Skip to content

ML Project Template

A starting-point structure and checklist for running ML experiments on OSC.

Workflow Overview

flowchart LR
    A[Setup\nEnv & Project] --> B[Data Prep\nDownload & Process]
    B --> C[Develop\nModel & Training Code]
    C --> D[Submit Jobs\nSLURM Batch]
    D --> E[Track\nMetrics & Artifacts]
    E --> F{Converged?}
    F -->|No| C
    F -->|Yes| G[Results\nAnalysis & Report]
~/projects/my_ml_project/
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
├── .gitignore               # See template below
├── data/                    # Small data files, data scripts
│   ├── download_data.sh
│   └── preprocess.py
├── src/                     # Source code
│   ├── __init__.py
│   ├── models/
│   ├── data/
│   ├── utils/
│   └── train.py             # Training entry point
├── scripts/                 # SLURM job scripts
│   ├── train_baseline.sh
│   └── hyperparameter_search.sh
├── configs/                 # Experiment configs (YAML)
│   ├── default.yaml
│   └── experiment1.yaml
├── notebooks/               # Exploratory analysis only
├── tests/
├── logs/
├── checkpoints/
└── results/

Data Organization on Scratch

Store large datasets and job outputs on scratch, not in your home directory:

/fs/scratch/PAS1234/$USER/
├── datasets/               # Large datasets
│   ├── imagenet/
│   ├── cifar10/
│   └── custom_dataset/
└── my_ml_project/         # Project-specific data
    ├── processed_data/
    ├── checkpoints/       # Model checkpoints
    └── results/           # Experiment outputs

Scratch is purged after 90 days of inactivity

Copy final results and best checkpoints to your home or project directory. See Clusters Overview for storage details.

.gitignore

Use the Python .gitignore template from GitHub's gitignore repo. Add project-specific patterns for data, checkpoints, and logs as needed.

Checklist

  • Environment — uv venv (or plain venv) created and documented in pyproject.toml or requirements.txt (Environment Management)
  • PyTorch — installed with correct CUDA version, verified on GPU node (PyTorch & GPU Setup)
  • Training script — uses argparse, device setup, checkpointing, and logging
  • Job scripts — SLURM batch scripts for training and sweeps (Job Submission)
  • Experiment tracking — MLflow, W&B, or DVC configured (Data & Experiment Tracking)
  • Reproducibility — random seeds set, configs saved with checkpoints
  • Data on scratch — large files on /fs/scratch/, not $HOME
  • Version control — code committed, large files in .gitignore