ML Project Template¶

A starting-point structure and checklist for running ML experiments on OSC.

Workflow Overview¶

flowchart LR
    A[Setup\nEnv & Project] --> B[Data Prep\nDownload & Process]
    B --> C[Develop\nModel & Training Code]
    C --> D[Submit Jobs\nSLURM Batch]
    D --> E[Track\nMetrics & Artifacts]
    E --> F{Converged?}
    F -->|No| C
    F -->|Yes| G[Results\nAnalysis & Report]

Recommended Directory Layout¶

~/projects/my_ml_project/
├── README.md                 # Project documentation
├── requirements.txt          # Python dependencies
├── .gitignore               # See template below
├── data/                    # Small data files, data scripts
│   ├── download_data.sh
│   └── preprocess.py
├── src/                     # Source code
│   ├── __init__.py
│   ├── models/
│   ├── data/
│   ├── utils/
│   └── train.py             # Training entry point
├── scripts/                 # SLURM job scripts
│   ├── train_baseline.sh
│   └── hyperparameter_search.sh
├── configs/                 # Experiment configs (YAML)
│   ├── default.yaml
│   └── experiment1.yaml
├── notebooks/               # Exploratory analysis only
├── tests/
├── logs/
├── checkpoints/
└── results/

Data Organization on Scratch¶

Store large datasets and job outputs on scratch, not in your home directory:

/fs/scratch/PAS1234/$USER/
├── datasets/               # Large datasets
│   ├── imagenet/
│   ├── cifar10/
│   └── custom_dataset/
└── my_ml_project/         # Project-specific data
    ├── processed_data/
    ├── checkpoints/       # Model checkpoints
    └── results/           # Experiment outputs

Scratch is purged after 90 days of inactivity

Copy final results and best checkpoints to your home or project directory. See Clusters Overview for storage details.

.gitignore¶

Use the Python .gitignore template from GitHub's gitignore repo. Add project-specific patterns for data, checkpoints, and logs as needed.

Checklist¶

Environment — uv venv (or plain venv) created and documented in pyproject.toml or requirements.txt (Environment Management)
PyTorch — installed with correct CUDA version, verified on GPU node (PyTorch & GPU Setup)
Training script — uses argparse, device setup, checkpointing, and logging
Job scripts — SLURM batch scripts for training and sweeps (Job Submission)
Experiment tracking — MLflow, W&B, or DVC configured (Data & Experiment Tracking)
Reproducibility — random seeds set, configs saved with checkpoints
Data on scratch — large files on /fs/scratch/, not $HOME
Version control — code committed, large files in .gitignore