ML Project Template¶
A starting-point structure and checklist for running ML experiments on OSC.
Workflow Overview¶
flowchart LR
A[Setup\nEnv & Project] --> B[Data Prep\nDownload & Process]
B --> C[Develop\nModel & Training Code]
C --> D[Submit Jobs\nSLURM Batch]
D --> E[Track\nMetrics & Artifacts]
E --> F{Converged?}
F -->|No| C
F -->|Yes| G[Results\nAnalysis & Report] Recommended Directory Layout¶
~/projects/my_ml_project/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── .gitignore # See template below
├── data/ # Small data files, data scripts
│ ├── download_data.sh
│ └── preprocess.py
├── src/ # Source code
│ ├── __init__.py
│ ├── models/
│ ├── data/
│ ├── utils/
│ └── train.py # Training entry point
├── scripts/ # SLURM job scripts
│ ├── train_baseline.sh
│ └── hyperparameter_search.sh
├── configs/ # Experiment configs (YAML)
│ ├── default.yaml
│ └── experiment1.yaml
├── notebooks/ # Exploratory analysis only
├── tests/
├── logs/
├── checkpoints/
└── results/
Data Organization on Scratch¶
Store large datasets and job outputs on scratch, not in your home directory:
/fs/scratch/PAS1234/$USER/
├── datasets/ # Large datasets
│ ├── imagenet/
│ ├── cifar10/
│ └── custom_dataset/
└── my_ml_project/ # Project-specific data
├── processed_data/
├── checkpoints/ # Model checkpoints
└── results/ # Experiment outputs
Scratch is purged after 90 days of inactivity
Copy final results and best checkpoints to your home or project directory. See Clusters Overview for storage details.
.gitignore¶
Use the Python .gitignore template from GitHub's gitignore repo. Add project-specific patterns for data, checkpoints, and logs as needed.
Checklist¶
- Environment — uv venv (or plain venv) created and documented in
pyproject.tomlorrequirements.txt(Environment Management) - PyTorch — installed with correct CUDA version, verified on GPU node (PyTorch & GPU Setup)
- Training script — uses
argparse, device setup, checkpointing, and logging - Job scripts — SLURM batch scripts for training and sweeps (Job Submission)
- Experiment tracking — MLflow, W&B, or DVC configured (Data & Experiment Tracking)
- Reproducibility — random seeds set, configs saved with checkpoints
- Data on scratch — large files on
/fs/scratch/, not$HOME - Version control — code committed, large files in
.gitignore