ML Workflows¶
What this section covers
The ML research stack on OSC: framework setup (PyTorch, PyG, RAPIDS), project structure, experiment tracking, analytics, deployment, and hard-won lessons from running ablation campaigns at scale.
Prerequisites: you've done Working on OSC — you can submit SLURM jobs and manage environments. These pages assume you know sbatch and module load.
Framework setup¶
-
PyTorch & GPU Setup
Install PyTorch against OSC's CUDA, request GPUs correctly, verify the install. Multi-GPU and
torch.compilenotes for the ablation-campaign crowd. -
PyG (PyTorch Geometric)
Graph neural networks on OSC. Clean install against the pinned PyTorch version, dataset caching, mini-batch loaders for large graphs.
-
GPU Preprocessing (RAPIDS)
10–100× faster tabular preprocessing with cuDF/cuML. Drop-in replacement for pandas on datasets that otherwise take hours to filter.
Project structure & iteration¶
-
ML Project Template
Starting-point directory layout and run checklist. What to standardize across ML repos so the lab can onboard onto each other's projects fast.
-
Notebook-to-Script Workflow
Iterate in Jupyter, graduate to
python -mscripts for sbatch. Shared patterns for keeping the two in sync without copy-paste. -
Data & Experiment Tracking
DVC for datasets, W&B/MLflow for runs, TensorBoard for training curves, Parquet for loaders. The full tracking stack the lab actually uses.
Analytics & deployment¶
-
DuckDB Analytics Layer
SQL over Parquet — no server, single binary, faster than pandas. How the lab slices experiment results across hundreds of runs.
-
Hugging Face Spaces
Free hosting for Streamlit/Gradio dashboards and Quarto reports. CI-driven deploys from GitHub — how the OSC usage dashboard ships.
Lessons from the trenches¶
-
HPC Training Nuances
Worker starvation, prebatching, gradient-accumulation gotchas, the stuff that only shows up at scale. Read before your first ablation campaign — save a week of wasted compute.