ML Workflows¶

What this section covers

The ML research stack on OSC: framework setup (PyTorch, PyG, RAPIDS), project structure, experiment tracking, analytics, deployment, and hard-won lessons from running ablation campaigns at scale.

Prerequisites: you've done Working on OSC — you can submit SLURM jobs and manage environments. These pages assume you know sbatch and module load.

Framework setup¶

PyTorch & GPU Setup

Install PyTorch against OSC's CUDA, request GPUs correctly, verify the install. Multi-GPU and torch.compile notes for the ablation-campaign crowd.

Set up PyTorch
PyG (PyTorch Geometric)

Graph neural networks on OSC. Clean install against the pinned PyTorch version, dataset caching, mini-batch loaders for large graphs.

Install PyG
GPU Preprocessing (RAPIDS)

10–100× faster tabular preprocessing with cuDF/cuML. Drop-in replacement for pandas on datasets that otherwise take hours to filter.

Accelerate preprocessing

Project structure & iteration¶

ML Project Template

Starting-point directory layout and run checklist. What to standardize across ML repos so the lab can onboard onto each other's projects fast.

Use the template
Notebook-to-Script Workflow

Iterate in Jupyter, graduate to python -m scripts for sbatch. Shared patterns for keeping the two in sync without copy-paste.

Graduate notebooks
Data & Experiment Tracking

DVC for datasets, W&B/MLflow for runs, TensorBoard for training curves, Parquet for loaders. The full tracking stack the lab actually uses.

Track experiments

Analytics & deployment¶

DuckDB Analytics Layer

SQL over Parquet — no server, single binary, faster than pandas. How the lab slices experiment results across hundreds of runs.

Query with DuckDB
Hugging Face Spaces

Free hosting for Streamlit/Gradio dashboards and Quarto reports. CI-driven deploys from GitHub — how the OSC usage dashboard ships.

Deploy a Space

Lessons from the trenches¶

HPC Training Nuances

Worker starvation, prebatching, gradient-accumulation gotchas, the stuff that only shows up at scale. Read before your first ablation campaign — save a week of wasted compute.

Avoid the traps