Skip to content

ML Workflows

What this section covers

The ML research stack on OSC: framework setup (PyTorch, PyG, RAPIDS), project structure, experiment tracking, analytics, deployment, and hard-won lessons from running ablation campaigns at scale.

Prerequisites: you've done Working on OSC — you can submit SLURM jobs and manage environments. These pages assume you know sbatch and module load.


Framework setup

  • PyTorch & GPU Setup


    Install PyTorch against OSC's CUDA, request GPUs correctly, verify the install. Multi-GPU and torch.compile notes for the ablation-campaign crowd.

    Set up PyTorch

  • PyG (PyTorch Geometric)


    Graph neural networks on OSC. Clean install against the pinned PyTorch version, dataset caching, mini-batch loaders for large graphs.

    Install PyG

  • GPU Preprocessing (RAPIDS)


    10–100× faster tabular preprocessing with cuDF/cuML. Drop-in replacement for pandas on datasets that otherwise take hours to filter.

    Accelerate preprocessing

Project structure & iteration

  • ML Project Template


    Starting-point directory layout and run checklist. What to standardize across ML repos so the lab can onboard onto each other's projects fast.

    Use the template

  • Notebook-to-Script Workflow


    Iterate in Jupyter, graduate to python -m scripts for sbatch. Shared patterns for keeping the two in sync without copy-paste.

    Graduate notebooks

  • Data & Experiment Tracking


    DVC for datasets, W&B/MLflow for runs, TensorBoard for training curves, Parquet for loaders. The full tracking stack the lab actually uses.

    Track experiments

Analytics & deployment

  • DuckDB Analytics Layer


    SQL over Parquet — no server, single binary, faster than pandas. How the lab slices experiment results across hundreds of runs.

    Query with DuckDB

  • Hugging Face Spaces


    Free hosting for Streamlit/Gradio dashboards and Quarto reports. CI-driven deploys from GitHub — how the OSC usage dashboard ships.

    Deploy a Space

Lessons from the trenches

  • HPC Training Nuances


    Worker starvation, prebatching, gradient-accumulation gotchas, the stuff that only shows up at scale. Read before your first ablation campaign — save a week of wasted compute.

    Avoid the traps