Skip to content

OSC Clusters Overview

Understand OSC's high-performance computing clusters, available resources, and how to choose the right configuration for your workloads.

HPC Terminology Glossary

Before diving into OSC's clusters, familiarize yourself with these key HPC terms:

Term Definition
Cluster A collection of interconnected computers (nodes) that work together as a single system
Node A single computer within a cluster, containing CPUs, memory, and sometimes GPUs
Login Node A shared entry point for connecting to the cluster — used for file editing, job submission, and light tasks only
Compute Node A node dedicated to running jobs — accessed through the job scheduler, not directly
Core / CPU A single processing unit; modern nodes have many cores (e.g., 40–48 per node on Pitzer)
GPU A graphics processing unit used for accelerated computing, especially deep learning
Partition A logical grouping of nodes with specific resource limits and policies (also called a queue)
Allocation A grant of compute time (measured in core-hours) assigned to a project account
Batch Job A job submitted via a script that runs without user interaction
Interactive Job A job that provides a live shell session on a compute node
Scheduler Software (SLURM at OSC) that manages job queues and allocates resources
Module A system for loading and managing software packages (e.g., module load python/3.12)
Scratch Space High-performance temporary storage for active jobs — files are purged after inactivity
Home Directory Persistent personal storage with limited quota (~/ or /users/)
Project Space Shared storage for a research group, tied to a project allocation
Walltime The maximum clock time a job is allowed to run
Core-Hours The billing unit for compute time: cores × hours (e.g., 4 cores × 2 hours = 8 core-hours)

Cluster Architecture

When you SSH into OSC, you land on a login node — a shared gateway for editing files and submitting jobs. Compute-intensive work runs on compute nodes allocated by the SLURM scheduler. All nodes share the same filesystems (home, scratch, project).

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e8f4fd', 'primaryTextColor': '#1a1a1a', 'lineColor': '#555'}}}%%
flowchart LR
    subgraph Your Machine
        A["fa:fa-laptop SSH Client"]:::external
    end
    subgraph OSC Login Nodes
        B["fa:fa-server pitzer-login01"]:::process
        C["fa:fa-server pitzer-login02"]:::process
    end
    subgraph SLURM Scheduler
        D["fa:fa-cogs sbatch / sinteractive"]:::infra
    end
    subgraph Compute Nodes
        E["fa:fa-server CPU Nodes\n557 nodes\n40-48 cores each"]:::process
        F["fa:fa-microchip GPU Nodes\n78 nodes\n2-4 V100s each"]:::process
        G["fa:fa-memory Large Memory\n16 nodes\nup to 3 TB RAM"]:::process
    end
    subgraph Shared Filesystems
        H["fa:fa-hard-drive /users — Home\n/fs/scratch — Scratch\n/fs/ess — Project"]:::data
    end

    A --> B & C
    B & C --> D
    D --> E & F & G
    E & F & G --- H
    B & C --- H

    classDef process fill:#e8f4fd,stroke:#3b82f6
    classDef external fill:#ede9fe,stroke:#7c3aed
    classDef infra fill:#f3e8ff,stroke:#9333ea
    classDef data fill:#d1fae5,stroke:#059669

Do not run compute on login nodes

Login nodes are shared by all users. Running training, preprocessing, or heavy builds on them slows everyone down and may get your processes killed. Use sinteractive or sbatch for anything beyond editing and job submission.

OSC Clusters

OSC currently operates three clusters. All use the SLURM job scheduler and share the same filesystems.

Pitzer (Primary)

Pitzer is the lab's primary cluster. All specs below are verified against live sinfo output (March 2026).

Node Type Nodes CPUs RAM GPUs Partition
Standard (2018 Skylake) 217 40 192 GB cpu
Standard (2020 Cascade Lake) 340 48 192 GB cpu-exp
Dual GPU (2018 Skylake) 32 40 384 GB 2× V100 16 GB gpu
Dual GPU (2020 Cascade Lake) 42 48 384 GB 2× V100 32 GB gpu-exp
Quad GPU (2020 Cascade Lake) 4 48 768 GB 4× V100 32 GB + NVLink gpu-quad
Large Memory (2018) 4 80 3 TB hugemem
Large Memory (2020) 12 48 768 GB largemem

Total: 651 nodes, ~29,000 cores. Interconnect: Mellanox EDR InfiniBand (100 Gbps). OS: RHEL 9.

Source: OSC Pitzer Documentation

Which GPU partition?

  • gpu (32 nodes) — V100 16 GB, 40 CPUs. Fine for most training.
  • gpu-exp (42 nodes) — V100 32 GB, 48 CPUs. Use when you need more GPU memory (larger models, bigger batches).
  • gpu-quad (4 nodes) — 4× V100 32 GB with NVLink, 768 GB RAM. For multi-GPU or very large models.

Ascend & Cardinal (Newer Clusters)

OSC has deployed two newer clusters with more powerful GPUs. Both are now in production and accepting jobs.

Cluster GPU GPU Memory Notes
Ascend NVIDIA A100 40 GB / 80 GB Best for large-model training, transformer workloads
Cardinal NVIDIA H100 94 GB Latest generation, highest throughput

Access may require a separate allocation request. Check OSC's systems page for current availability and how to request access.

Owens is decommissioned

Owens was fully shut down in February 2025. If you see references to Owens in older scripts or documentation, replace owens.osc.edu with pitzer.osc.edu. All Owens data was migrated to the shared filesystem.

Partitions and Queues

Each partition groups nodes with the same resource profile and time limits. The table below is from live sinfo output (March 2026):

Pitzer Partitions

Partition Max Walltime Nodes GPUs Use Case
cpu 7 days 217 Single-node CPU jobs (Skylake, 40 cores)
cpu-exp 7 days 340 Single-node CPU jobs (Cascade Lake, 48 cores)
longcpu 14 days 217 Long-running CPU jobs
gpu 7 days 32 2× V100 16 GB GPU training
gpu-exp 7 days 42 2× V100 32 GB GPU training (more memory)
gpu-quad 7 days 4 4× V100 32 GB Multi-GPU training
gpudebug 1 hour 32 V100 Quick GPU testing (high priority)
gpudebug-exp 1 hour 42 V100-32G Quick GPU testing
debug-cpu 1 hour 217 Quick CPU testing (high priority)
largemem 7 days 12 Jobs needing 768 GB RAM
hugemem 7 days 4 Jobs needing up to 3 TB RAM
gpubackfill 4 hours 32 V100 Free GPU time in scheduling gaps
gpubackfill-exp 4 hours 42 V100-32G Free GPU time in scheduling gaps

Backfill partitions are free

gpubackfill and gpubackfill-exp don't charge your allocation. The tradeoff is a 4-hour walltime limit. Great for short experiments and hyperparameter exploration.

Choosing the Right Partition

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e8f4fd', 'primaryTextColor': '#1a1a1a', 'lineColor': '#555'}}}%%
flowchart TD
    A["What type of job?"]:::process --> B@{ shape: diam, label: "Need a GPU?" }
    B:::decision -->|Yes| C@{ shape: diam, label: "Quick test < 1 hr?" }
    C:::decision -->|Yes| E@{ shape: stadium, label: "fa:fa-microchip gpudebug / gpudebug-exp" }
    E:::success
    C -->|No| F@{ shape: diam, label: "Need > 16 GB VRAM?" }
    F:::decision -->|Yes| G@{ shape: stadium, label: "fa:fa-microchip gpu-exp (32 GB)\nor gpu-quad (4× 32 GB)" }
    G:::success
    F -->|No| H@{ shape: stadium, label: "fa:fa-microchip gpu (16 GB)" }
    H:::success
    B -->|No| D@{ shape: diam, label: "Multi-node?" }
    D:::decision -->|Yes| I@{ shape: stadium, label: "fa:fa-server cpu + srun\n(multi-node MPI)" }
    I:::success
    D -->|No| J@{ shape: diam, label: "Need > 192 GB RAM?" }
    J:::decision -->|Yes| K@{ shape: stadium, label: "fa:fa-memory largemem (768 GB)\nor hugemem (3 TB)" }
    K:::success
    J -->|No| L@{ shape: diam, label: "Run > 7 days?" }
    L:::decision -->|Yes| M@{ shape: stadium, label: "fa:fa-server longcpu" }
    M:::success
    L -->|No| N@{ shape: stadium, label: "fa:fa-server cpu / cpu-exp" }
    N:::success

    classDef process fill:#e8f4fd,stroke:#3b82f6
    classDef decision fill:#fef3c7,stroke:#d97706
    classDef success fill:#d1fae5,stroke:#059669

Start with gpudebug for testing

Always test your job scripts on a debug partition first. Debug jobs start quickly (often within seconds) and help you catch errors before committing to long runs.

Storage Tiers

OSC provides four storage tiers, all visible from every login and compute node. Each serves a different purpose.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e8f4fd', 'primaryTextColor': '#1a1a1a', 'lineColor': '#555'}}}%%
flowchart LR
    subgraph Backed Up
        HOME["fa:fa-house Home\n/users/PAS.../user\n500 GB · NFS\nCode, configs"]:::backed
        PROJECT["fa:fa-folder-open Project\n/fs/ess/PAS...\n1-5 TB · GPFS\nShared datasets"]:::backed
    end
    subgraph Not Backed Up
        SCRATCH["fa:fa-bolt Scratch\n/fs/scratch/PAS...\n100 TB · GPFS\nActive job data"]:::notbacked
        TMPDIR["fa:fa-gauge-high $TMPDIR\nLocal disk\nJob-only · Fastest\nStaging I/O-heavy data"]:::notbacked
    end

    HOME -->|"rsync large files"| SCRATCH
    SCRATCH -->|"cp at job start"| TMPDIR
    PROJECT -->|"rsync shared data"| SCRATCH

    classDef backed fill:#d1fae5,stroke:#059669
    classDef notbacked fill:#fef3c7,stroke:#d97706

Storage Details

Tier Path Filesystem Quota Purge Backed Up Performance
Home /users/<project>/<user> NetApp NFS 500 GB, 1M files None (archived after 18 months inactive) Yes (daily, 2 tape copies) ~40 GB/s read/write
Project /fs/ess/<project> GPFS 1–5 TB (varies by allocation) None Yes (daily, 2 tape copies) ~60 GB/s read, ~50 GB/s write
Scratch /fs/scratch/<project> GPFS 100 TB, 25M files 60 days inactivity, purged Wednesdays No ~170 GB/s read, ~70 GB/s write
$TMPDIR Local compute node disk Local Varies by node Deleted when job ends No Fastest (local I/O)

Source: OSC Storage Environment

Scratch purge: 60 days, not recoverable

Files on scratch that have not been accessed for 60 days are automatically deleted every Wednesday. This is not recoverable — scratch is not backed up. Copy final results and trained models to home or project space.

Do not use scripts to artificially touch files and reset access times — this violates OSC policy and can result in account suspension.

Which Storage for What

Data Type Store On Why
Source code, configs, small scripts Home Backed up, persistent, git-managed
Shared datasets (lab-wide) Project (/fs/ess/) Backed up, shared across users
Training data, checkpoints, logs Scratch High throughput, large quota
I/O-heavy training data during a job $TMPDIR Fastest reads, avoids NFS/GPFS contention
Final results, trained models to keep Home or Project Backed up, won't be purged

Check your current usage:

# Home quota
quota -s

# Project storage usage
du -sh /fs/ess/PAS1234

# Scratch usage
du -sh /fs/scratch/PAS1234

Shared Project Directories

Use project space for datasets and environments that the whole lab needs:

/fs/ess/PAS1234/
├── datasets/           # Shared datasets
├── envs/               # Shared conda/venv environments
├── username1/          # Individual work directories
└── username2/

Keep a README in the project root documenting what each directory contains. For creating shared conda or venv environments, see Environment Management.

Compute Allocations

Every project has an allocation of core-hours. Check your balance with:

# Check your project's remaining core-hours
sbalance

# Or for a specific account
sbalance -a PAS1234

Monitor your allocation

When your allocation runs out, jobs will no longer be scheduled. Check sbalance regularly and request additional time through your PI if needed.

Resource Request Guidelines

For complete SBATCH job script templates (GPU training, CPU processing, debug, multi-GPU), see the Job Submission Guide.

Match CPU cores to GPU

Request 4–8 CPU cores per GPU to keep the data pipeline fast enough to feed the GPU. See the PyTorch Performance Tuning Guide for data loading optimization.

Workload Partition GPUs CPUs Memory Typical Walltime
Quick test gpudebug 0–1 2–4 8–16 GB 15–30 min
CPU preprocessing cpu 0 8–16 32–64 GB 1–4 hours
Single GPU training gpu or gpu-exp 1 4–8 32–64 GB 4–24 hours
Multi-GPU training gpu-quad 2–4 16–32 128–192 GB 12–48 hours
Large-memory job largemem 0 8–48 384–768 GB 2–24 hours
Hyperparameter sweep gpu (array) 1 per task 4–8 32 GB 2–8 hours per task

Troubleshooting

"Invalid account" Error

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Your project account may not have access to the partition you requested. Check:

# List your accounts and partitions
sacctmgr show associations user=$USER format=Account,Partition

Cannot See GPUs

If nvidia-smi shows no GPUs, make sure you requested GPU resources:

#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1

You do NOT need module load cuda for PyPI torch

If you installed PyTorch from PyPI (via pip install or uv add), the wheels bundle their own CUDA libraries. Only load a CUDA module if you are compiling custom CUDA extensions. See PyTorch & GPU Setup.

Jobs Pending with "Resources" Reason

Your job is requesting more resources than are currently available. Try:

  • Reducing the number of GPUs or nodes
  • Shortening the walltime (shorter jobs fit into backfill gaps more easily due to SLURM backfill scheduling)
  • Using gpubackfill or gpubackfill-exp for short experiments (4-hour max, doesn't charge allocation)

Next Steps

Resources