OSC Clusters Overview¶

Understand OSC's high-performance computing clusters, available resources, and how to choose the right configuration for your workloads.

HPC Terminology Glossary¶

Before diving into OSC's clusters, familiarize yourself with these key HPC terms:

Term	Definition
Cluster	A collection of interconnected computers (nodes) that work together as a single system
Node	A single computer within a cluster, containing CPUs, memory, and sometimes GPUs
Login Node	A shared entry point for connecting to the cluster — used for file editing, job submission, and light tasks only
Compute Node	A node dedicated to running jobs — accessed through the job scheduler, not directly
Core / CPU	A single processing unit; modern nodes have many cores (e.g., 40–96 per node)
GPU	A graphics processing unit used for accelerated computing, especially deep learning
Partition	A logical grouping of nodes with specific resource limits and policies (also called a queue)
Allocation	A grant of compute time (measured in core-hours) assigned to a project account
Batch Job	A job submitted via a script that runs without user interaction
Interactive Job	A job that provides a live shell session on a compute node
Scheduler	Software (SLURM at OSC) that manages job queues and allocates resources
Module	A system for loading and managing software packages (e.g., `module load python/3.12`)
Scratch Space	High-performance temporary storage for active jobs — files are purged after inactivity
Home Directory	Persistent personal storage with limited quota (`~/` or `/users/`)
Project Space	Shared storage for a research group, tied to a project allocation
Walltime	The maximum clock time a job is allowed to run
Core-Hours	The billing unit for compute time: cores × hours (e.g., 4 cores × 2 hours = 8 core-hours)

OSC Clusters¶

OSC operates two primary clusters available to researchers. Both use the SLURM job scheduler and share the same filesystem.

Pitzer¶

Pitzer is OSC's newer and more powerful cluster, ideal for GPU-accelerated and large-scale workloads.

Specification	Details
Launched	2018 (expanded 2020)
Total Nodes	~800
CPU Type	Intel Xeon 6148 (Skylake) and 8268 (Cascade Lake)
Cores per Node	40 (Skylake) or 48 (Cascade Lake)
RAM per Node	192 GB standard, 768 GB on large-memory nodes
GPU Nodes	NVIDIA V100 (32 GB) and A100 (40 GB / 80 GB)
GPUs per GPU Node	Up to 4 V100s or up to 4 A100s
Interconnect	Intel Omni-Path / HDR InfiniBand
Operating System	RHEL 9

Recommended for ML workloads

Pitzer's A100 GPUs provide the best performance for deep learning training. Request them with --gpus-per-node=a100:1.

Owens¶

Owens is OSC's older cluster, well-suited for CPU-intensive workloads and smaller GPU jobs.

Specification	Details
Launched	2016
Total Nodes	~800
CPU Type	Intel Xeon E5-2680 v4 (Broadwell)
Cores per Node	28
RAM per Node	128 GB standard, 384 GB or 768 GB on large-memory nodes
GPU Nodes	NVIDIA P100 (16 GB)
GPUs per GPU Node	1 P100
Interconnect	Intel Omni-Path
Operating System	RHEL 9

Cluster Comparison¶

Feature	Pitzer	Owens
Generation	Newer (2018+)	Older (2016)
Cores per Node	40–48	28
RAM per Node	192 GB+	128 GB+
GPU Options	V100, A100	P100
GPU Memory	32–80 GB	16 GB
Multi-GPU Nodes	Up to 4 GPUs	1 GPU
Best For	GPU training, large jobs	CPU work, smaller GPU jobs
Queue Wait Times	Can be longer (popular)	Often shorter

Newer clusters

OSC has announced additional clusters (Ascend, Cardinal). As they become available for general use, this guide will be updated. Check OSC's systems page for the latest.

Both clusters share the same filesystem

Your home directory, project space, and scratch space are accessible from both Pitzer and Owens. You do not need to copy files between clusters.

Partitions and Queues¶

Each cluster has multiple partitions with different resource limits and policies.

Pitzer Partitions¶

Partition	Max Walltime	Max Nodes	GPU Access	Use Case
`serial`	168:00:00 (7 days)	1	No	Single-node CPU jobs
`parallel`	168:00:00 (7 days)	20+	No	Multi-node MPI jobs
`gpu`	48:00:00 (2 days)	Variable	Yes (V100, A100)	GPU-accelerated workloads
`debug`	01:00:00 (1 hour)	2	Yes	Quick testing and debugging
`longserial`	336:00:00 (14 days)	1	No	Long-running single-node jobs
`largemem`	168:00:00 (7 days)	1	No	Jobs requiring 384+ GB RAM
`hugemem`	168:00:00 (7 days)	1	No	Jobs requiring 768+ GB RAM

Owens Partitions¶

Partition	Max Walltime	Max Nodes	GPU Access	Use Case
`serial`	168:00:00 (7 days)	1	No	Single-node CPU jobs
`parallel`	168:00:00 (7 days)	20+	No	Multi-node MPI jobs
`gpu`	168:00:00 (7 days)	Variable	Yes (P100)	GPU-accelerated workloads
`debug`	01:00:00 (1 hour)	2	Yes	Quick testing and debugging
`longserial`	336:00:00 (14 days)	1	No	Long-running single-node jobs
`hugemem`	168:00:00 (7 days)	1	No	Jobs requiring 768+ GB RAM

Choosing the Right Partition¶

flowchart TD
    A[What type of job?] --> B{Need a GPU?}
    B -->|Yes| C{Quick test < 1 hr?}
    B -->|No| D{Multi-node?}
    C -->|Yes| E[debug partition]
    C -->|No| F[gpu partition]
    D -->|Yes| G[parallel partition]
    D -->|No| H{Need > 192 GB RAM?}
    H -->|Yes| I[largemem or hugemem]
    H -->|No| J{Run > 7 days?}
    J -->|Yes| K[longserial partition]
    J -->|No| L[serial partition]

Start with debug for testing

Always test your job scripts on the debug partition first. Debug jobs start quickly and help you catch errors before committing to long runs.

Resource Limits and Quotas¶

Compute Allocations¶

Every project has an allocation of core-hours. Check your balance with:

# Check your project's remaining core-hours
sbalance

# Or for a specific account
sbalance -a PAS1234

Monitor your allocation

When your allocation runs out, jobs will no longer be scheduled. Check sbalance regularly and request additional time through your PI if needed.

Storage Quotas¶

OSC provides three types of storage, each with different purposes and limits:

Storage	Path	Quota	Purge Policy	Backed Up	Use For
Home	`~/` or `/users/<username>`	500 GB	None	Yes	Code, configs, small datasets
Scratch	`/fs/scratch/<project>`	100 TB (project)	Files deleted after 90 days of inactivity	No	Active job data, temporary files
Project	`/fs/ess/<project>`	Varies by allocation	None	Yes	Shared datasets, results, models

Check your current usage:

# Check home directory quota
quota -s

# Check project storage usage
du -sh /fs/ess/PAS1234

# Check scratch usage
du -sh /fs/scratch/PAS1234

Scratch is purged automatically

Files on scratch that have not been accessed for 90 days are automatically deleted. Never store important results only on scratch. Copy final results to your home or project directory.

Shared Project Directories¶

Use project space for datasets and environments that the whole lab needs:

/fs/ess/PAS1234/
├── datasets/           # Shared datasets
├── envs/               # Shared conda/venv environments
├── username1/          # Individual work directories
└── username2/

Keep a README in the project root documenting what each directory contains. For creating shared conda or venv environments, see Environment Management.

Job Limits per User¶

Typical per-user limits (these may vary by project):

Limit	Value
Max running jobs	~256
Max queued jobs	~1000
Max GPUs per user	Varies by partition
Max cores per job	Depends on partition and allocation

Choosing the Right Resources¶

For complete SBATCH job script templates (GPU training, CPU processing, debug, multi-GPU), see the Job Submission Guide.

Match CPU cores to GPU

Request 4–8 CPU cores per GPU to keep the data pipeline fast enough to feed the GPU.

Resource Request Guidelines¶

Workload	Partition	GPUs	CPUs	Memory	Typical Walltime
Small test	`debug`	0–1	2–4	8–16 GB	15–30 min
CPU preprocessing	`serial`	0	8–16	32–64 GB	1–4 hours
Single GPU training	`gpu`	1	4–8	32–64 GB	4–24 hours
Multi-GPU training	`gpu`	2–4	16–32	128–192 GB	12–48 hours
Large-memory job	`largemem`	0	8–48	384–768 GB	2–24 hours
Hyperparameter sweep	`gpu` (array)	1 per task	4–8	32 GB	2–8 hours per task

Troubleshooting¶

"Invalid account" Error¶

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Your project account may not have access to the partition you requested. Check:

# List your accounts
sacctmgr show associations user=$USER format=Account,Partition

Cannot See GPUs¶

If nvidia-smi shows no GPUs, make sure you requested GPU resources:

#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1

And load the CUDA module in your script:

module load cuda/12.4

Jobs Pending with "Resources" Reason¶

Your job is requesting more resources than are currently available. Try:

Reducing the number of GPUs or nodes
Shortening the walltime (shorter jobs fit into gaps more easily)
Using a different partition (e.g., Owens instead of Pitzer)

Next Steps¶

Set up your OSC account if you haven't already
Learn to connect via SSH
Submit your first job