Job Submission Guide¶
Learn how to submit and manage jobs on OSC using the SLURM job scheduler.
Overview¶
OSC uses SLURM (Simple Linux Utility for Resource Management) to schedule and manage jobs on compute nodes.
Quick Start¶
Interactive Job (Testing)¶
# Simplest way to get a compute node
sinteractive -A PAS1234 -c 4 -t 01:00:00
# With GPU
sinteractive -A PAS1234 -c 4 -g 1 -t 01:00:00
Batch Job (Production)¶
Create job.sh:
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=02:00:00
#SBATCH --output=job_%j.out
# Your commands here
python train.py
Submit:
SLURM Basics¶
Essential Commands¶
# Submit batch job
sbatch job_script.sh
# Interactive session
srun -p partition --pty bash
# List your jobs
squeue -u $USER
# Cancel job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER
# Job details
scontrol show job <job_id>
# Job efficiency (after completion)
seff <job_id>
Job States¶
- PD (Pending): Waiting for resources
- R (Running): Job is running
- CG (Completing): Job is finishing
- CD (Completed): Job finished successfully
- F (Failed): Job failed
- CA (Cancelled): Job was cancelled
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e8f4fd', 'primaryTextColor': '#1a1a1a', 'lineColor': '#555'}}}%%
flowchart LR
A@{ shape: bolt, label: "fa:fa-paper-plane sbatch" }:::trigger --> B@{ shape: hex, label: "fa:fa-clock Pending\nPD" }
B:::decision --> C["fa:fa-play Running\nR"]:::process
C --> D["fa:fa-spinner Completing\nCG"]:::process
D --> E@{ shape: stadium, label: "fa:fa-check Completed\nCD" }
E:::success
C --> F@{ shape: stadium, label: "fa:fa-xmark Failed\nF" }
F:::error
C --> G@{ shape: stadium, label: "fa:fa-xmark Timeout" }
G:::error
B --> H@{ shape: stadium, label: "fa:fa-xmark Cancelled\nCA" }
H:::error
C --> H
classDef trigger fill:#fef3c7,stroke:#d97706
classDef decision fill:#fef3c7,stroke:#d97706
classDef process fill:#e8f4fd,stroke:#3b82f6
classDef success fill:#d1fae5,stroke:#059669
classDef error fill:#fee2e2,stroke:#ef4444 Interactive Sessions¶
Interactive sessions give you a shell on a compute node — your own dedicated resources with no contention from other users. Use them for anything heavier than editing code or submitting jobs.
sinteractive (Recommended)¶
The simplest way to get a compute node:
# Basic: 1 core, debug partition, default time
sinteractive -A PAS1234
# Specify resources
sinteractive -A PAS1234 -c 4 -t 02:00:00
# With GPU
sinteractive -A PAS1234 -c 4 -g 1 -t 01:00:00
# On a specific cluster
sinteractive -A PAS1234 -c 4 -t 02:00:00 -p cpu
Replace PAS1234
Use your actual OSC project code. See Account Setup.
You'll see output like:
salloc: Pending job allocation 14269
salloc: job 14269 has been allocated resources
salloc: Granted job allocation 14269
[user@p0591 ~]$
Your prompt changes to show the compute node hostname (e.g., p0591). When done, type exit to release the node.
srun (Alternative)¶
# CPU-only
srun -p debug -c 4 --time=30:00 --account=PAS1234 --pty bash
# With GPU
srun -p gpu --gpus-per-node=1 --time=01:00:00 --account=PAS1234 --pty bash
Login Node vs. Compute Node¶
Your home directory (~/) is the same NFS mount from both login and compute nodes — same files, same paths, same permissions. You don't need to copy anything.
| Task | Where to Run | Why |
|---|---|---|
| Edit code, browse files, git | Login node | Lightweight, no allocation needed |
Submit jobs (sbatch) | Login node | Just sends a request to SLURM |
| AI coding tools (Claude Code, etc.) | Login node | Bottleneck is network API latency, not local CPU |
Run tests (pytest) | Compute node | Can use significant CPU/memory |
| Preprocessing scripts | Compute node | CPU-intensive, may run for minutes |
quarto render, mkdocs build | Compute node | Builds can be CPU-heavy |
| Anything with a GPU | Compute node | GPUs only available on compute nodes |
Cost of interactive sessions
A 1-core interactive session for 2 hours costs 2 core-hours — roughly the same as a single core running for 2 hours in a batch job. A 4-core session for 2 hours costs 8 core-hours. For perspective, a typical 4-hour GPU training run on 4 cores costs 16 core-hours. Interactive sessions are cheap, but don't leave them idle — exit when you're done.
Creating Job Scripts¶
Anatomy of a SLURM Script¶
Every SLURM batch script has three sections:
#!/bin/bash # 1. Shebang line
#SBATCH --job-name=my_job # 2. SBATCH directives
#SBATCH --account=PAS1234
#SBATCH --time=02:00:00
module load python/3.12 # 3. Execution block
source .venv/bin/activate # uv (recommended) — or ~/venvs/myproject/bin/activate for pip+venv
python train.py
Replace PAS1234
PAS1234 is a placeholder. Use your actual OSC project code, found at my.osc.edu under your project list. See Account Setup for details.
Section 1 — Shebang line: Must be the very first line. Tells the system to use Bash.
Section 2 — SBATCH directives: Lines starting with #SBATCH configure job resources. They look like comments to Bash, but SLURM reads them.
Section 3 — Execution block: Everything after the directives is your actual script — module loading, environment activation, and commands.
Directives must come before any executable line
SLURM stops reading #SBATCH directives at the first non-comment, non-blank line. Any directive placed after an executable command (like echo or module load) is silently ignored.
SBATCH Directive Reference¶
Every #SBATCH line declares one resource or behavior. They are grouped below by function.
Identity & Accounting¶
| Directive | Description |
|---|---|
--job-name=NAME | Label shown in squeue output. Keep it short and descriptive (e.g., train_exp03). Default: script filename. |
--account=PAS1234 | Required. The OSC project allocation to charge. Find yours at my.osc.edu → Projects. |
Compute Resources¶
| Directive | Description |
|---|---|
--nodes=N | Number of physical nodes. Use 1 for single-node jobs (the common case). Multi-node only needed for distributed training or Ray clusters. Default: 1. |
--ntasks-per-node=N | Independent processes per node. For most Python ML jobs, leave at 1 — parallelism comes from --cpus-per-task and --gpus-per-node instead. Use >1 only with srun/MPI or torchrun. Default: 1. |
--cpus-per-task=N | CPU cores allocated to each task. Controls how many DataLoader workers, preprocessing threads, or parallel operations your job can run. Set num_workers in your DataLoader up to this value. |
--gpus-per-node=N | GPUs allocated per node. Accepts a count (1, 2) or a type:count (v100:1, v100-32g:1). Only valid on GPU partitions. |
--mem=SIZE | Total RAM per node (e.g., 32G, 64G). Mutually exclusive with --mem-per-cpu. Pitzer standard nodes have 192 GB; GPU nodes share this across up to 4 GPUs. |
--mem-per-cpu=SIZE | RAM per CPU core (e.g., 4G). Useful when you want memory to scale with CPU count. Cannot combine with --mem. |
Choosing CPUs and memory for GPU jobs
The Clusters Overview recommends 4–8 CPU cores and 32–64 GB memory per GPU for single-GPU training. The CPUs feed data to the GPU via DataLoader workers. If nvidia-smi shows low GPU utilization, increase CPUs and num_workers first — the GPU is likely waiting on data (see PyTorch Performance Tuning Guide).
Time & Scheduling¶
| Directive | Description |
|---|---|
--time=HH:MM:SS | Required. Maximum walltime. SLURM kills the job when this expires. Formats: 02:00:00 (2 hours), 2-12:00:00 (2 days 12 hours). Shorter requests get scheduled faster due to backfill scheduling. |
--partition=NAME | Queue to submit to. Common: gpu (7-day max), cpu (CPU only), gpudebug/debug-cpu (1-hour max, high priority for testing). See Clusters Overview for full list. Default: cluster-dependent. |
Output & Logging¶
| Directive | Description |
|---|---|
--output=PATH | File for stdout. Supports substitution: %j → job ID, %A → array job ID, %a → array task ID. Example: logs/train_%j.out. Default: slurm-%j.out in submit directory. |
--error=PATH | File for stderr. Same substitutions as --output. If omitted, stderr merges into the --output file. |
Create the logs directory first
SLURM does not create parent directories for output files. If you use --output=logs/job_%j.out, run mkdir -p logs before submitting. If the directory doesn't exist, the job fails immediately and no output file is written.
Notifications¶
| Directive | Description |
|---|---|
--mail-type=EVENTS | When to send email. Values: BEGIN, END, FAIL, ALL. Comma-separate multiples: END,FAIL. |
--mail-user=EMAIL | Destination address. Use your name.N@osu.edu address. |
Advanced Scheduling¶
| Directive | Description |
|---|---|
--array=RANGE | Run a job array. 1-10 runs 10 jobs; 1-100%10 runs 100 with max 10 concurrent. Each job gets a unique $SLURM_ARRAY_TASK_ID. See Job Arrays. |
--dependency=COND:ID | Wait for another job. afterok:12345 starts only if job 12345 succeeds. afterany:12345 starts regardless. See Job Dependencies. |
--exclusive | Reserve the entire node — no sharing with other users. Useful for benchmarking or memory-sensitive workloads. Expensive: charges all node cores. |
--constraint=FEATURE | Request nodes with a specific feature tag. Check available features with sinfo -o "%N %f". |
--signal=B:SIG@TIME | Send a signal to the job before walltime expires. --signal=B:USR1@300 sends SIGUSR1 five minutes before timeout — used for graceful checkpointing. See Graceful Timeout Handling. |
--requeue | Allow SLURM to requeue the job if the node fails. Combine with checkpoint-resume logic. |
Basic Job Script Template¶
#!/bin/bash
#SBATCH --job-name=my_job # Job name
#SBATCH --account=PAS1234 # Project account
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # Tasks per node
#SBATCH --cpus-per-task=4 # CPUs per task
#SBATCH --time=02:00:00 # Time limit (HH:MM:SS)
#SBATCH --output=logs/job_%j.out # Standard output (%j = job ID)
#SBATCH --error=logs/job_%j.err # Standard error
#SBATCH --mail-type=END,FAIL # Email on END or FAIL
#SBATCH --mail-user=user@osu.edu # Email address
# Print job info
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
# Load modules
module load python/3.12
# Activate environment
source .venv/bin/activate # uv (recommended) — or ~/venvs/myproject/bin/activate for pip+venv
# Run your code
python train.py --epochs 100
# Print completion
echo "Job ended at: $(date)"
GPU Job Script¶
#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --account=PAS1234
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=1 # Number of GPUs
#SBATCH --time=04:00:00
#SBATCH --output=logs/gpu_job_%j.out
# Load modules
module load python/3.12
# module load cuda/12.4 # Only needed for custom CUDA extensions — PyPI torch bundles CUDA
# Activate environment
source ~/venvs/pytorch/bin/activate
# Set environment variables
export CUDA_VISIBLE_DEVICES=0
# Verify GPU
nvidia-smi
# Run training
python train.py --device cuda --epochs 100
Multi-GPU Job Script¶
#!/bin/bash
#SBATCH --job-name=multi_gpu
#SBATCH --account=PAS1234
#SBATCH --nodes=1
#SBATCH --gpus-per-node=4 # Use 4 GPUs
#SBATCH --time=08:00:00
#SBATCH --output=logs/multi_gpu_%j.out
module load python/3.12
# module load cuda/12.4 # Only needed for custom CUDA extensions — PyPI torch bundles CUDA
source ~/venvs/pytorch/bin/activate
# Run with PyTorch DDP (torchrun replaces the deprecated torch.distributed.launch)
torchrun --nproc_per_node=4 train.py --distributed
Common Job Patterns¶
Copy-paste recipes for the most frequent lab workloads. Click a pattern to expand its full job script.
CPU-Only Data Processing
For data preprocessing, feature extraction, or file conversion jobs that don't need a GPU:
#!/bin/bash
#SBATCH --job-name=data_preprocess
#SBATCH --account=PAS1234
#SBATCH --partition=cpu
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=logs/preprocess_%j.out
module load python/3.12
source .venv/bin/activate # uv (recommended) — or ~/venvs/myproject/bin/activate for pip+venv
# Use all allocated CPUs
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
python preprocess.py \
--input-dir data/raw/ \
--output-dir data/processed/ \
--workers $SLURM_CPUS_PER_TASK
Checkpoint-Resume Pattern
For long training jobs that may hit walltime limits or need to recover from failures:
#!/bin/bash
#SBATCH --job-name=train_resume
#SBATCH --account=PAS1234
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=24:00:00
#SBATCH --output=logs/train_%j.out
module load python/3.12
# module load cuda/12.4 # Only needed for custom CUDA extensions — PyPI torch bundles CUDA
source ~/venvs/pytorch/bin/activate
# Automatically resume from latest checkpoint if one exists
CHECKPOINT_DIR="checkpoints/"
LATEST=$(ls -t ${CHECKPOINT_DIR}/*.pt 2>/dev/null | head -1)
if [ -n "$LATEST" ]; then
echo "Resuming from checkpoint: $LATEST"
python train.py --resume "$LATEST"
else
echo "Starting fresh training"
python train.py
fi
Your Python training script should save checkpoints periodically:
# In your training loop
for epoch in range(start_epoch, num_epochs):
train_one_epoch(model, dataloader, optimizer)
# Save checkpoint every 5 epochs
if epoch % 5 == 0:
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, f'checkpoints/checkpoint_epoch_{epoch}.pt')
Graceful Timeout Handling
When a job hits its walltime, SLURM sends SIGTERM followed by SIGKILL after a short grace period (SLURM docs) — any in-flight training step is lost. Use --signal to get advance warning and save a checkpoint before the kill:
#!/bin/bash
#SBATCH --job-name=train_graceful
#SBATCH --account=PAS1234
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1
#SBATCH --time=08:00:00
#SBATCH --signal=B:USR1@300 # Send USR1 five minutes before timeout
#SBATCH --output=logs/train_%j.out
source .venv/bin/activate
python train.py --epochs 500 --checkpoint-dir checkpoints/
In your Python code, catch the signal and save state:
import signal
import sys
def handle_timeout(signum, frame):
"""Save checkpoint when SLURM sends USR1 before walltime."""
print("Received USR1 — saving checkpoint before timeout")
save_checkpoint(model, optimizer, epoch, "checkpoints/timeout_ckpt.pt")
sys.exit(0)
signal.signal(signal.SIGUSR1, handle_timeout)
PyTorch Lightning handles this automatically
If you use PyTorch Lightning, the SLURMEnvironment plugin catches SIGUSR1 and triggers checkpoint saving — no manual signal handling needed. Just add --signal=B:USR1@300 to your SBATCH header.
Data Staging for I/O-Heavy Jobs
OSC has three storage tiers with different performance characteristics. Staging data to faster storage before training reduces I/O bottleneck:
Home (NFS, permanent) → Slow random reads, limited quota
↓ rsync
Scratch (GPFS, 60-day purge) → Fast parallel I/O, large quota
↓ cp
$TMPDIR (local disk, job-only) → Fastest, but ephemeral — deleted when job ends
For training jobs that read many small files (e.g., image datasets, graph tensors), copy data to $TMPDIR at job start:
#!/bin/bash
#SBATCH --job-name=train_staged
#SBATCH --account=PAS1234
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1
#SBATCH --time=08:00:00
#SBATCH --output=logs/train_%j.out
source .venv/bin/activate
# Stage data: scratch → local SSD
SCRATCH_DATA="/fs/scratch/PAS1234/$USER/datasets/my_data"
LOCAL_DATA="$TMPDIR/my_data"
if [ -d "$SCRATCH_DATA" ]; then
echo "Staging data to local SSD..."
cp -r "$SCRATCH_DATA" "$LOCAL_DATA"
DATA_ROOT="$LOCAL_DATA"
else
echo "Using scratch directly"
DATA_ROOT="$SCRATCH_DATA"
fi
python train.py --data-root "$DATA_ROOT"
When to stage data
- Consider staging if your dataset is many small files (images,
.ptgraph tensors) — NFS metadata operations are a common bottleneck for small-file workloads. - Skip staging if your data is a few large files (Parquet, HDF5) — GPFS is optimized for sequential reads.
- Check
$TMPDIRsize — local disk capacity varies by node. Usedf -h $TMPDIRat job start to verify.
CUDA Memory Configuration
For GPU training jobs, set the PyTorch CUDA memory allocator to avoid fragmentation-related OOM errors:
# Add to your job script, before python
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,garbage_collection_threshold:0.8
| Setting | What It Does |
|---|---|
expandable_segments:True | Allocates GPU memory in growable segments instead of fixed blocks. Reduces fragmentation when tensors vary in size. See PyTorch CUDA memory management. |
garbage_collection_threshold:0.8 | Triggers CUDA garbage collection when the ratio of allocated memory to reserved memory drops below this threshold (i.e., when fragmentation is high). Default is 0.0 (disabled). |
When to use this
Recommended for models with variable-size inputs (graph neural networks, NLP with dynamic padding) where tensor sizes change between iterations, causing memory fragmentation.
Long-Running Job with Email Alerts
Get notified when important jobs start, finish, or fail:
#!/bin/bash
#SBATCH --job-name=long_training
#SBATCH --account=PAS1234
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1
#SBATCH --time=48:00:00
#SBATCH --output=logs/long_train_%j.out
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=name.1@osu.edu
module load python/3.12
# module load cuda/12.4 # Only needed for custom CUDA extensions — PyPI torch bundles CUDA
source ~/venvs/pytorch/bin/activate
echo "Training started at $(date) on $(hostname)"
python train.py --config configs/full_training.yaml
echo "Training finished at $(date)"
Partitions (Queues)¶
For partition details (time limits, GPU availability, node counts), see the Clusters Overview.
Resource Requests¶
CPUs and Memory¶
# Request 8 CPUs
#SBATCH --cpus-per-task=8
# Request 32 GB memory
#SBATCH --mem=32G
# Request memory per CPU
#SBATCH --mem-per-cpu=4G
GPUs¶
# Request 1 GPU (any type)
#SBATCH --gpus-per-node=1
# Request specific GPU type (Pitzer)
#SBATCH --gpus-per-node=v100:1 # V100 16 GB (gpu partition)
#SBATCH --gpus-per-node=v100-32g:1 # V100 32 GB (gpu-exp partition)
# Request multiple GPUs
#SBATCH --gpus-per-node=2
Time Limits¶
# Format: HH:MM:SS
#SBATCH --time=00:30:00 # 30 minutes
#SBATCH --time=02:00:00 # 2 hours
#SBATCH --time=24:00:00 # 24 hours
# Or use days-hours format
#SBATCH --time=2-12:00:00 # 2 days, 12 hours
Job Arrays¶
Run multiple similar jobs efficiently:
Basic Job Array¶
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --array=1-10 # Run 10 jobs
#SBATCH --time=01:00:00
#SBATCH --output=logs/job_%A_%a.out # %A = array ID, %a = task ID
# Use array task ID
python train.py --seed $SLURM_ARRAY_TASK_ID
Advanced Job Array¶
#!/bin/bash
#SBATCH --array=1-100%10 # 100 jobs, max 10 concurrent
# Define parameters for each task
learning_rates=(0.001 0.01 0.1)
batch_sizes=(16 32 64)
# Get parameters for this task
idx=$SLURM_ARRAY_TASK_ID
lr=${learning_rates[$((idx % 3))]}
bs=${batch_sizes[$((idx / 3 % 3))]}
python train.py --lr $lr --batch-size $bs
Job Dependencies¶
Chain jobs together:
# Submit first job
job1=$(sbatch --parsable job1.sh)
# Submit second job after first completes
sbatch --dependency=afterok:$job1 job2.sh
# Submit after job completes (success or failure)
sbatch --dependency=afterany:$job1 job3.sh
# Submit after multiple jobs complete
sbatch --dependency=afterok:$job1:$job2 job4.sh
Consider a pipeline orchestrator for complex pipelines
If you have multi-step pipelines with many dependencies, Ray can manage task scheduling, dependency tracking, and fault tolerance from Python. See Pipeline Orchestration.
Monitoring Jobs¶
Check Job Status¶
# List your jobs
squeue -u $USER
# Detailed view
squeue -u $USER --format="%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
# Watch job status
watch -n 10 squeue -u $USER
View Job Details¶
# Current job info
scontrol show job <job_id>
# Job accounting info (after completion)
sacct -j <job_id> --format=JobID,JobName,Partition,State,Elapsed,MaxRSS
# Job efficiency
seff <job_id>
Monitor Running Job¶
# SSH to compute node
squeue -u $USER # Get node name
ssh <nodename> # e.g., ssh p0123
# Monitor resources
top
htop
nvidia-smi # For GPU jobs
View Job Output¶
# Tail output file while job runs
tail -f logs/job_12345.out
# Follow with automatic refresh
watch -n 5 tail -20 logs/job_12345.out
Environment Variables¶
SLURM provides useful environment variables:
# In your job script
echo "Job ID: $SLURM_JOB_ID"
echo "Job name: $SLURM_JOB_NAME"
echo "Node list: $SLURM_JOB_NODELIST"
echo "Number of nodes: $SLURM_JOB_NUM_NODES"
echo "CPUs per task: $SLURM_CPUS_PER_TASK"
echo "Array task ID: $SLURM_ARRAY_TASK_ID"
echo "Working directory: $SLURM_SUBMIT_DIR"
Use in Python:
import os
job_id = os.environ.get('SLURM_JOB_ID')
task_id = os.environ.get('SLURM_ARRAY_TASK_ID', '0')
Best Practices¶
- Test with debug partition first —
--partition=gpudebug(1-hour max, high priority) before submitting long jobs. - Don't over-request resources — request only the CPUs, memory, and time you need. Over-requesting wastes allocation and increases queue wait time.
- Organize output files —
mkdir -p logsand use--output=logs/job_%j.out. SLURM won't create directories for you. - Check job efficiency after completion — run
seff <job_id>and aim for >80% CPU efficiency and >50% GPU utilization. - Set
PYTORCH_CUDA_ALLOC_CONF— addexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueto every GPU job. See CUDA Memory Configuration. - Use
--signalfor long training runs —--signal=B:USR1@300gives you 5 minutes to checkpoint before timeout. See Graceful Timeout Handling. - Stage data for I/O-heavy jobs — copy many-small-files datasets to
$TMPDIRat job start. See Data Staging.
Troubleshooting¶
Common failure modes and their fixes. Click an entry to expand.
Job Pending Forever
# Check reason
squeue -u $USER
# Common reasons and solutions:
# - QOSMaxGRESPerUser: Too many GPU jobs running
# - ReqNodeNotAvail: Maintenance window soon
# - Resources: Requesting too many resources
# - Priority: Other jobs have higher priority
Solution: Reduce resources or wait.
Job Fails Immediately
GPU Not Detected
# Verify GPU requested
#SBATCH --gpus-per-node=1
# Verify GPU is visible
nvidia-smi
# Check in Python
python -c "import torch; print(torch.cuda.is_available())"
Note
If you installed PyTorch from PyPI, you do not need module load cuda. PyPI wheels bundle CUDA. Only load a CUDA module if you are compiling custom CUDA extensions.
Example Workflows¶
Hyperparameter Search¶
#!/bin/bash
#SBATCH --job-name=hyperparam_search
#SBATCH --array=1-27
#SBATCH --output=logs/hp_%A_%a.out
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
# Define hyperparameter grid
lrs=(0.001 0.01 0.1)
batch_sizes=(16 32 64)
dropouts=(0.1 0.3 0.5)
# Map array task ID to hyperparameters
idx=$SLURM_ARRAY_TASK_ID
lr_idx=$((idx % 3))
bs_idx=$(((idx / 3) % 3))
dropout_idx=$(((idx / 9) % 3))
lr=${lrs[$lr_idx]}
bs=${batch_sizes[$bs_idx]}
dropout=${dropouts[$dropout_idx]}
# Run training
python train.py \
--lr $lr \
--batch-size $bs \
--dropout $dropout \
--experiment-name "hp_search_${SLURM_ARRAY_TASK_ID}"
Multi-Stage Pipeline¶
# Stage 1: Data preprocessing
job1=$(sbatch --parsable preprocess.sh)
# Stage 2: Training (after preprocessing)
job2=$(sbatch --dependency=afterok:$job1 --parsable train.sh)
# Stage 3: Evaluation (after training)
sbatch --dependency=afterok:$job2 evaluate.sh
Next Steps¶
- Learn Environment Management
- Set up PyTorch on OSC
- Automate pipelines with Pipeline Orchestration