PyTorch & GPU Setup¶
Everything you need to install PyTorch, request GPUs, and train efficiently on OSC.
Prerequisites¶
- OSC account with GPU access
- SSH connection configured
- Basic familiarity with Python and PyTorch
Quick Setup¶
# 1. Create venv with OSC's system Python (never use uv-managed Python on OSC)
uv venv --python /apps/python/3.12/bin/python3
# 2. Activate
source .venv/bin/activate
# 3. Install PyTorch (PyPI wheels bundle CUDA — no module load cuda needed)
uv add "torch>=2.8.0,<2.9" torchvision torchaudio
# 4. Install common ML packages
uv add numpy pandas matplotlib scikit-learn jupyter tensorboard
# 1. Load Python module
module load python/3.12
# 2. Create virtual environment
python -m venv ~/venvs/pytorch
# 3. Activate
source ~/venvs/pytorch/bin/activate
# 4. Install PyTorch (PyPI wheels bundle CUDA — no --index-url needed)
pip install --upgrade pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio
# 5. Install common ML packages
pip install numpy pandas matplotlib scikit-learn jupyter tensorboard
No module load cuda needed with PyPI torch
PyTorch wheels on PyPI bundle their own NVIDIA libraries. You do not need module load cuda or --index-url pytorch wheel URLs. Only load a CUDA module if you are compiling custom CUDA extensions or using venv/pip with the legacy pytorch wheel index.
Version Constraint Triangle (PyTorch + PyG + CUDA)
If you plan to use PyTorch Geometric (PyG), there is a three-way version coupling between PyTorch, PyG extension wheels, and CUDA. PyPI may ship torch 2.10+ but PyG only has compiled wheels up to torch 2.8.0. Installing mismatched versions compiles fine but segfaults at runtime (silent C++ ABI mismatch). Always check data.pyg.org/whl/ for the latest torch version supported by PyG before upgrading. See PyG Setup for full details.
Detailed Setup¶
Step 1: Create Virtual Environment¶
Step 2: Install PyTorch¶
PyTorch wheels on PyPI now bundle CUDA libraries. No --index-url or module load cuda is needed.
# uv
uv add "torch>=2.8.0,<2.9" torchvision torchaudio
# pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio
Step 3: Install Additional Packages¶
For graph neural network libraries (PyTorch Geometric), see the PyG Setup Guide.
# Core scientific packages
pip install numpy pandas matplotlib seaborn
# Machine learning utilities
pip install scikit-learn scipy
# Deep learning tools
pip install tensorboard wandb
# Jupyter for notebooks
pip install jupyter ipykernel
# Computer vision
pip install opencv-python pillow
# NLP (if needed)
pip install transformers datasets tokenizers
# Optimization and monitoring
pip install tqdm pytorch-lightning
Step 4: Verify Installation¶
Create test_pytorch.py:
import torch
import torchvision
print("=" * 50)
print("PyTorch Installation Test")
print("=" * 50)
print(f"\nPyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of GPUs: {torch.cuda.device_count()}")
print(f"Current GPU: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
# Test tensor operations
print("\nTesting GPU operations...")
x = torch.rand(5, 3)
print(f"CPU tensor shape: {x.shape}")
x_gpu = x.cuda()
print(f"GPU tensor device: {x_gpu.device}")
print("GPU operations working!")
else:
print("\nCUDA not available. Running on CPU.")
print("\n" + "=" * 50)
Test on GPU node:
# Request GPU node
srun -p gpu --gpus-per-node=1 --time=00:10:00 --pty bash
# Activate environment
source .venv/bin/activate # uv
# or: source ~/venvs/pytorch/bin/activate # venv + pip
# Run test
python test_pytorch.py
# Exit
exit
Requesting GPUs¶
Available GPU Types¶
Pitzer (Current)¶
| GPU Model | Memory | Partition | Nodes | Best For |
|---|---|---|---|---|
| NVIDIA V100 | 16 GB | gpu | 32 | General training |
| NVIDIA V100 | 32 GB | gpu-exp | 42 | Larger models, bigger batches |
| NVIDIA V100 (×4, NVLink) | 32 GB | gpu-quad | 4 | Multi-GPU training |
Ascend / Cardinal (Newer — may require separate allocation)¶
| GPU Model | Memory | Best For |
|---|---|---|
| NVIDIA A100 | 40 / 80 GB | Transformers, large-model training |
| NVIDIA H100 | 94 GB | Latest generation, highest throughput |
See the Clusters Overview for full specs and access details.
Interactive GPU Session¶
# Request any available GPU on Pitzer
srun -p gpu --gpus-per-node=1 --time=01:00:00 --pty bash
# Request V100 32 GB specifically
srun -p gpu-exp --gpus-per-node=1 --time=01:00:00 --pty bash
# Request multiple GPUs (quad nodes)
srun -p gpu-quad --gpus-per-node=2 --time=01:00:00 --pty bash
# With more CPUs and memory
srun -p gpu --gpus-per-node=1 --cpus-per-task=8 --mem=64G --time=02:00:00 --pty bash
Batch Job¶
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1 # Any available GPU
# #SBATCH --gpus-per-node=v100:1 # OR: request specific GPU type
#SBATCH --cpus-per-task=4 # CPUs (for data loading)
#SBATCH --mem=32G # Memory
#SBATCH --time=08:00:00
# Your GPU job commands
Monitoring GPUs¶
Using nvidia-smi¶
# Basic GPU info
nvidia-smi
# Continuous monitoring
watch -n 1 nvidia-smi
# Specific GPU
nvidia-smi -i 0
# Show processes
nvidia-smi pmon
# Detailed query
nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free,utilization.gpu --format=csv
Using Python¶
import torch
# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
# GPU information
print(f"Number of GPUs: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
print(f" Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")
# Memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
GPU Selection¶
# Use specific GPU
export CUDA_VISIBLE_DEVICES=0
# Use multiple GPUs
export CUDA_VISIBLE_DEVICES=0,1
# In Python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
# Or use device parameter
device = torch.device('cuda:0')
Performance¶
Data Loading Optimization¶
from torch.utils.data import DataLoader
# Use multiple workers (match CPUs requested)
train_loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=4, # Match --cpus-per-task
pin_memory=True, # Faster GPU transfer
prefetch_factor=2, # Prefetch batches
persistent_workers=True # Keep workers alive
)
Mixed Precision Training¶
Mixed precision uses FP16 where possible, saving memory and speeding up training.
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for epoch in range(num_epochs):
for data, target in train_loader:
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast():
output = model(data)
loss = criterion(output, target)
# Backward pass with scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Mixed Precision Gotchas
FP16 has a much smaller dynamic range than FP32 (max ±65504). Two common traps (see PyTorch AMP docs):
-
Intermediate overflow — Operations that accumulate large values (reductions, layer norms, custom statistics) can overflow FP16, producing
inforNaN. PyTorch's autocast keeps some ops in FP32 automatically, but custom operations may need explicit.float()casts. -
Loss scaling failures — If
GradScalerrepeatedly skips optimizer steps (logged as "Found inf/nan in gradients"), your loss magnitude may exceed FP16 range. The defaultinit_scaleis2**16(GradScaler docs); lowering it (e.g.,2**10) can help stabilize early training.
CUDA Memory Allocator¶
For models with variable-size inputs (GNNs, NLP with dynamic padding), set the CUDA memory allocator to reduce fragmentation:
See the Job Submission Guide for a full explanation of allocator settings.
Multiprocessing with CUDA¶
PyTorch with CUDA requires spawn (not fork) for multiprocessing when child processes use the GPU. Forking after CUDA initialization corrupts GPU state — see PyTorch multiprocessing notes.
# At the top of your main script
import torch.multiprocessing as mp
mp.set_start_method("spawn", force=True)
If you use PyTorch Lightning, set it in the Trainer:
# In your config YAML
trainer:
strategy:
class_path: pytorch_lightning.strategies.DDPStrategy
init_args:
start_method: spawn
DataLoader workers are different
DataLoader(num_workers=N) uses fork by default. This is generally safe because worker processes typically don't initialize CUDA themselves — they load data on CPU. The spawn requirement applies when you explicitly create processes that use GPUs (e.g., distributed training, Ray actors with GPU resources).
Gradient Accumulation¶
Simulate larger batch sizes without more GPU memory:
accumulation_steps = 4 # Effective batch size = batch_size * 4
for i, (data, target) in enumerate(train_loader):
data, target = data.cuda(), target.cuda()
# Forward pass
output = model(data)
loss = criterion(output, target)
# Scale loss and backward
loss = loss / accumulation_steps
loss.backward()
# Update weights every accumulation_steps
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
torch.compile() (PyTorch 2.0+)¶
torch.compile() JIT-compiles your model for faster execution with minimal code changes.
import torch
model = MyModel().cuda()
# Basic usage — tries the best available backend
model = torch.compile(model)
# Specify backend explicitly
model = torch.compile(model, backend="inductor") # Default, good general choice
# Max performance (longer compile time, best runtime)
model = torch.compile(model, mode="max-autotune")
Requires PyTorch 2.0+
torch.compile() is only available in PyTorch 2.0 and later. Check your version with torch.__version__. If you're using an older version, upgrade with:
V100 and newer GPUs benefit most
torch.compile() with mode="max-autotune" leverages GPU-specific optimizations. V100s on Pitzer support this; A100s on Ascend (if you have access) benefit even more from TF32 tensor cores.
Gradient Checkpointing
Save memory by recomputing activations during backward pass:
Profiling with PyTorch Profiler
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for i in range(10):
output = model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Print results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# Save trace for visualization
prof.export_chrome_trace("trace.json")
# View at chrome://tracing
Multi-GPU Training¶
DataParallel (Simple, Single Node)¶
import torch.nn as nn
model = MyModel()
# Wrap model for multi-GPU
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs")
model = nn.DataParallel(model)
model = model.cuda()
# Train as usual
for data, target in train_loader:
output = model(data.cuda()) # Automatically distributed
loss = criterion(output, target.cuda())
loss.backward()
optimizer.step()
Job script:
DistributedDataParallel (Recommended)¶
More efficient than DataParallel:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size):
setup(rank, world_size)
# Create model and move to GPU
model = MyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Use DistributedSampler
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=64)
# Training loop
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # Shuffle differently each epoch
for data, target in dataloader:
data, target = data.to(rank), target.to(rank)
output = ddp_model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
cleanup()
if __name__ == '__main__':
world_size = torch.cuda.device_count()
torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)
Job script:
#!/bin/bash
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
# torchrun replaces the deprecated torch.distributed.launch
torchrun --nproc_per_node=4 train_ddp.py
Memory Management¶
Check Memory Usage¶
import torch
# Current GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
# Peak memory usage
print(f"Max allocated: {torch.cuda.max_memory_allocated(0) / 1e9:.2f} GB")
# Detailed memory summary
print(torch.cuda.memory_summary(device=0, abbreviated=False))
Clear GPU Memory¶
# Clear cache
torch.cuda.empty_cache()
# Delete tensors explicitly
del large_tensor
torch.cuda.empty_cache()
# Move to CPU and delete
large_tensor = large_tensor.cpu()
del large_tensor
torch.cuda.empty_cache()
Memory-Efficient Practices¶
# 1. Use in-place operations
x.add_(y) # Instead of x = x + y
# 2. Use torch.no_grad() for inference
with torch.no_grad():
output = model(input)
# 3. Clear gradients efficiently
optimizer.zero_grad(set_to_none=True) # More memory efficient
# 4. Set memory fraction
torch.cuda.set_per_process_memory_fraction(0.8, device=0)
Using PyTorch in Jobs¶
Batch Job Script¶
Create pytorch_job.sh:
#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --account=PAS1234
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --output=logs/train_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@osu.edu
# Print job info
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
# Activate environment (PyPI torch bundles CUDA — no module load cuda needed)
source .venv/bin/activate
# Verify GPU
nvidia-smi
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Run training
python train.py \
--data-path /fs/scratch/PAS1234/$USER/data \
--epochs 100 \
--batch-size 64 \
--lr 0.001 \
--device cuda
echo "Job completed at: $(date)"
Submit:
Checkpointing¶
Save Checkpoints¶
def save_checkpoint(model, optimizer, epoch, loss, path):
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, path)
# Save every N epochs
if epoch % 10 == 0:
save_checkpoint(
model, optimizer, epoch, loss,
f'checkpoints/epoch_{epoch}.pth'
)
# Save best model
if loss < best_loss:
save_checkpoint(
model, optimizer, epoch, loss,
'checkpoints/best_model.pth'
)
Load Checkpoints¶
def load_checkpoint(model, optimizer, path):
checkpoint = torch.load(path)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
return epoch, loss
# Resume training
if os.path.exists('checkpoints/best_model.pth'):
epoch, loss = load_checkpoint(model, optimizer, 'checkpoints/best_model.pth')
print(f"Resumed from epoch {epoch}")
Troubleshooting¶
Common failure modes and their fixes. Click an entry to expand.
CUDA Out of Memory
Symptoms:
Solutions:
-
Reduce batch size
-
Use gradient accumulation
-
Use mixed precision
-
Clear cache
-
Use gradient checkpointing
-
Reduce model size
CUDA Not Available
Checks:
# 1. Verify you're on a GPU node (not a login node)
squeue -u $USER
# 2. Check node has GPU
nvidia-smi
# 3. Check PyTorch sees CUDA
python -c "import torch; print(torch.cuda.is_available())"
# 4. If still failing, reinstall PyTorch
pip uninstall torch torchvision torchaudio
pip install "torch>=2.8.0,<2.9" torchvision torchaudio
You do NOT need module load cuda for PyPI torch
If you installed PyTorch from PyPI (via pip install or uv add), the wheels bundle their own CUDA libraries. module load cuda is only needed if you are compiling custom CUDA extensions (e.g. custom C++/CUDA kernels). If torch.cuda.is_available() returns False, the most common cause is running on a login node instead of a GPU compute node.
Slow Training
Common issues: - Too few data loader workers - Not using pin_memory - Not using mixed precision - CPU-GPU transfer bottleneck
Diagnose:
Module Import Errors
Best Practices¶
- Always use virtual environments
- Test on GPU node before batch submission
- Save checkpoints regularly
- Use mixed precision for faster training
- Monitor GPU usage with
nvidia-smi - Don't over-request GPUs you won't use
- Use appropriate batch size for your GPU memory
- Pin memory and use multiple workers for data loading
- Profile before optimizing — find actual bottlenecks
- Document your environment in requirements.txt
Next Steps¶
- Read ML Workflow Guide
- Review Job Submission
Resources¶
- PyTorch Performance Tuning Guide — hard-to-find, detailed optimization recipes