PyTorch & GPU Setup¶
Everything you need to install PyTorch, request GPUs, and train efficiently on OSC.
Prerequisites¶
- OSC account with GPU access
- SSH connection configured
- Basic familiarity with Python and PyTorch
Quick Setup¶
# 1. Create venv with OSC's system Python (never use uv-managed Python on OSC)
uv venv --python /apps/python/3.12/bin/python3
# 2. Activate
source .venv/bin/activate
# 3. Install PyTorch (PyPI wheels bundle CUDA — no module load cuda needed)
uv add "torch>=2.8.0,<2.9" torchvision torchaudio
# 4. Install common ML packages
uv add numpy pandas matplotlib scikit-learn jupyter tensorboard
# 1. Load Python module
module load python/3.12
# 2. Create virtual environment
python -m venv ~/venvs/pytorch
# 3. Activate
source ~/venvs/pytorch/bin/activate
# 4. Install PyTorch (PyPI wheels bundle CUDA — no --index-url needed)
pip install --upgrade pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio
# 5. Install common ML packages
pip install numpy pandas matplotlib scikit-learn jupyter tensorboard
No module load cuda needed with PyPI torch
PyTorch wheels on PyPI bundle their own NVIDIA libraries. You do not need module load cuda or --index-url pytorch wheel URLs. Only load a CUDA module if you are compiling custom CUDA extensions or using venv/pip with the legacy pytorch wheel index.
Version Constraint Triangle (PyTorch + PyG + CUDA)
If you plan to use PyTorch Geometric (PyG), there is a three-way version coupling between PyTorch, PyG extension wheels, and CUDA. PyPI may ship torch 2.10+ but PyG only has compiled wheels up to torch 2.8.0. Installing mismatched versions compiles fine but segfaults at runtime (silent C++ ABI mismatch). Always check data.pyg.org/whl/ for the latest torch version supported by PyG before upgrading. See PyG Setup for full details.
Detailed Setup¶
Step 1: Create Virtual Environment¶
Step 2: Install PyTorch¶
PyTorch wheels on PyPI now bundle CUDA libraries. No --index-url or module load cuda is needed.
# uv
uv add "torch>=2.8.0,<2.9" torchvision torchaudio
# pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio
Step 3: Install Additional Packages¶
For graph neural network libraries (PyTorch Geometric), see the PyG Setup Guide.
# Core scientific packages
pip install numpy pandas matplotlib seaborn
# Machine learning utilities
pip install scikit-learn scipy
# Deep learning tools
pip install tensorboard wandb
# Jupyter for notebooks
pip install jupyter ipykernel
# Computer vision
pip install opencv-python pillow
# NLP (if needed)
pip install transformers datasets tokenizers
# Optimization and monitoring
pip install tqdm pytorch-lightning
Step 4: Verify Installation¶
Create test_pytorch.py:
import torch
import torchvision
print("=" * 50)
print("PyTorch Installation Test")
print("=" * 50)
print(f"\nPyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"Number of GPUs: {torch.cuda.device_count()}")
print(f"Current GPU: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
# Test tensor operations
print("\nTesting GPU operations...")
x = torch.rand(5, 3)
print(f"CPU tensor shape: {x.shape}")
x_gpu = x.cuda()
print(f"GPU tensor device: {x_gpu.device}")
print("GPU operations working!")
else:
print("\nCUDA not available. Running on CPU.")
print("\n" + "=" * 50)
Test on GPU node:
# Request GPU node
srun -p gpu --gpus-per-node=1 --time=00:10:00 --pty bash
# Activate environment
source .venv/bin/activate # uv
# or: source ~/venvs/pytorch/bin/activate # venv + pip
# Run test
python test_pytorch.py
# Exit
exit
Requesting GPUs¶
Available GPU Types¶
Pitzer Cluster¶
| GPU Model | Memory | CUDA Cores | Best For | Quantity |
|---|---|---|---|---|
| NVIDIA V100 | 32 GB | 5120 | Training large models | Limited |
| NVIDIA A100 | 40 GB | 6912 | Latest ML workloads | Limited |
Owens Cluster (Older)¶
| GPU Model | Memory | CUDA Cores | Best For | Quantity |
|---|---|---|---|---|
| NVIDIA P100 | 16 GB | 3584 | General GPU work | Many |
Which GPU to Use?¶
- A100: Latest architectures (Transformers, large models)
- V100: Most ML workloads, good balance
- P100: Older but widely available, good for testing
Interactive GPU Session¶
# Request any available GPU
srun -p gpu --gpus-per-node=1 --time=01:00:00 --pty bash
# Request specific GPU type
srun -p gpu --gpus-per-node=v100:1 --time=01:00:00 --pty bash
srun -p gpu --gpus-per-node=a100:1 --time=01:00:00 --pty bash
# Request multiple GPUs
srun -p gpu --gpus-per-node=2 --time=01:00:00 --pty bash
# With more CPUs and memory
srun -p gpu --gpus-per-node=1 --cpus-per-task=8 --mem=64G --time=02:00:00 --pty bash
Batch Job¶
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1 # Any available GPU
# #SBATCH --gpus-per-node=v100:1 # OR: request specific GPU type
#SBATCH --cpus-per-task=4 # CPUs (for data loading)
#SBATCH --mem=32G # Memory
#SBATCH --time=08:00:00
# Your GPU job commands
Monitoring GPUs¶
Using nvidia-smi¶
# Basic GPU info
nvidia-smi
# Continuous monitoring
watch -n 1 nvidia-smi
# Specific GPU
nvidia-smi -i 0
# Show processes
nvidia-smi pmon
# Detailed query
nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free,utilization.gpu --format=csv
Using Python¶
import torch
# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
# GPU information
print(f"Number of GPUs: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
print(f" Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")
# Memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
GPU Selection¶
# Use specific GPU
export CUDA_VISIBLE_DEVICES=0
# Use multiple GPUs
export CUDA_VISIBLE_DEVICES=0,1
# In Python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
# Or use device parameter
device = torch.device('cuda:0')
Performance¶
Data Loading Optimization¶
from torch.utils.data import DataLoader
# Use multiple workers (match CPUs requested)
train_loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=4, # Match --cpus-per-task
pin_memory=True, # Faster GPU transfer
prefetch_factor=2, # Prefetch batches
persistent_workers=True # Keep workers alive
)
Mixed Precision Training¶
Mixed precision uses FP16 where possible, saving memory and speeding up training.
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for epoch in range(num_epochs):
for data, target in train_loader:
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast():
output = model(data)
loss = criterion(output, target)
# Backward pass with scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Gradient Accumulation¶
Simulate larger batch sizes without more GPU memory:
accumulation_steps = 4 # Effective batch size = batch_size * 4
for i, (data, target) in enumerate(train_loader):
data, target = data.cuda(), target.cuda()
# Forward pass
output = model(data)
loss = criterion(output, target)
# Scale loss and backward
loss = loss / accumulation_steps
loss.backward()
# Update weights every accumulation_steps
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
torch.compile() (PyTorch 2.0+)¶
torch.compile() JIT-compiles your model for faster execution. On OSC's A100 GPUs, it can provide significant speedups with minimal code changes.
import torch
model = MyModel().cuda()
# Basic usage — tries the best available backend
model = torch.compile(model)
# Specify backend explicitly
model = torch.compile(model, backend="inductor") # Default, good general choice
# Max performance (longer compile time, best runtime)
model = torch.compile(model, mode="max-autotune")
Requires PyTorch 2.0+
torch.compile() is only available in PyTorch 2.0 and later. Check your version with torch.__version__. If you're using an older version, upgrade with:
A100 GPUs benefit the most
torch.compile() with mode="max-autotune" takes advantage of A100-specific features like TF32 tensor cores. Request A100s on Pitzer for best results: --gpus-per-node=a100:1.
Gradient Checkpointing
Save memory by recomputing activations during backward pass:
Profiling with PyTorch Profiler
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for i in range(10):
output = model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Print results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# Save trace for visualization
prof.export_chrome_trace("trace.json")
# View at chrome://tracing
Multi-GPU Training¶
DataParallel (Simple, Single Node)¶
import torch.nn as nn
model = MyModel()
# Wrap model for multi-GPU
if torch.cuda.device_count() > 1:
print(f"Using {torch.cuda.device_count()} GPUs")
model = nn.DataParallel(model)
model = model.cuda()
# Train as usual
for data, target in train_loader:
output = model(data.cuda()) # Automatically distributed
loss = criterion(output, target.cuda())
loss.backward()
optimizer.step()
Job script:
DistributedDataParallel (Recommended)¶
More efficient than DataParallel:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def train(rank, world_size):
setup(rank, world_size)
# Create model and move to GPU
model = MyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
# Use DistributedSampler
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=64)
# Training loop
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # Shuffle differently each epoch
for data, target in dataloader:
data, target = data.to(rank), target.to(rank)
output = ddp_model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
cleanup()
if __name__ == '__main__':
world_size = torch.cuda.device_count()
torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)
Job script:
#!/bin/bash
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
# torchrun replaces the deprecated torch.distributed.launch
torchrun --nproc_per_node=4 train_ddp.py
Memory Management¶
Check Memory Usage¶
import torch
# Current GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
# Peak memory usage
print(f"Max allocated: {torch.cuda.max_memory_allocated(0) / 1e9:.2f} GB")
# Detailed memory summary
print(torch.cuda.memory_summary(device=0, abbreviated=False))
Clear GPU Memory¶
# Clear cache
torch.cuda.empty_cache()
# Delete tensors explicitly
del large_tensor
torch.cuda.empty_cache()
# Move to CPU and delete
large_tensor = large_tensor.cpu()
del large_tensor
torch.cuda.empty_cache()
Memory-Efficient Practices¶
# 1. Use in-place operations
x.add_(y) # Instead of x = x + y
# 2. Use torch.no_grad() for inference
with torch.no_grad():
output = model(input)
# 3. Clear gradients efficiently
optimizer.zero_grad(set_to_none=True) # More memory efficient
# 4. Set memory fraction
torch.cuda.set_per_process_memory_fraction(0.8, device=0)
Using PyTorch in Jobs¶
Batch Job Script¶
Create pytorch_job.sh:
#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --account=PAS1234
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --output=logs/train_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@osu.edu
# Print job info
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
echo "Job ID: $SLURM_JOB_ID"
# Activate environment (PyPI torch bundles CUDA — no module load cuda needed)
source .venv/bin/activate
# Verify GPU
nvidia-smi
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
# Run training
python train.py \
--data-path /fs/scratch/PAS1234/$USER/data \
--epochs 100 \
--batch-size 64 \
--lr 0.001 \
--device cuda
echo "Job completed at: $(date)"
Submit:
Checkpointing¶
Save Checkpoints¶
def save_checkpoint(model, optimizer, epoch, loss, path):
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, path)
# Save every N epochs
if epoch % 10 == 0:
save_checkpoint(
model, optimizer, epoch, loss,
f'checkpoints/epoch_{epoch}.pth'
)
# Save best model
if loss < best_loss:
save_checkpoint(
model, optimizer, epoch, loss,
'checkpoints/best_model.pth'
)
Load Checkpoints¶
def load_checkpoint(model, optimizer, path):
checkpoint = torch.load(path)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
return epoch, loss
# Resume training
if os.path.exists('checkpoints/best_model.pth'):
epoch, loss = load_checkpoint(model, optimizer, 'checkpoints/best_model.pth')
print(f"Resumed from epoch {epoch}")
Troubleshooting¶
CUDA Out of Memory¶
Symptoms:
Solutions:
-
Reduce batch size
-
Use gradient accumulation
-
Use mixed precision
-
Clear cache
-
Use gradient checkpointing
-
Reduce model size
CUDA Not Available¶
Checks:
# 1. Verify you're on a GPU node (not a login node)
squeue -u $USER
# 2. Check node has GPU
nvidia-smi
# 3. Check PyTorch sees CUDA
python -c "import torch; print(torch.cuda.is_available())"
# 4. If still failing, reinstall PyTorch
pip uninstall torch torchvision torchaudio
pip install "torch>=2.8.0,<2.9" torchvision torchaudio
You do NOT need module load cuda for PyPI torch
If you installed PyTorch from PyPI (via pip install or uv add), the wheels bundle their own CUDA libraries. module load cuda is only needed if you are compiling custom CUDA extensions (e.g. custom C++/CUDA kernels). If torch.cuda.is_available() returns False, the most common cause is running on a login node instead of a GPU compute node.
Slow Training¶
Common issues: - Too few data loader workers - Not using pin_memory - Not using mixed precision - CPU-GPU transfer bottleneck
Diagnose:
# Profile to find bottlenecks
import torch.profiler
with torch.profiler.profile() as prof:
train_one_epoch()
print(prof.key_averages().table())
Module Import Errors¶
# Verify environment activated
which python # Should point to venv
# Reinstall package
pip install --force-reinstall torch
# Check Python path
python -c "import sys; print('\n'.join(sys.path))"
Best Practices¶
- Always use virtual environments
- Test on GPU node before batch submission
- Save checkpoints regularly
- Use mixed precision for faster training
- Monitor GPU usage with
nvidia-smi - Don't over-request GPUs you won't use
- Use appropriate batch size for your GPU memory
- Pin memory and use multiple workers for data loading
- Profile before optimizing — find actual bottlenecks
- Document your environment in requirements.txt
Next Steps¶
- Read ML Workflow Guide
- Review Job Submission
Resources¶
- PyTorch Performance Tuning Guide — hard-to-find, detailed optimization recipes