PyTorch & GPU Setup¶

Everything you need to install PyTorch, request GPUs, and train efficiently on OSC.

Prerequisites¶

OSC account with GPU access
SSH connection configured
Basic familiarity with Python and PyTorch

Quick Setup¶

uv (Recommended)pip + venv

# 1. Create venv with OSC's system Python (never use uv-managed Python on OSC)
uv venv --python /apps/python/3.12/bin/python3

# 2. Activate
source .venv/bin/activate

# 3. Install PyTorch (PyPI wheels bundle CUDA — no module load cuda needed)
uv add "torch>=2.8.0,<2.9" torchvision torchaudio

# 4. Install common ML packages
uv add numpy pandas matplotlib scikit-learn jupyter tensorboard

# 1. Load Python module
module load python/3.12

# 2. Create virtual environment
python -m venv ~/venvs/pytorch

# 3. Activate
source ~/venvs/pytorch/bin/activate

# 4. Install PyTorch (PyPI wheels bundle CUDA — no --index-url needed)
pip install --upgrade pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio

# 5. Install common ML packages
pip install numpy pandas matplotlib scikit-learn jupyter tensorboard

No module load cuda needed with PyPI torch

PyTorch wheels on PyPI bundle their own NVIDIA libraries. You do not need module load cuda or --index-url pytorch wheel URLs. Only load a CUDA module if you are compiling custom CUDA extensions or using venv/pip with the legacy pytorch wheel index.

Version Constraint Triangle (PyTorch + PyG + CUDA)

If you plan to use PyTorch Geometric (PyG), there is a three-way version coupling between PyTorch, PyG extension wheels, and CUDA. PyPI may ship torch 2.10+ but PyG only has compiled wheels up to torch 2.8.0. Installing mismatched versions compiles fine but segfaults at runtime (silent C++ ABI mismatch). Always check data.pyg.org/whl/ for the latest torch version supported by PyG before upgrading. See PyG Setup for full details.

Detailed Setup¶

Step 1: Create Virtual Environment¶

uv (Recommended)venv + pip

# Use OSC's system Python (never uv-managed Python on OSC)
uv venv --python /apps/python/3.12/bin/python3
source .venv/bin/activate

module load python/3.12
python -m venv ~/venvs/pytorch
source ~/venvs/pytorch/bin/activate
pip install --upgrade pip

Step 2: Install PyTorch¶

PyTorch wheels on PyPI now bundle CUDA libraries. No --index-url or module load cuda is needed.

# uv
uv add "torch>=2.8.0,<2.9" torchvision torchaudio

# pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio

Step 3: Install Additional Packages¶

For graph neural network libraries (PyTorch Geometric), see the PyG Setup Guide.

# Core scientific packages
pip install numpy pandas matplotlib seaborn

# Machine learning utilities
pip install scikit-learn scipy

# Deep learning tools
pip install tensorboard wandb

# Jupyter for notebooks
pip install jupyter ipykernel

# Computer vision
pip install opencv-python pillow

# NLP (if needed)
pip install transformers datasets tokenizers

# Optimization and monitoring
pip install tqdm pytorch-lightning

Step 4: Verify Installation¶

Create test_pytorch.py:

import torch
import torchvision

print("=" * 50)
print("PyTorch Installation Test")
print("=" * 50)

print(f"\nPyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU name: {torch.cuda.get_device_name(0)}")

    # Test tensor operations
    print("\nTesting GPU operations...")
    x = torch.rand(5, 3)
    print(f"CPU tensor shape: {x.shape}")

    x_gpu = x.cuda()
    print(f"GPU tensor device: {x_gpu.device}")
    print("GPU operations working!")
else:
    print("\nCUDA not available. Running on CPU.")

print("\n" + "=" * 50)

Test on GPU node:

# Request GPU node
srun -p gpu --gpus-per-node=1 --time=00:10:00 --pty bash

# Activate environment
source .venv/bin/activate  # uv
# or: source ~/venvs/pytorch/bin/activate  # venv + pip

# Run test
python test_pytorch.py

# Exit
exit

Requesting GPUs¶

Available GPU Types¶

Pitzer Cluster¶

GPU Model	Memory	CUDA Cores	Best For	Quantity
NVIDIA V100	32 GB	5120	Training large models	Limited
NVIDIA A100	40 GB	6912	Latest ML workloads	Limited

Owens Cluster (Older)¶

GPU Model	Memory	CUDA Cores	Best For	Quantity
NVIDIA P100	16 GB	3584	General GPU work	Many

Which GPU to Use?¶

A100: Latest architectures (Transformers, large models)
V100: Most ML workloads, good balance
P100: Older but widely available, good for testing

Interactive GPU Session¶

# Request any available GPU
srun -p gpu --gpus-per-node=1 --time=01:00:00 --pty bash

# Request specific GPU type
srun -p gpu --gpus-per-node=v100:1 --time=01:00:00 --pty bash
srun -p gpu --gpus-per-node=a100:1 --time=01:00:00 --pty bash

# Request multiple GPUs
srun -p gpu --gpus-per-node=2 --time=01:00:00 --pty bash

# With more CPUs and memory
srun -p gpu --gpus-per-node=1 --cpus-per-task=8 --mem=64G --time=02:00:00 --pty bash

Batch Job¶

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1           # Any available GPU
# #SBATCH --gpus-per-node=v100:1   # OR: request specific GPU type
#SBATCH --cpus-per-task=4           # CPUs (for data loading)
#SBATCH --mem=32G                   # Memory
#SBATCH --time=08:00:00

# Your GPU job commands

Monitoring GPUs¶

Using nvidia-smi¶

# Basic GPU info
nvidia-smi

# Continuous monitoring
watch -n 1 nvidia-smi

# Specific GPU
nvidia-smi -i 0

# Show processes
nvidia-smi pmon

# Detailed query
nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free,utilization.gpu --format=csv

Using Python¶

import torch

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

# GPU information
print(f"Number of GPUs: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
    print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")

# Memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

GPU Selection¶

# Use specific GPU
export CUDA_VISIBLE_DEVICES=0

# Use multiple GPUs
export CUDA_VISIBLE_DEVICES=0,1

# In Python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# Or use device parameter
device = torch.device('cuda:0')

Performance¶

Data Loading Optimization¶

from torch.utils.data import DataLoader

# Use multiple workers (match CPUs requested)
train_loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,        # Match --cpus-per-task
    pin_memory=True,      # Faster GPU transfer
    prefetch_factor=2,    # Prefetch batches
    persistent_workers=True  # Keep workers alive
)

Mixed Precision Training¶

Mixed precision uses FP16 where possible, saving memory and speeding up training.

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for epoch in range(num_epochs):
    for data, target in train_loader:
        data, target = data.cuda(), target.cuda()

        optimizer.zero_grad()

        # Forward pass in mixed precision
        with autocast():
            output = model(data)
            loss = criterion(output, target)

        # Backward pass with scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Gradient Accumulation¶

Simulate larger batch sizes without more GPU memory:

accumulation_steps = 4  # Effective batch size = batch_size * 4

for i, (data, target) in enumerate(train_loader):
    data, target = data.cuda(), target.cuda()

    # Forward pass
    output = model(data)
    loss = criterion(output, target)

    # Scale loss and backward
    loss = loss / accumulation_steps
    loss.backward()

    # Update weights every accumulation_steps
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

torch.compile() (PyTorch 2.0+)¶

torch.compile() JIT-compiles your model for faster execution. On OSC's A100 GPUs, it can provide significant speedups with minimal code changes.

import torch

model = MyModel().cuda()

# Basic usage — tries the best available backend
model = torch.compile(model)

# Specify backend explicitly
model = torch.compile(model, backend="inductor")  # Default, good general choice

# Max performance (longer compile time, best runtime)
model = torch.compile(model, mode="max-autotune")

Requires PyTorch 2.0+

torch.compile() is only available in PyTorch 2.0 and later. Check your version with torch.__version__. If you're using an older version, upgrade with:

pip install --upgrade "torch>=2.8.0,<2.9" torchvision torchaudio

A100 GPUs benefit the most

torch.compile() with mode="max-autotune" takes advantage of A100-specific features like TF32 tensor cores. Request A100s on Pitzer for best results: --gpus-per-node=a100:1.

Gradient Checkpointing

Save memory by recomputing activations during backward pass:

import torch.utils.checkpoint as checkpoint

class MyModel(nn.Module):
    def forward(self, x):
        # Use checkpointing for memory-intensive layers
        x = checkpoint.checkpoint(self.layer1, x)
        x = checkpoint.checkpoint(self.layer2, x)
        return x

Profiling with PyTorch Profiler

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for i in range(10):
        output = model(input)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

# Print results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# Save trace for visualization
prof.export_chrome_trace("trace.json")
# View at chrome://tracing

Multi-GPU Training¶

DataParallel (Simple, Single Node)¶

import torch.nn as nn

model = MyModel()

# Wrap model for multi-GPU
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)

model = model.cuda()

# Train as usual
for data, target in train_loader:
    output = model(data.cuda())  # Automatically distributed
    loss = criterion(output, target.cuda())
    loss.backward()
    optimizer.step()

Job script:

#SBATCH --gpus-per-node=4

DistributedDataParallel (Recommended)¶

More efficient than DataParallel:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # Create model and move to GPU
    model = MyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Use DistributedSampler
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=64)

    # Training loop
    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for data, target in dataloader:
            data, target = data.to(rank), target.to(rank)
            output = ddp_model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

    cleanup()

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)

Job script:

#!/bin/bash
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4

# torchrun replaces the deprecated torch.distributed.launch
torchrun --nproc_per_node=4 train_ddp.py

Memory Management¶

Check Memory Usage¶

import torch

# Current GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

# Peak memory usage
print(f"Max allocated: {torch.cuda.max_memory_allocated(0) / 1e9:.2f} GB")

# Detailed memory summary
print(torch.cuda.memory_summary(device=0, abbreviated=False))

Clear GPU Memory¶

# Clear cache
torch.cuda.empty_cache()

# Delete tensors explicitly
del large_tensor
torch.cuda.empty_cache()

# Move to CPU and delete
large_tensor = large_tensor.cpu()
del large_tensor
torch.cuda.empty_cache()

Memory-Efficient Practices¶

# 1. Use in-place operations
x.add_(y)  # Instead of x = x + y

# 2. Use torch.no_grad() for inference
with torch.no_grad():
    output = model(input)

# 3. Clear gradients efficiently
optimizer.zero_grad(set_to_none=True)  # More memory efficient

# 4. Set memory fraction
torch.cuda.set_per_process_memory_fraction(0.8, device=0)

Using PyTorch in Jobs¶

Batch Job Script¶

Create pytorch_job.sh:

#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --account=PAS1234
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --output=logs/train_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@osu.edu

# Print job info
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
echo "Job ID: $SLURM_JOB_ID"

# Activate environment (PyPI torch bundles CUDA — no module load cuda needed)
source .venv/bin/activate

# Verify GPU
nvidia-smi
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Run training
python train.py \
    --data-path /fs/scratch/PAS1234/$USER/data \
    --epochs 100 \
    --batch-size 64 \
    --lr 0.001 \
    --device cuda

echo "Job completed at: $(date)"

Submit:

mkdir -p logs
sbatch pytorch_job.sh

Checkpointing¶

Save Checkpoints¶

def save_checkpoint(model, optimizer, epoch, loss, path):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, path)

# Save every N epochs
if epoch % 10 == 0:
    save_checkpoint(
        model, optimizer, epoch, loss,
        f'checkpoints/epoch_{epoch}.pth'
    )

# Save best model
if loss < best_loss:
    save_checkpoint(
        model, optimizer, epoch, loss,
        'checkpoints/best_model.pth'
    )

Load Checkpoints¶

def load_checkpoint(model, optimizer, path):
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    loss = checkpoint['loss']
    return epoch, loss

# Resume training
if os.path.exists('checkpoints/best_model.pth'):
    epoch, loss = load_checkpoint(model, optimizer, 'checkpoints/best_model.pth')
    print(f"Resumed from epoch {epoch}")

Troubleshooting¶

CUDA Out of Memory¶

Symptoms:

RuntimeError: CUDA out of memory

Solutions:

Reduce batch size
```
batch_size = 32  # Instead of 64
```

Use gradient accumulation

accumulation_steps = 2  # Effective batch size = 64

Use mixed precision

from torch.cuda.amp import autocast
with autocast():
    output = model(input)

Clear cache
```
torch.cuda.empty_cache()
```
Use gradient checkpointing
```
model.gradient_checkpointing_enable()
```
Reduce model size

CUDA Not Available¶

Checks:

# 1. Verify you're on a GPU node (not a login node)
squeue -u $USER

# 2. Check node has GPU
nvidia-smi

# 3. Check PyTorch sees CUDA
python -c "import torch; print(torch.cuda.is_available())"

# 4. If still failing, reinstall PyTorch
pip uninstall torch torchvision torchaudio
pip install "torch>=2.8.0,<2.9" torchvision torchaudio

You do NOT need module load cuda for PyPI torch

If you installed PyTorch from PyPI (via pip install or uv add), the wheels bundle their own CUDA libraries. module load cuda is only needed if you are compiling custom CUDA extensions (e.g. custom C++/CUDA kernels). If torch.cuda.is_available() returns False, the most common cause is running on a login node instead of a GPU compute node.

Slow Training¶

Common issues: - Too few data loader workers - Not using pin_memory - Not using mixed precision - CPU-GPU transfer bottleneck

Diagnose:

# Profile to find bottlenecks
import torch.profiler
with torch.profiler.profile() as prof:
    train_one_epoch()
print(prof.key_averages().table())

Module Import Errors¶

# Verify environment activated
which python  # Should point to venv

# Reinstall package
pip install --force-reinstall torch

# Check Python path
python -c "import sys; print('\n'.join(sys.path))"

Best Practices¶

Always use virtual environments
Test on GPU node before batch submission
Save checkpoints regularly
Use mixed precision for faster training
Monitor GPU usage with nvidia-smi
Don't over-request GPUs you won't use
Use appropriate batch size for your GPU memory
Pin memory and use multiple workers for data loading
Profile before optimizing — find actual bottlenecks
Document your environment in requirements.txt

Next Steps¶

Read ML Workflow Guide
Review Job Submission

Resources¶

PyTorch Performance Tuning Guide — hard-to-find, detailed optimization recipes