Skip to content

PyTorch & GPU Setup

Everything you need to install PyTorch, request GPUs, and train efficiently on OSC.

Prerequisites

  • OSC account with GPU access
  • SSH connection configured
  • Basic familiarity with Python and PyTorch

Quick Setup

# 1. Create venv with OSC's system Python (never use uv-managed Python on OSC)
uv venv --python /apps/python/3.12/bin/python3

# 2. Activate
source .venv/bin/activate

# 3. Install PyTorch (PyPI wheels bundle CUDA — no module load cuda needed)
uv add "torch>=2.8.0,<2.9" torchvision torchaudio

# 4. Install common ML packages
uv add numpy pandas matplotlib scikit-learn jupyter tensorboard
# 1. Load Python module
module load python/3.12

# 2. Create virtual environment
python -m venv ~/venvs/pytorch

# 3. Activate
source ~/venvs/pytorch/bin/activate

# 4. Install PyTorch (PyPI wheels bundle CUDA — no --index-url needed)
pip install --upgrade pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio

# 5. Install common ML packages
pip install numpy pandas matplotlib scikit-learn jupyter tensorboard

No module load cuda needed with PyPI torch

PyTorch wheels on PyPI bundle their own NVIDIA libraries. You do not need module load cuda or --index-url pytorch wheel URLs. Only load a CUDA module if you are compiling custom CUDA extensions or using venv/pip with the legacy pytorch wheel index.

Version Constraint Triangle (PyTorch + PyG + CUDA)

If you plan to use PyTorch Geometric (PyG), there is a three-way version coupling between PyTorch, PyG extension wheels, and CUDA. PyPI may ship torch 2.10+ but PyG only has compiled wheels up to torch 2.8.0. Installing mismatched versions compiles fine but segfaults at runtime (silent C++ ABI mismatch). Always check data.pyg.org/whl/ for the latest torch version supported by PyG before upgrading. See PyG Setup for full details.

Detailed Setup

Step 1: Create Virtual Environment

# Use OSC's system Python (never uv-managed Python on OSC)
uv venv --python /apps/python/3.12/bin/python3
source .venv/bin/activate
module load python/3.12
python -m venv ~/venvs/pytorch
source ~/venvs/pytorch/bin/activate
pip install --upgrade pip

Step 2: Install PyTorch

PyTorch wheels on PyPI now bundle CUDA libraries. No --index-url or module load cuda is needed.

# uv
uv add "torch>=2.8.0,<2.9" torchvision torchaudio

# pip
pip install "torch>=2.8.0,<2.9" torchvision torchaudio

Step 3: Install Additional Packages

For graph neural network libraries (PyTorch Geometric), see the PyG Setup Guide.

# Core scientific packages
pip install numpy pandas matplotlib seaborn

# Machine learning utilities
pip install scikit-learn scipy

# Deep learning tools
pip install tensorboard wandb

# Jupyter for notebooks
pip install jupyter ipykernel

# Computer vision
pip install opencv-python pillow

# NLP (if needed)
pip install transformers datasets tokenizers

# Optimization and monitoring
pip install tqdm pytorch-lightning

Step 4: Verify Installation

Create test_pytorch.py:

import torch
import torchvision

print("=" * 50)
print("PyTorch Installation Test")
print("=" * 50)

print(f"\nPyTorch version: {torch.__version__}")
print(f"Torchvision version: {torchvision.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.current_device()}")
    print(f"GPU name: {torch.cuda.get_device_name(0)}")

    # Test tensor operations
    print("\nTesting GPU operations...")
    x = torch.rand(5, 3)
    print(f"CPU tensor shape: {x.shape}")

    x_gpu = x.cuda()
    print(f"GPU tensor device: {x_gpu.device}")
    print("GPU operations working!")
else:
    print("\nCUDA not available. Running on CPU.")

print("\n" + "=" * 50)

Test on GPU node:

# Request GPU node
srun -p gpu --gpus-per-node=1 --time=00:10:00 --pty bash

# Activate environment
source .venv/bin/activate  # uv
# or: source ~/venvs/pytorch/bin/activate  # venv + pip

# Run test
python test_pytorch.py

# Exit
exit

Requesting GPUs

Available GPU Types

Pitzer Cluster

GPU Model Memory CUDA Cores Best For Quantity
NVIDIA V100 32 GB 5120 Training large models Limited
NVIDIA A100 40 GB 6912 Latest ML workloads Limited

Owens Cluster (Older)

GPU Model Memory CUDA Cores Best For Quantity
NVIDIA P100 16 GB 3584 General GPU work Many

Which GPU to Use?

  • A100: Latest architectures (Transformers, large models)
  • V100: Most ML workloads, good balance
  • P100: Older but widely available, good for testing

Interactive GPU Session

# Request any available GPU
srun -p gpu --gpus-per-node=1 --time=01:00:00 --pty bash

# Request specific GPU type
srun -p gpu --gpus-per-node=v100:1 --time=01:00:00 --pty bash
srun -p gpu --gpus-per-node=a100:1 --time=01:00:00 --pty bash

# Request multiple GPUs
srun -p gpu --gpus-per-node=2 --time=01:00:00 --pty bash

# With more CPUs and memory
srun -p gpu --gpus-per-node=1 --cpus-per-task=8 --mem=64G --time=02:00:00 --pty bash

Batch Job

#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1           # Any available GPU
# #SBATCH --gpus-per-node=v100:1   # OR: request specific GPU type
#SBATCH --cpus-per-task=4           # CPUs (for data loading)
#SBATCH --mem=32G                   # Memory
#SBATCH --time=08:00:00

# Your GPU job commands

Monitoring GPUs

Using nvidia-smi

# Basic GPU info
nvidia-smi

# Continuous monitoring
watch -n 1 nvidia-smi

# Specific GPU
nvidia-smi -i 0

# Show processes
nvidia-smi pmon

# Detailed query
nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free,utilization.gpu --format=csv

Using Python

import torch

# Check CUDA availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

# GPU information
print(f"Number of GPUs: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
    print(f"  Memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.2f} GB")

# Memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

GPU Selection

# Use specific GPU
export CUDA_VISIBLE_DEVICES=0

# Use multiple GPUs
export CUDA_VISIBLE_DEVICES=0,1
# In Python
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# Or use device parameter
device = torch.device('cuda:0')

Performance

Data Loading Optimization

from torch.utils.data import DataLoader

# Use multiple workers (match CPUs requested)
train_loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,        # Match --cpus-per-task
    pin_memory=True,      # Faster GPU transfer
    prefetch_factor=2,    # Prefetch batches
    persistent_workers=True  # Keep workers alive
)

Mixed Precision Training

Mixed precision uses FP16 where possible, saving memory and speeding up training.

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for epoch in range(num_epochs):
    for data, target in train_loader:
        data, target = data.cuda(), target.cuda()

        optimizer.zero_grad()

        # Forward pass in mixed precision
        with autocast():
            output = model(data)
            loss = criterion(output, target)

        # Backward pass with scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Gradient Accumulation

Simulate larger batch sizes without more GPU memory:

accumulation_steps = 4  # Effective batch size = batch_size * 4

for i, (data, target) in enumerate(train_loader):
    data, target = data.cuda(), target.cuda()

    # Forward pass
    output = model(data)
    loss = criterion(output, target)

    # Scale loss and backward
    loss = loss / accumulation_steps
    loss.backward()

    # Update weights every accumulation_steps
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

torch.compile() (PyTorch 2.0+)

torch.compile() JIT-compiles your model for faster execution. On OSC's A100 GPUs, it can provide significant speedups with minimal code changes.

import torch

model = MyModel().cuda()

# Basic usage — tries the best available backend
model = torch.compile(model)

# Specify backend explicitly
model = torch.compile(model, backend="inductor")  # Default, good general choice

# Max performance (longer compile time, best runtime)
model = torch.compile(model, mode="max-autotune")

Requires PyTorch 2.0+

torch.compile() is only available in PyTorch 2.0 and later. Check your version with torch.__version__. If you're using an older version, upgrade with:

pip install --upgrade "torch>=2.8.0,<2.9" torchvision torchaudio

A100 GPUs benefit the most

torch.compile() with mode="max-autotune" takes advantage of A100-specific features like TF32 tensor cores. Request A100s on Pitzer for best results: --gpus-per-node=a100:1.

Gradient Checkpointing

Save memory by recomputing activations during backward pass:

import torch.utils.checkpoint as checkpoint

class MyModel(nn.Module):
    def forward(self, x):
        # Use checkpointing for memory-intensive layers
        x = checkpoint.checkpoint(self.layer1, x)
        x = checkpoint.checkpoint(self.layer2, x)
        return x
Profiling with PyTorch Profiler
import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for i in range(10):
        output = model(input)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

# Print results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

# Save trace for visualization
prof.export_chrome_trace("trace.json")
# View at chrome://tracing

Multi-GPU Training

DataParallel (Simple, Single Node)

import torch.nn as nn

model = MyModel()

# Wrap model for multi-GPU
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)

model = model.cuda()

# Train as usual
for data, target in train_loader:
    output = model(data.cuda())  # Automatically distributed
    loss = criterion(output, target.cuda())
    loss.backward()
    optimizer.step()

Job script:

#SBATCH --gpus-per-node=4

More efficient than DataParallel:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)

    # Create model and move to GPU
    model = MyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    # Use DistributedSampler
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=64)

    # Training loop
    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for data, target in dataloader:
            data, target = data.to(rank), target.to(rank)
            output = ddp_model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

    cleanup()

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    torch.multiprocessing.spawn(train, args=(world_size,), nprocs=world_size)

Job script:

#!/bin/bash
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4

# torchrun replaces the deprecated torch.distributed.launch
torchrun --nproc_per_node=4 train_ddp.py

Memory Management

Check Memory Usage

import torch

# Current GPU memory usage
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

# Peak memory usage
print(f"Max allocated: {torch.cuda.max_memory_allocated(0) / 1e9:.2f} GB")

# Detailed memory summary
print(torch.cuda.memory_summary(device=0, abbreviated=False))

Clear GPU Memory

# Clear cache
torch.cuda.empty_cache()

# Delete tensors explicitly
del large_tensor
torch.cuda.empty_cache()

# Move to CPU and delete
large_tensor = large_tensor.cpu()
del large_tensor
torch.cuda.empty_cache()

Memory-Efficient Practices

# 1. Use in-place operations
x.add_(y)  # Instead of x = x + y

# 2. Use torch.no_grad() for inference
with torch.no_grad():
    output = model(input)

# 3. Clear gradients efficiently
optimizer.zero_grad(set_to_none=True)  # More memory efficient

# 4. Set memory fraction
torch.cuda.set_per_process_memory_fraction(0.8, device=0)

Using PyTorch in Jobs

Batch Job Script

Create pytorch_job.sh:

#!/bin/bash
#SBATCH --job-name=pytorch_train
#SBATCH --account=PAS1234
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --output=logs/train_%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@osu.edu

# Print job info
echo "Job started at: $(date)"
echo "Running on node: $(hostname)"
echo "Job ID: $SLURM_JOB_ID"

# Activate environment (PyPI torch bundles CUDA — no module load cuda needed)
source .venv/bin/activate

# Verify GPU
nvidia-smi
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# Run training
python train.py \
    --data-path /fs/scratch/PAS1234/$USER/data \
    --epochs 100 \
    --batch-size 64 \
    --lr 0.001 \
    --device cuda

echo "Job completed at: $(date)"

Submit:

mkdir -p logs
sbatch pytorch_job.sh

Checkpointing

Save Checkpoints

def save_checkpoint(model, optimizer, epoch, loss, path):
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }, path)

# Save every N epochs
if epoch % 10 == 0:
    save_checkpoint(
        model, optimizer, epoch, loss,
        f'checkpoints/epoch_{epoch}.pth'
    )

# Save best model
if loss < best_loss:
    save_checkpoint(
        model, optimizer, epoch, loss,
        'checkpoints/best_model.pth'
    )

Load Checkpoints

def load_checkpoint(model, optimizer, path):
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    loss = checkpoint['loss']
    return epoch, loss

# Resume training
if os.path.exists('checkpoints/best_model.pth'):
    epoch, loss = load_checkpoint(model, optimizer, 'checkpoints/best_model.pth')
    print(f"Resumed from epoch {epoch}")

Troubleshooting

CUDA Out of Memory

Symptoms:

RuntimeError: CUDA out of memory

Solutions:

  1. Reduce batch size

    batch_size = 32  # Instead of 64
    

  2. Use gradient accumulation

    accumulation_steps = 2  # Effective batch size = 64
    

  3. Use mixed precision

    from torch.cuda.amp import autocast
    with autocast():
        output = model(input)
    

  4. Clear cache

    torch.cuda.empty_cache()
    

  5. Use gradient checkpointing

    model.gradient_checkpointing_enable()
    

  6. Reduce model size

CUDA Not Available

Checks:

# 1. Verify you're on a GPU node (not a login node)
squeue -u $USER

# 2. Check node has GPU
nvidia-smi

# 3. Check PyTorch sees CUDA
python -c "import torch; print(torch.cuda.is_available())"

# 4. If still failing, reinstall PyTorch
pip uninstall torch torchvision torchaudio
pip install "torch>=2.8.0,<2.9" torchvision torchaudio

You do NOT need module load cuda for PyPI torch

If you installed PyTorch from PyPI (via pip install or uv add), the wheels bundle their own CUDA libraries. module load cuda is only needed if you are compiling custom CUDA extensions (e.g. custom C++/CUDA kernels). If torch.cuda.is_available() returns False, the most common cause is running on a login node instead of a GPU compute node.

Slow Training

Common issues: - Too few data loader workers - Not using pin_memory - Not using mixed precision - CPU-GPU transfer bottleneck

Diagnose:

# Profile to find bottlenecks
import torch.profiler
with torch.profiler.profile() as prof:
    train_one_epoch()
print(prof.key_averages().table())

Module Import Errors

# Verify environment activated
which python  # Should point to venv

# Reinstall package
pip install --force-reinstall torch

# Check Python path
python -c "import sys; print('\n'.join(sys.path))"

Best Practices

  1. Always use virtual environments
  2. Test on GPU node before batch submission
  3. Save checkpoints regularly
  4. Use mixed precision for faster training
  5. Monitor GPU usage with nvidia-smi
  6. Don't over-request GPUs you won't use
  7. Use appropriate batch size for your GPU memory
  8. Pin memory and use multiple workers for data loading
  9. Profile before optimizing — find actual bottlenecks
  10. Document your environment in requirements.txt

Next Steps

Resources