Troubleshooting Guide¶

Common issues and solutions for working on OSC.

flowchart TD
    A{What's the problem?} --> B["Connection"]
    A --> H["VS Code Remote"]
    A --> C["Module & Environment"]
    A --> D["GPU / CUDA"]
    A --> E["Job Submission"]
    A --> F["File System"]
    A --> I["Data Transfer"]
    A --> J["Python / PyTorch"]
    A --> G["Performance"]

Connection Issues¶

SSH Connection Failed¶

Problem: Cannot connect to OSC

ssh: connect to host pitzer.osc.edu port 22: Connection refused

Solutions:

Check if on OSU network or VPN

# Install and connect to OSU VPN
# Download from: https://osuitsm.service-now.com/

Verify hostname

# Correct hostnames:
pitzer.osc.edu
owens.osc.edu

# Not: pitzer.org or pitzer.com

Check system status
Visit: https://www.osc.edu/resources/system-status

Try alternative cluster

ssh username@owens.osc.edu  # If pitzer is down

SSH Key Authentication Failed¶

Problem: Permission denied with SSH key

Permission denied (publickey,gssapi-keyex,gssapi-with-mic)

Solutions:

Verify key is added to OSC
Log into my.osc.edu
Check "SSH Public Keys"
Wait 10 minutes after adding

Check key permissions

chmod 700 ~/.ssh
chmod 600 ~/.ssh/id_ed25519
chmod 644 ~/.ssh/id_ed25519.pub

Verify SSH config

cat ~/.ssh/config
# Should have:
# IdentityFile ~/.ssh/id_ed25519

Test with verbose output

ssh -v pitzer
# Look for which key is being tried

VS Code Remote Connection Hangs¶

Problem: "Setting up SSH Host..." hangs

See VS Code Remote-SSH Issues below for detailed solutions covering connection timeouts, extension host crashes, slow performance, and ProxyCommand issues.

VS Code Remote-SSH Issues¶

Detailed troubleshooting for VS Code Remote-SSH connections to OSC. For general SSH issues, see Connection Issues above. For Remote-SSH setup, see Remote Development.

Connection Timeout¶

Problem: "Setting up SSH Host..." hangs or times out

Solutions:

Verify SSH works outside VS Code first

ssh pitzer  # Must work before Remote-SSH will

Delete the remote VS Code server
```
ssh pitzer
rm -rf ~/.vscode-server
```
Check disk quota — VS Code Server needs ~500 MB
```
ssh pitzer
quota -s
```
Restart VS Code completely — close all windows, then reopen

Extension Host Crash¶

Problem: "The VS Code Server failed to start" or extension host keeps crashing

Solutions:

Clear the extensions directory

ssh pitzer
rm -rf ~/.vscode-server/extensions

Reduce installed remote extensions — each extension increases memory usage. Only install essentials remotely (Python, Pylance, Jupyter)
Check for disk quota — extensions can fill your home directory
```
du -sh ~/.vscode-server/
```

Slow Performance¶

Problem: Typing lag, slow file loading, or frequent disconnects

Solutions:

Exclude large directories from the file watcher

Add to .vscode/settings.json:

{
  "files.watcherExclude": {
    "**/data": true,
    "**/checkpoints": true,
    "**/__pycache__": true,
    "**/node_modules": true,
    "**/.git/objects": true
  }
}

Disable unused remote extensions — open Extensions (Ctrl+Shift+X), filter to installed on SSH, and disable anything not needed
Work on the native filesystem — avoid opening folders on /fs/scratch/ if possible; prefer ~/projects/

ProxyCommand Issues¶

Problem: VS Code fails when SSH config uses ProxyCommand or ProxyJump

Solutions:

Use ProxyJump instead of ProxyCommand — VS Code handles ProxyJump better:

Host pitzer
    HostName pitzer.osc.edu
    User your.username
    ProxyJump bastion.osc.edu

Set RemoteCommand to none in your SSH config for the host entry

Workspace Trust¶

Problem: VS Code asks about workspace trust on every connection

Solution: When prompted, choose "Trust" for your home directory or project folder. You can also add trusted folders in VS Code settings:

{
  "security.workspace.trust.untrustedFiles": "open"
}

Module and Environment Issues¶

Module Not Found¶

Problem: module: command not found

Solution:

# Module system not initialized
# Add to ~/.bashrc:
source /etc/profile.d/modules.sh

# Or reinitialize shell
bash --login

Module Load Fails¶

Problem: Module not found

Module 'xyz' not found

Solutions:

Search for correct name

module spider xyz
module avail | grep xyz

Check dependencies

module spider xyz/version
# Shows required modules to load first

Virtual Environment Won't Activate¶

Problem: Environment doesn't activate

Solutions:

Verify path exists

# uv (recommended) — .venv/ in project root
ls .venv/bin/activate

# pip+venv — ~/venvs/ convention
ls ~/venvs/myproject/bin/activate

Recreate if corrupted

# uv (recommended)
rm -rf .venv
uv venv --python /apps/python/3.12/bin/python3

# pip+venv
rm -rf ~/venvs/myproject
module load python/3.12
python -m venv ~/venvs/myproject

Check Python module loaded (pip+venv only)

module list | grep python
module load python/3.12

Package Installation Fails¶

Problem: pip install fails

Solutions:

Update pip
```
pip install --upgrade pip
```
Check disk quota
```
quota -s
# If over quota, clean up
```
Install to user directory
```
pip install --user package_name
```
Clear pip cache
```
pip cache purge
```

GPU Issues¶

CUDA Not Available¶

Problem: torch.cuda.is_available() returns False

Quick checks: verify you're on a GPU node (nvidia-smi) and check torch.cuda.is_available(). If you installed PyTorch from PyPI, you do not need module load cuda -- PyPI wheels bundle CUDA. Only load a CUDA module if you're compiling custom CUDA extensions. For full diagnostic steps and reinstall commands, see PyTorch & GPU Setup -- Troubleshooting.

CUDA Out of Memory¶

Problem: RuntimeError: CUDA out of memory

Start by reducing batch size or clearing the cache with torch.cuda.empty_cache(). For a complete list of solutions (gradient accumulation, mixed precision, gradient checkpointing), see PyTorch & GPU Setup — CUDA Out of Memory.

GPU Utilization Low¶

Problem: GPU usage < 50% during training

Increase num_workers in your DataLoader to match --cpus-per-task
Enable pin_memory=True
Use a larger batch size if memory allows
Profile to find the actual bottleneck — see PyTorch & GPU Setup — Slow Training

Job Submission Issues¶

Job Pending Forever¶

Problem: Job stuck in PD (pending) state

Check reason:

squeue -u $USER
# Look at REASON column

Common reasons and solutions:

QOSMaxGRESPerUser
Too many GPU jobs running
Wait or cancel old jobs: scancel <job_id>
Resources
Requesting too many resources
Reduce resources requested
ReqNodeNotAvail
Maintenance window approaching
Reduce time limit or wait
Priority
Other jobs have higher priority
Wait in queue

Job Fails Immediately¶

Problem: Job exits with error immediately

Solutions:

Check error logs

cat logs/job_<jobid>.err
tail -50 logs/job_<jobid>.out

Common causes:

Module not loaded:

# Add to job script
module load python/3.12

Environment not activated:

source .venv/bin/activate  # uv (recommended)
# or: source ~/venvs/myproject/bin/activate  # pip+venv

File not found:

# Use absolute paths
python ~/projects/myproject/train.py

Permission denied:

chmod +x script.sh

Test interactively first

srun -p debug --pty bash
# Run commands manually

Job Times Out¶

Problem: Job killed after time limit

Solutions:

Increase time limit

#SBATCH --time=08:00:00  # Instead of 02:00:00

Implement checkpointing

# Save checkpoint every N epochs
if epoch % 10 == 0:
    torch.save(checkpoint, 'checkpoint.pth')

Resume from checkpoint

if os.path.exists('checkpoint.pth'):
    checkpoint = torch.load('checkpoint.pth')
    model.load_state_dict(checkpoint['model'])
    start_epoch = checkpoint['epoch']

File System Issues¶

Disk Quota Exceeded¶

Problem: Cannot write files

Disk quota exceeded

Solutions:

Check quota
```
quota -s
```
Find large directories
```
du -sh ~/*/  | sort -hr | head -10
```

Clean up

# Remove old virtual environments
rm -rf ~/venvs/old_project

# Clean pip cache
pip cache purge

# Clean conda cache
conda clean --all

# Remove old checkpoints
rm checkpoints/epoch_*.pth

# Clear Python cache
find . -type d -name __pycache__ -exec rm -rf {} +

Use scratch space

# Move large data to scratch
mv large_dataset/ /fs/scratch/PAS1234/$USER/

File Permission Denied¶

Problem: Cannot access file

Solutions:

Check permissions
```
ls -la file.txt
```

Fix permissions

chmod 644 file.txt      # Read/write for owner
chmod 755 directory/    # Directory permissions

Check ownership

ls -l file.txt
# File should be owned by you

Data Transfer Issues¶

rsync/scp Fails¶

Problem: Transfer interrupted or failed

Solutions:

Use rsync with resume

rsync -avz --progress --partial source/ pitzer:~/dest/

Check disk space on destination
```
ssh pitzer
quota -s
```
Check network connection
```
ping pitzer.osc.edu
```

Use tmux for long transfers

tmux new -s transfer
rsync -avz source/ pitzer:~/dest/
# Ctrl+b, then d to detach

Slow File Transfer¶

Solutions:

Compress during transfer

rsync -avz source/ pitzer:~/dest/
# -z enables compression

Archive first, then transfer

tar -czf archive.tar.gz large_directory/
rsync -avz --progress archive.tar.gz pitzer:~/

Use multiple parallel transfers

# Split into parts and transfer separately

Python/PyTorch Issues¶

Import Error¶

Problem: ModuleNotFoundError: No module named 'torch'

Solutions:

Verify environment activated

which python
# Should point to venv, not system Python

Reinstall package
```
pip install torch
```

Check Python version

python --version
# Should match venv Python version

Jupyter Kernel Issues¶

Problem: Jupyter can't find kernel

Solutions:

Install ipykernel
```
pip install ipykernel
```

Add environment as kernel

python -m ipykernel install --user --name=myproject

List kernels
```
jupyter kernelspec list
```

Performance Issues¶

For GPU performance troubleshooting (slow training, profiling, DataLoader optimization), see PyTorch & GPU Setup.

Out of Memory (RAM)¶

Problem: Job killed due to RAM

Solutions:

Request more memory
```
#SBATCH --mem=64G  # Instead of 32G
```

Reduce data in memory

# Don't load entire dataset at once
# Use generators or data loaders

Monitor memory usage
```
top -u $USER
```

Getting More Help¶

OSC Support: Email oschelp@osc.edu or call (614) 292-9248
System Status: osc.edu/resources/system-status
Lab: Ask lab members on Slack/Teams, or consult your PI
See Useful Links for OSC portals and support contacts

Debug Checklist¶

When things go wrong:

For cluster details and resource limits, see Clusters Overview. For job script patterns, see Job Submission.