Troubleshooting Guide¶
Common issues and solutions for working on OSC.
flowchart TD
A{What's the problem?} --> B["Connection"]
A --> H["VS Code Remote"]
A --> C["Module & Environment"]
A --> D["GPU / CUDA"]
A --> E["Job Submission"]
A --> F["File System"]
A --> I["Data Transfer"]
A --> J["Python / PyTorch"]
A --> G["Performance"] Jump to: Connection | VS Code Remote | Modules & Env | GPU | Jobs | Files | Data Transfer | Python/PyTorch | Performance
Connection Issues¶
SSH Connection Failed¶
Problem: Cannot connect to OSC
Solutions:
-
Check if on OSU network or VPN
-
Verify hostname
-
Check system status
-
Visit: https://www.osc.edu/resources/system-status
-
Try alternative cluster
SSH Key Authentication Failed¶
Problem: Permission denied with SSH key
Solutions:
- Verify key is added to OSC
- Log into my.osc.edu
- Check "SSH Public Keys"
-
Wait 10 minutes after adding
-
Check key permissions
-
Verify SSH config
-
Test with verbose output
VS Code Remote Connection Hangs¶
Problem: "Setting up SSH Host..." hangs
See VS Code Remote-SSH Issues below for detailed solutions covering connection timeouts, extension host crashes, slow performance, and ProxyCommand issues.
VS Code Remote-SSH Issues¶
Detailed troubleshooting for VS Code Remote-SSH connections to OSC. For general SSH issues, see Connection Issues above. For Remote-SSH setup, see Remote Development.
Connection Timeout¶
Problem: "Setting up SSH Host..." hangs or times out
Solutions:
-
Verify SSH works outside VS Code first
-
Delete the remote VS Code server
-
Check disk quota — VS Code Server needs ~500 MB
-
Restart VS Code completely — close all windows, then reopen
Extension Host Crash¶
Problem: "The VS Code Server failed to start" or extension host keeps crashing
Solutions:
-
Clear the extensions directory
-
Reduce installed remote extensions — each extension increases memory usage. Only install essentials remotely (Python, Pylance, Jupyter)
-
Check for disk quota — extensions can fill your home directory
Slow Performance¶
Problem: Typing lag, slow file loading, or frequent disconnects
Solutions:
-
Exclude large directories from the file watcher
Add to
.vscode/settings.json: -
Disable unused remote extensions — open Extensions (
Ctrl+Shift+X), filter to installed on SSH, and disable anything not needed -
Work on the native filesystem — avoid opening folders on
/fs/scratch/if possible; prefer~/projects/
ProxyCommand Issues¶
Problem: VS Code fails when SSH config uses ProxyCommand or ProxyJump
Solutions:
-
Use
ProxyJumpinstead ofProxyCommand— VS Code handlesProxyJumpbetter: -
Set
RemoteCommandto none in your SSH config for the host entry
Workspace Trust¶
Problem: VS Code asks about workspace trust on every connection
Solution: When prompted, choose "Trust" for your home directory or project folder. You can also add trusted folders in VS Code settings:
Module and Environment Issues¶
Module Not Found¶
Problem: module: command not found
Solution:
# Module system not initialized
# Add to ~/.bashrc:
source /etc/profile.d/modules.sh
# Or reinitialize shell
bash --login
Module Load Fails¶
Problem: Module not found
Solutions:
-
Search for correct name
-
Check dependencies
Virtual Environment Won't Activate¶
Problem: Environment doesn't activate
Solutions:
-
Verify path exists
-
Recreate if corrupted
-
Check Python module loaded (pip+venv only)
Package Installation Fails¶
Problem: pip install fails
Solutions:
-
Update pip
-
Check disk quota
-
Install to user directory
-
Clear pip cache
GPU Issues¶
CUDA Not Available¶
Problem: torch.cuda.is_available() returns False
Quick checks: verify you're on a GPU node (nvidia-smi) and check torch.cuda.is_available(). If you installed PyTorch from PyPI, you do not need module load cuda -- PyPI wheels bundle CUDA. Only load a CUDA module if you're compiling custom CUDA extensions. For full diagnostic steps and reinstall commands, see PyTorch & GPU Setup -- Troubleshooting.
CUDA Out of Memory¶
Problem: RuntimeError: CUDA out of memory
Start by reducing batch size or clearing the cache with torch.cuda.empty_cache(). For a complete list of solutions (gradient accumulation, mixed precision, gradient checkpointing), see PyTorch & GPU Setup — CUDA Out of Memory.
GPU Utilization Low¶
Problem: GPU usage < 50% during training
- Increase
num_workersin your DataLoader to match--cpus-per-task - Enable
pin_memory=True - Use a larger batch size if memory allows
- Profile to find the actual bottleneck — see PyTorch & GPU Setup — Slow Training
Job Submission Issues¶
Job Pending Forever¶
Problem: Job stuck in PD (pending) state
Check reason:
Common reasons and solutions:
- QOSMaxGRESPerUser
- Too many GPU jobs running
-
Wait or cancel old jobs:
scancel <job_id> -
Resources
- Requesting too many resources
-
Reduce resources requested
-
ReqNodeNotAvail
- Maintenance window approaching
-
Reduce time limit or wait
-
Priority
- Other jobs have higher priority
- Wait in queue
Job Fails Immediately¶
Problem: Job exits with error immediately
Solutions:
-
Check error logs
-
Common causes:
Module not loaded:
Environment not activated:
File not found:
Permission denied:
- Test interactively first
Job Times Out¶
Problem: Job killed after time limit
Solutions:
-
Increase time limit
-
Implement checkpointing
-
Resume from checkpoint
File System Issues¶
Disk Quota Exceeded¶
Problem: Cannot write files
Solutions:
-
Check quota
-
Find large directories
-
Clean up
-
Use scratch space
File Permission Denied¶
Problem: Cannot access file
Solutions:
-
Check permissions
-
Fix permissions
-
Check ownership
Data Transfer Issues¶
rsync/scp Fails¶
Problem: Transfer interrupted or failed
Solutions:
-
Use rsync with resume
-
Check disk space on destination
-
Check network connection
-
Use tmux for long transfers
Slow File Transfer¶
Solutions:
-
Compress during transfer
-
Archive first, then transfer
-
Use multiple parallel transfers
Python/PyTorch Issues¶
Import Error¶
Problem: ModuleNotFoundError: No module named 'torch'
Solutions:
-
Verify environment activated
-
Reinstall package
-
Check Python version
Jupyter Kernel Issues¶
Problem: Jupyter can't find kernel
Solutions:
-
Install ipykernel
-
Add environment as kernel
-
List kernels
Performance Issues¶
For GPU performance troubleshooting (slow training, profiling, DataLoader optimization), see PyTorch & GPU Setup.
Out of Memory (RAM)¶
Problem: Job killed due to RAM
Solutions:
-
Request more memory
-
Reduce data in memory
-
Monitor memory usage
Getting More Help¶
- OSC Support: Email oschelp@osc.edu or call (614) 292-9248
- System Status: osc.edu/resources/system-status
- Lab: Ask lab members on Slack/Teams, or consult your PI
- See Useful Links for OSC portals and support contacts
Debug Checklist¶
When things go wrong:
- Check system status
- Verify SSH connection works
- Confirm modules loaded
- Verify environment activated
- Check disk quota
- Review error logs
- Test interactively before batch jobs
- Search error message online
- Contact OSC support if needed
For cluster details and resource limits, see Clusters Overview. For job script patterns, see Job Submission.