GPU Preprocessing (RAPIDS)¶

Accelerate large-scale tabular data preprocessing on OSC GPUs using NVIDIA RAPIDS — 10-100x faster than CPU-based pandas/NumPy for filtering, grouping, and feature engineering.

What Is RAPIDS?¶

RAPIDS is a suite of GPU-accelerated libraries that mirror the pandas/scikit-learn API:

Library	CPU Equivalent	Purpose
cuDF	pandas	GPU DataFrames — filtering, grouping, joins
cuML	scikit-learn	GPU machine learning — preprocessing, clustering, regression
cuGraph	NetworkX	GPU graph analytics

RAPIDS is useful when your preprocessing pipeline is the bottleneck — large CSVs, millions of rows, complex feature engineering. If your data fits in memory and processes in seconds, stick with pandas.

Why a Separate Conda Environment?¶

RAPIDS is conda-only — it cannot be installed via pip or uv. Its packages depend on conda-forge builds of CUDA libraries that conflict with pip-installed PyTorch. For this reason:

RAPIDS preprocessing runs in a dedicated conda environment
Model training runs in your normal uv/venv environment with PyTorch
The two environments communicate through files (Parquet), not shared memory

This two-environment pattern is intentional — it keeps your training environment clean while giving you GPU-accelerated preprocessing when you need it.

Setup¶

Create a dedicated conda environment for RAPIDS:

# Load conda
module load python/3.12

# Create RAPIDS environment (this takes a few minutes)
conda create -n gnn-rapids \
    -c rapidsai -c conda-forge -c nvidia \
    rapids=24.12 cuda-version=12.6 python=3.12 \
    -y

# Verify installation
conda activate gnn-rapids
python -c "import cudf; print(f'cuDF version: {cudf.__version__}')"
conda deactivate

Do not install PyTorch in this environment

The RAPIDS conda environment is for preprocessing only. Keep PyTorch in your separate uv/venv environment to avoid dependency conflicts.

Soft-Import Pattern¶

To write code that works with or without RAPIDS, use a RAPIDS_AVAILABLE flag:

try:
    import cudf
    import cuml
    RAPIDS_AVAILABLE = True
except ImportError:
    RAPIDS_AVAILABLE = False

def load_data(path: str):
    """Load data using cuDF if available, otherwise pandas."""
    if RAPIDS_AVAILABLE:
        return cudf.read_parquet(path)
    else:
        import pandas as pd
        return pd.read_parquet(path)

This lets the same script run on both GPU nodes (with RAPIDS) and login nodes (without RAPIDS, falling back to pandas).

SLURM Script for GPU Preprocessing¶

scripts/preprocess_gpu.sh:

#!/bin/bash
#SBATCH --job-name=rapids_preprocess
#SBATCH --account=PAS1234
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=64G
#SBATCH --time=01:00:00
#SBATCH --output=logs/preprocess_%j.out

echo "Job started at: $(date)"
echo "Node: $(hostname)"

# Load CUDA module (RAPIDS needs system CUDA)
module load cuda/12.x

# Activate RAPIDS conda environment
module load python/3.12
conda activate gnn-rapids

# Run preprocessing only
python scripts/preprocess.py \
    --input data/raw/ \
    --output data/processed/ \
    --preprocess-only

conda deactivate
echo "Job finished at: $(date)"

After preprocessing completes, run training in your normal environment:

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=gpu
#SBATCH --gpus-per-node=1

# Normal uv/venv environment — reads the Parquet files RAPIDS produced
source .venv/bin/activate
python scripts/train.py --data data/processed/

When NOT to Use RAPIDS¶

Small datasets (< 1M rows) — pandas is fast enough, and the conda setup overhead isn't worth it
Graph neural network training — use PyTorch Geometric in your uv/venv environment instead
Anything that needs PyTorch — RAPIDS and PyTorch should live in separate environments
Interactive exploration — use pandas in a Jupyter notebook; RAPIDS is for batch preprocessing

Next Steps¶

Environment Management — uv, venv, and module management on OSC
PyTorch & GPU Setup — Setting up your training environment
Pipeline Orchestration — Ray pipelines that chain preprocessing → training
DuckDB Analytics Layer — Querying preprocessed results with SQL