r system-level dependencies and Conda for Python packages.
Best Practice: Always purge modules at the start of your script to ensure a clean state, then load only what is required.
#!/bin/bash
# setup_environment.sh
# Reset to base state
module purge
# Load system compilers and CUDA toolkit
# Specific versions ensure reproducibility
module load gcc/11.3.0
module load cuda/11.8.0
module load openmpi/4.1.4
# Initialize Conda for Python isolation
eval "$(conda shell.bash hook)"
# Create or activate project environment
if ! conda env list | grep -q "ai_project_v1"; then
conda create -n ai_project_v1 python=3.10 -y
fi
conda activate ai_project_v1
# Install framework dependencies
# Ensure PyTorch matches the loaded CUDA version
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate deepspeed
2. Declarative Job Submission with SLURM
SLURM (Simple Linux Utility for Resource Management) is the industry standard scheduler. You do not "run" code; you submit a job description that SLURM matches against available resources.
Architecture Decision: Use batch scripts (sbatch) for reproducibility and monitoring. Reserve interactive sessions (srun) only for debugging.
#!/bin/bash
# submit_experiment.sh
#SBATCH --job-name=llm_finetune_exp
#SBATCH --partition=accel
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:8
#SBATCH --time=48:00:00
#SBATCH --mem=0
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --account=project_allocation_id
# SLURM Variables Mapping
# These variables are injected by the scheduler
export WORLD_SIZE=$((SLURM_NNODES * SLURM_NTASKS_PER_NODE))
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
# Load environment
source setup_environment.sh
# Move to working directory
cd "$SLURM_SUBMIT_DIR"
# Execute distributed training
# torchrun leverages SLURM environment for auto-discovery
torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=$SLURM_GPUS_PER_NODE \
--node_rank=$SLURM_PROCID \
--master_addr=$MASTER_ADDR \
--master_port=29500 \
train_llm.py \
--config configs/deepseek_7b.yaml
Rationale:
--gres=gpu:8: Requests 8 GPUs per node. This aligns with standard node topology.
--mem=0: Requests all available memory on the node. Use specific values (e.g., --mem=256G) if sharing nodes is allowed, though AI workloads typically require exclusive nodes.
--output with %j: Uses the job ID for unique log files, preventing overwrites.
torchrun integration: Passes SLURM node lists to PyTorch, enabling automatic distributed setup without manual IP configuration.
3. Containerized Workflows with Apptainer
For maximum reproducibility and to bypass module conflicts, HPC environments often support Apptainer (formerly Singularity). Unlike Docker, Apptainer runs without root privileges and integrates seamlessly with the host file system and GPU drivers.
# Build container locally or on a build node
apptainer build pytorch_2.0_cuda118.sif docker://pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
# Execute within SLURM script
apptainer exec --nv \
--bind /scratch:/data \
pytorch_2.0_cuda118.sif \
python train_llm.py
Key Flag: --nv mounts NVIDIA drivers from the host, enabling GPU acceleration inside the container.
4. Storage Hierarchy Optimization
HPC storage is tiered. Misusing tiers causes performance degradation.
$HOME: Backed up, quota-limited, slow I/O. Use only for scripts and configs.
$SCRATCH: High-performance, non-backed up, large capacity. Use for datasets and model checkpoints.
$TMPDIR: Node-local NVMe. Use for temporary I/O during job execution.
Implementation: Symlink datasets to scratch and use node-local storage for active data loading.
# In job script
# Copy active dataset to node-local NVMe for maximum I/O throughput
cp -r /scratch/project/datasets/training_data $TMPDIR/
# Run training reading from $TMPDIR
python train_llm.py --data_dir=$TMPDIR/training_data
# Sync results back to scratch
rsync -av checkpoints/ /scratch/project/results/
Pitfall Guide
1. The Login Node Trap
Explanation: Users run heavy computations, data preprocessing, or interactive Python sessions on login nodes. These nodes are shared for code editing and job submission. Heavy usage degrades the experience for all users and may result in account suspension.
Fix: Always use srun --pty bash to request an interactive compute node for heavy tasks.
2. CUDA/PyTorch Version Mismatch
Explanation: Loading a CUDA module (e.g., cuda/12.1) while using a PyTorch binary compiled for an older version (e.g., cu118) causes runtime errors or silent fallback to CPU.
Fix: Verify compatibility using torch.version.cuda. Ensure the loaded module matches the PyTorch build. Use module spider cuda to list available versions.
3. Resource Hoarding
Explanation: Requesting more GPUs or memory than the job utilizes. This extends queue times for the user and reduces cluster efficiency.
Fix: Profile jobs locally to determine exact resource needs. Use --gres=gpu:1 if the script only uses one GPU. Monitor utilization with nvidia-smi during test runs.
4. Storage I/O Saturation
Explanation: Reading large datasets directly from $HOME or network storage during training. This saturates the file server bandwidth, causing training to stall.
Fix: Stage data to $SCRATCH or $TMPDIR before training. Use data loading libraries that support parallel I/O.
5. Module Pollution
Explanation: Loading multiple conflicting modules (e.g., different GCC or Python versions) leads to linker errors and undefined behavior.
Fix: Always start scripts with module purge. Load modules in a specific order: compilers first, then libraries, then applications.
6. Ignoring Job Exit Codes
Explanation: Assuming a job completed successfully because the log file exists. Jobs may fail silently or exit with non-zero codes due to OOM errors.
Fix: Check job status with sacct or scontrol show job. Implement error handling in scripts to exit immediately on failure: set -e.
7. Zombie Jobs
Explanation: Jobs that appear running but are stuck due to deadlocks or network issues, consuming resources indefinitely.
Fix: Set reasonable --time limits. Use monitoring tools to detect stalled jobs. Implement heartbeat mechanisms in long-running scripts.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Quick Prototype / Debugging | Local Machine or Interactive srun | Low latency, immediate feedback. | Low (Local) / Allocated (HPC) |
| Single-Node Training (< 8 GPUs) | HPC Batch Job | Access to A100/H100, no hourly cost. | Allocated (Free at point of use) |
| Multi-Node LLM Training | HPC Batch with torchrun | InfiniBand interconnect, massive scale. | Allocated (High compute allocation) |
| Burst Scaling / Variable Load | Cloud Instances | Elastic provisioning, pay-per-use. | High (Hourly rates) |
| Reproducible Research | Apptainer Container on HPC | Environment isolation, portability. | Allocated (Storage for images) |
Configuration Template
Copy this template for a robust, production-ready SLURM job.
#!/bin/bash
# =============================================================================
# SLURM Job Template for AI Workloads
# =============================================================================
#SBATCH --job-name=prod_ai_job
#SBATCH --partition=accel
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:4
#SBATCH --time=12:00:00
#SBATCH --mem=128G
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --account=your_project_id
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your_email@domain.com
# Strict error handling
set -euo pipefail
# 1. Environment Setup
module purge
module load gcc/11.3.0
module load cuda/11.8.0
eval "$(conda shell.bash hook)"
conda activate ai_env
# 2. Working Directory
cd "$SLURM_SUBMIT_DIR"
# 3. Data Staging (Optional: Copy to node-local storage)
# if [ -d "$TMPDIR" ]; then
# cp -r /scratch/data/dataset $TMPDIR/
# DATA_DIR=$TMPDIR/dataset
# else
# DATA_DIR=/scratch/data/dataset
# fi
# 4. Execution
echo "Starting job on $(hostname)"
echo "GPUs available: $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)"
python main.py \
--epochs 50 \
--batch-size 64 \
--lr 1e-4 \
--output-dir /scratch/results/model_run_$(date +%Y%m%d)
# 5. Cleanup
echo "Job completed successfully."
Quick Start Guide
-
Connect via SSH:
ssh username@login.cluster.cineca.it
Ensure you have 2FA configured if required.
-
Load Modules and Create Environment:
module purge
module load cuda/11.8.0
conda create -n hpc_test python=3.10 -y
conda activate hpc_test
pip install torch
-
Write a Job Script:
Create test_job.sh with the configuration template above. Adjust --gres and --time as needed.
-
Submit and Monitor:
sbatch test_job.sh
squeue -u $USER
tail -f logs/test_job_*.out
-
Verify Results:
Check the output directory for model checkpoints and logs. Use sacct to review job accounting data.