Model Training¶
This guide explains how to train models using our PyTorch Lightning-based training framework.
Overview¶
The training system consists of several key components:
- Lightning Modules: Models implemented as PyTorch Lightning modules
- Training Script: Central
train.pyscript for experiment execution - Docker Environment: Containerized training environment
- Slurm Integration: Scripts for cluster training
Model Implementation¶
All models should be implemented as PyTorch Lightning modules.
Key Components¶
- Model Definition: Inherit from
LightningModule - Training Step: Implement
training_step - Validation Step: Implement
validation_step - Test Step: Implement
test_step - Configure Optimizers: Implement
configure_optimizers
from lightning import LightningModule
import torch.nn as nn
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.model = nn.Sequential(...)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = self.criterion(y_hat, y)
self.log("train_loss", loss)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters())
Training Script¶
The main training script (src/train.py) handles:
- Configuration management via Hydra
- DataModule initialization
- Model instantiation
- Logger and callback setup
- Training execution
Training is recommended to be done in conjuction with experiment management.
Basic Usage¶
Configuration¶
Training configurations are managed by Hydra and stored in configs/:
# configs/train.yaml
defaults:
- model: base_model
- data: ptbxl
- trainer: default
- callbacks: default
- logger: wandb
model:
lr: 0.001
hidden_size: 128
data:
batch_size: 64
num_workers: 4
trainer:
max_epochs: 100
accelerator: gpu
devices: 1
Training Environments¶
Local Development¶
For local development and testing, you can run training directly:
Docker Environment¶
For reproducible training, use the provided Docker environment:
# Build the container
cd docker
./build.sh
# Run training in container
docker run --gpus all ipoleprojection/projection:latest python src/train.py
Cluster Training (Slurm)¶
For large-scale training on a Slurm cluster, use the provided scripts in scripts/slurm/:
Basic Job Submission¶
Resource Configuration¶
The Slurm scripts include default resource configurations:
#SBATCH -t 1-5:00:00 # Run time (days-hours:minutes:seconds)
#SBATCH -p performance # Partition (queue)
#SBATCH --gpus=1 # Number of GPUs
#SBATCH --cpus-per-gpu=4 # CPUs per GPU
#SBATCH --mem=32G # Memory per node
Modify these settings in scripts/slurm/train.sh based on your needs.
Job Management¶
# Check job status
squeue -u $USER
# Cancel a job
scancel <job_id>
# View job logs
tail -f logs/projection_train_<job_id>.out
tail -f logs/projection_train_<job_id>.err
Singularity Container¶
The training jobs run within a Singularity container to ensure reproducibility and consistent environments across cluster nodes. The container setup includes:
- Container Location: Uses
$HOME/projection_latest.sifas the container image -
Mounted Directories:
- Workspace: Job-specific workspace at
/tmp/projection_${SLURM_JOB_ID}→/workspace - Data: Project data from
$DATA_PATH→/workspace/data
- Workspace: Job-specific workspace at
-
Environment Configuration:
-
GPU Support: Container runs with
--nvflag for NVIDIA GPU access
To update the container image, use the provided script:
This containerized approach ensures consistent software environments, proper isolation, and reproducible experiments across different compute nodes in the cluster.
AWS SageMaker Training¶
Note: The AWS SageMaker integration is currently in beta and serves as a proof of concept. While the setup provides a solid foundation for cloud training, it has not undergone extensive testing and can encounter issues.
The framework supports training on AWS SageMaker using custom ECR (Elastic Container Registry) images. This setup provides scalable cloud training with managed infrastructure.
Prerequisites¶
-
AWS CLI Setup:
-
Docker Image: Base image: PyTorch training container from AWS Deep Learning Containers Location:
docker/aws/DockerfileCustom additions:- Rye package management
- SageMaker directory structure
- NVIDIA CUDA support
Directory Structure¶
docker/aws/
└── Dockerfile # SageMaker-compatible container definition
scripts/aws/
├── entrypoint.sh # Container entry point
├── sm_entrypoint.py # Hydra configuration adapter
├── train.py # SageMaker training script
└── test.py # Test script for validation
Configuration¶
The training job uses Hydra configurations with SageMaker-specific adaptations:
- Hyperparameters: Passed as SageMaker training job parameters are automatically translated to Hydra format
- Environment:
Uses standard SageMaker paths:
/opt/ml/input/data/training: Training data/opt/ml/model: Model artifacts/opt/ml/output: Training outputs
Usage¶
-
Build and Push Image:
-
Launch Training: Use the AWS SageMaker SDK or console to launch a training job with:
- Training image: Your pushed ECR image
- Entry point:
/opt/ml/code/entrypoint.sh - Hyperparameters: Passed as standard SageMaker hyperparameters
The SageMaker setup automatically handles infrastructure provisioning, data transfer, and artifact management while maintaining compatibility with the local training workflow through Hydra configurations.