Distributed Execution
MIND provides first-class support for distributed training and inference, allowing you to scale models across multiple nodes with automatic sharding, gradient synchronization, and fault tolerance.
Early Access: Distributed execution is currently in Phase 14 development. The APIs described here are subject to change. See the Roadmap for current status.
Overview
MIND's distributed execution framework supports three parallelism strategies:
- Data Parallelism: Replicate the model across devices and split data batches
- Model Parallelism: Split large models across devices when they don't fit in memory
- Pipeline Parallelism: Split model layers across devices for improved throughput
Getting Started
Enable distributed execution with the @distributed annotation:
use mind::distributed::{init, world_size, rank};
@distributed(strategy = "data_parallel")
fn train_step(model: &mut Model, batch: Tensor<f32, [B, 784]>) -> f32 {
let output = model.forward(batch);
let loss = cross_entropy(output, labels);
// Gradients are automatically synchronized across all nodes
loss.backward();
optimizer.step();
loss.item()
}
fn main() {
// Initialize distributed runtime
init();
println!("Running on node {} of {}", rank(), world_size());
for epoch in 0..100 {
let loss = train_step(&mut model, batch);
if rank() == 0 {
println!("Epoch {}: loss = {:.4}", epoch, loss);
}
}
}Data Parallelism
Data parallelism replicates the model on each device and splits input batches. Gradients are averaged across all replicas using all-reduce operations.
use mind::distributed::{DataParallel, AllReduce};
// Wrap model for data parallel training
let model = DataParallel::new(model, devices);
// Training loop - batches are automatically distributed
for batch in dataloader.iter() {
let loss = model.forward(batch);
loss.backward();
// Gradients synchronized via NCCL/Gloo
optimizer.step();
}Gradient Synchronization
MIND supports multiple collective communication backends:
| Backend | Devices | Notes |
|---|---|---|
| NCCL | NVIDIA GPU | Recommended for multi-GPU training |
| Gloo | CPU, GPU | Cross-platform, supports TCP/IP |
| MPI | CPU, GPU | HPC clusters with InfiniBand |
Model Parallelism
For models that don't fit on a single device, use model parallelism to split layers across devices:
use mind::distributed::{ModelParallel, DeviceMap};
// Define how layers map to devices
let device_map = DeviceMap::new()
.layer("embed", Device::cuda(0))
.layers("transformer.0..12", Device::cuda(0))
.layers("transformer.12..24", Device::cuda(1))
.layer("head", Device::cuda(1));
// Create model parallel wrapper
let model = ModelParallel::new(model, device_map);
// Forward pass automatically moves tensors between devices
let output = model.forward(input);Pipeline Parallelism
Pipeline parallelism enables higher throughput by overlapping computation across micro-batches:
use mind::distributed::{Pipeline, Schedule};
// Configure pipeline with micro-batching
let pipeline = Pipeline::new(model)
.stages(4) // Split into 4 pipeline stages
.micro_batches(8) // 8 micro-batches per batch
.schedule(Schedule::GPipe); // Or Schedule::PipeDream
// Training with pipeline parallelism
for batch in dataloader.iter() {
let loss = pipeline.forward_backward(batch);
optimizer.step();
}Multi-Node Training
Scale training across multiple machines using the MIND distributed launcher:
# Launch on 4 nodes with 8 GPUs each
mind launch --nodes 4 --gpus-per-node 8 \
--master-addr 192.168.1.1 \
--master-port 29500 \
train.mind
# Or use a hostfile
mind launch --hostfile hosts.txt --gpus-per-node 8 train.mindFault Tolerance
MIND supports elastic training with automatic checkpointing and recovery:
use mind::distributed::{Elastic, Checkpoint};
// Enable elastic training with checkpointing
let trainer = Elastic::new(model)
.min_nodes(2)
.max_nodes(8)
.checkpoint_dir("checkpoints/")
.checkpoint_interval(1000); // steps
// Training automatically resumes on node failure
trainer.fit(dataloader, epochs);Best Practices
- Start with data parallelism for most workloads - it's the simplest and most efficient
- Use gradient accumulation to simulate larger batch sizes without more memory
- Profile communication overhead with
MIND_PROFILE=1 - Enable mixed precision training to reduce communication bandwidth
- Use gradient compression for slow network connections
Learn More
See the Future Extensions page for upcoming distributed features and the Roadmap for development status.