Distributed Execution

MIND provides first-class support for distributed training and inference, allowing you to scale models across multiple nodes with automatic sharding, gradient synchronization, and fault tolerance.

Early Access: Distributed execution is currently in Phase 14 development. The APIs described here are subject to change. See the Roadmap for current status.

Overview

MIND's distributed execution framework supports three parallelism strategies:

  • Data Parallelism: Replicate the model across devices and split data batches
  • Model Parallelism: Split large models across devices when they don't fit in memory
  • Pipeline Parallelism: Split model layers across devices for improved throughput

Getting Started

Enable distributed execution with the @distributed annotation:

use mind::distributed::{init, world_size, rank};

@distributed(strategy = "data_parallel")
fn train_step(model: &mut Model, batch: Tensor<f32, [B, 784]>) -> f32 {
    let output = model.forward(batch);
    let loss = cross_entropy(output, labels);

    // Gradients are automatically synchronized across all nodes
    loss.backward();
    optimizer.step();

    loss.item()
}

fn main() {
    // Initialize distributed runtime
    init();

    println!("Running on node {} of {}", rank(), world_size());

    for epoch in 0..100 {
        let loss = train_step(&mut model, batch);
        if rank() == 0 {
            println!("Epoch {}: loss = {:.4}", epoch, loss);
        }
    }
}

Data Parallelism

Data parallelism replicates the model on each device and splits input batches. Gradients are averaged across all replicas using all-reduce operations.

use mind::distributed::{DataParallel, AllReduce};

// Wrap model for data parallel training
let model = DataParallel::new(model, devices);

// Training loop - batches are automatically distributed
for batch in dataloader.iter() {
    let loss = model.forward(batch);
    loss.backward();

    // Gradients synchronized via NCCL/Gloo
    optimizer.step();
}

Gradient Synchronization

MIND supports multiple collective communication backends:

BackendDevicesNotes
NCCLNVIDIA GPURecommended for multi-GPU training
GlooCPU, GPUCross-platform, supports TCP/IP
MPICPU, GPUHPC clusters with InfiniBand

Model Parallelism

For models that don't fit on a single device, use model parallelism to split layers across devices:

use mind::distributed::{ModelParallel, DeviceMap};

// Define how layers map to devices
let device_map = DeviceMap::new()
    .layer("embed", Device::cuda(0))
    .layers("transformer.0..12", Device::cuda(0))
    .layers("transformer.12..24", Device::cuda(1))
    .layer("head", Device::cuda(1));

// Create model parallel wrapper
let model = ModelParallel::new(model, device_map);

// Forward pass automatically moves tensors between devices
let output = model.forward(input);

Pipeline Parallelism

Pipeline parallelism enables higher throughput by overlapping computation across micro-batches:

use mind::distributed::{Pipeline, Schedule};

// Configure pipeline with micro-batching
let pipeline = Pipeline::new(model)
    .stages(4)                    // Split into 4 pipeline stages
    .micro_batches(8)             // 8 micro-batches per batch
    .schedule(Schedule::GPipe);   // Or Schedule::PipeDream

// Training with pipeline parallelism
for batch in dataloader.iter() {
    let loss = pipeline.forward_backward(batch);
    optimizer.step();
}

Multi-Node Training

Scale training across multiple machines using the MIND distributed launcher:

# Launch on 4 nodes with 8 GPUs each
mind launch --nodes 4 --gpus-per-node 8 \
    --master-addr 192.168.1.1 \
    --master-port 29500 \
    train.mind

# Or use a hostfile
mind launch --hostfile hosts.txt --gpus-per-node 8 train.mind

Fault Tolerance

MIND supports elastic training with automatic checkpointing and recovery:

use mind::distributed::{Elastic, Checkpoint};

// Enable elastic training with checkpointing
let trainer = Elastic::new(model)
    .min_nodes(2)
    .max_nodes(8)
    .checkpoint_dir("checkpoints/")
    .checkpoint_interval(1000);  // steps

// Training automatically resumes on node failure
trainer.fit(dataloader, epochs);

Best Practices

  • Start with data parallelism for most workloads - it's the simplest and most efficient
  • Use gradient accumulation to simulate larger batch sizes without more memory
  • Profile communication overhead with MIND_PROFILE=1
  • Enable mixed precision training to reduce communication bandwidth
  • Use gradient compression for slow network connections

Learn More

See the Future Extensions page for upcoming distributed features and the Roadmap for development status.