Runtime

The MIND runtime provides deterministic execution of compiled models with minimal overhead. It supports multiple deployment modes from embedded devices to cloud servers.

Architecture

┌─────────────────────────────────────┐
│           Application               │
├─────────────────────────────────────┤
│         Runtime API (C/Rust)        │
├─────────────────────────────────────┤
│    Executor    │   Memory Manager   │
├────────────────┼────────────────────┤
│   CPU Backend  │   GPU Backends     │
│   (Stable)     │ CUDA|Metal|ROCm|WebGPU│
└────────────────┴────────────────────┘

GPU Backend: Production CUDA 12.8+ backend available via Enterprise license. All 4 GPU backends (CUDA, Metal, ROCm, WebGPU) are production-ready.

GPU Runtime (Enterprise)

The Enterprise GPU runtime provides production-grade GPU acceleration across 4 backends:

  • cuBLAS/cuDNN: TF32 Tensor Cores for matmul, auto-tuned convolutions
  • Memory Allocator: CachingAllocator achieves 8.3M allocs/sec (180x faster than cudaMalloc)
  • Tensor Cores: TF32, FP16, FP8 (Ada Lovelace+) with PTX mma.sync
  • Async Streams: 8 streams (6 compute, 2 transfer) for overlapped execution
  • Supported GPUs: SM_80+ (Ampere, Ada Lovelace, Hopper)

Execution Modes

ModeUse CaseCharacteristics
AOT (Ahead-of-Time)Production deploymentFastest startup, smallest binary
JIT (Just-in-Time)Development, dynamic shapesFlexible, runtime optimization
InterpreterDebugging, conformanceReference implementation

Memory Management

  • Static allocation: Memory planned at compile time for AOT
  • Arena allocator: Fast bump allocation for intermediate tensors
  • Buffer reuse: Automatic sharing of memory between non-overlapping tensors
  • Device memory: Unified API for CPU and GPU memory

Determinism Guarantees

The runtime provides strong determinism guarantees:

// Create runtime with deterministic mode (default)
let rt = Runtime::new(RuntimeConfig {
    deterministic: true,  // IEEE 754 strict, no threading non-determinism
    seed: 42,             // RNG seed for reproducibility
});

// Same inputs always produce same outputs
let out1 = model.forward(&input);
let out2 = model.forward(&input);
assert_eq!(out1, out2);  // Guaranteed

Resource Limits

let config = RuntimeConfig {
    max_memory_mb: 1024,      // Memory limit
    max_threads: 4,           // Thread pool size
    timeout_ms: Some(5000),   // Execution timeout
    ..Default::default()
};

let rt = Runtime::new(config);

Profiling

// Enable profiling
let rt = Runtime::new(RuntimeConfig {
    profile: true,
    ..Default::default()
});

model.forward(&input);

// Get profile data
let profile = rt.get_profile();
for op in profile.operations {
    println!("{}: {}ms", op.name, op.duration_ms);
}

Learn More

See the full runtime specification at mind-spec/runtime.md and the runtime repository at mind-runtime.