Table of Contents
Introduction to MLOS Training
Machine learning training workloads have unique characteristics that generic operating systems aren't optimized for. Training involves repetitive patterns of computation (forward pass, backward pass, optimizer steps) interleaved with I/O operations (data loading, checkpointing) that benefit enormously from OS-level awareness.
MLOS (Machine Learning Operating System) introduces the Training Workload API in v7.1.0, providing kernel-level optimizations specifically designed for ML training. This guide will walk you through everything you need to know to leverage these optimizations in your training pipelines.
Why Use MLOS for Training?
Traditional training setups rely on generic Linux scheduling and memory management. While Linux is an excellent general-purpose OS, it lacks awareness of ML-specific patterns. Here's what MLOS brings to the table:
NUMA-Aware Allocation
Automatic placement of training data and model weights on optimal NUMA nodes, reducing cross-socket memory latency by up to 40%.
CPU Affinity Management
Bind training threads to specific CPU cores for consistent cache behavior and reduced context switching.
Phase-Aware Scheduling
The OS understands forward/backward/optimizer phases and adjusts priorities accordingly for optimal throughput.
Safe Checkpointing
Memory-locked buffers ensure checkpoint saves complete successfully even under memory pressure.
Memory Pressure Handling
Graceful callbacks when system memory runs low, allowing your training to adapt rather than crash.
Distributed Primitives
OS-level barrier and all-reduce operations for multi-node training with lower latency than userspace implementations.
The Training Workload API
The Training Workload API (libmlos-training) provides a C interface that can be used directly or through language bindings. The API is designed around the concept of training workloads - long-running processes that go through predictable phases.
Training Phases
MLOS recognizes six distinct training phases, each with different resource characteristics:
| Phase | Constant | Priority | Characteristics |
|---|---|---|---|
| Idle | MLOS_TRAINING_PHASE_IDLE |
Lowest | Workload inactive, minimal resources |
| Data Loading | MLOS_TRAINING_PHASE_DATA_LOADING |
Low | I/O bound, prefetch optimization |
| Forward Pass | MLOS_TRAINING_PHASE_FORWARD |
Normal | Compute bound, cache-friendly |
| Backward Pass | MLOS_TRAINING_PHASE_BACKWARD |
High | Compute + memory, gradient pinning |
| Optimizer | MLOS_TRAINING_PHASE_OPTIMIZER |
Normal | Weight updates, bandwidth-sensitive |
| Checkpoint | MLOS_TRAINING_PHASE_CHECKPOINT |
Critical | I/O bound, memory-locked |
Core API Functions
Getting Started
Prerequisites
Before you begin, ensure you have:
- MLOS Core v7.1.0 or later installed
- Axon v3.1.9 or later for model management
- Linux kernel 5.15+ (for full feature support)
- Your preferred ML framework (PyTorch, TensorFlow, JAX)
Installation
Basic Usage Example
Case Study: Artifactiq YOLO Training
Artifactiq is a visual intelligence platform that uses YOLO (You Only Look Once) models for real-time object detection. Their training pipeline demonstrates the practical benefits of MLOS optimizations in a production environment.
Real Results
Artifactiq reported 22% faster training times and 35% reduction in memory fragmentation after integrating MLOS Training API into their pipeline.
Training Configuration
Artifactiq trains YOLOv8 variants using the Ultralytics framework with PyTorch. Here's their typical configuration:
| Parameter | Value | Notes |
|---|---|---|
| Base Model | YOLOv8n/s/m | Nano for edge, Small/Medium for servers |
| Image Size | 320-640px | Varies by deployment target |
| Batch Size | 16-64 | MLOS enables larger batches via memory optimization |
| Epochs | 50-300 | With early stopping |
| Framework | PyTorch + Ultralytics | MLOS-aware PyTorch wrapper |
| Export Formats | ONNX (FP32/FP16/INT8) | Via Axon for deployment |
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ ARTIFACTIQ TRAINING PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA PREPARATION │
│ ├── Image Collection (custom datasets + COCO) │
│ ├── Annotation (YOLO format: class x_center y_center width height) │
│ └── Augmentation (Albumentations: mosaic, mixup, HSV shifts) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MLOS TRAINING WORKLOAD │
│ ├── Resource Allocation (NUMA-aware, CPU affinity) │
│ ├── Phase Tracking (data_load → forward → backward → optimizer) │
│ ├── Memory Management (gradient pinning, checkpoint buffers) │
│ └── Monitoring (mlgpu for GPU utilization tracking) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ MODEL EXPORT & DEPLOYMENT │
│ ├── PyTorch → ONNX (via Ultralytics export) │
│ ├── ONNX Optimization (FP16 quantization, INT8 calibration) │
│ ├── Axon Registration (axon register model.onnx) │
│ └── Deployment (MLOS Core inference with <0.8ms p99 latency) │
└─────────────────────────────────────────────────────────────────────────────┘
MLOS-Integrated Training Script
Here's a complete example showing how Artifactiq integrates MLOS with Ultralytics YOLO training:
Dataset Configuration
Artifactiq uses the standard Ultralytics YAML format for dataset configuration:
PyTorch Integration
For custom PyTorch training loops (not using Ultralytics), MLOS provides deeper integration through a custom DataLoader wrapper and training context manager:
Distributed Training
MLOS provides OS-level primitives for distributed training that complement frameworks like PyTorch DDP or Horovod. The key advantages are lower-latency barriers and all-reduce operations.
Performance Note
MLOS distributed primitives show 15-20% latency reduction compared to NCCL for small tensor operations, while being comparable for large tensors. The benefit is most pronounced in communication-bound scenarios.
Monitoring with mlgpu
Artifactiq uses mlgpu, an open-source GPU monitoring tool, alongside MLOS training. While MLOS handles CPU-side optimizations, mlgpu provides real-time visibility into GPU utilization, memory, and thermal status.
mlgpu displays:
- GPU utilization percentage
- Memory usage (used/total)
- Temperature and power draw
- Running processes and their memory consumption
- Framework detection (PyTorch, TensorFlow, etc.)
Integrating mlgpu with Training Scripts
Best Practices
1. Resource Allocation
- Request resources early: Allocate MLOS resources at the start of your training script, before loading data or models.
- Use exclusive CPUs: For dedicated training servers, set
exclusive_cpus=trueto prevent interference from other processes. - Match NUMA nodes: If you know your GPU's NUMA topology, specify the matching NUMA node for CPU allocation.
2. Phase Hints
- Be granular: Call
set_phase()at the start of each logical phase, not just once per epoch. - Include data loading: Don't forget to mark data loading phases - this enables prefetch optimizations.
- Mark checkpoints: Always transition to
PHASE_CHECKPOINTbefore saving, even for quick saves.
3. Memory Management
- Register pressure callbacks: Always register a memory pressure callback to handle OOM gracefully.
- Use checkpoint buffers: For critical checkpoints, use
alloc_checkpoint_buffer()to ensure saves complete. - Monitor with mlgpu: Keep an eye on GPU memory alongside MLOS CPU memory optimizations.
4. Distributed Training
- Use MLOS barriers: For frequent synchronization points, MLOS barriers are faster than NCCL.
- Keep NCCL for gradients: Use standard NCCL all-reduce for gradient synchronization (optimized for large tensors).
- Consistent resource allocation: Ensure all ranks request the same resources for predictable behavior.
Common Pitfall
Don't forget to release resources in a finally block or context manager. Unreleased resources can cause issues for subsequent training runs.
Troubleshooting
Resource Allocation Failures
Symptom: mlos_training_request_resources() returns an error.
Solutions:
- Check available memory with
free -h - Ensure no other MLOS workloads have exclusive CPU claims
- Reduce
num_cpusormemory_bytesin your request - Verify MLOS kernel module is loaded:
lsmod | grep mlos
Phase Hints Not Working
Symptom: No performance improvement despite phase hints.
Solutions:
- Verify hints are being called (add debug logging)
- Check that the training loop isn't I/O bound elsewhere
- Ensure sufficient CPU cores are allocated (minimum 4 recommended)
- Profile with
perfto identify actual bottlenecks
Checkpoint Failures Under Pressure
Symptom: Checkpoints fail or corrupt when system memory is low.
Solutions:
- Use
alloc_checkpoint_buffer()for mlock'd memory - Reduce checkpoint size (save only model weights, not optimizer state)
- Implement memory pressure callback to reduce batch size dynamically
- Consider periodic cleanup of old checkpoints
Ready to Optimize Your Training?
Get started with MLOS Training Workload API today and see the performance difference.