Complete Training Guide v7.1.0

Training ML Models
with MLOS

A comprehensive guide to leveraging OS-level optimizations for ML training, featuring real-world examples from Artifactiq's YOLO object detection pipeline.

Table of Contents

  1. Introduction to MLOS Training
  2. Why Use MLOS for Training?
  3. The Training Workload API
  4. Getting Started
  5. Case Study: Artifactiq YOLO Training
  6. PyTorch Integration
  7. Distributed Training
  8. Monitoring with mlgpu
  9. Best Practices
  10. Troubleshooting

Introduction to MLOS Training

Machine learning training workloads have unique characteristics that generic operating systems aren't optimized for. Training involves repetitive patterns of computation (forward pass, backward pass, optimizer steps) interleaved with I/O operations (data loading, checkpointing) that benefit enormously from OS-level awareness.

MLOS (Machine Learning Operating System) introduces the Training Workload API in v7.1.0, providing kernel-level optimizations specifically designed for ML training. This guide will walk you through everything you need to know to leverage these optimizations in your training pipelines.

15-30%
Efficiency Gains
<15%
Concurrent Overhead
80
API Tests
6
Training Phases

Why Use MLOS for Training?

Traditional training setups rely on generic Linux scheduling and memory management. While Linux is an excellent general-purpose OS, it lacks awareness of ML-specific patterns. Here's what MLOS brings to the table:

📈

NUMA-Aware Allocation

Automatic placement of training data and model weights on optimal NUMA nodes, reducing cross-socket memory latency by up to 40%.

CPU Affinity Management

Bind training threads to specific CPU cores for consistent cache behavior and reduced context switching.

🕒

Phase-Aware Scheduling

The OS understands forward/backward/optimizer phases and adjusts priorities accordingly for optimal throughput.

💾

Safe Checkpointing

Memory-locked buffers ensure checkpoint saves complete successfully even under memory pressure.

🚧

Memory Pressure Handling

Graceful callbacks when system memory runs low, allowing your training to adapt rather than crash.

🌐

Distributed Primitives

OS-level barrier and all-reduce operations for multi-node training with lower latency than userspace implementations.

The Training Workload API

The Training Workload API (libmlos-training) provides a C interface that can be used directly or through language bindings. The API is designed around the concept of training workloads - long-running processes that go through predictable phases.

Training Phases

MLOS recognizes six distinct training phases, each with different resource characteristics:

Phase Constant Priority Characteristics
Idle MLOS_TRAINING_PHASE_IDLE Lowest Workload inactive, minimal resources
Data Loading MLOS_TRAINING_PHASE_DATA_LOADING Low I/O bound, prefetch optimization
Forward Pass MLOS_TRAINING_PHASE_FORWARD Normal Compute bound, cache-friendly
Backward Pass MLOS_TRAINING_PHASE_BACKWARD High Compute + memory, gradient pinning
Optimizer MLOS_TRAINING_PHASE_OPTIMIZER Normal Weight updates, bandwidth-sensitive
Checkpoint MLOS_TRAINING_PHASE_CHECKPOINT Critical I/O bound, memory-locked

Core API Functions

mlos_training.h C Header
/* Resource management */ int mlos_training_request_resources( const mlos_training_resource_request_t* request, mlos_training_resources_t** resources ); int mlos_training_release_resources( mlos_training_resources_t* resources ); /* Thread binding */ int mlos_training_bind_thread( mlos_training_resources_t* resources, pthread_t thread, int cpu_index ); /* Phase hints */ int mlos_training_set_phase( mlos_training_resources_t* resources, mlos_training_phase_t phase ); /* Memory pressure callbacks */ int mlos_training_register_memory_callback( mlos_training_resources_t* resources, mlos_memory_callback_t callback, void* user_data ); /* Checkpoint buffers */ void* mlos_training_alloc_checkpoint_buffer( mlos_training_resources_t* resources, size_t size ); /* Distributed primitives */ int mlos_training_barrier( mlos_training_resources_t* resources, int rank, int world_size ); int mlos_training_allreduce( mlos_training_resources_t* resources, void* data, size_t count, mlos_reduce_op_t op );

Getting Started

Prerequisites

Before you begin, ensure you have:

Installation

Terminal Installation
# Install MLOS Core curl -fsSL https://mlosfoundation.org/install.sh | bash # Verify installation mlos --version # MLOS Core v7.1.0 (Training API enabled) # Install Axon for model management curl -fsSL https://github.com/mlOS-foundation/axon/releases/download/v3.1.9/install.sh | bash # Verify Axon axon --version # Axon v3.1.9

Basic Usage Example

train_basic.c C Example
#include <mlos/training.h> #include <pthread.h> int main() { // Request resources for training mlos_training_resource_request_t request = { .num_cpus = 8, .memory_bytes = 16ULL * 1024 * 1024 * 1024, // 16GB .numa_node_preference = -1, // Auto-select best node .priority = 1, // Normal priority .exclusive_cpus = true }; mlos_training_resources_t* resources = NULL; if (mlos_training_request_resources(&request, &resources) != 0) { fprintf(stderr, "Failed to allocate training resources\n"); return 1; } // Bind current thread to allocated CPUs mlos_training_bind_thread(resources, pthread_self(), 0); // Training loop with phase hints for (int epoch = 0; epoch < num_epochs; epoch++) { // Data loading phase mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_DATA_LOADING); load_batch(data_loader); // Forward pass mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_FORWARD); outputs = model_forward(inputs); // Backward pass mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_BACKWARD); gradients = compute_gradients(outputs, targets); // Optimizer step mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_OPTIMIZER); update_weights(model, gradients, learning_rate); // Periodic checkpointing if (epoch % checkpoint_interval == 0) { mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_CHECKPOINT); save_checkpoint(model, epoch); } } // Cleanup mlos_training_release_resources(resources); return 0; }

Case Study: Artifactiq YOLO Training

Artifactiq is a visual intelligence platform that uses YOLO (You Only Look Once) models for real-time object detection. Their training pipeline demonstrates the practical benefits of MLOS optimizations in a production environment.

Real Results

Artifactiq reported 22% faster training times and 35% reduction in memory fragmentation after integrating MLOS Training API into their pipeline.

Training Configuration

Artifactiq trains YOLOv8 variants using the Ultralytics framework with PyTorch. Here's their typical configuration:

Parameter Value Notes
Base Model YOLOv8n/s/m Nano for edge, Small/Medium for servers
Image Size 320-640px Varies by deployment target
Batch Size 16-64 MLOS enables larger batches via memory optimization
Epochs 50-300 With early stopping
Framework PyTorch + Ultralytics MLOS-aware PyTorch wrapper
Export Formats ONNX (FP32/FP16/INT8) Via Axon for deployment

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        ARTIFACTIQ TRAINING PIPELINE                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DATA PREPARATION                                                            │
│  ├── Image Collection (custom datasets + COCO)                              │
│  ├── Annotation (YOLO format: class x_center y_center width height)         │
│  └── Augmentation (Albumentations: mosaic, mixup, HSV shifts)               │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  MLOS TRAINING WORKLOAD                                                      │
│  ├── Resource Allocation (NUMA-aware, CPU affinity)                         │
│  ├── Phase Tracking (data_load → forward → backward → optimizer)            │
│  ├── Memory Management (gradient pinning, checkpoint buffers)               │
│  └── Monitoring (mlgpu for GPU utilization tracking)                        │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  MODEL EXPORT & DEPLOYMENT                                                   │
│  ├── PyTorch → ONNX (via Ultralytics export)                               │
│  ├── ONNX Optimization (FP16 quantization, INT8 calibration)               │
│  ├── Axon Registration (axon register model.onnx)                           │
│  └── Deployment (MLOS Core inference with <0.8ms p99 latency)              │
└─────────────────────────────────────────────────────────────────────────────┘

MLOS-Integrated Training Script

Here's a complete example showing how Artifactiq integrates MLOS with Ultralytics YOLO training:

train_yolo_mlos.py Python
""" YOLO Training with MLOS Optimizations Based on Artifactiq's production pipeline """ import os import torch from ultralytics import YOLO from pathlib import Path # MLOS Python bindings (optional but recommended) try: import mlos_training MLOS_AVAILABLE = True except ImportError: MLOS_AVAILABLE = False print("MLOS not available, using standard training") class MLOSYOLOTrainer: """YOLO trainer with MLOS optimizations""" def __init__(self, model_size="n", num_cpus=8, memory_gb=16): self.model_size = model_size self.model = YOLO(f"yolov8{model_size}.pt") self.resources = None if MLOS_AVAILABLE: # Request MLOS training resources self.resources = mlos_training.request_resources( num_cpus=num_cpus, memory_bytes=memory_gb * 1024**3, numa_preference=-1, # Auto-select exclusive_cpus=True ) # Bind main thread mlos_training.bind_current_thread(self.resources) print(f"MLOS: Allocated {num_cpus} CPUs, {memory_gb}GB memory") def train(self, data_yaml, epochs=100, imgsz=640, batch=16): """Train YOLO model with MLOS phase hints""" # Configure training callbacks for MLOS phases callbacks = {} if self.resources: def on_train_batch_start(trainer): mlos_training.set_phase( self.resources, mlos_training.PHASE_DATA_LOADING ) def on_train_batch_end(trainer): mlos_training.set_phase( self.resources, mlos_training.PHASE_OPTIMIZER ) callbacks = { "on_train_batch_start": on_train_batch_start, "on_train_batch_end": on_train_batch_end, } # Start training results = self.model.train( data=data_yaml, epochs=epochs, imgsz=imgsz, batch=batch, device="0" if torch.cuda.is_available() else "cpu", workers=4, project="runs/train", name=f"yolov8{self.model_size}_mlos", exist_ok=True, pretrained=True, optimizer="AdamW", lr0=0.001, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, patience=50, # Early stopping patience save=True, save_period=10, cache=False, amp=True, # Mixed precision ) return results def export_onnx(self, output_dir="exports"): """Export trained model to ONNX formats""" if self.resources: mlos_training.set_phase( self.resources, mlos_training.PHASE_CHECKPOINT ) output_path = Path(output_dir) output_path.mkdir(exist_ok=True) # Export FP32 self.model.export( format="onnx", imgsz=640, simplify=True, opset=12, ) # Export FP16 self.model.export( format="onnx", imgsz=640, half=True, simplify=True, ) print(f"Exported models to {output_path}") def register_with_axon(self, model_path): """Register exported model with Axon for MLOS deployment""" import subprocess result = subprocess.run( ["axon", "register", str(model_path)], capture_output=True, text=True ) if result.returncode == 0: print(f"Model registered with Axon: {model_path}") else: print(f"Axon registration failed: {result.stderr}") def cleanup(self): """Release MLOS resources""" if self.resources: mlos_training.release_resources(self.resources) print("MLOS resources released") if __name__ == "__main__": # Example usage trainer = MLOSYOLOTrainer( model_size="n", # nano model for this example num_cpus=8, memory_gb=16 ) try: # Train on custom dataset results = trainer.train( data_yaml="dataset.yaml", epochs=100, imgsz=640, batch=16 ) # Export to ONNX trainer.export_onnx() # Register with Axon trainer.register_with_axon("runs/train/yolov8n_mlos/weights/best.onnx") finally: trainer.cleanup()

Dataset Configuration

Artifactiq uses the standard Ultralytics YAML format for dataset configuration:

dataset.yaml YAML
# Artifactiq Custom Object Detection Dataset path: ./data/artifactiq_detection train: images/train val: images/val test: images/test # Classes names: 0: person 1: vehicle 2: package 3: equipment 4: hazard # Dataset stats (auto-calculated during training) nc: 5 # number of classes

PyTorch Integration

For custom PyTorch training loops (not using Ultralytics), MLOS provides deeper integration through a custom DataLoader wrapper and training context manager:

custom_pytorch_training.py Python
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader # MLOS PyTorch integration import mlos_training from mlos_training.pytorch import ( MLOSDataLoader, MLOSTrainingContext, mlos_checkpoint ) def train_with_mlos(model, train_dataset, epochs=100, batch_size=32): """Custom PyTorch training with MLOS optimizations""" # MLOS-aware DataLoader with prefetching optimization train_loader = MLOSDataLoader( train_dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True, prefetch_factor=2, numa_aware=True # MLOS: allocate on optimal NUMA node ) optimizer = optim.AdamW(model.parameters(), lr=0.001) criterion = nn.CrossEntropyLoss() # Training context manager handles resource allocation with MLOSTrainingContext(num_cpus=8, memory_gb=16) as ctx: for epoch in range(epochs): model.train() for batch_idx, (data, target) in enumerate(train_loader): # Phase: Data loading (handled by MLOSDataLoader) # Phase: Forward pass ctx.set_phase(mlos_training.PHASE_FORWARD) output = model(data) loss = criterion(output, target) # Phase: Backward pass ctx.set_phase(mlos_training.PHASE_BACKWARD) optimizer.zero_grad() loss.backward() # Phase: Optimizer step ctx.set_phase(mlos_training.PHASE_OPTIMIZER) optimizer.step() # Checkpoint with memory-locked buffer if epoch % 10 == 0: with mlos_checkpoint(ctx): torch.save({ 'epoch': epoch, 'model_state': model.state_dict(), 'optimizer_state': optimizer.state_dict(), 'loss': loss.item(), }, f'checkpoint_epoch_{epoch}.pt') print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

Distributed Training

MLOS provides OS-level primitives for distributed training that complement frameworks like PyTorch DDP or Horovod. The key advantages are lower-latency barriers and all-reduce operations.

Performance Note

MLOS distributed primitives show 15-20% latency reduction compared to NCCL for small tensor operations, while being comparable for large tensors. The benefit is most pronounced in communication-bound scenarios.

distributed_training.py Python
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP import mlos_training from mlos_training.distributed import MLOSProcessGroup def setup_distributed(rank, world_size): """Initialize distributed training with MLOS backend""" # Standard PyTorch distributed init dist.init_process_group( backend="nccl", rank=rank, world_size=world_size ) # MLOS process group for optimized barriers mlos_pg = MLOSProcessGroup(rank, world_size) return mlos_pg def train_distributed(rank, world_size, model, dataset): """Distributed training with MLOS optimizations""" mlos_pg = setup_distributed(rank, world_size) # Wrap model with DDP model = model.to(rank) model = DDP(model, device_ids=[rank]) # MLOS training context for this rank with mlos_training.MLOSTrainingContext( num_cpus=8, memory_gb=16, rank=rank, world_size=world_size ) as ctx: for epoch in range(num_epochs): # Training step... # MLOS barrier (lower latency than NCCL barrier) mlos_pg.barrier() # Custom all-reduce for metrics metrics_tensor = torch.tensor([loss, accuracy], device=rank) mlos_pg.all_reduce(metrics_tensor, op=mlos_training.SUM) metrics_tensor /= world_size if rank == 0: print(f"Epoch {epoch}: Avg Loss = {metrics_tensor[0]:.4f}") dist.destroy_process_group()

Monitoring with mlgpu

Artifactiq uses mlgpu, an open-source GPU monitoring tool, alongside MLOS training. While MLOS handles CPU-side optimizations, mlgpu provides real-time visibility into GPU utilization, memory, and thermal status.

Terminal mlgpu Installation & Usage
# Install mlgpu curl -fsSL https://raw.githubusercontent.com/ARTIFACTIQ/mlgpu/main/install.sh | bash # Basic monitoring mlgpu # Watch mode with 1-second refresh mlgpu --watch # JSON output for programmatic use mlgpu --json # Monitor specific GPU mlgpu --gpu 0

mlgpu displays:

Integrating mlgpu with Training Scripts

monitor_training.py Python
import subprocess import json import threading import time class GPUMonitor: """Background GPU monitoring during training""" def __init__(self, log_file="gpu_metrics.jsonl", interval=5): self.log_file = log_file self.interval = interval self._stop = False self._thread = None def _monitor_loop(self): with open(self.log_file, "a") as f: while not self._stop: result = subprocess.run( ["mlgpu", "--json"], capture_output=True, text=True ) if result.returncode == 0: metrics = json.loads(result.stdout) metrics["timestamp"] = time.time() f.write(json.dumps(metrics) + "\n") f.flush() time.sleep(self.interval) def start(self): self._stop = False self._thread = threading.Thread(target=self._monitor_loop) self._thread.start() def stop(self): self._stop = True if self._thread: self._thread.join() # Usage with training monitor = GPUMonitor() monitor.start() try: # Your training code here train_model() finally: monitor.stop()

Best Practices

1. Resource Allocation

2. Phase Hints

3. Memory Management

4. Distributed Training

Common Pitfall

Don't forget to release resources in a finally block or context manager. Unreleased resources can cause issues for subsequent training runs.

Troubleshooting

Resource Allocation Failures

Symptom: mlos_training_request_resources() returns an error.

Solutions:

Phase Hints Not Working

Symptom: No performance improvement despite phase hints.

Solutions:

Checkpoint Failures Under Pressure

Symptom: Checkpoints fail or corrupt when system memory is low.

Solutions:

Ready to Optimize Your Training?

Get started with MLOS Training Workload API today and see the performance difference.