ML Training with MLOS | Complete Guide

Introduction to MLOS Training
Why Use MLOS for Training?
The Training Workload API
Getting Started
Case Study: Artifactiq YOLO Training
PyTorch Integration
Distributed Training
Monitoring with mlgpu
Best Practices
Troubleshooting

Introduction to MLOS Training

Machine learning training workloads have unique characteristics that generic operating systems aren't optimized for. Training involves repetitive patterns of computation (forward pass, backward pass, optimizer steps) interleaved with I/O operations (data loading, checkpointing) that benefit enormously from OS-level awareness.

MLOS (Machine Learning Operating System) introduces the Training Workload API in v7.1.0, providing kernel-level optimizations specifically designed for ML training. This guide will walk you through everything you need to know to leverage these optimizations in your training pipelines.

15-30%

Efficiency Gains

<15%

Concurrent Overhead

API Tests

Training Phases

Why Use MLOS for Training?

Traditional training setups rely on generic Linux scheduling and memory management. While Linux is an excellent general-purpose OS, it lacks awareness of ML-specific patterns. Here's what MLOS brings to the table:

📈

NUMA-Aware Allocation

Automatic placement of training data and model weights on optimal NUMA nodes, reducing cross-socket memory latency by up to 40%.

⚡

CPU Affinity Management

Bind training threads to specific CPU cores for consistent cache behavior and reduced context switching.

🕒

Phase-Aware Scheduling

The OS understands forward/backward/optimizer phases and adjusts priorities accordingly for optimal throughput.

💾

Safe Checkpointing

Memory-locked buffers ensure checkpoint saves complete successfully even under memory pressure.

🚧

Memory Pressure Handling

Graceful callbacks when system memory runs low, allowing your training to adapt rather than crash.

🌐

Distributed Primitives

OS-level barrier and all-reduce operations for multi-node training with lower latency than userspace implementations.

The Training Workload API

The Training Workload API (libmlos-training) provides a C interface that can be used directly or through language bindings. The API is designed around the concept of training workloads - long-running processes that go through predictable phases.

Training Phases

MLOS recognizes six distinct training phases, each with different resource characteristics:

Phase	Constant	Priority	Characteristics
Idle	`MLOS_TRAINING_PHASE_IDLE`	Lowest	Workload inactive, minimal resources
Data Loading	`MLOS_TRAINING_PHASE_DATA_LOADING`	Low	I/O bound, prefetch optimization
Forward Pass	`MLOS_TRAINING_PHASE_FORWARD`	Normal	Compute bound, cache-friendly
Backward Pass	`MLOS_TRAINING_PHASE_BACKWARD`	High	Compute + memory, gradient pinning
Optimizer	`MLOS_TRAINING_PHASE_OPTIMIZER`	Normal	Weight updates, bandwidth-sensitive
Checkpoint	`MLOS_TRAINING_PHASE_CHECKPOINT`	Critical	I/O bound, memory-locked

Core API Functions

mlos_training.h C Header

/* Resource management */
int mlos_training_request_resources(
    const mlos_training_resource_request_t* request,
    mlos_training_resources_t** resources
);

int mlos_training_release_resources(
    mlos_training_resources_t* resources
);

/* Thread binding */
int mlos_training_bind_thread(
    mlos_training_resources_t* resources,
    pthread_t thread,
    int cpu_index
);

/* Phase hints */
int mlos_training_set_phase(
    mlos_training_resources_t* resources,
    mlos_training_phase_t phase
);

/* Memory pressure callbacks */
int mlos_training_register_memory_callback(
    mlos_training_resources_t* resources,
    mlos_memory_callback_t callback,
    void* user_data
);

/* Checkpoint buffers */
void* mlos_training_alloc_checkpoint_buffer(
    mlos_training_resources_t* resources,
    size_t size
);

/* Distributed primitives */
int mlos_training_barrier(
    mlos_training_resources_t* resources,
    int rank,
    int world_size
);

int mlos_training_allreduce(
    mlos_training_resources_t* resources,
    void* data,
    size_t count,
    mlos_reduce_op_t op
);

Getting Started

Prerequisites

Before you begin, ensure you have:

MLOS Core v7.1.0 or later installed
Axon v3.1.9 or later for model management
Linux kernel 5.15+ (for full feature support)
Your preferred ML framework (PyTorch, TensorFlow, JAX)

Installation

Terminal Installation

# Install MLOS Core
curl -fsSL https://mlosfoundation.org/install.sh | bash

# Verify installation
mlos --version
# MLOS Core v7.1.0 (Training API enabled)

# Install Axon for model management
curl -fsSL https://github.com/mlOS-foundation/axon/releases/download/v3.1.9/install.sh | bash

# Verify Axon
axon --version
# Axon v3.1.9

Basic Usage Example

train_basic.c C Example

#include <mlos/training.h>
#include <pthread.h>

int main() {
    // Request resources for training
    mlos_training_resource_request_t request = {
        .num_cpus = 8,
        .memory_bytes = 16ULL * 1024 * 1024 * 1024,  // 16GB
        .numa_node_preference = -1,  // Auto-select best node
        .priority = 1,  // Normal priority
        .exclusive_cpus = true
    };

    mlos_training_resources_t* resources = NULL;
    if (mlos_training_request_resources(&request, &resources) != 0) {
        fprintf(stderr, "Failed to allocate training resources\n");
        return 1;
    }

    // Bind current thread to allocated CPUs
    mlos_training_bind_thread(resources, pthread_self(), 0);

    // Training loop with phase hints
    for (int epoch = 0; epoch < num_epochs; epoch++) {
        // Data loading phase
        mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_DATA_LOADING);
        load_batch(data_loader);

        // Forward pass
        mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_FORWARD);
        outputs = model_forward(inputs);

        // Backward pass
        mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_BACKWARD);
        gradients = compute_gradients(outputs, targets);

        // Optimizer step
        mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_OPTIMIZER);
        update_weights(model, gradients, learning_rate);

        // Periodic checkpointing
        if (epoch % checkpoint_interval == 0) {
            mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_CHECKPOINT);
            save_checkpoint(model, epoch);
        }
    }

    // Cleanup
    mlos_training_release_resources(resources);
    return 0;
}

Case Study: Artifactiq YOLO Training

Artifactiq is a visual intelligence platform that uses YOLO (You Only Look Once) models for real-time object detection. Their training pipeline demonstrates the practical benefits of MLOS optimizations in a production environment.

Real Results

Artifactiq reported 22% faster training times and 35% reduction in memory fragmentation after integrating MLOS Training API into their pipeline.

Training Configuration

Artifactiq trains YOLOv8 variants using the Ultralytics framework with PyTorch. Here's their typical configuration:

Parameter	Value	Notes
Base Model	YOLOv8n/s/m	Nano for edge, Small/Medium for servers
Image Size	320-640px	Varies by deployment target
Batch Size	16-64	MLOS enables larger batches via memory optimization
Epochs	50-300	With early stopping
Framework	PyTorch + Ultralytics	MLOS-aware PyTorch wrapper
Export Formats	ONNX (FP32/FP16/INT8)	Via Axon for deployment

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                        ARTIFACTIQ TRAINING PIPELINE                          │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DATA PREPARATION                                                            │
│  ├── Image Collection (custom datasets + COCO)                              │
│  ├── Annotation (YOLO format: class x_center y_center width height)         │
│  └── Augmentation (Albumentations: mosaic, mixup, HSV shifts)               │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  MLOS TRAINING WORKLOAD                                                      │
│  ├── Resource Allocation (NUMA-aware, CPU affinity)                         │
│  ├── Phase Tracking (data_load → forward → backward → optimizer)            │
│  ├── Memory Management (gradient pinning, checkpoint buffers)               │
│  └── Monitoring (mlgpu for GPU utilization tracking)                        │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  MODEL EXPORT & DEPLOYMENT                                                   │
│  ├── PyTorch → ONNX (via Ultralytics export)                               │
│  ├── ONNX Optimization (FP16 quantization, INT8 calibration)               │
│  ├── Axon Registration (axon register model.onnx)                           │
│  └── Deployment (MLOS Core inference with <0.8ms p99 latency)              │
└─────────────────────────────────────────────────────────────────────────────┘

MLOS-Integrated Training Script

Here's a complete example showing how Artifactiq integrates MLOS with Ultralytics YOLO training:

train_yolo_mlos.py Python

"""
YOLO Training with MLOS Optimizations
Based on Artifactiq's production pipeline
"""

import os
import torch
from ultralytics import YOLO
from pathlib import Path

# MLOS Python bindings (optional but recommended)
try:
    import mlos_training
    MLOS_AVAILABLE = True
except ImportError:
    MLOS_AVAILABLE = False
    print("MLOS not available, using standard training")


class MLOSYOLOTrainer:
    """YOLO trainer with MLOS optimizations"""

    def __init__(self, model_size="n", num_cpus=8, memory_gb=16):
        self.model_size = model_size
        self.model = YOLO(f"yolov8{model_size}.pt")
        self.resources = None

        if MLOS_AVAILABLE:
            # Request MLOS training resources
            self.resources = mlos_training.request_resources(
                num_cpus=num_cpus,
                memory_bytes=memory_gb * 1024**3,
                numa_preference=-1,  # Auto-select
                exclusive_cpus=True
            )

            # Bind main thread
            mlos_training.bind_current_thread(self.resources)
            print(f"MLOS: Allocated {num_cpus} CPUs, {memory_gb}GB memory")

    def train(self, data_yaml, epochs=100, imgsz=640, batch=16):
        """Train YOLO model with MLOS phase hints"""

        # Configure training callbacks for MLOS phases
        callbacks = {}

        if self.resources:
            def on_train_batch_start(trainer):
                mlos_training.set_phase(
                    self.resources,
                    mlos_training.PHASE_DATA_LOADING
                )

            def on_train_batch_end(trainer):
                mlos_training.set_phase(
                    self.resources,
                    mlos_training.PHASE_OPTIMIZER
                )

            callbacks = {
                "on_train_batch_start": on_train_batch_start,
                "on_train_batch_end": on_train_batch_end,
            }

        # Start training
        results = self.model.train(
            data=data_yaml,
            epochs=epochs,
            imgsz=imgsz,
            batch=batch,
            device="0" if torch.cuda.is_available() else "cpu",
            workers=4,
            project="runs/train",
            name=f"yolov8{self.model_size}_mlos",
            exist_ok=True,
            pretrained=True,
            optimizer="AdamW",
            lr0=0.001,
            lrf=0.01,
            momentum=0.937,
            weight_decay=0.0005,
            warmup_epochs=3.0,
            warmup_momentum=0.8,
            warmup_bias_lr=0.1,
            box=7.5,
            cls=0.5,
            dfl=1.5,
            patience=50,  # Early stopping patience
            save=True,
            save_period=10,
            cache=False,
            amp=True,  # Mixed precision
        )

        return results

    def export_onnx(self, output_dir="exports"):
        """Export trained model to ONNX formats"""

        if self.resources:
            mlos_training.set_phase(
                self.resources,
                mlos_training.PHASE_CHECKPOINT
            )

        output_path = Path(output_dir)
        output_path.mkdir(exist_ok=True)

        # Export FP32
        self.model.export(
            format="onnx",
            imgsz=640,
            simplify=True,
            opset=12,
        )

        # Export FP16
        self.model.export(
            format="onnx",
            imgsz=640,
            half=True,
            simplify=True,
        )

        print(f"Exported models to {output_path}")

    def register_with_axon(self, model_path):
        """Register exported model with Axon for MLOS deployment"""
        import subprocess

        result = subprocess.run(
            ["axon", "register", str(model_path)],
            capture_output=True,
            text=True
        )

        if result.returncode == 0:
            print(f"Model registered with Axon: {model_path}")
        else:
            print(f"Axon registration failed: {result.stderr}")

    def cleanup(self):
        """Release MLOS resources"""
        if self.resources:
            mlos_training.release_resources(self.resources)
            print("MLOS resources released")


if __name__ == "__main__":
    # Example usage
    trainer = MLOSYOLOTrainer(
        model_size="n",  # nano model for this example
        num_cpus=8,
        memory_gb=16
    )

    try:
        # Train on custom dataset
        results = trainer.train(
            data_yaml="dataset.yaml",
            epochs=100,
            imgsz=640,
            batch=16
        )

        # Export to ONNX
        trainer.export_onnx()

        # Register with Axon
        trainer.register_with_axon("runs/train/yolov8n_mlos/weights/best.onnx")

    finally:
        trainer.cleanup()

Dataset Configuration

Artifactiq uses the standard Ultralytics YAML format for dataset configuration:

dataset.yaml YAML

# Artifactiq Custom Object Detection Dataset
path: ./data/artifactiq_detection
train: images/train
val: images/val
test: images/test

# Classes
names:
  0: person
  1: vehicle
  2: package
  3: equipment
  4: hazard

# Dataset stats (auto-calculated during training)
nc: 5  # number of classes

PyTorch Integration

For custom PyTorch training loops (not using Ultralytics), MLOS provides deeper integration through a custom DataLoader wrapper and training context manager:

custom_pytorch_training.py Python

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# MLOS PyTorch integration
import mlos_training
from mlos_training.pytorch import (
    MLOSDataLoader,
    MLOSTrainingContext,
    mlos_checkpoint
)


def train_with_mlos(model, train_dataset, epochs=100, batch_size=32):
    """Custom PyTorch training with MLOS optimizations"""

    # MLOS-aware DataLoader with prefetching optimization
    train_loader = MLOSDataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=4,
        pin_memory=True,
        prefetch_factor=2,
        numa_aware=True  # MLOS: allocate on optimal NUMA node
    )

    optimizer = optim.AdamW(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()

    # Training context manager handles resource allocation
    with MLOSTrainingContext(num_cpus=8, memory_gb=16) as ctx:
        for epoch in range(epochs):
            model.train()

            for batch_idx, (data, target) in enumerate(train_loader):
                # Phase: Data loading (handled by MLOSDataLoader)

                # Phase: Forward pass
                ctx.set_phase(mlos_training.PHASE_FORWARD)
                output = model(data)
                loss = criterion(output, target)

                # Phase: Backward pass
                ctx.set_phase(mlos_training.PHASE_BACKWARD)
                optimizer.zero_grad()
                loss.backward()

                # Phase: Optimizer step
                ctx.set_phase(mlos_training.PHASE_OPTIMIZER)
                optimizer.step()

            # Checkpoint with memory-locked buffer
            if epoch % 10 == 0:
                with mlos_checkpoint(ctx):
                    torch.save({
                        'epoch': epoch,
                        'model_state': model.state_dict(),
                        'optimizer_state': optimizer.state_dict(),
                        'loss': loss.item(),
                    }, f'checkpoint_epoch_{epoch}.pt')

            print(f"Epoch {epoch}: Loss = {loss.item():.4f}")

Distributed Training

MLOS provides OS-level primitives for distributed training that complement frameworks like PyTorch DDP or Horovod. The key advantages are lower-latency barriers and all-reduce operations.

Performance Note

MLOS distributed primitives show 15-20% latency reduction compared to NCCL for small tensor operations, while being comparable for large tensors. The benefit is most pronounced in communication-bound scenarios.

distributed_training.py Python

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

import mlos_training
from mlos_training.distributed import MLOSProcessGroup


def setup_distributed(rank, world_size):
    """Initialize distributed training with MLOS backend"""

    # Standard PyTorch distributed init
    dist.init_process_group(
        backend="nccl",
        rank=rank,
        world_size=world_size
    )

    # MLOS process group for optimized barriers
    mlos_pg = MLOSProcessGroup(rank, world_size)

    return mlos_pg


def train_distributed(rank, world_size, model, dataset):
    """Distributed training with MLOS optimizations"""

    mlos_pg = setup_distributed(rank, world_size)

    # Wrap model with DDP
    model = model.to(rank)
    model = DDP(model, device_ids=[rank])

    # MLOS training context for this rank
    with mlos_training.MLOSTrainingContext(
        num_cpus=8,
        memory_gb=16,
        rank=rank,
        world_size=world_size
    ) as ctx:

        for epoch in range(num_epochs):
            # Training step...

            # MLOS barrier (lower latency than NCCL barrier)
            mlos_pg.barrier()

            # Custom all-reduce for metrics
            metrics_tensor = torch.tensor([loss, accuracy], device=rank)
            mlos_pg.all_reduce(metrics_tensor, op=mlos_training.SUM)
            metrics_tensor /= world_size

            if rank == 0:
                print(f"Epoch {epoch}: Avg Loss = {metrics_tensor[0]:.4f}")

    dist.destroy_process_group()

Monitoring with mlgpu

Artifactiq uses mlgpu, an open-source GPU monitoring tool, alongside MLOS training. While MLOS handles CPU-side optimizations, mlgpu provides real-time visibility into GPU utilization, memory, and thermal status.

Terminal mlgpu Installation & Usage

# Install mlgpu
curl -fsSL https://raw.githubusercontent.com/ARTIFACTIQ/mlgpu/main/install.sh | bash

# Basic monitoring
mlgpu

# Watch mode with 1-second refresh
mlgpu --watch

# JSON output for programmatic use
mlgpu --json

# Monitor specific GPU
mlgpu --gpu 0

mlgpu displays:

GPU utilization percentage
Memory usage (used/total)
Temperature and power draw
Running processes and their memory consumption
Framework detection (PyTorch, TensorFlow, etc.)

Integrating mlgpu with Training Scripts

monitor_training.py Python

import subprocess
import json
import threading
import time


class GPUMonitor:
    """Background GPU monitoring during training"""

    def __init__(self, log_file="gpu_metrics.jsonl", interval=5):
        self.log_file = log_file
        self.interval = interval
        self._stop = False
        self._thread = None

    def _monitor_loop(self):
        with open(self.log_file, "a") as f:
            while not self._stop:
                result = subprocess.run(
                    ["mlgpu", "--json"],
                    capture_output=True,
                    text=True
                )
                if result.returncode == 0:
                    metrics = json.loads(result.stdout)
                    metrics["timestamp"] = time.time()
                    f.write(json.dumps(metrics) + "\n")
                    f.flush()
                time.sleep(self.interval)

    def start(self):
        self._stop = False
        self._thread = threading.Thread(target=self._monitor_loop)
        self._thread.start()

    def stop(self):
        self._stop = True
        if self._thread:
            self._thread.join()


# Usage with training
monitor = GPUMonitor()
monitor.start()

try:
    # Your training code here
    train_model()
finally:
    monitor.stop()

Best Practices

1. Resource Allocation

Request resources early: Allocate MLOS resources at the start of your training script, before loading data or models.
Use exclusive CPUs: For dedicated training servers, set exclusive_cpus=true to prevent interference from other processes.
Match NUMA nodes: If you know your GPU's NUMA topology, specify the matching NUMA node for CPU allocation.

2. Phase Hints

Be granular: Call set_phase() at the start of each logical phase, not just once per epoch.
Include data loading: Don't forget to mark data loading phases - this enables prefetch optimizations.
Mark checkpoints: Always transition to PHASE_CHECKPOINT before saving, even for quick saves.

3. Memory Management

Register pressure callbacks: Always register a memory pressure callback to handle OOM gracefully.
Use checkpoint buffers: For critical checkpoints, use alloc_checkpoint_buffer() to ensure saves complete.
Monitor with mlgpu: Keep an eye on GPU memory alongside MLOS CPU memory optimizations.

4. Distributed Training

Use MLOS barriers: For frequent synchronization points, MLOS barriers are faster than NCCL.
Keep NCCL for gradients: Use standard NCCL all-reduce for gradient synchronization (optimized for large tensors).
Consistent resource allocation: Ensure all ranks request the same resources for predictable behavior.

Common Pitfall

Don't forget to release resources in a finally block or context manager. Unreleased resources can cause issues for subsequent training runs.

Troubleshooting

Resource Allocation Failures

Symptom: mlos_training_request_resources() returns an error.

Solutions:

Check available memory with free -h
Ensure no other MLOS workloads have exclusive CPU claims
Reduce num_cpus or memory_bytes in your request
Verify MLOS kernel module is loaded: lsmod | grep mlos

Phase Hints Not Working

Symptom: No performance improvement despite phase hints.

Solutions:

Verify hints are being called (add debug logging)
Check that the training loop isn't I/O bound elsewhere
Ensure sufficient CPU cores are allocated (minimum 4 recommended)
Profile with perf to identify actual bottlenecks

Checkpoint Failures Under Pressure

Symptom: Checkpoints fail or corrupt when system memory is low.

Solutions:

Use alloc_checkpoint_buffer() for mlock'd memory
Reduce checkpoint size (save only model weights, not optimizer state)
Implement memory pressure callback to reduce batch size dynamically
Consider periodic cleanup of old checkpoints

Ready to Optimize Your Training?

Get started with MLOS Training Workload API today and see the performance difference.

Training API Reference Download MLOS Core

Training ML Models
with MLOS

Table of Contents

Introduction to MLOS Training

Why Use MLOS for Training?

NUMA-Aware Allocation

CPU Affinity Management

Phase-Aware Scheduling

Safe Checkpointing

Memory Pressure Handling

Distributed Primitives

The Training Workload API

Training Phases

Core API Functions

Getting Started

Prerequisites

Installation

Basic Usage Example

Case Study: Artifactiq YOLO Training

Real Results

Training Configuration

Pipeline Architecture

MLOS-Integrated Training Script

Dataset Configuration

PyTorch Integration

Distributed Training

Performance Note

Monitoring with mlgpu

Integrating mlgpu with Training Scripts

Best Practices

1. Resource Allocation

2. Phase Hints

3. Memory Management

4. Distributed Training

Common Pitfall

Troubleshooting

Resource Allocation Failures

Phase Hints Not Working

Checkpoint Failures Under Pressure

Ready to Optimize Your Training?

Training ML Modelswith MLOS

Table of Contents

Introduction to MLOS Training

Why Use MLOS for Training?

NUMA-Aware Allocation

CPU Affinity Management

Phase-Aware Scheduling

Safe Checkpointing

Memory Pressure Handling

Distributed Primitives

The Training Workload API

Training Phases

Core API Functions

Getting Started

Prerequisites

Installation

Basic Usage Example

Case Study: Artifactiq YOLO Training

Real Results

Training Configuration

Pipeline Architecture

MLOS-Integrated Training Script

Dataset Configuration

PyTorch Integration

Distributed Training

Performance Note

Monitoring with mlgpu

Integrating mlgpu with Training Scripts

Best Practices

1. Resource Allocation

2. Phase Hints

3. Memory Management

4. Distributed Training

Common Pitfall

Troubleshooting

Resource Allocation Failures

Phase Hints Not Working

Checkpoint Failures Under Pressure

Ready to Optimize Your Training?

Training ML Models
with MLOS