Training Workload API | MLOS Foundation

Performance Highlights

15-30%

Training Efficiency Gain

<15%

Concurrent Workload Overhead

80

Tests (61 Unit + 19 Integration)

100%

Framework Agnostic

Key Features

NUMA-Aware Resource Allocation

Automatic placement of training workloads on optimal NUMA nodes with memory affinity for reduced cross-socket latency.

CPU Affinity Management

Bind training threads to specific CPUs for consistent cache locality and reduced jitter during training.

Training Phase Scheduling

Hint-based scheduling for forward pass, backward pass, optimizer, and checkpoint phases with priority adjustments.

Memory Pressure Handling

Callbacks for graceful memory pressure response with configurable thresholds and automatic notifications.

Checkpoint Safety

mlock'd buffers for reliable checkpoint saves even under memory pressure, ensuring training progress is never lost.

Distributed Primitives

Barrier and all-reduce operations for multi-node distributed training with OS-level coordination.

Architecture Overview

USER SPACE +------------------------------------------------------------------+ | | | +------------------+ +------------------+ +---------------+ | | | PyTorch/TF/JAX | | Custom Training | | Inference | | | | Training Loop | | Loop | | Workloads | | | +--------+---------+ +--------+---------+ +-------+-------+ | | | | | | | v v v | | +----------------------------------------------------------------+| | | mlOS Training Workload API (libmlos-training) || | | +------------+ +------------+ +------------+ +--------------+ || | | | Resource | | CPU | | Memory | | Checkpoint | || | | | Allocator | | Affinity | | Pressure | | Safety | || | | +------------+ +------------+ +------------+ +--------------+ || | | +------------+ +------------+ +------------+ +--------------+ || | | | Training | | Distributed| | Metrics | | Export | || | | | Phases | | Primitives | | Export | | Formats | || | | +------------+ +------------+ +------------+ +--------------+ || | +----------------------------------------------------------------+| +------------------------------------------------------------------+ | KERNEL SPACE +------------------------------------------------------------------+ | +----------------------------------------------------------------+| | | mlOS Kernel Module (mlos.ko) || | | +------------+ +------------+ +------------+ +--------------+ || | | | TMM | | Scheduler | | GPU | | Chardev | || | | |(Tensor Mem)| | (ML-Aware) | | Manager | | Interface | || | | +------------+ +------------+ +------------+ +--------------+ || | +----------------------------------------------------------------+| +------------------------------------------------------------------+

Training Phase Support

The API supports training phase hints for intelligent OS-level scheduling optimizations.

📦

IDLE

Workload inactive

🗃

DATA_LOADING

Low priority I/O

▶

FORWARD

Normal priority compute

◀

BACKWARD

High priority, gradient pinning

📈

OPTIMIZER

Weight updates

💾

CHECKPOINT

Safe checkpoint with mlock

Core API Functions

mlos_training_request_resources()

Request CPU and memory resources with NUMA preferences. Returns a resources handle for the training workload.

mlos_training_bind_thread()

Bind a training thread to allocated CPUs with optimal NUMA memory policy.

mlos_training_set_phase()

Notify the OS of training phase transitions for intelligent scheduling.

mlos_training_register_memory_callback()

Register a callback for memory pressure notifications to handle gracefully.

mlos_training_alloc_checkpoint_buffer()

Allocate mlock'd buffer for reliable checkpoint saves under pressure.

mlos_training_barrier() / mlos_training_allreduce()

Distributed training primitives for multi-node synchronization.

mlos_training_release_resources()

Release allocated resources when training completes.

Usage Example

/* Request resources for training */
mlos_training_resource_request_t request = {
    .num_cpus = 8,
    .memory_bytes = 16 * 1024 * 1024 * 1024ULL,  // 16GB
    .numa_node_preference = -1,  // auto-select
    .priority = 1,  // normal
    .exclusive_cpus = true
};

mlos_training_resources_t* resources = NULL;
mlos_training_request_resources(&request, &resources);

/* Bind training thread */
mlos_training_bind_thread(resources, pthread_self(), 0);

/* Training loop with phase hints */
for (int epoch = 0; epoch < num_epochs; epoch++) {
    mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_DATA_LOADING);
    load_batch();

    mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_FORWARD);
    forward_pass();

    mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_BACKWARD);
    backward_pass();

    mlos_training_set_phase(resources, MLOS_TRAINING_PHASE_OPTIMIZER);
    update_weights();
}

/* Release resources */
mlos_training_release_resources(resources);

Real-World Showcase: Artifactiq YOLO Training

We validated the Training Workload API by training a YOLO object detection model in our CI pipeline, demonstrating end-to-end training with automatic ONNX export for deployment.

Training Configuration

Model	YOLOv8n
Epochs	5
Image Size	320x320
Framework	PyTorch + Ultralytics
Execution	GitHub Actions (CPU)

ONNX Export Results

FP32 Model	11.6 MB
FP16 Model	5.8 MB
INT8 Model	~3 MB
PyTorch Best	6.2 MB
Export Status	All Verified

API Test Coverage

Unit Tests	61 Passing
Integration Tests	19 Passing
Total Tests	80 Tests
Coverage	Full API
CI Status	All Green

View CI Workflow