Runtime Modes

From userspace portability to full kernel-accelerated performance

Understanding MLOS Runtime Modes

MLOS Core is designed to run efficiently across different system configurations. Whether you're running on a standard Linux server, a container in the cloud, or a purpose-built MLOS Linux distribution, Core automatically detects and utilizes available optimizations.

The runtime mode determines which acceleration features are available and directly impacts inference latency, throughput, and resource utilization.

Available Runtime Modes

Userspace Only

Available Now

Baseline

Pure userspace execution using ONNX Runtime for neural network inference and llama.cpp for LLM workloads. Works on any system without special requirements.

  • Requirements: None (works everywhere)
  • Platforms: Linux, macOS, Windows
  • Use Case: Development, CI/CD, containers
  • Latency: Model-dependent baseline

Kernel Basic

Kernel Module

Target: Improved Latency

Kernel module provides optimized memory management with zero-copy tensor transfers and intelligent memory pooling for reduced allocation overhead.

  • Requirements: mlos-ml kernel module
  • Platforms: Linux (kernel 5.15+)
  • Use Case: Production servers
  • Features: Memory manager, zero-copy I/O

Kernel Scheduler

Kernel Module

Target: Consistent Low Latency

ML-aware CPU scheduler with preemption control and intelligent thread affinity for consistent latency under mixed workloads.

  • Requirements: mlos-ml module + scheduler
  • Platforms: Linux (kernel 5.15+)
  • Use Case: Latency-critical workloads
  • Features: Memory + ML scheduler

Kernel Full

Maximum Performance

Target: Maximum Throughput

Full kernel optimization with memory management, ML-aware scheduling, and integrated GPU resource management for maximum throughput.

  • Requirements: mlos-ml module (full config)
  • Platforms: MLOS Linux distributions
  • Use Case: High-performance inference
  • Features: Memory + Scheduler + GPU

Note: Performance improvements are under active benchmarking. Validated numbers will be published from E2E testing with kernel module enabled.

Architecture Overview

Application Layer

Your ML applications, API clients, model serving frameworks

MLOS Core (Userspace)

REST/gRPC APIs, Plugin Manager, Model Registry, ONNX Runtime, llama.cpp

mlos-ml Kernel Module (Optional)

Memory Manager | ML-Aware Scheduler | GPU Resource Manager

Hardware

CPU | GPU (CUDA/ROCm/Metal) | Memory | Storage

Feature Comparison

Feature Userspace Kernel Basic Kernel Scheduler Kernel Full
ONNX Runtime Inference Yes Yes Yes Yes
llama.cpp LLM Support Yes Yes Yes Yes
REST/gRPC APIs Yes Yes Yes Yes
Zero-Copy Tensor I/O No Yes Yes Yes
Memory Pooling No Yes Yes Yes
ML-Aware Scheduling No No Yes Yes
Preemption Control No No Yes Yes
GPU Memory Management No No No Yes
Multi-GPU Orchestration No No No Yes
Works in Containers Yes Yes* Yes* Yes*

* Requires privileged container or host kernel module

Mode Detection

MLOS Core automatically detects the available runtime mode at startup. You can also check the current mode programmatically or via the command line.

Command Line Detection

# Check if kernel module is loaded
$ lsmod | grep mlos_ml
mlos_ml                45056  0

# Check module parameters (Linux only)
$ cat /sys/module/mlos_ml/parameters/enable_scheduler
1

$ cat /sys/module/mlos_ml/parameters/enable_gpu_manager
1

# Check via Core health endpoint
$ curl -s http://localhost:8080/health | jq '.runtime_mode'
"kernel_full"

E2E Test Report

When viewing E2E validation reports, the Runtime Mode card in the "Release Versions" section indicates which mode was active during testing:

  • "Userspace Only (No Kernel Optimizations)" - Tests ran without kernel acceleration
  • "Kernel Module (Memory Manager)" - Basic kernel optimizations active
  • "Kernel Module (Memory + Scheduler)" - Scheduler optimizations enabled
  • "Kernel Module (Full: Memory, Scheduler, GPU)" - All kernel features enabled

Deployment Recommendations

CI/CD and Development

Use Userspace Only mode for GitHub Actions, local development, and container-based workflows. This provides maximum portability with zero special requirements.

Production Servers

Deploy with Kernel Basic or Kernel Scheduler for improved latency and throughput. Install the mlos-ml kernel module on your Linux servers.

High-Performance Inference

For maximum performance, use Kernel Full mode with MLOS Linux distributions (Ubuntu or Flatcar variants). This provides integrated GPU management and optimal resource utilization.

Ready to Deploy?

Check out our E2E validation reports to see runtime modes in action

View E2E Reports Architecture Guide