Understanding MLOS Runtime Modes
MLOS Core is designed to run efficiently across different system configurations. Whether you're running on a standard Linux server, a container in the cloud, or a purpose-built MLOS Linux distribution, Core automatically detects and utilizes available optimizations.
The runtime mode determines which acceleration features are available and directly impacts inference latency, throughput, and resource utilization.
Available Runtime Modes
Userspace Only
Available NowBaseline
Pure userspace execution using ONNX Runtime for neural network inference and llama.cpp for LLM workloads. Works on any system without special requirements.
- Requirements: None (works everywhere)
- Platforms: Linux, macOS, Windows
- Use Case: Development, CI/CD, containers
- Latency: Model-dependent baseline
Kernel Basic
Kernel ModuleTarget: Improved Latency
Kernel module provides optimized memory management with zero-copy tensor transfers and intelligent memory pooling for reduced allocation overhead.
- Requirements: mlos-ml kernel module
- Platforms: Linux (kernel 5.15+)
- Use Case: Production servers
- Features: Memory manager, zero-copy I/O
Kernel Scheduler
Kernel ModuleTarget: Consistent Low Latency
ML-aware CPU scheduler with preemption control and intelligent thread affinity for consistent latency under mixed workloads.
- Requirements: mlos-ml module + scheduler
- Platforms: Linux (kernel 5.15+)
- Use Case: Latency-critical workloads
- Features: Memory + ML scheduler
Kernel Full
Maximum PerformanceTarget: Maximum Throughput
Full kernel optimization with memory management, ML-aware scheduling, and integrated GPU resource management for maximum throughput.
- Requirements: mlos-ml module (full config)
- Platforms: MLOS Linux distributions
- Use Case: High-performance inference
- Features: Memory + Scheduler + GPU
Note: Performance improvements are under active benchmarking. Validated numbers will be published from E2E testing with kernel module enabled.
Architecture Overview
Application Layer
Your ML applications, API clients, model serving frameworks
MLOS Core (Userspace)
REST/gRPC APIs, Plugin Manager, Model Registry, ONNX Runtime, llama.cpp
mlos-ml Kernel Module (Optional)
Memory Manager | ML-Aware Scheduler | GPU Resource Manager
Hardware
CPU | GPU (CUDA/ROCm/Metal) | Memory | Storage
Feature Comparison
| Feature | Userspace | Kernel Basic | Kernel Scheduler | Kernel Full |
|---|---|---|---|---|
| ONNX Runtime Inference | Yes | Yes | Yes | Yes |
| llama.cpp LLM Support | Yes | Yes | Yes | Yes |
| REST/gRPC APIs | Yes | Yes | Yes | Yes |
| Zero-Copy Tensor I/O | No | Yes | Yes | Yes |
| Memory Pooling | No | Yes | Yes | Yes |
| ML-Aware Scheduling | No | No | Yes | Yes |
| Preemption Control | No | No | Yes | Yes |
| GPU Memory Management | No | No | No | Yes |
| Multi-GPU Orchestration | No | No | No | Yes |
| Works in Containers | Yes | Yes* | Yes* | Yes* |
* Requires privileged container or host kernel module
Mode Detection
MLOS Core automatically detects the available runtime mode at startup. You can also check the current mode programmatically or via the command line.
Command Line Detection
# Check if kernel module is loaded $ lsmod | grep mlos_ml mlos_ml 45056 0 # Check module parameters (Linux only) $ cat /sys/module/mlos_ml/parameters/enable_scheduler 1 $ cat /sys/module/mlos_ml/parameters/enable_gpu_manager 1 # Check via Core health endpoint $ curl -s http://localhost:8080/health | jq '.runtime_mode' "kernel_full"
E2E Test Report
When viewing E2E validation reports, the Runtime Mode card in the "Release Versions" section indicates which mode was active during testing:
- "Userspace Only (No Kernel Optimizations)" - Tests ran without kernel acceleration
- "Kernel Module (Memory Manager)" - Basic kernel optimizations active
- "Kernel Module (Memory + Scheduler)" - Scheduler optimizations enabled
- "Kernel Module (Full: Memory, Scheduler, GPU)" - All kernel features enabled
Deployment Recommendations
CI/CD and Development
Use Userspace Only mode for GitHub Actions, local development, and container-based workflows. This provides maximum portability with zero special requirements.
Production Servers
Deploy with Kernel Basic or Kernel Scheduler for improved latency and throughput. Install the mlos-ml kernel module on your Linux servers.
High-Performance Inference
For maximum performance, use Kernel Full mode with MLOS Linux distributions (Ubuntu or Flatcar variants). This provides integrated GPU management and optimal resource utilization.
Ready to Deploy?
Check out our E2E validation reports to see runtime modes in action
View E2E Reports Architecture Guide