Architecture

Kernel-level machine learning runtime with plugin-based design

System Overview

mlOS Core implements a kernel-level machine learning operating system with a plugin-based architecture. The system provides standardized interfaces for ML frameworks while maintaining high performance and resource efficiency through direct OS integration.

Client Applications
Your App
API Clients
SDKs
API Layer
HTTP REST
gRPC
IPC (Unix Socket)
Runtime Layer
mlOS Core Engine
Plugins
ONNX Runtime
llama.cpp
PyTorch
Custom
Hardware
CPU • GPU (CUDA/ROCm/Metal) • Memory • Storage

Core Components

mlOS Core Engine

The central orchestrator that manages the entire mlOS system.

  • Plugin Registry — Manages loaded ML framework plugins with lifecycle tracking
  • Model Registry — Tracks registered models, versions, and metadata
  • Resource Manager — Allocates and manages compute resources (CPU, GPU, memory)
  • SMI Interface — Standard Model Interface for plugin communication

Multi-Protocol API Layer

Three API protocols for different use cases.

  • HTTP REST — Management operations, easy integration (port 8080)
  • gRPC — High-performance binary protocol for production (port 8081)
  • IPC — Ultra-low latency Unix sockets for local apps

Plugin Architecture

Framework-agnostic plugin system for maximum flexibility.

  • Dynamic Loading — Load/unload plugins without restart
  • Process Isolation — Each plugin runs in separate process
  • Version Support — Multiple versions can run simultaneously
  • Hot-Swapping — Update plugins without downtime

System Flow

The typical flow of operations in mlOS from registration to inference.

1

Register Plugin

Client application registers an ML framework plugin with mlOS Core

2

Load Plugin

mlOS Core dynamically loads and initializes the plugin

3

Register Model

Client registers a model with mlOS Core, which forwards to the plugin

4

Inference Request

Client sends inference request to mlOS Core via any API protocol

5

Execute & Return

mlOS routes to appropriate plugin, executes inference, returns results

Performance Characteristics

mlOS achieves high performance through kernel-level optimizations and efficient resource management.

Operation HTTP API gRPC API IPC API
Plugin Registration ~5ms ~2ms ~0.5ms
Model Registration ~10ms ~5ms ~1ms
Inference (small model) ~2ms ~1ms ~0.1ms
Inference (large model) ~50ms ~25ms ~10ms
Health Check ~1ms ~0.5ms ~0.05ms

Deployment Patterns

Single Node Deployment

mlOS Core runs on a single node with multiple plugins. Ideal for development, testing, and small-scale deployments.

Distributed Deployment

Multiple mlOS Core nodes behind a load balancer or service mesh. Enables horizontal scaling and high availability for production.

Security & Isolation

mlOS implements comprehensive security measures for production deployments.

Plugin Sandboxing

Each plugin runs in isolated process with resource limits, preventing cascading failures.

API Security

Token-based authentication and role-based access control for all API endpoints.

Rate Limiting

Per-client request throttling to prevent abuse and ensure fair resource usage.