v7.0.0-beta First Beta Release - 4Γ— faster inference

The Runtime Layer Your Models Deserve

Kernel-native ML execution. One unified runtime for ONNX, PyTorch, and LLMs. Predictive memory management that cuts latency by 30%.

4Γ—
Faster Inference
42+
Models Tested
100%
Open Source
Without MLOS With MLOS
Complex Multi-Framework Setup
πŸ”₯
PyTorch
+180ms cold start
πŸ“Š
TensorFlow
+220ms cold start
πŸ“¦
ONNX Runtime
+95ms cold start
Latency
~200ms
Memory
3Γ— overhead
Integration
Per-framework
Memory fragmentation No batching Manual NUMA
Unified Runtime Layer
πŸ“¦ .onnx
πŸ”₯ .pt
πŸ¦™ .gguf
mlOS Core v7.0.0-beta
πŸ”Œ
ONNX Plugin
Hot-reload ready
πŸ¦™
GGUF Plugin
Streaming mode
⚑
Tensor Pool
3.6M ops/sec
🧠
NUMA Manager
Auto-pinned
<50ΞΌs
overhead
4Γ—
faster
42
models
:8080
REST API
:8081
gRPC
unix://
IPC
One Unified API Auto-batching Zero-copy Transfer

Built for Production ML

Everything you need to run models in production, without the infrastructure headaches.

⚑

Kernel-Native Scheduling

ML workloads are first-class OS citizens. Priority-based scheduling ensures latency-sensitive inference always runs first.

🧠

Predictive Memory

NUMA-aware tensor placement with prefetching. Memory is ready before your model needs it.

πŸ”„

Universal Runtime

One API for ONNX, PyTorch, TensorFlow, and GGUF models. No more juggling frameworks.

πŸ¦™

Native LLM Support

Built-in llama.cpp integration. Run Llama, Qwen, DeepSeek and other LLMs with streaming generation.

πŸ“Š

Real-Time Metrics

Comprehensive observability for inference latency, memory usage, and model health.

πŸ”§

Simple REST API

Standard HTTP interface that works with any language. Register, load, inferβ€”done.

Deploy in 60 Seconds

Install the CLI, add your model, and start serving. MLOS handles conversion, optimization, and scaling automatically.

  • Automatic ONNX conversion from PyTorch/TensorFlow
  • Built-in model versioning and hot-reload
  • Hardware-optimized inference out of the box
  • Zero-config for common model architectures
terminal
$ curl -sSL axon.mlosfoundation.org | sh
βœ“ Axon CLI installed

$ axon install hf/bert-base-uncased
Downloading from Hugging Face...
Converting to ONNX format...
βœ“ Model ready: bert-base-uncased

$ mlos_core serve --port 8080
βœ“ Serving at http://localhost:8080
Ready for inference requests

42+ Models CI-Tested

Every model validated in our E2E pipeline. If it's here, it works.

πŸ“
BERT
Text Encoding
πŸ”
DistilBERT
Fast Classification
πŸ€–
RoBERTa
NLU Tasks
πŸ’¬
GPT-2
Text Generation
πŸ–ΌοΈ
ViT
Vision Transformer
🎨
CLIP
Multimodal
πŸ¦™
TinyLlama
GGUF/LLM
🌐
Qwen2
GGUF/LLM
View All Test Results β†’

Latest News

Updates from the MLOS Foundation

Loading announcements...

Latest Posts

Deep dives, technical insights, and updates from the MLOS Foundation

View All Posts β†’
Loading blog posts...

Ready to Ship ML?

Open source. Production ready. Start deploying models in minutes.