Standard REST API for easy integration. JSON request/response format. Best for development and web applications.
Best For
- Web applications
- Development & testing
- Cross-platform clients
- Management operations
High-performance binary protocol with protobuf serialization. Streaming support for LLM responses.
Best For
- Production deployments
- Microservices
- LLM streaming
- High-throughput workloads
Unix domain socket for ultra-low latency. Same-machine communication only. Lowest possible overhead.
Best For
- Local applications
- Sidecar pattern
- Real-time inference
- Embedded systems
Protocol Comparison
| Feature | HTTP REST | gRPC | IPC |
|---|---|---|---|
| Inference Latency | ~2-5ms | ~0.5-1ms | ~0.05-0.1ms |
| Serialization | JSON | Protobuf | Binary |
| Streaming Support | SSE | Native | Native |
| Network | TCP/HTTPS | HTTP/2 | Unix Socket |
| Cross-Platform | Yes | Yes | Local only |
| Client Libraries | Any HTTP | Generated | Custom |
Quick Examples
# Register a model curl -X POST http://localhost:8080/models \ -H "Content-Type: application/json" \ -d '{"name": "bert-base", "path": "/models/bert.onnx"}' # Run inference curl -X POST http://localhost:8080/models/bert-base/inference \ -H "Content-Type: application/json" \ -d '{"inputs": {"input_ids": [[101, 7592, 102]]}}' # Health check curl http://localhost:8080/health
# Generate text with TinyLlama curl -X POST http://localhost:8080/models/tinyllama/inference \ -H "Content-Type: application/json" \ -d '{ "prompt": "What is machine learning?", "max_tokens": 128, "temperature": 0.7 }' # Response { "text": "Machine learning is a subset of artificial intelligence...", "tokens_generated": 45, "latency_ms": 234.5 }
Runtime Plugins
High-performance inference for ONNX models. Supports CPU and GPU execution with automatic optimization.
Native LLM execution for GGUF quantized models. Streaming generation, 4-bit quantization support, and optimized sampling.