Runtime & APIs

Three protocols for different use cases. Choose based on your latency and integration requirements.

🌐 HTTP REST :8080

Standard REST API for easy integration. JSON request/response format. Best for development and web applications.

Latency
~2-5ms
Throughput
High

Best For

  • Web applications
  • Development & testing
  • Cross-platform clients
  • Management operations
🔌 IPC socket

Unix domain socket for ultra-low latency. Same-machine communication only. Lowest possible overhead.

Latency
~0.05-0.1ms
Throughput
Maximum

Best For

  • Local applications
  • Sidecar pattern
  • Real-time inference
  • Embedded systems

Protocol Comparison

Feature HTTP REST gRPC IPC
Inference Latency ~2-5ms ~0.5-1ms ~0.05-0.1ms
Serialization JSON Protobuf Binary
Streaming Support SSE Native Native
Network TCP/HTTPS HTTP/2 Unix Socket
Cross-Platform Yes Yes Local only
Client Libraries Any HTTP Generated Custom

Quick Examples

HTTP REST - Inference Request
# Register a model
curl -X POST http://localhost:8080/models \
  -H "Content-Type: application/json" \
  -d '{"name": "bert-base", "path": "/models/bert.onnx"}'

# Run inference
curl -X POST http://localhost:8080/models/bert-base/inference \
  -H "Content-Type: application/json" \
  -d '{"inputs": {"input_ids": [[101, 7592, 102]]}}'

# Health check
curl http://localhost:8080/health
LLM Generation (GGUF)
# Generate text with TinyLlama
curl -X POST http://localhost:8080/models/tinyllama/inference \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is machine learning?",
    "max_tokens": 128,
    "temperature": 0.7
  }'

# Response
{
  "text": "Machine learning is a subset of artificial intelligence...",
  "tokens_generated": 45,
  "latency_ms": 234.5
}

Runtime Plugins

🔷
ONNX Runtime Built-in

High-performance inference for ONNX models. Supports CPU and GPU execution with automatic optimization.

.onnx
🦙
llama.cpp Built-in

Native LLM execution for GGUF quantized models. Streaming generation, 4-bit quantization support, and optimized sampling.

.gguf