Runtime Modes

🌐 HTTP REST :8080

Standard REST API for easy integration. JSON request/response format. Best for development and web applications.

Latency

~2-5ms

Throughput

High

Best For

Web applications
Development & testing
Cross-platform clients
Management operations

⚡ gRPC :8081

High-performance binary protocol with protobuf serialization. Streaming support for LLM responses.

Latency

~0.5-1ms

Throughput

Very High

Best For

Production deployments
Microservices
LLM streaming
High-throughput workloads

🔌 IPC socket

Unix domain socket for ultra-low latency. Same-machine communication only. Lowest possible overhead.

Latency

~0.05-0.1ms

Throughput

Maximum

Best For

Local applications
Sidecar pattern
Real-time inference
Embedded systems

Protocol Comparison

Feature	HTTP REST	gRPC	IPC
Inference Latency	~2-5ms	~0.5-1ms	~0.05-0.1ms
Serialization	JSON	Protobuf	Binary
Streaming Support	SSE	Native	Native
Network	TCP/HTTPS	HTTP/2	Unix Socket
Cross-Platform	Yes	Yes	Local only
Client Libraries	Any HTTP	Generated	Custom

Quick Examples

                        HTTP REST - Inference Request
                    

# Register a model
curl -X POST http://localhost:8080/models \
  -H "Content-Type: application/json" \
  -d '{"name": "bert-base", "path": "/models/bert.onnx"}'

# Run inference
curl -X POST http://localhost:8080/models/bert-base/inference \
  -H "Content-Type: application/json" \
  -d '{"inputs": {"input_ids": [[101, 7592, 102]]}}'

# Health check
curl http://localhost:8080/health
                    

                        LLM Generation (GGUF)
                    

# Generate text with TinyLlama
curl -X POST http://localhost:8080/models/tinyllama/inference \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is machine learning?",
    "max_tokens": 128,
    "temperature": 0.7
  }'

# Response
{
  "text": "Machine learning is a subset of artificial intelligence...",
  "tokens_generated": 45,
  "latency_ms": 234.5
}
                    

Runtime Plugins

🔷

ONNX Runtime Built-in

High-performance inference for ONNX models. Supports CPU and GPU execution with automatic optimization.

.onnx

🦙

llama.cpp Built-in

Native LLM execution for GGUF quantized models. Streaming generation, 4-bit quantization support, and optimized sampling.

.gguf

Runtime & APIs

Best For

Best For

Best For

Protocol Comparison

Quick Examples

Runtime Plugins