Skip to main content

Chat Completions

warning

🚧 Cortex is under construction.

Cortex's Chat API is compatible with OpenAI’s Chat Completions endpoint. It is a drop-in replacement for local inference.

For local inference, Cortex is multi-engine and supports the following model formats:

  • GGUF: A generalizable LLM format that runs across CPUs and GPUs. Cortex implements a GGUF runtime through llama.cpp.
  • TensorRT: A production-ready, enterprise-grade LLM format optimized for fast inference on NVIDIA GPUs. Cortex implements a TensorRT runtime through TensorRT-LLM.

Cortex routes requests to multiple APIs for remote inference while providing a single, easy-to-use, OpenAI-compatible endpoint.

Usage​


# Streaming
cortex chat --model janhq/TinyLlama-1.1B-Chat-v1.0-GGUF

Capabilities​

Multiple Local Engines​

Cortex scales applications from prototype to production, running on CPU-only laptops with llama.cpp and GPU-accelerated with TensorRT-LLM.

To configure each engine, refer to:

  • Use llama.cpp
  • Use tensorrt-llm

Learn more about our engine architecture:

  • cortex.cpp
  • cortex.llamacpp
  • cortex.tensorRTLLM

Multiple Remote APIs​

Cortex also acts as an aggregator for remote inference requests from a single endpoint. Currently, Cortex supports:

  • OpenAI
  • Groq
  • Cohere
  • Anthropic
  • MistralAI
  • Martian
  • OpenRouter
note

Learn more about Chat Completions capabilities: