Skip to main content

Cortex.llamacpp

warning

🚧 Cortex is under construction.

Cortex.llamacpp is a C++ inference library that can be loaded by any server at runtime. It submodules (and occasionally upstreams) llama.cpp for GGUF inference.

In addition to llama.cpp, cortex.llamacpp adds:

  • OpenAI compatibility for the stateless endpoints
  • Model orchestration like model warm up and concurrent models
info

Cortex.llamacpp is formerly called "Nitro".

If you already use Jan or Cortex, cortex.llamacpp is bundled by default and you don’t need this guide. This guides walks you through how to use cortex.llamacpp as a standalone library, in any custom C++ server.

Usage​

To include cortex.llamacpp in your own server implementation, follow this server example.

Interface​

Cortex.llamacpp has the following Interfaces:

  • HandleChatCompletion: Processes chat completion tasks

    void HandleChatCompletion(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • HandleEmbedding: Generates embeddings for the input data provided

    void HandleEmbedding(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • LoadModel: Loads a model based on the specifications

    void LoadModel(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • UnloadModel: Unloads a model as specified

    void UnloadModel(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

  • GetModelStatus: Retrieves the status of a model

    void GetModelStatus(
    std::shared_ptr<Json::Value> jsonBody,
    std::function<void(Json::Value&&, Json::Value&&)>&& callback);

Parameters:

  • jsonBody: The request content in JSON format.
  • callback: A function that handles the response

Architecture​

The main components include:

  • enginei: an engine interface definition that extends to all engines, handling endpoint logic and facilitating communication between cortex.cpp and llama engine.
  • llama engine: exposes APIs for embedding and inference. It loads and unloads models and simplifies API calls to llama.cpp.
  • llama.cpp: submodule from the llama.cpp repository that provides the core functionality for embeddings and inferences.
  • llama server context: a wrapper offers a simpler and more user-friendly interface for llama.cpp APIs

Cortex llamacpp architecture

Communication Protocols:​

  • Streaming: Responses are processed and returned one token at a time.
  • RESTful: The response is processed as a whole. After the llama server context completes the entire process, it returns a single result back to cortex.cpp.

Cortex llamacpp architecture

Code Structure​


.
β”œβ”€β”€ base # Engine interface definition
| └── cortex-common # Common interfaces used for all engines
| └── enginei.h # Define abstract classes and interface methods for engines
β”œβ”€β”€ examples # Server example to integrate engine
β”‚ └── server.cc # Example server demonstrating engine integration
β”œβ”€β”€ llama.cpp # Upstream llama.cpp repository
β”‚ └── (files from upstream llama.cpp)
β”œβ”€β”€ src # Source implementation for llama.cpp
β”‚ β”œβ”€β”€ chat_completion_request.h # OpenAI compatible request handling
β”‚ β”œβ”€β”€ llama_client_slot # Manage vector of slots for parallel processing
β”‚ β”œβ”€β”€ llama_engine # Implementation of llamacpp engine for model loading and inference
β”‚ β”œβ”€β”€ llama_server_context # Context management for chat completion requests
β”‚ β”‚ β”œβ”€β”€ slot # Struct for slot management
β”‚ β”‚ └── llama_context # Struct for llama context management
| | └── chat_completion # Struct for chat completion management
| | └── embedding # Struct for embedding management
β”œβ”€β”€ third-party # Dependencies of the cortex.llamacpp project
β”‚ └── (list of third-party dependencies)

Runtime​

Roadmap​

The future plans for Cortex.llamacpp are focused on enhancing performance and expanding capabilities. Key areas of improvement include:

  • Performance Enhancements: Optimizing speed and reducing memory usage to ensure efficient processing of tasks.
  • Multimodal Model Compatibility: Expanding support to include a variety of multimodal models, enabling a broader range of applications and use cases.

To follow the latest developments, see the cortex.llamacpp GitHub