Completion Service
LLM gateway that routes completion requests to the correct provider. Handles streaming, normalizes responses across providers, and emits token usage events.
- Tech: NestJS 11, RabbitMQ (producer)
- Port: 4000 (configured differently in deployment)
- Auth: JWT, API Key, Public
- Database: None (stateless)
Endpoints
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /api/v1/completions | Public | Execute an LLM completion |
| GET | /api/v1/health | Public | Health check |
| GET | /metrics | Public | Prometheus metrics |
POST /api/v1/completions
The single core endpoint. Accepts a completion request, routes it to the correct LLM provider, returns the response.
Request
{
"model": "gpt-4o",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello" }
],
"stream": true,
"temperature": 0.7,
"maxOutputTokens": 4096,
"tools": [],
"toolChoice": "auto"
}
Non-Streaming Response
{
"outputText": "Hello! How can I help you?",
"output": [],
"usage": {
"inputTokens": 25,
"outputTokens": 8
},
"model": "gpt-4o",
"finishReason": "stop"
}
Streaming Response (NDJSON)
When stream: true, the response is a stream of newline-delimited JSON chunks:
{"type":"text_delta_start","content":"Hello"}
{"type":"text_delta","content":"! How"}
{"type":"text_delta","content":" can I help"}
{"type":"text_delta_end","content":""}
{"type":"usage","inputTokens":25,"outputTokens":8}
{"type":"finish","finishReason":"stop"}
Chunk types:
text_delta_start/text_delta/text_delta_end-- Text generationtool_call_start/tool_call_delta/tool_call_end-- Tool/function callsusage-- Token countsfinish-- Completion finishederror-- Error occurred
Supported Providers
| Provider | Key | Notes |
|---|---|---|
| OpenAI | openai | GPT-4, GPT-4o, GPT-3.5, etc. |
| Azure OpenAI | azure_openai | Azure-hosted OpenAI models |
| Azure OpenAI Completions | azure_openai_completions | Azure completions API variant |
| Anthropic | anthropic | Claude models |
| Anthropic Bedrock | anthropic_bedrock | Claude via AWS Bedrock |
google | Gemini models | |
| Mistral | mistral | Mistral models |
| Jamba | jamba | AI21 Jamba models |
| Ollama | ollama | Local models via Ollama |
| vLLM | vllm | Self-hosted models via vLLM |
| Remote Custom | remote_completion | Any OpenAI-compatible endpoint |
Model Configuration
At startup, the service fetches its model/provider configuration from llm-core:
GET /api/v1/services-config/models-providers
This returns the full mapping of which models are available from which providers, along with provider-specific configuration (API keys, endpoints, deployment names). Alternatively, a local JSON config file can be used via LOCAL_MODEL_PROVIDERS_CONFIG_PATH.
Token Usage Tracking
After every completion (streaming or non-streaming), the service publishes a transaction event to RabbitMQ:
{
"model": "gpt-4o",
"provider": "openai",
"inputTokens": 250,
"outputTokens": 150,
"totalTokens": 400,
"reasoningTokens": 0,
"deploymentName": "gpt-4o",
"stream": true,
"temperature": 0.7,
"maxOutputTokens": 4096,
"toolsCount": 2,
"timestamp": "2026-01-15T10:30:00.000Z",
"messages": [{ "role": "user", "content": "..." }],
"context": { "userId": "uuid", "conversationId": "uuid" }
}
The event includes all completion parameters (temperature, topP, topK, responseFormat, toolChoice, etc.) and the full context object passed in the original request. llm-core consumes these events and stores them in the model_transactions table.
Inter-Service Communication
| Target | Protocol | Purpose |
|---|---|---|
| llm-core | HTTP (GET) | Fetch model/provider configuration at startup |
| LLM APIs (external) | HTTPS | Send completions to OpenAI, Anthropic, Google, etc. |
| RabbitMQ | AMQP (producer) | Emit transaction events |