Skip to main content

RAG Service

Retrieval-Augmented Generation query layer. Takes a user query, embeds it, performs vector similarity search against pgvector, optionally reranks results, and returns relevant document chunks.

This is the "read/query side" of the RAG system. The embedding-service handles the "write/indexing side."

  • Tech: NestJS 11, TypeORM, PostgreSQL (pgvector)
  • Port: 4000
  • Auth: JWT, API Key, Public
  • Database: Shared document database (reads documents, chunks, embeddings tables)

Endpoints

MethodPathAuthDescription
POST/api/v1/rag/retrievePublicRAG retrieval with optional reranking
POST/api/v1/chunks/similarity-searchPublicDirect vector similarity search
GET/api/v1/healthPublicHealth check

POST /api/v1/rag/retrieve

Main retrieval endpoint.

Request

{
"query": "How does kubernetes handle pod scheduling?",
"grade": 0.7,
"similarityTopK": 20,
"docIds": ["uuid-1", "uuid-2"],
"engineName": "default",
"rerank": {
"enabled": true,
"strategy": "bge",
"topK": 5,
"score": 0.5
}
}
FieldTypeRequiredDescription
querystringYesSearch query
gradenumberYes0-1 similarity threshold
similarityTopKnumberNo5-50, number of candidates from vector search
docIdsstring[]NoFilter results to specific documents
engineNamestringNo"default" (advanced) or "naive"
rerankobjectNoReranking configuration
rerank.enabledbooleanNoEnable reranking
rerank.strategystringNo"bge" (model-based) or "llm" (LLM-based)
rerank.topKnumberNoNumber of results after reranking
rerank.scorenumberNoMinimum rerank score threshold

Response

{
"results": [
{
"content": "Kubernetes uses a scheduler that...",
"score": 0.89,
"metadata": {
"documentId": "uuid",
"fileName": "k8s-docs.pdf",
"chunkIndex": 5,
"pageNumber": 12,
"contentType": "text"
}
}
],
"query": "How does kubernetes handle pod scheduling?",
"totalResults": 3,
"engineUsed": "advanced",
"pipeline": {
"reranked": true,
"rerankStrategy": "bge",
"candidatesConsidered": 20
}
}

Retrieval Engines

Advanced (default):

  1. Embed the query via embedding-service
  2. Run pgvector cosine similarity search
  3. Rerank candidates (BGE model or LLM-based)
  4. Filter by rerank score threshold
  5. Return top-K results

Naive:

  1. Embed the query via embedding-service
  2. Run pgvector cosine similarity search
  3. Filter by grade (similarity threshold)
  4. Return results directly (no reranking)

POST /api/v1/chunks/similarity-search

Low-level vector search. Accepts a pre-computed embedding vector and returns matching chunks.

Request

{
"queryEmbedding": [0.0123, -0.0456, ...],
"topK": 10,
"docIds": ["uuid-1"]
}

Uses pgvector <=> cosine distance operator directly.

Note: The rag-service entity declares the embedding column as vector(1536), but the actual vectors stored by embedding-service are 1024-dimensional. This is a stale value in the rag-service entity that should be updated to 1024. It does not affect runtime behavior because pgvector handles the cosine distance calculation regardless of the declared column length.

Reranking Strategies

BGE Model Reranking

Calls an external reranker service (POST /v1/rerank) that runs a BGE cross-encoder model. Scores each candidate against the query and re-orders by relevance.

  • Timeout: 30s
  • Retry: 2 attempts
  • Configured via RERANKER_SERVICE_URL
  • Optional -- RAG works without it

LLM Reranking

Uses the completion service to ask an LLM to score and rank candidates. More accurate but slower and more expensive.

Inter-Service Communication

TargetProtocolPurpose
embedding-serviceHTTPPOST /api/v1/embed/single -- Embed query text
completion-serviceHTTPPOST /api/v1/completions -- LLM reranking
reranker-serviceHTTPPOST /v1/rerank -- BGE model reranking (optional)
PostgreSQL (shared document database)SQLCosine similarity search via pgvector