RAG Query Flow
End-to-end flow from a user query to relevant document chunks via vector similarity search and reranking.
Services Involved
rag-service, embedding-service, completion-service (for LLM reranking), reranker-service (optional, for BGE reranking), PostgreSQL (pgvector)
Prerequisites
Documents must already be processed (status: PROCESSED). This means they have been parsed and embedded via the Document Processing Flow.
Steps (Advanced Engine -- Default)
1. Receive Query
Caller (typically the MCP RAG tool, or directly) calls POST /api/v1/rag/retrieve on rag-service:
{
"query": "How does kubernetes handle pod scheduling?",
"grade": 0.7,
"similarityTopK": 20,
"docIds": ["uuid-1", "uuid-2"],
"rerank": {
"enabled": true,
"strategy": "bge",
"topK": 5,
"score": 0.5
}
}
2. Embed the Query
rag-service calls embedding-service to convert the query text into a vector:
rag-service --POST /api/v1/embed/single--> embedding-service
embedding-service calls the OpenAI/Azure OpenAI embedding API and returns a 1024-dimension vector.
3. Vector Similarity Search
rag-service queries PostgreSQL directly using pgvector:
SELECT c.*, e.embedding <=> query_vector AS distance
FROM embeddings e
JOIN chunks c ON e.chunk_id = c.id
WHERE e.document_id IN (:docIds)
ORDER BY distance ASC
LIMIT :similarityTopK
The <=> operator computes cosine distance. Results are ordered by similarity (closest first). The HNSW index on the embedding column makes this fast.
4. Rerank Results
If reranking is enabled, candidates are re-scored for higher precision.
BGE strategy: Calls an external reranker service:
rag-service --POST /v1/rerank--> reranker-service
The BGE cross-encoder model scores each candidate against the original query and re-orders them.
LLM strategy: Calls completion-service:
rag-service --POST /api/v1/completions--> completion-service
Asks the LLM to score and rank candidates. More accurate, slower, more expensive.
5. Filter and Return
After reranking, results below the score threshold are removed. The top-K results are returned:
{
"results": [
{
"content": "Kubernetes uses a scheduler that watches...",
"score": 0.89,
"metadata": {
"documentId": "uuid-1",
"fileName": "k8s-guide.pdf",
"chunkIndex": 12,
"pageNumber": 45,
"contentType": "text"
}
}
],
"totalResults": 3,
"engineUsed": "advanced",
"pipeline": {
"reranked": true,
"rerankStrategy": "bge",
"candidatesConsidered": 20
}
}
Steps (Naive Engine)
A simpler pipeline without reranking:
- Embed the query (same as step 2 above)
- Vector similarity search (same as step 3)
- Filter results by the
gradethreshold (cosine similarity score) - Return results directly
Use the naive engine by setting "engineName": "naive" in the request.
Diagram
Caller (MCP RAG tool or direct)
|
| POST /api/v1/rag/retrieve
v
rag-service
|
| POST /api/v1/embed/single
v
embedding-service --> OpenAI Embedding API
|
| Returns 1024-dim vector
v
rag-service
|
| SQL: cosine similarity search (pgvector)
v
PostgreSQL (document database)
|
| Top-K candidates
v
rag-service
|
|--- BGE rerank? --> reranker-service (POST /v1/rerank)
|--- LLM rerank? --> completion-service (POST /api/v1/completions)
|--- No rerank? --> filter by grade
|
v
Filtered, ranked results --> Caller