Parser Service
Converts documents (PDF, DOCX, HTML, etc.) to markdown. Supports four parser backends. Runs both a synchronous HTTP API and an asynchronous RabbitMQ worker for queue-based processing.
- Tech: Python, FastAPI, SQLAlchemy, RabbitMQ (aio-pika)
- Port: 4004
- Auth: None (internal service)
- Database: Shared document database (parsing_jobs table, reads documents table)
Endpoints
| Method | Path | Rate Limit | Description |
|---|---|---|---|
| POST | /api/v1/parser/parse | 10/min | Synchronous parse -- upload file, get markdown back |
| POST | /api/v1/parser/parse-async | 30/min | Async parse -- enqueue job, returns job_id |
| GET | /api/v1/parser/parse-status/{job_id} | -- | Check async job status |
| GET | /api/v1/parser/health | -- | Health of parser backends |
| GET | /api/v1/parser/info | -- | Available parser information |
| GET | /health/ | -- | Basic health |
| GET | /health/liveness | -- | Kubernetes liveness probe |
| GET | /health/readiness | -- | Kubernetes readiness probe (checks memory, disk, parsers, RabbitMQ) |
| GET | /health/detailed | -- | Detailed health with all indicators |
| GET | /health/healthz | -- | Legacy alias for /health |
| GET | /health/readyz | -- | Legacy alias for /readiness |
| GET | /metrics | -- | Prometheus metrics |
POST /api/v1/parser/parse
Synchronous parsing. Sends the file content as application/octet-stream body.
Headers:
X-Original-Filename(required) -- Original file name with extension
Query Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
| parser_method | string | None (auto-detect) | Parser backend to use. If not specified, the service auto-selects based on file type. |
| extract_tables | boolean | false | Extract tables from document |
| extract_images | boolean | false | Extract images from document |
| ocr_enabled | boolean | false | Enable OCR |
| ocr_mode | string | -- | OCR mode |
| caption_images | boolean | false | Generate image captions via Vision LLM |
| document_id | string | -- | Link to document record |
| user_id | string | -- | Requesting user |
Response: Parsed markdown text.
POST /api/v1/parser/parse-async
Same parameters as /parse. Returns a job ID for status polling.
Supports Idempotency-Key header to prevent duplicate jobs.
Parser Backends
| Backend | Key | Supported Files | Notes |
|---|---|---|---|
| Azure Document Intelligence | document_intelligence | PDF, DOCX, DOC, TXT, HTML, MD, RTF, XLSX, XLS | Default. Uses Azure SDK. |
| PyMuPDF | pdf_pymupdf | Fallback. HTTP microservice at port 8001. | |
| MinerU | mineru_parser | Optional. HTTP microservice at port 8002. | |
| Marker | marker_parser | Optional. HTTP microservice at port 8003. |
The parser chain tries the selected backend first and falls back to alternatives on failure.
Worker
Separate process (worker.py) that:
- Consumes from
parsing_jobsRabbitMQ queue - Parses the document using the configured backend
- Publishes results to
parsing_resultsqueue - Publishes progress to
parsing_progressqueue
Messages are formatted for NestJS microservice compatibility (includes pattern field).
Image Captioning
When caption_images=true, extracted images are sent to the completion service's Vision LLM for caption generation. The captions are embedded in the markdown output.
Inter-Service Communication
| Target | Protocol | Purpose |
|---|---|---|
| document-service | RabbitMQ | Publish parsing_results and parsing_progress |
| completion-service | HTTP | Image captioning via Vision LLM |
| PyMuPDF microservice | HTTP (localhost:8001) | PDF parsing |
| MinerU microservice | HTTP (localhost:8002) | PDF parsing |
| Marker microservice | HTTP (localhost:8003) | PDF parsing |
| Azure Document Intelligence | Azure SDK | Document parsing |