Parser Overview
Before embedding or chunking any document, it must first go through a parsing stage. In our system, parsers are responsible for reading the uploaded document (PDF or Word format) and extracting its textual and visual content into a structured format.
What is a Parser?
A parser in our platform is the first step in the document processing pipeline. It takes raw documents and converts them into a clean, machine-readable format — enabling downstream operations like embedding, chunking, translation, and Q&A generation.
Supported Formats
- PDF (
.pdf) - Word Documents (
.doc,.docx)
How It Works
- Upload your document – drag & drop or select from your file system.
- Choose a parser – select one of the available parsing strategies depending on your document type or goal.
- The parser extracts – we convert the file into structured text (and optionally metadata, tables, images, etc.).
- Post-processing options – after parsing, you can:
- Generate embeddings
- Split content into chunks
- Translate to English
- Generate questions and answers
Available Parsers
| Parser | Description |
|---|---|
| Text | Basic raw text extraction from documents. Fast, minimal formatting. |
| Text & Image | Extracts text and image references for mixed content files. |
| Marker | Advanced parser that supports tables, forms, images, equations, and chunking-ready output. |
| Tika | Apache Tika-based parser for extracting text and metadata from a wide variety of file formats. |
| Semantic | Parser that focuses on semantic structure and logical separation of content. |
| Flex | Smart parser that auto-selects strategy based on file type and content complexity. |