Parser Overview
Before embedding or chunking any document, it must first go through a parsing stage. In our system, parsers are responsible for reading the uploaded document (PDF or Word format) and extracting its textual and visual content into a structured format.
What is a Parser?
A parser in our platform is the first step in the document processing pipeline. It takes raw documents and converts them into a clean, machine-readable format — enabling downstream operations like embedding, chunking, translation, and Q&A generation.
Supported Formats
- PDF (
.pdf
) - Word Documents (
.doc
,.docx
)
How It Works
- Upload your document – drag & drop or select from your file system.
- Choose a parser – select one of the available parsing strategies depending on your document type or goal.
- The parser extracts – we convert the file into structured text (and optionally metadata, tables, images, etc.).
- Post-processing options – after parsing, you can:
- Generate embeddings
- Split content into chunks
- Translate to English
- Generate questions and answers
Available Parsers
Parser | Description |
---|---|
Text | Basic raw text extraction from documents. Fast, minimal formatting. |
Text & Image | Extracts text and image references for mixed content files. |
Marker | Advanced parser that supports tables, forms, images, equations, and chunking-ready output. |
Tika | Apache Tika-based parser for extracting text and metadata from a wide variety of file formats. |
Semantic | Parser that focuses on semantic structure and logical separation of content. |
Flex | Smart parser that auto-selects strategy based on file type and content complexity. |