📄️ Parser Overview
Before embedding or chunking any document, it must first go through a parsing stage.
📄️ Marker
Marker converts documents (PDF, DOCX, PPTX, images, etc.) into clean Markdown, JSON, or HTML, with support for structured data extraction, table/form parsing, image extraction, and optional LLM-based accuracy enhancements.
📄️ Tika Parser
Apache Tika is a content detection and analysis tool that extracts text, metadata, and language from a wide range of document formats.