Marker
Marker converts documents (PDF, DOCX, PPTX, images, etc.) into clean Markdown, JSON, or HTML, with support for structured data extraction, table/form parsing, image extraction, and optional LLM-based accuracy enhancements.
Endpoint
POST /api/v1/services/process-document?chunk_size=500&overlap=15
Request Format
Sent as multipart/form-data
.
Headers
- X-Original-Filename: Original file name (supporting Unicode names such as Hebrew)
- Content-Type: Must match the uploaded file type (e.g.
application/pdf
)
Query Parameters
Name | Type | Default | Description |
---|---|---|---|
chunk_size | number | 500 | Number of tokens per chunk |
overlap | number | 15 | Number of overlapping tokens between chunks |
Response Format
The response is returned as a parsed file, extracted and divided into chunks according to the specified chunk_size
and overlap
.
Each chunk contains a segment of the document's content, formatted for further processing or embedding.
Example cURL Usage
English filename:
curl --location 'https://8ehqmu89grlsbn-8001.proxy.runpod.net/services/process-document?chunk_size=500&overlap=15' \
--header 'X-Original-Filename: myfile' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/myfile.pdf'
Hebrew filename:
curl --location 'https://8ehqmu89grlsbn-8001.proxy.runpod.net/services/process-document?chunk_size=500&overlap=15' \
--header 'X-Original-Filename: %D7%9E%D7%A1%D7%9E%D7%9A' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/מסמך.pdf'
Supported File Types
.pdf
.docx
.pptx
.xlsx
.html
.epub
- Image formats:
.png
,.jpg
,.jpeg
,.tiff
, etc.
Notes
- Marker automatically detects your Torch device (CPU, GPU, or MPS).