Skip to main content

Marker

Marker converts documents (PDF, DOCX, PPTX, images, etc.) into clean Markdown, JSON, or HTML, with support for structured data extraction, table/form parsing, image extraction, and optional LLM-based accuracy enhancements.

Endpoint

POST /api/v1/services/process-document?chunk_size=500&overlap=15

Request Format

Sent as multipart/form-data.

Headers

  • X-Original-Filename: Original file name (supporting Unicode names such as Hebrew)
  • Content-Type: Must match the uploaded file type (e.g. application/pdf)

Query Parameters

NameTypeDefaultDescription
chunk_sizenumber500Number of tokens per chunk
overlapnumber15Number of overlapping tokens between chunks

Response Format

The response is returned as a parsed file, extracted and divided into chunks according to the specified chunk_size and overlap.
Each chunk contains a segment of the document's content, formatted for further processing or embedding.

Example cURL Usage

English filename:

curl --location 'https://8ehqmu89grlsbn-8001.proxy.runpod.net/services/process-document?chunk_size=500&overlap=15' \
--header 'X-Original-Filename: myfile' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/myfile.pdf'

Hebrew filename:

curl --location 'https://8ehqmu89grlsbn-8001.proxy.runpod.net/services/process-document?chunk_size=500&overlap=15' \
--header 'X-Original-Filename: %D7%9E%D7%A1%D7%9E%D7%9A' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/מסמך.pdf'

Supported File Types

  • .pdf
  • .docx
  • .pptx
  • .xlsx
  • .html
  • .epub
  • Image formats: .png, .jpg, .jpeg, .tiff, etc.

Notes

  • Marker automatically detects your Torch device (CPU, GPU, or MPS).