Skip to main content

Tika Parser

Apache Tika is a content detection and analysis tool that extracts text, metadata, and language from a wide range of document formats.
It supports parsing documents such as PDFs, Word, Excel, PowerPoint, HTML, and various image and multimedia formats.

Endpoint

POST /api/v1/services/tika-extract?chunk_size=500&overlap=15

Request Format

Sent as multipart/form-data.

Headers

  • X-Original-Filename: Original file name (supporting Unicode names such as Hebrew)
  • Content-Type: Must match the uploaded file type (e.g. application/pdf)

Response Format

The response is returned as a parsed file, extracted and divided into chunks according to the specified chunk_size and overlap.
Each chunk contains a segment of the document's content, formatted for further processing or embedding.

Example cURL Usage

English filename:

curl --location 'https://your-domain.com/api/v1/services/tika-extract' \
--header 'X-Original-Filename: document' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/document.pdf'

Hebrew filename:

curl --location 'https://your-domain.com/api/v1/services/tika-extract' \
--header 'X-Original-Filename: %D7%A7%D7%95%D7%91%D7%A5' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/קובץ.pdf'

Supported File Types

  • .pdf
  • .doc, .docx
  • .xls, .xlsx
  • .ppt, .pptx
  • .html, .xml, .txt

Notes

  • Tika focuses on accurate text and metadata extraction, but does not preserve structure like tables or layout formatting.
  • Best suited for use cases requiring raw text and document indexing.