Skip to content

fix: [ISSUE] Indexing large PDF files time out API call from Unstract frontend #1195

@prakkert

Description

@prakkert

Describe the bug

In Prompt Studio, after creating a new project with Unstract Open Source, ollama (LLM and embedding connector), postgresql and unstructured-io-api, I upload a PDF file with a large number of pages (1.8 MB, 214 pages) and try to index it using the Index button on the Manage Document Variants page. The indexing fails exactly 5 minutes after starting the indexing action, according to the logs shown at the bottom of the Prompt Studio project screen. An error message is shown on the frontend:

Failed
Error while indexing 'FullText.pdf'. Error from text extractor 'Unstructured IO Community'. unstructured-io: Error while calling the unstructured-io service

The unstructured docker logs show a '200 OK', though. This response has a timestamp that's around 6 minutes later than the timestamp of the unstructured_api receives the indexing request.

I suspect there's a timeout on the API call made from the Unstract frontend to the unstructured API endpoint. In the unstract code, I see references to timeouts that are 300 seconds long. I am unable to change these timeouts manually, though. Even after changing all these timeout values, Unstract still errors out after 5 minutes.

Smaller PDF files (under 20 pages) do index properly for me.

To reproduce

  1. Spin up an Unstract Open Source environment per the instruction on the Unstract blog: https://unstract.com/blog/open-source-document-data-extraction-with-unstract-deepseek/
  2. When the instruction prompts to upload a PDF file, choose a PDF file with a high page count. Example used for demonstration purposes: https://computingbook.org/FullText.pdf (use at your own risk, haven't checked it for anything malicious).
  3. Click the 'Index' button for this newly uploaded file, marked by the circular arrow.
  4. The logs on the lower part of the Prompt Studio project page should state that indexing has started ('Indexing started'), take note of the timestamp
  5. While indexing has started, run 'docker logs -f {CONTAINER_ID}' to check the logs on the unstructured-io-api docker container.
  6. After exactly 5 minutes (give or take a second, I suppose there's a slight delay in Unstract's backend when the API call returns), the indexing will have failed.

Add: the unstructured container doesn't give me a 200 OK when processing the example file. I'm not sure why, but that might be an indicator of something important to this issue.

Expected behavior

I expect Unstract to properly handle PDF indexing when that process takes more than 5 minutes. If this cannot be handled elegantly automagically, at least allow users to manually set a timeout if 300 seconds is too short for background process to do their work. User should not have to look up timeout values in code, because Unstract is a 'no code platform'.

Environment details

Version: Latest (cloned Sunday 16th of March 2025 21:00 GMT)

Kernel: Linux 6.13.7-arch1-1
CPU: 12th Gen Intel(R) Core(TM) i5-12600K (16) @ 4.60 GHz
GPU 1: NVIDIA GeForce RTX 3070 Lite Hash Rate [Discrete]
Memory: 13.08 GiB / 62.57 GiB (21%)
Disk space: 2.67 TiB (7% in use, aggr.)

$ sudo docker version:

Client:
Version: 28.0.1
API version: 1.48
Go version: go1.24.1
Git commit: 068a01ea94
Built: Thu Mar 6 19:09:48 2025
OS/Arch: linux/amd64
Context: default

Server:
Engine:
Version: 28.0.1
API version: 1.48 (minimum version 1.24)
Go version: go1.24.1
Git commit: bbd0a17ccc
Built: Thu Mar 6 19:09:48 2025
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v2.0.3
GitCommit: 06b99ca80cdbfbc6cc8bd567021738c9af2b36ce.m
runc:
Version: 1.2.5
GitCommit:
docker-init:
Version: 0.19.0
GitCommit: de40ad0

Containers:
REPOSITORY TAG IMAGE ID CREATED SIZE
ollama/ollama latest 43d119f0a361 2 days ago 3.45GB
unstract/prompt-service latest cc5a99fd9d02 3 days ago 1.48GB
unstract/frontend latest 1c6a0ac703bf 3 days ago 73.4MB
unstract/backend latest 5f23e37626b6 3 days ago 1.76GB
unstract/platform-service latest 063bece45d1c 3 days ago 1.42GB
unstract/runner latest 3877391ad115 3 days ago 610MB
unstract/x2text-service latest f14c1e1029a9 3 days ago 260MB
minio/minio latest 2eaf94c71682 4 days ago 182MB
minio/mc latest b27ef7f401c2 4 days ago 84MB
pgvector/pgvector pg15 34de5cb0838f 11 days ago 437MB
downloads.unstructured.io/unstructured-io/unstructured-api latest 4ed6e30ee87a 4 weeks ago 9.87GB
qdrant/qdrant v1.8.3 e7cb95ab55c6 12 months ago 162MB
traefik v2.10 ee69e8120b64 13 months ago 153MB
flipt/flipt v1.34.0 db65814a90e4 14 months ago 65.7MB
redis 7.2.3 e40e2763392d 15 months ago 138MB

Additional context

Nothing to add, I think.

Screenshots

Note the discrepancy in the timestamps. Frontend is set in a different timezone, which accounts for a 1 hour difference.

Image

Image

Image

2025-03-17 11:37:51,418 unstructured_api DEBUG pipeline_api input params: {"filename": "unstract/prompt-studio-data/mock_org/mock_user_id/2cbd7437-b721-435b-a0cc-bafd4a421a1e/FullText.pdf", "response_type": "application/json", "coordinates": false, "encoding": "utf-8", "hi_res_model_name": null, "include_page_breaks": false, "ocr_languages": null, "pdf_infer_table_structure": true, "skip_infer_table_types": [], "strategy": "auto", "xml_keep_tags": false, "languages": null, "extract_image_block_types": null, "unique_element_ids": false, "chunking_strategy": null, "combine_under_n_chars": null, "max_characters": 500, "multipage_sections": true, "new_after_n_chars": null, "overlap": 0, "overlap_all": false, "starting_page_number": null, "include_slide_notes": true} 2025-03-17 11:37:51,419 unstructured_api DEBUG filetype: application/pdf 2025-03-17 11:37:51,428 unstructured_api DEBUG partition input data: {"content_type": "application/pdf", "strategy": "auto", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": true, "include_page_breaks": false, "encoding": "utf-8", "hi_res_model_name": null, "xml_keep_tags": false, "skip_infer_table_types": [], "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": null, "new_after_n_chars": null, "max_characters": 500, "overlap": 0, "overlap_all": false, "extract_image_block_types": null, "extract_image_block_to_payload": false, "unique_element_ids": false, "include_slide_notes": true} 2025-03-17 11:37:51,428 unstructured WARNING The pdf_infer_table_structure kwarg is deprecated. Please use skip_infer_table_types instead. 2025-03-17 11:37:57,399 unstructured_inference INFO Reading PDF for file: /tmp/tmp3dyy955y/document.pdf ...

Image

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions