-
Notifications
You must be signed in to change notification settings - Fork 555
Description
Describe the bug
In Prompt Studio, after creating a new project with Unstract Open Source, ollama (LLM and embedding connector), postgresql and unstructured-io-api, I upload a PDF file with a large number of pages (1.8 MB, 214 pages) and try to index it using the Index button on the Manage Document Variants page. The indexing fails exactly 5 minutes after starting the indexing action, according to the logs shown at the bottom of the Prompt Studio project screen. An error message is shown on the frontend:
Failed
Error while indexing 'FullText.pdf'. Error from text extractor 'Unstructured IO Community'. unstructured-io: Error while calling the unstructured-io service
The unstructured docker logs show a '200 OK', though. This response has a timestamp that's around 6 minutes later than the timestamp of the unstructured_api receives the indexing request.
I suspect there's a timeout on the API call made from the Unstract frontend to the unstructured API endpoint. In the unstract code, I see references to timeouts that are 300 seconds long. I am unable to change these timeouts manually, though. Even after changing all these timeout values, Unstract still errors out after 5 minutes.
Smaller PDF files (under 20 pages) do index properly for me.
To reproduce
- Spin up an Unstract Open Source environment per the instruction on the Unstract blog: https://unstract.com/blog/open-source-document-data-extraction-with-unstract-deepseek/
- When the instruction prompts to upload a PDF file, choose a PDF file with a high page count. Example used for demonstration purposes: https://computingbook.org/FullText.pdf (use at your own risk, haven't checked it for anything malicious).
- Click the 'Index' button for this newly uploaded file, marked by the circular arrow.
- The logs on the lower part of the Prompt Studio project page should state that indexing has started ('Indexing started'), take note of the timestamp
- While indexing has started, run 'docker logs -f {CONTAINER_ID}' to check the logs on the unstructured-io-api docker container.
- After exactly 5 minutes (give or take a second, I suppose there's a slight delay in Unstract's backend when the API call returns), the indexing will have failed.
Add: the unstructured container doesn't give me a 200 OK when processing the example file. I'm not sure why, but that might be an indicator of something important to this issue.
Expected behavior
I expect Unstract to properly handle PDF indexing when that process takes more than 5 minutes. If this cannot be handled elegantly automagically, at least allow users to manually set a timeout if 300 seconds is too short for background process to do their work. User should not have to look up timeout values in code, because Unstract is a 'no code platform'.
Environment details
Version: Latest (cloned Sunday 16th of March 2025 21:00 GMT)
Kernel: Linux 6.13.7-arch1-1
CPU: 12th Gen Intel(R) Core(TM) i5-12600K (16) @ 4.60 GHz
GPU 1: NVIDIA GeForce RTX 3070 Lite Hash Rate [Discrete]
Memory: 13.08 GiB / 62.57 GiB (21%)
Disk space: 2.67 TiB (7% in use, aggr.)
$ sudo docker version:
Client:
Version: 28.0.1
API version: 1.48
Go version: go1.24.1
Git commit: 068a01ea94
Built: Thu Mar 6 19:09:48 2025
OS/Arch: linux/amd64
Context: default
Server:
Engine:
Version: 28.0.1
API version: 1.48 (minimum version 1.24)
Go version: go1.24.1
Git commit: bbd0a17ccc
Built: Thu Mar 6 19:09:48 2025
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v2.0.3
GitCommit: 06b99ca80cdbfbc6cc8bd567021738c9af2b36ce.m
runc:
Version: 1.2.5
GitCommit:
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Containers:
REPOSITORY TAG IMAGE ID CREATED SIZE
ollama/ollama latest 43d119f0a361 2 days ago 3.45GB
unstract/prompt-service latest cc5a99fd9d02 3 days ago 1.48GB
unstract/frontend latest 1c6a0ac703bf 3 days ago 73.4MB
unstract/backend latest 5f23e37626b6 3 days ago 1.76GB
unstract/platform-service latest 063bece45d1c 3 days ago 1.42GB
unstract/runner latest 3877391ad115 3 days ago 610MB
unstract/x2text-service latest f14c1e1029a9 3 days ago 260MB
minio/minio latest 2eaf94c71682 4 days ago 182MB
minio/mc latest b27ef7f401c2 4 days ago 84MB
pgvector/pgvector pg15 34de5cb0838f 11 days ago 437MB
downloads.unstructured.io/unstructured-io/unstructured-api latest 4ed6e30ee87a 4 weeks ago 9.87GB
qdrant/qdrant v1.8.3 e7cb95ab55c6 12 months ago 162MB
traefik v2.10 ee69e8120b64 13 months ago 153MB
flipt/flipt v1.34.0 db65814a90e4 14 months ago 65.7MB
redis 7.2.3 e40e2763392d 15 months ago 138MB
Additional context
Nothing to add, I think.
Screenshots
Note the discrepancy in the timestamps. Frontend is set in a different timezone, which accounts for a 1 hour difference.
2025-03-17 11:37:51,418 unstructured_api DEBUG pipeline_api input params: {"filename": "unstract/prompt-studio-data/mock_org/mock_user_id/2cbd7437-b721-435b-a0cc-bafd4a421a1e/FullText.pdf", "response_type": "application/json", "coordinates": false, "encoding": "utf-8", "hi_res_model_name": null, "include_page_breaks": false, "ocr_languages": null, "pdf_infer_table_structure": true, "skip_infer_table_types": [], "strategy": "auto", "xml_keep_tags": false, "languages": null, "extract_image_block_types": null, "unique_element_ids": false, "chunking_strategy": null, "combine_under_n_chars": null, "max_characters": 500, "multipage_sections": true, "new_after_n_chars": null, "overlap": 0, "overlap_all": false, "starting_page_number": null, "include_slide_notes": true} 2025-03-17 11:37:51,419 unstructured_api DEBUG filetype: application/pdf 2025-03-17 11:37:51,428 unstructured_api DEBUG partition input data: {"content_type": "application/pdf", "strategy": "auto", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": true, "include_page_breaks": false, "encoding": "utf-8", "hi_res_model_name": null, "xml_keep_tags": false, "skip_infer_table_types": [], "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": null, "new_after_n_chars": null, "max_characters": 500, "overlap": 0, "overlap_all": false, "extract_image_block_types": null, "extract_image_block_to_payload": false, "unique_element_ids": false, "include_slide_notes": true} 2025-03-17 11:37:51,428 unstructured WARNING The pdf_infer_table_structure kwarg is deprecated. Please use skip_infer_table_types instead. 2025-03-17 11:37:57,399 unstructured_inference INFO Reading PDF for file: /tmp/tmp3dyy955y/document.pdf ...