fix: [ISSUE] Indexing large PDF files time out API call from Unstract frontend

## Describe the bug
In Prompt Studio, after creating a new project with Unstract Open Source, ollama (LLM and embedding connector), postgresql and unstructured-io-api, I upload a PDF file with a large number of pages (1.8 MB, 214 pages) and try to index it using the Index button on the Manage Document Variants page. The indexing fails exactly 5 minutes after starting the indexing action, according to the logs shown at the bottom of the Prompt Studio project screen. An error message is shown on the frontend: 

Failed
Error while indexing 'FullText.pdf'. Error from text extractor 'Unstructured IO Community'. unstructured-io: Error while calling the unstructured-io service

The unstructured docker logs show a '200 OK', though. This response has a timestamp that's around 6 minutes later than the timestamp of the unstructured_api receives the indexing request. 

I suspect there's a timeout on the API call made from the Unstract frontend to the unstructured API endpoint. In the unstract code, I see references to timeouts that are 300 seconds long. I am unable to change these timeouts manually, though. Even after changing all these timeout values, Unstract still errors out after 5 minutes.

Smaller PDF files (under 20 pages) do index properly for me.

## To reproduce
1. Spin up an Unstract Open Source environment per the instruction on the Unstract blog: https://unstract.com/blog/open-source-document-data-extraction-with-unstract-deepseek/
2. When the instruction prompts to upload a PDF file, choose a PDF file with a high page count. Example used for demonstration purposes: https://computingbook.org/FullText.pdf (use at your own risk, haven't checked it for anything malicious).
3. Click the 'Index' button for this newly uploaded file, marked by the circular arrow.
4. The logs on the lower part of the Prompt Studio project page should state that indexing has started ('Indexing started'), take note of the timestamp
5. While indexing has started, run 'docker logs -f {CONTAINER_ID}' to check the logs on the unstructured-io-api docker container.  
6. After exactly 5 minutes (give or take a second, I suppose there's a slight delay in Unstract's backend when the API call returns), the indexing will have failed.

Add: the unstructured container doesn't give me a 200 OK when processing the example file. I'm not sure why, but that might be an indicator of something important to this issue.

## Expected behavior
I expect Unstract to properly handle PDF indexing when that process takes more than 5 minutes. If this cannot be handled elegantly automagically, at least allow users to manually set a timeout if 300 seconds is too short for background process to do their work. User should not have to look up timeout values in code, because Unstract is a 'no code platform'. 

## Environment details
Version: Latest (cloned Sunday 16th of March 2025 21:00 GMT)

Kernel: Linux 6.13.7-arch1-1
CPU: 12th Gen Intel(R) Core(TM) i5-12600K (16) @ 4.60 GHz
GPU 1: NVIDIA GeForce RTX 3070 Lite Hash Rate [Discrete]
Memory: 13.08 GiB / 62.57 GiB (21%)
Disk space: 2.67 TiB (7% in use, aggr.)

$ sudo docker version:

Client:
 Version:           28.0.1
 API version:       1.48
 Go version:        go1.24.1
 Git commit:        068a01ea94
 Built:             Thu Mar  6 19:09:48 2025
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          28.0.1
  API version:      1.48 (minimum version 1.24)
  Go version:       go1.24.1
  Git commit:       bbd0a17ccc
  Built:            Thu Mar  6 19:09:48 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v2.0.3
  GitCommit:        06b99ca80cdbfbc6cc8bd567021738c9af2b36ce.m
 runc:
  Version:          1.2.5
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Containers:
REPOSITORY                                                   TAG       IMAGE ID       CREATED         SIZE
ollama/ollama                                                latest    43d119f0a361   2 days ago      3.45GB
unstract/prompt-service                                      latest    cc5a99fd9d02   3 days ago      1.48GB
unstract/frontend                                            latest    1c6a0ac703bf   3 days ago      73.4MB
unstract/backend                                             latest    5f23e37626b6   3 days ago      1.76GB
unstract/platform-service                                    latest    063bece45d1c   3 days ago      1.42GB
unstract/runner                                              latest    3877391ad115   3 days ago      610MB
unstract/x2text-service                                      latest    f14c1e1029a9   3 days ago      260MB
minio/minio                                                  latest    2eaf94c71682   4 days ago      182MB
minio/mc                                                     latest    b27ef7f401c2   4 days ago      84MB
pgvector/pgvector                                            pg15      34de5cb0838f   11 days ago     437MB
downloads.unstructured.io/unstructured-io/unstructured-api   latest    4ed6e30ee87a   4 weeks ago     9.87GB
qdrant/qdrant                                                v1.8.3    e7cb95ab55c6   12 months ago   162MB
traefik                                                      v2.10     ee69e8120b64   13 months ago   153MB
flipt/flipt                                                  v1.34.0   db65814a90e4   14 months ago   65.7MB
redis                                                        7.2.3     e40e2763392d   15 months ago   138MB

## Additional context
Nothing to add, I think. 

## Screenshots

Note the discrepancy in the timestamps. Frontend is set in a different timezone, which accounts for a 1 hour difference.

![Image](https://github.com/user-attachments/assets/a78994df-13f1-4453-9ced-4f04bfdd6db6)

![Image](https://github.com/user-attachments/assets/cadbe460-550d-4099-9862-a6e6b2ab9205)

![Image](https://github.com/user-attachments/assets/4b06223a-d928-4f02-891c-2ba7c6d56a63)

`2025-03-17 11:37:51,418 unstructured_api DEBUG pipeline_api input params: {"filename": "unstract/prompt-studio-data/mock_org/mock_user_id/2cbd7437-b721-435b-a0cc-bafd4a421a1e/FullText.pdf", "response_type": "application/json", "coordinates": false, "encoding": "utf-8", "hi_res_model_name": null, "include_page_breaks": false, "ocr_languages": null, "pdf_infer_table_structure": true, "skip_infer_table_types": [], "strategy": "auto", "xml_keep_tags": false, "languages": null, "extract_image_block_types": null, "unique_element_ids": false, "chunking_strategy": null, "combine_under_n_chars": null, "max_characters": 500, "multipage_sections": true, "new_after_n_chars": null, "overlap": 0, "overlap_all": false, "starting_page_number": null, "include_slide_notes": true}
2025-03-17 11:37:51,419 unstructured_api DEBUG filetype: application/pdf
2025-03-17 11:37:51,428 unstructured_api DEBUG partition input data: {"content_type": "application/pdf", "strategy": "auto", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": true, "include_page_breaks": false, "encoding": "utf-8", "hi_res_model_name": null, "xml_keep_tags": false, "skip_infer_table_types": [], "languages": null, "chunking_strategy": null, "multipage_sections": true, "combine_under_n_chars": null, "new_after_n_chars": null, "max_characters": 500, "overlap": 0, "overlap_all": false, "extract_image_block_types": null, "extract_image_block_to_payload": false, "unique_element_ids": false, "include_slide_notes": true}
2025-03-17 11:37:51,428 unstructured WARNING The pdf_infer_table_structure kwarg is deprecated. Please use skip_infer_table_types instead.
2025-03-17 11:37:57,399 unstructured_inference INFO Reading PDF for file: /tmp/tmp3dyy955y/document.pdf ...`

![Image](https://github.com/user-attachments/assets/d8ebde0f-d7c5-4cf1-8dc8-705a7a63a8f2)

![Image](https://github.com/user-attachments/assets/599a4234-de2b-4d8e-b77a-3c3f973b6424)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: [ISSUE] Indexing large PDF files time out API call from Unstract frontend #1195

Describe the bug

To reproduce

Expected behavior

Environment details

Additional context

Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix: [ISSUE] Indexing large PDF files time out API call from Unstract frontend #1195

Description

Describe the bug

To reproduce

Expected behavior

Environment details

Additional context

Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions