Skip to content

fix: [ISSUE] indexing pdf with scans inside failed with timeout, tesseract vs llama3.2-vision? #1194

@gaetanquentin

Description

@gaetanquentin

Describe the bug

uploading and indexing a big pdf containing scans , tesseract is used but is too slow and get a timeout:

tesseract is still running when extractor do a tiemout

unstract-backend                | 172.28.0.1 - - [17/Mar/2025:09:57:30 +0000] "GET /api/v1/socket/?EIO=4&transport=websocket HTTP/1.1" 400 25 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
unstract-x2text-service         | [2025-03-17 09:57:30 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:7)
unstract-x2text-service         | [2025-03-17 09:57:30 +0000] [7] [ERROR] Error handling request /api/v1/x2text/process
unstract-x2text-service         | Traceback (most recent call last):
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 134, in handle
unstract-x2text-service         |     self.handle_request(listener, req, client, addr)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 177, in handle_request
unstract-x2text-service         |     respiter = self.wsgi(environ, resp.start_response)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-x2text-service         |     return self.wsgi_app(environ, start_response)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-x2text-service         |     response = self.full_dispatch_request()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-x2text-service         |     rv = self.dispatch_request()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-x2text-service         |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-x2text-service         |   File "/app/app/authentication_middleware.py", line 16, in wrapper
unstract-x2text-service         |     return func(*args, **kwargs)
unstract-x2text-service         |   File "/app/app/controllers/controller.py", line 120, in process
unstract-x2text-service         |     response = requests.request(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/api.py", line 59, in request
unstract-x2text-service         |     return session.request(method=method, url=url, **kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
unstract-x2text-service         |     resp = self.send(prep, **send_kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
unstract-x2text-service         |     r = adapter.send(request, **kwargs)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/requests/adapters.py", line 667, in send
unstract-x2text-service         |     resp = conn.urlopen(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 789, in urlopen
unstract-x2text-service         |     response = self._make_request(
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 536, in _make_request
unstract-x2text-service         |     response = conn.getresponse()
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/urllib3/connection.py", line 464, in getresponse
unstract-x2text-service         |     httplib_response = super().getresponse()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
unstract-x2text-service         |     response.begin()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
unstract-x2text-service         |     version, status, reason = self._read_status()
unstract-x2text-service         |   File "/usr/local/lib/python3.9/http/client.py", line 281, in _read_status
unstract-x2text-service         |     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
unstract-x2text-service         |   File "/usr/local/lib/python3.9/socket.py", line 716, in readinto
unstract-x2text-service         |     return self._sock.recv_into(b)
unstract-x2text-service         |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 204, in handle_abort
unstract-x2text-service         |     sys.exit(1)
unstract-x2text-service         | SystemExit: 1

To reproduce

llm profile:
Name LLM Embedding Model Vector Database Text Extractor
ollama-deepseek-r1 ollama-deepseek-r1 ollama-emb-deepseek-r1 pg-vdb-1 unstructured-io-1

Expected behavior

indexation ok

Environment details

  • Version: latest with optional profil

Additional context

Question

is there a way to replace old tesseract , not accelerated by gpu, with model llama 3.2 vision?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions