-
Notifications
You must be signed in to change notification settings - Fork 50
Description
When uploading files through the new web UI, there is currently no validation to ensure that files labeled as PDF are actually PDF files. During a recent upload of multiple files, one of the uploaded files was mistakenly labeled as a PDF but was not in the correct format. This caused the processing pipeline to encounter an error, and now the system seems stuck, unable to process further files. From what I can tell this same issue will occur via the API as there is no filetype check
Issue Details:
- The system attempted to process a non-PDF file, leading to an error where the file header did not match the expected PDF format.
- Logs indicate an unclear error message and suggest the system is stuck as this same series of message on the same file keeps repeating:
2025-01-02 22:32:54.130 INFO [140499353970496] Client:86 | Subscribing on Topic :persistent://tg/flow/document-load
2025-01-02 22:32:54.131 INFO [140498914039488] HandlerBase:111 | [persistent://tg/flow/document-load, decoding.pdf, 0] Getting connection from pool
2025-01-02 22:32:54.131 INFO [140498914039488] BinaryProtoLookupService:85 | Lookup response for persistent://tg/flow/document-load, lookup-broker-url pulsar://localhost:6650, from [192.168.0.2:41392 -> 192.168.0.11:6650]
2025-01-02 22:32:54.134 INFO [140498914039488] ConsumerImpl:300 | [persistent://tg/flow/document-load, decoding.pdf, 0] Created consumer on broker [192.168.0.2:41394 -> 192.168.0.11:6650]
{'pulsar_host': 'pulsar://pulsar:6650', 'log_level': <LogLevel.INFO: 'info'>, 'metrics': True, 'metrics_port': 8000, 'input_queue': 'persistent://tg/flow/document-load', 'subscriber': 'decoding.pdf', 'output_queue': 'persistent://tg/flow/text-document-load'}
PDF inited
PDF message received
Decoding https://trustgraph.ai/doc/0fe61ab7-2bf4-4b3a-8ce6-d8c013b6458e...
invalid pdf header: b'{\n "'
- Subsequent files are not being processed, potentially due to the system halting further operations.
Proposed Solution:
- Implement a file type check before processing begins. This would verify that files labeled as PDFs match the correct header signature for PDF files (e.g.,
%PDF-
at the beginning of the file). - Provide a clear error message to the user if a file fails the type check.
- Add error handling to ensure the system can continue processing valid files even if an invalid file is encountered.
Steps to Reproduce:
- Upload a set of files through the TrustGraph web UI, including a non-PDF file with a
.pdf
extension. - Observe that the processing pipeline encounters an error and halts further processing.
Expected Behavior:
- The system should validate files during the upload phase or during processing, reject invalid files, and raise an error.
- Processing should continue for valid files without interruption.
Logs and Screenshots:
- Logs provided above. Additional logs/screenshots can be shared upon request.
Priority: Medium
Let me know if further information is needed to address this issue.
This could be as straightforward as a try/catch here -
trustgraph/trustgraph-flow/trustgraph/decoding/pdf/pdf_decoder.py
Lines 55 to 69 in ee9837c
with open(fp.name, mode='rb') as f: | |
loader = PyPDFLoader(fp.name) | |
pages = loader.load() | |
for ix, page in enumerate(pages): | |
r = TextDocument( | |
metadata=v.metadata, | |
text=page.page_content.encode("utf-8"), | |
) | |
self.send(r) | |
print("Done.", flush=True) |