Skip to content

Feature/Bug: Add a pdf filetype check on processing #243

@toliver38

Description

@toliver38

When uploading files through the new web UI, there is currently no validation to ensure that files labeled as PDF are actually PDF files. During a recent upload of multiple files, one of the uploaded files was mistakenly labeled as a PDF but was not in the correct format. This caused the processing pipeline to encounter an error, and now the system seems stuck, unable to process further files. From what I can tell this same issue will occur via the API as there is no filetype check

Issue Details:

  • The system attempted to process a non-PDF file, leading to an error where the file header did not match the expected PDF format.
  • Logs indicate an unclear error message and suggest the system is stuck as this same series of message on the same file keeps repeating:

2025-01-02 22:32:54.130 INFO  [140499353970496] Client:86 | Subscribing on Topic :persistent://tg/flow/document-load

2025-01-02 22:32:54.131 INFO  [140498914039488] HandlerBase:111 | [persistent://tg/flow/document-load, decoding.pdf, 0] Getting connection from pool

2025-01-02 22:32:54.131 INFO  [140498914039488] BinaryProtoLookupService:85 | Lookup response for persistent://tg/flow/document-load, lookup-broker-url pulsar://localhost:6650, from [192.168.0.2:41392 -> 192.168.0.11:6650] 

2025-01-02 22:32:54.134 INFO  [140498914039488] ConsumerImpl:300 | [persistent://tg/flow/document-load, decoding.pdf, 0] Created consumer on broker [192.168.0.2:41394 -> 192.168.0.11:6650] 

{'pulsar_host': 'pulsar://pulsar:6650', 'log_level': <LogLevel.INFO: 'info'>, 'metrics': True, 'metrics_port': 8000, 'input_queue': 'persistent://tg/flow/document-load', 'subscriber': 'decoding.pdf', 'output_queue': 'persistent://tg/flow/text-document-load'}

PDF inited

PDF message received

Decoding https://trustgraph.ai/doc/0fe61ab7-2bf4-4b3a-8ce6-d8c013b6458e...

invalid pdf header: b'{\n  "'
  • Subsequent files are not being processed, potentially due to the system halting further operations.

Proposed Solution:

  • Implement a file type check before processing begins. This would verify that files labeled as PDFs match the correct header signature for PDF files (e.g., %PDF- at the beginning of the file).
  • Provide a clear error message to the user if a file fails the type check.
  • Add error handling to ensure the system can continue processing valid files even if an invalid file is encountered.

Steps to Reproduce:

  1. Upload a set of files through the TrustGraph web UI, including a non-PDF file with a .pdf extension.
  2. Observe that the processing pipeline encounters an error and halts further processing.

Expected Behavior:

  • The system should validate files during the upload phase or during processing, reject invalid files, and raise an error.
  • Processing should continue for valid files without interruption.

Logs and Screenshots:

  • Logs provided above. Additional logs/screenshots can be shared upon request.

Priority: Medium

Let me know if further information is needed to address this issue.

This could be as straightforward as a try/catch here -

with open(fp.name, mode='rb') as f:
loader = PyPDFLoader(fp.name)
pages = loader.load()
for ix, page in enumerate(pages):
r = TextDocument(
metadata=v.metadata,
text=page.page_content.encode("utf-8"),
)
self.send(r)
print("Done.", flush=True)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions