Feature/Bug: Add a pdf filetype check on processing

When uploading files through the new web UI, there is currently no validation to ensure that files labeled as PDF are actually PDF files. During a recent upload of multiple files, one of the uploaded files was mistakenly labeled as a PDF but was not in the correct format. This caused the processing pipeline to encounter an error, and now the system seems stuck, unable to process further files. From what I can tell this same issue will occur via the API as there is no filetype check

**Issue Details**:  
- The system attempted to process a non-PDF file, leading to an error where the file header did not match the expected PDF format.  
- Logs indicate an unclear error message and suggest the system is stuck as this same series of message on the same file keeps repeating:  
```2025-01-02 22:32:54.128 INFO  [140498914039488] ProducerImpl:220 | [persistent://tg/flow/text-document-load, ] Created producer on broker [192.168.0.2:41394 -> 192.168.0.11:6650] 

2025-01-02 22:32:54.130 INFO  [140499353970496] Client:86 | Subscribing on Topic :persistent://tg/flow/document-load

2025-01-02 22:32:54.131 INFO  [140498914039488] HandlerBase:111 | [persistent://tg/flow/document-load, decoding.pdf, 0] Getting connection from pool

2025-01-02 22:32:54.131 INFO  [140498914039488] BinaryProtoLookupService:85 | Lookup response for persistent://tg/flow/document-load, lookup-broker-url pulsar://localhost:6650, from [192.168.0.2:41392 -> 192.168.0.11:6650] 

2025-01-02 22:32:54.134 INFO  [140498914039488] ConsumerImpl:300 | [persistent://tg/flow/document-load, decoding.pdf, 0] Created consumer on broker [192.168.0.2:41394 -> 192.168.0.11:6650] 

{'pulsar_host': 'pulsar://pulsar:6650', 'log_level': <LogLevel.INFO: 'info'>, 'metrics': True, 'metrics_port': 8000, 'input_queue': 'persistent://tg/flow/document-load', 'subscriber': 'decoding.pdf', 'output_queue': 'persistent://tg/flow/text-document-load'}

PDF inited

PDF message received

Decoding https://trustgraph.ai/doc/0fe61ab7-2bf4-4b3a-8ce6-d8c013b6458e...

invalid pdf header: b'{\n  "'
```


- Subsequent files are not being processed, potentially due to the system halting further operations.  

**Proposed Solution**:  
- Implement a file type check before processing begins. This would verify that files labeled as PDFs match the correct header signature for PDF files (e.g., `%PDF-` at the beginning of the file).  
- Provide a clear error message to the user if a file fails the type check.  
- Add error handling to ensure the system can continue processing valid files even if an invalid file is encountered.  

**Steps to Reproduce**:  
1. Upload a set of files through the TrustGraph web UI, including a non-PDF file with a `.pdf` extension.  
2. Observe that the processing pipeline encounters an error and halts further processing.  

**Expected Behavior**:  
- The system should validate files during the upload phase or during processing, reject invalid files, and raise an error.  
- Processing should continue for valid files without interruption.  

**Logs and Screenshots**:  
- Logs provided above. Additional logs/screenshots can be shared upon request.  

**Priority**: Medium  

Let me know if further information is needed to address this issue.

This could be as straightforward as a try/catch here - https://github.com/trustgraph-ai/trustgraph/blob/ee9837c9ca628105170540f71871c010b167b06e/trustgraph-flow/trustgraph/decoding/pdf/pdf_decoder.py#L55-L69


	with open(fp.name, mode='rb') as f:

	loader = PyPDFLoader(fp.name)
	pages = loader.load()

	for ix, page in enumerate(pages):

	r = TextDocument(
	metadata=v.metadata,
	text=page.page_content.encode("utf-8"),
	)

	self.send(r)

	print("Done.", flush=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/Bug: Add a pdf filetype check on processing #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature/Bug: Add a pdf filetype check on processing #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions