Skip to content

Conversation

rex993
Copy link
Contributor

@rex993 rex993 commented Jun 15, 2025

  • Install libreoffice-nogui in Docker container
  • Add soffice availability check on initialization
  • Implement unified Office-to-PDF conversion method
  • Add support for DOCX, PPTX, and XLSX files
  • Include additional unstructured extras for Office formats
  • Add unstructured support for Office documents when ColPali is disabled

rex993 added 2 commits June 15, 2025 12:49
- Install libreoffice-nogui in Docker container
- Add soffice availability check on initialization
- Implement unified Office-to-PDF conversion method
- Add support for DOCX, PPTX, and XLSX files
- Include additional unstructured extras for Office formats
Copy link

jazzberry-ai bot commented Jun 15, 2025

Bug Report

Name Severity Example test case Description
Overly aggressive text sanitization Medium Upload a document with specific Unicode characters. The _sanitize_text function in core/parser/morphik_parser.py might remove valid Unicode characters, leading to data loss.
Unhandled soffice failure Medium Attempt to convert an Office document to PDF when the output directory is not writable. The _convert_office_to_pdf function in core/services/document_service.py doesn't handle the case where the soffice command fails to create the PDF file in the expected location, potentially leading to uninformative error messages.
Missing empty input file check Low Upload an empty Office document. The _convert_office_to_pdf function in core/services/document_service.py doesn't check if the input file is empty before calling soffice, potentially leading to unnecessary processing and empty PDF files.
Missing runtime soffice availability check Low Start the service and then uninstall LibreOffice. The _convert_office_to_pdf function in core/services/document_service.py only checks for soffice availability on initialization, and does not account for it becoming unavailable at a later time.

Comments? Email us.

Copy link

jazzberry-ai bot commented Jun 15, 2025

An error occured.

This error may be due to rate limits. If this error persists, please email us.

@Adityav369
Copy link
Collaborator

Added default support for all office documents in latest commit: f9e6e3e. Thanks!

@Adityav369 Adityav369 closed this Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants