Skip to content

Conversation

rateixei
Copy link
Contributor

@rateixei rateixei commented Oct 13, 2025

This PR implements a way to import Docx DrawingML objects as PNG images into DoclingDocument objects. This includes diagrams, hand-drawn shapes, and Word/Excel charts.

This is performed with the following steps:

  • Once a DrawingML object is found, an empty copy of the docx file is created (this is needed to keep the formatting styles there are defined in the file). One file is created for each paragraph that contains a DrawingML object.
  • The file is populated with the DrawingML object of that paragraph
  • The docx file is temporarily saved, and is exported to a temporary PDF ONLY with LibreOffice (if available, either on PATH, or as a Docling specific environment variable DOCLING_LIBREOFFICE_CMD). If not available is available, a warning is displayed.
  • The temporary PDF is read as PNG, which is then cropped to avoid the page rectangle, and stored as a pillow image.

An example docx file containing diagrams, figures and charts is attached, along with the DoclingDocument export.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

drawingml_example.docx
drawingml_example.json

@rateixei rateixei requested a review from dolfim-ibm October 13, 2025 13:08
Copy link
Contributor

github-actions bot commented Oct 13, 2025

DCO Check Passed

Thanks @rateixei, all your commits are properly signed off. 🎉

Copy link

mergify bot commented Oct 13, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Copy link

codecov bot commented Oct 13, 2025

Codecov Report

❌ Patch coverage is 84.46602% with 16 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/msword_backend.py 75.00% 12 Missing ⚠️
docling/backend/docx/drawingml/utils.py 92.72% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
…ail.com>

I, Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>, hereby add my Signed-off-by to this commit: 9518fff

Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
@rateixei rateixei requested a review from dolfim-ibm October 13, 2025 14:05
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
@rateixei rateixei marked this pull request as ready for review October 14, 2025 07:31
@rateixei rateixei changed the title [WIP] feat(docx): Adding feature to import drawingml objects in doclingdocument feat(docx): Adding feature to import drawingml objects in doclingdocument Oct 14, 2025
@rateixei rateixei requested a review from ceberam October 14, 2025 07:31
Copy link

dosubot bot commented Oct 14, 2025

Related Documentation

Checked 3 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

dolfim-ibm
dolfim-ibm previously approved these changes Oct 14, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dolfim-ibm dolfim-ibm changed the title feat(docx): Adding feature to import drawingml objects in doclingdocument feat(docx): Process drawingml objects Oct 14, 2025
Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • IMHO, LibreOffice should be an optional dependency in Docling. However, if a user does not have LibreOffice the backend test fails.
    FAILED tests/test_backend_msword.py::test_e2e_docx_conversions - AssertionError: export to markdown failed on tests/data/docx/drawingml.docx
    
    We could first check if LibreOffice is installed and compare the output accordingly
  • Just FYI for later: instead of an environment variable we could use backend options to set the path to LibreOffice, since it is more transparent and easier to document. On the PR #2011 that I am preparing, I introduce backend options
  • Could we extend this feature to the xlsx and pptx backends?

@dolfim-ibm dolfim-ibm changed the title feat(docx): Process drawingml objects feat(docx): Process drawingml objects in docx Oct 14, 2025
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
dolfim-ibm and others added 2 commits October 14, 2025 14:49
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Rafael Teixeira de Lima <Rafael.td.lima@gmail.com>
Copy link
Contributor

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏆

@rateixei rateixei merged commit 1682993 into main Oct 15, 2025
25 checks passed
@rateixei rateixei deleted the rtdl/export_drawingml branch October 15, 2025 08:58
Copy link

dosubot bot commented Oct 15, 2025

Documentation Updates

Checked 3 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants