Skip to content

Charts Extraction from PPTX files #2306

@valentin-best

Description

@valentin-best

Requested feature

Add support for extracting charts from PPTX files in Docling.
Currently, Docling ignores charts when parsing .pptx presentations, which means valuable data (e.g., bar/line/pie charts) is lost in downstream processing.

The feature could expose chart metadata and data series in a structured format (e.g., JSON) so that downstream tools can either visualize them or analyze the data programmatically.

Example JSON output for a simple bar chart:

{
  "slide": 3,
  "shape_id": 12,
  "type": "clusteredColumn",
  "title": "Revenue by Region",
  "categories": ["Q1","Q2","Q3","Q4"],
  "series": [
    {"name":"EMEA","values":[12.3,14.2,15.0,16.8]},
    {"name":"AMER","values":[10.1,11.0,13.4,14.0]}
  ]
}

This would make Docling much more useful for users working with business presentations where charts are as important as text.

Alternatives

Currently, the only workaround is to parse PPTX files manually using python-pptx, walk through all shapes, extract the chart data and build a custom JSON output.
While this works, it requires writing and maintaining additional code outside Docling, and prevents a unified API for accessing text + charts.

Additional context

I’d be happy to contribute an initial implementation using python-pptx, at least for the most common chart types (bar/column/line/pie).
Would you accept a PR adding this feature behind a flag or as part of the default PPTX parsing pipeline?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpptxissue related to pptx backend

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions