-
Notifications
You must be signed in to change notification settings - Fork 3k
Contribution branch #2339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contribution branch #2339
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Example Contribution | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please remove this. We have similar steps outlined in the CONTRIBUTING file. If you would like to propose changes there, please do it in a standalone PR. |
||
|
||
This file demonstrates a sample contribution to the Docling project. | ||
|
||
## What this shows | ||
- How to clone the repository | ||
- How to set up the development environment with `uv` | ||
- How to create a contribution branch | ||
- How to make changes and prepare for pushing | ||
|
||
## Development Setup Completed | ||
- ✅ Repository cloned | ||
- ✅ Virtual environment created with `uv sync` | ||
- ✅ Dependencies installed (244 packages) | ||
- ✅ CLI tool working (`docling --version`) | ||
- ✅ Pre-commit hooks installed | ||
- ✅ Contribution branch created | ||
|
||
## Next Steps for Contributing | ||
1. Fork the repository on GitHub | ||
2. Add your fork as remote: `git remote add fork https://github.com/YOUR-USERNAME/docling.git` | ||
3. Make your changes | ||
4. Run tests: `uv run pytest` | ||
5. Run pre-commit checks: `uv run pre-commit run --all-files` | ||
6. Commit and push to your fork | ||
7. Create a Pull Request | ||
|
||
## Project Information | ||
- **Language**: Python 3.9-3.13 | ||
- **Package Manager**: uv | ||
- **Current Version**: 2.54.0 | ||
- **Main Purpose**: Document processing and parsing (PDF, DOCX, HTML, etc.) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,107 @@ | ||
# Performance Statistics Feature | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We won't add a standalone page for this feature. Please rework it and add it as an addition to The section could be about the performance statistics in general and a subsection (shorter than this) for how to use it in the CLI. |
||
|
||
This document describes the new `--stats` performance statistics feature added to the Docling CLI. | ||
|
||
## Overview | ||
|
||
The `--stats` flag provides detailed performance metrics and timing information for document conversion operations. This feature is valuable for: | ||
|
||
- Understanding processing bottlenecks | ||
- Optimizing conversion workflows | ||
- Benchmarking performance across different systems | ||
- Debugging slow conversion processes | ||
|
||
## Usage | ||
|
||
Add the `--stats` flag to any `docling convert` command: | ||
|
||
```bash | ||
# Single document with stats | ||
docling document.pdf --stats | ||
|
||
# Multiple documents with stats | ||
docling documents/ --stats --output ./converted | ||
|
||
# With other options | ||
docling document.pdf --stats --to json --to md --output ./output | ||
``` | ||
|
||
## Output Format | ||
|
||
The statistics output includes two main sections: | ||
|
||
### 1. Performance Statistics Table | ||
|
||
Shows high-level conversion metrics: | ||
|
||
- **Total Documents**: Number of documents processed | ||
- **Successful**: Number of successfully converted documents | ||
- **Failed**: Number of failed conversions | ||
- **Total Pages**: Sum of all pages across documents | ||
- **Avg Pages/Doc**: Average pages per document | ||
- **Total Time**: Total processing time in seconds | ||
- **Throughput (docs/s)**: Documents processed per second | ||
- **Throughput (pages/s)**: Pages processed per second | ||
|
||
### 2. Pipeline Timings Table | ||
|
||
Provides detailed breakdown of processing time by pipeline operation: | ||
|
||
- **Operation**: Name of the pipeline stage (e.g., layout, table_structure, ocr) | ||
- **Total (s)**: Total time spent in this operation across all documents | ||
- **Avg (s)**: Average time per operation instance | ||
- **Min (s)**: Minimum time observed | ||
- **Max (s)**: Maximum time observed | ||
- **Count**: Number of times this operation was executed | ||
|
||
## Implementation Details | ||
|
||
- Enabling `--stats` automatically enables pipeline profiling (`DOCLING_DEBUG_PROFILE_PIPELINE_TIMINGS=true`) | ||
- Statistics are collected during processing and displayed after completion | ||
- The feature works with single documents, multiple documents, and batch processing | ||
- All timing measurements use high-precision monotonic time | ||
|
||
## Example Output | ||
|
||
``` | ||
📊 Performance Statistics | ||
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ | ||
┃ Metric ┃ Value ┃ | ||
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ | ||
│ 📄 Total Documents │ 1 │ | ||
│ ✅ Successful │ 1 │ | ||
│ ❌ Failed │ 0 │ | ||
│ 📃 Total Pages │ 1 │ | ||
│ 📊 Avg Pages/Doc │ 1.0 │ | ||
│ ⏱️ Total Time │ 5.13s │ | ||
│ 🚀 Throughput (docs/s) │ 0.20 │ | ||
│ 📄 Throughput (pages/s) │ 0.20 │ | ||
└─────────────────────────┴───────┘ | ||
⚙️ Pipeline Timings | ||
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┓ | ||
┃ Operation ┃ Total (s) ┃ Avg (s) ┃ Min (s) ┃ Max (s) ┃ Count ┃ | ||
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━┩ | ||
│ pipeline_total │ 1.456 │ 1.456 │ 1.456 │ 1.456 │ 1 │ | ||
│ doc_build │ 1.410 │ 1.410 │ 1.410 │ 1.410 │ 1 │ | ||
│ table_structure │ 0.673 │ 0.673 │ 0.673 │ 0.673 │ 1 │ | ||
│ layout │ 0.508 │ 0.508 │ 0.508 │ 0.508 │ 1 │ | ||
│ ocr │ 0.115 │ 0.115 │ 0.115 │ 0.115 │ 1 │ | ||
│ page_parse │ 0.061 │ 0.061 │ 0.061 │ 0.061 │ 1 │ | ||
│ doc_assemble │ 0.046 │ 0.046 │ 0.046 │ 0.046 │ 1 │ | ||
│ page_init │ 0.045 │ 0.045 │ 0.045 │ 0.045 │ 1 │ | ||
│ reading_order │ 0.005 │ 0.005 │ 0.005 │ 0.005 │ 1 │ | ||
│ page_assemble │ 0.001 │ 0.001 │ 0.001 │ 0.001 │ 1 │ | ||
│ doc_enrich │ 0.000 │ 0.000 │ 0.000 │ 0.000 │ 1 │ | ||
└─────────────────┴───────────┴─────────┴─────────┴─────────┴───────┘ | ||
``` | ||
|
||
## Performance Insights | ||
|
||
From the example above, you can see that: | ||
|
||
- **Table structure detection** (0.673s) and **layout analysis** (0.508s) consume most processing time | ||
- **OCR processing** takes 0.115s for this document | ||
- **Document parsing** and **assembly** are relatively fast operations | ||
|
||
This information helps identify optimization opportunities and understand where processing time is spent. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
#!/usr/bin/env python3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't need an example for the CLI. Please remove. |
||
""" | ||
Example script demonstrating the new --stats performance feature in Docling CLI. | ||
This script shows how the --stats flag provides detailed performance insights | ||
for document conversion operations. | ||
""" | ||
|
||
import subprocess | ||
import sys | ||
from pathlib import Path | ||
|
||
|
||
def run_docling_with_stats(): | ||
"""Demonstrate the --stats feature with example documents.""" | ||
|
||
print("🚀 Docling CLI Performance Statistics Demo") | ||
print("=" * 50) | ||
print() | ||
|
||
# Example 1: Single document with stats | ||
print("📄 Example 1: Single Document Performance Analysis") | ||
print("-" * 40) | ||
|
||
cmd = [ | ||
"docling", | ||
"tests/data/pdf/2305.03393v1-pg9.pdf", | ||
"--stats", | ||
"--output", "/tmp/stats_demo_single" | ||
] | ||
|
||
print(f"Command: {' '.join(cmd)}") | ||
print() | ||
|
||
try: | ||
result = subprocess.run(cmd, capture_output=True, text=True, check=True) | ||
print("✅ Conversion completed successfully!") | ||
print("\nOutput:") | ||
print(result.stdout) | ||
if result.stderr: | ||
print("Warnings/Info:") | ||
print(result.stderr) | ||
except subprocess.CalledProcessError as e: | ||
print(f"❌ Error: {e}") | ||
return | ||
|
||
print("\n" + "=" * 50) | ||
print() | ||
|
||
# Example 2: Multiple documents with stats | ||
print("📄 Example 2: Batch Processing Performance Analysis") | ||
print("-" * 40) | ||
|
||
cmd = [ | ||
"docling", | ||
"tests/data/pdf/2305.03393v1-pg9.pdf", | ||
"tests/data/pdf/code_and_formula.pdf", | ||
"--stats", | ||
"--output", "/tmp/stats_demo_batch" | ||
] | ||
|
||
print(f"Command: {' '.join(cmd)}") | ||
print() | ||
|
||
try: | ||
result = subprocess.run(cmd, capture_output=True, text=True, check=True) | ||
print("✅ Batch conversion completed successfully!") | ||
print("\nOutput:") | ||
print(result.stdout) | ||
if result.stderr: | ||
print("Warnings/Info:") | ||
print(result.stderr) | ||
except subprocess.CalledProcessError as e: | ||
print(f"❌ Error: {e}") | ||
return | ||
|
||
print("\n🎉 Demo completed! The --stats feature provides valuable insights into:") | ||
print(" • Overall conversion performance (throughput, timing)") | ||
print(" • Detailed pipeline operation breakdowns") | ||
print(" • Processing bottlenecks identification") | ||
print(" • Batch processing analytics") | ||
|
||
|
||
if __name__ == "__main__": | ||
# Check if we're in the right directory | ||
if not Path("tests/data/pdf").exists(): | ||
print("❌ Error: This script must be run from the Docling repository root directory") | ||
print(" Please run: cd /path/to/docling && python examples/stats_demo.py") | ||
sys.exit(1) | ||
|
||
run_docling_with_stats() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not touch the CHANGELOG. This will be done by the CI/CD automatically.