|
1 | | -This project is currently a work in progress and just getting off the ground! Check back soon! |
| 1 | +# Welcome to the Docling Java Project! |
| 2 | + |
| 3 | +This is repository for Docling Java, a Java API for using [Docling](https://github.com/docling-project). |
| 4 | + |
| 5 | +[](https://docling-project.github.io/docling-java/) |
| 6 | +[](https://docling-project.github.io/docling-java) |
| 7 | +[](https://opensource.org/licenses/MIT) |
| 8 | +[](https://docling.ai/discord) |
| 9 | +[](https://www.bestpractices.dev/projects/11397) |
| 10 | + |
| 11 | +[Docling](https://github.com/docling-project) simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. |
| 12 | + |
| 13 | +## Features |
| 14 | + |
| 15 | +* 🗂️ Parsing of [multiple document formats][supported_formats] incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, ...), and more |
| 16 | +* 📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more |
| 17 | +* 🧬 Unified, expressive [DoclingDocument][docling_document] representation format |
| 18 | +* ↪️ Various [export formats][supported_formats] and options, including Markdown, HTML, [DocTags](https://arxiv.org/abs/2503.11576) and lossless JSON |
| 19 | +* 🔒 Local execution capabilities for sensitive data and air-gapped environments |
| 20 | +* 🤖 Plug-and-play [integrations][integrations] including [LangChain4j](https://docs.langchain4j.dev/) |
| 21 | +* 🔍 Extensive OCR support for scanned PDFs and images |
| 22 | +* 👓 Support of several Visual Language Models ([GraniteDocling](https://huggingface.co/ibm-granite/granite-docling-258M)) |
| 23 | +* 🎙️ Audio support with Automatic Speech Recognition (ASR) models |
| 24 | + |
| 25 | +## Artifacts |
| 26 | + |
| 27 | +This project aims to provide the following artifacts: |
| 28 | + |
| 29 | +* [`docling-api`](api): The core API for interacting with Docling. Should be framework-agnostic. |
| 30 | +* [`docling-client`](client): A reference implementation of the [`docling-api`](api) using Java's [`HttpClient`](https://openjdk.org/groups/net/httpclient/intro.html) and [Jackson](https://github.com/FasterXML/jackson). |
| 31 | +* [`docling-testing`](testing): Utilities for testing Docling |
| 32 | +* [`docling-testcontainers`](testcontainers): A [Testcontainers module](https://testcontainers.com/) for running Docling in a Docker container. |
| 33 | + |
| 34 | +## Getting started |
| 35 | + |
| 36 | +Use `DoclingApi.convertSource()` to convert individual documents. For example: |
| 37 | + |
| 38 | +```java |
| 39 | +import ai.docling.api.DoclingApi; |
| 40 | +import ai.docling.api.convert.request.ConvertDocumentRequest; |
| 41 | +import ai.docling.api.convert.response.ConvertDocumentResponse; |
| 42 | +import ai.docling.client.DoclingClient; |
| 43 | + |
| 44 | +DoclingApi doclingApi = DoclingClient.builder() |
| 45 | + .baseUrl("<location of docling server>") |
| 46 | + .build(); |
| 47 | + |
| 48 | +ConvertDocumentRequest request = ConvertDocumentRequest.builder() |
| 49 | + .addHttpSources(URI.create("https://arxiv.org/pdf/2408.09869")) |
| 50 | + .build(); |
| 51 | + |
| 52 | +ConvertDocumentResponse response = doclingApi.convertSource(request); |
| 53 | +System.out.println(response.document().markdownContent()); |
| 54 | +``` |
| 55 | + |
| 56 | +More [usage information](https://docling-project.github.io/docling-java) are available in the docs. |
| 57 | + |
| 58 | +## Get help and support |
| 59 | + |
| 60 | +Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling-java/discussions). |
| 61 | + |
| 62 | + |
| 63 | +## Contributing |
| 64 | + |
| 65 | +Please read [Contributing to Docling Java](CONTRIBUTING.md) for details. |
| 66 | + |
| 67 | +## License |
| 68 | + |
| 69 | +The Docling codebase is under MIT license. |
| 70 | +For individual model usage, please refer to the model licenses found in the original packages. |
| 71 | + |
| 72 | +### IBM ❤️ Open Source AI |
| 73 | + |
| 74 | +The project was started by the AI for knowledge team at IBM Research Zurich. |
| 75 | + |
| 76 | +[supported_formats]: https://docling-project.github.io/docling/usage/supported_formats/ |
| 77 | +[docling_document]: https://docling-project.github.io/docling/concepts/docling_document/ |
| 78 | +[integrations]: https://docling-project.github.io/docling/integrations/ |
2 | 79 |
|
3 | 80 | ## Contributors ✨ |
4 | 81 |
|
|
0 commit comments