Skip to content

Dual column text layout lost #374

@piertdewitt-glitch

Description

@piertdewitt-glitch

Hi,

Please forgive me if I write something that doesn't make sense, I'm new to this and still learning.

Trying to feed an LLM with single invoices (through n8n) so I can ask questions about it. For clarity, I'm using n8n's AI nodes which does not support uploading pdf files and ask questions about it.
The issue I run into is that some invoices have multiple two-column tables on them. For example:
Invoice number:__123456 ____________Customer: some company
Invoice date:_____01-01-2025 ________Customer number: 654321
Due date:________31-01-2025 ________Customer tax ID: 321654

Using docling-serve's UI to convert the pdf will correctly label each text elemen'ts position in docling-json (verifyable through the docling render tab). Also, the doctags format seems to correctly identify each text position within loc_xxx tags. However, in markdown, it looks something like this:
Invoice number:
Some company
01-01-2025
123456
Customer number
Due date:
321654

If I use the docling-json or doctags in a prompt the LLM isn't able to correctly comprehend the format and it's not able to answer much questions about it. I think the markdown would work, except for invoices with columns (which are quite a few). I have been watching a lot of youtube videos and reading a lot of articles, but I can't find/understand how to feed the text from invoices, including each element's location, to an LLM. It needs to understand the locations of the text elements to understand that 123456 is the invoice number.
I tried almost all possible options in the web-ui (different ocr, fast and accurate tables, the vlm pipeline etc etc).
Maybe my assumptions are wrong?
Assumption: an LLM can only work with markdown if you want it to 'see' the document like we see it.
Only tables that are detected, are represented in markdown as objects where elements have a specific location.
Markdown ignores location tags for text elements.
There is a conspiracy against me to never mention how to get documents with element location data to an LLM (directly or through a RAG). This is only discussed in secret forums which are hidden from me and google search.

Assuming you just can't use the sjon or doctag location tags with LLM's, is there a workaround to this? Maybe there's a way to convert those location tags from the json to tables using an intermediary step and output the result to markdown? It's just invoices, I don't need the images or anything, I just need all the invoice's text and locations in such a way that the AI can understand the invoice like we see it. It does work if I upload the pdf to chatgpt or gemini for example, so there must be a way right? :-)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions