Skip to content

Conversation

@getwithashish
Copy link

@getwithashish getwithashish commented Sep 20, 2024

Summary

This pull request introduces a new feature to locate the bounding box of each section within an image, enhancing the traceability of the markdown content. Users now have the ability to toggle this feature to obtain bounding box information for any markdown-generated section.

Why

Previously, there was no way to trace which section of the image the generated markdown originated from, limiting the interpretability of the output. This feature addresses that gap by providing bounding box coordinates for each markdown section.

Changes

  1. Added new prompt to segment the markdown (prompts.py)
  2. Perform OCR on the image to extract text from the image (ocr.py)
  3. Find bounding box of a specific string in the image (bounding_box.py)
  4. Added new error messages for OCR and bounding box operations (messages.py)
  5. Find bounding box of each section of the markdown, if bounding_box param is set to True (pdf.py)
  6. Remove markdown format from a string (text.py)
  7. Append system prompt to segment the markdown, if bounding_box param is set to True (modellitellm.py)
  8. Added a new Section type which will include all the identified sections of a page, along with their corresponding bounding boxes (types.py)
  9. Updated the Page model to include sections and their bounding boxes (zerox.py)
  10. Added a section delimiter which is used to separate the various sections in the markdown (patterns.py)
  11. Added dependencies to pyproject.toml (pyproject.toml)
  12. Wrote script for installing Tesseract (pre_install.py)
  13. Updated package metadata (setup.cfg)
  14. Updated README.md with bounding box details (README.md)

Functionality

  1. Get segmented markdown from the AI model
  2. Open and enhance the image using pillow
  3. Perform OCR on the image using pytesseract
  4. For each section of the markdown:
    1. Remove the markdown format of that section
    2. Perform a similarity search using Levenshtein distance to match the section with the OCR data
    3. Calculate the normalized bounding box around the matched substring, ensuring all relevant text is enclosed
    4. Bounding box format: (left, top, width, height)
  5. Include sections along with their bounding boxes in the Page model for easier access and visualization

Bounding Box De-Normalization

Bounding boxes are normalized (values between 0 and 1). To de-normalize, multiply the normalized values by the image's dimensions (width, height):

denormalized_left = left * image_width
denormalized_top = top * image_height
denormalized_width = width * image_width
denormalized_height = height * image_height

Usage

result = await zerox(
            file_path=file_path,
            model=model,
            bounding_box=True,
            output_dir=output_dir,
            custom_system_prompt=custom_system_prompt,
            select_pages=select_pages,
            **kwargs
        )
print(result)

Output

ZeroxOutput(
    completion_time=9432.975,
    file_name='cs101',
    input_tokens=36877,
    output_tokens=515,
    pages=[
        Page(
            content='| Type    | Description                          | Wrapper Class |\n' +
                    '|---------|--------------------------------------|---------------|\n' +
                    '| byte    | 8-bit signed 2s complement integer   | Byte          |\n' +
                    '| short   | 16-bit signed 2s complement integer  | Short         |\n' +
                    '| int     | 32-bit signed 2s complement integer  | Integer       |\n' +
                    '| long    | 64-bit signed 2s complement integer  | Long          |\n' +
                    '| float   | 32-bit IEEE 754 floating point number| Float         |\n' +
                    '| double  | 64-bit floating point number         | Double        |\n' +
                    '| boolean | may be set to true or false          | Boolean       |\n' +
                    '| char    | 16-bit Unicode (UTF-16) character    | Character     |\n\n' +
                    'Table 26.2.: Primitive types in Java\n\n' +
                    '## 26.3.1. Declaration & Assignment\n\n' +
                    'Java is a statically typed language meaning that all variables must be declared before you can use ' +
                    'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' +
                    'its identifier. For example:\n\n' +
                    '```java\n' +
                    'int numUnits;\n' +
                    'double costPerUnit;\n' +
                    'char firstInitial;\n' +
                    'boolean isStudent;\n' +
                    '```\n\n' +
                    'Each declaration specifies the variable’s type followed by the identifier and ending with a ' +
                    'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' +
                    'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' +
                    'character. We adopt the modern camelCasing naming convention for variables in our code. In ' +
                    'general, variables must be assigned a value before you can use them in an expression. You do not ' +
                    'have to immediately assign a value when you declare them (though it is good practice), but some ' +
                    'value must be assigned before they can be used or the compiler will issue an error.\n\n' +
                    'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' +
                    'the variable that we wish to assign the value to appears on the left-hand-side while the value ' +
                    '(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' +
                    'we can assign them values:\n\n' +
                    '> 2 Instance variables, that is variables declared as part of an object do have default values. ' +
                    'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' +
                    'boolean type, `false` is the default, and the default char value is `\\0`, the null-terminating ' +
                    'character (zero in the ASCII table).',
            content_length=2333,
            sections=[
              Section(
                content='| Type    | Description                               | Wrapper Class |\n|---------|-------------------------------------------|---------------|\n| byte    | 8-bit signed 2s complement integer        | Byte          |\n| short   | 16-bit signed 2s complement integer       | Short         |\n| int     | 32-bit signed 2s complement integer       | Integer       |\n| long    | 64-bit signed 2s complement integer       | Long          |\n| float   | 32-bit IEEE 754 floating point number     | Float         |\n| double  | 64-bit floating point number               | Double        |\n| boolean | may be set to `true` or `false`          | Boolean       |\n| char    | 16-bit Unicode (UTF-16) character        | Character     |\n\n**Table 26.2.: Primitive types in Java**  ',
                bounding_box=(0.16198125836680052, 0.07765151515151515, 0.5863453815261044, 0.19602272727272727)
              ),
              Section(
                content='\n\n### 26.3.1. Declaration & Assignment  ',
                bounding_box=(0.08433734939759036, 0.30303030303030304, 0.37751004016064255, 0.01231060606060606)
              ),
              Section(
                content='\n\nJava is a statically typed language meaning that all variables must be declared before you can use them or refer to them. In addition, when declaring a variable, you must specify both its type and its identifier. For example:\n\n```java\nint numUnits;  \ndouble costPerUnit;  \nchar firstInitial;  \nboolean isStudent;  \n```  ',
                bounding_box=(0.08299866131191433, 0.3446969696969697, 0.749665327978581, 0.13541666666666666)
              ),
              Section(
                  content="\n\nEach declaration specifies the variable's type followed by the identifier and ending with a semicolon. The identifier rules are fairly standard: a name can consist of lowercase and uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric character. We adopt the modern camelCasing naming convention for variables in our code. In general, variables **must** be assigned a value before you can use them in an expression. You do not have to immediately assign a value when you declare them (though it is good practice), but some value must be assigned before they can be used or the compiler will issue an error.²  ",
                  bounding_box=(0.08299866131191433, 0.5501893939393939, 0.751004016064257, 0.1571969696969697)
              ),
              Section(
                content='\n\nThe assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, the variable that we wish to assign the value to appears on the left-hand-side while the value (literal, variable or expression) is on the right-hand-side. Using our variables from before, we can assign them values:  ',
                bounding_box=(0.08299866131191433, 0.6534090909090909, 0.7483266398929049, 0.10795454545454546)
              ),
              Section(
                content='\n\n²Instance variables, that is variables declared as part of an object do have default values. For objects, the default is `null`, for all numeric types, zero is the default value. For the `boolean` type, `false` is the default, and the default `char` value is `\\0`, the null-terminating character (zero in the ASCII table).  ',
                bounding_box=(0.08299866131191433, 0.6998106060606061, 0.749665327978581, 0.13825757575757575)
              ),
              Section(
                content='',
                bounding_box=(0.7054886211512718, 0.048295454545454544, 0.03480589022757698, 0.00946969696969697)
              )
            ],
            page=1
        )
    ]
)

Generated Markdown

| Type    | Description                             | Wrapper Class |
|---------|-----------------------------------------|---------------|
| byte    | 8-bit signed 2s complement integer      | Byte          |
| short   | 16-bit signed 2s complement integer     | Short         |
| int     | 32-bit signed 2s complement integer     | Integer       |
| long    | 64-bit signed 2s complement integer     | Long          |
| float   | 32-bit IEEE 754 floating point number   | Float         |
| double  | 64-bit floating point number            | Double        |
| boolean | may be set to `true` or `false`        | Boolean       |
| char    | 16-bit Unicode (UTF-16) character      | Character     |

Table 26.2.: Primitive types in Java  

## 26.3.1. Declaration & Assignment  

Java is a statically typed language meaning that all variables must be declared before you can use them or refer to them. In addition, when declaring a variable, you must specify both its type and its identifier. For example:

```java
int numUnits;  
double costPerUnit;  
char firstInitial;  
boolean isStudent;  
```

Each declaration specifies the variable's type followed by the identifier and ending with a semicolon. The identifier rules are fairly standard: a name can consist of lowercase and uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric character. We adopt the modern camelCasing naming convention for variables in our code. In general, variables **must** be assigned a value before you can use them in an expression. You do not have to immediately assign a value when you declare them (though it is good practice), but some value must be assigned before they can be used or the compiler will issue an error.  

The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, the variable that we wish to assign the value to appears on the left-hand-side while the value (literal, variable or expression) is on the right-hand-side. Using our variables from before, we can assign them values:  

2Instance variables, that is variables declared as part of an object do have default values. For objects, the default is `null`, for all numeric types, zero is the default value. For the `boolean` type, `false` is the default, and the default `char` value is `\0`, the null-terminating character (zero in the ASCII table).  

Screenshots

Image plotted with bounding boxes

image

Performance Impact

  • Generation without bounding box: ~17 seconds
  • Generation with bounding box: ~22 seconds

This performance impact is expected, considering the accuracy provided by the bounding box detection.

@getwithashish getwithashish marked this pull request as ready for review September 20, 2024 14:37
@getwithashish
Copy link
Author

@tylermaran Please have a look at this PR.

@getwithashish
Copy link
Author

getwithashish commented Oct 7, 2024

Hi @tylermaran! I wanted to check in on my PR #44. If you have any feedback, I’d love to hear it—just making sure it hasn’t gotten lost in the shuffle!

I’m also planning to work on a Node version for this PR, so any input would be super helpful.

@tylermaran
Copy link
Contributor

Hey @getwithashish! Sorry I sat on this one for so long. But starting to really look into bounding boxes now and will be testing out your PR.
Although one thing we're thinking about is pretty much running your method in reverse.

i.e.

  • find bounding boxes for everything using tesseract.
  • use an image library like pillow to cut up the document into each bounding box
  • pass each image separately to the model

I think this method gives a couple improvements:

  1. The responses will be much faster. The slowest part of zerox is waiting for the llm to send back all the tokens. If we split each page into a set of 4 different LLM requests, we can run them all in parallel.
  2. Higher accuracy. We've noticed that most LLMs get "lazy" when there's a bunch of different charts on a page. But if we send each image independently, it will do a great job.

@99991
Copy link

99991 commented Jan 8, 2025

PyTesseract is very old and much worse at OCR than GPT (try with handwritten notes for example), so this PR would be a massive downgrade. I am not sure if it is even a good idea for finding bounding boxes. I'd suggest to look into topics such as "Layout Detection" or "Layout Analysis". Here is a relatively recent benchmark: https://github.com/opendatalab/OmniDocBench?tab=readme-ov-file#layout-detection

DocLayout-YOLO does not seem too bad (license might be problematic), but there are new models being released every week, so I'd suggest to abstract it somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Research: Add bounding boxes to response

3 participants