Skip to content

Commit 594b4b6

Browse files
authored
Merge pull request #24 from RapidAI/wired_table_optim
Wired table optim
2 parents 37b6b76 + ae3f873 commit 594b4b6

19 files changed

+2029
-335
lines changed

.github/workflows/lineless_table_rec.yml

Lines changed: 37 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -35,40 +35,40 @@ jobs:
3535
3636
pytest tests/test_lineless_table_rec.py
3737
38-
GenerateWHL_PushPyPi:
39-
needs: UnitTesting
40-
runs-on: ubuntu-latest
41-
42-
steps:
43-
- uses: actions/checkout@v3
44-
45-
- name: Set up Python 3.7
46-
uses: actions/setup-python@v4
47-
with:
48-
python-version: '3.7'
49-
architecture: 'x64'
50-
51-
- name: Run setup.py
52-
run: |
53-
pip install -r requirements.txt
54-
python -m pip install --upgrade pip
55-
pip install wheel get_pypi_latest_version
56-
57-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/lineless_table_rec_models.zip
58-
unzip lineless_table_rec_models.zip
59-
mv lineless_table_rec_models/*.onnx lineless_table_rec/models/
60-
61-
python setup_lineless.py bdist_wheel "${{ github.event.head_commit.message }}"
62-
63-
# - name: Publish distribution 📦 to Test PyPI
64-
# uses: pypa/gh-action-pypi-publish@v1.5.0
65-
# with:
66-
# password: ${{ secrets.TEST_PYPI_API_TOKEN }}
67-
# repository_url: https://test.pypi.org/legacy/
68-
# packages_dir: dist/
69-
70-
- name: Publish distribution 📦 to PyPI
71-
uses: pypa/gh-action-pypi-publish@v1.5.0
72-
with:
73-
password: ${{ secrets.PYPI_API_TOKEN }}
74-
packages_dir: dist/
38+
# GenerateWHL_PushPyPi:
39+
# needs: UnitTesting
40+
# runs-on: ubuntu-latest
41+
#
42+
# steps:
43+
# - uses: actions/checkout@v3
44+
#
45+
# - name: Set up Python 3.7
46+
# uses: actions/setup-python@v4
47+
# with:
48+
# python-version: '3.7'
49+
# architecture: 'x64'
50+
#
51+
# - name: Run setup.py
52+
# run: |
53+
# pip install -r requirements.txt
54+
# python -m pip install --upgrade pip
55+
# pip install wheel get_pypi_latest_version
56+
#
57+
# wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/lineless_table_rec_models.zip
58+
# unzip lineless_table_rec_models.zip
59+
# mv lineless_table_rec_models/*.onnx lineless_table_rec/models/
60+
#
61+
# python setup_lineless.py bdist_wheel "${{ github.event.head_commit.message }}"
62+
#
63+
# # - name: Publish distribution 📦 to Test PyPI
64+
# # uses: pypa/gh-action-pypi-publish@v1.5.0
65+
# # with:
66+
# # password: ${{ secrets.TEST_PYPI_API_TOKEN }}
67+
# # repository_url: https://test.pypi.org/legacy/
68+
# # packages_dir: dist/
69+
#
70+
# - name: Publish distribution 📦 to PyPI
71+
# uses: pypa/gh-action-pypi-publish@v1.5.0
72+
# with:
73+
# password: ${{ secrets.PYPI_API_TOKEN }}
74+
# packages_dir: dist/

.github/workflows/table_cls.yml

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -35,33 +35,33 @@ jobs:
3535
3636
pytest tests/test_table_cls.py
3737
38-
GenerateWHL_PushPyPi:
39-
needs: UnitTesting
40-
runs-on: ubuntu-latest
41-
42-
steps:
43-
- uses: actions/checkout@v3
44-
45-
- name: Set up Python 3.10
46-
uses: actions/setup-python@v4
47-
with:
48-
python-version: '3.10'
49-
architecture: 'x64'
50-
51-
- name: Run setup.py
52-
run: |
53-
pip install -r requirements.txt
54-
python -m pip install --upgrade pip
55-
pip install wheel get_pypi_latest_version
56-
57-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/table_cls_models.zip
58-
unzip table_cls_models.zip
59-
mv table_cls_models/*.onnx table_cls/models/
60-
61-
python setup_table_cls.py bdist_wheel "${{ github.event.head_commit.message }}"
62-
63-
- name: Publish distribution 📦 to PyPI
64-
uses: pypa/gh-action-pypi-publish@v1.5.0
65-
with:
66-
password: ${{ secrets.TABLE_CLS }}
67-
packages_dir: dist/
38+
# GenerateWHL_PushPyPi:
39+
# needs: UnitTesting
40+
# runs-on: ubuntu-latest
41+
#
42+
# steps:
43+
# - uses: actions/checkout@v3
44+
#
45+
# - name: Set up Python 3.10
46+
# uses: actions/setup-python@v4
47+
# with:
48+
# python-version: '3.10'
49+
# architecture: 'x64'
50+
#
51+
# - name: Run setup.py
52+
# run: |
53+
# pip install -r requirements.txt
54+
# python -m pip install --upgrade pip
55+
# pip install wheel get_pypi_latest_version
56+
#
57+
# wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/table_cls_models.zip
58+
# unzip table_cls_models.zip
59+
# mv table_cls_models/*.onnx table_cls/models/
60+
#
61+
# python setup_table_cls.py bdist_wheel "${{ github.event.head_commit.message }}"
62+
#
63+
# - name: Publish distribution 📦 to PyPI
64+
# uses: pypa/gh-action-pypi-publish@v1.5.0
65+
# with:
66+
# password: ${{ secrets.TABLE_CLS }}
67+
# packages_dir: dist/

.github/workflows/wired_table_rec.yml

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -35,33 +35,33 @@ jobs:
3535
3636
pytest tests/test_wired_table_rec.py
3737
38-
GenerateWHL_PushPyPi:
39-
needs: UnitTesting
40-
runs-on: ubuntu-latest
41-
42-
steps:
43-
- uses: actions/checkout@v3
44-
45-
- name: Set up Python 3.7
46-
uses: actions/setup-python@v4
47-
with:
48-
python-version: '3.7'
49-
architecture: 'x64'
50-
51-
- name: Run setup.py
52-
run: |
53-
pip install -r requirements.txt
54-
python -m pip install --upgrade pip
55-
pip install wheel get_pypi_latest_version
56-
57-
wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/wired_table_rec_models.zip
58-
unzip wired_table_rec_models.zip
59-
mv wired_table_rec_models/*.onnx wired_table_rec/models/
60-
61-
python setup_wired.py bdist_wheel "${{ github.event.head_commit.message }}"
62-
63-
- name: Publish distribution 📦 to PyPI
64-
uses: pypa/gh-action-pypi-publish@v1.5.0
65-
with:
66-
password: ${{ secrets.PYPI_API_TOKEN }}
67-
packages_dir: dist/
38+
# GenerateWHL_PushPyPi:
39+
# needs: UnitTesting
40+
# runs-on: ubuntu-latest
41+
#
42+
# steps:
43+
# - uses: actions/checkout@v3
44+
#
45+
# - name: Set up Python 3.7
46+
# uses: actions/setup-python@v4
47+
# with:
48+
# python-version: '3.7'
49+
# architecture: 'x64'
50+
#
51+
# - name: Run setup.py
52+
# run: |
53+
# pip install -r requirements.txt
54+
# python -m pip install --upgrade pip
55+
# pip install wheel get_pypi_latest_version
56+
#
57+
# wget https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/wired_table_rec_models.zip
58+
# unzip wired_table_rec_models.zip
59+
# mv wired_table_rec_models/*.onnx wired_table_rec/models/
60+
#
61+
# python setup_wired.py bdist_wheel "${{ github.event.head_commit.message }}"
62+
#
63+
# - name: Publish distribution 📦 to PyPI
64+
# uses: pypa/gh-action-pypi-publish@v1.5.0
65+
# with:
66+
# password: ${{ secrets.PYPI_API_TOKEN }}
67+
# packages_dir: dist/

README.md

Lines changed: 96 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<div align="center">
22
<div align="center">
3-
<h1><b>📊 Table Structure Recognition</b></h1>
3+
<h1><b>📊 表格结构识别</b></h1>
44
</div>
55
<a href=""><img src="https://img.shields.io/badge/Python->=3.6,<3.12-aff.svg"></a>
66
<a href=""><img src="https://img.shields.io/badge/OS-Linux%2C%20Mac%2C%20Win-pink.svg"></a>
@@ -10,61 +10,125 @@
1010
<a href="https://semver.org/"><img alt="SemVer2.0" src="https://img.shields.io/badge/SemVer-2.0-brightgreen"></a>
1111
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
1212
<a href="https://github.com/RapidAI/TableStructureRec/blob/c41bbd23898cb27a957ed962b0ffee3c74dfeff1/LICENSE"><img alt="GitHub" src="https://img.shields.io/badge/license-Apache 2.0-blue"></a>
13+
</div>
14+
15+
### 简介
16+
17+
💖该仓库是用来对文档中表格做结构化识别的推理库,包括来自paddle的表格识别模型,
18+
阿里读光有线和无线表格识别模型,llaipython(微信)贡献的有线表格模型,网易Qanything内置表格分类模型等。
19+
20+
#### 特点
21+
**** 采用ONNXRuntime作为推理引擎,cpu下单图推理1-7s
22+
23+
🎯 ****: 结合表格类型分类模型,区分有线表格,无线表格,任务更细分,精度更高
1324

14-
[简体中文](./docs/README_zh.md) | English
25+
🛡️ ****: 不依赖任何第三方训练框架,只依赖必要基础库,避免包冲突
26+
27+
### 效果展示
28+
<div align="center">
29+
<img src="https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/demo_img_output.gif" alt="Demo" width="100%" height="100%">
1530
</div>
1631

17-
### Introduction
32+
### 指标结果
33+
[TableRecognitionMetric 评测工具](https://github.com/SWHL/TableRecognitionMetric) [评测数据集](https://huggingface.co/datasets/SWHL/table_rec_test_dataset) [Rapid OCR](https://github.com/RapidAI/RapidOCR)
1834

19-
This repo is an inference library used for structured recognition of tables in documents, including table structure recognition algorithm models from PaddleOCR, wired and wireless table recognition algorithm models from Alibaba Duguang, etc.
35+
| 方法 | TEDS |
36+
|:---------------------------------------------------------------------------------------------------------------------------|:----:|
37+
| lineless_table_rec | 0.53561 |
38+
| [RapidTable](https://github.com/RapidAI/RapidStructure/blob/b800b156015bf5cd6f5429295cdf48be682fd97e/docs/README_Table.md) | 0.58786 |
39+
| wired_table_rec v1 | 0.70279 |
40+
| wired_table_rec v2 | 0.78007 |
41+
| table_cls + wired_table_rec v1 + lineless_table_rec | 0.74692 |
42+
| table_cls + wired_table_rec v2 + lineless_table_rec |0.80235|
2043

21-
The repo has improved the pre- and post-processing of form recognition and combined with OCR to ensure that the form recognition part can be used directly.
44+
### 安装
45+
``` python {linenos=table}
46+
pip install wired_table_rec lineless_table_rec table_cls
47+
```
2248

23-
The repo will continue to focus on the field of table recognition, integrate the latest and most useful table recognition algorithms, and strive to create the most valuable table recognition tool library.
49+
### 快速使用
50+
``` python {linenos=table}
51+
import os
2452

25-
Welcome everyone to continue to pay attention.
53+
from lineless_table_rec import LinelessTableRecognition
54+
from lineless_table_rec.utils_table_recover import format_html, plot_rec_box_with_logic_info, plot_rec_box
55+
from table_cls import TableCls
56+
from wired_table_rec import WiredTableRecognition
2657

27-
### What is Table Structure Recognition?
58+
lineless_engine = LinelessTableRecognition()
59+
wired_engine = WiredTableRecognition()
60+
table_cls = TableCls()
61+
img_path = f'images/img14.jpg'
2862

29-
Table Structure Recognition (TSR) aims to extract the logical or physical structure of table images, thereby converting unstructured table images into machine-readable formats.
63+
cls,elasp = table_cls(img_path)
64+
if cls == 'wired':
65+
table_engine = wired_engine
66+
else:
67+
table_engine = lineless_engine
68+
html, elasp, polygons, logic_points, ocr_res = table_engine(img_path)
69+
print(f"elasp: {elasp}")
3070

31-
Logical structure: represents the row/column relationship of cells (such as the same row, the same column) and the span information of cells.
71+
# output_dir = f'outputs'
72+
# complete_html = format_html(html)
73+
# os.makedirs(os.path.dirname(f"{output_dir}/table.html"), exist_ok=True)
74+
# with open(f"{output_dir}/table.html", "w", encoding="utf-8") as file:
75+
# file.write(complete_html)
76+
# # 可视化表格识别框 + 逻辑行列信息
77+
# plot_rec_box_with_logic_info(
78+
# img_path, f"{output_dir}/table_rec_box.jpg", logic_points, polygons
79+
# )
80+
# # 可视化 ocr 识别框
81+
# plot_rec_box(img_path, f"{output_dir}/ocr_box.jpg", ocr_res)
82+
```
3283

33-
Physical structure: includes not only the logical structure, but also the cell's bounding box, content and other information, emphasizing the physical location of the cell.
84+
## FAQ (Frequently Asked Questions)
3485

35-
<div align='center'>
36-
<img src="https://github.com/RapidAI/TableStructureRec/releases/download/v0.0.0/TSRFramework.jpg" width=70%>
37-
</div>
86+
1. **问:偏移的图片能够处理吗?**
87+
- 答:该项目暂时不支持偏移图片识别,请先修正图片,也欢迎提pr来解决这个问题。
3888

39-
Figure from: [Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling](https://openaccess.thecvf.com/content/CVPR2023/html/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper.html)
89+
2. **问:识别框丢失了内部文字信息**
90+
- 答:默认使用的rapidocr小模型,如果需要更高精度的效果,可以从 [模型列表](https://rapidai.github.io/RapidOCRDocs/model_list/#_1)
91+
下载更高精度的ocr模型,在执行时传入ocr_result即可
92+
93+
3. **问:模型支持 gpu 加速吗?**
94+
- 答:目前表格模型的推理非常快,有线表格在100ms级别,无线表格在500ms级别,
95+
主要耗时在ocr阶段,可以参考 [rapidocr_paddle](https://rapidai.github.io/RapidOCRDocs/install_usage/rapidocr_paddle/usage/#_3) 加速ocr识别过程
4096

41-
### Documentation
97+
### TODO List
98+
- [ ] 识别前图片偏移修正
99+
- [ ] 增加数据集数量,增加更多评测对比
100+
- [ ] 优化无线表格模型
42101

43-
Full documentation can be found on [docs](https://rapidai.github.io/TableStructureRec/docs/), in Chinese.
102+
### 处理流程
103+
```mermaid
104+
flowchart TD
105+
A[/表格图片/] --> B([表格分类])
106+
B --> C([有线表格识别]) & D([无线表格识别]) --> E([文字识别 rapidocr_onnxruntime])
107+
E --> F[/html结构化输出/]
108+
```
44109

45-
### Acknowledgements
110+
### 致谢
46111

47-
[PaddleOCR Table](https://github.com/PaddlePaddle/PaddleOCR/blob/4b17511491adcfd0f3e2970895d06814d1ce56cc/ppstructure/table/README_ch.md)
112+
[PaddleOCR 表格识别](https://github.com/PaddlePaddle/PaddleOCR/blob/4b17511491adcfd0f3e2970895d06814d1ce56cc/ppstructure/table/README_ch.md)
48113

49-
[Cycle CenterNet](https://www.modelscope.cn/models/damo/cv_dla34_table-structure-recognition_cycle-centernet/summary)
114+
[读光-表格结构识别-有线表格](https://www.modelscope.cn/models/damo/cv_dla34_table-structure-recognition_cycle-centernet/summary)
50115

51-
[LORE](https://www.modelscope.cn/models/damo/cv_resnet-transformer_table-structure-recognition_lore/summary)
116+
[读光-表格结构识别-无线表格](https://www.modelscope.cn/models/damo/cv_resnet-transformer_table-structure-recognition_lore/summary)
52117

53-
### Contributing
118+
[Qanything-RAG](https://github.com/netease-youdao/QAnything)
54119

55-
Pull requests are welcome. For major changes, please open an issue first
56-
to discuss what you would like to change.
120+
非常感谢 llaipython(微信,提供全套有偿高精度表格提取) 提供高精度有线表格模型。
57121

58-
Please make sure to update tests as appropriate.
122+
### 贡献指南
59123

60-
### [Sponsor](https://rapidai.github.io/Knowledge-QA-LLM/docs/sponsor/)
124+
欢迎提交请求。对于重大更改,请先打开issue讨论您想要改变的内容。
61125

62-
If you want to sponsor the project, you can directly click the **Buy me a coffee** image, please write a note (e.g. your github account name) to facilitate adding to the sponsorship list below.
126+
请确保适当更新测试。
63127

64-
<div align="left">
65-
<a href="https://www.buymeacoffee.com/SWHL"><img src="https://raw.githubusercontent.com/RapidAI/.github/main/assets/buymeacoffe.png" width="30%" height="30%"></a>
66-
</div>
128+
### [赞助](https://rapidai.github.io/Knowledge-QA-LLM/docs/sponsor/)
129+
130+
如果您想要赞助该项目,可直接点击当前页最上面的Sponsor按钮,请写好备注(**您的Github账号名称**),方便添加到赞助列表中。
67131

68-
### License
132+
### 开源许可证
69133

70-
This project is released under the [Apache 2.0 license](https://github.com/RapidAI/TableStructureRec/blob/c41bbd23898cb27a957ed962b0ffee3c74dfeff1/LICENSE).
134+
该项目采用[Apache 2.0](https://github.com/RapidAI/TableStructureRec/blob/c41bbd23898cb27a957ed962b0ffee3c74dfeff1/LICENSE)开源许可证。

0 commit comments

Comments
 (0)