Commit d04f3f22 authored by liukaiwen's avatar liukaiwen

# feat(model inference): add table recognition and convertion to LaTeX

# What's Changed

### New Features

- Add table content recognition, we use weights of [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) to convert table image to LaTex.

### Instruction

- pip install pypandoc struct-eqtable==0.1.0
- Download [StructEqTable weights](https://huggingface.co/wanderkid/PDF-Extract-Kit/tree/main/models/TabRec) and put it under models/ directory.
- Edit 'table-mode' value to turn on table recognition function which is turned off by default.
- If you did not download any models before, refer to [how to download models](docs/how_to_download_models_zh_cn.md)。
parent c98e7b98
...@@ -91,6 +91,7 @@ MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练 ...@@ -91,6 +91,7 @@ MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练
- 保留原文档的结构,包括标题、段落、列表等 - 保留原文档的结构,包括标题、段落、列表等
- 提取图像、图片标题、表格、表格标题 - 提取图像、图片标题、表格、表格标题
- 自动识别文档中的公式并将公式转换成latex - 自动识别文档中的公式并将公式转换成latex
- 自动识别文档中的表格并将表格转换成latex
- 乱码PDF自动检测并启用OCR - 乱码PDF自动检测并启用OCR
- 支持CPU和GPU环境 - 支持CPU和GPU环境
- 支持windows/linux/mac平台 - 支持windows/linux/mac平台
...@@ -235,7 +236,7 @@ TODO ...@@ -235,7 +236,7 @@ TODO
- [ ] 正文中列表识别 - [ ] 正文中列表识别
- [ ] 正文中代码块识别 - [ ] 正文中代码块识别
- [ ] 目录识别 - [ ] 目录识别
- [ ] 表格识别 - [x] 表格识别
- [ ] 化学式识别 - [ ] 化学式识别
- [ ] 几何图形识别 - [ ] 几何图形识别
...@@ -270,6 +271,7 @@ The project currently leverages PyMuPDF to deliver advanced functionalities; how ...@@ -270,6 +271,7 @@ The project currently leverages PyMuPDF to deliver advanced functionalities; how
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect) - [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six) - [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
# Citation # Citation
......
...@@ -73,5 +73,15 @@ git clone https://www.modelscope.cn/wanderkid/PDF-Extract-Kit.git ...@@ -73,5 +73,15 @@ git clone https://www.modelscope.cn/wanderkid/PDF-Extract-Kit.git
│ ├── README.md │ ├── README.md
│ ├── tokenizer_config.json │ ├── tokenizer_config.json
│ └── tokenizer.json │ └── tokenizer.json
│── TabRec
│ └─StructEqTable
│ ├── config.json
│ ├── generation_config.json
│ ├── model.safetensors
│ ├── preprocessor_config.json
│ ├── special_tokens_map.json
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
└── README.md └── README.md
``` ```
...@@ -253,9 +253,8 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx): ...@@ -253,9 +253,8 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
} }
for block in para_block['blocks']: for block in para_block['blocks']:
if block['type'] == BlockType.TableBody: if block['type'] == BlockType.TableBody:
#TODO
if block["lines"][0]["spans"][0].get('content', ''): if block["lines"][0]["spans"][0].get('content', ''):
para_content['table_body'] = f"\n {block['lines'][0]['spans'][0]['content']} \n" para_content['table_body'] = f"\n\n$\n {block['lines'][0]['spans'][0]['content']}\n$\n\n"
para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path']) para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
if block['type'] == BlockType.TableCaption: if block['type'] == BlockType.TableCaption:
para_content['table_caption'] = merge_para_with_text(block) para_content['table_caption'] = merge_para_with_text(block)
......
...@@ -8,4 +8,4 @@ weights: ...@@ -8,4 +8,4 @@ weights:
layout: Layout/model_final.pth layout: Layout/model_final.pth
mfd: MFD/weights.pt mfd: MFD/weights.pt
mfr: MFR/UniMERNet mfr: MFR/UniMERNet
table: Table/ table: TabRec/StructEqTable
\ No newline at end of file \ No newline at end of file
...@@ -13,4 +13,5 @@ scikit-learn ...@@ -13,4 +13,5 @@ scikit-learn
tqdm tqdm
htmltabletomd htmltabletomd
pypandoc pypandoc
pyopenssl==24.0.0 pyopenssl==24.0.0
\ No newline at end of file struct-eqtable==0.1.0
\ No newline at end of file
...@@ -8,4 +8,6 @@ fast-langdetect==0.2.0 ...@@ -8,4 +8,6 @@ fast-langdetect==0.2.0
wordninja>=2.0.0 wordninja>=2.0.0
scikit-learn>=1.0.2 scikit-learn>=1.0.2
pdfminer.six==20231228 pdfminer.six==20231228
pypandoc
struct-eqtable==0.1.0
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator. # The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment