Unverified Commit 1dc915a4 authored by yyy's avatar yyy Committed by GitHub

release: release 0.7.1 version (#526)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: 's avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: 's avatarwangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: 's avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: 's avatarwangbinDL <wangbin_research@163.com>

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: 's avatarwangbinDL <wangbin_research@163.com>
parent 7f0fe200
...@@ -30,6 +30,7 @@ ...@@ -30,6 +30,7 @@
</div> </div>
# Changelog # Changelog
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality - 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation - 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release - 2024/07/05: Initial open-source release
...@@ -171,7 +172,7 @@ In non-mainline environments, due to the diversity of hardware and software conf ...@@ -171,7 +172,7 @@ In non-mainline environments, due to the diversity of hardware and software conf
```bash ```bash
conda create -n MinerU python=3.10 conda create -n MinerU python=3.10
conda activate MinerU conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
``` ```
#### 2. Download model weight files #### 2. Download model weight files
...@@ -200,6 +201,7 @@ Find the `magic-pdf.json` file in your user directory and configure the "models- ...@@ -200,6 +201,7 @@ Find the `magic-pdf.json` file in your user directory and configure the "models-
// other config // other config
"models-dir": "D:/models", "models-dir": "D:/models",
"table-config": { "table-config": {
"model": "TableMaster", // Another option of this value is 'struct_eqtable'
"is_table_recog_enable": false, // Table recognition is disabled by default, modify this value to enable it "is_table_recog_enable": false, // Table recognition is disabled by default, modify this value to enable it
"max_time": 400 "max_time": 400
} }
...@@ -311,13 +313,7 @@ TODO ...@@ -311,13 +313,7 @@ TODO
- Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet - Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet
- Enabling OCR may produce better results in PDFs with a high density of formulas - Enabling OCR may produce better results in PDFs with a high density of formulas
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions. - If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions.
- **Table Recognition** is currently in the testing phase; recognition speed is slow, and accuracy needs improvement. Below are some performance test results in an Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090 environment for reference.
| Table Size | Parsing Time |
|---------------|----------------------------|
| 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s |
# FAQ # FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md) [FAQ in Chinese](docs/FAQ_zh_cn.md)
......
...@@ -116,13 +116,13 @@ pip install detectron2 --extra-index-url https://wheels.myhloli.com ...@@ -116,13 +116,13 @@ pip install detectron2 --extra-index-url https://wheels.myhloli.com
>CUDA/MPSによる加速については、[CUDAまたはMPSによる加速](#4-CUDAまたはMPSによる加速)を参照してください。 >CUDA/MPSによる加速については、[CUDAまたはMPSによる加速](#4-CUDAまたはMPSによる加速)を参照してください。
```bash ```bash
pip install magic-pdf[full]==0.6.2b1 pip install -U magic-pdf[full]
``` ```
> ❗️❗️❗️ > ❗️❗️❗️
> 私たちは0.6.2 ベータ版を事前にリリースし、私たちのログに記載されている多くの問題に対処しました。しかし、このビルドはまだ完全なQAテストを経ておらず、最終的なリリース品質を表していません。問題に遭遇した場合は、問題を通じて速やかに報告するか、0.6.1バージョンに戻ることをお願いします。 > 私たちは0.6.2 ベータ版を事前にリリースし、私たちのログに記載されている多くの問題に対処しました。しかし、このビルドはまだ完全なQAテストを経ておらず、最終的なリリース品質を表していません。問題に遭遇した場合は、問題を通じて速やかに報告するか、0.6.1バージョンに戻ることをお願いします。
> ```bash > ```bash
> pip install magic-pdf[full-cpu]==0.6.1 > pip install -U magic-pdf[full]
> ``` > ```
......
...@@ -33,6 +33,7 @@ ...@@ -33,6 +33,7 @@
# 更新记录 # 更新记录
- 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能 - 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档 - 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
- 2024/07/05 首次开源 - 2024/07/05 首次开源
...@@ -179,7 +180,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -179,7 +180,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
```bash ```bash
conda create -n MinerU python=3.10 conda create -n MinerU python=3.10
conda activate MinerU conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
#### 2. 下载模型权重文件 #### 2. 下载模型权重文件
...@@ -208,6 +209,7 @@ cp magic-pdf.template.json ~/magic-pdf.json ...@@ -208,6 +209,7 @@ cp magic-pdf.template.json ~/magic-pdf.json
// other config // other config
"models-dir": "D:/models", "models-dir": "D:/models",
"table-config": { "table-config": {
"model": "TableMaster", // 使用structEqTable请修改为'struct_eqtable'
"is_table_recog_enable": false, // 表格识别功能默认是关闭的,如果需要修改此处的值 "is_table_recog_enable": false, // 表格识别功能默认是关闭的,如果需要修改此处的值
"max_time": 400 "max_time": 400
} }
...@@ -321,14 +323,6 @@ TODO ...@@ -321,14 +323,6 @@ TODO
- 漫画书、艺术图册、小学教材、习题尚不能很好解析 - 漫画书、艺术图册、小学教材、习题尚不能很好解析
- 在一些公式密集的PDF上强制启用OCR效果会更好 - 在一些公式密集的PDF上强制启用OCR效果会更好
- 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。 - 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。
- **表格识别**目前处于测试阶段,识别速度较慢,识别准确度有待提升。以下是我们在Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090环境下的一些性能测试结果,可供参考。
| 表格大小 | 解析耗时 |
|---------------|----------------------------|
| 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s |
# FAQ # FAQ
......
...@@ -48,7 +48,7 @@ ...@@ -48,7 +48,7 @@
### 5. Install Applications ### 5. Install Applications
```sh ```sh
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
``` ```
❗ After installation, make sure to check the version of `magic-pdf` using the following command: ❗ After installation, make sure to check the version of `magic-pdf` using the following command:
```sh ```sh
......
...@@ -43,7 +43,7 @@ conda activate MinerU ...@@ -43,7 +43,7 @@ conda activate MinerU
``` ```
## 5. 安装应用 ## 5. 安装应用
```bash ```bash
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
> ❗️下载完成后,务必通过以下命令确认magic-pdf的版本是否正确 > ❗️下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
> >
......
...@@ -19,7 +19,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86 ...@@ -19,7 +19,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
### 4. Install Applications ### 4. Install Applications
``` ```
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
``` ```
>❗️After installation, verify the version of `magic-pdf`: >❗️After installation, verify the version of `magic-pdf`:
> ```bash > ```bash
......
...@@ -20,7 +20,7 @@ conda activate MinerU ...@@ -20,7 +20,7 @@ conda activate MinerU
``` ```
## 4. 安装应用 ## 4. 安装应用
```bash ```bash
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
> ❗️下载完成后,务必通过以下命令确认magic-pdf的版本是否正确 > ❗️下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
> >
......
...@@ -44,6 +44,21 @@ The structure of the model folder is as follows, including configuration files a ...@@ -44,6 +44,21 @@ The structure of the model folder is as follows, including configuration files a
│ ├── spiece.model │ ├── spiece.model
│ ├── tokenizer.json │ ├── tokenizer.json
│ └── tokenizer_config.json │ └── tokenizer_config.json
│ └─ TableMaster
│ └─ ch_PP-OCRv3_det_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ ch_PP-OCRv3_rec_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ table_structure_tablemaster_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ ├── ppocr_keys_v1.txt
│ └── table_master_structure_dict.txt
└── README.md └── README.md
``` ```
#### 2. Check whether the model file is fully downloaded. #### 2. Check whether the model file is fully downloaded.
......
...@@ -74,6 +74,21 @@ print(f"模型文件下载路径为:{model_dir}/models") ...@@ -74,6 +74,21 @@ print(f"模型文件下载路径为:{model_dir}/models")
│ ├── spiece.model │ ├── spiece.model
│ ├── tokenizer.json │ ├── tokenizer.json
│ └── tokenizer_config.json │ └── tokenizer_config.json
│ └─ TableMaster
│ └─ ch_PP-OCRv3_det_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ ch_PP-OCRv3_rec_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ table_structure_tablemaster_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ ├── ppocr_keys_v1.txt
│ └── table_master_structure_dict.txt
└── README.md └── README.md
``` ```
......
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
"models-dir":"/tmp/models", "models-dir":"/tmp/models",
"device-mode":"cpu", "device-mode":"cpu",
"table-config": { "table-config": {
"model": "TableMaster",
"is_table_recog_enable": false, "is_table_recog_enable": false,
"max_time": 400 "max_time": 400
} }
......
...@@ -132,6 +132,8 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""): ...@@ -132,6 +132,8 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, mode, img_buket_path=""):
# if processed by table model # if processed by table model
if span.get('latex', ''): if span.get('latex', ''):
para_text += f"\n\n$\n {span['latex']}\n$\n\n" para_text += f"\n\n$\n {span['latex']}\n$\n\n"
elif span.get('html', ''):
para_text += f"\n\n{span['html']}\n\n"
else: else:
para_text += f"\n![{table_caption}]({join_path(img_buket_path, span['image_path'])}) \n" para_text += f"\n![{table_caption}]({join_path(img_buket_path, span['image_path'])}) \n"
for block in para_block['blocks']: # 3rd.拼table_footnote for block in para_block['blocks']: # 3rd.拼table_footnote
...@@ -256,6 +258,8 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx): ...@@ -256,6 +258,8 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
if block['type'] == BlockType.TableBody: if block['type'] == BlockType.TableBody:
if block["lines"][0]["spans"][0].get('latex', ''): if block["lines"][0]["spans"][0].get('latex', ''):
para_content['table_body'] = f"\n\n$\n {block['lines'][0]['spans'][0]['latex']}\n$\n\n" para_content['table_body'] = f"\n\n$\n {block['lines'][0]['spans'][0]['latex']}\n$\n\n"
elif block["lines"][0]["spans"][0].get('html', ''):
para_content['table_body'] = f"\n\n{block['lines'][0]['spans'][0]['html']}\n\n"
para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path']) para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
if block['type'] == BlockType.TableCaption: if block['type'] == BlockType.TableCaption:
para_content['table_caption'] = merge_para_with_text(block) para_content['table_caption'] = merge_para_with_text(block)
......
...@@ -10,5 +10,31 @@ block维度自定义字段 ...@@ -10,5 +10,31 @@ block维度自定义字段
# block中lines是否被删除 # block中lines是否被删除
LINES_DELETED = "lines_deleted" LINES_DELETED = "lines_deleted"
# struct eqtable
STRUCT_EQTABLE = "struct_eqtable"
# table recognition max time default value # table recognition max time default value
TABLE_MAX_TIME_VALUE = 400 TABLE_MAX_TIME_VALUE = 400
\ No newline at end of file
# pp_table_result_max_length
TABLE_MAX_LEN = 480
# pp table structure algorithm
TABLE_MASTER = "TableMaster"
# table master structure dict
TABLE_MASTER_DICT = "table_master_structure_dict.txt"
# table master dir
TABLE_MASTER_DIR = "table_structure_tablemaster_infer/"
# pp detect model dir
DETECT_MODEL_DIR = "ch_PP-OCRv3_det_infer"
# pp rec model dir
REC_MODEL_DIR = "ch_PP-OCRv3_rec_infer"
# pp rec char dict path
REC_CHAR_DICT = "ppocr_keys_v1.txt"
__version__ = "0.7.0b1" __version__ = "0.7.1"
...@@ -562,8 +562,11 @@ class MagicModel: ...@@ -562,8 +562,11 @@ class MagicModel:
elif category_id == 5: elif category_id == 5:
# 获取table模型结果 # 获取table模型结果
latex = layout_det.get("latex", None) latex = layout_det.get("latex", None)
html = layout_det.get("html", None)
if latex: if latex:
span["latex"] = latex span["latex"] = latex
elif html:
span["html"] = html
span["type"] = ContentType.Table span["type"] = ContentType.Table
elif category_id == 13: elif category_id == 13:
span["content"] = layout_det["latex"] span["content"] = layout_det["latex"]
......
...@@ -2,7 +2,7 @@ from loguru import logger ...@@ -2,7 +2,7 @@ from loguru import logger
import os import os
import time import time
from magic_pdf.libs.Constants import TABLE_MAX_TIME_VALUE from magic_pdf.libs.Constants import *
os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1' # 禁止albumentations检查更新 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1' # 禁止albumentations检查更新
try: try:
...@@ -34,10 +34,18 @@ from magic_pdf.model.pek_sub_modules.layoutlmv3.model_init import Layoutlmv3_Pre ...@@ -34,10 +34,18 @@ from magic_pdf.model.pek_sub_modules.layoutlmv3.model_init import Layoutlmv3_Pre
from magic_pdf.model.pek_sub_modules.post_process import get_croped_image, latex_rm_whitespace from magic_pdf.model.pek_sub_modules.post_process import get_croped_image, latex_rm_whitespace
from magic_pdf.model.pek_sub_modules.self_modify import ModifiedPaddleOCR from magic_pdf.model.pek_sub_modules.self_modify import ModifiedPaddleOCR
from magic_pdf.model.pek_sub_modules.structeqtable.StructTableModel import StructTableModel from magic_pdf.model.pek_sub_modules.structeqtable.StructTableModel import StructTableModel
from magic_pdf.model.ppTableModel import ppTableModel
def table_model_init(model_path, max_time, _device_='cpu'):
table_model = StructTableModel(model_path, max_time=max_time, device=_device_) def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
if table_model_type == STRUCT_EQTABLE:
table_model = StructTableModel(model_path, max_time=max_time, device=_device_)
else:
config = {
"model_dir": model_path,
"device": _device_
}
table_model = ppTableModel(config)
return table_model return table_model
...@@ -104,9 +112,11 @@ class CustomPEKModel: ...@@ -104,9 +112,11 @@ class CustomPEKModel:
# 初始化解析配置 # 初始化解析配置
self.apply_layout = kwargs.get("apply_layout", self.configs["config"]["layout"]) self.apply_layout = kwargs.get("apply_layout", self.configs["config"]["layout"])
self.apply_formula = kwargs.get("apply_formula", self.configs["config"]["formula"]) self.apply_formula = kwargs.get("apply_formula", self.configs["config"]["formula"])
# table config
self.table_config = kwargs.get("table_config", self.configs["config"]["table_config"]) self.table_config = kwargs.get("table_config", self.configs["config"]["table_config"])
self.apply_table = self.table_config.get("is_table_recog_enable", False) self.apply_table = self.table_config.get("is_table_recog_enable", False)
self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE) self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
self.table_model_type = self.table_config.get("model", TABLE_MASTER)
self.apply_ocr = ocr self.apply_ocr = ocr
logger.info( logger.info(
"DocAnalysis init, this may take some times. apply_layout: {}, apply_formula: {}, apply_ocr: {}, apply_table: {}".format( "DocAnalysis init, this may take some times. apply_layout: {}, apply_formula: {}, apply_ocr: {}, apply_table: {}".format(
...@@ -141,10 +151,11 @@ class CustomPEKModel: ...@@ -141,10 +151,11 @@ class CustomPEKModel:
if self.apply_ocr: if self.apply_ocr:
self.ocr_model = ModifiedPaddleOCR(show_log=show_log) self.ocr_model = ModifiedPaddleOCR(show_log=show_log)
# init structeqtable # init table model
if self.apply_table: if self.apply_table:
self.table_model = table_model_init(str(os.path.join(models_dir, self.configs["weights"]["table"])), table_model_dir = self.configs["weights"][self.table_model_type]
max_time = self.table_max_time, _device_=self.device) self.table_model = table_model_init(self.table_model_type, str(os.path.join(models_dir, table_model_dir)),
max_time=self.table_max_time, _device_=self.device)
logger.info('DocAnalysis init done!') logger.info('DocAnalysis init done!')
def __call__(self, image): def __call__(self, image):
...@@ -278,16 +289,28 @@ class CustomPEKModel: ...@@ -278,16 +289,28 @@ class CustomPEKModel:
new_image, _ = crop_img(res, pil_img) new_image, _ = crop_img(res, pil_img)
single_table_start_time = time.time() single_table_start_time = time.time()
logger.info("------------------table recognition processing begins-----------------") logger.info("------------------table recognition processing begins-----------------")
latex_code = None
html_code = None
with torch.no_grad(): with torch.no_grad():
latex_code = self.table_model.image2latex(new_image)[0] if self.table_model_type == STRUCT_EQTABLE:
latex_code = self.table_model.image2latex(new_image)[0]
else:
html_code = self.table_model.img2html(new_image)
run_time = time.time() - single_table_start_time run_time = time.time() - single_table_start_time
logger.info(f"------------table recognition processing ends within {run_time}s-----") logger.info(f"------------table recognition processing ends within {run_time}s-----")
if run_time > self.table_max_time: if run_time > self.table_max_time:
logger.warning(f"------------table recognition processing exceeds max time {self.table_max_time}s----------") logger.warning(f"------------table recognition processing exceeds max time {self.table_max_time}s----------")
# 判断是否返回正常 # 判断是否返回正常
expected_ending = latex_code.strip().endswith('end{tabular}') or latex_code.strip().endswith('end{table}')
if latex_code and expected_ending: if latex_code:
res["latex"] = latex_code expected_ending = latex_code.strip().endswith('end{tabular}') or latex_code.strip().endswith(
'end{table}')
if expected_ending:
res["latex"] = latex_code
else:
logger.warning(f"------------table recognition processing fails----------")
elif html_code:
res["html"] = html_code
else: else:
logger.warning(f"------------table recognition processing fails----------") logger.warning(f"------------table recognition processing fails----------")
table_cost = round(time.time() - table_start, 2) table_cost = round(time.time() - table_start, 2)
......
...@@ -12,7 +12,6 @@ class StructTableModel: ...@@ -12,7 +12,6 @@ class StructTableModel:
self.model = StructTable(self.model_path, self.max_new_tokens, self.max_time) self.model = StructTable(self.model_path, self.max_new_tokens, self.max_time)
def image2latex(self, image) -> str: def image2latex(self, image) -> str:
#
table_latex = self.model.forward(image) table_latex = self.model.forward(image)
return table_latex return table_latex
......
from paddleocr.ppstructure.table.predict_table import TableSystem
from paddleocr.ppstructure.utility import init_args
from magic_pdf.libs.Constants import *
import os
from PIL import Image
import numpy as np
class ppTableModel(object):
"""
This class is responsible for converting image of table into HTML format using a pre-trained model.
Attributes:
- table_sys: An instance of TableSystem initialized with parsed arguments.
Methods:
- __init__(config): Initializes the model with configuration parameters.
- img2html(image): Converts a PIL Image or NumPy array to HTML string.
- parse_args(**kwargs): Parses configuration arguments.
"""
def __init__(self, config):
"""
Parameters:
- config (dict): Configuration dictionary containing model_dir and device.
"""
args = self.parse_args(**config)
self.table_sys = TableSystem(args)
def img2html(self, image):
"""
Parameters:
- image (PIL.Image or np.ndarray): The image of the table to be converted.
Return:
- HTML (str): A string representing the HTML structure with content of the table.
"""
if isinstance(image, Image.Image):
image = np.array(image)
pred_res, _ = self.table_sys(image)
pred_html = pred_res["html"]
res = '<td><table border="1">' + pred_html.replace("<html><body><table>", "").replace("</table></body></html>",
"") + "</table></td>\n"
return res
def parse_args(self, **kwargs):
parser = init_args()
model_dir = kwargs.get("model_dir")
table_model_dir = os.path.join(model_dir, TABLE_MASTER_DIR)
table_char_dict_path = os.path.join(model_dir, TABLE_MASTER_DICT)
det_model_dir = os.path.join(model_dir, DETECT_MODEL_DIR)
rec_model_dir = os.path.join(model_dir, REC_MODEL_DIR)
rec_char_dict_path = os.path.join(model_dir, REC_CHAR_DICT)
device = kwargs.get("device", "cpu")
use_gpu = True if device == "cuda" else False
config = {
"use_gpu": use_gpu,
"table_max_len": kwargs.get("table_max_len", TABLE_MAX_LEN),
"table_algorithm": TABLE_MASTER,
"table_model_dir": table_model_dir,
"table_char_dict_path": table_char_dict_path,
"det_model_dir": det_model_dir,
"rec_model_dir": rec_model_dir,
"rec_char_dict_path": rec_char_dict_path,
}
parser.set_defaults(**config)
return parser.parse_args([])
...@@ -3,6 +3,7 @@ config: ...@@ -3,6 +3,7 @@ config:
layout: True layout: True
formula: True formula: True
table_config: table_config:
model: TableMaster
is_table_recog_enable: False is_table_recog_enable: False
max_time: 400 max_time: 400
...@@ -10,4 +11,5 @@ weights: ...@@ -10,4 +11,5 @@ weights:
layout: Layout/model_final.pth layout: Layout/model_final.pth
mfd: MFD/weights.pt mfd: MFD/weights.pt
mfr: MFR/UniMERNet mfr: MFR/UniMERNet
table: TabRec/StructEqTable struct_eqtable: TabRec/StructEqTable
\ No newline at end of file TableMaster: TabRec/TableMaster
\ No newline at end of file
import unittest
from PIL import Image
from magic_pdf.model.ppTableModel import ppTableModel
class TestppTableModel(unittest.TestCase):
def test_image2html(self):
img = Image.open("tests/test_table/assets/table.jpg")
# 修改table模型路径
config = {"device": "cuda",
"model_dir": "D:/models/PDF-Extract-Kit/models/TabRec/TableMaster"}
table_model = ppTableModel(config)
res = table_model.img2html(img)
true_value = """<td><table border="1"><thead><tr><td><b>Methods</b></td><td><b>R</b></td><td><b>P</b></td><td><b>F</b></td><td><b>FPS</b></td></tr></thead><tbody><tr><td>SegLink [26]</td><td>70.0</td><td>86.0</td><td>77.0</td><td>8.9</td></tr><tr><td>PixelLink [4]</td><td>73.2</td><td>83.0</td><td>77.8</td><td>-</td></tr><tr><td>TextSnake [18]</td><td>73.9</td><td>83.2</td><td>78.3</td><td>1.1</td></tr><tr><td>TextField [37]</td><td>75.9</td><td>87.4</td><td>81.3</td><td>5.2 </td></tr><tr><td>MSR[38]</td><td>76.7</td><td>87.4</td><td>81.7</td><td>-</td></tr><tr><td>FTSN[3]</td><td>77.1</td><td>87.6</td><td>82.0</td><td>-</td></tr><tr><td>LSE[30]</td><td>81.7</td><td>84.2</td><td>82.9</td><td>-</td></tr><tr><td>CRAFT [2]</td><td>78.2</td><td>88.2</td><td>82.9</td><td>8.6</td></tr><tr><td>MCN [16]</td><td>79</td><td>88.</td><td>83</td><td>-</td></tr><tr><td>ATRR[35]</td><td>82.1</td><td>85.2</td><td>83.6</td><td>-</td></tr><tr><td>PAN [34]</td><td>83.8</td><td>84.4</td><td>84.1</td><td>30.2</td></tr><tr><td>DB[12]</td><td>79.2</td><td>91.5</td><td>84.9</td><td>32.0</td></tr><tr><td>DRRG [41]</td><td>82.30</td><td>88.05</td><td>85.08</td><td>-</td></tr><tr><td>Ours (SynText)</td><td>80.68</td><td>85.40</td><td>82.97</td><td>12.68</td></tr><tr><td>Ours (MLT-17)</td><td>84.54</td><td>86.62</td><td>85.57</td><td>12.31</td></tr></tbody></table></td>\n"""
self.assertEqual(true_value, res)
if __name__ == "__main__":
unittest.main()
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment