Unverified Commit 9d689790 authored by linfeng's avatar linfeng Committed by GitHub

Merge branch 'opendatalab:dev' into dev

parents bcef0868 fb383ba6
......@@ -80,6 +80,7 @@ body:
-
- "0.6.x"
- "0.7.x"
- "0.8.x"
validations:
required: true
......
......@@ -38,11 +38,12 @@ jobs:
echo $GITHUB_WORKSPACE && sh tests/retry_env.sh
- name: unit test
run: |
cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m pytest tests/test_unit.py --cov=magic_pdf/ --cov-report term-missing --cov-report html
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m pytest tests/unittest --cov=magic_pdf/ --cov-report term-missing --cov-report html
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
- name: cli test
run: |
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py
source ~/.bashrc && cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli.py
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
......
......@@ -659,3 +659,4 @@ specific requirements.
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU AGPL, see
<https://www.gnu.org/licenses/>.
......@@ -14,8 +14,9 @@
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
......@@ -40,6 +41,7 @@
</div>
# Changelog
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
......@@ -178,7 +180,9 @@ In non-mainline environments, due to the diversity of hardware and software conf
### Online Demo
[Click here for the online demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
### Quick CPU Demo
......@@ -353,7 +357,6 @@ TODO
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions.
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)
......
......@@ -14,8 +14,9 @@
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-Demo-yellow.svg?logo=)](https://huggingface.co/spaces/opendatalab/MinerU)
[![ModelScope](https://img.shields.io/badge/ModelScope-Demo-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/papayalove/b5f4913389e7ff9883c6b687de156e78/mineru_demo.ipynb)
[![Paper](https://img.shields.io/badge/Paper-arXiv-green)](#)
......@@ -40,6 +41,7 @@
</div>
# 更新记录
- 2024/09/09 0.8.0发布,支持Dockerfile快速部署,同时上线了huggingface、modelscope demo
- 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
......@@ -178,7 +180,9 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
### 在线体验
[在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[![OpenDataLab](https://img.shields.io/badge/Demo_on_OpenDataLab-blue?logo=&labelColor=white)](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
[![ModelScope](https://img.shields.io/badge/Demo_on_ModelScope-purple?logo=&labelColor=white)](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
[![HuggingFace](https://img.shields.io/badge/Demo_on_HuggingFace-yellow.svg?logo=&labelColor=white)](https://huggingface.co/spaces/opendatalab/MinerU)
### 使用CPU快速体验
......@@ -356,8 +360,8 @@ TODO
- 在一些公式密集的PDF上强制启用OCR效果会更好
- 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。
# FAQ
# FAQ
[常见问题](docs/FAQ_zh_cn.md)
......
......@@ -44,3 +44,11 @@ pip uninstall fairscale
pip install fairscale
```
Reference: https://github.com/opendatalab/MinerU/issues/411
### 6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
The compatibility of cuda11 with new graphics cards is poor, and the CUDA version used by Paddle needs to be upgraded.
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
Reference: https://github.com/opendatalab/MinerU/issues/558
......@@ -41,3 +41,11 @@ pip uninstall fairscale
pip install fairscale
```
参考:https://github.com/opendatalab/MinerU/issues/411
### 6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
参考:https://github.com/opendatalab/MinerU/issues/558
from huggingface_hub import snapshot_download
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit')
print(f"model dir is: {model_dir}/models")
......@@ -6,58 +6,8 @@ wget https://github.com/opendatalab/MinerU/raw/master/docs/download_models_hf.py
python download_models_hf.py
```
After the Python script finishes executing, it will output the directory where the models are downloaded.
### 2. Additional steps
#### 1. Check whether the model directory is downloaded completely.
The structure of the model folder is as follows, including configuration files and weight files of different components:
```
../
├── Layout
│ ├── config.json
│ └── model_final.pth
├── MFD
│ └── weights.pt
├── MFR
│ └── UniMERNet
│ ├── config.json
│ ├── preprocessor_config.json
│ ├── pytorch_model.bin
│ ├── README.md
│ ├── tokenizer_config.json
│ └── tokenizer.json
│── TabRec
│ └─StructEqTable
│ ├── config.json
│ ├── generation_config.json
│ ├── model.safetensors
│ ├── preprocessor_config.json
│ ├── special_tokens_map.json
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
│ └─ TableMaster
│ └─ ch_PP-OCRv3_det_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ ch_PP-OCRv3_rec_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ table_structure_tablemaster_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ ├── ppocr_keys_v1.txt
│ └── table_master_structure_dict.txt
└── README.md
```
#### 2. Check whether the model file is fully downloaded.
Please check whether the size of the model file in the directory is consistent with the description on the web page. If possible, it is best to check whether the model is downloaded completely through sha256.
#### 3.
### 2. To modify the model path address in the configuration file
Additionally, in `~/magic-pdf.json`, update the model directory path to the absolute path of the `models` directory output by the previous Python script. Otherwise, you will encounter an error indicating that the model cannot be loaded.
......@@ -21,55 +21,7 @@ wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py
python download_models.py
```
python脚本执行完毕后,会输出模型下载目录
## 【❗️必须要做❗️】的额外步骤(模型下载完成后请务必完成以下操作)
### 1.检查模型目录是否下载完整
模型文件夹的结构如下,包含了不同组件的配置文件和权重文件:
```
./
├── Layout # 布局检测模型
│ ├── config.json
│ └── model_final.pth
├── MFD # 公式检测
│ └── weights.pt
├── MFR # 公式识别模型
│ └── UniMERNet
│ ├── config.json
│ ├── preprocessor_config.json
│ ├── pytorch_model.bin
│ ├── README.md
│ ├── tokenizer_config.json
│ └── tokenizer.json
│── TabRec # 表格识别模型
│ └─StructEqTable
│ ├── config.json
│ ├── generation_config.json
│ ├── model.safetensors
│ ├── preprocessor_config.json
│ ├── special_tokens_map.json
│ ├── spiece.model
│ ├── tokenizer.json
│ └── tokenizer_config.json
│ └─ TableMaster
│ └─ ch_PP-OCRv3_det_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ ch_PP-OCRv3_rec_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ └─ table_structure_tablemaster_infer
│ ├── inference.pdiparams
│ ├── inference.pdiparams.info
│ └── inference.pdmodel
│ ├── ppocr_keys_v1.txt
│ └── table_master_structure_dict.txt
└── README.md
```
### 2.检查模型文件是否下载完整
请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整
### 3.修改magic-pdf.json中的模型路径
此外在 `~/magic-pdf.json`里修改模型的目录指向之前python脚本输出的models目录的绝对路径,否则会报模型无法加载的错误。
## 下载完成后的操作:修改magic-pdf.json中的模型路径
`~/magic-pdf.json`里修改模型的目录指向上一步脚本输出的models目录的绝对路径,否则会报模型无法加载的错误。
......@@ -116,17 +116,20 @@ def ocr_mk_markdown_with_para_core(paras_of_layout, mode, img_buket_path=''):
def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
mode,
img_buket_path=''):
img_buket_path='',
parse_type="auto",
lang=None
):
page_markdown = []
for para_block in paras_of_layout:
para_text = ''
para_type = para_block['type']
if para_type == BlockType.Text:
para_text = merge_para_with_text(para_block)
para_text = merge_para_with_text(para_block, parse_type=parse_type, lang=lang)
elif para_type == BlockType.Title:
para_text = f'# {merge_para_with_text(para_block)}'
para_text = f'# {merge_para_with_text(para_block, parse_type=parse_type, lang=lang)}'
elif para_type == BlockType.InterlineEquation:
para_text = merge_para_with_text(para_block)
para_text = merge_para_with_text(para_block, parse_type=parse_type, lang=lang)
elif para_type == BlockType.Image:
if mode == 'nlp':
continue
......@@ -139,17 +142,17 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
para_text += f"\n![]({join_path(img_buket_path, span['image_path'])}) \n"
for block in para_block['blocks']: # 2nd.拼image_caption
if block['type'] == BlockType.ImageCaption:
para_text += merge_para_with_text(block)
para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
for block in para_block['blocks']: # 2nd.拼image_caption
if block['type'] == BlockType.ImageFootnote:
para_text += merge_para_with_text(block)
para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
elif para_type == BlockType.Table:
if mode == 'nlp':
continue
elif mode == 'mm':
for block in para_block['blocks']: # 1st.拼table_caption
if block['type'] == BlockType.TableCaption:
para_text += merge_para_with_text(block)
para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
for block in para_block['blocks']: # 2nd.拼table_body
if block['type'] == BlockType.TableBody:
for line in block['lines']:
......@@ -164,7 +167,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
para_text += f"\n![]({join_path(img_buket_path, span['image_path'])}) \n"
for block in para_block['blocks']: # 3rd.拼table_footnote
if block['type'] == BlockType.TableFootnote:
para_text += merge_para_with_text(block)
para_text += merge_para_with_text(block, parse_type=parse_type, lang=lang)
if para_text.strip() == '':
continue
......@@ -174,7 +177,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
return page_markdown
def merge_para_with_text(para_block):
def merge_para_with_text(para_block, parse_type="auto", lang=None):
def detect_language(text):
en_pattern = r'[a-zA-Z]+'
......@@ -205,7 +208,11 @@ def merge_para_with_text(para_block):
content = span['content']
# language = detect_lang(content)
language = detect_language(content)
if language == 'en': # 只对英文长词进行分词处理,中文分词会丢失文本
# 判断是否小语种
if lang is not None and lang != 'en':
content = ocr_escape_special_markdown_char(content)
else: # 非小语种逻辑
if language == 'en' and parse_type == 'ocr': # 只对英文长词进行分词处理,中文分词会丢失文本
content = ocr_escape_special_markdown_char(
split_long_words(content))
else:
......@@ -265,41 +272,39 @@ def para_to_standard_format(para, img_buket_path):
return para_content
def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
def para_to_standard_format_v2(para_block, img_buket_path, page_idx, parse_type="auto", lang=None, drop_reason=None):
para_type = para_block['type']
para_content = {}
if para_type == BlockType.Text:
para_content = {
'type': 'text',
'text': merge_para_with_text(para_block),
'page_idx': page_idx,
'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
}
elif para_type == BlockType.Title:
para_content = {
'type': 'text',
'text': merge_para_with_text(para_block),
'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
'text_level': 1,
'page_idx': page_idx,
}
elif para_type == BlockType.InterlineEquation:
para_content = {
'type': 'equation',
'text': merge_para_with_text(para_block),
'text': merge_para_with_text(para_block, parse_type=parse_type, lang=lang),
'text_format': 'latex',
'page_idx': page_idx,
}
elif para_type == BlockType.Image:
para_content = {'type': 'image', 'page_idx': page_idx}
para_content = {'type': 'image'}
for block in para_block['blocks']:
if block['type'] == BlockType.ImageBody:
para_content['img_path'] = join_path(
img_buket_path,
block['lines'][0]['spans'][0]['image_path'])
if block['type'] == BlockType.ImageCaption:
para_content['img_caption'] = merge_para_with_text(block)
para_content['img_caption'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
if block['type'] == BlockType.ImageFootnote:
para_content['img_footnote'] = merge_para_with_text(block)
para_content['img_footnote'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
elif para_type == BlockType.Table:
para_content = {'type': 'table', 'page_idx': page_idx}
para_content = {'type': 'table'}
for block in para_block['blocks']:
if block['type'] == BlockType.TableBody:
if block["lines"][0]["spans"][0].get('latex', ''):
......@@ -308,9 +313,14 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx):
para_content['table_body'] = f"\n\n{block['lines'][0]['spans'][0]['html']}\n\n"
para_content['img_path'] = join_path(img_buket_path, block["lines"][0]["spans"][0]['image_path'])
if block['type'] == BlockType.TableCaption:
para_content['table_caption'] = merge_para_with_text(block)
para_content['table_caption'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
if block['type'] == BlockType.TableFootnote:
para_content['table_footnote'] = merge_para_with_text(block)
para_content['table_footnote'] = merge_para_with_text(block, parse_type=parse_type, lang=lang)
para_content['page_idx'] = page_idx
if drop_reason is not None:
para_content['drop_reason'] = drop_reason
return para_content
......@@ -394,13 +404,19 @@ def ocr_mk_mm_standard_format(pdf_info_dict: list):
def union_make(pdf_info_dict: list,
make_mode: str,
drop_mode: str,
img_buket_path: str = ''):
img_buket_path: str = '',
parse_type: str = "auto",
lang=None):
output_content = []
for page_info in pdf_info_dict:
drop_reason_flag = False
drop_reason = None
if page_info.get('need_drop', False):
drop_reason = page_info.get('drop_reason')
if drop_mode == DropMode.NONE:
pass
elif drop_mode == DropMode.NONE_WITH_REASON:
drop_reason_flag = True
elif drop_mode == DropMode.WHOLE_PDF:
raise Exception((f'drop_mode is {DropMode.WHOLE_PDF} ,'
f'drop_reason is {drop_reason}'))
......@@ -417,16 +433,20 @@ def union_make(pdf_info_dict: list,
continue
if make_mode == MakeMode.MM_MD:
page_markdown = ocr_mk_markdown_with_para_core_v2(
paras_of_layout, 'mm', img_buket_path)
paras_of_layout, 'mm', img_buket_path, parse_type=parse_type, lang=lang)
output_content.extend(page_markdown)
elif make_mode == MakeMode.NLP_MD:
page_markdown = ocr_mk_markdown_with_para_core_v2(
paras_of_layout, 'nlp')
paras_of_layout, 'nlp', parse_type=parse_type, lang=lang)
output_content.extend(page_markdown)
elif make_mode == MakeMode.STANDARD_FORMAT:
for para_block in paras_of_layout:
if drop_reason_flag:
para_content = para_to_standard_format_v2(
para_block, img_buket_path, page_idx, parse_type=parse_type, lang=lang, drop_reason=drop_reason)
else:
para_content = para_to_standard_format_v2(
para_block, img_buket_path, page_idx)
para_block, img_buket_path, page_idx, parse_type=parse_type, lang=lang)
output_content.append(para_content)
if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
return '\n\n'.join(output_content)
......
......@@ -8,3 +8,4 @@ class DropMode:
WHOLE_PDF = "whole_pdf"
SINGLE_PAGE = "single_page"
NONE = "none"
NONE_WITH_REASON = "none_with_reason"
......@@ -426,3 +426,22 @@ def bbox_distance(bbox1, bbox2):
elif top:
return y2 - y1b
return 0.0
def box_area(bbox):
return (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
def get_overlap_area(bbox1, bbox2):
"""计算box1和box2的重叠面积占bbox1的比例."""
# Determine the coordinates of the intersection rectangle
x_left = max(bbox1[0], bbox2[0])
y_top = max(bbox1[1], bbox2[1])
x_right = min(bbox1[2], bbox2[2])
y_bottom = min(bbox1[3], bbox2[3])
if x_right < x_left or y_bottom < y_top:
return 0.0
# The area of overlap area
return (x_right - x_left) * (y_bottom - y_top)
__version__ = "0.7.1"
__version__ = "0.8.0"
......@@ -57,14 +57,14 @@ class ModelSingleton:
cls._instance = super().__new__(cls)
return cls._instance
def get_model(self, ocr: bool, show_log: bool):
key = (ocr, show_log)
def get_model(self, ocr: bool, show_log: bool, lang=None):
key = (ocr, show_log, lang)
if key not in self._models:
self._models[key] = custom_model_init(ocr=ocr, show_log=show_log)
self._models[key] = custom_model_init(ocr=ocr, show_log=show_log, lang=lang)
return self._models[key]
def custom_model_init(ocr: bool = False, show_log: bool = False):
def custom_model_init(ocr: bool = False, show_log: bool = False, lang=None):
model = None
if model_config.__model_mode__ == "lite":
......@@ -78,7 +78,7 @@ def custom_model_init(ocr: bool = False, show_log: bool = False):
model_init_start = time.time()
if model == MODEL.Paddle:
from magic_pdf.model.pp_structure_v2 import CustomPaddleModel
custom_model = CustomPaddleModel(ocr=ocr, show_log=show_log)
custom_model = CustomPaddleModel(ocr=ocr, show_log=show_log, lang=lang)
elif model == MODEL.PEK:
from magic_pdf.model.pdf_extract_kit import CustomPEKModel
# 从配置文件读取model-dir和device
......@@ -89,7 +89,9 @@ def custom_model_init(ocr: bool = False, show_log: bool = False):
"show_log": show_log,
"models_dir": local_models_dir,
"device": device,
"table_config": table_config}
"table_config": table_config,
"lang": lang,
}
custom_model = CustomPEKModel(**model_input)
else:
logger.error("Not allow model_name!")
......@@ -104,10 +106,10 @@ def custom_model_init(ocr: bool = False, show_log: bool = False):
def doc_analyze(pdf_bytes: bytes, ocr: bool = False, show_log: bool = False,
start_page_id=0, end_page_id=None):
start_page_id=0, end_page_id=None, lang=None):
model_manager = ModelSingleton()
custom_model = model_manager.get_model(ocr, show_log)
custom_model = model_manager.get_model(ocr, show_log, lang)
images = load_images_from_pdf(pdf_bytes)
......
import json
from magic_pdf.libs.boxbase import (_is_in, _is_part_overlap, bbox_distance,
bbox_relative_pos, calculate_iou,
calculate_overlap_area_in_bbox1_area_ratio)
bbox_relative_pos, box_area, calculate_iou,
calculate_overlap_area_in_bbox1_area_ratio,
get_overlap_area)
from magic_pdf.libs.commons import fitz, join_path
from magic_pdf.libs.coordinate_transform import get_scale_ratio
from magic_pdf.libs.local_math import float_gt
......@@ -12,6 +13,7 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
CAPATION_OVERLAP_AREA_RATIO = 0.6
MERGE_BOX_OVERLAP_AREA_RATIO = 1.1
class MagicModel:
......@@ -165,6 +167,8 @@ class MagicModel:
dis_table_footnote.get(i, float('inf')),
)
for i in range(len(footnotes)):
if i not in dis_figure_footnote:
continue
if dis_table_footnote.get(i, float('inf')) > dis_figure_footnote[i]:
footnotes[i]['category_id'] = CategoryId.ImageFootnote
......@@ -191,6 +195,44 @@ class MagicModel:
筛选出所有和 merged bbox 有 overlap 且 overlap 面积大于 object 的面积的 subjects。
再求出筛选出的 subjects 和 object 的最短距离
"""
def search_overlap_between_boxes(
subject_idx, object_idx
):
idxes = [subject_idx, object_idx]
x0s = [all_bboxes[idx]['bbox'][0] for idx in idxes]
y0s = [all_bboxes[idx]['bbox'][1] for idx in idxes]
x1s = [all_bboxes[idx]['bbox'][2] for idx in idxes]
y1s = [all_bboxes[idx]['bbox'][3] for idx in idxes]
merged_bbox = [
min(x0s),
min(y0s),
max(x1s),
max(y1s),
]
ratio = 0
other_objects = list(
map(
lambda x: {'bbox': x['bbox'], 'score': x['score']},
filter(
lambda x: x['category_id']
not in (object_category_id, subject_category_id),
self.__model_list[page_no]['layout_dets'],
),
)
)
for other_object in other_objects:
ratio = max(
ratio,
get_overlap_area(
merged_bbox, other_object['bbox']
) * 1.0 / box_area(all_bboxes[object_idx]['bbox'])
)
if ratio >= MERGE_BOX_OVERLAP_AREA_RATIO:
break
return ratio
def may_find_other_nearest_bbox(subject_idx, object_idx):
ret = float('inf')
......@@ -299,6 +341,15 @@ class MagicModel:
):
continue
subject_idx, object_idx = i, j
if all_bboxes[j]['category_id'] == subject_category_id:
subject_idx, object_idx = j, i
if search_overlap_between_boxes(subject_idx, object_idx) >= MERGE_BOX_OVERLAP_AREA_RATIO:
dis[i][j] = float('inf')
dis[j][i] = dis[i][j]
continue
dis[i][j] = bbox_distance(all_bboxes[i]['bbox'], all_bboxes[j]['bbox'])
dis[j][i] = dis[i][j]
......@@ -627,13 +678,13 @@ class MagicModel:
span['type'] = ContentType.Image
elif category_id == 5:
# 获取table模型结果
latex = layout_det.get("latex", None)
html = layout_det.get("html", None)
latex = layout_det.get('latex', None)
html = layout_det.get('html', None)
if latex:
span["latex"] = latex
span['latex'] = latex
elif html:
span["html"] = html
span["type"] = ContentType.Table
span['html'] = html
span['type'] = ContentType.Table
elif category_id == 13:
span['content'] = layout_det['latex']
span['type'] = ContentType.InlineEquation
......
......@@ -58,7 +58,7 @@ def mfd_model_init(weight):
def mfr_model_init(weight_dir, cfg_path, _device_='cpu'):
args = argparse.Namespace(cfg_path=cfg_path, options=None)
cfg = Config(args)
cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.bin")
cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
cfg.config.model.model_config.model_name = weight_dir
cfg.config.model.tokenizer_config.path = weight_dir
task = tasks.setup_task(cfg)
......@@ -74,7 +74,10 @@ def layout_model_init(weight, config_file, device):
return model
def ocr_model_init(show_log: bool = False, det_db_box_thresh=0.3):
def ocr_model_init(show_log: bool = False, det_db_box_thresh=0.3, lang=None):
if lang is not None:
model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, lang=lang)
else:
model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh)
return model
......@@ -134,7 +137,8 @@ def atom_model_init(model_name: str, **kwargs):
elif model_name == AtomicModel.OCR:
atom_model = ocr_model_init(
kwargs.get("ocr_show_log"),
kwargs.get("det_db_box_thresh")
kwargs.get("det_db_box_thresh"),
kwargs.get("lang")
)
elif model_name == AtomicModel.Table:
atom_model = table_model_init(
......@@ -177,9 +181,10 @@ class CustomPEKModel:
self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
self.table_model_type = self.table_config.get("model", TABLE_MASTER)
self.apply_ocr = ocr
self.lang = kwargs.get("lang", None)
logger.info(
"DocAnalysis init, this may take some times. apply_layout: {}, apply_formula: {}, apply_ocr: {}, apply_table: {}".format(
self.apply_layout, self.apply_formula, self.apply_ocr, self.apply_table
"DocAnalysis init, this may take some times. apply_layout: {}, apply_formula: {}, apply_ocr: {}, apply_table: {}, lang: {}".format(
self.apply_layout, self.apply_formula, self.apply_ocr, self.apply_table, self.lang
)
)
assert self.apply_layout, "DocAnalysis must contain layout model."
......@@ -225,11 +230,13 @@ class CustomPEKModel:
)
# 初始化ocr
if self.apply_ocr:
# self.ocr_model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=0.3)
self.ocr_model = atom_model_manager.get_atom_model(
atom_model_name=AtomicModel.OCR,
ocr_show_log=show_log,
det_db_box_thresh=0.3
det_db_box_thresh=0.3,
lang=self.lang
)
# init table model
if self.apply_table:
......@@ -243,6 +250,7 @@ class CustomPEKModel:
table_max_time=self.table_max_time,
device=self.device
)
logger.info('DocAnalysis init done!')
def __call__(self, image):
......@@ -383,6 +391,7 @@ class CustomPEKModel:
latex_code = self.table_model.image2latex(new_image)[0]
else:
html_code = self.table_model.img2html(new_image)
run_time = time.time() - single_table_start_time
logger.info(f"------------table recognition processing ends within {run_time}s-----")
if run_time > self.table_max_time:
......
......@@ -18,7 +18,10 @@ def region_to_bbox(region):
class CustomPaddleModel:
def __init__(self, ocr: bool = False, show_log: bool = False):
def __init__(self, ocr: bool = False, show_log: bool = False, lang=None):
if lang is not None:
self.model = PPStructure(table=False, ocr=ocr, show_log=show_log, lang=lang)
else:
self.model = PPStructure(table=False, ocr=ocr, show_log=show_log)
def __call__(self, img):
......
......@@ -17,7 +17,7 @@ class AbsPipe(ABC):
PIP_TXT = "txt"
def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
start_page_id=0, end_page_id=None):
start_page_id=0, end_page_id=None, lang=None):
self.pdf_bytes = pdf_bytes
self.model_list = model_list
self.image_writer = image_writer
......@@ -25,6 +25,7 @@ class AbsPipe(ABC):
self.is_debug = is_debug
self.start_page_id = start_page_id
self.end_page_id = end_page_id
self.lang = lang
def get_compress_pdf_mid_data(self):
return JsonCompressor.compress_json(self.pdf_mid_data)
......@@ -94,7 +95,9 @@ class AbsPipe(ABC):
"""
pdf_mid_data = JsonCompressor.decompress_json(compressed_pdf_mid_data)
pdf_info_list = pdf_mid_data["pdf_info"]
content_list = union_make(pdf_info_list, MakeMode.STANDARD_FORMAT, drop_mode, img_buket_path)
parse_type = pdf_mid_data["_parse_type"]
lang = pdf_mid_data.get("_lang", None)
content_list = union_make(pdf_info_list, MakeMode.STANDARD_FORMAT, drop_mode, img_buket_path, parse_type, lang)
return content_list
@staticmethod
......@@ -104,7 +107,9 @@ class AbsPipe(ABC):
"""
pdf_mid_data = JsonCompressor.decompress_json(compressed_pdf_mid_data)
pdf_info_list = pdf_mid_data["pdf_info"]
md_content = union_make(pdf_info_list, md_make_mode, drop_mode, img_buket_path)
parse_type = pdf_mid_data["_parse_type"]
lang = pdf_mid_data.get("_lang", None)
md_content = union_make(pdf_info_list, md_make_mode, drop_mode, img_buket_path, parse_type, lang)
return md_content
......@@ -10,19 +10,21 @@ from magic_pdf.user_api import parse_ocr_pdf
class OCRPipe(AbsPipe):
def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
start_page_id=0, end_page_id=None):
super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id)
start_page_id=0, end_page_id=None, lang=None):
super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id, lang)
def pipe_classify(self):
pass
def pipe_analyze(self):
self.model_list = doc_analyze(self.pdf_bytes, ocr=True,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
def pipe_parse(self):
self.pdf_mid_data = parse_ocr_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
......
......@@ -11,19 +11,21 @@ from magic_pdf.user_api import parse_txt_pdf
class TXTPipe(AbsPipe):
def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
start_page_id=0, end_page_id=None):
super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id)
start_page_id=0, end_page_id=None, lang=None):
super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id, lang)
def pipe_classify(self):
pass
def pipe_analyze(self):
self.model_list = doc_analyze(self.pdf_bytes, ocr=False,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
def pipe_parse(self):
self.pdf_mid_data = parse_txt_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
......
......@@ -14,9 +14,9 @@ from magic_pdf.user_api import parse_union_pdf, parse_ocr_pdf
class UNIPipe(AbsPipe):
def __init__(self, pdf_bytes: bytes, jso_useful_key: dict, image_writer: AbsReaderWriter, is_debug: bool = False,
start_page_id=0, end_page_id=None):
start_page_id=0, end_page_id=None, lang=None):
self.pdf_type = jso_useful_key["_pdf_type"]
super().__init__(pdf_bytes, jso_useful_key["model_list"], image_writer, is_debug, start_page_id, end_page_id)
super().__init__(pdf_bytes, jso_useful_key["model_list"], image_writer, is_debug, start_page_id, end_page_id, lang)
if len(self.model_list) == 0:
self.input_model_is_empty = True
else:
......@@ -28,22 +28,26 @@ class UNIPipe(AbsPipe):
def pipe_analyze(self):
if self.pdf_type == self.PIP_TXT:
self.model_list = doc_analyze(self.pdf_bytes, ocr=False,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
elif self.pdf_type == self.PIP_OCR:
self.model_list = doc_analyze(self.pdf_bytes, ocr=True,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
def pipe_parse(self):
if self.pdf_type == self.PIP_TXT:
self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
is_debug=self.is_debug, input_model_is_empty=self.input_model_is_empty,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
elif self.pdf_type == self.PIP_OCR:
self.pdf_mid_data = parse_ocr_pdf(self.pdf_bytes, self.model_list, self.image_writer,
is_debug=self.is_debug,
start_page_id=self.start_page_id, end_page_id=self.end_page_id)
start_page_id=self.start_page_id, end_page_id=self.end_page_id,
lang=self.lang)
def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.NONE_WITH_REASON):
result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
logger.info("uni_pipe mk content list finished")
return result
......
......@@ -2,13 +2,13 @@ model:
arch: unimernet
model_type: unimernet
model_config:
model_name: ./models
max_seq_len: 1024
length_aware: False
model_name: ./models/unimernet_base
max_seq_len: 1536
load_pretrained: True
pretrained: ./models/pytorch_model.bin
pretrained: './models/unimernet_base/pytorch_model.pth'
tokenizer_config:
path: ./models
path: ./models/unimernet_base
datasets:
formula_rec_eval:
......
......@@ -10,6 +10,6 @@ config:
weights:
layout: Layout/model_final.pth
mfd: MFD/weights.pt
mfr: MFR/UniMERNet
mfr: MFR/unimernet_base
struct_eqtable: TabRec/StructEqTable
TableMaster: TabRec/TableMaster
\ No newline at end of file
......@@ -44,6 +44,18 @@ auto: automatically choose the best method for parsing pdf from ocr and txt.
without method specified, auto will be used by default.""",
default='auto',
)
@click.option(
'-l',
'--lang',
'lang',
type=str,
help="""
Input the languages in the pdf (if known) to improve OCR accuracy. Optional.
You should input "Abbreviation" with language form url:
https://paddlepaddle.github.io/PaddleOCR/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations
""",
default=None,
)
@click.option(
'-d',
'--debug',
......@@ -68,7 +80,7 @@ without method specified, auto will be used by default.""",
help='The ending page for PDF parsing, beginning from 0.',
default=None,
)
def cli(path, output_dir, method, debug_able, start_page_id, end_page_id):
def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
model_config.__use_inside_model__ = True
model_config.__model_mode__ = 'full'
os.makedirs(output_dir, exist_ok=True)
......@@ -90,6 +102,7 @@ def cli(path, output_dir, method, debug_able, start_page_id, end_page_id):
debug_able,
start_page_id=start_page_id,
end_page_id=end_page_id,
lang=lang
)
except Exception as e:
......
......@@ -44,9 +44,10 @@ def do_parse(
f_draw_model_bbox=False,
start_page_id=0,
end_page_id=None,
lang=None,
):
if debug_able:
logger.warning("debug mode is on")
logger.warning('debug mode is on')
f_dump_content_list = True
f_draw_model_bbox = True
......@@ -61,13 +62,13 @@ def do_parse(
if parse_method == 'auto':
jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True,
start_page_id=start_page_id, end_page_id=end_page_id)
start_page_id=start_page_id, end_page_id=end_page_id, lang=lang)
elif parse_method == 'txt':
pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True,
start_page_id=start_page_id, end_page_id=end_page_id)
start_page_id=start_page_id, end_page_id=end_page_id, lang=lang)
elif parse_method == 'ocr':
pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True,
start_page_id=start_page_id, end_page_id=end_page_id)
start_page_id=start_page_id, end_page_id=end_page_id, lang=lang)
else:
logger.error('unknown parse method')
exit(1)
......
......@@ -26,7 +26,7 @@ PARSE_TYPE_OCR = "ocr"
def parse_txt_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWriter, is_debug=False,
start_page_id=0, end_page_id=None,
start_page_id=0, end_page_id=None, lang=None,
*args, **kwargs):
"""
解析文本类pdf
......@@ -44,11 +44,14 @@ def parse_txt_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWrit
pdf_info_dict["_version_name"] = __version__
if lang is not None:
pdf_info_dict["_lang"] = lang
return pdf_info_dict
def parse_ocr_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWriter, is_debug=False,
start_page_id=0, end_page_id=None,
start_page_id=0, end_page_id=None, lang=None,
*args, **kwargs):
"""
解析ocr类pdf
......@@ -66,12 +69,15 @@ def parse_ocr_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWrit
pdf_info_dict["_version_name"] = __version__
if lang is not None:
pdf_info_dict["_lang"] = lang
return pdf_info_dict
def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWriter, is_debug=False,
input_model_is_empty: bool = False,
start_page_id=0, end_page_id=None,
start_page_id=0, end_page_id=None, lang=None,
*args, **kwargs):
"""
ocr和文本混合的pdf,全部解析出来
......@@ -95,9 +101,11 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
if pdf_info_dict is None or pdf_info_dict.get("_need_drop", False):
logger.warning(f"parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr")
if input_model_is_empty:
pdf_models = doc_analyze(pdf_bytes, ocr=True,
pdf_models = doc_analyze(pdf_bytes,
ocr=True,
start_page_id=start_page_id,
end_page_id=end_page_id)
end_page_id=end_page_id,
lang=lang)
pdf_info_dict = parse_pdf(parse_pdf_by_ocr)
if pdf_info_dict is None:
raise Exception("Both parse_pdf_by_txt and parse_pdf_by_ocr failed.")
......@@ -108,4 +116,7 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
pdf_info_dict["_version_name"] = __version__
if lang is not None:
pdf_info_dict["_lang"] = lang
return pdf_info_dict
......@@ -3,4 +3,6 @@
## Project List
- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
- [gradio_app](./gradio_app/README.md): Build a web app based on gradio
......@@ -3,3 +3,5 @@
## 项目列表
- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
## Installation
MinerU(>=0.8.0)
> If you already have a functioning MinerU environment, you can skip this step.
>
[Deploy in CPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#quick-cpu-demo)
[Deploy in GPU environment](https://github.com/opendatalab/MinerU?tab=readme-ov-file#using-gpu)
Third-party Software
```bash
pip install gradio gradio-pdf
```
## Start Gradio App
```bash
python app.py
```
## Use Gradio App
Access http://127.0.0.1:7860 in your web browser
\ No newline at end of file
## 安装
MinerU(>=0.8.0)
>如已有正常运行的MinerU环境则可以跳过此步骤
>
[在CPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8cpu%E5%BF%AB%E9%80%9F%E4%BD%93%E9%AA%8C)
[在GPU环境部署](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#%E4%BD%BF%E7%94%A8gpu)
第三方软件
```bash
pip install gradio gradio-pdf
```
## 启动gradio应用
```bash
python app.py
```
## 使用gradio应用
在浏览器中访问 http://127.0.0.1:7860
\ No newline at end of file
......@@ -14,8 +14,6 @@ from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, prepare_env
os.system("pip install gradio")
os.system("pip install gradio-pdf")
import gradio as gr
from gradio_pdf import PDF
......@@ -25,12 +23,15 @@ def read_fn(path):
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
def parse_pdf(doc_path, output_dir, end_page_id):
def parse_pdf(doc_path, output_dir, end_page_id, is_ocr):
os.makedirs(output_dir, exist_ok=True)
try:
file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
pdf_data = read_fn(doc_path)
if is_ocr:
parse_method = "ocr"
else:
parse_method = "auto"
local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
do_parse(
......@@ -92,9 +93,9 @@ def replace_image_with_base64(markdown_text, image_dir_path):
return re.sub(pattern, replace, markdown_text)
def to_markdown(file_path, end_pages):
def to_markdown(file_path, end_pages, is_ocr):
# 获取识别的md文件以及压缩包文件路径
local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1)
local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1, is_ocr)
archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
if zip_archive_success == 0:
......@@ -111,14 +112,6 @@ def to_markdown(file_path, end_pages):
return md_content, txt_content, archive_zip_path, new_pdf_path
# def show_pdf(file_path):
# with open(file_path, "rb") as f:
# base64_pdf = base64.b64encode(f.read()).decode('utf-8')
# pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
# f'width="100%" height="1000" type="application/pdf">'
# return pdf_display
latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
{"left": '$', "right": '$', "display": False}]
......@@ -141,16 +134,29 @@ model_init = init_model()
logger.info(f"model_init: {model_init}")
with open("header.html", "r") as file:
header = file.read()
if __name__ == "__main__":
with gr.Blocks() as demo:
gr.HTML(header)
with gr.Row():
with gr.Column(variant='panel', scale=5):
pdf_show = gr.Markdown()
max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
with gr.Row() as bu_flow:
is_ocr = gr.Checkbox(label="Force enable OCR")
change_bu = gr.Button("Convert")
clear_bu = gr.ClearButton([pdf_show], value="Clear")
pdf_show = PDF(label="Please upload pdf", interactive=True, height=800)
with gr.Accordion("Examples:"):
example_root = os.path.join(os.path.dirname(__file__), "examples")
gr.Examples(
examples=[os.path.join(example_root, _) for _ in os.listdir(example_root) if
_.endswith("pdf")],
inputs=pdf_show,
)
with gr.Column(variant='panel', scale=5):
output_file = gr.File(label="convert result", interactive=False)
......@@ -160,8 +166,7 @@ if __name__ == "__main__":
latex_delimiters=latex_delimiters, line_breaks=True)
with gr.Tab("Markdown text"):
md_text = gr.TextArea(lines=45, show_copy_button=True)
change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show])
clear_bu.add([md, pdf_show, md_text, output_file])
change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages, is_ocr], outputs=[md, md_text, output_file, pdf_show])
clear_bu.add([md, pdf_show, md_text, output_file, is_ocr])
demo.launch()
\ No newline at end of file
<html><head>
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.15.4/css/all.css">
<style>
.link-block {
border: 1px solid transparent;
border-radius: 24px;
background-color: rgba(54, 54, 54, 1);
cursor: pointer !important;
}
.link-block:hover {
background-color: rgba(54, 54, 54, 0.75) !important;
cursor: pointer !important;
}
.external-link {
display: inline-flex;
align-items: center;
height: 36px;
line-height: 36px;
padding: 0 16px;
cursor: pointer !important;
}
.external-link,
.external-link:hover {
cursor: pointer !important;
}
a {
text-decoration: none;
}
</style></head>
<body>
<div style="
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
text-align: center;
background: linear-gradient(45deg, #007bff 0%, #0056b3 100%);
padding: 24px;
gap: 24px;
border-radius: 8px;
">
<div style="
display: flex;
flex-direction: column;
align-items: center;
gap: 16px;
">
<div style="display: flex; flex-direction: column; gap: 8px">
<h1 style="
font-size: 48px;
color: #fafafa;
margin: 0;
font-family: 'Trebuchet MS', 'Lucida Sans Unicode',
'Lucida Grande', 'Lucida Sans', Arial, sans-serif;
">
MinerU: PDF Extraction Demo
</h1>
</div>
</div>
<p style="
margin: 0;
line-height: 1.6rem;
font-size: 16px;
color: #fafafa;
opacity: 0.8;
">
A one-stop, open-source, high-quality data extraction tool, supports
PDF/webpage/e-book extraction.<br>
</p>
<style>
.link-block {
display: inline-block;
}
.link-block + .link-block {
margin-left: 20px;
}
</style>
<div class="column has-text-centered">
<div class="publication-links">
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/opendatalab/MinerU" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
<span class="icon" style="margin-right: 4px">
<i class="fab fa-github" style="color: white; margin-right: 4px"></i>
</span>
<span style="color: white">Code</span>
</a>
</span>
<!-- Homepage Link. -->
<span class="link-block">
<a href="https://opendatalab.com/" class="external-link button is-normal is-rounded is-dark" style="text-decoration: none; cursor: pointer">
<span class="icon" style="margin-right: 8px">
<i class="fas fa-globe" style="color: white"></i>
</span>
<span style="color: white">Homepage</span>
</a>
</span>
</div>
</div>
<!-- New Demo Links -->
</div>
</body></html>
\ No newline at end of file
magic-pdf[full]>=0.8.0
gradio
gradio-pdf
\ No newline at end of file
## 安装
<details open="open">
<summary><h2 style="display: inline-block">目录</h2></summary>
<li><a href="#介绍">介绍</a></li>
<li><a href="#安装">安装</a></li>
<li><a href="#示例">示例</a></li>
<li><a href="#开发">开发</a></li>
</ol>
</details>
MinerU
## 介绍
```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
`MinerU` 提供数据 `API接口` 以支持用户导入数据到 `RAG` 系统。本项目将基于`通义千问`展示如何构建一个轻量级的 `RAG` 系统。
<p align="center">
<img src="rag_data_api.png" width="300px" style="vertical-align:middle;">
</p>
## 安装
conda create -n MinerU python=3.10
conda activate MinerU
pip install .[full] --extra-index-url https://wheels.myhloli.com
环境要求
```text
NVIDIA A100 80GB,
Centos 7 3.10.0-957.el7.x86_64
Client: Docker Engine - Community
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:39:02 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:38:05 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.25
GitCommit: d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
runc:
Version: 1.1.10
GitCommit: v1.1.10-0-g18a0cb0
docker-init:
Version: 0.19.0
GitCommit: de40ad0
```
请参考[文档](../../README_zh-CN.md) 安装 MinerU
第三方软件
```bash
# install
pip install modelscope==1.14.0
pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68
......@@ -26,39 +70,12 @@ pip install accelerate==0.33.0
pip uninstall transformer-engine
```
## 环境配置
```
export DASHSCOPE_API_KEY={some_key}
export ES_USER={some_es_user}
export ES_PASSWORD={some_es_password}
export ES_URL=http://{es_url}:9200
```
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
## 使用
### 导入数据
```bash
python data_ingestion.py -p some.pdf # load data from pdf
or
python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
```
### 查询
```bash
python query.py --question '{the_question_you_want_to_ask}'
```
## 示例
````bash
# 启动 es 服务
cd projects/llama_index_rag
docker compose up -d
or
......@@ -67,17 +84,41 @@ docker-compose up -d
# 配置环境变量
export ES_USER=elastic
export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key}
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
# 未导入数据,查询问题。返回通义千问默认答案
python query.py -q 'how about the rights of men'
## outputs
question: how about the rights of men
answer: The topic of men's rights often refers to discussions around legal, social, and political issues that affect men specifically or differently from women. Movements related to men's rights advocate for addressing areas where men face discrimination or unique challenges, such as:
Child Custody: Ensuring that men have equal opportunities for custody of their children following divorce or separation.
Domestic Violence: Recognizing that men can also be victims of domestic abuse and ensuring they have access to support services.
Mental Health and Suicide Rates: Addressing the higher rates of suicide among men and providing mental health resources.
Military Conscription: In some countries, only men are required to register for military service, which is seen as a gender-based obligation.
Workplace Safety: Historically, more men than women have been employed in high-risk occupations, leading to higher workplace injury and death rates.
Parental Leave: Advocating for paternity leave policies that allow men to take time off work for family care.
Men's rights activism often intersects with broader discussions on gender equality and aims to promote fairness and equity across genders. It's important to note that while advocating for these issues, it should be done in a way that does not detract from or oppose the goals of gender equality and the rights of other groups. The focus should be on creating a fair society where everyone has equal opportunities and protections under the law.
# 导入数据
python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
python data_ingestion.py -p example/data/
or
python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.pdf
# 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。
# 查询问题
python query.py -q 'how about the rights of men'
## outputs
......
......@@ -8,7 +8,7 @@ fast-langdetect==0.2.0
wordninja>=2.0.0
scikit-learn>=1.0.2
pdfminer.six==20231228
unimernet==0.1.6
unimernet==0.2.1
matplotlib
ultralytics
paddleocr==2.7.3
......
......@@ -36,7 +36,7 @@ if __name__ == '__main__':
"paddlepaddle==3.0.0b1;platform_system=='Linux'",
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
],
"full": ["unimernet==0.1.6", # 0.1.6版本大幅裁剪依赖包范围,推荐使用此版本
"full": ["unimernet==0.2.1", # unimernet升级0.2.1
"matplotlib<=3.9.0;platform_system=='Windows'", # 3.9.1及之后不提供windows的预编译包,避免一些没有编译环境的windows设备安装失败
"matplotlib;platform_system=='Linux' or platform_system=='Darwin'", # linux 和 macos 不应限制matplotlib的最高版本,以避免无法更新导致的一些bug
"ultralytics", # yolov8,公式检测
......
"""
clean coverage
"""
import os
import shutil
def delete_file(path):
"""delete file."""
if not os.path.exists(path):
if os.path.isfile(path):
try:
os.remove(path)
print(f"File '{path}' deleted.")
except TypeError as e:
print(f"Error deleting file '{path}': {e}")
elif os.path.isdir(path):
try:
shutil.rmtree(path)
print(f"Directory '{path}' and its contents deleted.")
except TypeError as e:
print(f"Error deleting directory '{path}': {e}")
if __name__ == "__main__":
delete_file("htmlcov")
\ No newline at end of file
......@@ -2,7 +2,7 @@
get cov
"""
from bs4 import BeautifulSoup
import shutil
def get_covrage():
"""get covrage"""
# 发送请求获取网页内容
......
......@@ -8,7 +8,8 @@ while true; do
# prepare env
source activate MinerU
pip install -r requirements-qa.txt
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
pip uninstall magic-pdf
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
exit_code=$?
if [ $exit_code -eq 0 ]; then
......
......@@ -2,6 +2,7 @@ import os
conf = {
"code_path": os.environ.get('GITHUB_WORKSPACE'),
"pdf_dev_path" : os.environ.get('GITHUB_WORKSPACE') + "/tests/test_cli/pdf_dev",
"pdf_res_path": "/tmp/magic-pdf"
"pdf_res_path": "/tmp/magic-pdf",
"jsonl_path": "s3://llm-qatest-pnorm/mineru/test/line1.jsonl",
"s3_pdf_path": "s3://llm-qatest-pnorm/mineru/test/test.pdf"
}
\ No newline at end of file
This source diff could not be displayed because it is too large. You can view the blob instead.
[
{
"layout_dets": [
{
"category_id": 1,
"poly": [
578.2055053710938,
672.8831787109375,
1579.973388671875,
672.8831787109375,
1579.973388671875,
1034.681640625,
578.2055053710938,
1034.681640625
],
"score": 0.9999963045120239
},
{
"category_id": 1,
"poly": [
583.6041259765625,
1067.1112060546875,
1579.822265625,
1067.1112060546875,
1579.822265625,
1537.1324462890625,
583.6041259765625,
1537.1324462890625
],
"score": 0.9999961853027344
},
{
"category_id": 1,
"poly": [
585.4341430664062,
1568.220703125,
1578.5487060546875,
1568.220703125,
1578.5487060546875,
1931.516845703125,
585.4341430664062,
1931.516845703125
],
"score": 0.9999949336051941
},
{
"category_id": 1,
"poly": [
578.491455078125,
532.0020141601562,
1577.96337890625,
532.0020141601562,
1577.96337890625,
641.0128784179688,
578.491455078125,
641.0128784179688
],
"score": 0.999992847442627
},
{
"category_id": 1,
"poly": [
66.43791961669922,
1776.6951904296875,
530.4810180664062,
1776.6951904296875,
530.4810180664062,
1883.127685546875,
66.43791961669922,
1883.127685546875
],
"score": 0.9999925494194031
},
{
"category_id": 3,
"poly": [
70.23656463623047,
818.9393920898438,
517.8253784179688,
818.9393920898438,
517.8253784179688,
1076.5823974609375,
70.23656463623047,
1076.5823974609375
],
"score": 0.9999912977218628
},
{
"category_id": 1,
"poly": [
64.99957275390625,
651.9596557617188,
436.5134582519531,
651.9596557617188,
436.5134582519531,
723.5758056640625,
64.99957275390625,
723.5758056640625
],
"score": 0.9999804496765137
},
{
"category_id": 0,
"poly": [
556.2775268554688,
270.2123107910156,
1577.8211669921875,
270.2123107910156,
1577.8211669921875,
408.9685974121094,
556.2775268554688,
408.9685974121094
],
"score": 0.9999696016311646
},
{
"category_id": 1,
"poly": [
67.8562240600586,
1342.2239990234375,
530.5654296875,
1342.2239990234375,
530.5654296875,
1447.843017578125,
67.8562240600586,
1447.843017578125
],
"score": 0.9999648928642273
},
{
"category_id": 1,
"poly": [
65.74958801269531,
1631.3671875,
530.32861328125,
1631.3671875,
530.32861328125,
1772.413818359375,
65.74958801269531,
1772.413818359375
],
"score": 0.9999628067016602
},
{
"category_id": 1,
"poly": [
588.5570068359375,
2068.54931640625,
1525.3253173828125,
2068.54931640625,
1525.3253173828125,
2103.89013671875,
588.5570068359375,
2103.89013671875
],
"score": 0.9999607801437378
},
{
"category_id": 1,
"poly": [
586.5548706054688,
1963.105712890625,
1556.578125,
1963.105712890625,
1556.578125,
2034.8116455078125,
586.5548706054688,
2034.8116455078125
],
"score": 0.9999469518661499
},
{
"category_id": 5,
"poly": [
59.96487045288086,
1110.6282958984375,
529.9209594726562,
1110.6282958984375,
529.9209594726562,
1225.2921142578125,
59.96487045288086,
1225.2921142578125
],
"score": 0.999945878982544
},
{
"category_id": 2,
"poly": [
70.25292205810547,
103.42201232910156,
420.4892578125,
103.42201232910156,
420.4892578125,
223.39370727539062,
70.25292205810547,
223.39370727539062
],
"score": 0.9999405145645142
},
{
"category_id": 2,
"poly": [
1081.0203857421875,
2244.87890625,
1554.669189453125,
2244.87890625,
1554.669189453125,
2275.28662109375,
1081.0203857421875,
2275.28662109375
],
"score": 0.9999217987060547
},
{
"category_id": 1,
"poly": [
68.85404968261719,
345.9093017578125,
307.9080810546875,
345.9093017578125,
307.9080810546875,
409.0098876953125,
68.85404968261719,
409.0098876953125
],
"score": 0.9999183416366577
},
{
"category_id": 0,
"poly": [
65.58759307861328,
1295.9366455078125,
180.4149932861328,
1295.9366455078125,
180.4149932861328,
1328.867919921875,
65.58759307861328,
1328.867919921875
],
"score": 0.9998926520347595
},
{
"category_id": 2,
"poly": [
1245.0789794921875,
108.83513641357422,
1576.3131103515625,
108.83513641357422,
1576.3131103515625,
219.29042053222656,
1245.0789794921875,
219.29042053222656
],
"score": 0.9995975494384766
},
{
"category_id": 1,
"poly": [
65.75041961669922,
483.5210266113281,
428.6028137207031,
483.5210266113281,
428.6028137207031,
586.8894653320312,
65.75041961669922,
586.8894653320312
],
"score": 0.9993270635604858
},
{
"category_id": 0,
"poly": [
65.02926635742188,
445.02288818359375,
208.3317108154297,
445.02288818359375,
208.3317108154297,
476.65252685546875,
65.02926635742188,
476.65252685546875
],
"score": 0.9992279410362244
},
{
"category_id": 0,
"poly": [
556.96630859375,
453.08447265625,
673.0485229492188,
453.08447265625,
673.0485229492188,
490.60455322265625,
556.96630859375,
490.60455322265625
],
"score": 0.9949817657470703
},
{
"category_id": 1,
"poly": [
66.26518249511719,
1524.234130859375,
530.2540283203125,
1524.234130859375,
530.2540283203125,
1627.5291748046875,
66.26518249511719,
1627.5291748046875
],
"score": 0.9919581413269043
},
{
"category_id": 7,
"poly": [
62.5564079284668,
1227.41943359375,
380.10693359375,
1227.41943359375,
380.10693359375,
1252.8614501953125,
62.5564079284668,
1252.8614501953125
],
"score": 0.9918426275253296
},
{
"category_id": 1,
"poly": [
66.80464935302734,
1451.4775390625,
527.3795166015625,
1451.4775390625,
527.3795166015625,
1519.5836181640625,
66.80464935302734,
1519.5836181640625
],
"score": 0.9883899688720703
},
{
"category_id": 0,
"poly": [
65.36080932617188,
605.3754272460938,
181.24375915527344,
605.3754272460938,
181.24375915527344,
637.0076904296875,
65.36080932617188,
637.0076904296875
],
"score": 0.9870840311050415
},
{
"category_id": 0,
"poly": [
178.82904052734375,
264.6627197265625,
396.52825927734375,
264.6627197265625,
396.52825927734375,
315.41900634765625,
178.82904052734375,
315.41900634765625
],
"score": 0.9779323935508728
},
{
"category_id": 4,
"poly": [
66.15127563476562,
767.24658203125,
181.25694274902344,
767.24658203125,
181.25694274902344,
799.7832641601562,
66.15127563476562,
799.7832641601562
],
"score": 0.8932801485061646
},
{
"category_id": 13,
"poly": [
590,
747,
688,
747,
688,
778,
590,
778
],
"score": 0.91,
"latex": "+24.4\\%"
},
{
"category_id": 13,
"poly": [
1433,
855,
1492,
855,
1492,
886,
1433,
886
],
"score": 0.86,
"latex": "30\\%"
},
{
"category_id": 13,
"poly": [
238,
689,
264,
689,
264,
717,
238,
717
],
"score": 0.34,
"latex": "@"
},
{
"category_id": 13,
"poly": [
702,
1002,
722,
1002,
722,
1026,
702,
1026
],
"score": 0.33,
"latex": "^+"
},
{
"category_id": 13,
"poly": [
177,
1154,
223,
1154,
223,
1185,
177,
1185
],
"score": 0.28,
"latex": "(\\%)"
}
],
"page_info": {
"page_no": 0,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 2,
"poly": [
88.00849151611328,
31.891826629638672,
300.7432861328125,
31.891826629638672,
300.7432861328125,
113.5999755859375,
88.00849151611328,
113.5999755859375
],
"score": 0.9999986886978149
},
{
"category_id": 2,
"poly": [
771.0192260742188,
2213.479248046875,
827.4273681640625,
2213.479248046875,
827.4273681640625,
2239.40185546875,
771.0192260742188,
2239.40185546875
],
"score": 0.9999963641166687
},
{
"category_id": 7,
"poly": [
544.2962646484375,
488.5493469238281,
988.3958129882812,
488.5493469238281,
988.3958129882812,
541.0634155273438,
544.2962646484375,
541.0634155273438
],
"score": 0.9999918341636658
},
{
"category_id": 2,
"poly": [
1082.88232421875,
82.37471771240234,
1519.4150390625,
82.37471771240234,
1519.4150390625,
114.9271011352539,
1082.88232421875,
114.9271011352539
],
"score": 0.9999632835388184
},
{
"category_id": 2,
"poly": [
1009.1597900390625,
2210.9462890625,
1535.9239501953125,
2210.9462890625,
1535.9239501953125,
2241.830322265625,
1009.1597900390625,
2241.830322265625
],
"score": 0.9999324679374695
},
{
"category_id": 5,
"poly": [
537.349365234375,
156.8784637451172,
1584.9866943359375,
156.8784637451172,
1584.9866943359375,
485.3042907714844,
537.349365234375,
485.3042907714844
],
"score": 0.9985955953598022
},
{
"category_id": 7,
"poly": [
62.69784927368164,
443.4034118652344,
249.9097137451172,
443.4034118652344,
249.9097137451172,
467.4612731933594,
62.69784927368164,
467.4612731933594
],
"score": 0.9873980283737183
},
{
"category_id": 5,
"poly": [
61.374210357666016,
138.51153564453125,
528.30517578125,
138.51153564453125,
528.30517578125,
443.5376281738281,
61.374210357666016,
443.5376281738281
],
"score": 0.9232220649719238
},
{
"category_id": 6,
"poly": [
548.1119384765625,
148.7312774658203,
797.3070678710938,
148.7312774658203,
797.3070678710938,
180.74609375,
548.1119384765625,
180.74609375
],
"score": 0.6074804663658142
},
{
"category_id": 13,
"poly": [
864,
455,
922,
455,
922,
482,
864,
482
],
"score": 0.74,
"latex": "6.0\\%"
},
{
"category_id": 13,
"poly": [
850,
418,
922,
418,
922,
445,
850,
445
],
"score": 0.64,
"latex": "35.3\\%"
},
{
"category_id": 13,
"poly": [
1501,
270,
1571,
270,
1571,
298,
1501,
298
],
"score": 0.54,
"latex": "13.8\\%"
},
{
"category_id": 13,
"poly": [
1013,
454,
1083,
454,
1083,
482,
1013,
482
],
"score": 0.52,
"latex": "15.0\\%"
},
{
"category_id": 13,
"poly": [
1012,
417,
1083,
417,
1083,
444,
1012,
444
],
"score": 0.52,
"latex": "33.7\\%"
},
{
"category_id": 13,
"poly": [
689,
456,
725,
456,
725,
482,
689,
482
],
"score": 0.48,
"latex": "(\\%)"
},
{
"category_id": 13,
"poly": [
850,
344,
922,
344,
922,
372,
850,
372
],
"score": 0.4,
"latex": "83.8\\%"
},
{
"category_id": 13,
"poly": [
863,
270,
922,
270,
922,
298,
863,
298
],
"score": 0.4,
"latex": "4.5\\%"
},
{
"category_id": 13,
"poly": [
1334,
270,
1406,
270,
1406,
298,
1334,
298
],
"score": 0.35,
"latex": "37.2\\%"
},
{
"category_id": 13,
"poly": [
618,
419,
656,
419,
656,
446,
618,
446
],
"score": 0.35,
"latex": "(\\%)"
}
],
"page_info": {
"page_no": 1,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 2,
"poly": [
87.9037094116211,
31.59800148010254,
300.9930419921875,
31.59800148010254,
300.9930419921875,
113.4053955078125,
87.9037094116211,
113.4053955078125
],
"score": 0.9999939799308777
},
{
"category_id": 2,
"poly": [
1008.992919921875,
2209.248779296875,
1534.9334716796875,
2209.248779296875,
1534.9334716796875,
2242.77294921875,
1008.992919921875,
2242.77294921875
],
"score": 0.9999377131462097
},
{
"category_id": 2,
"poly": [
770.6600341796875,
2212.857666015625,
827.4126586914062,
2212.857666015625,
827.4126586914062,
2239.77197265625,
770.6600341796875,
2239.77197265625
],
"score": 0.9998395442962646
},
{
"category_id": 2,
"poly": [
1082.096923828125,
82.25012969970703,
1518.9267578125,
82.25012969970703,
1518.9267578125,
114.52576446533203,
1082.096923828125,
114.52576446533203
],
"score": 0.9996457099914551
},
{
"category_id": 7,
"poly": [
95.39900970458984,
1846.6380615234375,
564.4166870117188,
1846.6380615234375,
564.4166870117188,
1899.209716796875,
95.39900970458984,
1899.209716796875
],
"score": 0.9908766746520996
},
{
"category_id": 6,
"poly": [
95.4662094116211,
173.42832946777344,
470.21905517578125,
173.42832946777344,
470.21905517578125,
217.74632263183594,
95.4662094116211,
217.74632263183594
],
"score": 0.9437939524650574
},
{
"category_id": 5,
"poly": [
854.1142578125,
1043.93603515625,
1592.0174560546875,
1043.93603515625,
1592.0174560546875,
1846.16552734375,
854.1142578125,
1846.16552734375
],
"score": 0.8844046592712402
},
{
"category_id": 5,
"poly": [
92.02946472167969,
1331.8909912109375,
814.2915649414062,
1331.8909912109375,
814.2915649414062,
1842.6195068359375,
92.02946472167969,
1842.6195068359375
],
"score": 0.8743430972099304
},
{
"category_id": 5,
"poly": [
851.83984375,
224.99559020996094,
1592.4068603515625,
224.99559020996094,
1592.4068603515625,
1018.7105712890625,
851.83984375,
1018.7105712890625
],
"score": 0.8650150299072266
},
{
"category_id": 5,
"poly": [
91.79800415039062,
224.10838317871094,
816.58154296875,
224.10838317871094,
816.58154296875,
1248.422607421875,
91.79800415039062,
1248.422607421875
],
"score": 0.8604844808578491
},
{
"category_id": 5,
"poly": [
85.19661712646484,
220.71524047851562,
1602.3074951171875,
220.71524047851562,
1602.3074951171875,
1844.488525390625,
85.19661712646484,
1844.488525390625
],
"score": 0.6638449430465698
},
{
"category_id": 13,
"poly": [
737,
704,
804,
704,
804,
730,
737,
730
],
"score": 0.56,
"latex": "\\pmb{26.5\\%}"
},
{
"category_id": 13,
"poly": [
738,
673,
804,
673,
804,
699,
738,
699
],
"score": 0.48,
"latex": "\\pmb{16.2\\%}"
},
{
"category_id": 13,
"poly": [
736,
767,
805,
767,
805,
795,
736,
795
],
"score": 0.48,
"latex": "\\mathbf{\\lambda_{23.7\\%}}"
},
{
"category_id": 13,
"poly": [
231,
611,
268,
611,
268,
638,
231,
638
],
"score": 0.47,
"latex": "(\\%)"
},
{
"category_id": 13,
"poly": [
749,
736,
804,
736,
804,
763,
749,
763
],
"score": 0.41,
"latex": "\\pmb{9.2\\%}"
},
{
"category_id": 13,
"poly": [
737,
641,
804,
641,
804,
668,
737,
668
],
"score": 0.41,
"latex": "{\\bf38.0\\%}"
},
{
"category_id": 13,
"poly": [
748,
577,
805,
577,
805,
606,
748,
606
],
"score": 0.35,
"latex": "0.1\\%"
},
{
"category_id": 13,
"poly": [
187,
800,
222,
800,
222,
827,
187,
827
],
"score": 0.32,
"latex": "(\\%)"
},
{
"category_id": 13,
"poly": [
738,
830,
805,
830,
805,
857,
738,
857
],
"score": 0.28,
"latex": "\\mathbf{13.8\\%}"
},
{
"category_id": 13,
"poly": [
737,
862,
805,
862,
805,
889,
737,
889
],
"score": 0.27,
"latex": "\\mathbf{31.9\\%}"
},
{
"category_id": 13,
"poly": [
736,
955,
804,
955,
804,
983,
736,
983
],
"score": 0.26,
"latex": "\\pmb{65.3\\%}"
}
],
"page_info": {
"page_no": 2,
"height": 2339,
"width": 1654
}
},
{
"layout_dets": [
{
"category_id": 2,
"poly": [
86.3010025024414,
32.05937194824219,
303.65325927734375,
32.05937194824219,
303.65325927734375,
114.77494049072266,
86.3010025024414,
114.77494049072266
],
"score": 0.9999954700469971
},
{
"category_id": 1,
"poly": [
108.4952392578125,
590.2026977539062,
1536.75146484375,
590.2026977539062,
1536.75146484375,
688.4915771484375,
108.4952392578125,
688.4915771484375
],
"score": 0.9999932646751404
},
{
"category_id": 0,
"poly": [
95.94864654541016,
1205.4134521484375,
252.92477416992188,
1205.4134521484375,
252.92477416992188,
1246.0015869140625,
95.94864654541016,
1246.0015869140625
],
"score": 0.999992847442627
},
{
"category_id": 1,
"poly": [
106.48407745361328,
338.27471923828125,
1568.86328125,
338.27471923828125,
1568.86328125,
437.84783935546875,
106.48407745361328,
437.84783935546875
],
"score": 0.9999897480010986
},
{
"category_id": 2,
"poly": [
767.6918334960938,
2212.269287109375,
830.787353515625,
2212.269287109375,
830.787353515625,
2239.28515625,
767.6918334960938,
2239.28515625
],
"score": 0.9999850988388062
},
{
"category_id": 0,
"poly": [
96.18482208251953,
508.36334228515625,
291.4427490234375,
508.36334228515625,
291.4427490234375,
549.4661865234375,
96.18482208251953,
549.4661865234375
],
"score": 0.9999837875366211
},
{
"category_id": 2,
"poly": [
1082.2672119140625,
81.18732452392578,
1520.2149658203125,
81.18732452392578,
1520.2149658203125,
116.55751037597656,
1082.2672119140625,
116.55751037597656
],
"score": 0.9999496340751648
},
{
"category_id": 0,
"poly": [
96.45167541503906,
157.92835998535156,
319.21392822265625,
157.92835998535156,
319.21392822265625,
213.8436279296875,
96.45167541503906,
213.8436279296875
],
"score": 0.9999274015426636
},
{
"category_id": 0,
"poly": [
96.99238586425781,
257.6522216796875,
483.6472473144531,
257.6522216796875,
483.6472473144531,
301.53717041015625,
96.99238586425781,
301.53717041015625
],
"score": 0.9999104738235474
},
{
"category_id": 2,
"poly": [
1008.8760986328125,
2208.609375,
1536.0474853515625,
2208.609375,
1536.0474853515625,
2243.414306640625,
1008.8760986328125,
2243.414306640625
],
"score": 0.9998928308486938
},
{
"category_id": 1,
"poly": [
108.46533203125,
1288.0927734375,
1546.7518310546875,
1288.0927734375,
1546.7518310546875,
1383.8438720703125,
108.46533203125,
1383.8438720703125
],
"score": 0.9997898936271667
},
{
"category_id": 1,
"poly": [
107.81462860107422,
1678.24609375,
1227.880615234375,
1678.24609375,
1227.880615234375,
1711.37255859375,
107.81462860107422,
1711.37255859375
],
"score": 0.99957275390625
},
{
"category_id": 5,
"poly": [
109.75360107421875,
810.0169677734375,
1579.9549560546875,
810.0169677734375,
1579.9549560546875,
1171.6383056640625,
109.75360107421875,
1171.6383056640625
],
"score": 0.9994542598724365
},
{
"category_id": 1,
"poly": [
106.46218872070312,
1548.299072265625,
1540.3388671875,
1548.299072265625,
1540.3388671875,
1676.67919921875,
106.46218872070312,
1676.67919921875
],
"score": 0.9886452555656433
},
{
"category_id": 1,
"poly": [
107.52558898925781,
1386.4000244140625,
1540.886962890625,
1386.4000244140625,
1540.886962890625,
1447.8128662109375,
107.52558898925781,
1447.8128662109375
],
"score": 0.9709398150444031
},
{
"category_id": 1,
"poly": [
107.66414642333984,
1451.8369140625,
1537.99169921875,
1451.8369140625,
1537.99169921875,
1546.690185546875,
107.66414642333984,
1546.690185546875
],
"score": 0.9590120315551758
},
{
"category_id": 6,
"poly": [
95.90371704101562,
728.2855224609375,
328.1967468261719,
728.2855224609375,
328.1967468261719,
768.121826171875,
95.90371704101562,
768.121826171875
],
"score": 0.6999977827072144
},
{
"category_id": 1,
"poly": [
106.67481994628906,
1371.857421875,
1544.84814453125,
1371.857421875,
1544.84814453125,
1678.67236328125,
106.67481994628906,
1678.67236328125
],
"score": 0.5645973086357117
},
{
"category_id": 0,
"poly": [
95.94171142578125,
728.264404296875,
328.1947937011719,
728.264404296875,
328.1947937011719,
768.1663818359375,
95.94171142578125,
768.1663818359375
],
"score": 0.30702608823776245
},
{
"category_id": 13,
"poly": [
1247,
887,
1353,
887,
1353,
914,
1247,
914
],
"score": 0.91,
"latex": "5\\%{\\sim}20\\%"
},
{
"category_id": 13,
"poly": [
1181,
923,
1290,
923,
1290,
950,
1181,
950
],
"score": 0.9,
"latex": "-5\\%{+}5\\%"
},
{
"category_id": 13,
"poly": [
1416,
1047,
1469,
1047,
1469,
1077,
1416,
1077
],
"score": 0.87,
"latex": "10\\%"
},
{
"category_id": 13,
"poly": [
1254,
963,
1296,
963,
1296,
991,
1254,
991
],
"score": 0.86,
"latex": "5\\%"
},
{
"category_id": 13,
"poly": [
1373,
1003,
1428,
1003,
1428,
1032,
1373,
1032
],
"score": 0.86,
"latex": "10\\%"
},
{
"category_id": 13,
"poly": [
1332,
1047,
1388,
1047,
1388,
1076,
1332,
1076
],
"score": 0.86,
"latex": "\\cdot10\\%"
},
{
"category_id": 13,
"poly": [
1373,
1112,
1428,
1112,
1428,
1141,
1373,
1141
],
"score": 0.85,
"latex": "10\\%"
},
{
"category_id": 13,
"poly": [
1248,
854,
1302,
854,
1302,
880,
1248,
880
],
"score": 0.85,
"latex": "z0\\%"
}
],
"page_info": {
"page_no": 3,
"height": 2339,
"width": 1654
}
}
]
\ No newline at end of file
......@@ -9,7 +9,7 @@ from lib import common
import magic_pdf.model as model_config
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
model_config.__use_inside_model__ = True
pdf_res_path = conf.conf['pdf_res_path']
code_path = conf.conf['code_path']
......@@ -178,6 +178,95 @@ class TestCli:
common.cli_count_folders_and_check_contents(
os.path.join(res_path, demo_name, 'ocr'))
@pytest.mark.P1
def test_pdf_dev_cli_local_jsonl_txt(self):
"""magic_pdf_dev cli local txt."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, "txt")
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_local_jsonl_ocr(self):
"""magic_pdf_dev cli local ocr."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'ocr')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_local_jsonl_auto(self):
"""magic_pdf_dev cli local auto."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_s3_jsonl_txt(self):
"""magic_pdf_dev cli s3 txt."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, "txt")
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_s3_jsonl_ocr(self):
"""magic_pdf_dev cli s3 ocr."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'ocr')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_s3_jsonl_auto(self):
"""magic_pdf_dev cli s3 auto."""
jsonl_path = os.path.join(pdf_dev_path, 'line1.jsonl')
cmd = 'magic-pdf-dev --jsonl %s --method %s' % (jsonl_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_pdf_json_auto(self):
"""magic_pdf_dev cli pdf+json auto."""
json_path = os.path.join(pdf_dev_path, 'test_model.json')
pdf_path = os.path.join(pdf_dev_path, 'pdf', 'research_report_1f978cd81fb7260c8f7644039ec2c054.pdf')
cmd = 'magic-pdf-dev --pdf %s --json %s --method %s' % (pdf_path, json_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_pdf_dev_cli_pdf_json_ocr(self):
"""magic_pdf_dev cli pdf+json ocr."""
json_path = os.path.join(pdf_dev_path, 'test_model.json')
pdf_path = os.path.join(pdf_dev_path, 'pdf', 'research_report_1f978cd81fb7260c8f7644039ec2c054.pdf')
cmd = 'magic-pdf-dev --pdf %s --json %s --method %s' % (pdf_path, json_path, 'auto')
logging.info(cmd)
os.system(cmd)
@pytest.mark.P1
def test_s3_sdk_suto(self):
pdf_ak = os.environ.get('pdf_ak', "")
pdf_sk = os.environ.get('pdf_sk', "")
pdf_bucket = os.environ.get('bucket', "")
pdf_endpoint = os.environ.get('pdf_endpoint', "")
s3_pdf_path = conf.conf["s3_pdf_path"]
image_dir = "s3://" + pdf_bucket + "/mineru/test/test.md"
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
s3image_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify()
pipe.pipe_analyze()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
assert len(md_content) > 0
if __name__ == '__main__':
pytest.main()
"""
test performance
"""
import os
import shutil
import json
from lib import calculate_score
import pytest
from conf import conf
code_path = os.environ.get('GITHUB_WORKSPACE')
pdf_dev_path = conf.conf["pdf_dev_path"]
pdf_res_path = conf.conf["pdf_res_path"]
class TestTable():
"""
test table
"""
def test_perf_close_table(self):
"""
test perf when close table
"""
def get_score():
"""
get score
"""
score = calculate_score.Scoring(os.path.join(pdf_dev_path, "result.json"))
score.calculate_similarity_total("mineru", pdf_dev_path)
res = score.summary_scores()
return res
"""
test table case
"""
import os
import shutil
import json
from lib import calculate_score
import pytest
from conf import conf
code_path = os.environ.get('GITHUB_WORKSPACE')
pdf_dev_path = conf.conf["pdf_dev_path"]
pdf_res_path = conf.conf["pdf_res_path"]
class TestTable():
"""
test table
"""
def test_paddle_table_master_cuda(self):
"""
select table: paddle table master,mode is cuda
"""
def test_paddle_table_master_cpu(self):
"""
select table: paddle table master, mode is cpu
"""
def test_st_table_cuda(self):
"""
select table: ST, mode is cuda
"""
def test_st_table_cpu(self):
"""
select table: ST, mode is cpu
"""
def test_close_table_cuda(self):
"""
close table, mode is cuda
"""
def get_score():
"""
get score
"""
score = calculate_score.Scoring(os.path.join(pdf_dev_path, "result.json"))
score.calculate_similarity_total("mineru", pdf_dev_path)
res = score.summary_scores()
return res
import pytest
from PIL import Image
from magic_pdf.model.ppTableModel import ppTableModel
class TestppTableModel:
def test_image2html(self):
img = Image.open("tests/unittest/test_table/assets/table.jpg")
# 修改table模型路径
config = {"device": "cuda",
"model_dir": "/home/quyuan/PDF-Extract-Kit/models/TabRec/TableMaster"}
table_model = ppTableModel(config)
res = table_model.img2html(img)
true_value = """<td><table border="1"><thead><tr><td><b>Methods</b></td><td><b>R</b></td><td><b>P</b></td><td><b>F</b></td><td><b>FPS</b></td></tr></thead><tbody><tr><td>SegLink [26]</td><td>70.0</td><td>86.0</td><td>77.0</td><td>8.9</td></tr><tr><td>PixelLink [4]</td><td>73.2</td><td>83.0</td><td>77.8</td><td>-</td></tr><tr><td>TextSnake [18]</td><td>73.9</td><td>83.2</td><td>78.3</td><td>1.1</td></tr><tr><td>TextField [37]</td><td>75.9</td><td>87.4</td><td>81.3</td><td>5.2 </td></tr><tr><td>MSR[38]</td><td>76.7</td><td>87.4</td><td>81.7</td><td>-</td></tr><tr><td>FTSN[3]</td><td>77.1</td><td>87.6</td><td>82.0</td><td>-</td></tr><tr><td>LSE[30]</td><td>81.7</td><td>84.2</td><td>82.9</td><td>-</td></tr><tr><td>CRAFT [2]</td><td>78.2</td><td>88.2</td><td>82.9</td><td>8.6</td></tr><tr><td>MCN [16]</td><td>79</td><td>88.</td><td>83</td><td>-</td></tr><tr><td>ATRR[35]</td><td>82.1</td><td>85.2</td><td>83.6</td><td>-</td></tr><tr><td>PAN [34]</td><td>83.8</td><td>84.4</td><td>84.1</td><td>30.2</td></tr><tr><td>DB[12]</td><td>79.2</td><td>91.5</td><td>84.9</td><td>32.0</td></tr><tr><td>DRRG [41]</td><td>82.30</td><td>88.05</td><td>85.08</td><td>-</td></tr><tr><td>Ours (SynText)</td><td>80.68</td><td>85.40</td><td>82.97</td><td>12.68</td></tr><tr><td>Ours (MLT-17)</td><td>84.54</td><td>86.62</td><td>85.57</td><td>12.31</td></tr></tbody></table></td>\n"""
assert res == true_value
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment