Unverified Commit c9a51491 authored by icecraft's avatar icecraft Committed by GitHub

feat: rename the file generated by command line tools (#401)

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: 's avataricecraft <tmortred@gmail.com>
Co-authored-by: 's avataricecraft <xurui1@pjlab.org.cn>
parent 041b9465
...@@ -5,6 +5,7 @@ ...@@ -5,6 +5,7 @@
</p> </p>
<!-- icon --> <!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
...@@ -15,14 +16,17 @@ ...@@ -15,14 +16,17 @@
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language --> <!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md) [English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link --> <!-- hot link -->
<p align="center"> <p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥 <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
</p> </p>
<!-- join us --> <!-- join us -->
<p align="center"> <p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a> 👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p> </p>
...@@ -30,11 +34,13 @@ ...@@ -30,11 +34,13 @@
</div> </div>
# Changelog # Changelog
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality - 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation - 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release - 2024/07/05: Initial open-source release
<!-- TABLE OF CONTENT --> <!-- TABLE OF CONTENT -->
<details open="open"> <details open="open">
<summary><h2 style="display: inline-block">Table of Contents</h2></summary> <summary><h2 style="display: inline-block">Table of Contents</h2></summary>
<ol> <ol>
...@@ -73,10 +79,10 @@ ...@@ -73,10 +79,10 @@
</ol> </ol>
</details> </details>
# MinerU # MinerU
## Project Introduction ## Project Introduction
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format. MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models. MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models.
Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**. Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**.
...@@ -100,6 +106,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -100,6 +106,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br> If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br> If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
There are three different ways to experience MinerU: There are three different ways to experience MinerU:
- [Online Demo (No Installation Required)](#online-demo) - [Online Demo (No Installation Required)](#online-demo)
- [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo) - [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo)
- [Linux/Windows + CUDA](#Using-GPU) - [Linux/Windows + CUDA](#Using-GPU)
...@@ -168,33 +175,41 @@ In non-mainline environments, due to the diversity of hardware and software conf ...@@ -168,33 +175,41 @@ In non-mainline environments, due to the diversity of hardware and software conf
### Quick CPU Demo ### Quick CPU Demo
#### 1. Install magic-pdf #### 1. Install magic-pdf
```bash ```bash
conda create -n MinerU python=3.10 conda create -n MinerU python=3.10
conda activate MinerU conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com
``` ```
#### 2. Download model weight files #### 2. Download model weight files
Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions. Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions.
> ❗️After downloading the models, please make sure to verify the completeness of the model files. > ❗️After downloading the models, please make sure to verify the completeness of the model files.
> >
> Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files. > Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files.
#### 3. Copy and configure the template file #### 3. Copy and configure the template file
You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository. You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository.
> ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run. > ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run.
> >
> The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`. > The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`.
```bash ```bash
cp magic-pdf.template.json ~/magic-pdf.json cp magic-pdf.template.json ~/magic-pdf.json
``` ```
Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files). Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files).
> ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files. > ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files.
> >
> On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences. > On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences.
> >
> For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`. > For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`.
```json ```json
{ {
// other config // other config
...@@ -206,14 +221,13 @@ Find the `magic-pdf.json` file in your user directory and configure the "models- ...@@ -206,14 +221,13 @@ Find the `magic-pdf.json` file in your user directory and configure the "models-
} }
``` ```
### Using GPU ### Using GPU
If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system: If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system:
- [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md) - [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md)
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md) - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
## Usage ## Usage
### Command Line ### Command Line
...@@ -248,11 +262,11 @@ The results will be saved in the `{some_output_dir}` directory. The output file ...@@ -248,11 +262,11 @@ The results will be saved in the `{some_output_dir}` directory. The output file
```text ```text
├── some_pdf.md # markdown file ├── some_pdf.md # markdown file
├── images # directory for storing images ├── images # directory for storing images
├── layout.pdf # layout diagram ├── some_pdf_layout.pdf # layout diagram
├── middle.json # MinerU intermediate processing result ├── some_pdf_middle.json # MinerU intermediate processing result
├── model.json # model inference result ├── some_pdf_model.json # model inference result
├── origin.pdf # original PDF file ├── some_pdf_origin.pdf # original PDF file
└── spans.pdf # smallest granularity bbox position information diagram └── some_pdf_spans.pdf # smallest granularity bbox position information diagram
``` ```
For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md). For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
...@@ -260,6 +274,7 @@ For more information about the output files, please refer to the [Output File De ...@@ -260,6 +274,7 @@ For more information about the output files, please refer to the [Output File De
### API ### API
Processing files from local disk Processing files from local disk
```python ```python
image_writer = DiskReaderWriter(local_image_dir) image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
...@@ -272,6 +287,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -272,6 +287,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
Processing files from object storage Processing files from object storage
```python ```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/" image_dir = "s3://img_bucket/"
...@@ -286,10 +302,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -286,10 +302,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
For detailed implementation, refer to: For detailed implementation, refer to:
- [demo.py Simplest Processing Method](demo/demo.py) - [demo.py Simplest Processing Method](demo/demo.py)
- [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py) - [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
### Development Guide ### Development Guide
TODO TODO
...@@ -305,6 +321,7 @@ TODO ...@@ -305,6 +321,7 @@ TODO
- [ ] Geometric shape recognition - [ ] Geometric shape recognition
# Known Issues # Known Issues
- Reading order is segmented based on rules, which can cause disordered sequences in some cases - Reading order is segmented based on rules, which can cause disordered sequences in some cases
- Vertical text is not supported - Vertical text is not supported
- Lists, code blocks, and table of contents are not yet supported in the layout model - Lists, code blocks, and table of contents are not yet supported in the layout model
...@@ -314,17 +331,17 @@ TODO ...@@ -314,17 +331,17 @@ TODO
- **Table Recognition** is currently in the testing phase; recognition speed is slow, and accuracy needs improvement. Below are some performance test results in an Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090 environment for reference. - **Table Recognition** is currently in the testing phase; recognition speed is slow, and accuracy needs improvement. Below are some performance test results in an Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090 environment for reference.
| Table Size | Parsing Time | | Table Size | Parsing Time |
|---------------|----------------------------| | ------------ | ------------ |
| 6\*5 55kb | 37s | | 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s | | 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s | | 44\*7 559kb | 4m12s |
# FAQ # FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md) [FAQ in Chinese](docs/FAQ_zh_cn.md)
[FAQ in English](docs/FAQ_en_us.md) [FAQ in English](docs/FAQ_en_us.md)
# All Thanks To Our Contributors # All Thanks To Our Contributors
<a href="https://github.com/opendatalab/MinerU/graphs/contributors"> <a href="https://github.com/opendatalab/MinerU/graphs/contributors">
...@@ -337,8 +354,8 @@ TODO ...@@ -337,8 +354,8 @@ TODO
This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility. This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility.
# Acknowledgments # Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
...@@ -375,9 +392,11 @@ This project currently uses PyMuPDF to achieve advanced functionality. However, ...@@ -375,9 +392,11 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
</a> </a>
# Magic-doc # Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool [Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html # Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool [Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links # Links
......
...@@ -4,8 +4,8 @@ ...@@ -4,8 +4,8 @@
<img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;"> <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
</p> </p>
<!-- icon --> <!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
...@@ -16,29 +16,31 @@ ...@@ -16,29 +16,31 @@
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language --> <!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md)
[English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link --> <!-- hot link -->
<p align="center"> <p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥 <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
</p> </p>
<!-- join us --> <!-- join us -->
<p align="center"> <p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a> 👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p> </p>
</div> </div>
# 更新记录 # 更新记录
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能 - 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档 - 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
- 2024/07/05 首次开源 - 2024/07/05 首次开源
<!-- TABLE OF CONTENT --> <!-- TABLE OF CONTENT -->
<details open="open"> <details open="open">
<summary><h2 style="display: inline-block">文档目录</h2></summary> <summary><h2 style="display: inline-block">文档目录</h2></summary>
<ol> <ol>
...@@ -77,10 +79,10 @@ ...@@ -77,10 +79,10 @@
</ol> </ol>
</details> </details>
# MinerU # MinerU
## 项目简介 ## 项目简介
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。 MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。 MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。
相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF** 相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF**
...@@ -99,17 +101,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -99,17 +101,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
- 支持CPU和GPU环境 - 支持CPU和GPU环境
- 支持windows/linux/mac平台 - 支持windows/linux/mac平台
## 快速开始 ## 快速开始
如果遇到任何安装问题,请先查询 <a href="#faq">FAQ</a> </br> 如果遇到任何安装问题,请先查询 <a href="#faq">FAQ</a> </br>
如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br> 如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br>
有3种不同方式可以体验MinerU的效果: 有3种不同方式可以体验MinerU的效果:
- [在线体验(无需任何安装)](#在线体验) - [在线体验(无需任何安装)](#在线体验)
- [使用CPU快速体验(Windows,Linux,Mac)](#使用cpu快速体验) - [使用CPU快速体验(Windows,Linux,Mac)](#使用cpu快速体验)
- [Linux/Windows + CUDA](#使用gpu) - [Linux/Windows + CUDA](#使用gpu)
**⚠️安装前必看——软硬件环境支持说明** **⚠️安装前必看——软硬件环境支持说明**
为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。 为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。
...@@ -171,38 +172,46 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c ...@@ -171,38 +172,46 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
[在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF) [在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
### 使用CPU快速体验 ### 使用CPU快速体验
#### 1. 安装magic-pdf #### 1. 安装magic-pdf
最新版本国内镜像源同步可能会有延迟,请耐心等待 最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash ```bash
conda create -n MinerU python=3.10 conda create -n MinerU python=3.10
conda activate MinerU conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
#### 2. 下载模型权重文件 #### 2. 下载模型权重文件
详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md) 详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
> ❗️模型下载后请务必检查模型文件是否下载完整 > ❗️模型下载后请务必检查模型文件是否下载完整
> >
> 请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整 > 请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整
#### 3. 拷贝配置文件并进行配置 #### 3. 拷贝配置文件并进行配置
在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件 在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件
> ❗️务必执行以下命令将配置文件拷贝到【用户目录】下,否则程序将无法运行 > ❗️务必执行以下命令将配置文件拷贝到【用户目录】下,否则程序将无法运行
> >
> windows的用户目录为 "C:\Users\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名" > windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
```bash ```bash
cp magic-pdf.template.json ~/magic-pdf.json cp magic-pdf.template.json ~/magic-pdf.json
``` ```
在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录 在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录
> ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行 > ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行
> >
> windows系统中此路径应包含盘符,且需把路径中所有的"\"替换为"/",否则会因为转义原因导致json文件语法错误。 > windows系统中此路径应包含盘符,且需把路径中所有的""替换为"/",否则会因为转义原因导致json文件语法错误。
> >
> 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models" > 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models"
```json ```json
{ {
// other config // other config
...@@ -214,14 +223,13 @@ cp magic-pdf.template.json ~/magic-pdf.json ...@@ -214,14 +223,13 @@ cp magic-pdf.template.json ~/magic-pdf.json
} }
``` ```
### 使用GPU ### 使用GPU
如果您的设备支持CUDA,且满足主线环境中的显卡要求,则可以使用GPU加速,请根据自己的系统选择适合的教程: 如果您的设备支持CUDA,且满足主线环境中的显卡要求,则可以使用GPU加速,请根据自己的系统选择适合的教程:
- [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md) - [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md)
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md) - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
## 使用 ## 使用
### 命令行 ### 命令行
...@@ -256,19 +264,19 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto ...@@ -256,19 +264,19 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
```text ```text
├── some_pdf.md # markdown 文件 ├── some_pdf.md # markdown 文件
├── images # 存放图片目录 ├── images # 存放图片目录
├── layout.pdf # layout 绘图 ├── some_pdf_layout.pdf # layout 绘图
├── middle.json # minerU 中间处理结果 ├── some_pdf_middle.json # minerU 中间处理结果
├── model.json # 模型推理结果 ├── some_pdf_model.json # 模型推理结果
├── origin.pdf # 原 pdf 文件 ├── some_pdf_origin.pdf # 原 pdf 文件
└── spans.pdf # 最小粒度的bbox位置信息绘图 └── some_pdf_spans.pdf # 最小粒度的bbox位置信息绘图
``` ```
更多有关输出文件的信息,请参考[输出文件说明](docs/output_file_zh_cn.md) 更多有关输出文件的信息,请参考[输出文件说明](docs/output_file_zh_cn.md)
### API ### API
处理本地磁盘上的文件 处理本地磁盘上的文件
```python ```python
image_writer = DiskReaderWriter(local_image_dir) image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
...@@ -281,6 +289,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -281,6 +289,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
处理对象存储上的文件 处理对象存储上的文件
```python ```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/" image_dir = "s3://img_bucket/"
...@@ -295,10 +304,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -295,10 +304,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
``` ```
详细实现可参考 详细实现可参考
- [demo.py 最简单的处理方式](demo/demo.py) - [demo.py 最简单的处理方式](demo/demo.py)
- [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py) - [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py)
### 二次开发 ### 二次开发
TODO TODO
...@@ -313,8 +322,8 @@ TODO ...@@ -313,8 +322,8 @@ TODO
- [ ] 化学式识别 - [ ] 化学式识别
- [ ] 几何图形识别 - [ ] 几何图形识别
# Known Issues # Known Issues
- 阅读顺序基于规则的分割,在一些情况下会乱序 - 阅读顺序基于规则的分割,在一些情况下会乱序
- 不支持竖排文字 - 不支持竖排文字
- 列表、代码块、目录在layout模型里还没有支持 - 列表、代码块、目录在layout模型里还没有支持
...@@ -324,19 +333,17 @@ TODO ...@@ -324,19 +333,17 @@ TODO
- **表格识别**目前处于测试阶段,识别速度较慢,识别准确度有待提升。以下是我们在Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090环境下的一些性能测试结果,可供参考。 - **表格识别**目前处于测试阶段,识别速度较慢,识别准确度有待提升。以下是我们在Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090环境下的一些性能测试结果,可供参考。
| 表格大小 | 解析耗时 | | 表格大小 | 解析耗时 |
|---------------|----------------------------| | ------------ | -------- |
| 6\*5 55kb | 37s | | 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s | | 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s | | 44\*7 559kb | 4m12s |
# FAQ # FAQ
[常见问题](docs/FAQ_zh_cn.md) [常见问题](docs/FAQ_zh_cn.md)
[FAQ](docs/FAQ_en_us.md) [FAQ](docs/FAQ_en_us.md)
# All Thanks To Our Contributors # All Thanks To Our Contributors
<a href="https://github.com/opendatalab/MinerU/graphs/contributors"> <a href="https://github.com/opendatalab/MinerU/graphs/contributors">
...@@ -350,6 +357,7 @@ TODO ...@@ -350,6 +357,7 @@ TODO
本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。 本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。
# Acknowledgments # Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy) - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
...@@ -386,9 +394,11 @@ TODO ...@@ -386,9 +394,11 @@ TODO
</a> </a>
# Magic-doc # Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool [Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html # Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool [Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links # Links
......
## Overview ## Overview
After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one. After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
### layout.pdf
Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors. Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
![layout example](images/layout_example.png) ![layout example](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas. All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
![spans example](images/spans_example.png) ![spans example](images/spans_example.png)
### some_pdf_model.json
### model.json
#### Structure Definition #### Structure Definition
```python ```python
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from enum import IntEnum from enum import IntEnum
...@@ -62,11 +61,9 @@ inference_result: list[PageInferenceResults] = [] ...@@ -62,11 +61,9 @@ inference_result: list[PageInferenceResults] = []
``` ```
The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively. The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
![Poly Coordinate Diagram](images/poly.png) ![Poly Coordinate Diagram](images/poly.png)
#### example #### example
```json ```json
...@@ -120,15 +117,13 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen ...@@ -120,15 +117,13 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
] ]
``` ```
### some_pdf_middle.json
### middle.json
| Field Name | Description | | Field Name | Description |
| :-----|:------------------------------------------| | :------------- | :------------------------------------------------------------------------------------------------------------- |
|pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details | | pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
|_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state | | \_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
|_version_name | string, indicates the version of magic-pdf used in this parsing | | \_version_name | string, indicates the version of magic-pdf used in this parsing |
<br> <br>
...@@ -137,12 +132,12 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen ...@@ -137,12 +132,12 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
Field structure description Field structure description
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :------------------ | :----------------------------------------------------------------------------------------------------------------- |
| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented | | preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order | | layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
| page_idx | Page number, starting from 0 | | page_idx | Page number, starting from 0 |
| page_size | Page width and height | | page_size | Page width and height |
| _layout_tree | Layout tree structure | | \_layout_tree | Layout tree structure |
| images | list, each element is a dict representing an img_block | | images | list, each element is a dict representing an img_block |
| tables | list, each element is a dict representing a table_block | | tables | list, each element is a dict representing a table_block |
| interline_equations | list, each element is a dict representing an interline_equation_block | | interline_equations | list, each element is a dict representing an interline_equation_block |
...@@ -158,10 +153,10 @@ In the above table, `para_blocks` is an array of dicts, each dict representing a ...@@ -158,10 +153,10 @@ In the above table, `para_blocks` is an array of dicts, each dict representing a
The outer block is referred to as a first-level block, and the fields in the first-level block include: The outer block is referred to as a first-level block, and the fields in the first-level block include:
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :--------- | :------------------------------------------------------------- |
| type | Block type (table\|image)| | type | Block type (table\|image) |
|bbox | Block bounding box coordinates | | bbox | Block bounding box coordinates |
|blocks |list, each element is a dict representing a second-level block | | blocks | list, each element is a dict representing a second-level block |
<br> <br>
There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks. There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks.
...@@ -169,15 +164,15 @@ There are only two types of first-level blocks: "table" and "image". All other b ...@@ -169,15 +164,15 @@ There are only two types of first-level blocks: "table" and "image". All other b
The fields in a second-level block include: The fields in a second-level block include:
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :--------- | :---------------------------------------------------------------------------------------------------------- |
| type | Block type | | type | Block type |
| bbox | Block bounding box coordinates | | bbox | Block bounding box coordinates |
| lines | list, each element is a dict representing a line, used to describe the composition of a line of information| | lines | list, each element is a dict representing a line, used to describe the composition of a line of information |
Detailed explanation of second-level block types Detailed explanation of second-level block types
| type | Description | | type | Description |
|:-------------------| :---- | | :----------------- | :--------------------- |
| image_body | Main body of the image | | image_body | Main body of the image |
| image_caption | Image description text | | image_caption | Image description text |
| table_body | Main body of the table | | table_body | Main body of the table |
...@@ -185,7 +180,7 @@ Detailed explanation of second-level block types ...@@ -185,7 +180,7 @@ Detailed explanation of second-level block types
| table_footnote | Table footnote | | table_footnote | Table footnote |
| text | Text block | | text | Text block |
| title | Title block | | title | Title block |
| interline_equation | Block formula| | interline_equation | Block formula |
<br> <br>
...@@ -194,17 +189,16 @@ Detailed explanation of second-level block types ...@@ -194,17 +189,16 @@ Detailed explanation of second-level block types
The field format of a line is as follows: The field format of a line is as follows:
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :--------- | :------------------------------------------------------------------------------------------------------ |
| bbox | Bounding box coordinates of the line | | bbox | Bounding box coordinates of the line |
| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit | | spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
<br> <br>
**span** **span**
| Field Name | Description | | Field Name | Description |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------------------------------------------- |
| bbox | Bounding box coordinates of the span | | bbox | Bounding box coordinates of the span |
| type | Type of the span | | type | Type of the span |
| content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information | | content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information |
...@@ -212,7 +206,7 @@ The field format of a line is as follows: ...@@ -212,7 +206,7 @@ The field format of a line is as follows:
The types of spans are as follows: The types of spans are as follows:
| type | Description | | type | Description |
| :-----| :---- | | :----------------- | :------------- |
| image | Image | | image | Image |
| table | Table | | table | Table |
| text | Text | | text | Text |
...@@ -229,7 +223,6 @@ The block structure is as follows: ...@@ -229,7 +223,6 @@ The block structure is as follows:
First-level block (if any) -> Second-level block -> Line -> Span First-level block (if any) -> Second-level block -> Line -> Span
#### example #### example
```json ```json
......
## 概览 ## 概览
`magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件 `magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
### some_pdf_layout.pdf
### layout.pdf
每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。 每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
![layout 页面示例](images/layout_example.png) ![layout 页面示例](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。 根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
![span 页面示例](images/spans_example.png) ![span 页面示例](images/spans_example.png)
### some_pdf_model.json
### model.json
#### 结构定义 #### 结构定义
```python ```python
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from enum import IntEnum from enum import IntEnum
...@@ -62,10 +61,9 @@ inference_result: list[PageInferenceResults] = [] ...@@ -62,10 +61,9 @@ inference_result: list[PageInferenceResults] = []
``` ```
poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标 poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
![poly 坐标示意图](images/poly.png) ![poly 坐标示意图](images/poly.png)
#### 示例数据 #### 示例数据
```json ```json
...@@ -119,14 +117,13 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -119,14 +117,13 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
] ]
``` ```
### some_pdf_middle.json
### middle.json
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----|:------------------------------------------| | :------------- | :----------------------------------------------------------------- |
|pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 | | pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
|_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 | | \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
|_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 | | \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
<br> <br>
...@@ -134,12 +131,12 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -134,12 +131,12 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
字段结构说明 字段结构说明
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------- |
| preproc_blocks | pdf预处理后,未分段的中间结果 | | preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 | | layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 | | page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 | | page_size | 页面的宽度和高度 |
| _layout_tree | 布局树状结构 | | \_layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block | | images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block | | tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block | | interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
...@@ -155,10 +152,10 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -155,10 +152,10 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
外层block被称为一级block,一级block中的字段包括 外层block被称为一级block,一级block中的字段包括
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :---------------------------------------------- |
| type | block类型(table\|image)| | type | block类型(table\|image) |
|bbox | block矩形框坐标 | | bbox | block矩形框坐标 |
|blocks |list,里面的每个元素都是一个dict格式的二级block | | blocks | list,里面的每个元素都是一个dict格式的二级block |
<br> <br>
一级block只有"table"和"image"两种类型,其余block均为二级block 一级block只有"table"和"image"两种类型,其余block均为二级block
...@@ -166,15 +163,15 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -166,15 +163,15 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
二级block中的字段包括 二级block中的字段包括
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :----------------------------------------------------------- |
| type | block类型 | | type | block类型 |
| bbox | block矩形框坐标 | | bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成| | lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
二级block的类型详解 二级block的类型详解
| type | desc | | type | desc |
|:-------------------| :---- | | :----------------- | :------------- |
| image_body | 图像的本体 | | image_body | 图像的本体 |
| image_caption | 图像的描述文本 | | image_caption | 图像的描述文本 |
| table_body | 表格本体 | | table_body | 表格本体 |
...@@ -182,7 +179,7 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -182,7 +179,7 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
| table_footnote | 表格的脚注 | | table_footnote | 表格的脚注 |
| text | 文本块 | | text | 文本块 |
| title | 标题块 | | title | 标题块 |
| interline_equation | 行间公式块| | interline_equation | 行间公式块 |
<br> <br>
...@@ -191,17 +188,16 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -191,17 +188,16 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
line 的 字段格式如下 line 的 字段格式如下
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :------------------------------------------------------------------- |
| bbox | line的矩形框坐标 | | bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 | | spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
<br> <br>
**span** **span**
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------------------- |
| bbox | span的矩形框坐标 | | bbox | span的矩形框坐标 |
| type | span的类型 | | type | span的类型 |
| content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 | | content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 |
...@@ -209,14 +205,13 @@ line 的 字段格式如下 ...@@ -209,14 +205,13 @@ line 的 字段格式如下
span 的类型有如下几种 span 的类型有如下几种
| type | desc | | type | desc |
| :-----| :---- | | :----------------- | :------- |
| image | 图片 | | image | 图片 |
| table | 表格 | | table | 表格 |
| text | 文本 | | text | 文本 |
| inline_equation | 行内公式 | | inline_equation | 行内公式 |
| interline_equation | 行间公式 | | interline_equation | 行间公式 |
**总结** **总结**
span是所有元素的最小存储单元 span是所有元素的最小存储单元
...@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息 ...@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息
一级block(如有)->二级block->line->span 一级block(如有)->二级block->line->span
#### 示例数据 #### 示例数据
```json ```json
......
from magic_pdf.libs.Constants import CROSS_PAGE
from magic_pdf.libs.commons import fitz # PyMuPDF from magic_pdf.libs.commons import fitz # PyMuPDF
from magic_pdf.libs.ocr_content_type import ContentType, BlockType, CategoryId from magic_pdf.libs.Constants import CROSS_PAGE
from magic_pdf.libs.ocr_content_type import BlockType, CategoryId, ContentType
from magic_pdf.model.magic_model import MagicModel from magic_pdf.model.magic_model import MagicModel
...@@ -65,10 +65,11 @@ def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config): ...@@ -65,10 +65,11 @@ def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config):
) # Insert the index in the top left corner of the rectangle ) # Insert the index in the top left corner of the rectangle
def draw_layout_bbox(pdf_info, pdf_bytes, out_path): def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename):
layout_bbox_list = [] layout_bbox_list = []
dropped_bbox_list = [] dropped_bbox_list = []
tables_list, tables_body_list, tables_caption_list, tables_footnote_list = [], [], [], [] tables_list, tables_body_list = [], []
tables_caption_list, tables_footnote_list = [], []
imgs_list, imgs_body_list, imgs_caption_list = [], [], [] imgs_list, imgs_body_list, imgs_caption_list = [], [], []
titles_list = [] titles_list = []
texts_list = [] texts_list = []
...@@ -81,37 +82,37 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path): ...@@ -81,37 +82,37 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path):
titles = [] titles = []
texts = [] texts = []
interequations = [] interequations = []
for layout in page["layout_bboxes"]: for layout in page['layout_bboxes']:
page_layout_list.append(layout["layout_bbox"]) page_layout_list.append(layout['layout_bbox'])
layout_bbox_list.append(page_layout_list) layout_bbox_list.append(page_layout_list)
for dropped_bbox in page["discarded_blocks"]: for dropped_bbox in page['discarded_blocks']:
page_dropped_list.append(dropped_bbox["bbox"]) page_dropped_list.append(dropped_bbox['bbox'])
dropped_bbox_list.append(page_dropped_list) dropped_bbox_list.append(page_dropped_list)
for block in page["para_blocks"]: for block in page['para_blocks']:
bbox = block["bbox"] bbox = block['bbox']
if block["type"] == BlockType.Table: if block['type'] == BlockType.Table:
tables.append(bbox) tables.append(bbox)
for nested_block in block["blocks"]: for nested_block in block['blocks']:
bbox = nested_block["bbox"] bbox = nested_block['bbox']
if nested_block["type"] == BlockType.TableBody: if nested_block['type'] == BlockType.TableBody:
tables_body.append(bbox) tables_body.append(bbox)
elif nested_block["type"] == BlockType.TableCaption: elif nested_block['type'] == BlockType.TableCaption:
tables_caption.append(bbox) tables_caption.append(bbox)
elif nested_block["type"] == BlockType.TableFootnote: elif nested_block['type'] == BlockType.TableFootnote:
tables_footnote.append(bbox) tables_footnote.append(bbox)
elif block["type"] == BlockType.Image: elif block['type'] == BlockType.Image:
imgs.append(bbox) imgs.append(bbox)
for nested_block in block["blocks"]: for nested_block in block['blocks']:
bbox = nested_block["bbox"] bbox = nested_block['bbox']
if nested_block["type"] == BlockType.ImageBody: if nested_block['type'] == BlockType.ImageBody:
imgs_body.append(bbox) imgs_body.append(bbox)
elif nested_block["type"] == BlockType.ImageCaption: elif nested_block['type'] == BlockType.ImageCaption:
imgs_caption.append(bbox) imgs_caption.append(bbox)
elif block["type"] == BlockType.Title: elif block['type'] == BlockType.Title:
titles.append(bbox) titles.append(bbox)
elif block["type"] == BlockType.Text: elif block['type'] == BlockType.Text:
texts.append(bbox) texts.append(bbox)
elif block["type"] == BlockType.InterlineEquation: elif block['type'] == BlockType.InterlineEquation:
interequations.append(bbox) interequations.append(bbox)
tables_list.append(tables) tables_list.append(tables)
tables_body_list.append(tables_body) tables_body_list.append(tables_body)
...@@ -124,26 +125,33 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path): ...@@ -124,26 +125,33 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path):
texts_list.append(texts) texts_list.append(texts)
interequations_list.append(interequations) interequations_list.append(interequations)
pdf_docs = fitz.open("pdf", pdf_bytes) pdf_docs = fitz.open('pdf', pdf_bytes)
for i, page in enumerate(pdf_docs): for i, page in enumerate(pdf_docs):
draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False) draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False)
draw_bbox_without_number(i, dropped_bbox_list, page, [158, 158, 158], True) draw_bbox_without_number(i, dropped_bbox_list, page, [158, 158, 158],
draw_bbox_without_number(i, tables_list, page, [153, 153, 0], True) # color ! True)
draw_bbox_without_number(i, tables_body_list, page, [204, 204, 0], True) draw_bbox_without_number(i, tables_list, page, [153, 153, 0],
draw_bbox_without_number(i, tables_caption_list, page, [255, 255, 102], True) True) # color !
draw_bbox_without_number(i, tables_footnote_list, page, [229, 255, 204], True) draw_bbox_without_number(i, tables_body_list, page, [204, 204, 0],
True)
draw_bbox_without_number(i, tables_caption_list, page, [255, 255, 102],
True)
draw_bbox_without_number(i, tables_footnote_list, page,
[229, 255, 204], True)
draw_bbox_without_number(i, imgs_list, page, [51, 102, 0], True) draw_bbox_without_number(i, imgs_list, page, [51, 102, 0], True)
draw_bbox_without_number(i, imgs_body_list, page, [153, 255, 51], True) draw_bbox_without_number(i, imgs_body_list, page, [153, 255, 51], True)
draw_bbox_without_number(i, imgs_caption_list, page, [102, 178, 255], True) draw_bbox_without_number(i, imgs_caption_list, page, [102, 178, 255],
True)
draw_bbox_without_number(i, titles_list, page, [102, 102, 255], True) draw_bbox_without_number(i, titles_list, page, [102, 102, 255], True)
draw_bbox_without_number(i, texts_list, page, [153, 0, 76], True) draw_bbox_without_number(i, texts_list, page, [153, 0, 76], True)
draw_bbox_without_number(i, interequations_list, page, [0, 255, 0], True) draw_bbox_without_number(i, interequations_list, page, [0, 255, 0],
True)
# Save the PDF # Save the PDF
pdf_docs.save(f"{out_path}/layout.pdf") pdf_docs.save(f'{out_path}/{filename}_layout.pdf')
def draw_span_bbox(pdf_info, pdf_bytes, out_path): def draw_span_bbox(pdf_info, pdf_bytes, out_path, filename):
text_list = [] text_list = []
inline_equation_list = [] inline_equation_list = []
interline_equation_list = [] interline_equation_list = []
...@@ -154,22 +162,22 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path): ...@@ -154,22 +162,22 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
next_page_inline_equation_list = [] next_page_inline_equation_list = []
def get_span_info(span): def get_span_info(span):
if span["type"] == ContentType.Text: if span['type'] == ContentType.Text:
if span.get(CROSS_PAGE, False): if span.get(CROSS_PAGE, False):
next_page_text_list.append(span["bbox"]) next_page_text_list.append(span['bbox'])
else: else:
page_text_list.append(span["bbox"]) page_text_list.append(span['bbox'])
elif span["type"] == ContentType.InlineEquation: elif span['type'] == ContentType.InlineEquation:
if span.get(CROSS_PAGE, False): if span.get(CROSS_PAGE, False):
next_page_inline_equation_list.append(span["bbox"]) next_page_inline_equation_list.append(span['bbox'])
else: else:
page_inline_equation_list.append(span["bbox"]) page_inline_equation_list.append(span['bbox'])
elif span["type"] == ContentType.InterlineEquation: elif span['type'] == ContentType.InterlineEquation:
page_interline_equation_list.append(span["bbox"]) page_interline_equation_list.append(span['bbox'])
elif span["type"] == ContentType.Image: elif span['type'] == ContentType.Image:
page_image_list.append(span["bbox"]) page_image_list.append(span['bbox'])
elif span["type"] == ContentType.Table: elif span['type'] == ContentType.Table:
page_table_list.append(span["bbox"]) page_table_list.append(span['bbox'])
for page in pdf_info: for page in pdf_info:
page_text_list = [] page_text_list = []
...@@ -188,54 +196,56 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path): ...@@ -188,54 +196,56 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
next_page_inline_equation_list.clear() next_page_inline_equation_list.clear()
# 构造dropped_list # 构造dropped_list
for block in page["discarded_blocks"]: for block in page['discarded_blocks']:
if block["type"] == BlockType.Discarded: if block['type'] == BlockType.Discarded:
for line in block["lines"]: for line in block['lines']:
for span in line["spans"]: for span in line['spans']:
page_dropped_list.append(span["bbox"]) page_dropped_list.append(span['bbox'])
dropped_list.append(page_dropped_list) dropped_list.append(page_dropped_list)
# 构造其余useful_list # 构造其余useful_list
for block in page["para_blocks"]: for block in page['para_blocks']:
if block["type"] in [ if block['type'] in [
BlockType.Text, BlockType.Text,
BlockType.Title, BlockType.Title,
BlockType.InterlineEquation, BlockType.InterlineEquation,
]: ]:
for line in block["lines"]: for line in block['lines']:
for span in line["spans"]: for span in line['spans']:
get_span_info(span) get_span_info(span)
elif block["type"] in [BlockType.Image, BlockType.Table]: elif block['type'] in [BlockType.Image, BlockType.Table]:
for sub_block in block["blocks"]: for sub_block in block['blocks']:
for line in sub_block["lines"]: for line in sub_block['lines']:
for span in line["spans"]: for span in line['spans']:
get_span_info(span) get_span_info(span)
text_list.append(page_text_list) text_list.append(page_text_list)
inline_equation_list.append(page_inline_equation_list) inline_equation_list.append(page_inline_equation_list)
interline_equation_list.append(page_interline_equation_list) interline_equation_list.append(page_interline_equation_list)
image_list.append(page_image_list) image_list.append(page_image_list)
table_list.append(page_table_list) table_list.append(page_table_list)
pdf_docs = fitz.open("pdf", pdf_bytes) pdf_docs = fitz.open('pdf', pdf_bytes)
for i, page in enumerate(pdf_docs): for i, page in enumerate(pdf_docs):
# 获取当前页面的数据 # 获取当前页面的数据
draw_bbox_without_number(i, text_list, page, [255, 0, 0], False) draw_bbox_without_number(i, text_list, page, [255, 0, 0], False)
draw_bbox_without_number(i, inline_equation_list, page, [0, 255, 0], False) draw_bbox_without_number(i, inline_equation_list, page, [0, 255, 0],
draw_bbox_without_number(i, interline_equation_list, page, [0, 0, 255], False) False)
draw_bbox_without_number(i, interline_equation_list, page, [0, 0, 255],
False)
draw_bbox_without_number(i, image_list, page, [255, 204, 0], False) draw_bbox_without_number(i, image_list, page, [255, 204, 0], False)
draw_bbox_without_number(i, table_list, page, [204, 0, 255], False) draw_bbox_without_number(i, table_list, page, [204, 0, 255], False)
draw_bbox_without_number(i, dropped_list, page, [158, 158, 158], False) draw_bbox_without_number(i, dropped_list, page, [158, 158, 158], False)
# Save the PDF # Save the PDF
pdf_docs.save(f"{out_path}/spans.pdf") pdf_docs.save(f'{out_path}/{filename}_spans.pdf')
def drow_model_bbox(model_list: list, pdf_bytes, out_path): def drow_model_bbox(model_list: list, pdf_bytes, out_path, filename):
dropped_bbox_list = [] dropped_bbox_list = []
tables_body_list, tables_caption_list, tables_footnote_list = [], [], [] tables_body_list, tables_caption_list, tables_footnote_list = [], [], []
imgs_body_list, imgs_caption_list = [], [] imgs_body_list, imgs_caption_list = [], []
titles_list = [] titles_list = []
texts_list = [] texts_list = []
interequations_list = [] interequations_list = []
pdf_docs = fitz.open("pdf", pdf_bytes) pdf_docs = fitz.open('pdf', pdf_bytes)
magic_model = MagicModel(model_list, pdf_docs) magic_model = MagicModel(model_list, pdf_docs)
for i in range(len(model_list)): for i in range(len(model_list)):
page_dropped_list = [] page_dropped_list = []
...@@ -245,26 +255,27 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path): ...@@ -245,26 +255,27 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path):
texts = [] texts = []
interequations = [] interequations = []
page_info = magic_model.get_model_list(i) page_info = magic_model.get_model_list(i)
layout_dets = page_info["layout_dets"] layout_dets = page_info['layout_dets']
for layout_det in layout_dets: for layout_det in layout_dets:
bbox = layout_det["bbox"] bbox = layout_det['bbox']
if layout_det["category_id"] == CategoryId.Text: if layout_det['category_id'] == CategoryId.Text:
texts.append(bbox) texts.append(bbox)
elif layout_det["category_id"] == CategoryId.Title: elif layout_det['category_id'] == CategoryId.Title:
titles.append(bbox) titles.append(bbox)
elif layout_det["category_id"] == CategoryId.TableBody: elif layout_det['category_id'] == CategoryId.TableBody:
tables_body.append(bbox) tables_body.append(bbox)
elif layout_det["category_id"] == CategoryId.TableCaption: elif layout_det['category_id'] == CategoryId.TableCaption:
tables_caption.append(bbox) tables_caption.append(bbox)
elif layout_det["category_id"] == CategoryId.TableFootnote: elif layout_det['category_id'] == CategoryId.TableFootnote:
tables_footnote.append(bbox) tables_footnote.append(bbox)
elif layout_det["category_id"] == CategoryId.ImageBody: elif layout_det['category_id'] == CategoryId.ImageBody:
imgs_body.append(bbox) imgs_body.append(bbox)
elif layout_det["category_id"] == CategoryId.ImageCaption: elif layout_det['category_id'] == CategoryId.ImageCaption:
imgs_caption.append(bbox) imgs_caption.append(bbox)
elif layout_det["category_id"] == CategoryId.InterlineEquation_YOLO: elif layout_det[
'category_id'] == CategoryId.InterlineEquation_YOLO:
interequations.append(bbox) interequations.append(bbox)
elif layout_det["category_id"] == CategoryId.Abandon: elif layout_det['category_id'] == CategoryId.Abandon:
page_dropped_list.append(bbox) page_dropped_list.append(bbox)
tables_body_list.append(tables_body) tables_body_list.append(tables_body)
...@@ -278,15 +289,19 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path): ...@@ -278,15 +289,19 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path):
dropped_bbox_list.append(page_dropped_list) dropped_bbox_list.append(page_dropped_list)
for i, page in enumerate(pdf_docs): for i, page in enumerate(pdf_docs):
draw_bbox_with_number(i, dropped_bbox_list, page, [158, 158, 158], True) # color ! draw_bbox_with_number(i, dropped_bbox_list, page, [158, 158, 158],
True) # color !
draw_bbox_with_number(i, tables_body_list, page, [204, 204, 0], True) draw_bbox_with_number(i, tables_body_list, page, [204, 204, 0], True)
draw_bbox_with_number(i, tables_caption_list, page, [255, 255, 102], True) draw_bbox_with_number(i, tables_caption_list, page, [255, 255, 102],
draw_bbox_with_number(i, tables_footnote_list, page, [229, 255, 204], True) True)
draw_bbox_with_number(i, tables_footnote_list, page, [229, 255, 204],
True)
draw_bbox_with_number(i, imgs_body_list, page, [153, 255, 51], True) draw_bbox_with_number(i, imgs_body_list, page, [153, 255, 51], True)
draw_bbox_with_number(i, imgs_caption_list, page, [102, 178, 255], True) draw_bbox_with_number(i, imgs_caption_list, page, [102, 178, 255],
True)
draw_bbox_with_number(i, titles_list, page, [102, 102, 255], True) draw_bbox_with_number(i, titles_list, page, [102, 102, 255], True)
draw_bbox_with_number(i, texts_list, page, [153, 0, 76], True) draw_bbox_with_number(i, texts_list, page, [153, 0, 76], True)
draw_bbox_with_number(i, interequations_list, page, [0, 255, 0], True) draw_bbox_with_number(i, interequations_list, page, [0, 255, 0], True)
# Save the PDF # Save the PDF
pdf_docs.save(f"{out_path}/model.pdf") pdf_docs.save(f'{out_path}/{filename}_model.pdf')
\ No newline at end of file
import os import os
from pathlib import Path
import click import click
from loguru import logger from loguru import logger
from pathlib import Path
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config import magic_pdf.model as model_config
from magic_pdf.tools.common import parse_pdf_methods, do_parse
from magic_pdf.libs.version import __version__ from magic_pdf.libs.version import __version__
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, parse_pdf_methods
@click.command() @click.command()
@click.version_option(__version__, "--version", "-v", help="display the version and exit") @click.version_option(__version__,
'--version',
'-v',
help='display the version and exit')
@click.option( @click.option(
"-p", '-p',
"--path", '--path',
"path", 'path',
type=click.Path(exists=True), type=click.Path(exists=True),
required=True, required=True,
help="local pdf filepath or directory", help='local pdf filepath or directory',
) )
@click.option( @click.option(
"-o", '-o',
"--output-dir", '--output-dir',
"output_dir", 'output_dir',
type=str, type=click.Path(),
help="output local directory", required=True,
default="", help='output local directory',
default='',
) )
@click.option( @click.option(
"-m", '-m',
"--method", '--method',
"method", 'method',
type=parse_pdf_methods, type=parse_pdf_methods,
help="""the method for parsing pdf. help="""the method for parsing pdf.
ocr: using ocr technique to extract information from pdf. ocr: using ocr technique to extract information from pdf.
txt: suitable for the text-based pdf only and outperform ocr. txt: suitable for the text-based pdf only and outperform ocr.
auto: automatically choose the best method for parsing pdf from ocr and txt. auto: automatically choose the best method for parsing pdf from ocr and txt.
without method specified, auto will be used by default.""", without method specified, auto will be used by default.""",
default="auto", default='auto',
) )
def cli(path, output_dir, method): def cli(path, output_dir, method):
model_config.__use_inside_model__ = True model_config.__use_inside_model__ = True
model_config.__model_mode__ = "full" model_config.__model_mode__ = 'full'
if output_dir == "": os.makedirs(output_dir, exist_ok=True)
if os.path.isdir(path):
output_dir = os.path.join(path, "output")
else:
output_dir = os.path.join(os.path.dirname(path), "output")
def read_fn(path): def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path)) disk_rw = DiskReaderWriter(os.path.dirname(path))
...@@ -69,11 +70,11 @@ def cli(path, output_dir, method): ...@@ -69,11 +70,11 @@ def cli(path, output_dir, method):
logger.exception(e) logger.exception(e)
if os.path.isdir(path): if os.path.isdir(path):
for doc_path in Path(path).glob("*.pdf"): for doc_path in Path(path).glob('*.pdf'):
parse_doc(doc_path) parse_doc(doc_path)
else: else:
parse_doc(path) parse_doc(path)
if __name__ == "__main__": if __name__ == '__main__':
cli() cli()
import os
import json as json_parse import json as json_parse
import click import os
from pathlib import Path from pathlib import Path
from magic_pdf.libs.path_utils import (
parse_s3path, import click
parse_s3_range_params,
remove_non_official_s3_args,
)
from magic_pdf.libs.config_reader import (
get_s3_config,
)
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config import magic_pdf.model as model_config
from magic_pdf.tools.common import parse_pdf_methods, do_parse from magic_pdf.libs.config_reader import get_s3_config
from magic_pdf.libs.path_utils import (parse_s3_range_params, parse_s3path,
remove_non_official_s3_args)
from magic_pdf.libs.version import __version__ from magic_pdf.libs.version import __version__
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
from magic_pdf.tools.common import do_parse, parse_pdf_methods
def read_s3_path(s3path): def read_s3_path(s3path):
bucket, key = parse_s3path(s3path) bucket, key = parse_s3path(s3path)
s3_ak, s3_sk, s3_endpoint = get_s3_config(bucket) s3_ak, s3_sk, s3_endpoint = get_s3_config(bucket)
s3_rw = S3ReaderWriter( s3_rw = S3ReaderWriter(s3_ak, s3_sk, s3_endpoint, 'auto',
s3_ak, s3_sk, s3_endpoint, "auto", remove_non_official_s3_args(s3path) remove_non_official_s3_args(s3path))
)
may_range_params = parse_s3_range_params(s3path) may_range_params = parse_s3_range_params(s3path)
if may_range_params is None or 2 != len(may_range_params): if may_range_params is None or 2 != len(may_range_params):
byte_start, byte_end = 0, None byte_start, byte_end = 0, None
else: else:
byte_start, byte_end = int(may_range_params[0]), int(may_range_params[1]) byte_start, byte_end = int(may_range_params[0]), int(
may_range_params[1])
return s3_rw.read_offset( return s3_rw.read_offset(
remove_non_official_s3_args(s3path), remove_non_official_s3_args(s3path),
byte_start, byte_start,
...@@ -38,51 +35,48 @@ def read_s3_path(s3path): ...@@ -38,51 +35,48 @@ def read_s3_path(s3path):
@click.group() @click.group()
@click.version_option(__version__, "--version", "-v", help="显示版本信息") @click.version_option(__version__, '--version', '-v', help='显示版本信息')
def cli(): def cli():
pass pass
@cli.command() @cli.command()
@click.option( @click.option(
"-j", '-j',
"--jsonl", '--jsonl',
"jsonl", 'jsonl',
type=str, type=str,
help="输入 jsonl 路径,本地或者 s3 上的文件", help='输入 jsonl 路径,本地或者 s3 上的文件',
required=True, required=True,
) )
@click.option( @click.option(
"-m", '-m',
"--method", '--method',
"method", 'method',
type=parse_pdf_methods, type=parse_pdf_methods,
help="指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法", help='指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法',
default="auto", default='auto',
) )
@click.option( @click.option(
"-o", '-o',
"--output-dir", '--output-dir',
"output_dir", 'output_dir',
type=str, type=click.Path(),
help="输出到本地目录", required=True,
default="", help='输出到本地目录',
default='',
) )
def jsonl(jsonl, method, output_dir): def jsonl(jsonl, method, output_dir):
model_config.__use_inside_model__ = False model_config.__use_inside_model__ = False
if jsonl.startswith("s3://"): if jsonl.startswith('s3://'):
jso = json_parse.loads(read_s3_path(jsonl).decode("utf-8")) jso = json_parse.loads(read_s3_path(jsonl).decode('utf-8'))
full_jsonl_path = "."
else: else:
full_jsonl_path = os.path.realpath(jsonl)
with open(jsonl) as f: with open(jsonl) as f:
jso = json_parse.loads(f.readline()) jso = json_parse.loads(f.readline())
os.makedirs(output_dir, exist_ok=True)
if output_dir == "": s3_file_path = jso.get('file_location')
output_dir = os.path.join(os.path.dirname(full_jsonl_path), "output")
s3_file_path = jso.get("file_location")
if s3_file_path is None: if s3_file_path is None:
s3_file_path = jso.get("path") s3_file_path = jso.get('path')
pdf_file_name = Path(s3_file_path).stem pdf_file_name = Path(s3_file_path).stem
pdf_data = read_s3_path(s3_file_path) pdf_data = read_s3_path(s3_file_path)
...@@ -91,7 +85,7 @@ def jsonl(jsonl, method, output_dir): ...@@ -91,7 +85,7 @@ def jsonl(jsonl, method, output_dir):
output_dir, output_dir,
pdf_file_name, pdf_file_name,
pdf_data, pdf_data,
jso["doc_layout_result"], jso['doc_layout_result'],
method, method,
f_dump_content_list=True, f_dump_content_list=True,
f_draw_model_bbox=True, f_draw_model_bbox=True,
...@@ -100,43 +94,46 @@ def jsonl(jsonl, method, output_dir): ...@@ -100,43 +94,46 @@ def jsonl(jsonl, method, output_dir):
@cli.command() @cli.command()
@click.option( @click.option(
"-p", '-p',
"--pdf", '--pdf',
"pdf", 'pdf',
type=click.Path(exists=True), type=click.Path(exists=True),
required=True, required=True,
help="本地 PDF 文件", help='本地 PDF 文件',
) )
@click.option( @click.option(
"-j", '-j',
"--json", '--json',
"json_data", 'json_data',
type=click.Path(exists=True), type=click.Path(exists=True),
required=True, required=True,
help="本地模型推理出的 json 数据", help='本地模型推理出的 json 数据',
)
@click.option(
"-o", "--output-dir", "output_dir", type=str, help="本地输出目录", default=""
) )
@click.option('-o',
'--output-dir',
'output_dir',
type=click.Path(),
required=True,
help='本地输出目录',
default='')
@click.option( @click.option(
"-m", '-m',
"--method", '--method',
"method", 'method',
type=parse_pdf_methods, type=parse_pdf_methods,
help="指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法", help='指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法',
default="auto", default='auto',
) )
def pdf(pdf, json_data, output_dir, method): def pdf(pdf, json_data, output_dir, method):
model_config.__use_inside_model__ = False model_config.__use_inside_model__ = False
full_pdf_path = os.path.realpath(pdf) full_pdf_path = os.path.realpath(pdf)
if output_dir == "": os.makedirs(output_dir, exist_ok=True)
output_dir = os.path.join(os.path.dirname(full_pdf_path), "output")
def read_fn(path): def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path)) disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN) return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
model_json_list = json_parse.loads(read_fn(json_data).decode("utf-8")) model_json_list = json_parse.loads(read_fn(json_data).decode('utf-8'))
file_name = str(Path(full_pdf_path).stem) file_name = str(Path(full_pdf_path).stem)
pdf_data = read_fn(full_pdf_path) pdf_data = read_fn(full_pdf_path)
...@@ -151,5 +148,5 @@ def pdf(pdf, json_data, output_dir, method): ...@@ -151,5 +148,5 @@ def pdf(pdf, json_data, output_dir, method):
) )
if __name__ == "__main__": if __name__ == '__main__':
cli() cli()
import os
import json as json_parse
import copy import copy
import json as json_parse
import os
import click import click
from loguru import logger from loguru import logger
import magic_pdf.model as model_config
from magic_pdf.libs.draw_bbox import (draw_layout_bbox, draw_span_bbox,
drow_model_bbox)
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.libs.draw_bbox import draw_layout_bbox, draw_span_bbox, drow_model_bbox
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe from magic_pdf.pipe.TXTPipe import TXTPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
def prepare_env(output_dir, pdf_file_name, method): def prepare_env(output_dir, pdf_file_name, method):
local_parent_dir = os.path.join(output_dir, pdf_file_name, method) local_parent_dir = os.path.join(output_dir, pdf_file_name, method)
local_image_dir = os.path.join(str(local_parent_dir), "images") local_image_dir = os.path.join(str(local_parent_dir), 'images')
local_md_dir = local_parent_dir local_md_dir = local_parent_dir
os.makedirs(local_image_dir, exist_ok=True) os.makedirs(local_image_dir, exist_ok=True)
os.makedirs(local_md_dir, exist_ok=True) os.makedirs(local_md_dir, exist_ok=True)
...@@ -40,22 +43,22 @@ def do_parse( ...@@ -40,22 +43,22 @@ def do_parse(
f_draw_model_bbox=False, f_draw_model_bbox=False,
): ):
orig_model_list = copy.deepcopy(model_list) orig_model_list = copy.deepcopy(model_list)
local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method) local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name,
parse_method)
image_writer, md_writer = DiskReaderWriter(local_image_dir), DiskReaderWriter( image_writer, md_writer = DiskReaderWriter(
local_md_dir local_image_dir), DiskReaderWriter(local_md_dir)
)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
if parse_method == "auto": if parse_method == 'auto':
jso_useful_key = {"_pdf_type": "", "model_list": model_list} jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
elif parse_method == "txt": elif parse_method == 'txt':
pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True) pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True)
elif parse_method == "ocr": elif parse_method == 'ocr':
pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True) pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True)
else: else:
logger.error("unknown parse method") logger.error('unknown parse method')
exit(1) exit(1)
pipe.pipe_classify() pipe.pipe_classify()
...@@ -65,58 +68,65 @@ def do_parse( ...@@ -65,58 +68,65 @@ def do_parse(
pipe.pipe_analyze() pipe.pipe_analyze()
orig_model_list = copy.deepcopy(pipe.model_list) orig_model_list = copy.deepcopy(pipe.model_list)
else: else:
logger.error("need model list input") logger.error('need model list input')
exit(2) exit(2)
pipe.pipe_parse() pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"] pdf_info = pipe.pdf_mid_data['pdf_info']
if f_draw_layout_bbox: if f_draw_layout_bbox:
draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir) draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
if f_draw_span_bbox: if f_draw_span_bbox:
draw_span_bbox(pdf_info, pdf_bytes, local_md_dir) draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
if f_draw_model_bbox: if f_draw_model_bbox:
drow_model_bbox(orig_model_list, pdf_bytes, local_md_dir) drow_model_bbox(orig_model_list, pdf_bytes, local_md_dir,
pdf_file_name)
md_content = pipe.pipe_mk_markdown( md_content = pipe.pipe_mk_markdown(image_dir,
image_dir, drop_mode=DropMode.NONE, md_make_mode=f_make_md_mode drop_mode=DropMode.NONE,
) md_make_mode=f_make_md_mode)
if f_dump_md: if f_dump_md:
md_writer.write( md_writer.write(
content=md_content, content=md_content,
path=f"{pdf_file_name}.md", path=f'{pdf_file_name}.md',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
if f_dump_middle_json: if f_dump_middle_json:
md_writer.write( md_writer.write(
content=json_parse.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4), content=json_parse.dumps(pipe.pdf_mid_data,
path="middle.json", ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_middle.json',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
if f_dump_model_json: if f_dump_model_json:
md_writer.write( md_writer.write(
content=json_parse.dumps(orig_model_list, ensure_ascii=False, indent=4), content=json_parse.dumps(orig_model_list,
path="model.json", ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_model.json',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
if f_dump_orig_pdf: if f_dump_orig_pdf:
md_writer.write( md_writer.write(
content=pdf_bytes, content=pdf_bytes,
path="origin.pdf", path=f'{pdf_file_name}_origin.pdf',
mode=AbsReaderWriter.MODE_BIN, mode=AbsReaderWriter.MODE_BIN,
) )
content_list = pipe.pipe_mk_uni_format(image_dir, drop_mode=DropMode.NONE) content_list = pipe.pipe_mk_uni_format(image_dir, drop_mode=DropMode.NONE)
if f_dump_content_list: if f_dump_content_list:
md_writer.write( md_writer.write(
content=json_parse.dumps(content_list, ensure_ascii=False, indent=4), content=json_parse.dumps(content_list,
path="content_list.json", ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_content_list.json',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
logger.info(f"local output dir is {local_md_dir}") logger.info(f'local output dir is {local_md_dir}')
parse_pdf_methods = click.Choice(["ocr", "txt", "auto"]) parse_pdf_methods = click.Choice(['ocr', 'txt', 'auto'])
dependent on the service headway and the reliability of the departure time of the service to which passengers are incident.
After briefly introducing the random incidence model, which is often assumed to hold at short headways, the balance of this section reviews six studies of passenger incidence behavior that are moti- vated by understanding the relationships between service headway, service reliability, passenger incidence behavior, and passenger waiting time in a more nuanced fashion than is embedded in the random incidence assumption ( 2 ). Three of these studies depend on manually collected data, two studies use data from AFC systems, and one study analyzes the issue purely theoretically. These studies reveal much about passenger incidence behavior, but all are found to be limited in their general applicability by the methods with which they collect information about passengers and the services those passengers intend to use.
# Random Passenger Incidence Behavior
One characterization of passenger incidence behavior is that of ran- dom incidence ( 3 ). The key assumption underlying the random inci- dence model is that the process of passenger arrivals to the public transport service is independent from the vehicle departure process of the service. This implies that passengers become incident to the service at a random time, and thus the instantaneous rate of passen- ger arrivals to the service is uniform over a given period of time. Let $W$ and $H$ be random variables representing passenger waiting times and service headways, respectively. Under the random incidence assumption and the assumption that vehicle capacity is not a binding constraint, a classic result of transportation science is that
$$
E!\\left(W\\right)!=!\\frac{E!\\left\[H^{2}\\right\]}{2E!\\left\[H\\right\]}!=!\\frac{E!\\left\[H\\right\]}{2}!!\\left(1!+!\\operatorname{CV}!\\left(H\\right)^{2}\\right)
$$
where $E\[X\]$ is the probabilistic expectation of some random variable $X$ and $\\operatorname{CV}(H)$ is the coefficient of variation of $H$ , a unitless measure of the variability of $H$ defined as
$$
\\mathbf{CV}\\big(H\\big)!=!\\frac{\\boldsymbol{\\upsigma}\_{H}}{E\\big\[H\\big\]}
$$
where $\\upsigma\_{H}$ is the standard deviation of $H\\left(4\\right)$ . The second expression in Equation 1 is particularly useful because it expresses the mean passenger waiting time as the sum of two components: the waiting time caused by the mean headway (i.e., the reciprocal of service fre- quency) and the waiting time caused by the variability of the head- ways (which is one measure of service reliability). When the service is perfectly reliable with constant headways, the mean ­ waiting time will be simply half the headway.
# More Behaviorally Realistic Incidence Models
Jolliffe and Hutchinson studied bus passenger incidence in South London suburbs ( 5 ). They observed 10 bus stops for $^{1\\mathrm{~h~}}$ per day over 8 days, recording the times of passenger incidence and actual and scheduled bus departures. They limited their stop selection to those served by only a single bus route with a single service pat- tern so as to avoid ambiguity about which service a passenger was waiting for. The authors found that the actual average passenger waiting time was $30%$ less than predicted by the random incidence model. They also found that the empirical distributions of passenger incidence times (by time of day) had peaks just before the respec- tive average bus departure times. They hypothesized the existence of three classes of passengers: with proportion $q$ , passengers whose time of incidence is causally coincident with that of a bus departure (e.g., because they saw the approaching bus from their home or a shop window); with proportion $p(1-q)$ , passengers who time their arrivals to minimize expected waiting time; and with proportion $(1-p)(1-q)$ , passengers who are randomly incident. The authors found that $p$ was positively correlated with the potential reduction in waiting time (compared with arriving randomly) that resulted from knowledge of the timetable and of service reliability. They also found $p$ to be higher in the peak commuting periods rather than in the off-peak periods, indicating more awareness of the timetable or historical reliability, or both, by commuters.
Bowman and Turnquist built on the concept of aware and unaware passengers of proportions $p$ and $(1-p)$ , respectively. They proposed a utility-based model to estimate $p$ and the distribution of incidence times, and thus the mean waiting time, of aware passengers over a given headway as a function of the headway and reliability of bus departure times $(l)$ . They observed seven bus stops in Chicago, Illinois, each served by a single (different) bus route, between 6:00 and $8{\\cdot}00;\\mathrm{a.m}$ . for 5 to 10 days each. The bus routes had headways of 5 to $20~\\mathrm{min}$ and a range of reliabilities. The authors found that actual average waiting time was substantially less than predicted by the random incidence model. They estimated that $p$ was not statistically significantly different from 1.0, which they explain by the fact that all observations were taken during peak commuting times. Their model predicts that the longer the headway and the more reliable the departures, the more peaked the distribution of incidence times will be and the closer that peak will be to the next scheduled departure time. This prediction demonstrates what they refer to as a safety margin that passengers add to reduce the chance of missing their bus when the service is known to be somewhat unreliable. Such a safety margin can also result from unreliability in passengers’ journeys to the public transport stop or station. Bowman and ­ Turnquist conclude from their model that the random incidence model underestimates the waiting time benefits of improving reli- ability and overestimates the waiting time benefits of increasing ser- vice frequency. This is because as reliability increases passengers can better predict departure times and so can time their incidence to decrease their waiting time.
Furth and Muller study the issue in a theoretical context and gener- ally agree with the above findings ( 2 ). They are primarily concerned with the use of data from automatic vehicle-tracking systems to assess the impacts of reliability on passenger incidence behavior and wait- ing times. They propose that passengers will react to unreliability by departing earlier than they would with reliable services. Randomly incident unaware passengers will experience unreliability as a more dispersed distribution of headways and simply allocate additional time to their trip plan to improve the chance of arriving at their des- tination on time. Aware passengers, whose incidence is not entirely random, will react by timing their incidence somewhat earlier than the scheduled departure time to increase their chance of catching the desired service. The authors characterize these ­ reactions as the costs of unreliability.
Luethi et al. continued with the analysis of manually collected data on actual passenger behavior ( 6 ). They use the language of probability to describe two classes of passengers. The first is timetable-dependent passengers (i.e., the aware passengers), whose incidence behavior is affected by awareness (possibly gained
This source diff could not be displayed because it is too large. You can view the blob instead.
[
{
"layout_dets": [
{
"category_id": 1,
"poly": [
882.4013061523438,
169.93817138671875,
1552.350341796875,
169.93817138671875,
1552.350341796875,
625.8263549804688,
882.4013061523438,
625.8263549804688
],
"score": 0.999992311000824
},
{
"category_id": 1,
"poly": [
882.474853515625,
1450.92822265625,
1551.4490966796875,
1450.92822265625,
1551.4490966796875,
1877.5712890625,
882.474853515625,
1877.5712890625
],
"score": 0.9999903440475464
},
{
"category_id": 1,
"poly": [
881.6513061523438,
626.2058715820312,
1552.1400146484375,
626.2058715820312,
1552.1400146484375,
1450.604736328125,
881.6513061523438,
1450.604736328125
],
"score": 0.9999856352806091
},
{
"category_id": 1,
"poly": [
149.41075134277344,
232.1595001220703,
819.0465087890625,
232.1595001220703,
819.0465087890625,
625.8865356445312,
149.41075134277344,
625.8865356445312
],
"score": 0.99998539686203
},
{
"category_id": 1,
"poly": [
149.3945770263672,
1215.5172119140625,
817.8850708007812,
1215.5172119140625,
817.8850708007812,
1304.873291015625,
149.3945770263672,
1304.873291015625
],
"score": 0.9999765157699585
},
{
"category_id": 1,
"poly": [
882.6979370117188,
1880.13916015625,
1552.15185546875,
1880.13916015625,
1552.15185546875,
2031.339599609375,
882.6979370117188,
2031.339599609375
],
"score": 0.9999744892120361
},
{
"category_id": 1,
"poly": [
148.96054077148438,
743.3055419921875,
818.6231689453125,
743.3055419921875,
818.6231689453125,
1074.2369384765625,
148.96054077148438,
1074.2369384765625
],
"score": 0.9999669790267944
},
{
"category_id": 1,
"poly": [
148.8435516357422,
1791.14306640625,
818.6885375976562,
1791.14306640625,
818.6885375976562,
2030.794189453125,
148.8435516357422,
2030.794189453125
],
"score": 0.9999618530273438
},
{
"category_id": 0,
"poly": [
150.7009735107422,
684.0087890625,
623.5106201171875,
684.0087890625,
623.5106201171875,
717.03662109375,
150.7009735107422,
717.03662109375
],
"score": 0.9999415278434753
},
{
"category_id": 8,
"poly": [
146.48068237304688,
1331.6737060546875,
317.2640075683594,
1331.6737060546875,
317.2640075683594,
1400.1722412109375,
146.48068237304688,
1400.1722412109375
],
"score": 0.9998958110809326
},
{
"category_id": 1,
"poly": [
149.42420959472656,
1430.8782958984375,
818.9042358398438,
1430.8782958984375,
818.9042358398438,
1672.7386474609375,
149.42420959472656,
1672.7386474609375
],
"score": 0.9998599290847778
},
{
"category_id": 1,
"poly": [
149.18746948242188,
172.10252380371094,
818.5662231445312,
172.10252380371094,
818.5662231445312,
230.4594268798828,
149.18746948242188,
230.4594268798828
],
"score": 0.9997718334197998
},
{
"category_id": 0,
"poly": [
149.0175018310547,
1732.1090087890625,
702.1005859375,
1732.1090087890625,
702.1005859375,
1763.6046142578125,
149.0175018310547,
1763.6046142578125
],
"score": 0.9997085928916931
},
{
"category_id": 2,
"poly": [
1519.802490234375,
98.59099578857422,
1551.985107421875,
98.59099578857422,
1551.985107421875,
119.48420715332031,
1519.802490234375,
119.48420715332031
],
"score": 0.9995552897453308
},
{
"category_id": 8,
"poly": [
146.9109649658203,
1100.156494140625,
544.2803344726562,
1100.156494140625,
544.2803344726562,
1184.929443359375,
146.9109649658203,
1184.929443359375
],
"score": 0.9995207786560059
},
{
"category_id": 2,
"poly": [
148.11611938476562,
99.87767791748047,
318.926025390625,
99.87767791748047,
318.926025390625,
120.70393371582031,
148.11611938476562,
120.70393371582031
],
"score": 0.999351441860199
},
{
"category_id": 9,
"poly": [
791.7642211914062,
1130.056396484375,
818.6940307617188,
1130.056396484375,
818.6940307617188,
1161.1080322265625,
791.7642211914062,
1161.1080322265625
],
"score": 0.9908884763717651
},
{
"category_id": 9,
"poly": [
788.37060546875,
1346.8450927734375,
818.5010986328125,
1346.8450927734375,
818.5010986328125,
1377.370361328125,
788.37060546875,
1377.370361328125
],
"score": 0.9873985052108765
},
{
"category_id": 14,
"poly": [
146,
1103,
543,
1103,
543,
1184,
146,
1184
],
"score": 0.94,
"latex": "E\\!\\left(W\\right)\\!=\\!\\frac{E\\!\\left[H^{2}\\right]}{2E\\!\\left[H\\right]}\\!=\\!\\frac{E\\!\\left[H\\right]}{2}\\!\\!\\left(1\\!+\\!\\operatorname{CV}\\!\\left(H\\right)^{2}\\right)"
},
{
"category_id": 13,
"poly": [
1196,
354,
1278,
354,
1278,
384,
1196,
384
],
"score": 0.91,
"latex": "p(1-q)"
},
{
"category_id": 13,
"poly": [
881,
415,
1020,
415,
1020,
444,
881,
444
],
"score": 0.91,
"latex": "(1-p)(1-q)"
},
{
"category_id": 14,
"poly": [
147,
1333,
318,
1333,
318,
1400,
147,
1400
],
"score": 0.91,
"latex": "\\mathbf{CV}\\big(H\\big)\\!=\\!\\frac{\\boldsymbol{\\upsigma}_{H}}{E\\big[H\\big]}"
},
{
"category_id": 13,
"poly": [
1197,
657,
1263,
657,
1263,
686,
1197,
686
],
"score": 0.9,
"latex": "(1-p)"
},
{
"category_id": 13,
"poly": [
213,
1217,
263,
1217,
263,
1244,
213,
1244
],
"score": 0.88,
"latex": "E[X]"
},
{
"category_id": 13,
"poly": [
214,
1434,
245,
1434,
245,
1459,
214,
1459
],
"score": 0.87,
"latex": "\\upsigma_{H}"
},
{
"category_id": 13,
"poly": [
324,
2002,
373,
2002,
373,
2028,
324,
2028
],
"score": 0.84,
"latex": "30\\%"
},
{
"category_id": 13,
"poly": [
1209,
693,
1225,
693,
1225,
717,
1209,
717
],
"score": 0.83,
"latex": "p"
},
{
"category_id": 13,
"poly": [
990,
449,
1007,
449,
1007,
474,
990,
474
],
"score": 0.81,
"latex": "p"
},
{
"category_id": 13,
"poly": [
346,
1277,
369,
1277,
369,
1301,
346,
1301
],
"score": 0.81,
"latex": "H"
},
{
"category_id": 13,
"poly": [
1137,
661,
1154,
661,
1154,
686,
1137,
686
],
"score": 0.81,
"latex": "p"
},
{
"category_id": 13,
"poly": [
522,
1432,
579,
1432,
579,
1459,
522,
1459
],
"score": 0.81,
"latex": "H\\left(4\\right)"
},
{
"category_id": 13,
"poly": [
944,
540,
962,
540,
962,
565,
944,
565
],
"score": 0.8,
"latex": "p"
},
{
"category_id": 13,
"poly": [
1444,
936,
1461,
936,
1461,
961,
1444,
961
],
"score": 0.79,
"latex": "p"
},
{
"category_id": 13,
"poly": [
602,
1247,
624,
1247,
624,
1270,
602,
1270
],
"score": 0.78,
"latex": "H"
},
{
"category_id": 13,
"poly": [
147,
1247,
167,
1247,
167,
1271,
147,
1271
],
"score": 0.77,
"latex": "X"
},
{
"category_id": 13,
"poly": [
210,
1246,
282,
1246,
282,
1274,
210,
1274
],
"score": 0.77,
"latex": "\\operatorname{CV}(H)"
},
{
"category_id": 13,
"poly": [
1346,
268,
1361,
268,
1361,
292,
1346,
292
],
"score": 0.76,
"latex": "q"
},
{
"category_id": 13,
"poly": [
215,
957,
238,
957,
238,
981,
215,
981
],
"score": 0.74,
"latex": "H"
},
{
"category_id": 13,
"poly": [
149,
956,
173,
956,
173,
981,
149,
981
],
"score": 0.63,
"latex": "W"
},
{
"category_id": 13,
"poly": [
924,
841,
1016,
841,
1016,
868,
924,
868
],
"score": 0.56,
"latex": "8{\\cdot}00\\;\\mathrm{a.m}"
},
{
"category_id": 13,
"poly": [
956,
871,
1032,
871,
1032,
898,
956,
898
],
"score": 0.43,
"latex": "20~\\mathrm{min}"
},
{
"category_id": 13,
"poly": [
1082,
781,
1112,
781,
1112,
808,
1082,
808
],
"score": 0.41,
"latex": "(l)"
},
{
"category_id": 13,
"poly": [
697,
1821,
734,
1821,
734,
1847,
697,
1847
],
"score": 0.3,
"latex": "^{1\\mathrm{~h~}}"
}
],
"page_info": {
"page_no": 0,
"height": 2200,
"width": 1700
}
}
]
import tempfile
import os import os
import shutil import shutil
import tempfile
from click.testing import CliRunner from click.testing import CliRunner
from magic_pdf.tools.cli import cli from magic_pdf.tools.cli import cli
...@@ -8,19 +9,20 @@ from magic_pdf.tools.cli import cli ...@@ -8,19 +9,20 @@ from magic_pdf.tools.cli import cli
def test_cli_pdf(): def test_cli_pdf():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = "cli_test_01" filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
cli, cli,
[ [
"-p", '-p',
"tests/test_tools/assets/cli/pdf/cli_test_01.pdf", 'tests/test_tools/assets/cli/pdf/cli_test_01.pdf',
"-o", '-o',
temp_output_dir, temp_output_dir,
], ],
) )
...@@ -28,29 +30,31 @@ def test_cli_pdf(): ...@@ -28,29 +30,31 @@ def test_cli_pdf():
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
...@@ -58,68 +62,72 @@ def test_cli_pdf(): ...@@ -58,68 +62,72 @@ def test_cli_pdf():
def test_cli_path(): def test_cli_path():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
cli, ["-p", "tests/test_tools/assets/cli/path", "-o", temp_output_dir] cli, ['-p', 'tests/test_tools/assets/cli/path', '-o', temp_output_dir])
)
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
filename = "cli_test_01" filename = 'cli_test_01'
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
base_output_dir = os.path.join(temp_output_dir, "cli_test_02/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_02/auto')
filename = "cli_test_02" filename = 'cli_test_02'
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
import tempfile
import os import os
import shutil import shutil
import tempfile
from click.testing import CliRunner from click.testing import CliRunner
from magic_pdf.tools import cli_dev from magic_pdf.tools import cli_dev
...@@ -8,22 +9,23 @@ from magic_pdf.tools import cli_dev ...@@ -8,22 +9,23 @@ from magic_pdf.tools import cli_dev
def test_cli_pdf(): def test_cli_pdf():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = "cli_test_01" filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
cli_dev.cli, cli_dev.cli,
[ [
"pdf", 'pdf',
"-p", '-p',
"tests/test_tools/assets/cli/pdf/cli_test_01.pdf", 'tests/test_tools/assets/cli/pdf/cli_test_01.pdf',
"-j", '-j',
"tests/test_tools/assets/cli_dev/cli_test_01.model.json", 'tests/test_tools/assets/cli_dev/cli_test_01.model.json',
"-o", '-o',
temp_output_dir, temp_output_dir,
], ],
) )
...@@ -31,31 +33,31 @@ def test_cli_pdf(): ...@@ -31,31 +33,31 @@ def test_cli_pdf():
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, "content_list.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
...@@ -63,13 +65,14 @@ def test_cli_pdf(): ...@@ -63,13 +65,14 @@ def test_cli_pdf():
def test_cli_jsonl(): def test_cli_jsonl():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = "cli_test_01" filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
def mock_read_s3_path(s3path): def mock_read_s3_path(s3path):
with open(s3path, "rb") as f: with open(s3path, 'rb') as f:
return f.read() return f.read()
cli_dev.read_s3_path = mock_read_s3_path # mock cli_dev.read_s3_path = mock_read_s3_path # mock
...@@ -79,10 +82,10 @@ def test_cli_jsonl(): ...@@ -79,10 +82,10 @@ def test_cli_jsonl():
result = runner.invoke( result = runner.invoke(
cli_dev.cli, cli_dev.cli,
[ [
"jsonl", 'jsonl',
"-j", '-j',
"tests/test_tools/assets/cli_dev/cli_test_01.jsonl", 'tests/test_tools/assets/cli_dev/cli_test_01.jsonl',
"-o", '-o',
temp_output_dir, temp_output_dir,
], ],
) )
...@@ -90,31 +93,31 @@ def test_cli_jsonl(): ...@@ -90,31 +93,31 @@ def test_cli_jsonl():
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, "content_list.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
import tempfile
import os import os
import shutil import shutil
import tempfile
import pytest import pytest
import magic_pdf.model as model_config
from magic_pdf.tools.common import do_parse from magic_pdf.tools.common import do_parse
@pytest.mark.parametrize("method", ["auto", "txt", "ocr"]) @pytest.mark.parametrize('method', ['auto', 'txt', 'ocr'])
def test_common_do_parse(method): def test_common_do_parse(method):
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" model_config.__use_inside_model__ = True
filename = "fake" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = 'fake'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
with open("tests/test_tools/assets/common/cli_test_01.pdf", "rb") as f: with open('tests/test_tools/assets/common/cli_test_01.pdf', 'rb') as f:
bits = f.read() bits = f.read()
do_parse(temp_output_dir, filename, bits, [], method, f_dump_content_list=True) do_parse(temp_output_dir,
filename,
bits, [],
method,
f_dump_content_list=True)
# check # check
base_output_dir = os.path.join(temp_output_dir, f"fake/{method}") base_output_dir = os.path.join(temp_output_dir, f'fake/{method}')
r = os.stat(os.path.join(base_output_dir, "content_list.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
os.path.exists(os.path.join(base_output_dir, "images")) os.path.exists(os.path.join(base_output_dir, 'images'))
os.path.isdir(os.path.join(base_output_dir, "images")) os.path.isdir(os.path.join(base_output_dir, 'images'))
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment