Unverified Commit c9a51491 authored by icecraft's avatar icecraft Committed by GitHub

feat: rename the file generated by command line tools (#401)

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: 's avataricecraft <tmortred@gmail.com>
Co-authored-by: 's avataricecraft <xurui1@pjlab.org.cn>
parent 041b9465
......@@ -5,6 +5,7 @@
</p>
<!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
......@@ -15,14 +16,17 @@
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link -->
<p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: High-Quality PDF Extraction Toolkit</a>🔥🔥🔥
</p>
<!-- join us -->
<p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p>
......@@ -30,11 +34,13 @@
</div>
# Changelog
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release
<!-- TABLE OF CONTENT -->
<details open="open">
<summary><h2 style="display: inline-block">Table of Contents</h2></summary>
<ol>
......@@ -73,10 +79,10 @@
</ol>
</details>
# MinerU
## Project Introduction
MinerU is a tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
MinerU was born during the pre-training process of [InternLM](https://github.com/InternLM/InternLM). We focus on solving symbol conversion issues in scientific literature and hope to contribute to technological development in the era of large models.
Compared to well-known commercial products, MinerU is still young. If you encounter any issues or if the results are not as expected, please submit an issue on [issue](https://github.com/opendatalab/MinerU/issues) and **attach the relevant PDF**.
......@@ -100,6 +106,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
If you encounter any installation issues, please first consult the <a href="#faq">FAQ</a>. </br>
If the parsing results are not as expected, refer to the <a href="#known-issues">Known Issues</a>. </br>
There are three different ways to experience MinerU:
- [Online Demo (No Installation Required)](#online-demo)
- [Quick CPU Demo (Windows, Linux, Mac)](#quick-cpu-demo)
- [Linux/Windows + CUDA](#Using-GPU)
......@@ -168,33 +175,41 @@ In non-mainline environments, due to the diversity of hardware and software conf
### Quick CPU Demo
#### 1. Install magic-pdf
```bash
conda create -n MinerU python=3.10
conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com
```
#### 2. Download model weight files
Refer to [How to Download Model Files](docs/how_to_download_models_en.md) for detailed instructions.
> ❗️After downloading the models, please make sure to verify the completeness of the model files.
>
>
> Check if the model file sizes match the description on the webpage. If possible, use sha256 to verify the integrity of the files.
#### 3. Copy and configure the template file
You can find the `magic-pdf.template.json` template configuration file in the root directory of the repository.
> ❗️Make sure to execute the following command to copy the configuration file to your **user directory**; otherwise, the program will not run.
>
>
> The user directory for Windows is `C:\Users\YourUsername`, for Linux it is `/home/YourUsername`, and for macOS it is `/Users/YourUsername`.
```bash
cp magic-pdf.template.json ~/magic-pdf.json
```
Find the `magic-pdf.json` file in your user directory and configure the "models-dir" path to point to the directory where the model weight files were downloaded in [Step 2](#2-download-model-weight-files).
> ❗️Make sure to correctly configure the **absolute path** to the model weight files directory, otherwise the program will not run because it can't find the model files.
>
> On Windows, this path should include the drive letter and all backslashes (`\`) in the path should be replaced with forward slashes (`/`) to avoid syntax errors in the JSON file due to escape sequences.
>
>
> For example: If the models are stored in the "models" directory at the root of the D drive, the "model-dir" value should be `D:/models`.
```json
{
// other config
......@@ -206,14 +221,13 @@ Find the `magic-pdf.json` file in your user directory and configure the "models-
}
```
### Using GPU
If your device supports CUDA and meets the GPU requirements of the mainline environment, you can use GPU acceleration. Please select the appropriate guide based on your system:
- [Ubuntu 22.04 LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_en_US.md)
- [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
## Usage
### Command Line
......@@ -226,12 +240,12 @@ Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf.
-m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf
from ocr and txt.
without method specified, auto will be used by default.
without method specified, auto will be used by default.
--help Show this message and exit.
......@@ -246,13 +260,13 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
The results will be saved in the `{some_output_dir}` directory. The output file list is as follows:
```text
├── some_pdf.md # markdown file
├── images # directory for storing images
├── layout.pdf # layout diagram
├── middle.json # MinerU intermediate processing result
├── model.json # model inference result
├── origin.pdf # original PDF file
└── spans.pdf # smallest granularity bbox position information diagram
├── some_pdf.md # markdown file
├── images # directory for storing images
├── some_pdf_layout.pdf # layout diagram
├── some_pdf_middle.json # MinerU intermediate processing result
├── some_pdf_model.json # model inference result
├── some_pdf_origin.pdf # original PDF file
└── some_pdf_spans.pdf # smallest granularity bbox position information diagram
```
For more information about the output files, please refer to the [Output File Description](docs/output_file_en_us.md).
......@@ -260,6 +274,7 @@ For more information about the output files, please refer to the [Output File De
### API
Processing files from local disk
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
......@@ -272,6 +287,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
Processing files from object storage
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
......@@ -286,10 +302,10 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
For detailed implementation, refer to:
- [demo.py Simplest Processing Method](demo/demo.py)
- [magic_pdf_parse_main.py More Detailed Processing Workflow](demo/magic_pdf_parse_main.py)
### Development Guide
TODO
......@@ -305,6 +321,7 @@ TODO
- [ ] Geometric shape recognition
# Known Issues
- Reading order is segmented based on rules, which can cause disordered sequences in some cases
- Vertical text is not supported
- Lists, code blocks, and table of contents are not yet supported in the layout model
......@@ -313,18 +330,18 @@ TODO
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions.
- **Table Recognition** is currently in the testing phase; recognition speed is slow, and accuracy needs improvement. Below are some performance test results in an Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090 environment for reference.
| Table Size | Parsing Time |
|---------------|----------------------------|
| 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s |
| Table Size | Parsing Time |
| ------------ | ------------ |
| 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s |
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)
[FAQ in English](docs/FAQ_en_us.md)
# All Thanks To Our Contributors
<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
......@@ -337,8 +354,8 @@ TODO
This project currently uses PyMuPDF to achieve advanced functionality. However, since it adheres to the AGPL license, it may impose restrictions on certain usage scenarios. In future iterations, we plan to explore and replace it with a more permissive PDF processing library to enhance user-friendliness and flexibility.
# Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
......@@ -375,9 +392,11 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
</a>
# Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links
......
......@@ -4,8 +4,8 @@
<img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
</p>
<!-- icon -->
[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
......@@ -16,29 +16,31 @@
<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
<!-- language -->
[English](README.md) | [简体中文](README_zh-CN.md)
[English](README.md) | [简体中文](README_zh-CN.md)
<!-- hot link -->
<p align="center">
<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高质量PDF解析工具箱</a>🔥🔥🔥
</p>
<!-- join us -->
<p align="center">
👋 join us on <a href="https://discord.gg/Tdedn9GTXq" target="_blank">Discord</a> and <a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>
</p>
</div>
# 更新记录
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
- 2024/07/05 首次开源
<!-- TABLE OF CONTENT -->
<details open="open">
<summary><h2 style="display: inline-block">文档目录</h2></summary>
<ol>
......@@ -77,10 +79,10 @@
</ol>
</details>
# MinerU
## 项目简介
MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
MinerU诞生于[书生-浦语](https://github.com/InternLM/InternLM)的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。
相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到[issue](https://github.com/opendatalab/MinerU/issues)提交问题,同时**附上相关PDF**
......@@ -99,17 +101,16 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
- 支持CPU和GPU环境
- 支持windows/linux/mac平台
## 快速开始
如果遇到任何安装问题,请先查询 <a href="#faq">FAQ</a> </br>
如果遇到解析效果不及预期,参考 <a href="#known-issues">Known Issues</a></br>
有3种不同方式可以体验MinerU的效果:
- [在线体验(无需任何安装)](#在线体验)
- [使用CPU快速体验(Windows,Linux,Mac)](#使用cpu快速体验)
- [Linux/Windows + CUDA](#使用gpu)
**⚠️安装前必看——软硬件环境支持说明**
为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。
......@@ -171,38 +172,46 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
[在线体验点击这里](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
### 使用CPU快速体验
#### 1. 安装magic-pdf
最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash
conda create -n MinerU python=3.10
conda activate MinerU
pip install magic-pdf[full]==0.7.0b1 --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
```
#### 2. 下载模型权重文件
详细参考 [如何下载模型文件](docs/how_to_download_models_zh_cn.md)
> ❗️模型下载后请务必检查模型文件是否下载完整
>
>
> 请检查目录下的模型文件大小与网页上描述是否一致,如果可以的话,最好通过sha256校验模型是否下载完整
#### 3. 拷贝配置文件并进行配置
在仓库根目录可以获得 [magic-pdf.template.json](magic-pdf.template.json) 配置模版文件
> ❗️务必执行以下命令将配置文件拷贝到【用户目录】下,否则程序将无法运行
>
> windows的用户目录为 "C:\Users\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
>
> windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
```bash
cp magic-pdf.template.json ~/magic-pdf.json
```
在用户目录中找到magic-pdf.json文件并配置"models-dir"为[2. 下载模型权重文件](#2-下载模型权重文件)中下载的模型权重文件所在目录
> ❗️务必正确配置模型权重文件所在目录的【绝对路径】,否则会因为找不到模型文件而导致程序无法运行
>
> windows系统中此路径应包含盘符,且需把路径中所有的"\"替换为"/",否则会因为转义原因导致json文件语法错误。
>
> windows系统中此路径应包含盘符,且需把路径中所有的""替换为"/",否则会因为转义原因导致json文件语法错误。
>
> 例如:模型放在D盘根目录的models目录,则model-dir的值应为"D:/models"
```json
{
// other config
......@@ -214,14 +223,13 @@ cp magic-pdf.template.json ~/magic-pdf.json
}
```
### 使用GPU
如果您的设备支持CUDA,且满足主线环境中的显卡要求,则可以使用GPU加速,请根据自己的系统选择适合的教程:
- [Ubuntu22.04LTS + GPU](docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md)
- [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
## 使用
### 命令行
......@@ -234,12 +242,12 @@ Options:
-v, --version display the version and exit
-p, --path PATH local pdf filepath or directory [required]
-o, --output-dir TEXT output local directory
-m, --method [ocr|txt|auto] the method for parsing pdf.
-m, --method [ocr|txt|auto] the method for parsing pdf.
ocr: using ocr technique to extract information from pdf,
txt: suitable for the text-based pdf only and outperform ocr,
auto: automatically choose the best method for parsing pdf
from ocr and txt.
without method specified, auto will be used by default.
without method specified, auto will be used by default.
--help Show this message and exit.
......@@ -254,21 +262,21 @@ magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
运行完命令后输出的结果会保存在`{some_output_dir}`目录下, 输出的文件列表如下
```text
├── some_pdf.md # markdown 文件
├── images # 存放图片目录
├── layout.pdf # layout 绘图
├── middle.json # minerU 中间处理结果
├── model.json # 模型推理结果
├── origin.pdf # 原 pdf 文件
└── spans.pdf # 最小粒度的bbox位置信息绘图
├── some_pdf.md # markdown 文件
├── images # 存放图片目录
├── some_pdf_layout.pdf # layout 绘图
├── some_pdf_middle.json # minerU 中间处理结果
├── some_pdf_model.json # 模型推理结果
├── some_pdf_origin.pdf # 原 pdf 文件
└── some_pdf_spans.pdf # 最小粒度的bbox位置信息绘图
```
更多有关输出文件的信息,请参考[输出文件说明](docs/output_file_zh_cn.md)
### API
处理本地磁盘上的文件
```python
image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir))
......@@ -281,6 +289,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
处理对象存储上的文件
```python
s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/"
......@@ -294,11 +303,11 @@ pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
```
详细实现可参考
详细实现可参考
- [demo.py 最简单的处理方式](demo/demo.py)
- [magic_pdf_parse_main.py 能够更清晰看到处理流程](demo/magic_pdf_parse_main.py)
### 二次开发
TODO
......@@ -313,8 +322,8 @@ TODO
- [ ] 化学式识别
- [ ] 几何图形识别
# Known Issues
- 阅读顺序基于规则的分割,在一些情况下会乱序
- 不支持竖排文字
- 列表、代码块、目录在layout模型里还没有支持
......@@ -323,20 +332,18 @@ TODO
- 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。
- **表格识别**目前处于测试阶段,识别速度较慢,识别准确度有待提升。以下是我们在Ubuntu 22.04 LTS + Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz + NVIDIA GeForce RTX 4090环境下的一些性能测试结果,可供参考。
| 表格大小 | 解析耗时 |
|---------------|----------------------------|
| 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s |
| 表格大小 | 解析耗时 |
| ------------ | -------- |
| 6\*5 55kb | 37s |
| 16\*12 284kb | 3m18s |
| 44\*7 559kb | 4m12s |
# FAQ
[常见问题](docs/FAQ_zh_cn.md)
[FAQ](docs/FAQ_en_us.md)
# All Thanks To Our Contributors
<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
......@@ -350,6 +357,7 @@ TODO
本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。
# Acknowledgments
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
- [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
......@@ -386,9 +394,11 @@ TODO
</a>
# Magic-doc
[Magic-Doc](https://github.com/InternLM/magic-doc) Fast speed ppt/pptx/doc/docx/pdf extraction tool
# Magic-html
[Magic-HTML](https://github.com/opendatalab/magic-html) Mixed web page extraction tool
# Links
......
## Overview
After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
### some_pdf_layout.pdf
### layout.pdf
Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
![layout example](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
![spans example](images/spans_example.png)
### model.json
### some_pdf_model.json
#### Structure Definition
```python
from pydantic import BaseModel, Field
from enum import IntEnum
......@@ -34,12 +33,12 @@ class CategoryType(IntEnum):
table_footnote = 7 # Table footnote
isolate_formula = 8 # Block formula
formula_caption = 9 # Formula label
embedding = 13 # Inline formula
isolated = 14 # Block formula
text = 15 # OCR recognition result
class PageInfo(BaseModel):
page_no: int = Field(description="Page number, the first page is 0", ge=0)
height: int = Field(description="Page height", gt=0)
......@@ -51,22 +50,20 @@ class ObjectInferenceResult(BaseModel):
score: float = Field(description="Confidence of the inference result")
latex: str | None = Field(description="LaTeX parsing result", default=None)
html: str | None = Field(description="HTML parsing result", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
page_info: PageInfo = Field(description="Page metadata")
# The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
inference_result: list[PageInferenceResults] = []
```
The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively.
![Poly Coordinate Diagram](images/poly.png)
#### example
```json
......@@ -120,15 +117,13 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
]
```
### some_pdf_middle.json
### middle.json
| Field Name | Description |
| :-----|:------------------------------------------|
|pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
|_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
|_version_name | string, indicates the version of magic-pdf used in this parsing |
| Field Name | Description |
| :------------- | :------------------------------------------------------------------------------------------------------------- |
| pdf_info | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
| \_parse_type | ocr \| txt, used to indicate the mode used in this intermediate parsing state |
| \_version_name | string, indicates the version of magic-pdf used in this parsing |
<br>
......@@ -136,18 +131,18 @@ The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3], represen
Field structure description
| Field Name | Description |
| :-----| :---- |
| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
| page_idx | Page number, starting from 0 |
| page_size | Page width and height |
| _layout_tree | Layout tree structure |
| images | list, each element is a dict representing an img_block |
| tables | list, each element is a dict representing a table_block |
| interline_equations | list, each element is a dict representing an interline_equation_block |
| discarded_blocks | List, block information returned by the model that needs to be dropped |
| para_blocks | Result after segmenting preproc_blocks |
| Field Name | Description |
| :------------------ | :----------------------------------------------------------------------------------------------------------------- |
| preproc_blocks | Intermediate result after PDF preprocessing, not yet segmented |
| layout_bboxes | Layout segmentation results, containing layout direction (vertical, horizontal), and bbox, sorted by reading order |
| page_idx | Page number, starting from 0 |
| page_size | Page width and height |
| \_layout_tree | Layout tree structure |
| images | list, each element is a dict representing an img_block |
| tables | list, each element is a dict representing a table_block |
| interline_equations | list, each element is a dict representing an interline_equation_block |
| discarded_blocks | List, block information returned by the model that needs to be dropped |
| para_blocks | Result after segmenting preproc_blocks |
In the above table, `para_blocks` is an array of dicts, each dict representing a block structure. A block can support up to one level of nesting.
......@@ -157,35 +152,35 @@ In the above table, `para_blocks` is an array of dicts, each dict representing a
The outer block is referred to as a first-level block, and the fields in the first-level block include:
| Field Name | Description |
| :-----| :---- |
| type | Block type (table\|image)|
|bbox | Block bounding box coordinates |
|blocks |list, each element is a dict representing a second-level block |
| Field Name | Description |
| :--------- | :------------------------------------------------------------- |
| type | Block type (table\|image) |
| bbox | Block bounding box coordinates |
| blocks | list, each element is a dict representing a second-level block |
<br>
There are only two types of first-level blocks: "table" and "image". All other blocks are second-level blocks.
The fields in a second-level block include:
| Field Name | Description |
| :-----| :---- |
| type | Block type |
| bbox | Block bounding box coordinates |
| lines | list, each element is a dict representing a line, used to describe the composition of a line of information|
| Field Name | Description |
| :--------- | :---------------------------------------------------------------------------------------------------------- |
| type | Block type |
| bbox | Block bounding box coordinates |
| lines | list, each element is a dict representing a line, used to describe the composition of a line of information |
Detailed explanation of second-level block types
| type | Description |
|:-------------------| :---- |
| type | Description |
| :----------------- | :--------------------- |
| image_body | Main body of the image |
| image_caption | Image description text |
| table_body | Main body of the table |
| table_caption | Table description text |
| table_footnote | Table footnote |
| text | Text block |
| title | Title block |
| interline_equation | Block formula|
| table_footnote | Table footnote |
| text | Text block |
| title | Title block |
| interline_equation | Block formula |
<br>
......@@ -193,31 +188,30 @@ Detailed explanation of second-level block types
The field format of a line is as follows:
| Field Name | Description |
| :-----| :---- |
| bbox | Bounding box coordinates of the line |
| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
| Field Name | Description |
| :--------- | :------------------------------------------------------------------------------------------------------ |
| bbox | Bounding box coordinates of the line |
| spans | list, each element is a dict representing a span, used to describe the composition of the smallest unit |
<br>
**span**
| Field Name | Description |
| :-----| :---- |
| bbox | Bounding box coordinates of the span |
| type | Type of the span |
| Field Name | Description |
| :------------------ | :------------------------------------------------------------------------------------------------------- |
| bbox | Bounding box coordinates of the span |
| type | Type of the span |
| content \| img_path | Text spans use content, chart spans use img_path to store the actual text or screenshot path information |
The types of spans are as follows:
| type | Description |
| :-----| :---- |
| image | Image |
| table | Table |
| text | Text |
| inline_equation | Inline formula |
| interline_equation | Block formula |
| type | Description |
| :----------------- | :------------- |
| image | Image |
| table | Table |
| text | Text |
| inline_equation | Inline formula |
| interline_equation | Block formula |
**Summary**
......@@ -229,7 +223,6 @@ The block structure is as follows:
First-level block (if any) -> Second-level block -> Line -> Span
#### example
```json
......
## 概览
`magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
### some_pdf_layout.pdf
### layout.pdf
每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
![layout 页面示例](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
![span 页面示例](images/spans_example.png)
### model.json
### some_pdf_model.json
#### 结构定义
```python
from pydantic import BaseModel, Field
from enum import IntEnum
......@@ -33,13 +32,13 @@ class CategoryType(IntEnum):
table_caption = 6 # 表格描述
table_footnote = 7 # 表格注释
isolate_formula = 8 # 行间公式
formula_caption = 9 # 行间公式的标号
formula_caption = 9 # 行间公式的标号
embedding = 13 # 行内公式
isolated = 14 # 行间公式
text = 15 # ocr 识别结果
class PageInfo(BaseModel):
page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0)
height: int = Field(description="页面高度", gt=0)
......@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel):
score: float = Field(description="推理结果的置信度")
latex: str | None = Field(description="latex 解析结果", default=None)
html: str | None = Field(description="html 解析结果", default=None)
class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
page_info: PageInfo = Field(description="页面元信息")
# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = []
```
poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标
poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
![poly 坐标示意图](images/poly.png)
#### 示例数据
```json
......@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
]
```
### some_pdf_middle.json
### middle.json
| 字段名 | 解释 |
| :-----|:------------------------------------------|
|pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
|_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
|_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
| 字段名 | 解释 |
| :------------- | :----------------------------------------------------------------- |
| pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
| \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
<br>
**pdf_info**
字段结构说明
| 字段名 | 解释 |
| :-----| :---- |
| preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 |
| _layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
| discarded_blocks | List, 模型返回的需要drop的block信息 |
| para_blocks | 将preproc_blocks进行分段之后的结果 |
| 字段名 | 解释 |
| :------------------ | :------------------------------------------------------------------- |
| preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 |
| \_layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
| discarded_blocks | List, 模型返回的需要drop的block信息 |
| para_blocks | 将preproc_blocks进行分段之后的结果 |
上表中 `para_blocks` 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
......@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
外层block被称为一级block,一级block中的字段包括
| 字段名 | 解释 |
| :-----| :---- |
| type | block类型(table\|image)|
|bbox | block矩形框坐标 |
|blocks |list,里面的每个元素都是一个dict格式的二级block |
| 字段名 | 解释 |
| :----- | :---------------------------------------------- |
| type | block类型(table\|image) |
| bbox | block矩形框坐标 |
| blocks | list,里面的每个元素都是一个dict格式的二级block |
<br>
一级block只有"table"和"image"两种类型,其余block均为二级block
二级block中的字段包括
| 字段名 | 解释 |
| :-----| :---- |
| type | block类型 |
| bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成|
| 字段名 | 解释 |
| :----- | :----------------------------------------------------------- |
| type | block类型 |
| bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
二级block的类型详解
| type | desc |
|:-------------------| :---- |
| image_body | 图像的本体 |
| type | desc |
| :----------------- | :------------- |
| image_body | 图像的本体 |
| image_caption | 图像的描述文本 |
| table_body | 表格本体 |
| table_body | 表格本体 |
| table_caption | 表格的描述文本 |
| table_footnote | 表格的脚注 |
| text | 文本块 |
| title | 标题块 |
| interline_equation | 行间公式块|
| table_footnote | 表格的脚注 |
| text | 文本块 |
| title | 标题块 |
| interline_equation | 行间公式块 |
<br>
......@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
line 的 字段格式如下
| 字段名 | 解释 |
| :-----| :---- |
| bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
| 字段名 | 解释 |
| :----- | :------------------------------------------------------------------- |
| bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
<br>
**span**
| 字段名 | 解释 |
| :-----| :---- |
| bbox | span的矩形框坐标 |
| type | span的类型 |
| 字段名 | 解释 |
| :------------------ | :------------------------------------------------------------------------------- |
| bbox | span的矩形框坐标 |
| type | span的类型 |
| content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 |
span 的类型有如下几种
| type | desc |
| :-----| :---- |
| image | 图片 |
| table | 表格 |
| text | 文本 |
| inline_equation | 行内公式 |
| type | desc |
| :----------------- | :------- |
| image | 图片 |
| table | 表格 |
| text | 文本 |
| inline_equation | 行内公式 |
| interline_equation | 行间公式 |
**总结**
span是所有元素的最小存储单元
......@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息
一级block(如有)->二级block->line->span
#### 示例数据
```json
......@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息
"_parse_type": "txt",
"_version_name": "0.6.1"
}
```
\ No newline at end of file
```
from magic_pdf.libs.Constants import CROSS_PAGE
from magic_pdf.libs.commons import fitz # PyMuPDF
from magic_pdf.libs.ocr_content_type import ContentType, BlockType, CategoryId
from magic_pdf.libs.Constants import CROSS_PAGE
from magic_pdf.libs.ocr_content_type import BlockType, CategoryId, ContentType
from magic_pdf.model.magic_model import MagicModel
......@@ -65,10 +65,11 @@ def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config):
) # Insert the index in the top left corner of the rectangle
def draw_layout_bbox(pdf_info, pdf_bytes, out_path):
def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename):
layout_bbox_list = []
dropped_bbox_list = []
tables_list, tables_body_list, tables_caption_list, tables_footnote_list = [], [], [], []
tables_list, tables_body_list = [], []
tables_caption_list, tables_footnote_list = [], []
imgs_list, imgs_body_list, imgs_caption_list = [], [], []
titles_list = []
texts_list = []
......@@ -81,37 +82,37 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path):
titles = []
texts = []
interequations = []
for layout in page["layout_bboxes"]:
page_layout_list.append(layout["layout_bbox"])
for layout in page['layout_bboxes']:
page_layout_list.append(layout['layout_bbox'])
layout_bbox_list.append(page_layout_list)
for dropped_bbox in page["discarded_blocks"]:
page_dropped_list.append(dropped_bbox["bbox"])
for dropped_bbox in page['discarded_blocks']:
page_dropped_list.append(dropped_bbox['bbox'])
dropped_bbox_list.append(page_dropped_list)
for block in page["para_blocks"]:
bbox = block["bbox"]
if block["type"] == BlockType.Table:
for block in page['para_blocks']:
bbox = block['bbox']
if block['type'] == BlockType.Table:
tables.append(bbox)
for nested_block in block["blocks"]:
bbox = nested_block["bbox"]
if nested_block["type"] == BlockType.TableBody:
for nested_block in block['blocks']:
bbox = nested_block['bbox']
if nested_block['type'] == BlockType.TableBody:
tables_body.append(bbox)
elif nested_block["type"] == BlockType.TableCaption:
elif nested_block['type'] == BlockType.TableCaption:
tables_caption.append(bbox)
elif nested_block["type"] == BlockType.TableFootnote:
elif nested_block['type'] == BlockType.TableFootnote:
tables_footnote.append(bbox)
elif block["type"] == BlockType.Image:
elif block['type'] == BlockType.Image:
imgs.append(bbox)
for nested_block in block["blocks"]:
bbox = nested_block["bbox"]
if nested_block["type"] == BlockType.ImageBody:
for nested_block in block['blocks']:
bbox = nested_block['bbox']
if nested_block['type'] == BlockType.ImageBody:
imgs_body.append(bbox)
elif nested_block["type"] == BlockType.ImageCaption:
elif nested_block['type'] == BlockType.ImageCaption:
imgs_caption.append(bbox)
elif block["type"] == BlockType.Title:
elif block['type'] == BlockType.Title:
titles.append(bbox)
elif block["type"] == BlockType.Text:
elif block['type'] == BlockType.Text:
texts.append(bbox)
elif block["type"] == BlockType.InterlineEquation:
elif block['type'] == BlockType.InterlineEquation:
interequations.append(bbox)
tables_list.append(tables)
tables_body_list.append(tables_body)
......@@ -124,26 +125,33 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path):
texts_list.append(texts)
interequations_list.append(interequations)
pdf_docs = fitz.open("pdf", pdf_bytes)
pdf_docs = fitz.open('pdf', pdf_bytes)
for i, page in enumerate(pdf_docs):
draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False)
draw_bbox_without_number(i, dropped_bbox_list, page, [158, 158, 158], True)
draw_bbox_without_number(i, tables_list, page, [153, 153, 0], True) # color !
draw_bbox_without_number(i, tables_body_list, page, [204, 204, 0], True)
draw_bbox_without_number(i, tables_caption_list, page, [255, 255, 102], True)
draw_bbox_without_number(i, tables_footnote_list, page, [229, 255, 204], True)
draw_bbox_without_number(i, dropped_bbox_list, page, [158, 158, 158],
True)
draw_bbox_without_number(i, tables_list, page, [153, 153, 0],
True) # color !
draw_bbox_without_number(i, tables_body_list, page, [204, 204, 0],
True)
draw_bbox_without_number(i, tables_caption_list, page, [255, 255, 102],
True)
draw_bbox_without_number(i, tables_footnote_list, page,
[229, 255, 204], True)
draw_bbox_without_number(i, imgs_list, page, [51, 102, 0], True)
draw_bbox_without_number(i, imgs_body_list, page, [153, 255, 51], True)
draw_bbox_without_number(i, imgs_caption_list, page, [102, 178, 255], True)
draw_bbox_without_number(i, imgs_caption_list, page, [102, 178, 255],
True)
draw_bbox_without_number(i, titles_list, page, [102, 102, 255], True)
draw_bbox_without_number(i, texts_list, page, [153, 0, 76], True)
draw_bbox_without_number(i, interequations_list, page, [0, 255, 0], True)
draw_bbox_without_number(i, interequations_list, page, [0, 255, 0],
True)
# Save the PDF
pdf_docs.save(f"{out_path}/layout.pdf")
pdf_docs.save(f'{out_path}/{filename}_layout.pdf')
def draw_span_bbox(pdf_info, pdf_bytes, out_path):
def draw_span_bbox(pdf_info, pdf_bytes, out_path, filename):
text_list = []
inline_equation_list = []
interline_equation_list = []
......@@ -154,22 +162,22 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
next_page_inline_equation_list = []
def get_span_info(span):
if span["type"] == ContentType.Text:
if span['type'] == ContentType.Text:
if span.get(CROSS_PAGE, False):
next_page_text_list.append(span["bbox"])
next_page_text_list.append(span['bbox'])
else:
page_text_list.append(span["bbox"])
elif span["type"] == ContentType.InlineEquation:
page_text_list.append(span['bbox'])
elif span['type'] == ContentType.InlineEquation:
if span.get(CROSS_PAGE, False):
next_page_inline_equation_list.append(span["bbox"])
next_page_inline_equation_list.append(span['bbox'])
else:
page_inline_equation_list.append(span["bbox"])
elif span["type"] == ContentType.InterlineEquation:
page_interline_equation_list.append(span["bbox"])
elif span["type"] == ContentType.Image:
page_image_list.append(span["bbox"])
elif span["type"] == ContentType.Table:
page_table_list.append(span["bbox"])
page_inline_equation_list.append(span['bbox'])
elif span['type'] == ContentType.InterlineEquation:
page_interline_equation_list.append(span['bbox'])
elif span['type'] == ContentType.Image:
page_image_list.append(span['bbox'])
elif span['type'] == ContentType.Table:
page_table_list.append(span['bbox'])
for page in pdf_info:
page_text_list = []
......@@ -188,54 +196,56 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
next_page_inline_equation_list.clear()
# 构造dropped_list
for block in page["discarded_blocks"]:
if block["type"] == BlockType.Discarded:
for line in block["lines"]:
for span in line["spans"]:
page_dropped_list.append(span["bbox"])
for block in page['discarded_blocks']:
if block['type'] == BlockType.Discarded:
for line in block['lines']:
for span in line['spans']:
page_dropped_list.append(span['bbox'])
dropped_list.append(page_dropped_list)
# 构造其余useful_list
for block in page["para_blocks"]:
if block["type"] in [
BlockType.Text,
BlockType.Title,
BlockType.InterlineEquation,
for block in page['para_blocks']:
if block['type'] in [
BlockType.Text,
BlockType.Title,
BlockType.InterlineEquation,
]:
for line in block["lines"]:
for span in line["spans"]:
for line in block['lines']:
for span in line['spans']:
get_span_info(span)
elif block["type"] in [BlockType.Image, BlockType.Table]:
for sub_block in block["blocks"]:
for line in sub_block["lines"]:
for span in line["spans"]:
elif block['type'] in [BlockType.Image, BlockType.Table]:
for sub_block in block['blocks']:
for line in sub_block['lines']:
for span in line['spans']:
get_span_info(span)
text_list.append(page_text_list)
inline_equation_list.append(page_inline_equation_list)
interline_equation_list.append(page_interline_equation_list)
image_list.append(page_image_list)
table_list.append(page_table_list)
pdf_docs = fitz.open("pdf", pdf_bytes)
pdf_docs = fitz.open('pdf', pdf_bytes)
for i, page in enumerate(pdf_docs):
# 获取当前页面的数据
draw_bbox_without_number(i, text_list, page, [255, 0, 0], False)
draw_bbox_without_number(i, inline_equation_list, page, [0, 255, 0], False)
draw_bbox_without_number(i, interline_equation_list, page, [0, 0, 255], False)
draw_bbox_without_number(i, inline_equation_list, page, [0, 255, 0],
False)
draw_bbox_without_number(i, interline_equation_list, page, [0, 0, 255],
False)
draw_bbox_without_number(i, image_list, page, [255, 204, 0], False)
draw_bbox_without_number(i, table_list, page, [204, 0, 255], False)
draw_bbox_without_number(i, dropped_list, page, [158, 158, 158], False)
# Save the PDF
pdf_docs.save(f"{out_path}/spans.pdf")
pdf_docs.save(f'{out_path}/{filename}_spans.pdf')
def drow_model_bbox(model_list: list, pdf_bytes, out_path):
def drow_model_bbox(model_list: list, pdf_bytes, out_path, filename):
dropped_bbox_list = []
tables_body_list, tables_caption_list, tables_footnote_list = [], [], []
imgs_body_list, imgs_caption_list = [], []
titles_list = []
texts_list = []
interequations_list = []
pdf_docs = fitz.open("pdf", pdf_bytes)
pdf_docs = fitz.open('pdf', pdf_bytes)
magic_model = MagicModel(model_list, pdf_docs)
for i in range(len(model_list)):
page_dropped_list = []
......@@ -245,26 +255,27 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path):
texts = []
interequations = []
page_info = magic_model.get_model_list(i)
layout_dets = page_info["layout_dets"]
layout_dets = page_info['layout_dets']
for layout_det in layout_dets:
bbox = layout_det["bbox"]
if layout_det["category_id"] == CategoryId.Text:
bbox = layout_det['bbox']
if layout_det['category_id'] == CategoryId.Text:
texts.append(bbox)
elif layout_det["category_id"] == CategoryId.Title:
elif layout_det['category_id'] == CategoryId.Title:
titles.append(bbox)
elif layout_det["category_id"] == CategoryId.TableBody:
elif layout_det['category_id'] == CategoryId.TableBody:
tables_body.append(bbox)
elif layout_det["category_id"] == CategoryId.TableCaption:
elif layout_det['category_id'] == CategoryId.TableCaption:
tables_caption.append(bbox)
elif layout_det["category_id"] == CategoryId.TableFootnote:
elif layout_det['category_id'] == CategoryId.TableFootnote:
tables_footnote.append(bbox)
elif layout_det["category_id"] == CategoryId.ImageBody:
elif layout_det['category_id'] == CategoryId.ImageBody:
imgs_body.append(bbox)
elif layout_det["category_id"] == CategoryId.ImageCaption:
elif layout_det['category_id'] == CategoryId.ImageCaption:
imgs_caption.append(bbox)
elif layout_det["category_id"] == CategoryId.InterlineEquation_YOLO:
elif layout_det[
'category_id'] == CategoryId.InterlineEquation_YOLO:
interequations.append(bbox)
elif layout_det["category_id"] == CategoryId.Abandon:
elif layout_det['category_id'] == CategoryId.Abandon:
page_dropped_list.append(bbox)
tables_body_list.append(tables_body)
......@@ -278,15 +289,19 @@ def drow_model_bbox(model_list: list, pdf_bytes, out_path):
dropped_bbox_list.append(page_dropped_list)
for i, page in enumerate(pdf_docs):
draw_bbox_with_number(i, dropped_bbox_list, page, [158, 158, 158], True) # color !
draw_bbox_with_number(i, dropped_bbox_list, page, [158, 158, 158],
True) # color !
draw_bbox_with_number(i, tables_body_list, page, [204, 204, 0], True)
draw_bbox_with_number(i, tables_caption_list, page, [255, 255, 102], True)
draw_bbox_with_number(i, tables_footnote_list, page, [229, 255, 204], True)
draw_bbox_with_number(i, tables_caption_list, page, [255, 255, 102],
True)
draw_bbox_with_number(i, tables_footnote_list, page, [229, 255, 204],
True)
draw_bbox_with_number(i, imgs_body_list, page, [153, 255, 51], True)
draw_bbox_with_number(i, imgs_caption_list, page, [102, 178, 255], True)
draw_bbox_with_number(i, imgs_caption_list, page, [102, 178, 255],
True)
draw_bbox_with_number(i, titles_list, page, [102, 102, 255], True)
draw_bbox_with_number(i, texts_list, page, [153, 0, 76], True)
draw_bbox_with_number(i, interequations_list, page, [0, 255, 0], True)
# Save the PDF
pdf_docs.save(f"{out_path}/model.pdf")
\ No newline at end of file
pdf_docs.save(f'{out_path}/{filename}_model.pdf')
import os
from pathlib import Path
import click
from loguru import logger
from pathlib import Path
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config
from magic_pdf.tools.common import parse_pdf_methods, do_parse
from magic_pdf.libs.version import __version__
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, parse_pdf_methods
@click.command()
@click.version_option(__version__, "--version", "-v", help="display the version and exit")
@click.version_option(__version__,
'--version',
'-v',
help='display the version and exit')
@click.option(
"-p",
"--path",
"path",
'-p',
'--path',
'path',
type=click.Path(exists=True),
required=True,
help="local pdf filepath or directory",
help='local pdf filepath or directory',
)
@click.option(
"-o",
"--output-dir",
"output_dir",
type=str,
help="output local directory",
default="",
'-o',
'--output-dir',
'output_dir',
type=click.Path(),
required=True,
help='output local directory',
default='',
)
@click.option(
"-m",
"--method",
"method",
'-m',
'--method',
'method',
type=parse_pdf_methods,
help="""the method for parsing pdf.
help="""the method for parsing pdf.
ocr: using ocr technique to extract information from pdf.
txt: suitable for the text-based pdf only and outperform ocr.
auto: automatically choose the best method for parsing pdf from ocr and txt.
without method specified, auto will be used by default.""",
default="auto",
default='auto',
)
def cli(path, output_dir, method):
model_config.__use_inside_model__ = True
model_config.__model_mode__ = "full"
if output_dir == "":
if os.path.isdir(path):
output_dir = os.path.join(path, "output")
else:
output_dir = os.path.join(os.path.dirname(path), "output")
model_config.__model_mode__ = 'full'
os.makedirs(output_dir, exist_ok=True)
def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path))
......@@ -69,11 +70,11 @@ def cli(path, output_dir, method):
logger.exception(e)
if os.path.isdir(path):
for doc_path in Path(path).glob("*.pdf"):
for doc_path in Path(path).glob('*.pdf'):
parse_doc(doc_path)
else:
parse_doc(path)
if __name__ == "__main__":
if __name__ == '__main__':
cli()
import os
import json as json_parse
import click
import os
from pathlib import Path
from magic_pdf.libs.path_utils import (
parse_s3path,
parse_s3_range_params,
remove_non_official_s3_args,
)
from magic_pdf.libs.config_reader import (
get_s3_config,
)
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import click
import magic_pdf.model as model_config
from magic_pdf.tools.common import parse_pdf_methods, do_parse
from magic_pdf.libs.config_reader import get_s3_config
from magic_pdf.libs.path_utils import (parse_s3_range_params, parse_s3path,
remove_non_official_s3_args)
from magic_pdf.libs.version import __version__
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
from magic_pdf.tools.common import do_parse, parse_pdf_methods
def read_s3_path(s3path):
bucket, key = parse_s3path(s3path)
s3_ak, s3_sk, s3_endpoint = get_s3_config(bucket)
s3_rw = S3ReaderWriter(
s3_ak, s3_sk, s3_endpoint, "auto", remove_non_official_s3_args(s3path)
)
s3_rw = S3ReaderWriter(s3_ak, s3_sk, s3_endpoint, 'auto',
remove_non_official_s3_args(s3path))
may_range_params = parse_s3_range_params(s3path)
if may_range_params is None or 2 != len(may_range_params):
byte_start, byte_end = 0, None
else:
byte_start, byte_end = int(may_range_params[0]), int(may_range_params[1])
byte_start, byte_end = int(may_range_params[0]), int(
may_range_params[1])
return s3_rw.read_offset(
remove_non_official_s3_args(s3path),
byte_start,
......@@ -38,51 +35,48 @@ def read_s3_path(s3path):
@click.group()
@click.version_option(__version__, "--version", "-v", help="显示版本信息")
@click.version_option(__version__, '--version', '-v', help='显示版本信息')
def cli():
pass
@cli.command()
@click.option(
"-j",
"--jsonl",
"jsonl",
'-j',
'--jsonl',
'jsonl',
type=str,
help="输入 jsonl 路径,本地或者 s3 上的文件",
help='输入 jsonl 路径,本地或者 s3 上的文件',
required=True,
)
@click.option(
"-m",
"--method",
"method",
'-m',
'--method',
'method',
type=parse_pdf_methods,
help="指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法",
default="auto",
help='指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法',
default='auto',
)
@click.option(
"-o",
"--output-dir",
"output_dir",
type=str,
help="输出到本地目录",
default="",
'-o',
'--output-dir',
'output_dir',
type=click.Path(),
required=True,
help='输出到本地目录',
default='',
)
def jsonl(jsonl, method, output_dir):
model_config.__use_inside_model__ = False
if jsonl.startswith("s3://"):
jso = json_parse.loads(read_s3_path(jsonl).decode("utf-8"))
full_jsonl_path = "."
if jsonl.startswith('s3://'):
jso = json_parse.loads(read_s3_path(jsonl).decode('utf-8'))
else:
full_jsonl_path = os.path.realpath(jsonl)
with open(jsonl) as f:
jso = json_parse.loads(f.readline())
if output_dir == "":
output_dir = os.path.join(os.path.dirname(full_jsonl_path), "output")
s3_file_path = jso.get("file_location")
os.makedirs(output_dir, exist_ok=True)
s3_file_path = jso.get('file_location')
if s3_file_path is None:
s3_file_path = jso.get("path")
s3_file_path = jso.get('path')
pdf_file_name = Path(s3_file_path).stem
pdf_data = read_s3_path(s3_file_path)
......@@ -91,7 +85,7 @@ def jsonl(jsonl, method, output_dir):
output_dir,
pdf_file_name,
pdf_data,
jso["doc_layout_result"],
jso['doc_layout_result'],
method,
f_dump_content_list=True,
f_draw_model_bbox=True,
......@@ -100,43 +94,46 @@ def jsonl(jsonl, method, output_dir):
@cli.command()
@click.option(
"-p",
"--pdf",
"pdf",
'-p',
'--pdf',
'pdf',
type=click.Path(exists=True),
required=True,
help="本地 PDF 文件",
help='本地 PDF 文件',
)
@click.option(
"-j",
"--json",
"json_data",
'-j',
'--json',
'json_data',
type=click.Path(exists=True),
required=True,
help="本地模型推理出的 json 数据",
)
@click.option(
"-o", "--output-dir", "output_dir", type=str, help="本地输出目录", default=""
help='本地模型推理出的 json 数据',
)
@click.option('-o',
'--output-dir',
'output_dir',
type=click.Path(),
required=True,
help='本地输出目录',
default='')
@click.option(
"-m",
"--method",
"method",
'-m',
'--method',
'method',
type=parse_pdf_methods,
help="指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法",
default="auto",
help='指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法',
default='auto',
)
def pdf(pdf, json_data, output_dir, method):
model_config.__use_inside_model__ = False
full_pdf_path = os.path.realpath(pdf)
if output_dir == "":
output_dir = os.path.join(os.path.dirname(full_pdf_path), "output")
os.makedirs(output_dir, exist_ok=True)
def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
model_json_list = json_parse.loads(read_fn(json_data).decode("utf-8"))
model_json_list = json_parse.loads(read_fn(json_data).decode('utf-8'))
file_name = str(Path(full_pdf_path).stem)
pdf_data = read_fn(full_pdf_path)
......@@ -151,5 +148,5 @@ def pdf(pdf, json_data, output_dir, method):
)
if __name__ == "__main__":
if __name__ == '__main__':
cli()
import os
import json as json_parse
import copy
import json as json_parse
import os
import click
from loguru import logger
import magic_pdf.model as model_config
from magic_pdf.libs.draw_bbox import (draw_layout_bbox, draw_span_bbox,
drow_model_bbox)
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.libs.draw_bbox import draw_layout_bbox, draw_span_bbox, drow_model_bbox
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
def prepare_env(output_dir, pdf_file_name, method):
local_parent_dir = os.path.join(output_dir, pdf_file_name, method)
local_image_dir = os.path.join(str(local_parent_dir), "images")
local_image_dir = os.path.join(str(local_parent_dir), 'images')
local_md_dir = local_parent_dir
os.makedirs(local_image_dir, exist_ok=True)
os.makedirs(local_md_dir, exist_ok=True)
......@@ -40,22 +43,22 @@ def do_parse(
f_draw_model_bbox=False,
):
orig_model_list = copy.deepcopy(model_list)
local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)
local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name,
parse_method)
image_writer, md_writer = DiskReaderWriter(local_image_dir), DiskReaderWriter(
local_md_dir
)
image_writer, md_writer = DiskReaderWriter(
local_image_dir), DiskReaderWriter(local_md_dir)
image_dir = str(os.path.basename(local_image_dir))
if parse_method == "auto":
jso_useful_key = {"_pdf_type": "", "model_list": model_list}
if parse_method == 'auto':
jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
elif parse_method == "txt":
elif parse_method == 'txt':
pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True)
elif parse_method == "ocr":
elif parse_method == 'ocr':
pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True)
else:
logger.error("unknown parse method")
logger.error('unknown parse method')
exit(1)
pipe.pipe_classify()
......@@ -65,58 +68,65 @@ def do_parse(
pipe.pipe_analyze()
orig_model_list = copy.deepcopy(pipe.model_list)
else:
logger.error("need model list input")
logger.error('need model list input')
exit(2)
pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"]
pdf_info = pipe.pdf_mid_data['pdf_info']
if f_draw_layout_bbox:
draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir)
draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
if f_draw_span_bbox:
draw_span_bbox(pdf_info, pdf_bytes, local_md_dir)
draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
if f_draw_model_bbox:
drow_model_bbox(orig_model_list, pdf_bytes, local_md_dir)
drow_model_bbox(orig_model_list, pdf_bytes, local_md_dir,
pdf_file_name)
md_content = pipe.pipe_mk_markdown(
image_dir, drop_mode=DropMode.NONE, md_make_mode=f_make_md_mode
)
md_content = pipe.pipe_mk_markdown(image_dir,
drop_mode=DropMode.NONE,
md_make_mode=f_make_md_mode)
if f_dump_md:
md_writer.write(
content=md_content,
path=f"{pdf_file_name}.md",
path=f'{pdf_file_name}.md',
mode=AbsReaderWriter.MODE_TXT,
)
if f_dump_middle_json:
md_writer.write(
content=json_parse.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4),
path="middle.json",
content=json_parse.dumps(pipe.pdf_mid_data,
ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_middle.json',
mode=AbsReaderWriter.MODE_TXT,
)
if f_dump_model_json:
md_writer.write(
content=json_parse.dumps(orig_model_list, ensure_ascii=False, indent=4),
path="model.json",
content=json_parse.dumps(orig_model_list,
ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_model.json',
mode=AbsReaderWriter.MODE_TXT,
)
if f_dump_orig_pdf:
md_writer.write(
content=pdf_bytes,
path="origin.pdf",
path=f'{pdf_file_name}_origin.pdf',
mode=AbsReaderWriter.MODE_BIN,
)
content_list = pipe.pipe_mk_uni_format(image_dir, drop_mode=DropMode.NONE)
if f_dump_content_list:
md_writer.write(
content=json_parse.dumps(content_list, ensure_ascii=False, indent=4),
path="content_list.json",
content=json_parse.dumps(content_list,
ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_content_list.json',
mode=AbsReaderWriter.MODE_TXT,
)
logger.info(f"local output dir is {local_md_dir}")
logger.info(f'local output dir is {local_md_dir}')
parse_pdf_methods = click.Choice(["ocr", "txt", "auto"])
parse_pdf_methods = click.Choice(['ocr', 'txt', 'auto'])
dependent on the service headway and the reliability of the departure time of the service to which passengers are incident.
After briefly introducing the random incidence model, which is often assumed to hold at short headways, the balance of this section reviews six studies of passenger incidence behavior that are moti- vated by understanding the relationships between service headway, service reliability, passenger incidence behavior, and passenger waiting time in a more nuanced fashion than is embedded in the random incidence assumption ( 2 ). Three of these studies depend on manually collected data, two studies use data from AFC systems, and one study analyzes the issue purely theoretically. These studies reveal much about passenger incidence behavior, but all are found to be limited in their general applicability by the methods with which they collect information about passengers and the services those passengers intend to use.
# Random Passenger Incidence Behavior
One characterization of passenger incidence behavior is that of ran- dom incidence ( 3 ). The key assumption underlying the random inci- dence model is that the process of passenger arrivals to the public transport service is independent from the vehicle departure process of the service. This implies that passengers become incident to the service at a random time, and thus the instantaneous rate of passen- ger arrivals to the service is uniform over a given period of time. Let $W$ and $H$ be random variables representing passenger waiting times and service headways, respectively. Under the random incidence assumption and the assumption that vehicle capacity is not a binding constraint, a classic result of transportation science is that
$$
E!\\left(W\\right)!=!\\frac{E!\\left\[H^{2}\\right\]}{2E!\\left\[H\\right\]}!=!\\frac{E!\\left\[H\\right\]}{2}!!\\left(1!+!\\operatorname{CV}!\\left(H\\right)^{2}\\right)
$$
where $E\[X\]$ is the probabilistic expectation of some random variable $X$ and $\\operatorname{CV}(H)$ is the coefficient of variation of $H$ , a unitless measure of the variability of $H$ defined as
$$
\\mathbf{CV}\\big(H\\big)!=!\\frac{\\boldsymbol{\\upsigma}\_{H}}{E\\big\[H\\big\]}
$$
where $\\upsigma\_{H}$ is the standard deviation of $H\\left(4\\right)$ . The second expression in Equation 1 is particularly useful because it expresses the mean passenger waiting time as the sum of two components: the waiting time caused by the mean headway (i.e., the reciprocal of service fre- quency) and the waiting time caused by the variability of the head- ways (which is one measure of service reliability). When the service is perfectly reliable with constant headways, the mean ­ waiting time will be simply half the headway.
# More Behaviorally Realistic Incidence Models
Jolliffe and Hutchinson studied bus passenger incidence in South London suburbs ( 5 ). They observed 10 bus stops for $^{1\\mathrm{~h~}}$ per day over 8 days, recording the times of passenger incidence and actual and scheduled bus departures. They limited their stop selection to those served by only a single bus route with a single service pat- tern so as to avoid ambiguity about which service a passenger was waiting for. The authors found that the actual average passenger waiting time was $30%$ less than predicted by the random incidence model. They also found that the empirical distributions of passenger incidence times (by time of day) had peaks just before the respec- tive average bus departure times. They hypothesized the existence of three classes of passengers: with proportion $q$ , passengers whose time of incidence is causally coincident with that of a bus departure (e.g., because they saw the approaching bus from their home or a shop window); with proportion $p(1-q)$ , passengers who time their arrivals to minimize expected waiting time; and with proportion $(1-p)(1-q)$ , passengers who are randomly incident. The authors found that $p$ was positively correlated with the potential reduction in waiting time (compared with arriving randomly) that resulted from knowledge of the timetable and of service reliability. They also found $p$ to be higher in the peak commuting periods rather than in the off-peak periods, indicating more awareness of the timetable or historical reliability, or both, by commuters.
Bowman and Turnquist built on the concept of aware and unaware passengers of proportions $p$ and $(1-p)$ , respectively. They proposed a utility-based model to estimate $p$ and the distribution of incidence times, and thus the mean waiting time, of aware passengers over a given headway as a function of the headway and reliability of bus departure times $(l)$ . They observed seven bus stops in Chicago, Illinois, each served by a single (different) bus route, between 6:00 and $8{\\cdot}00;\\mathrm{a.m}$ . for 5 to 10 days each. The bus routes had headways of 5 to $20~\\mathrm{min}$ and a range of reliabilities. The authors found that actual average waiting time was substantially less than predicted by the random incidence model. They estimated that $p$ was not statistically significantly different from 1.0, which they explain by the fact that all observations were taken during peak commuting times. Their model predicts that the longer the headway and the more reliable the departures, the more peaked the distribution of incidence times will be and the closer that peak will be to the next scheduled departure time. This prediction demonstrates what they refer to as a safety margin that passengers add to reduce the chance of missing their bus when the service is known to be somewhat unreliable. Such a safety margin can also result from unreliability in passengers’ journeys to the public transport stop or station. Bowman and ­ Turnquist conclude from their model that the random incidence model underestimates the waiting time benefits of improving reli- ability and overestimates the waiting time benefits of increasing ser- vice frequency. This is because as reliability increases passengers can better predict departure times and so can time their incidence to decrease their waiting time.
Furth and Muller study the issue in a theoretical context and gener- ally agree with the above findings ( 2 ). They are primarily concerned with the use of data from automatic vehicle-tracking systems to assess the impacts of reliability on passenger incidence behavior and wait- ing times. They propose that passengers will react to unreliability by departing earlier than they would with reliable services. Randomly incident unaware passengers will experience unreliability as a more dispersed distribution of headways and simply allocate additional time to their trip plan to improve the chance of arriving at their des- tination on time. Aware passengers, whose incidence is not entirely random, will react by timing their incidence somewhat earlier than the scheduled departure time to increase their chance of catching the desired service. The authors characterize these ­ reactions as the costs of unreliability.
Luethi et al. continued with the analysis of manually collected data on actual passenger behavior ( 6 ). They use the language of probability to describe two classes of passengers. The first is timetable-dependent passengers (i.e., the aware passengers), whose incidence behavior is affected by awareness (possibly gained
This source diff could not be displayed because it is too large. You can view the blob instead.
[
{
"layout_dets": [
{
"category_id": 1,
"poly": [
882.4013061523438,
169.93817138671875,
1552.350341796875,
169.93817138671875,
1552.350341796875,
625.8263549804688,
882.4013061523438,
625.8263549804688
],
"score": 0.999992311000824
},
{
"category_id": 1,
"poly": [
882.474853515625,
1450.92822265625,
1551.4490966796875,
1450.92822265625,
1551.4490966796875,
1877.5712890625,
882.474853515625,
1877.5712890625
],
"score": 0.9999903440475464
},
{
"category_id": 1,
"poly": [
881.6513061523438,
626.2058715820312,
1552.1400146484375,
626.2058715820312,
1552.1400146484375,
1450.604736328125,
881.6513061523438,
1450.604736328125
],
"score": 0.9999856352806091
},
{
"category_id": 1,
"poly": [
149.41075134277344,
232.1595001220703,
819.0465087890625,
232.1595001220703,
819.0465087890625,
625.8865356445312,
149.41075134277344,
625.8865356445312
],
"score": 0.99998539686203
},
{
"category_id": 1,
"poly": [
149.3945770263672,
1215.5172119140625,
817.8850708007812,
1215.5172119140625,
817.8850708007812,
1304.873291015625,
149.3945770263672,
1304.873291015625
],
"score": 0.9999765157699585
},
{
"category_id": 1,
"poly": [
882.6979370117188,
1880.13916015625,
1552.15185546875,
1880.13916015625,
1552.15185546875,
2031.339599609375,
882.6979370117188,
2031.339599609375
],
"score": 0.9999744892120361
},
{
"category_id": 1,
"poly": [
148.96054077148438,
743.3055419921875,
818.6231689453125,
743.3055419921875,
818.6231689453125,
1074.2369384765625,
148.96054077148438,
1074.2369384765625
],
"score": 0.9999669790267944
},
{
"category_id": 1,
"poly": [
148.8435516357422,
1791.14306640625,
818.6885375976562,
1791.14306640625,
818.6885375976562,
2030.794189453125,
148.8435516357422,
2030.794189453125
],
"score": 0.9999618530273438
},
{
"category_id": 0,
"poly": [
150.7009735107422,
684.0087890625,
623.5106201171875,
684.0087890625,
623.5106201171875,
717.03662109375,
150.7009735107422,
717.03662109375
],
"score": 0.9999415278434753
},
{
"category_id": 8,
"poly": [
146.48068237304688,
1331.6737060546875,
317.2640075683594,
1331.6737060546875,
317.2640075683594,
1400.1722412109375,
146.48068237304688,
1400.1722412109375
],
"score": 0.9998958110809326
},
{
"category_id": 1,
"poly": [
149.42420959472656,
1430.8782958984375,
818.9042358398438,
1430.8782958984375,
818.9042358398438,
1672.7386474609375,
149.42420959472656,
1672.7386474609375
],
"score": 0.9998599290847778
},
{
"category_id": 1,
"poly": [
149.18746948242188,
172.10252380371094,
818.5662231445312,
172.10252380371094,
818.5662231445312,
230.4594268798828,
149.18746948242188,
230.4594268798828
],
"score": 0.9997718334197998
},
{
"category_id": 0,
"poly": [
149.0175018310547,
1732.1090087890625,
702.1005859375,
1732.1090087890625,
702.1005859375,
1763.6046142578125,
149.0175018310547,
1763.6046142578125
],
"score": 0.9997085928916931
},
{
"category_id": 2,
"poly": [
1519.802490234375,
98.59099578857422,
1551.985107421875,
98.59099578857422,
1551.985107421875,
119.48420715332031,
1519.802490234375,
119.48420715332031
],
"score": 0.9995552897453308
},
{
"category_id": 8,
"poly": [
146.9109649658203,
1100.156494140625,
544.2803344726562,
1100.156494140625,
544.2803344726562,
1184.929443359375,
146.9109649658203,
1184.929443359375
],
"score": 0.9995207786560059
},
{
"category_id": 2,
"poly": [
148.11611938476562,
99.87767791748047,
318.926025390625,
99.87767791748047,
318.926025390625,
120.70393371582031,
148.11611938476562,
120.70393371582031
],
"score": 0.999351441860199
},
{
"category_id": 9,
"poly": [
791.7642211914062,
1130.056396484375,
818.6940307617188,
1130.056396484375,
818.6940307617188,
1161.1080322265625,
791.7642211914062,
1161.1080322265625
],
"score": 0.9908884763717651
},
{
"category_id": 9,
"poly": [
788.37060546875,
1346.8450927734375,
818.5010986328125,
1346.8450927734375,
818.5010986328125,
1377.370361328125,
788.37060546875,
1377.370361328125
],
"score": 0.9873985052108765
},
{
"category_id": 14,
"poly": [
146,
1103,
543,
1103,
543,
1184,
146,
1184
],
"score": 0.94,
"latex": "E\\!\\left(W\\right)\\!=\\!\\frac{E\\!\\left[H^{2}\\right]}{2E\\!\\left[H\\right]}\\!=\\!\\frac{E\\!\\left[H\\right]}{2}\\!\\!\\left(1\\!+\\!\\operatorname{CV}\\!\\left(H\\right)^{2}\\right)"
},
{
"category_id": 13,
"poly": [
1196,
354,
1278,
354,
1278,
384,
1196,
384
],
"score": 0.91,
"latex": "p(1-q)"
},
{
"category_id": 13,
"poly": [
881,
415,
1020,
415,
1020,
444,
881,
444
],
"score": 0.91,
"latex": "(1-p)(1-q)"
},
{
"category_id": 14,
"poly": [
147,
1333,
318,
1333,
318,
1400,
147,
1400
],
"score": 0.91,
"latex": "\\mathbf{CV}\\big(H\\big)\\!=\\!\\frac{\\boldsymbol{\\upsigma}_{H}}{E\\big[H\\big]}"
},
{
"category_id": 13,
"poly": [
1197,
657,
1263,
657,
1263,
686,
1197,
686
],
"score": 0.9,
"latex": "(1-p)"
},
{
"category_id": 13,
"poly": [
213,
1217,
263,
1217,
263,
1244,
213,
1244
],
"score": 0.88,
"latex": "E[X]"
},
{
"category_id": 13,
"poly": [
214,
1434,
245,
1434,
245,
1459,
214,
1459
],
"score": 0.87,
"latex": "\\upsigma_{H}"
},
{
"category_id": 13,
"poly": [
324,
2002,
373,
2002,
373,
2028,
324,
2028
],
"score": 0.84,
"latex": "30\\%"
},
{
"category_id": 13,
"poly": [
1209,
693,
1225,
693,
1225,
717,
1209,
717
],
"score": 0.83,
"latex": "p"
},
{
"category_id": 13,
"poly": [
990,
449,
1007,
449,
1007,
474,
990,
474
],
"score": 0.81,
"latex": "p"
},
{
"category_id": 13,
"poly": [
346,
1277,
369,
1277,
369,
1301,
346,
1301
],
"score": 0.81,
"latex": "H"
},
{
"category_id": 13,
"poly": [
1137,
661,
1154,
661,
1154,
686,
1137,
686
],
"score": 0.81,
"latex": "p"
},
{
"category_id": 13,
"poly": [
522,
1432,
579,
1432,
579,
1459,
522,
1459
],
"score": 0.81,
"latex": "H\\left(4\\right)"
},
{
"category_id": 13,
"poly": [
944,
540,
962,
540,
962,
565,
944,
565
],
"score": 0.8,
"latex": "p"
},
{
"category_id": 13,
"poly": [
1444,
936,
1461,
936,
1461,
961,
1444,
961
],
"score": 0.79,
"latex": "p"
},
{
"category_id": 13,
"poly": [
602,
1247,
624,
1247,
624,
1270,
602,
1270
],
"score": 0.78,
"latex": "H"
},
{
"category_id": 13,
"poly": [
147,
1247,
167,
1247,
167,
1271,
147,
1271
],
"score": 0.77,
"latex": "X"
},
{
"category_id": 13,
"poly": [
210,
1246,
282,
1246,
282,
1274,
210,
1274
],
"score": 0.77,
"latex": "\\operatorname{CV}(H)"
},
{
"category_id": 13,
"poly": [
1346,
268,
1361,
268,
1361,
292,
1346,
292
],
"score": 0.76,
"latex": "q"
},
{
"category_id": 13,
"poly": [
215,
957,
238,
957,
238,
981,
215,
981
],
"score": 0.74,
"latex": "H"
},
{
"category_id": 13,
"poly": [
149,
956,
173,
956,
173,
981,
149,
981
],
"score": 0.63,
"latex": "W"
},
{
"category_id": 13,
"poly": [
924,
841,
1016,
841,
1016,
868,
924,
868
],
"score": 0.56,
"latex": "8{\\cdot}00\\;\\mathrm{a.m}"
},
{
"category_id": 13,
"poly": [
956,
871,
1032,
871,
1032,
898,
956,
898
],
"score": 0.43,
"latex": "20~\\mathrm{min}"
},
{
"category_id": 13,
"poly": [
1082,
781,
1112,
781,
1112,
808,
1082,
808
],
"score": 0.41,
"latex": "(l)"
},
{
"category_id": 13,
"poly": [
697,
1821,
734,
1821,
734,
1847,
697,
1847
],
"score": 0.3,
"latex": "^{1\\mathrm{~h~}}"
}
],
"page_info": {
"page_no": 0,
"height": 2200,
"width": 1700
}
}
]
import tempfile
import os
import shutil
import tempfile
from click.testing import CliRunner
from magic_pdf.tools.cli import cli
......@@ -8,19 +9,20 @@ from magic_pdf.tools.cli import cli
def test_cli_pdf():
# setup
unitest_dir = "/tmp/magic_pdf/unittest/tools"
filename = "cli_test_01"
unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools")
temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run
runner = CliRunner()
result = runner.invoke(
cli,
[
"-p",
"tests/test_tools/assets/cli/pdf/cli_test_01.pdf",
"-o",
'-p',
'tests/test_tools/assets/cli/pdf/cli_test_01.pdf',
'-o',
temp_output_dir,
],
)
......@@ -28,29 +30,31 @@ def test_cli_pdf():
# check
assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto")
base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, f"{filename}.md"))
r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False
assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
# teardown
shutil.rmtree(temp_output_dir)
......@@ -58,68 +62,72 @@ def test_cli_pdf():
def test_cli_path():
# setup
unitest_dir = "/tmp/magic_pdf/unittest/tools"
unitest_dir = '/tmp/magic_pdf/unittest/tools'
os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools")
temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run
runner = CliRunner()
result = runner.invoke(
cli, ["-p", "tests/test_tools/assets/cli/path", "-o", temp_output_dir]
)
cli, ['-p', 'tests/test_tools/assets/cli/path', '-o', temp_output_dir])
# check
assert result.exit_code == 0
filename = "cli_test_01"
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto")
filename = 'cli_test_01'
base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, f"{filename}.md"))
r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False
assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
base_output_dir = os.path.join(temp_output_dir, "cli_test_02/auto")
filename = "cli_test_02"
base_output_dir = os.path.join(temp_output_dir, 'cli_test_02/auto')
filename = 'cli_test_02'
r = os.stat(os.path.join(base_output_dir, f"{filename}.md"))
r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, "middle.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False
assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
# teardown
shutil.rmtree(temp_output_dir)
import tempfile
import os
import shutil
import tempfile
from click.testing import CliRunner
from magic_pdf.tools import cli_dev
......@@ -8,22 +9,23 @@ from magic_pdf.tools import cli_dev
def test_cli_pdf():
# setup
unitest_dir = "/tmp/magic_pdf/unittest/tools"
filename = "cli_test_01"
unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools")
temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run
runner = CliRunner()
result = runner.invoke(
cli_dev.cli,
[
"pdf",
"-p",
"tests/test_tools/assets/cli/pdf/cli_test_01.pdf",
"-j",
"tests/test_tools/assets/cli_dev/cli_test_01.model.json",
"-o",
'pdf',
'-p',
'tests/test_tools/assets/cli/pdf/cli_test_01.pdf',
'-j',
'tests/test_tools/assets/cli_dev/cli_test_01.model.json',
'-o',
temp_output_dir,
],
)
......@@ -31,31 +33,31 @@ def test_cli_pdf():
# check
assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto")
base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, "content_list.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md"))
r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True
assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
# teardown
shutil.rmtree(temp_output_dir)
......@@ -63,26 +65,27 @@ def test_cli_pdf():
def test_cli_jsonl():
# setup
unitest_dir = "/tmp/magic_pdf/unittest/tools"
filename = "cli_test_01"
unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools")
temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
def mock_read_s3_path(s3path):
with open(s3path, "rb") as f:
with open(s3path, 'rb') as f:
return f.read()
cli_dev.read_s3_path = mock_read_s3_path # mock
cli_dev.read_s3_path = mock_read_s3_path # mock
# run
runner = CliRunner()
result = runner.invoke(
cli_dev.cli,
[
"jsonl",
"-j",
"tests/test_tools/assets/cli_dev/cli_test_01.jsonl",
"-o",
'jsonl',
'-j',
'tests/test_tools/assets/cli_dev/cli_test_01.jsonl',
'-o',
temp_output_dir,
],
)
......@@ -90,31 +93,31 @@ def test_cli_jsonl():
# check
assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto")
base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, "content_list.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md"))
r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True
assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
# teardown
shutil.rmtree(temp_output_dir)
import tempfile
import os
import shutil
import tempfile
import pytest
import magic_pdf.model as model_config
from magic_pdf.tools.common import do_parse
@pytest.mark.parametrize("method", ["auto", "txt", "ocr"])
@pytest.mark.parametrize('method', ['auto', 'txt', 'ocr'])
def test_common_do_parse(method):
# setup
unitest_dir = "/tmp/magic_pdf/unittest/tools"
filename = "fake"
model_config.__use_inside_model__ = True
unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = 'fake'
os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools")
temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run
with open("tests/test_tools/assets/common/cli_test_01.pdf", "rb") as f:
with open('tests/test_tools/assets/common/cli_test_01.pdf', 'rb') as f:
bits = f.read()
do_parse(temp_output_dir, filename, bits, [], method, f_dump_content_list=True)
do_parse(temp_output_dir,
filename,
bits, [],
method,
f_dump_content_list=True)
# check
base_output_dir = os.path.join(temp_output_dir, f"fake/{method}")
base_output_dir = os.path.join(temp_output_dir, f'fake/{method}')
r = os.stat(os.path.join(base_output_dir, "content_list.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md"))
r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf"))
r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000
os.path.exists(os.path.join(base_output_dir, "images"))
os.path.isdir(os.path.join(base_output_dir, "images"))
os.path.exists(os.path.join(base_output_dir, 'images'))
os.path.isdir(os.path.join(base_output_dir, 'images'))
# teardown
shutil.rmtree(temp_output_dir)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment