Unverified Commit c9a51491 authored by icecraft's avatar icecraft Committed by GitHub

feat: rename the file generated by command line tools (#401)

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: 's avataricecraft <tmortred@gmail.com>
Co-authored-by: 's avataricecraft <xurui1@pjlab.org.cn>
parent 041b9465
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
## 概览 ## 概览
`magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件 `magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
### some_pdf_layout.pdf
### layout.pdf
每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。 每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
![layout 页面示例](images/layout_example.png) ![layout 页面示例](images/layout_example.png)
### some_pdf_spans.pdf
### spans.pdf
根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。 根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行间公式未识别等问题。
![span 页面示例](images/spans_example.png) ![span 页面示例](images/spans_example.png)
### some_pdf_model.json
### model.json
#### 结构定义 #### 结构定义
```python ```python
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from enum import IntEnum from enum import IntEnum
...@@ -33,13 +32,13 @@ class CategoryType(IntEnum): ...@@ -33,13 +32,13 @@ class CategoryType(IntEnum):
table_caption = 6 # 表格描述 table_caption = 6 # 表格描述
table_footnote = 7 # 表格注释 table_footnote = 7 # 表格注释
isolate_formula = 8 # 行间公式 isolate_formula = 8 # 行间公式
formula_caption = 9 # 行间公式的标号 formula_caption = 9 # 行间公式的标号
embedding = 13 # 行内公式 embedding = 13 # 行内公式
isolated = 14 # 行间公式 isolated = 14 # 行间公式
text = 15 # ocr 识别结果 text = 15 # ocr 识别结果
class PageInfo(BaseModel): class PageInfo(BaseModel):
page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0) page_no: int = Field(description="页码序号,第一页的序号是 0", ge=0)
height: int = Field(description="页面高度", gt=0) height: int = Field(description="页面高度", gt=0)
...@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel): ...@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel):
score: float = Field(description="推理结果的置信度") score: float = Field(description="推理结果的置信度")
latex: str | None = Field(description="latex 解析结果", default=None) latex: str | None = Field(description="latex 解析结果", default=None)
html: str | None = Field(description="html 解析结果", default=None) html: str | None = Field(description="html 解析结果", default=None)
class PageInferenceResults(BaseModel): class PageInferenceResults(BaseModel):
layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0) layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
page_info: PageInfo = Field(description="页面元信息") page_info: PageInfo = Field(description="页面元信息")
# 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果 # 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
inference_result: list[PageInferenceResults] = [] inference_result: list[PageInferenceResults] = []
``` ```
poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标 poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
![poly 坐标示意图](images/poly.png) ![poly 坐标示意图](images/poly.png)
#### 示例数据 #### 示例数据
```json ```json
...@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
] ]
``` ```
### some_pdf_middle.json
### middle.json | 字段名 | 解释 |
| :------------- | :----------------------------------------------------------------- |
| 字段名 | 解释 | | pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
| :-----|:------------------------------------------| | \_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
|pdf_info | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 | | \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
|_parse_type | ocr \| txt,用来标识本次解析的中间态使用的模式 |
|_version_name | string, 表示本次解析使用的 magic-pdf 的版本号 |
<br> <br>
**pdf_info** **pdf_info**
字段结构说明 字段结构说明
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------- |
| preproc_blocks | pdf预处理后,未分段的中间结果 | | preproc_blocks | pdf预处理后,未分段的中间结果 |
| layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 | | layout_bboxes | 布局分割的结果,含有布局的方向(垂直、水平),和bbox,按阅读顺序排序 |
| page_idx | 页码,从0开始 | | page_idx | 页码,从0开始 |
| page_size | 页面的宽度和高度 | | page_size | 页面的宽度和高度 |
| _layout_tree | 布局树状结构 | | \_layout_tree | 布局树状结构 |
| images | list,每个元素是一个dict,每个dict表示一个img_block | | images | list,每个元素是一个dict,每个dict表示一个img_block |
| tables | list,每个元素是一个dict,每个dict表示一个table_block | | tables | list,每个元素是一个dict,每个dict表示一个table_block |
| interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block | | interline_equations | list,每个元素是一个dict,每个dict表示一个interline_equation_block |
| discarded_blocks | List, 模型返回的需要drop的block信息 | | discarded_blocks | List, 模型返回的需要drop的block信息 |
| para_blocks | 将preproc_blocks进行分段之后的结果 | | para_blocks | 将preproc_blocks进行分段之后的结果 |
上表中 `para_blocks` 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套 上表中 `para_blocks` 是个dict的数组,每个dict是一个block结构,block最多支持一次嵌套
...@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
外层block被称为一级block,一级block中的字段包括 外层block被称为一级block,一级block中的字段包括
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :---------------------------------------------- |
| type | block类型(table\|image)| | type | block类型(table\|image) |
|bbox | block矩形框坐标 | | bbox | block矩形框坐标 |
|blocks |list,里面的每个元素都是一个dict格式的二级block | | blocks | list,里面的每个元素都是一个dict格式的二级block |
<br> <br>
一级block只有"table"和"image"两种类型,其余block均为二级block 一级block只有"table"和"image"两种类型,其余block均为二级block
二级block中的字段包括 二级block中的字段包括
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :----------------------------------------------------------- |
| type | block类型 | | type | block类型 |
| bbox | block矩形框坐标 | | bbox | block矩形框坐标 |
| lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成| | lines | list,每个元素都是一个dict表示的line,用来描述一行信息的构成 |
二级block的类型详解 二级block的类型详解
| type | desc | | type | desc |
|:-------------------| :---- | | :----------------- | :------------- |
| image_body | 图像的本体 | | image_body | 图像的本体 |
| image_caption | 图像的描述文本 | | image_caption | 图像的描述文本 |
| table_body | 表格本体 | | table_body | 表格本体 |
| table_caption | 表格的描述文本 | | table_caption | 表格的描述文本 |
| table_footnote | 表格的脚注 | | table_footnote | 表格的脚注 |
| text | 文本块 | | text | 文本块 |
| title | 标题块 | | title | 标题块 |
| interline_equation | 行间公式块| | interline_equation | 行间公式块 |
<br> <br>
...@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右 ...@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
line 的 字段格式如下 line 的 字段格式如下
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :----- | :------------------------------------------------------------------- |
| bbox | line的矩形框坐标 | | bbox | line的矩形框坐标 |
| spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 | | spans | list,每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成 |
<br> <br>
**span** **span**
| 字段名 | 解释 | | 字段名 | 解释 |
| :-----| :---- | | :------------------ | :------------------------------------------------------------------------------- |
| bbox | span的矩形框坐标 | | bbox | span的矩形框坐标 |
| type | span的类型 | | type | span的类型 |
| content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 | | content \| img_path | 文本类型的span使用content,图表类使用img_path 用来存储实际的文本或者截图路径信息 |
span 的类型有如下几种 span 的类型有如下几种
| type | desc | | type | desc |
| :-----| :---- | | :----------------- | :------- |
| image | 图片 | | image | 图片 |
| table | 表格 | | table | 表格 |
| text | 文本 | | text | 文本 |
| inline_equation | 行内公式 | | inline_equation | 行内公式 |
| interline_equation | 行间公式 | | interline_equation | 行间公式 |
**总结** **总结**
span是所有元素的最小存储单元 span是所有元素的最小存储单元
...@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息 ...@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息
一级block(如有)->二级block->line->span 一级block(如有)->二级block->line->span
#### 示例数据 #### 示例数据
```json ```json
...@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息 ...@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息
"_parse_type": "txt", "_parse_type": "txt",
"_version_name": "0.6.1" "_version_name": "0.6.1"
} }
``` ```
\ No newline at end of file
This diff is collapsed.
import os import os
from pathlib import Path
import click import click
from loguru import logger from loguru import logger
from pathlib import Path
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config import magic_pdf.model as model_config
from magic_pdf.tools.common import parse_pdf_methods, do_parse
from magic_pdf.libs.version import __version__ from magic_pdf.libs.version import __version__
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.tools.common import do_parse, parse_pdf_methods
@click.command() @click.command()
@click.version_option(__version__, "--version", "-v", help="display the version and exit") @click.version_option(__version__,
'--version',
'-v',
help='display the version and exit')
@click.option( @click.option(
"-p", '-p',
"--path", '--path',
"path", 'path',
type=click.Path(exists=True), type=click.Path(exists=True),
required=True, required=True,
help="local pdf filepath or directory", help='local pdf filepath or directory',
) )
@click.option( @click.option(
"-o", '-o',
"--output-dir", '--output-dir',
"output_dir", 'output_dir',
type=str, type=click.Path(),
help="output local directory", required=True,
default="", help='output local directory',
default='',
) )
@click.option( @click.option(
"-m", '-m',
"--method", '--method',
"method", 'method',
type=parse_pdf_methods, type=parse_pdf_methods,
help="""the method for parsing pdf. help="""the method for parsing pdf.
ocr: using ocr technique to extract information from pdf. ocr: using ocr technique to extract information from pdf.
txt: suitable for the text-based pdf only and outperform ocr. txt: suitable for the text-based pdf only and outperform ocr.
auto: automatically choose the best method for parsing pdf from ocr and txt. auto: automatically choose the best method for parsing pdf from ocr and txt.
without method specified, auto will be used by default.""", without method specified, auto will be used by default.""",
default="auto", default='auto',
) )
def cli(path, output_dir, method): def cli(path, output_dir, method):
model_config.__use_inside_model__ = True model_config.__use_inside_model__ = True
model_config.__model_mode__ = "full" model_config.__model_mode__ = 'full'
if output_dir == "": os.makedirs(output_dir, exist_ok=True)
if os.path.isdir(path):
output_dir = os.path.join(path, "output")
else:
output_dir = os.path.join(os.path.dirname(path), "output")
def read_fn(path): def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path)) disk_rw = DiskReaderWriter(os.path.dirname(path))
...@@ -69,11 +70,11 @@ def cli(path, output_dir, method): ...@@ -69,11 +70,11 @@ def cli(path, output_dir, method):
logger.exception(e) logger.exception(e)
if os.path.isdir(path): if os.path.isdir(path):
for doc_path in Path(path).glob("*.pdf"): for doc_path in Path(path).glob('*.pdf'):
parse_doc(doc_path) parse_doc(doc_path)
else: else:
parse_doc(path) parse_doc(path)
if __name__ == "__main__": if __name__ == '__main__':
cli() cli()
import os
import json as json_parse import json as json_parse
import click import os
from pathlib import Path from pathlib import Path
from magic_pdf.libs.path_utils import (
parse_s3path, import click
parse_s3_range_params,
remove_non_official_s3_args,
)
from magic_pdf.libs.config_reader import (
get_s3_config,
)
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config import magic_pdf.model as model_config
from magic_pdf.tools.common import parse_pdf_methods, do_parse from magic_pdf.libs.config_reader import get_s3_config
from magic_pdf.libs.path_utils import (parse_s3_range_params, parse_s3path,
remove_non_official_s3_args)
from magic_pdf.libs.version import __version__ from magic_pdf.libs.version import __version__
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
from magic_pdf.rw.S3ReaderWriter import S3ReaderWriter
from magic_pdf.tools.common import do_parse, parse_pdf_methods
def read_s3_path(s3path): def read_s3_path(s3path):
bucket, key = parse_s3path(s3path) bucket, key = parse_s3path(s3path)
s3_ak, s3_sk, s3_endpoint = get_s3_config(bucket) s3_ak, s3_sk, s3_endpoint = get_s3_config(bucket)
s3_rw = S3ReaderWriter( s3_rw = S3ReaderWriter(s3_ak, s3_sk, s3_endpoint, 'auto',
s3_ak, s3_sk, s3_endpoint, "auto", remove_non_official_s3_args(s3path) remove_non_official_s3_args(s3path))
)
may_range_params = parse_s3_range_params(s3path) may_range_params = parse_s3_range_params(s3path)
if may_range_params is None or 2 != len(may_range_params): if may_range_params is None or 2 != len(may_range_params):
byte_start, byte_end = 0, None byte_start, byte_end = 0, None
else: else:
byte_start, byte_end = int(may_range_params[0]), int(may_range_params[1]) byte_start, byte_end = int(may_range_params[0]), int(
may_range_params[1])
return s3_rw.read_offset( return s3_rw.read_offset(
remove_non_official_s3_args(s3path), remove_non_official_s3_args(s3path),
byte_start, byte_start,
...@@ -38,51 +35,48 @@ def read_s3_path(s3path): ...@@ -38,51 +35,48 @@ def read_s3_path(s3path):
@click.group() @click.group()
@click.version_option(__version__, "--version", "-v", help="显示版本信息") @click.version_option(__version__, '--version', '-v', help='显示版本信息')
def cli(): def cli():
pass pass
@cli.command() @cli.command()
@click.option( @click.option(
"-j", '-j',
"--jsonl", '--jsonl',
"jsonl", 'jsonl',
type=str, type=str,
help="输入 jsonl 路径,本地或者 s3 上的文件", help='输入 jsonl 路径,本地或者 s3 上的文件',
required=True, required=True,
) )
@click.option( @click.option(
"-m", '-m',
"--method", '--method',
"method", 'method',
type=parse_pdf_methods, type=parse_pdf_methods,
help="指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法", help='指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法',
default="auto", default='auto',
) )
@click.option( @click.option(
"-o", '-o',
"--output-dir", '--output-dir',
"output_dir", 'output_dir',
type=str, type=click.Path(),
help="输出到本地目录", required=True,
default="", help='输出到本地目录',
default='',
) )
def jsonl(jsonl, method, output_dir): def jsonl(jsonl, method, output_dir):
model_config.__use_inside_model__ = False model_config.__use_inside_model__ = False
if jsonl.startswith("s3://"): if jsonl.startswith('s3://'):
jso = json_parse.loads(read_s3_path(jsonl).decode("utf-8")) jso = json_parse.loads(read_s3_path(jsonl).decode('utf-8'))
full_jsonl_path = "."
else: else:
full_jsonl_path = os.path.realpath(jsonl)
with open(jsonl) as f: with open(jsonl) as f:
jso = json_parse.loads(f.readline()) jso = json_parse.loads(f.readline())
os.makedirs(output_dir, exist_ok=True)
if output_dir == "": s3_file_path = jso.get('file_location')
output_dir = os.path.join(os.path.dirname(full_jsonl_path), "output")
s3_file_path = jso.get("file_location")
if s3_file_path is None: if s3_file_path is None:
s3_file_path = jso.get("path") s3_file_path = jso.get('path')
pdf_file_name = Path(s3_file_path).stem pdf_file_name = Path(s3_file_path).stem
pdf_data = read_s3_path(s3_file_path) pdf_data = read_s3_path(s3_file_path)
...@@ -91,7 +85,7 @@ def jsonl(jsonl, method, output_dir): ...@@ -91,7 +85,7 @@ def jsonl(jsonl, method, output_dir):
output_dir, output_dir,
pdf_file_name, pdf_file_name,
pdf_data, pdf_data,
jso["doc_layout_result"], jso['doc_layout_result'],
method, method,
f_dump_content_list=True, f_dump_content_list=True,
f_draw_model_bbox=True, f_draw_model_bbox=True,
...@@ -100,43 +94,46 @@ def jsonl(jsonl, method, output_dir): ...@@ -100,43 +94,46 @@ def jsonl(jsonl, method, output_dir):
@cli.command() @cli.command()
@click.option( @click.option(
"-p", '-p',
"--pdf", '--pdf',
"pdf", 'pdf',
type=click.Path(exists=True), type=click.Path(exists=True),
required=True, required=True,
help="本地 PDF 文件", help='本地 PDF 文件',
) )
@click.option( @click.option(
"-j", '-j',
"--json", '--json',
"json_data", 'json_data',
type=click.Path(exists=True), type=click.Path(exists=True),
required=True, required=True,
help="本地模型推理出的 json 数据", help='本地模型推理出的 json 数据',
)
@click.option(
"-o", "--output-dir", "output_dir", type=str, help="本地输出目录", default=""
) )
@click.option('-o',
'--output-dir',
'output_dir',
type=click.Path(),
required=True,
help='本地输出目录',
default='')
@click.option( @click.option(
"-m", '-m',
"--method", '--method',
"method", 'method',
type=parse_pdf_methods, type=parse_pdf_methods,
help="指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法", help='指定解析方法。txt: 文本型 pdf 解析方法, ocr: 光学识别解析 pdf, auto: 程序智能选择解析方法',
default="auto", default='auto',
) )
def pdf(pdf, json_data, output_dir, method): def pdf(pdf, json_data, output_dir, method):
model_config.__use_inside_model__ = False model_config.__use_inside_model__ = False
full_pdf_path = os.path.realpath(pdf) full_pdf_path = os.path.realpath(pdf)
if output_dir == "": os.makedirs(output_dir, exist_ok=True)
output_dir = os.path.join(os.path.dirname(full_pdf_path), "output")
def read_fn(path): def read_fn(path):
disk_rw = DiskReaderWriter(os.path.dirname(path)) disk_rw = DiskReaderWriter(os.path.dirname(path))
return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN) return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
model_json_list = json_parse.loads(read_fn(json_data).decode("utf-8")) model_json_list = json_parse.loads(read_fn(json_data).decode('utf-8'))
file_name = str(Path(full_pdf_path).stem) file_name = str(Path(full_pdf_path).stem)
pdf_data = read_fn(full_pdf_path) pdf_data = read_fn(full_pdf_path)
...@@ -151,5 +148,5 @@ def pdf(pdf, json_data, output_dir, method): ...@@ -151,5 +148,5 @@ def pdf(pdf, json_data, output_dir, method):
) )
if __name__ == "__main__": if __name__ == '__main__':
cli() cli()
import os
import json as json_parse
import copy import copy
import json as json_parse
import os
import click import click
from loguru import logger from loguru import logger
import magic_pdf.model as model_config
from magic_pdf.libs.draw_bbox import (draw_layout_bbox, draw_span_bbox,
drow_model_bbox)
from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode
from magic_pdf.libs.draw_bbox import draw_layout_bbox, draw_span_bbox, drow_model_bbox
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.pipe.OCRPipe import OCRPipe from magic_pdf.pipe.OCRPipe import OCRPipe
from magic_pdf.pipe.TXTPipe import TXTPipe from magic_pdf.pipe.TXTPipe import TXTPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
import magic_pdf.model as model_config from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
def prepare_env(output_dir, pdf_file_name, method): def prepare_env(output_dir, pdf_file_name, method):
local_parent_dir = os.path.join(output_dir, pdf_file_name, method) local_parent_dir = os.path.join(output_dir, pdf_file_name, method)
local_image_dir = os.path.join(str(local_parent_dir), "images") local_image_dir = os.path.join(str(local_parent_dir), 'images')
local_md_dir = local_parent_dir local_md_dir = local_parent_dir
os.makedirs(local_image_dir, exist_ok=True) os.makedirs(local_image_dir, exist_ok=True)
os.makedirs(local_md_dir, exist_ok=True) os.makedirs(local_md_dir, exist_ok=True)
...@@ -40,22 +43,22 @@ def do_parse( ...@@ -40,22 +43,22 @@ def do_parse(
f_draw_model_bbox=False, f_draw_model_bbox=False,
): ):
orig_model_list = copy.deepcopy(model_list) orig_model_list = copy.deepcopy(model_list)
local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method) local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name,
parse_method)
image_writer, md_writer = DiskReaderWriter(local_image_dir), DiskReaderWriter( image_writer, md_writer = DiskReaderWriter(
local_md_dir local_image_dir), DiskReaderWriter(local_md_dir)
)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
if parse_method == "auto": if parse_method == 'auto':
jso_useful_key = {"_pdf_type": "", "model_list": model_list} jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
elif parse_method == "txt": elif parse_method == 'txt':
pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True) pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True)
elif parse_method == "ocr": elif parse_method == 'ocr':
pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True) pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True)
else: else:
logger.error("unknown parse method") logger.error('unknown parse method')
exit(1) exit(1)
pipe.pipe_classify() pipe.pipe_classify()
...@@ -65,58 +68,65 @@ def do_parse( ...@@ -65,58 +68,65 @@ def do_parse(
pipe.pipe_analyze() pipe.pipe_analyze()
orig_model_list = copy.deepcopy(pipe.model_list) orig_model_list = copy.deepcopy(pipe.model_list)
else: else:
logger.error("need model list input") logger.error('need model list input')
exit(2) exit(2)
pipe.pipe_parse() pipe.pipe_parse()
pdf_info = pipe.pdf_mid_data["pdf_info"] pdf_info = pipe.pdf_mid_data['pdf_info']
if f_draw_layout_bbox: if f_draw_layout_bbox:
draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir) draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
if f_draw_span_bbox: if f_draw_span_bbox:
draw_span_bbox(pdf_info, pdf_bytes, local_md_dir) draw_span_bbox(pdf_info, pdf_bytes, local_md_dir, pdf_file_name)
if f_draw_model_bbox: if f_draw_model_bbox:
drow_model_bbox(orig_model_list, pdf_bytes, local_md_dir) drow_model_bbox(orig_model_list, pdf_bytes, local_md_dir,
pdf_file_name)
md_content = pipe.pipe_mk_markdown( md_content = pipe.pipe_mk_markdown(image_dir,
image_dir, drop_mode=DropMode.NONE, md_make_mode=f_make_md_mode drop_mode=DropMode.NONE,
) md_make_mode=f_make_md_mode)
if f_dump_md: if f_dump_md:
md_writer.write( md_writer.write(
content=md_content, content=md_content,
path=f"{pdf_file_name}.md", path=f'{pdf_file_name}.md',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
if f_dump_middle_json: if f_dump_middle_json:
md_writer.write( md_writer.write(
content=json_parse.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4), content=json_parse.dumps(pipe.pdf_mid_data,
path="middle.json", ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_middle.json',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
if f_dump_model_json: if f_dump_model_json:
md_writer.write( md_writer.write(
content=json_parse.dumps(orig_model_list, ensure_ascii=False, indent=4), content=json_parse.dumps(orig_model_list,
path="model.json", ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_model.json',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
if f_dump_orig_pdf: if f_dump_orig_pdf:
md_writer.write( md_writer.write(
content=pdf_bytes, content=pdf_bytes,
path="origin.pdf", path=f'{pdf_file_name}_origin.pdf',
mode=AbsReaderWriter.MODE_BIN, mode=AbsReaderWriter.MODE_BIN,
) )
content_list = pipe.pipe_mk_uni_format(image_dir, drop_mode=DropMode.NONE) content_list = pipe.pipe_mk_uni_format(image_dir, drop_mode=DropMode.NONE)
if f_dump_content_list: if f_dump_content_list:
md_writer.write( md_writer.write(
content=json_parse.dumps(content_list, ensure_ascii=False, indent=4), content=json_parse.dumps(content_list,
path="content_list.json", ensure_ascii=False,
indent=4),
path=f'{pdf_file_name}_content_list.json',
mode=AbsReaderWriter.MODE_TXT, mode=AbsReaderWriter.MODE_TXT,
) )
logger.info(f"local output dir is {local_md_dir}") logger.info(f'local output dir is {local_md_dir}')
parse_pdf_methods = click.Choice(["ocr", "txt", "auto"]) parse_pdf_methods = click.Choice(['ocr', 'txt', 'auto'])
dependent on the service headway and the reliability of the departure time of the service to which passengers are incident.
After briefly introducing the random incidence model, which is often assumed to hold at short headways, the balance of this section reviews six studies of passenger incidence behavior that are moti- vated by understanding the relationships between service headway, service reliability, passenger incidence behavior, and passenger waiting time in a more nuanced fashion than is embedded in the random incidence assumption ( 2 ). Three of these studies depend on manually collected data, two studies use data from AFC systems, and one study analyzes the issue purely theoretically. These studies reveal much about passenger incidence behavior, but all are found to be limited in their general applicability by the methods with which they collect information about passengers and the services those passengers intend to use.
# Random Passenger Incidence Behavior
One characterization of passenger incidence behavior is that of ran- dom incidence ( 3 ). The key assumption underlying the random inci- dence model is that the process of passenger arrivals to the public transport service is independent from the vehicle departure process of the service. This implies that passengers become incident to the service at a random time, and thus the instantaneous rate of passen- ger arrivals to the service is uniform over a given period of time. Let $W$ and $H$ be random variables representing passenger waiting times and service headways, respectively. Under the random incidence assumption and the assumption that vehicle capacity is not a binding constraint, a classic result of transportation science is that
$$
E!\\left(W\\right)!=!\\frac{E!\\left\[H^{2}\\right\]}{2E!\\left\[H\\right\]}!=!\\frac{E!\\left\[H\\right\]}{2}!!\\left(1!+!\\operatorname{CV}!\\left(H\\right)^{2}\\right)
$$
where $E\[X\]$ is the probabilistic expectation of some random variable $X$ and $\\operatorname{CV}(H)$ is the coefficient of variation of $H$ , a unitless measure of the variability of $H$ defined as
$$
\\mathbf{CV}\\big(H\\big)!=!\\frac{\\boldsymbol{\\upsigma}\_{H}}{E\\big\[H\\big\]}
$$
where $\\upsigma\_{H}$ is the standard deviation of $H\\left(4\\right)$ . The second expression in Equation 1 is particularly useful because it expresses the mean passenger waiting time as the sum of two components: the waiting time caused by the mean headway (i.e., the reciprocal of service fre- quency) and the waiting time caused by the variability of the head- ways (which is one measure of service reliability). When the service is perfectly reliable with constant headways, the mean ­ waiting time will be simply half the headway.
# More Behaviorally Realistic Incidence Models
Jolliffe and Hutchinson studied bus passenger incidence in South London suburbs ( 5 ). They observed 10 bus stops for $^{1\\mathrm{~h~}}$ per day over 8 days, recording the times of passenger incidence and actual and scheduled bus departures. They limited their stop selection to those served by only a single bus route with a single service pat- tern so as to avoid ambiguity about which service a passenger was waiting for. The authors found that the actual average passenger waiting time was $30%$ less than predicted by the random incidence model. They also found that the empirical distributions of passenger incidence times (by time of day) had peaks just before the respec- tive average bus departure times. They hypothesized the existence of three classes of passengers: with proportion $q$ , passengers whose time of incidence is causally coincident with that of a bus departure (e.g., because they saw the approaching bus from their home or a shop window); with proportion $p(1-q)$ , passengers who time their arrivals to minimize expected waiting time; and with proportion $(1-p)(1-q)$ , passengers who are randomly incident. The authors found that $p$ was positively correlated with the potential reduction in waiting time (compared with arriving randomly) that resulted from knowledge of the timetable and of service reliability. They also found $p$ to be higher in the peak commuting periods rather than in the off-peak periods, indicating more awareness of the timetable or historical reliability, or both, by commuters.
Bowman and Turnquist built on the concept of aware and unaware passengers of proportions $p$ and $(1-p)$ , respectively. They proposed a utility-based model to estimate $p$ and the distribution of incidence times, and thus the mean waiting time, of aware passengers over a given headway as a function of the headway and reliability of bus departure times $(l)$ . They observed seven bus stops in Chicago, Illinois, each served by a single (different) bus route, between 6:00 and $8{\\cdot}00;\\mathrm{a.m}$ . for 5 to 10 days each. The bus routes had headways of 5 to $20~\\mathrm{min}$ and a range of reliabilities. The authors found that actual average waiting time was substantially less than predicted by the random incidence model. They estimated that $p$ was not statistically significantly different from 1.0, which they explain by the fact that all observations were taken during peak commuting times. Their model predicts that the longer the headway and the more reliable the departures, the more peaked the distribution of incidence times will be and the closer that peak will be to the next scheduled departure time. This prediction demonstrates what they refer to as a safety margin that passengers add to reduce the chance of missing their bus when the service is known to be somewhat unreliable. Such a safety margin can also result from unreliability in passengers’ journeys to the public transport stop or station. Bowman and ­ Turnquist conclude from their model that the random incidence model underestimates the waiting time benefits of improving reli- ability and overestimates the waiting time benefits of increasing ser- vice frequency. This is because as reliability increases passengers can better predict departure times and so can time their incidence to decrease their waiting time.
Furth and Muller study the issue in a theoretical context and gener- ally agree with the above findings ( 2 ). They are primarily concerned with the use of data from automatic vehicle-tracking systems to assess the impacts of reliability on passenger incidence behavior and wait- ing times. They propose that passengers will react to unreliability by departing earlier than they would with reliable services. Randomly incident unaware passengers will experience unreliability as a more dispersed distribution of headways and simply allocate additional time to their trip plan to improve the chance of arriving at their des- tination on time. Aware passengers, whose incidence is not entirely random, will react by timing their incidence somewhat earlier than the scheduled departure time to increase their chance of catching the desired service. The authors characterize these ­ reactions as the costs of unreliability.
Luethi et al. continued with the analysis of manually collected data on actual passenger behavior ( 6 ). They use the language of probability to describe two classes of passengers. The first is timetable-dependent passengers (i.e., the aware passengers), whose incidence behavior is affected by awareness (possibly gained
import tempfile
import os import os
import shutil import shutil
import tempfile
from click.testing import CliRunner from click.testing import CliRunner
from magic_pdf.tools.cli import cli from magic_pdf.tools.cli import cli
...@@ -8,19 +9,20 @@ from magic_pdf.tools.cli import cli ...@@ -8,19 +9,20 @@ from magic_pdf.tools.cli import cli
def test_cli_pdf(): def test_cli_pdf():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = "cli_test_01" filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
cli, cli,
[ [
"-p", '-p',
"tests/test_tools/assets/cli/pdf/cli_test_01.pdf", 'tests/test_tools/assets/cli/pdf/cli_test_01.pdf',
"-o", '-o',
temp_output_dir, temp_output_dir,
], ],
) )
...@@ -28,29 +30,31 @@ def test_cli_pdf(): ...@@ -28,29 +30,31 @@ def test_cli_pdf():
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
...@@ -58,68 +62,72 @@ def test_cli_pdf(): ...@@ -58,68 +62,72 @@ def test_cli_pdf():
def test_cli_path(): def test_cli_path():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
cli, ["-p", "tests/test_tools/assets/cli/path", "-o", temp_output_dir] cli, ['-p', 'tests/test_tools/assets/cli/path', '-o', temp_output_dir])
)
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
filename = "cli_test_01" filename = 'cli_test_01'
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
base_output_dir = os.path.join(temp_output_dir, "cli_test_02/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_02/auto')
filename = "cli_test_02" filename = 'cli_test_02'
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
assert os.path.exists(os.path.join(base_output_dir, "content_list.json")) is False assert os.path.exists(
os.path.join(base_output_dir,
f'{filename}_content_list.json')) is False
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
import tempfile
import os import os
import shutil import shutil
import tempfile
from click.testing import CliRunner from click.testing import CliRunner
from magic_pdf.tools import cli_dev from magic_pdf.tools import cli_dev
...@@ -8,22 +9,23 @@ from magic_pdf.tools import cli_dev ...@@ -8,22 +9,23 @@ from magic_pdf.tools import cli_dev
def test_cli_pdf(): def test_cli_pdf():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = "cli_test_01" filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
cli_dev.cli, cli_dev.cli,
[ [
"pdf", 'pdf',
"-p", '-p',
"tests/test_tools/assets/cli/pdf/cli_test_01.pdf", 'tests/test_tools/assets/cli/pdf/cli_test_01.pdf',
"-j", '-j',
"tests/test_tools/assets/cli_dev/cli_test_01.model.json", 'tests/test_tools/assets/cli_dev/cli_test_01.model.json',
"-o", '-o',
temp_output_dir, temp_output_dir,
], ],
) )
...@@ -31,31 +33,31 @@ def test_cli_pdf(): ...@@ -31,31 +33,31 @@ def test_cli_pdf():
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, "content_list.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
...@@ -63,26 +65,27 @@ def test_cli_pdf(): ...@@ -63,26 +65,27 @@ def test_cli_pdf():
def test_cli_jsonl(): def test_cli_jsonl():
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = "cli_test_01" filename = 'cli_test_01'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
def mock_read_s3_path(s3path): def mock_read_s3_path(s3path):
with open(s3path, "rb") as f: with open(s3path, 'rb') as f:
return f.read() return f.read()
cli_dev.read_s3_path = mock_read_s3_path # mock cli_dev.read_s3_path = mock_read_s3_path # mock
# run # run
runner = CliRunner() runner = CliRunner()
result = runner.invoke( result = runner.invoke(
cli_dev.cli, cli_dev.cli,
[ [
"jsonl", 'jsonl',
"-j", '-j',
"tests/test_tools/assets/cli_dev/cli_test_01.jsonl", 'tests/test_tools/assets/cli_dev/cli_test_01.jsonl',
"-o", '-o',
temp_output_dir, temp_output_dir,
], ],
) )
...@@ -90,31 +93,31 @@ def test_cli_jsonl(): ...@@ -90,31 +93,31 @@ def test_cli_jsonl():
# check # check
assert result.exit_code == 0 assert result.exit_code == 0
base_output_dir = os.path.join(temp_output_dir, "cli_test_01/auto") base_output_dir = os.path.join(temp_output_dir, 'cli_test_01/auto')
r = os.stat(os.path.join(base_output_dir, "content_list.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
assert os.path.exists(os.path.join(base_output_dir, "images")) is True assert os.path.exists(os.path.join(base_output_dir, 'images')) is True
assert os.path.isdir(os.path.join(base_output_dir, "images")) is True assert os.path.isdir(os.path.join(base_output_dir, 'images')) is True
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
import tempfile
import os import os
import shutil import shutil
import tempfile
import pytest import pytest
import magic_pdf.model as model_config
from magic_pdf.tools.common import do_parse from magic_pdf.tools.common import do_parse
@pytest.mark.parametrize("method", ["auto", "txt", "ocr"]) @pytest.mark.parametrize('method', ['auto', 'txt', 'ocr'])
def test_common_do_parse(method): def test_common_do_parse(method):
# setup # setup
unitest_dir = "/tmp/magic_pdf/unittest/tools" model_config.__use_inside_model__ = True
filename = "fake" unitest_dir = '/tmp/magic_pdf/unittest/tools'
filename = 'fake'
os.makedirs(unitest_dir, exist_ok=True) os.makedirs(unitest_dir, exist_ok=True)
temp_output_dir = tempfile.mkdtemp(dir="/tmp/magic_pdf/unittest/tools") temp_output_dir = tempfile.mkdtemp(dir='/tmp/magic_pdf/unittest/tools')
os.makedirs(temp_output_dir, exist_ok=True)
# run # run
with open("tests/test_tools/assets/common/cli_test_01.pdf", "rb") as f: with open('tests/test_tools/assets/common/cli_test_01.pdf', 'rb') as f:
bits = f.read() bits = f.read()
do_parse(temp_output_dir, filename, bits, [], method, f_dump_content_list=True) do_parse(temp_output_dir,
filename,
bits, [],
method,
f_dump_content_list=True)
# check # check
base_output_dir = os.path.join(temp_output_dir, f"fake/{method}") base_output_dir = os.path.join(temp_output_dir, f'fake/{method}')
r = os.stat(os.path.join(base_output_dir, "content_list.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_content_list.json'))
assert r.st_size > 5000 assert r.st_size > 5000
r = os.stat(os.path.join(base_output_dir, f"{filename}.md")) r = os.stat(os.path.join(base_output_dir, f'{filename}.md'))
assert r.st_size > 7000 assert r.st_size > 7000
r = os.stat(os.path.join(base_output_dir, "middle.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_middle.json'))
assert r.st_size > 200000 assert r.st_size > 200000
r = os.stat(os.path.join(base_output_dir, "model.json")) r = os.stat(os.path.join(base_output_dir, f'{filename}_model.json'))
assert r.st_size > 15000 assert r.st_size > 15000
r = os.stat(os.path.join(base_output_dir, "origin.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_origin.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "layout.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_layout.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
r = os.stat(os.path.join(base_output_dir, "spans.pdf")) r = os.stat(os.path.join(base_output_dir, f'{filename}_spans.pdf'))
assert r.st_size > 500000 assert r.st_size > 500000
os.path.exists(os.path.join(base_output_dir, "images")) os.path.exists(os.path.join(base_output_dir, 'images'))
os.path.isdir(os.path.join(base_output_dir, "images")) os.path.isdir(os.path.join(base_output_dir, 'images'))
# teardown # teardown
shutil.rmtree(temp_output_dir) shutil.rmtree(temp_output_dir)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment