Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
6b76f5cb
Commit
6b76f5cb
authored
Jul 15, 2024
by
myhloli
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update(readme): Optimizing the Installation Process
parent
1debe7fe
Changes
3
Show whitespace changes
Inline
Side-by-side
Showing
3 changed files
with
80 additions
and
74 deletions
+80
-74
README.md
README.md
+41
-37
README_zh-CN.md
README_zh-CN.md
+37
-36
demo.py
demo/demo.py
+2
-1
No files found.
README.md
View file @
6b76f5cb
...
...
@@ -82,21 +82,22 @@ conda create -n MinerU python=3.10
conda activate MinerU
```
###
Usage Instructions
###
Installation and Configuration
#### 1. Install Magic-PDF
Install
using
pip:
```
bash
pip
install
magic-pdf
```
Alternatively, for built-in high-precision model parsing capabilities, use:
Install
the full-feature package with
pip:
>Note: The pip-installed package supports CPU-only and is ideal for quick tests.
>
>For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).
```
bash
pip
install
magic-pdf[full-cpu]
```
The high-precision models depend on detectron2, which requires a compiled installation.
If you need to compile it yourself, refer to https://github.com/facebookresearch/detectron2/issues/5114
Or directly use our pre-compiled wheel packages (limited to python 3.10):
The full-feature package depends on detectron2, which requires a compilation installation.
If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
```
bash
pip
install
detectron2
--extra-index-url
https://myhloli.github.io/wheels/
```
...
...
@@ -123,31 +124,8 @@ In magic-pdf.json, configure "models-dir" to point to the directory where the mo
```
#### 4. Usage via Command Line
###### simple
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--inside_model
true
```
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
You can find the corresponding xxx_model.json file in the markdown directory.
If you intend to do secondary development on the post-processing pipeline, you can use the command:
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--model
"model_json_path"
```
In this way, you won't need to re-run the model data, making debugging more convenient.
###### more
```
bash
magic-pdf
--help
```
#### 5. Acceleration Using CUDA or MPS
#### 4. Acceleration Using CUDA or MPS
If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
##### CUDA
You need to install the corresponding PyTorch version according to your CUDA version.
...
...
@@ -172,13 +150,39 @@ You also need to modify the value of "device-mode" in the configuration file mag
}
```
#### 6. Usage via Api
### Usage
#### 1.Usage via Command Line
###### simple
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--inside_model
true
```
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
You can find the corresponding xxx_model.json file in the markdown directory.
If you intend to do secondary development on the post-processing pipeline, you can use the command:
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--model
"model_json_path"
```
In this way, you won't need to re-run the model data, making debugging more convenient.
###### more
```
bash
magic-pdf
--help
```
#### 2. Usage via Api
###### Local
```
python
image_writer
=
DiskReaderWriter
(
local_image_dir
)
image_dir
=
str
(
os
.
path
.
basename
(
local_image_dir
))
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
model_json
}
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
[]
}
pipe
=
UNIPipe
(
pdf_bytes
,
jso_useful_key
,
image_writer
)
pipe
.
pipe_classify
()
pipe
.
pipe_parse
()
...
...
@@ -191,7 +195,7 @@ s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir
=
"s3://img_bucket/"
s3image_cli
=
S3ReaderWriter
(
img_ak
,
img_sk
,
img_endpoint
,
parent_path
=
image_dir
)
pdf_bytes
=
s3pdf_cli
.
read
(
s3_pdf_path
,
mode
=
s3pdf_cli
.
MODE_BIN
)
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
model_json
}
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
[]
}
pipe
=
UNIPipe
(
pdf_bytes
,
jso_useful_key
,
s3image_cli
)
pipe
.
pipe_classify
()
pipe
.
pipe_parse
()
...
...
README_zh-CN.md
View file @
6b76f5cb
...
...
@@ -78,19 +78,18 @@ conda activate MinerU
```
开发基于python 3.10,如果在其他版本python出现问题请切换至3.10。
###
使用说明
###
安装配置
#### 1. 安装Magic-PDF
使用pip安装:
```
bash
pip
install
magic-pdf
```
或者,需要内置高精度模型解析功能,使用:
使用pip安装完整功能包:
>受pypi限制,pip安装的完整功能包仅支持cpu推理,建议只用于快速测试解析能力。
>
>如需在生产环境使用CUDA/MPS加速请参考[使用CUDA或MPS加速推理](#4-使用CUDA或MPS加速推理)
```
bash
pip
install
magic-pdf[full-cpu]
```
高精度模型依赖于
detectron2,该库需要编译安装,如需自行编译,请参考 https://github.com/facebookresearch/detectron2/issues/5114
完整功能包依赖
detectron2,该库需要编译安装,如需自行编译,请参考 https://github.com/facebookresearch/detectron2/issues/5114
或是直接使用我们预编译的whl包(仅限python 3.10):
```
bash
pip
install
detectron2
--extra-index-url
https://myhloli.github.io/wheels/
...
...
@@ -113,30 +112,9 @@ cp magic-pdf.template.json ~/magic-pdf.json
}
```
#### 4. 通过命令行使用
###### 直接使用
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--inside_model
true
```
程序运行完成后,你可以在"/tmp/magic-pdf"目录下看到生成的markdown文件,markdown目录中可以找到对应的xxx_model.json文件
如果您有意对后处理pipeline进行二次开发,可以使用命令
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--model
"model_json_path"
```
这样就不需要重跑模型数据,调试起来更方便
###### 更多用法
```
bash
magic-pdf
--help
```
#### 5. 使用CUDA或MPS进行加速
###### CUDA
#### 4. 使用CUDA或MPS加速推理
如您有可用的Nvidia显卡或在使用Apple Silicon的Mac,可以使用CUDA或MPS进行加速
##### CUDA
需要根据自己的CUDA版本安装对应的pytorch版本
以下是对应CUDA 11.8版本的安装命令,更多信息请参考 https://pytorch.org/get-started/locally/
...
...
@@ -151,7 +129,7 @@ pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https
}
```
#####
#
MPS
##### MPS
使用macOS(M系列芯片设备)可以使用MPS进行推理加速
需要修改配置文件magic-pdf.json中"device-mode"的值
```
json
...
...
@@ -161,13 +139,36 @@ pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https
```
#### 6. 通过接口调用
### 使用说明
#### 1. 通过命令行使用
###### 直接使用
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--inside_model
true
```
程序运行完成后,你可以在"/tmp/magic-pdf"目录下看到生成的markdown文件,markdown目录中可以找到对应的xxx_model.json文件
如果您有意对后处理pipeline进行二次开发,可以使用命令
```
bash
magic-pdf pdf-command
--pdf
"pdf_path"
--model
"model_json_path"
```
这样就不需要重跑模型数据,调试起来更方便
###### 更多用法
```
bash
magic-pdf
--help
```
#### 2. 通过接口调用
###### 本地使用
```
python
image_writer
=
DiskReaderWriter
(
local_image_dir
)
image_dir
=
str
(
os
.
path
.
basename
(
local_image_dir
))
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
model_json
}
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
[]
}
pipe
=
UNIPipe
(
pdf_bytes
,
jso_useful_key
,
image_writer
)
pipe
.
pipe_classify
()
pipe
.
pipe_parse
()
...
...
@@ -180,7 +181,7 @@ s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir
=
"s3://img_bucket/"
s3image_cli
=
S3ReaderWriter
(
img_ak
,
img_sk
,
img_endpoint
,
parent_path
=
image_dir
)
pdf_bytes
=
s3pdf_cli
.
read
(
s3_pdf_path
,
mode
=
s3pdf_cli
.
MODE_BIN
)
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
model_json
}
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
[]
}
pipe
=
UNIPipe
(
pdf_bytes
,
jso_useful_key
,
s3image_cli
)
pipe
.
pipe_classify
()
pipe
.
pipe_parse
()
...
...
demo/demo.py
View file @
6b76f5cb
...
...
@@ -12,7 +12,8 @@ try:
pdf_path
=
os
.
path
.
join
(
current_script_dir
,
f
"{demo_name}.pdf"
)
model_path
=
os
.
path
.
join
(
current_script_dir
,
f
"{demo_name}.json"
)
pdf_bytes
=
open
(
pdf_path
,
"rb"
)
.
read
()
model_json
=
json
.
loads
(
open
(
model_path
,
"r"
,
encoding
=
"utf-8"
)
.
read
())
# model_json = json.loads(open(model_path, "r", encoding="utf-8").read())
model_json
=
[]
# model_json传空list使用内置模型解析
jso_useful_key
=
{
"_pdf_type"
:
""
,
"model_list"
:
model_json
}
local_image_dir
=
os
.
path
.
join
(
current_script_dir
,
'images'
)
image_dir
=
str
(
os
.
path
.
basename
(
local_image_dir
))
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment