Commit 6b76f5cb authored by myhloli's avatar myhloli

update(readme): Optimizing the Installation Process

parent 1debe7fe
...@@ -82,21 +82,22 @@ conda create -n MinerU python=3.10 ...@@ -82,21 +82,22 @@ conda create -n MinerU python=3.10
conda activate MinerU conda activate MinerU
``` ```
### Usage Instructions ### Installation and Configuration
#### 1. Install Magic-PDF #### 1. Install Magic-PDF
Install using pip: Install the full-feature package with pip:
```bash >Note: The pip-installed package supports CPU-only and is ideal for quick tests.
pip install magic-pdf >
``` >For CUDA/MPS acceleration in production, see [Acceleration Using CUDA or MPS](#4-Acceleration-Using-CUDA-or-MPS).
Alternatively, for built-in high-precision model parsing capabilities, use:
```bash ```bash
pip install magic-pdf[full-cpu] pip install magic-pdf[full-cpu]
``` ```
The high-precision models depend on detectron2, which requires a compiled installation. The full-feature package depends on detectron2, which requires a compilation installation.
If you need to compile it yourself, refer to https://github.com/facebookresearch/detectron2/issues/5114 If you need to compile it yourself, please refer to https://github.com/facebookresearch/detectron2/issues/5114
Or directly use our pre-compiled wheel packages (limited to python 3.10): Alternatively, you can directly use our precompiled whl package (limited to Python 3.10):
```bash ```bash
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/ pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
``` ```
...@@ -123,31 +124,8 @@ In magic-pdf.json, configure "models-dir" to point to the directory where the mo ...@@ -123,31 +124,8 @@ In magic-pdf.json, configure "models-dir" to point to the directory where the mo
``` ```
#### 4. Usage via Command Line #### 4. Acceleration Using CUDA or MPS
If you have an available Nvidia GPU or are using a Mac with Apple Silicon, you can leverage acceleration with CUDA or MPS respectively.
###### simple
```bash
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
```
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
You can find the corresponding xxx_model.json file in the markdown directory.
If you intend to do secondary development on the post-processing pipeline, you can use the command:
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
In this way, you won't need to re-run the model data, making debugging more convenient.
###### more
```bash
magic-pdf --help
```
#### 5. Acceleration Using CUDA or MPS
##### CUDA ##### CUDA
You need to install the corresponding PyTorch version according to your CUDA version. You need to install the corresponding PyTorch version according to your CUDA version.
...@@ -172,13 +150,39 @@ You also need to modify the value of "device-mode" in the configuration file mag ...@@ -172,13 +150,39 @@ You also need to modify the value of "device-mode" in the configuration file mag
} }
``` ```
#### 6. Usage via Api
### Usage
#### 1.Usage via Command Line
###### simple
```bash
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
```
After the program has finished, you can find the generated markdown files under the directory "/tmp/magic-pdf".
You can find the corresponding xxx_model.json file in the markdown directory.
If you intend to do secondary development on the post-processing pipeline, you can use the command:
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
In this way, you won't need to re-run the model data, making debugging more convenient.
###### more
```bash
magic-pdf --help
```
#### 2. Usage via Api
###### Local ###### Local
```python ```python
image_writer = DiskReaderWriter(local_image_dir) image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json} jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify() pipe.pipe_classify()
pipe.pipe_parse() pipe.pipe_parse()
...@@ -191,7 +195,7 @@ s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) ...@@ -191,7 +195,7 @@ s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/" image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir) s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN) pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json} jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli) pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify() pipe.pipe_classify()
pipe.pipe_parse() pipe.pipe_parse()
......
...@@ -78,19 +78,18 @@ conda activate MinerU ...@@ -78,19 +78,18 @@ conda activate MinerU
``` ```
开发基于python 3.10,如果在其他版本python出现问题请切换至3.10。 开发基于python 3.10,如果在其他版本python出现问题请切换至3.10。
### 使用说明 ### 安装配置
#### 1. 安装Magic-PDF #### 1. 安装Magic-PDF
使用pip安装: 使用pip安装完整功能包:
```bash >受pypi限制,pip安装的完整功能包仅支持cpu推理,建议只用于快速测试解析能力。
pip install magic-pdf >
``` >如需在生产环境使用CUDA/MPS加速请参考[使用CUDA或MPS加速推理](#4-使用CUDA或MPS加速推理)
或者,需要内置高精度模型解析功能,使用:
```bash ```bash
pip install magic-pdf[full-cpu] pip install magic-pdf[full-cpu]
``` ```
高精度模型依赖于detectron2,该库需要编译安装,如需自行编译,请参考 https://github.com/facebookresearch/detectron2/issues/5114 完整功能包依赖detectron2,该库需要编译安装,如需自行编译,请参考 https://github.com/facebookresearch/detectron2/issues/5114
或是直接使用我们预编译的whl包(仅限python 3.10): 或是直接使用我们预编译的whl包(仅限python 3.10):
```bash ```bash
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/ pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
...@@ -113,30 +112,9 @@ cp magic-pdf.template.json ~/magic-pdf.json ...@@ -113,30 +112,9 @@ cp magic-pdf.template.json ~/magic-pdf.json
} }
``` ```
#### 4. 通过命令行使用 #### 4. 使用CUDA或MPS加速推理
如您有可用的Nvidia显卡或在使用Apple Silicon的Mac,可以使用CUDA或MPS进行加速
###### 直接使用 ##### CUDA
```bash
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
```
程序运行完成后,你可以在"/tmp/magic-pdf"目录下看到生成的markdown文件,markdown目录中可以找到对应的xxx_model.json文件
如果您有意对后处理pipeline进行二次开发,可以使用命令
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
这样就不需要重跑模型数据,调试起来更方便
###### 更多用法
```bash
magic-pdf --help
```
#### 5. 使用CUDA或MPS进行加速
###### CUDA
需要根据自己的CUDA版本安装对应的pytorch版本 需要根据自己的CUDA版本安装对应的pytorch版本
以下是对应CUDA 11.8版本的安装命令,更多信息请参考 https://pytorch.org/get-started/locally/ 以下是对应CUDA 11.8版本的安装命令,更多信息请参考 https://pytorch.org/get-started/locally/
...@@ -151,7 +129,7 @@ pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https ...@@ -151,7 +129,7 @@ pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https
} }
``` ```
###### MPS ##### MPS
使用macOS(M系列芯片设备)可以使用MPS进行推理加速 使用macOS(M系列芯片设备)可以使用MPS进行推理加速
需要修改配置文件magic-pdf.json中"device-mode"的值 需要修改配置文件magic-pdf.json中"device-mode"的值
```json ```json
...@@ -161,13 +139,36 @@ pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https ...@@ -161,13 +139,36 @@ pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https
``` ```
#### 6. 通过接口调用 ### 使用说明
#### 1. 通过命令行使用
###### 直接使用
```bash
magic-pdf pdf-command --pdf "pdf_path" --inside_model true
```
程序运行完成后,你可以在"/tmp/magic-pdf"目录下看到生成的markdown文件,markdown目录中可以找到对应的xxx_model.json文件
如果您有意对后处理pipeline进行二次开发,可以使用命令
```bash
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
```
这样就不需要重跑模型数据,调试起来更方便
###### 更多用法
```bash
magic-pdf --help
```
#### 2. 通过接口调用
###### 本地使用 ###### 本地使用
```python ```python
image_writer = DiskReaderWriter(local_image_dir) image_writer = DiskReaderWriter(local_image_dir)
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
jso_useful_key = {"_pdf_type": "", "model_list": model_json} jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer) pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify() pipe.pipe_classify()
pipe.pipe_parse() pipe.pipe_parse()
...@@ -180,7 +181,7 @@ s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint) ...@@ -180,7 +181,7 @@ s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
image_dir = "s3://img_bucket/" image_dir = "s3://img_bucket/"
s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir) s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN) pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
jso_useful_key = {"_pdf_type": "", "model_list": model_json} jso_useful_key = {"_pdf_type": "", "model_list": []}
pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli) pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
pipe.pipe_classify() pipe.pipe_classify()
pipe.pipe_parse() pipe.pipe_parse()
...@@ -268,4 +269,4 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7 ...@@ -268,4 +269,4 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
<img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" /> <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
</picture> </picture>
</a> </a>
\ No newline at end of file
...@@ -12,7 +12,8 @@ try: ...@@ -12,7 +12,8 @@ try:
pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf") pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
model_path = os.path.join(current_script_dir, f"{demo_name}.json") model_path = os.path.join(current_script_dir, f"{demo_name}.json")
pdf_bytes = open(pdf_path, "rb").read() pdf_bytes = open(pdf_path, "rb").read()
model_json = json.loads(open(model_path, "r", encoding="utf-8").read()) # model_json = json.loads(open(model_path, "r", encoding="utf-8").read())
model_json = [] # model_json传空list使用内置模型解析
jso_useful_key = {"_pdf_type": "", "model_list": model_json} jso_useful_key = {"_pdf_type": "", "model_list": model_json}
local_image_dir = os.path.join(current_script_dir, 'images') local_image_dir = os.path.join(current_script_dir, 'images')
image_dir = str(os.path.basename(local_image_dir)) image_dir = str(os.path.basename(local_image_dir))
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment