Commit cfe170b4 authored by 赵小蒙's avatar 赵小蒙

update readme

parent 34ed90b7
<div id="top"></div> <div id="top"></div>
<div align="center"> <div align="center">
[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE) [![license](https://img.shields.io/github/license/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[English](README.md) | [简体中文](README_zh-CN.md) [English](README.md) | [简体中文](README_zh-CN.md)
...@@ -15,6 +15,15 @@ ...@@ -15,6 +15,15 @@
</div> </div>
# MinerU
## Introduction
MinerU is a one-stop, open-source data extraction tool, primarily includes the following features:
- PDF Document Extraction [Magic-PDF](#Magic-PDF)
- Webpage & E-book Extraction [Magic-Doc](#Magic-Doc)
# Magic-PDF # Magic-PDF
## Introduction ## Introduction
...@@ -49,17 +58,20 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ...@@ -49,17 +58,20 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
### Submodule Repositories ### Submodule Repositories
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
A Comprehensive Toolkit for High-Quality PDF Content Extraction
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark) - [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)
An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios
## Getting Started ## Getting Started
### Requirements ### Requirements
- Python 3.9 or newer - Python >= 3.9
### Usage Instructions ### Usage Instructions
#### 1. Install Magic-PDF #### 1. Install Magic-PDF
```bash ```bash
pip install magic-pdf pip install magic-pdf
``` ```
...@@ -67,11 +79,14 @@ pip install magic-pdf ...@@ -67,11 +79,14 @@ pip install magic-pdf
#### 2. Usage via Command Line #### 2. Usage via Command Line
###### simple ###### simple
```bash ```bash
cp magic-pdf.template.json to ~/magic-pdf.json cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path" magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
``` ```
###### more ###### more
```bash ```bash
magic-pdf --help magic-pdf --help
``` ```
...@@ -112,9 +127,46 @@ Demo can be referred to [demo.py](demo/demo.py) ...@@ -112,9 +127,46 @@ Demo can be referred to [demo.py](demo/demo.py)
## License Information ## License Information
See [LICENSE.md](LICENSE.md) for details. [LICENSE.md](LICENSE.md)
The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
## Acknowledgments ## Acknowledgments
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
# Magic-Doc
## Introduction
Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
Key Features Include:
- Web Page Extraction
- Cross-modal precise parsing of text, images, tables, and formula information.
- E-Book Document Extraction
- Supports various document formats including epub, mobi, with full adaptation for text and images.
- Language Type Identification
- Accurate recognition of 176 languages.
https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
## Project Repository
- [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
Outstanding Webpage and E-book Extraction Tool
<div id="top"></div> <div id="top"></div>
<div align="center"> <div align="center">
[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) [![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) [![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE) [![license](https://img.shields.io/github/license/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) [![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) [![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[English](README.md) | [简体中文](README_zh-CN.md) [English](README.md) | [简体中文](README_zh-CN.md)
...@@ -21,8 +21,8 @@ ...@@ -21,8 +21,8 @@
MinerU 是一款一站式开源数据提取工具,主要包含以下功能: MinerU 是一款一站式开源数据提取工具,主要包含以下功能:
- PDF文档提取 (Magic-PDF) - PDF文档提取 [Magic-PDF](#Magic-PDF)
- 网页与电子书提取 (Magic-Doc) - 网页与电子书提取 [Magic-Doc](#Magic-Doc)
# Magic-PDF # Magic-PDF
...@@ -58,7 +58,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ...@@ -58,7 +58,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
### 子模块仓库 ### 子模块仓库
- [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
领先的文档分析模型 高质量的PDF内容提取工具包
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark) - [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)
端到端的PDF文档理解评估套件,专为大规模模型数据场景而设计 端到端的PDF文档理解评估套件,专为大规模模型数据场景而设计
...@@ -67,11 +67,12 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ...@@ -67,11 +67,12 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
### 配置要求 ### 配置要求
python 3.9+ python >= 3.9
### 使用说明 ### 使用说明
#### 1. 安装Magic-PDF #### 1. 安装Magic-PDF
```bash ```bash
pip install magic-pdf pip install magic-pdf
``` ```
...@@ -79,11 +80,14 @@ pip install magic-pdf ...@@ -79,11 +80,14 @@ pip install magic-pdf
#### 2. 通过命令行使用 #### 2. 通过命令行使用
###### 直接使用 ###### 直接使用
```bash ```bash
cp magic-pdf.template.json to ~/magic-pdf.json cp magic-pdf.template.json to ~/magic-pdf.json
magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path" magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
``` ```
###### 更多用法 ###### 更多用法
```bash ```bash
magic-pdf --help magic-pdf --help
``` ```
...@@ -121,10 +125,13 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -121,10 +125,13 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
[LICENSE.md](LICENSE.md) [LICENSE.md](LICENSE.md)
本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。
## 鸣谢 ## 鸣谢
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
# Magic-Doc # Magic-Doc
## 简介 ## 简介
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment