Commit 17be5497 authored by 赵小蒙's avatar 赵小蒙

update readme

parent 79d850b8
...@@ -17,6 +17,7 @@ ...@@ -17,6 +17,7 @@
# MinerU # MinerU
## Introduction ## Introduction
MinerU is a one-stop, open-source data extraction tool, primarily includes the following features: MinerU is a one-stop, open-source data extraction tool, primarily includes the following features:
...@@ -24,8 +25,10 @@ MinerU is a one-stop, open-source data extraction tool, primarily includes the f ...@@ -24,8 +25,10 @@ MinerU is a one-stop, open-source data extraction tool, primarily includes the f
- [Magic-PDF](#Magic-PDF) PDF Document Extraction - [Magic-PDF](#Magic-PDF) PDF Document Extraction
- [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction - [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction
# Magic-PDF # Magic-PDF
## Introduction ## Introduction
Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol. Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol.
...@@ -51,6 +54,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ...@@ -51,6 +54,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
![Project Panorama](docs/images/project_panorama_en.png) ![Project Panorama](docs/images/project_panorama_en.png)
## Flowchart ## Flowchart
![Flowchart](docs/images/flowchart_en.png) ![Flowchart](docs/images/flowchart_en.png)
...@@ -62,6 +66,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ...@@ -62,6 +66,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3
- [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark) - [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark)
- An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios - An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios
## Getting Started ## Getting Started
### Requirements ### Requirements
...@@ -119,18 +124,21 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -119,18 +124,21 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
Demo can be referred to [demo.py](demo/demo.py) Demo can be referred to [demo.py](demo/demo.py)
## All Thanks To Our Contributors ## All Thanks To Our Contributors
<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors"> <a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
<img src="https://contrib.rocks/image?repo=magicpdf/Magic-PDF" /> <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a> </a>
## License Information ## License Information
[LICENSE.md](LICENSE.md) [LICENSE.md](LICENSE.md)
The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility. The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility.
## Acknowledgments ## Acknowledgments
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
...@@ -139,6 +147,7 @@ The project currently leverages PyMuPDF to deliver advanced functionalities; how ...@@ -139,6 +147,7 @@ The project currently leverages PyMuPDF to deliver advanced functionalities; how
# Magic-Doc # Magic-Doc
## Introduction ## Introduction
Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format. Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format.
...@@ -166,6 +175,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7 ...@@ -166,6 +175,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
## Project Repository ## Project Repository
- [Magic-Doc](https://github.com/magicpdf/Magic-Doc) - [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
......
...@@ -17,6 +17,7 @@ ...@@ -17,6 +17,7 @@
# MinerU # MinerU
## 简介 ## 简介
MinerU 是一款一站式开源数据提取工具,主要包含以下功能: MinerU 是一款一站式开源数据提取工具,主要包含以下功能:
...@@ -26,6 +27,7 @@ MinerU 是一款一站式开源数据提取工具,主要包含以下功能: ...@@ -26,6 +27,7 @@ MinerU 是一款一站式开源数据提取工具,主要包含以下功能:
# Magic-PDF # Magic-PDF
## 简介 ## 简介
Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。 Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。
...@@ -121,12 +123,20 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -121,12 +123,20 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
详细实现可参考 [demo.py](demo/demo.py) 详细实现可参考 [demo.py](demo/demo.py)
## 感谢我们的贡献者
<a href="https://github.com/magicpdf/Magic-PDF/graphs/contributors">
<img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
</a>
## 版权说明 ## 版权说明
[LICENSE.md](LICENSE.md) [LICENSE.md](LICENSE.md)
本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。 本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。
## 鸣谢 ## 鸣谢
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
...@@ -134,6 +144,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") ...@@ -134,6 +144,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
# Magic-Doc # Magic-Doc
## 简介 ## 简介
Magic-Doc 是一款支持将网页或多格式电子书转换为 markdown 格式的工具。 Magic-Doc 是一款支持将网页或多格式电子书转换为 markdown 格式的工具。
...@@ -161,6 +172,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7 ...@@ -161,6 +172,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7
## 项目仓库 ## 项目仓库
- [Magic-Doc](https://github.com/magicpdf/Magic-Doc) - [Magic-Doc](https://github.com/magicpdf/Magic-Doc)
......
import regex
import unicodedata import unicodedata
from fast_langdetect import detect_langs from fast_langdetect import detect_langs
RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")
def remove_bad_chars(text):
return RE_BAD_CHARS.sub("", text)
def detect_lang(text: str) -> str: def detect_lang(text: str) -> str:
if len(text) == 0: if len(text) == 0:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment