Unverified Commit 55404808 authored by drunkpig's avatar drunkpig Committed by GitHub

Realese 0.8.0 (#587)

* release: release 0.7.1 version (#526)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: 's avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: 's avatarwangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: 's avatarsfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: 's avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: 's avatarwangbinDL <wangbin_research@163.com>

---------
Co-authored-by: 's avatardrunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: 's avatarwangbinDL <wangbin_research@163.com>

* Hotfix readme 0.7.1 (#528)

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

add HF、modelscope、colab url

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Rename README.md to README_zh-CN.md

* Create readme.md

* Rename readme.md to README.md

* Rename README.md to README_zh-CN.md

* Update README_zh-CN.md

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Create download_models_hf.py

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* Update FAQ_zh_cn.md

* Update FAQ_en_us.md

* Update FAQ_zh_cn.md

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573)

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* Update README_zh-CN.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* add rag data api

* Update README_zh-CN.md

update rag api image

* Update README.md

docs: remove RAG related release notes

* Update README_zh-CN.md

docs: remove RAG related release notes

* Update README_zh-CN.md

update 更新记录

---------
Co-authored-by: 's avataryyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: 's avatarsfk <18810651050@163.com>
Co-authored-by: 's avatarAoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: 's avatarXiaomeng Zhao <moe@myhloli.com>
Co-authored-by: 's avatarKaiwen Liu <lkw_buaa@163.com>
Co-authored-by: 's avatargithub-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: 's avatarliukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: 's avatarwangbinDL <wangbin_research@163.com>
parent 0a8c7b35
......@@ -40,6 +40,7 @@
</div>
# Changelog
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
......@@ -353,7 +354,6 @@ TODO
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions.
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)
......
......@@ -40,6 +40,7 @@
</div>
# 更新记录
- 2024/09/09 0.8.0发布,支持Dockerfile快速部署,同时上线了huggingface、modelscope demo
- 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
......@@ -356,8 +357,8 @@ TODO
- 在一些公式密集的PDF上强制启用OCR效果会更好
- 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。
# FAQ
# FAQ
[常见问题](docs/FAQ_zh_cn.md)
......
......@@ -44,3 +44,11 @@ pip uninstall fairscale
pip install fairscale
```
Reference: https://github.com/opendatalab/MinerU/issues/411
### 6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
The compatibility of cuda11 with new graphics cards is poor, and the CUDA version used by Paddle needs to be upgraded.
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
Reference: https://github.com/opendatalab/MinerU/issues/558
......@@ -41,3 +41,11 @@ pip uninstall fairscale
pip install fairscale
```
参考:https://github.com/opendatalab/MinerU/issues/411
### 6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
参考:https://github.com/opendatalab/MinerU/issues/558
from huggingface_hub import snapshot_download
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit')
print(f"model dir is: {model_dir}/models")
......@@ -230,6 +230,7 @@ class CustomPEKModel:
)
# 初始化ocr
if self.apply_ocr:
# self.ocr_model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=0.3)
self.ocr_model = atom_model_manager.get_atom_model(
atom_model_name=AtomicModel.OCR,
......@@ -249,6 +250,7 @@ class CustomPEKModel:
table_max_time=self.table_max_time,
device=self.device
)
logger.info('DocAnalysis init done!')
def __call__(self, image):
......@@ -389,6 +391,7 @@ class CustomPEKModel:
latex_code = self.table_model.image2latex(new_image)[0]
else:
html_code = self.table_model.img2html(new_image)
run_time = time.time() - single_table_start_time
logger.info(f"------------table recognition processing ends within {run_time}s-----")
if run_time > self.table_max_time:
......
......@@ -5,3 +5,4 @@
- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
- [gradio_app](./gradio_app/README.md): Build a web app based on gradio
......@@ -4,3 +4,4 @@
- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
## 安装
<details open="open">
<summary><h2 style="display: inline-block">目录</h2></summary>
<li><a href="#介绍">介绍</a></li>
<li><a href="#安装">安装</a></li>
<li><a href="#示例">示例</a></li>
<li><a href="#开发">开发</a></li>
</ol>
</details>
MinerU
## 介绍
```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
`MinerU` 提供数据 `API接口` 以支持用户导入数据到 `RAG` 系统。本项目将基于`通义千问`展示如何构建一个轻量级的 `RAG` 系统。
<p align="center">
<img src="rag_data_api.png" width="300px" style="vertical-align:middle;">
</p>
## 安装
conda create -n MinerU python=3.10
conda activate MinerU
pip install .[full] --extra-index-url https://wheels.myhloli.com
环境要求
```text
NVIDIA A100 80GB,
Centos 7 3.10.0-957.el7.x86_64
Client: Docker Engine - Community
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:39:02 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:38:05 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.25
GitCommit: d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
runc:
Version: 1.1.10
GitCommit: v1.1.10-0-g18a0cb0
docker-init:
Version: 0.19.0
GitCommit: de40ad0
```
请参考[文档](../../README_zh-CN.md) 安装 MinerU
第三方软件
```bash
# install
pip install modelscope==1.14.0
pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68
......@@ -26,39 +71,13 @@ pip install accelerate==0.33.0
pip uninstall transformer-engine
```
## 环境配置
```
export DASHSCOPE_API_KEY={some_key}
export ES_USER={some_es_user}
export ES_PASSWORD={some_es_password}
export ES_URL=http://{es_url}:9200
```
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
## 使用
### 导入数据
```bash
python data_ingestion.py -p some.pdf # load data from pdf
or
python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
```
### 查询
```bash
python query.py --question '{the_question_you_want_to_ask}'
```
## 示例
````bash
# 启动 es 服务
cd projects/llama_index_rag
docker compose up -d
or
......@@ -67,17 +86,41 @@ docker-compose up -d
# 配置环境变量
export ES_USER=elastic
export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key}
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
# 未导入数据,查询问题。返回通义千问默认答案
python query.py -q 'how about the rights of men'
## outputs
question: how about the rights of men
answer: The topic of men's rights often refers to discussions around legal, social, and political issues that affect men specifically or differently from women. Movements related to men's rights advocate for addressing areas where men face discrimination or unique challenges, such as:
Child Custody: Ensuring that men have equal opportunities for custody of their children following divorce or separation.
Domestic Violence: Recognizing that men can also be victims of domestic abuse and ensuring they have access to support services.
Mental Health and Suicide Rates: Addressing the higher rates of suicide among men and providing mental health resources.
Military Conscription: In some countries, only men are required to register for military service, which is seen as a gender-based obligation.
Workplace Safety: Historically, more men than women have been employed in high-risk occupations, leading to higher workplace injury and death rates.
Parental Leave: Advocating for paternity leave policies that allow men to take time off work for family care.
Men's rights activism often intersects with broader discussions on gender equality and aims to promote fairness and equity across genders. It's important to note that while advocating for these issues, it should be done in a way that does not detract from or oppose the goals of gender equality and the rights of other groups. The focus should be on creating a fair society where everyone has equal opportunities and protections under the law.
# 导入数据
python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
python data_ingestion.py -p example/data/
or
python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.pdf
# 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。
# 查询问题
python query.py -q 'how about the rights of men'
## outputs
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment