Realese 0.8.0 (#586)

* Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * fix(ocr_mkcontent): improve language detection and content formatting (#458) Optimize the language detection logic to enhance content formatting. This change addresses issues with long word segmentation. Language detection now uses a threshold to determine the language of a text based on the proportion of English characters. Formatting rules for content have been updated to consider a list of languages (initially including Chinese, Japanese, and Korean) where no space is added between content segments for inline equations and text spans, improving the handling of Asian languages. The impact of these changes includes improved accuracy in language detection, better segmentation of long words, and more appropriate spacing in content formatting for multiple languages. * fix(self_modify): merge detection boxes for optimized text region detection (#448) Merge adjacent and overlapping detection boxes to optimize text region detection in the document. Post processing of text boxes is enhanced by consolidating them into larger text lines, taking into account their vertical and horizontal alignment. This improvement reduces fragmentation and improves the readability of detected text blocks. * fix(pdf-extract): adjust box threshold for OCR detection (#447) Tuned the detection box threshold parameter in the OCR model initialization to improve the accuracy of text extraction from images. The threshold was modified from 0.6 to 0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted text by reducing noise and false positives in the OCR process. * feat: rename the file generated by command line tools (#401) * feat: rename the file generated by command line tools * feat: add pdf filename as prefix to {span,layout,model}.pdf --------- Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(ocr_mkcontent): revise table caption output (#397) * fix(ocr_mkcontent): revise table caption output - Ensuring that table captions are properly included in the output. - Remove the redundant `table_caption` variable。 * Update cla.yml * Update bug_report.yml * feat(cli): add debug option for detailed error handling Enable users to invoke the CLI command with a new debug flag to get detailed debugging information. * fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified to improve the accuracy and consistency of OCR processing. The new values are set to 25 to ensure more precise image cropping and pasting which leads to better OCR recognition results. * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and handling of border cases. This change will help in improving the overall quality of OCR'ed text by providing more context around the detected text areas. * fix(common): deep copy model list before drawing model bbox Use a deep copy of the original model list in `drow_model_bbox` to avoid potential modifications to the source data. This ensures the integrity of the original models is maintained while generating the model bounding boxes visualization. --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * build(docker): update docker build step (#471) * build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved performance and stability. Additionally, integrate PaddlePaddle GPU version 3.0.0b1 into the Docker build for enhanced AI capabilities. The MinIO configuration file has also been updated to the latest version. * build(dockerfile): Updated the Dockerfile * build(Dockerfile): update Dockerfile * docs(docker): add instructions for quick deployment with Docker Include Docker-based deployment instructions in the README for both English and Chinese locales. This update provides users a quick-start guide to using Docker for deployment, with notes on GPU VRAM requirements and default acceleration features. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * feat: add test case (#499) Co-authored-by: quyuan <quyuan@pjlab.org> * Update cla.yml * Update gpu-ci.yml * Update cli.yml * Delete .github/workflows/gpu-ci.yml * fix(pdf-parse-union-core): #492 decrease span threshold for block filling (#500) Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to improve the accuracy of block filling based on layout analysis. * fix(detect_all_bboxes): remove small overlapping blocks by merging (#501) Previously, small blocks that overlapped with larger ones were merely removed. This fix changes the approach to merge smaller blocks into the larger block instead, ensuring that no information is lost and the larger block encompasses all the text content fully. * feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507) * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * Feat/support rag (#510) * Create requirements-docker.txt * feat: update deps to support rag * feat: add support to rag, add rag_data_reader api for rag integration * feat: let user retrieve the filename of the processed file * feat: add projects demo for rag integrations --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * Update Dockerfile * feat(gradio): add app by gradio (#512) * fix: replace \u0002, \u0003 in common text (#521) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518) * fix(para): When an English line ends with a hyphen, do not add a space at the end. (#523) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix: delete hyphen at end of line * Release: Release 0.7.1 verison, update dev (#527) * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Hotfix readme 0.7.1 (#529) * release: release 0.7.1 version (#526) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md --------- Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README_zh-CN.md delete Known issue about table recognition * Update Dockerfile * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#542) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: typo error in markdown (#536) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(gradio): remove unused imports and simplify pdf display (#534) Removed the previously used gradio and gradio-pdf imports which were not leveraged in the code. Also, replaced the custom `show_pdf` function with direct use of the `PDF` component from gradio for a simpler and more integrated PDF upload and display solution, improving code maintainability and readability. * Feat/support footnote in figure (#532) * feat: support figure footnote * feat: using the relative position to combine footnote, table, image * feat: add the readme of projects * fix: code spell in unittest --------- Co-authored-by: icecraft <xurui1@pjlab.org.cn> * refactor(pdf_extract_kit): implement singleton pattern for atomic models (#533) Refactor the pdf_extract_kit module to utilize a singleton pattern when initializing atomic models. This change ensures that atomic models are instantiated at most once, optimizing memory usage and reducing redundant initialization steps. The AtomModelSingleton class now manages the instantiation and retrieval of atomic models, improving the overall structure and efficiency of the codebase. * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md add HF、modelscope、colab url * Update README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Rename README.md to README_zh-CN.md * Create readme.md * Rename readme.md to README.md * Rename README.md to README_zh-CN.md * Update README_zh-CN.md * Create README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * Update README_zh-CN.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * add rag data api * Update README_zh-CN.md update rag api image * Update README.md docs: remove RAG related release notes * Update README_zh-CN.md docs: remove RAG related release notes * Update README_zh-CN.md update 更新记录 --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <tmortred@163.com> Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Siyu Hao <131659128+GDDGCZ518@users.noreply.github.com> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: quyuan <quyuan@pjlab.org> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com>

Realese 0.8.0 (#586)
* Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * fix(ocr_mkcontent): improve language detection and content formatting (#458) Optimize the language detection logic to enhance content formatting. This change addresses issues with long word segmentation. Language detection now uses a threshold to determine the language of a text based on the proportion of English characters. Formatting rules for content have been updated to consider a list of languages (initially including Chinese, Japanese, and Korean) where no space is added between content segments for inline equations and text spans, improving the handling of Asian languages. The impact of these changes includes improved accuracy in language detection, better segmentation of long words, and more appropriate spacing in content formatting for multiple languages. * fix(self_modify): merge detection boxes for optimized text region detection (#448) Merge adjacent and overlapping detection boxes to optimize text region detection in the document. Post processing of text boxes is enhanced by consolidating them into larger text lines, taking into account their vertical and horizontal alignment. This improvement reduces fragmentation and improves the readability of detected text blocks. * fix(pdf-extract): adjust box threshold for OCR detection (#447) Tuned the detection box threshold parameter in the OCR model initialization to improve the accuracy of text extraction from images. The threshold was modified from 0.6 to 0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted text by reducing noise and false positives in the OCR process. * feat: rename the file generated by command line tools (#401) * feat: rename the file generated by command line tools * feat: add pdf filename as prefix to {span,layout,model}.pdf --------- Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(ocr_mkcontent): revise table caption output (#397) * fix(ocr_mkcontent): revise table caption output - Ensuring that table captions are properly included in the output. - Remove the redundant `table_caption` variable。 * Update cla.yml * Update bug_report.yml * feat(cli): add debug option for detailed error handling Enable users to invoke the CLI command with a new debug flag to get detailed debugging information. * fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified to improve the accuracy and consistency of OCR processing. The new values are set to 25 to ensure more precise image cropping and pasting which leads to better OCR recognition results. * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and handling of border cases. This change will help in improving the overall quality of OCR'ed text by providing more context around the detected text areas. * fix(common): deep copy model list before drawing model bbox Use a deep copy of the original model list in `drow_model_bbox` to avoid potential modifications to the source data. This ensures the integrity of the original models is maintained while generating the model bounding boxes visualization. --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * build(docker): update docker build step (#471) * build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved performance and stability. Additionally, integrate PaddlePaddle GPU version 3.0.0b1 into the Docker build for enhanced AI capabilities. The MinIO configuration file has also been updated to the latest version. * build(dockerfile): Updated the Dockerfile * build(Dockerfile): update Dockerfile * docs(docker): add instructions for quick deployment with Docker Include Docker-based deployment instructions in the README for both English and Chinese locales. This update provides users a quick-start guide to using Docker for deployment, with notes on GPU VRAM requirements and default acceleration features. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself. * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * upload an introduction about chemical formula and update readme.md (#489) * upload an introduction about chemical formula * rename 2 files * update readme.md at TODO in chemstery * rename 2 files and update readme.md at TODO in chemstery * update README_zh-CN.md at TODO in chemstery * fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * feat: add test case (#499) Co-authored-by: quyuan <quyuan@pjlab.org> * Update cla.yml * Update gpu-ci.yml * Update cli.yml * Delete .github/workflows/gpu-ci.yml * fix(pdf-parse-union-core): #492 decrease span threshold for block filling (#500) Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to improve the accuracy of block filling based on layout analysis. * fix(detect_all_bboxes): remove small overlapping blocks by merging (#501) Previously, small blocks that overlapped with larger ones were merely removed. This fix changes the approach to merge smaller blocks into the larger block instead, ensuring that no information is lost and the larger block encompasses all the text content fully. * feat(cli&analyze&pipeline): add start_page and end_page args for pagination (#507) * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * feat(cli&analyze&pipeline): add start_page and end_page args for paginationAdd start_page_id and end_page_id arguments to various components of the PDF parsing pipeline to support pagination functionality. This feature allows users to specify the range of pages to be processed, enhancing the efficiency and flexibility of the system. * Feat/support rag (#510) * Create requirements-docker.txt * feat: update deps to support rag * feat: add support to rag, add rag_data_reader api for rag integration * feat: let user retrieve the filename of the processed file * feat: add projects demo for rag integrations --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> * Update Dockerfile * feat(gradio): add app by gradio (#512) * fix: replace \u0002, \u0003 in common text (#521) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. (#518) * fix(para): When an English line ends with a hyphen, do not add a space at the end. (#523) * fix replace \u0002, \u0003 in common text * fix(para): When an English line ends with a hyphen, do not add a space at the end. * fix: delete hyphen at end of line * Release: Release 0.7.1 verison, update dev (#527) * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Hotfix readme 0.7.1 (#529) * release: release 0.7.1 version (#526) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> * feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511) * Update cla.yml * Update bug_report.yml * Update README_zh-CN.md (#404) correct FAQ url * Update README_zh-CN.md (#404) (#409) (#410) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * Update FAQ_zh_cn.md add new issue * Update FAQ_en_us.md * Update README_Windows_CUDA_Acceleration_zh_CN.md * Update README_zh-CN.md * @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 * Update cla.yml * feat: add tablemaster_paddle (#463) * Update README_zh-CN.md (#404) (#409) correct FAQ url Co-authored-by: sfk <18810651050@163.com> * add dockerfile (#189) Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> * Update cla.yml * Update cla.yml --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> * <fix>(para_split_v2): index out of range issue of span_text first char (#396) Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> * @Matthijz98 has signed the CLA in opendatalab/MinerU#467 * Create download_models.py * Create requirements-docker.txt * feat<table model>: add tablemaster with paddleocr to detect and recognize table * @strongerfly has signed the CLA in opendatalab/MinerU#487 * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * Update cla.yml * Delete .github/workflows/gpu-ci.yml * Update Huggingface and ModelScope links to organization account * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table * feat<table model>: add tablemaster with paddleocr to detect and recognize table --------- Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: wangbinDL <wangbin_research@163.com> --------- Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md --------- Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com> * Update README_zh-CN.md delete Known issue about table recognition * Update Dockerfile * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#542) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: typo error in markdown (#536) Co-authored-by: icecraft <xurui1@pjlab.org.cn> * fix(gradio): remove unused imports and simplify pdf display (#534) Removed the previously used gradio and gradio-pdf imports which were not leveraged in the code. Also, replaced the custom `show_pdf` function with direct use of the `PDF` component from gradio for a simpler and more integrated PDF upload and display solution, improving code maintainability and readability. * Feat/support footnote in figure (#532) * feat: support figure footnote * feat: using the relative position to combine footnote, table, image * feat: add the readme of projects * fix: code spell in unittest --------- Co-authored-by: icecraft <xurui1@pjlab.org.cn> * refactor(pdf_extract_kit): implement singleton pattern for atomic models (#533) Refactor the pdf_extract_kit module to utilize a singleton pattern when initializing atomic models. This change ensures that atomic models are instantiated at most once, optimizing memory usage and reducing redundant initialization steps. The AtomModelSingleton class now manages the instantiation and retrieval of atomic models, improving the overall structure and efficiency of the codebase. * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md add HF、modelscope、colab url * Update README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Rename README.md to README_zh-CN.md * Create readme.md * Rename readme.md to README.md * Rename README.md to README_zh-CN.md * Update README_zh-CN.md * Create README.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * Update README_zh-CN.md * Update README.md * Update README_zh-CN.md * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573) * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 * Update README_zh-CN.md * Update README.md * Update README.md * Update README.md * Update README_zh-CN.md * add rag data api * Update README_zh-CN.md update rag api image * Update README.md docs: remove RAG related release notes * Update README_zh-CN.md docs: remove RAG related release notes * Update README_zh-CN.md update 更新记录 --------- Co-authored-by: sfk <18810651050@163.com> Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn> Co-authored-by: Xiaomeng Zhao <moe@myhloli.com> Co-authored-by: icecraft <tmortred@163.com> Co-authored-by: icecraft <tmortred@gmail.com> Co-authored-by: icecraft <xurui1@pjlab.org.cn> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Siyu Hao <131659128+GDDGCZ518@users.noreply.github.com> Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com> Co-authored-by: quyuan <quyuan@pjlab.org> Co-authored-by: Kaiwen Liu <lkw_buaa@163.com> Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn> Co-authored-by: wangbinDL <wangbin_research@163.com>
9f352df0 · drunkpig · GitHub · b6633cd6 · 9f352df0 · 9f352df0
Unverified Commit 9f352df0 authored Sep 10, 2024 by drunkpig Committed by GitHub Sep 10, 2024
78 changed files
--- a/.github/workflows/cli.yml
+++ b/.github/workflows/cli.yml
@@ -6,20 +6,24 @@ on:
  push:
    branches:
      - "master"
+      - "dev"
    paths-ignore:
      - "cmds/**"
      - "**.md"
+      - "**.yml"
  pull_request:
    branches:
      - "master"
+      - "dev"
    paths-ignore:
      - "cmds/**"
      - "**.md"
+      - "**.yml"
  workflow_dispatch:
 jobs:
  cli-test:
-    runs-on: ubuntu-latest
-    timeout-minutes: 40
+    runs-on: pdf
+    timeout-minutes: 120
    strategy:
      fail-fast: true

@@ -28,27 +32,22 @@ jobs:
      uses: actions/checkout@v3
      with:
        fetch-depth: 2
-      
-    - name: check-requirements
-      run: |
-        pip install -r requirements.txt
-        pip install -r requirements-qa.txt
-        pip install magic-pdf
-    - name: test_cli
+
+    - name: install
      run: |
-        cp magic-pdf.template.json ~/magic-pdf.json
-        echo $GITHUB_WORKSPACE
-        cd $GITHUB_WORKSPACE && export PYTHONPATH=. && pytest -s -v tests/test_unit.py
-        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_cli.py
-                                                                                                                            
-    - name: benchmark
+        echo $GITHUB_WORKSPACE && sh tests/retry_env.sh
+    - name: unit test
+      run: |        
+        cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m  pytest  tests/test_unit.py --cov=magic_pdf/ --cov-report term-missing --cov-report html
+        cd $GITHUB_WORKSPACE && python tests/get_coverage.py
+    - name: cli test
      run: |
-        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_bench.py
+        cd $GITHUB_WORKSPACE &&  pytest -s -v tests/test_cli/test_cli_sdk.py

  notify_to_feishu:
    if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
-    needs: [cli-test]
-    runs-on: ubuntu-latest
+    needs: cli-test
+    runs-on: pdf
    steps:
    - name: get_actor
      run: |
@@ -67,9 +66,5 @@ jobs:

    - name: notify
      run: |
-        curl  ${{ secrets.WEBHOOK_URL }} -H 'Content-Type: application/json'  -d '{
-        "msgtype": "text",
-        "text": {
-            "mentioned_list": ["${{ env.METIONS }}"] , "content": "'${{ github.repository }}' GitHubAction Failed!\n 细节请查看：https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"
-        }
-        }'   
\ No newline at end of file
+        echo ${{ secrets.USER_ID }}
+        curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}'  ${{ secrets.WEBHOOK_URL }}
--- a/.gitignore
+++ b/.gitignore
@@ -30,10 +30,10 @@ tmp/
 tmp
 .vscode
 .vscode/
-/tests/
 ocr_demo

 /app/common/__init__.py
 /magic_pdf/config/__init__.py
 source.dev.env

+tmp
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -3,6 +3,7 @@ repos:
    rev: 5.0.4
    hooks:
      - id: flake8
+        args: ["--max-line-length=120", "--ignore=E131,E125,W503,W504,E203"]
  - repo: https://github.com/PyCQA/isort
    rev: 5.11.5
    hooks:
@@ -11,6 +12,7 @@ repos:
    rev: v0.32.0
    hooks:
      - id: yapf
+        args: ["--style={based_on_style: google, column_limit: 120, indent_width: 4}"]
  - repo: https://github.com/codespell-project/codespell
    rev: v2.2.1
    hooks:
@@ -41,4 +43,4 @@ repos:
    rev: v1.3.1
    hooks:
      - id: docformatter
-        args: ["--in-place", "--wrap-descriptions", "79"]
+        args: ["--in-place", "--wrap-descriptions", "119"]
--- a/Dockerfile
+++ b/Dockerfile
 # Use the official Ubuntu base image
-FROM ubuntu:latest
+FROM ubuntu:22.04

 # Set environment variables to non-interactive to avoid prompts during installation
 ENV DEBIAN_FRONTEND=noninteractive
@@ -29,17 +29,23 @@ RUN python3 -m venv /opt/mineru_venv

 # Activate the virtual environment and install necessary Python packages
 RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
-    pip install --upgrade pip && \
-    pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/"
-
-# Copy the configuration file template and set up the model directory
-COPY magic-pdf.template.json /root/magic-pdf.json
-
-# Set the models directory in the configuration file (adjust the path as needed)
-RUN sed -i 's|/tmp/models|/opt/models|g' /root/magic-pdf.json
-
-# Create the models directory
-RUN mkdir -p /opt/models
+    pip3 install --upgrade pip && \
+    wget https://gitee.com/myhloli/MinerU/raw/master/requirements-docker.txt && \
+    pip3 install -r requirements-docker.txt --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple && \
+    pip3 install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
+
+# Copy the configuration file template and install magic-pdf latest
+RUN /bin/bash -c "wget https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json && \
+    cp magic-pdf.template.json /root/magic-pdf.json && \
+    source /opt/mineru_venv/bin/activate && \
+    pip3 install -U magic-pdf"
+
+# Download models and update the configuration file
+RUN /bin/bash -c "pip3 install modelscope && \
+    wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py && \
+    python3 download_models.py && \
+    sed -i 's|/tmp/models|/root/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models|g' /root/magic-pdf.json && \
+    sed -i 's|cpu|cuda|g' /root/magic-pdf.json"

 # Set the entry point to activate the virtual environment and run the command line tool
 ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]
--- a/README.md
+++ b/README.md
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
--- a/README_zh-CN.md.bak
+++ b/README_zh-CN.md.bak
--- a/app.py
+++ b/app.py
+# Copyright (c) Opendatalab. All rights reserved.
+
+import base64
+import os
+import time
+import zipfile
+from pathlib import Path
+import re
+
+from loguru import logger
+
+from magic_pdf.libs.hash_utils import compute_sha256
+from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
+from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+from magic_pdf.tools.common import do_parse, prepare_env
+
+os.system("pip install gradio")
+os.system("pip install gradio-pdf")
+import gradio as gr
+from gradio_pdf import PDF
+
+
+def read_fn(path):
+    disk_rw = DiskReaderWriter(os.path.dirname(path))
+    return disk_rw.read(os.path.basename(path), AbsReaderWriter.MODE_BIN)
+
+
+def parse_pdf(doc_path, output_dir, end_page_id):
+    os.makedirs(output_dir, exist_ok=True)
+
+    try:
+        file_name = f"{str(Path(doc_path).stem)}_{time.time()}"
+        pdf_data = read_fn(doc_path)
+        parse_method = "auto"
+        local_image_dir, local_md_dir = prepare_env(output_dir, file_name, parse_method)
+        do_parse(
+            output_dir,
+            file_name,
+            pdf_data,
+            [],
+            parse_method,
+            False,
+            end_page_id=end_page_id,
+        )
+        return local_md_dir, file_name
+    except Exception as e:
+        logger.exception(e)
+
+
+def compress_directory_to_zip(directory_path, output_zip_path):
+    """
+    压缩指定目录到一个 ZIP 文件。
+
+    :param directory_path: 要压缩的目录路径
+    :param output_zip_path: 输出的 ZIP 文件路径
+    """
+    try:
+        with zipfile.ZipFile(output_zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
+
+            # 遍历目录中的所有文件和子目录
+            for root, dirs, files in os.walk(directory_path):
+                for file in files:
+                    # 构建完整的文件路径
+                    file_path = os.path.join(root, file)
+                    # 计算相对路径
+                    arcname = os.path.relpath(file_path, directory_path)
+                    # 添加文件到 ZIP 文件
+                    zipf.write(file_path, arcname)
+        return 0
+    except Exception as e:
+        logger.exception(e)
+        return -1
+
+
+def image_to_base64(image_path):
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode('utf-8')
+
+
+def replace_image_with_base64(markdown_text, image_dir_path):
+    # 匹配Markdown中的图片标签
+    pattern = r'\!\[(?:[^\]]*)\]\(([^)]+)\)'
+
+    # 替换图片链接
+    def replace(match):
+        relative_path = match.group(1)
+        full_path = os.path.join(image_dir_path, relative_path)
+        base64_image = image_to_base64(full_path)
+        return f"![{relative_path}](data:image/jpeg;base64,{base64_image})"
+
+    # 应用替换
+    return re.sub(pattern, replace, markdown_text)
+
+
+def to_markdown(file_path, end_pages):
+    # 获取识别的md文件以及压缩包文件路径
+    local_md_dir, file_name = parse_pdf(file_path, './output', end_pages - 1)
+    archive_zip_path = os.path.join("./output", compute_sha256(local_md_dir) + ".zip")
+    zip_archive_success = compress_directory_to_zip(local_md_dir, archive_zip_path)
+    if zip_archive_success == 0:
+        logger.info("压缩成功")
+    else:
+        logger.error("压缩失败")
+    md_path = os.path.join(local_md_dir, file_name + ".md")
+    with open(md_path, 'r', encoding='utf-8') as f:
+        txt_content = f.read()
+    md_content = replace_image_with_base64(txt_content, local_md_dir)
+    # 返回转换后的PDF路径
+    new_pdf_path = os.path.join(local_md_dir, file_name + "_layout.pdf")
+
+    return md_content, txt_content, archive_zip_path, new_pdf_path
+
+
+# def show_pdf(file_path):
+#     with open(file_path, "rb") as f:
+#         base64_pdf = base64.b64encode(f.read()).decode('utf-8')
+#     pdf_display = f'<embed src="data:application/pdf;base64,{base64_pdf}" ' \
+#                   f'width="100%" height="1000" type="application/pdf">'
+#     return pdf_display
+
+
+latex_delimiters = [{"left": "$$", "right": "$$", "display": True},
+                    {"left": '$', "right": '$', "display": False}]
+
+
+def init_model():
+    from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
+    try:
+        model_manager = ModelSingleton()
+        txt_model = model_manager.get_model(False, False)
+        logger.info(f"txt_model init final")
+        ocr_model = model_manager.get_model(True, False)
+        logger.info(f"ocr_model init final")
+        return 0
+    except Exception as e:
+        logger.exception(e)
+        return -1
+
+
+model_init = init_model()
+logger.info(f"model_init: {model_init}")
+
+
+if __name__ == "__main__":
+    with gr.Blocks() as demo:
+        with gr.Row():
+            with gr.Column(variant='panel', scale=5):
+                pdf_show = gr.Markdown()
+                max_pages = gr.Slider(1, 10, 5, step=1, label="Max convert pages")
+                with gr.Row() as bu_flow:
+                    change_bu = gr.Button("Convert")
+                    clear_bu = gr.ClearButton([pdf_show], value="Clear")
+                pdf_show = PDF(label="Please upload pdf", interactive=True, height=800)
+
+            with gr.Column(variant='panel', scale=5):
+                output_file = gr.File(label="convert result", interactive=False)
+                with gr.Tabs():
+                    with gr.Tab("Markdown rendering"):
+                        md = gr.Markdown(label="Markdown rendering", height=900, show_copy_button=True,
+                                         latex_delimiters=latex_delimiters, line_breaks=True)
+                    with gr.Tab("Markdown text"):
+                        md_text = gr.TextArea(lines=45, show_copy_button=True)
+        change_bu.click(fn=to_markdown, inputs=[pdf_show, max_pages], outputs=[md, md_text, output_file, pdf_show])
+        clear_bu.add([md, pdf_show, md_text, output_file])
+
+    demo.launch()
+
--- a/docs/chemical_knowledge_introduction/introduction.pdf
+++ b/docs/chemical_knowledge_introduction/introduction.pdf
--- a/docs/chemical_knowledge_introduction/introduction.xmind
+++ b/docs/chemical_knowledge_introduction/introduction.xmind
--- a/docs/output_file_en_us.md
+++ b/docs/output_file_en_us.md
--- a/docs/output_file_zh_cn.md
+++ b/docs/output_file_zh_cn.md
-
-
 ## 概览
+
 `magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外，还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件

+### some_pdf_layout.pdf

-### layout.pdf 
 每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。

 ![layout 页面示例](images/layout_example.png)

+### some_pdf_spans.pdf

-### spans.pdf 
 根据 span 类型的不同，采用不同颜色线框绘制页面上所有 span。该文件可以用于质检，可以快速排查出文本丢失、行间公式未识别等问题。

 ![span 页面示例](images/spans_example.png)

-
-### model.json
+### some_pdf_model.json

 #### 结构定义
+
 ```python
 from pydantic import BaseModel, Field
 from enum import IntEnum
@@ -33,13 +32,13 @@ class CategoryType(IntEnum):
     table_caption = 6       # 表格描述
     table_footnote = 7      # 表格注释
     isolate_formula = 8     # 行间公式
-     formula_caption = 9     # 行间公式的标号 
-     
+     formula_caption = 9     # 行间公式的标号
+
     embedding = 13          # 行内公式
     isolated = 14           # 行间公式
     text = 15               # ocr 识别结果
-   
-     
+
+
 class PageInfo(BaseModel):
    page_no: int = Field(description="页码序号，第一页的序号是 0", ge=0)
    height: int = Field(description="页面高度", gt=0)
@@ -51,21 +50,20 @@ class ObjectInferenceResult(BaseModel):
    score: float = Field(description="推理结果的置信度")
    latex: str | None = Field(description="latex 解析结果", default=None)
    html: str | None = Field(description="html 解析结果", default=None)
-  
+
 class PageInferenceResults(BaseModel):
     layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果", ge=0)
     page_info: PageInfo = Field(description="页面元信息")
-    
-    
+
+
 # 所有页面的推理结果按照页码顺序依次放到列表中即为 minerU 推理结果
 inference_result: list[PageInferenceResults] = []

 ```

-poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右上、右下、左下四点的坐标
+poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、右上、右下、左下四点的坐标
 ![poly 坐标示意图](images/poly.png)

-
 #### 示例数据

 ```json
@@ -119,32 +117,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右
 ]
 ```

+### some_pdf_middle.json

-### middle.json
-
-| 字段名 | 解释                                        | 
-| :-----|:------------------------------------------|
-|pdf_info | list，每个元素都是一个dict,这个dict是每一页pdf的解析结果，详见下表 |
-|_parse_type | ocr \| txt，用来标识本次解析的中间态使用的模式              |
-|_version_name | string, 表示本次解析使用的 magic-pdf 的版本号          |
+| 字段名         | 解释                                                               |
+| :------------- | :----------------------------------------------------------------- |
+| pdf_info       | list，每个元素都是一个dict,这个dict是每一页pdf的解析结果，详见下表 |
+| \_parse_type   | ocr \| txt，用来标识本次解析的中间态使用的模式                     |
+| \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号                      |

 <br>

 **pdf_info**
 字段结构说明

-| 字段名 | 解释 | 
-| :-----| :---- |
-| preproc_blocks | pdf预处理后，未分段的中间结果 |
-| layout_bboxes | 布局分割的结果，含有布局的方向（垂直、水平），和bbox，按阅读顺序排序 |
-| page_idx | 页码，从0开始 |
-| page_size | 页面的宽度和高度 | 
-| _layout_tree | 布局树状结构 |
-| images | list，每个元素是一个dict，每个dict表示一个img_block |
-| tables | list，每个元素是一个dict，每个dict表示一个table_block |
-| interline_equations | list，每个元素是一个dict，每个dict表示一个interline_equation_block |
-| discarded_blocks | List, 模型返回的需要drop的block信息 |
-| para_blocks | 将preproc_blocks进行分段之后的结果 |
+| 字段名              | 解释                                                                 |
+| :------------------ | :------------------------------------------------------------------- |
+| preproc_blocks      | pdf预处理后，未分段的中间结果                                        |
+| layout_bboxes       | 布局分割的结果，含有布局的方向（垂直、水平），和bbox，按阅读顺序排序 |
+| page_idx            | 页码，从0开始                                                        |
+| page_size           | 页面的宽度和高度                                                     |
+| \_layout_tree       | 布局树状结构                                                         |
+| images              | list，每个元素是一个dict，每个dict表示一个img_block                  |
+| tables              | list，每个元素是一个dict，每个dict表示一个table_block                |
+| interline_equations | list，每个元素是一个dict，每个dict表示一个interline_equation_block   |
+| discarded_blocks    | List, 模型返回的需要drop的block信息                                  |
+| para_blocks         | 将preproc_blocks进行分段之后的结果                                   |

 上表中 `para_blocks` 是个dict的数组，每个dict是一个block结构，block最多支持一次嵌套

@@ -154,35 +151,35 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右

 外层block被称为一级block，一级block中的字段包括

-| 字段名 | 解释 |
-| :-----| :---- |
-| type | block类型（table\|image）|
-|bbox | block矩形框坐标 |
-|blocks |list，里面的每个元素都是一个dict格式的二级block |
+| 字段名 | 解释                                            |
+| :----- | :---------------------------------------------- |
+| type   | block类型（table\|image）                       |
+| bbox   | block矩形框坐标                                 |
+| blocks | list，里面的每个元素都是一个dict格式的二级block |

 <br>
 一级block只有"table"和"image"两种类型，其余block均为二级block

 二级block中的字段包括

-| 字段名 | 解释 |
-| :-----| :---- |
-| type | block类型 |
-| bbox | block矩形框坐标 |
-| lines | list，每个元素都是一个dict表示的line，用来描述一行信息的构成| 
+| 字段名 | 解释                                                         |
+| :----- | :----------------------------------------------------------- |
+| type   | block类型                                                    |
+| bbox   | block矩形框坐标                                              |
+| lines  | list，每个元素都是一个dict表示的line，用来描述一行信息的构成 |

 二级block的类型详解

-| type               | desc | 
-|:-------------------| :---- |
-| image_body         | 图像的本体 |
+| type               | desc           |
+| :----------------- | :------------- |
+| image_body         | 图像的本体     |
 | image_caption      | 图像的描述文本 |
-| table_body         | 表格本体 |
+| table_body         | 表格本体       |
 | table_caption      | 表格的描述文本 |
-| table_footnote     | 表格的脚注 |
-| text               | 文本块 |
-| title              | 标题块 |
-| interline_equation | 行间公式块| 
+| table_footnote     | 表格的脚注     |
+| text               | 文本块         |
+| title              | 标题块         |
+| interline_equation | 行间公式块     |

 <br>

@@ -190,33 +187,31 @@ poly 坐标的格式 [x0, y0, x1, y1, x2, y2, x3, y3], 分别表示左上、右

 line 的 字段格式如下

-| 字段名 | 解释 | 
-| :-----| :---- |
-| bbox | line的矩形框坐标 |
-| spans | list，每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成 |
-
+| 字段名 | 解释                                                                 |
+| :----- | :------------------------------------------------------------------- |
+| bbox   | line的矩形框坐标                                                     |
+| spans  | list，每个元素都是一个dict表示的span，用来描述一个最小组成单元的构成 |

 <br>

 **span**

-| 字段名 | 解释 | 
-| :-----| :---- |
-| bbox | span的矩形框坐标 |
-| type | span的类型 |
+| 字段名              | 解释                                                                             |
+| :------------------ | :------------------------------------------------------------------------------- |
+| bbox                | span的矩形框坐标                                                                 |
+| type                | span的类型                                                                       |
 | content \| img_path | 文本类型的span使用content，图表类使用img_path 用来存储实际的文本或者截图路径信息 |

 span 的类型有如下几种

-| type | desc | 
-| :-----| :---- |
-| image | 图片 | 
-| table | 表格 |
-| text | 文本 |
-| inline_equation | 行内公式 |
+| type               | desc     |
+| :----------------- | :------- |
+| image              | 图片     |
+| table              | 表格     |
+| text               | 文本     |
+| inline_equation    | 行内公式 |
 | interline_equation | 行间公式 |

-
 **总结**

 span是所有元素的最小存储单元
@@ -227,7 +222,6 @@ para_blocks内存储的元素为区块信息

 一级block(如有)->二级block->line->span

-
 #### 示例数据

 ```json
@@ -329,4 +323,4 @@ para_blocks内存储的元素为区块信息
    "_parse_type": "txt",
    "_version_name": "0.6.1"
 }
-```
\ No newline at end of file
+```
--- a/magic_pdf/dict2md/ocr_mkcontent.py
+++ b/magic_pdf/dict2md/ocr_mkcontent.py
--- a/magic_pdf/integrations/__init__.py
+++ b/magic_pdf/integrations/__init__.py
--- a/magic_pdf/integrations/rag/__init__.py
+++ b/magic_pdf/integrations/rag/__init__.py
--- a/magic_pdf/integrations/rag/api.py
+++ b/magic_pdf/integrations/rag/api.py
+import os
+from pathlib import Path
+
+from loguru import logger
+
+from magic_pdf.integrations.rag.type import (ElementRelation, LayoutElements,
+                                             Node)
+from magic_pdf.integrations.rag.utils import inference
+
+
+class RagPageReader:
+
+    def __init__(self, pagedata: LayoutElements):
+        self.o = [
+            Node(
+                category_type=v.category_type,
+                text=v.text,
+                image_path=v.image_path,
+                anno_id=v.anno_id,
+                latex=v.latex,
+                html=v.html,
+            ) for v in pagedata.layout_dets
+        ]
+
+        self.pagedata = pagedata
+
+    def __iter__(self):
+        return iter(self.o)
+
+    def get_rel_map(self) -> list[ElementRelation]:
+        return self.pagedata.extra.element_relation
+
+
+class RagDocumentReader:
+
+    def __init__(self, ragdata: list[LayoutElements]):
+        self.o = [RagPageReader(v) for v in ragdata]
+
+    def __iter__(self):
+        return iter(self.o)
+
+
+class DataReader:
+
+    def __init__(self, path_or_directory: str, method: str, output_dir: str):
+        self.path_or_directory = path_or_directory
+        self.method = method
+        self.output_dir = output_dir
+        self.pdfs = []
+        if os.path.isdir(path_or_directory):
+            for doc_path in Path(path_or_directory).glob('*.pdf'):
+                self.pdfs.append(doc_path)
+        else:
+            assert path_or_directory.endswith('.pdf')
+            self.pdfs.append(Path(path_or_directory))
+
+    def get_documents_count(self) -> int:
+        """Returns the number of documents in the directory."""
+        return len(self.pdfs)
+
+    def get_document_result(self, idx: int) -> RagDocumentReader | None:
+        """
+        Args:
+            idx (int): the index of documents under the
+                directory path_or_directory
+
+        Returns:
+            RagDocumentReader | None: RagDocumentReader is an iterable object,
+            more details @RagDocumentReader
+        """
+        if idx >= self.get_documents_count() or idx < 0:
+            logger.error(f'invalid idx: {idx}')
+            return None
+        res = inference(str(self.pdfs[idx]), self.output_dir, self.method)
+        if res is None:
+            logger.warning(f'failed to inference pdf {self.pdfs[idx]}')
+            return None
+        return RagDocumentReader(res)
+
+    def get_document_filename(self, idx: int) -> Path:
+        """get the filename of the document."""
+        return self.pdfs[idx]
--- a/magic_pdf/integrations/rag/type.py
+++ b/magic_pdf/integrations/rag/type.py
+from enum import Enum
+
+from pydantic import BaseModel, Field
+
+
+# rag
+class CategoryType(Enum):  # py310 not support StrEnum
+    text = 'text'
+    title = 'title'
+    interline_equation = 'interline_equation'
+    image = 'image'
+    image_body = 'image_body'
+    image_caption = 'image_caption'
+    table = 'table'
+    table_body = 'table_body'
+    table_caption = 'table_caption'
+    table_footnote = 'table_footnote'
+
+
+class ElementRelType(Enum):
+    sibling = 'sibling'
+
+
+class PageInfo(BaseModel):
+    page_no: int = Field(description='the index of page, start from zero',
+                         ge=0)
+    height: int = Field(description='the height of page', gt=0)
+    width: int = Field(description='the width of page', ge=0)
+    image_path: str | None = Field(description='the image of this page',
+                                   default=None)
+
+
+class ContentObject(BaseModel):
+    category_type: CategoryType = Field(description='类别')
+    poly: list[float] = Field(
+        description=('Coordinates, need to convert back to PDF coordinates,'
+                     ' order is top-left, top-right, bottom-right, bottom-left'
+                     ' x,y coordinates'))
+    ignore: bool = Field(description='whether ignore this object',
+                         default=False)
+    text: str | None = Field(description='text content of the object',
+                             default=None)
+    image_path: str | None = Field(description='path of embedded image',
+                                   default=None)
+    order: int = Field(description='the order of this object within a page',
+                       default=-1)
+    anno_id: int = Field(description='unique id', default=-1)
+    latex: str | None = Field(description='latex result', default=None)
+    html: str | None = Field(description='html result', default=None)
+
+
+class ElementRelation(BaseModel):
+    source_anno_id: int = Field(description='unique id of the source object',
+                                default=-1)
+    target_anno_id: int = Field(description='unique id of the target object',
+                                default=-1)
+    relation: ElementRelType = Field(
+        description='the relation between source and target element')
+
+
+class LayoutElementsExtra(BaseModel):
+    element_relation: list[ElementRelation] = Field(
+        description='the relation between source and target element')
+
+
+class LayoutElements(BaseModel):
+    layout_dets: list[ContentObject] = Field(
+        description='layout element details')
+    page_info: PageInfo = Field(description='page info')
+    extra: LayoutElementsExtra = Field(description='extra information')
+
+
+# iter data format
+class Node(BaseModel):
+    category_type: CategoryType = Field(description='类别')
+    text: str | None = Field(description='text content of the object',
+                             default=None)
+    image_path: str | None = Field(description='path of embedded image',
+                                   default=None)
+    anno_id: int = Field(description='unique id', default=-1)
+    latex: str | None = Field(description='latex result', default=None)
+    html: str | None = Field(description='html result', default=None)
--- a/magic_pdf/integrations/rag/utils.py
+++ b/magic_pdf/integrations/rag/utils.py
--- a/magic_pdf/layout/layout_sort.py
+++ b/magic_pdf/layout/layout_sort.py
--- a/magic_pdf/libs/boxbase.py
+++ b/magic_pdf/libs/boxbase.py
--- a/magic_pdf/libs/draw_bbox.py
+++ b/magic_pdf/libs/draw_bbox.py
--- a/magic_pdf/libs/ocr_content_type.py
+++ b/magic_pdf/libs/ocr_content_type.py
 class ContentType:
-    Image = "image"
-    Table = "table"
-    Text = "text"
-    InlineEquation = "inline_equation"
-    InterlineEquation = "interline_equation"
-    
+    Image = 'image'
+    Table = 'table'
+    Text = 'text'
+    InlineEquation = 'inline_equation'
+    InterlineEquation = 'interline_equation'
+
+
 class BlockType:
-    Image = "image"
-    ImageBody = "image_body"
-    ImageCaption = "image_caption"
-    Table = "table"
-    TableBody = "table_body"
-    TableCaption = "table_caption"
-    TableFootnote = "table_footnote"
-    Text = "text"
-    Title = "title"
-    InterlineEquation = "interline_equation"
-    Footnote = "footnote"
-    Discarded = "discarded"
+    Image = 'image'
+    ImageBody = 'image_body'
+    ImageCaption = 'image_caption'
+    ImageFootnote = 'image_footnote'
+    Table = 'table'
+    TableBody = 'table_body'
+    TableCaption = 'table_caption'
+    TableFootnote = 'table_footnote'
+    Text = 'text'
+    Title = 'title'
+    InterlineEquation = 'interline_equation'
+    Footnote = 'footnote'
+    Discarded = 'discarded'


 class CategoryId:
@@ -33,3 +35,4 @@ class CategoryId:
    InlineEquation = 13
    InterlineEquation_YOLO = 14
    OcrText = 15
+    ImageFootnote = 101
--- a/magic_pdf/model/doc_analyze_by_custom_model.py
+++ b/magic_pdf/model/doc_analyze_by_custom_model.py
@@ -103,20 +103,32 @@ def custom_model_init(ocr: bool = False, show_log: bool = False):
    return custom_model


-def doc_analyze(pdf_bytes: bytes, ocr: bool = False, show_log: bool = False):
+def doc_analyze(pdf_bytes: bytes, ocr: bool = False, show_log: bool = False,
+                start_page_id=0, end_page_id=None):

    model_manager = ModelSingleton()
    custom_model = model_manager.get_model(ocr, show_log)

    images = load_images_from_pdf(pdf_bytes)

+    # end_page_id = end_page_id if end_page_id else len(images) - 1
+    end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(images) - 1
+
+    if end_page_id > len(images) - 1:
+        logger.warning("end_page_id is out of range, use images length")
+        end_page_id = len(images) - 1
+
    model_json = []
    doc_analyze_start = time.time()
+
    for index, img_dict in enumerate(images):
        img = img_dict["img"]
        page_width = img_dict["width"]
        page_height = img_dict["height"]
-        result = custom_model(img)
+        if start_page_id <= index <= end_page_id:
+            result = custom_model(img)
+        else:
+            result = []
        page_info = {"page_no": index, "height": page_height, "width": page_width}
        page_dict = {"layout_dets": result, "page_info": page_info}
        model_json.append(page_dict)

--- a/magic_pdf/model/magic_model.py
+++ b/magic_pdf/model/magic_model.py
--- a/magic_pdf/model/model_list.py
+++ b/magic_pdf/model/model_list.py
 class MODEL:
    Paddle = "pp_structure_v2"
    PEK = "pdf_extract_kit"
+
+
+class AtomicModel:
+    Layout = "layout"
+    MFD = "mfd"
+    MFR = "mfr"
+    OCR = "ocr"
+    Table = "table"
--- a/magic_pdf/model/pdf_extract_kit.py
+++ b/magic_pdf/model/pdf_extract_kit.py
--- a/magic_pdf/model/pek_sub_modules/self_modify.py
+++ b/magic_pdf/model/pek_sub_modules/self_modify.py
@@ -12,6 +12,7 @@ from paddleocr.ppocr.utils.utility import check_and_read, alpha_to_color, binari
 from paddleocr.tools.infer.utility import draw_ocr_box_txt, get_rotate_crop_image, get_minarea_rect_crop

 from magic_pdf.libs.boxbase import __is_overlaps_y_exceeds_threshold
+from magic_pdf.pre_proc.ocr_dict_merge import merge_spans_to_line

 logger = get_logger()

@@ -162,6 +163,86 @@ def update_det_boxes(dt_boxes, mfd_res):
    return new_dt_boxes


+def merge_overlapping_spans(spans):
+    """
+    Merges overlapping spans on the same line.
+
+    :param spans: A list of span coordinates [(x1, y1, x2, y2), ...]
+    :return: A list of merged spans
+    """
+    # Return an empty list if the input spans list is empty
+    if not spans:
+        return []
+
+    # Sort spans by their starting x-coordinate
+    spans.sort(key=lambda x: x[0])
+
+    # Initialize the list of merged spans
+    merged = []
+    for span in spans:
+        # Unpack span coordinates
+        x1, y1, x2, y2 = span
+        # If the merged list is empty or there's no horizontal overlap, add the span directly
+        if not merged or merged[-1][2] < x1:
+            merged.append(span)
+        else:
+            # If there is horizontal overlap, merge the current span with the previous one
+            last_span = merged.pop()
+            # Update the merged span's top-left corner to the smaller (x1, y1) and bottom-right to the larger (x2, y2)
+            x1 = min(last_span[0], x1)
+            y1 = min(last_span[1], y1)
+            x2 = max(last_span[2], x2)
+            y2 = max(last_span[3], y2)
+            # Add the merged span back to the list
+            merged.append((x1, y1, x2, y2))
+
+    # Return the list of merged spans
+    return merged
+
+
+def merge_det_boxes(dt_boxes):
+    """
+    Merge detection boxes.
+
+    This function takes a list of detected bounding boxes, each represented by four corner points.
+    The goal is to merge these bounding boxes into larger text regions.
+
+    Parameters:
+    dt_boxes (list): A list containing multiple text detection boxes, where each box is defined by four corner points.
+
+    Returns:
+    list: A list containing the merged text regions, where each region is represented by four corner points.
+    """
+    # Convert the detection boxes into a dictionary format with bounding boxes and type
+    dt_boxes_dict_list = []
+    for text_box in dt_boxes:
+        text_bbox = points_to_bbox(text_box)
+        text_box_dict = {
+            'bbox': text_bbox,
+            'type': 'text',
+        }
+        dt_boxes_dict_list.append(text_box_dict)
+
+    # Merge adjacent text regions into lines
+    lines = merge_spans_to_line(dt_boxes_dict_list)
+
+    # Initialize a new list for storing the merged text regions
+    new_dt_boxes = []
+    for line in lines:
+        line_bbox_list = []
+        for span in line:
+            line_bbox_list.append(span['bbox'])
+
+        # Merge overlapping text regions within the same line
+        merged_spans = merge_overlapping_spans(line_bbox_list)
+
+        # Convert the merged text regions back to point format and add them to the new detection box list
+        for span in merged_spans:
+            new_dt_boxes.append(bbox_to_points(span))
+
+    return new_dt_boxes
+
+
 class ModifiedPaddleOCR(PaddleOCR):
    def ocr(self, img, det=True, rec=True, cls=True, bin=False, inv=False, mfd_res=None, alpha_color=(255, 255, 255)):
        """
@@ -265,6 +346,9 @@ class ModifiedPaddleOCR(PaddleOCR):
        img_crop_list = []

        dt_boxes = sorted_boxes(dt_boxes)
+
+        dt_boxes = merge_det_boxes(dt_boxes)
+
        if mfd_res:
            bef = time.time()
            dt_boxes = update_det_boxes(dt_boxes, mfd_res)

--- a/magic_pdf/para/para_split_v2.py
+++ b/magic_pdf/para/para_split_v2.py
--- a/magic_pdf/pdf_parse_union_core.py
+++ b/magic_pdf/pdf_parse_union_core.py
--- a/magic_pdf/pipe/AbsPipe.py
+++ b/magic_pdf/pipe/AbsPipe.py
@@ -16,12 +16,15 @@ class AbsPipe(ABC):
    PIP_OCR = "ocr"
    PIP_TXT = "txt"

-    def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False):
+    def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
+                 start_page_id=0, end_page_id=None):
        self.pdf_bytes = pdf_bytes
        self.model_list = model_list
        self.image_writer = image_writer
        self.pdf_mid_data = None  # 未压缩
        self.is_debug = is_debug
+        self.start_page_id = start_page_id
+        self.end_page_id = end_page_id
    
    def get_compress_pdf_mid_data(self):
        return JsonCompressor.compress_json(self.pdf_mid_data)

--- a/magic_pdf/pipe/OCRPipe.py
+++ b/magic_pdf/pipe/OCRPipe.py
@@ -9,17 +9,20 @@ from magic_pdf.user_api import parse_ocr_pdf

 class OCRPipe(AbsPipe):

-    def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False):
-        super().__init__(pdf_bytes, model_list, image_writer, is_debug)
+    def __init__(self, pdf_bytes: bytes, model_list: list, image_writer: AbsReaderWriter, is_debug: bool = False,
+                 start_page_id=0, end_page_id=None):
+        super().__init__(pdf_bytes, model_list, image_writer, is_debug, start_page_id, end_page_id)

    def pipe_classify(self):
        pass

    def pipe_analyze(self):
-        self.model_list = doc_analyze(self.pdf_bytes, ocr=True)
+        self.model_list = doc_analyze(self.pdf_bytes, ocr=True,
+                                      start_page_id=self.start_page_id, end_page_id=self.end_page_id)

    def pipe_parse(self):
-        self.pdf_mid_data = parse_ocr_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug)
+        self.pdf_mid_data = parse_ocr_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
+                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id)

    def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
        result = super().pipe_mk_uni_format(img_parent_path, drop_mode)

--- a/magic_pdf/pipe/TXTPipe.py
+++ b/magic_pdf/pipe/TXTPipe.py
--- a/magic_pdf/pipe/UNIPipe.py
+++ b/magic_pdf/pipe/UNIPipe.py
--- a/magic_pdf/pre_proc/ocr_detect_all_bboxes.py
+++ b/magic_pdf/pre_proc/ocr_detect_all_bboxes.py
--- a/magic_pdf/pre_proc/ocr_dict_merge.py
+++ b/magic_pdf/pre_proc/ocr_dict_merge.py
--- a/magic_pdf/tools/cli.py
+++ b/magic_pdf/tools/cli.py
--- a/magic_pdf/tools/cli_dev.py
+++ b/magic_pdf/tools/cli_dev.py
--- a/magic_pdf/tools/common.py
+++ b/magic_pdf/tools/common.py
--- a/magic_pdf/user_api.py
+++ b/magic_pdf/user_api.py
--- a/projects/README.md
+++ b/projects/README.md
+# Welcome to the MinerU Project List
+
+## Project List
+
+- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
--- a/projects/README_zh-CN.md
+++ b/projects/README_zh-CN.md
+# 欢迎来到 MinerU 项目列表
+
+## 项目列表
+
+- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
--- a/projects/llama_index_rag/README.md
+++ b/projects/llama_index_rag/README.md
--- a/projects/llama_index_rag/README_zh-CN.md
+++ b/projects/llama_index_rag/README_zh-CN.md
--- a/projects/llama_index_rag/data_ingestion.py
+++ b/projects/llama_index_rag/data_ingestion.py
--- a/projects/llama_index_rag/docker-compose.yml
+++ b/projects/llama_index_rag/docker-compose.yml
--- a/projects/llama_index_rag/example/data/declaration_of_the_rights_of_man_1789.pdf
+++ b/projects/llama_index_rag/example/data/declaration_of_the_rights_of_man_1789.pdf
--- a/projects/llama_index_rag/query.py
+++ b/projects/llama_index_rag/query.py
--- a/projects/llama_index_rag/rag_data_api.png
+++ b/projects/llama_index_rag/rag_data_api.png
--- a/requirements-qa.txt
+++ b/requirements-qa.txt
--- a/requirements.txt
+++ b/requirements.txt
--- a/tests/get_coverage.py
+++ b/tests/get_coverage.py
--- a/tests/pdf_indicator/overall_indicator.py
+++ b/tests/pdf_indicator/overall_indicator.py
--- a/tests/retry_env.sh
+++ b/tests/retry_env.sh
--- a/tests/test_cli/lib/common.py
+++ b/tests/test_cli/lib/common.py
--- a/tests/test_cli/pdf_dev/academic_literature_0b2c9c91f5232541a7ace8984df306b2_model.json
+++ b/tests/test_cli/pdf_dev/academic_literature_0b2c9c91f5232541a7ace8984df306b2_model.json
--- a/tests/test_cli/pdf_dev/academic_literature_f7904bc37cc2e25c1e3e412978854b10_model.json
+++ b/tests/test_cli/pdf_dev/academic_literature_f7904bc37cc2e25c1e3e412978854b10_model.json
--- a/tests/test_cli/pdf_dev/academic_literature_fbdb99151e811688574c0c4c67341074_model.json
+++ b/tests/test_cli/pdf_dev/academic_literature_fbdb99151e811688574c0c4c67341074_model.json
--- a/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_academic_literature_0b2c9c91f5232541a7ace8984df306b2.md
+++ b/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_academic_literature_0b2c9c91f5232541a7ace8984df306b2.md
--- a/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_academic_literature_f7904bc37cc2e25c1e3e412978854b10.md
+++ b/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_academic_literature_f7904bc37cc2e25c1e3e412978854b10.md
--- a/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_academic_literature_fbdb99151e811688574c0c4c67341074.md
+++ b/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_academic_literature_fbdb99151e811688574c0c4c67341074.md
--- a/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_ordinary_textbook_1d9a847603a5e37e379738316820850d.md
+++ b/tests/test_cli/pdf_dev/annotations/cleaned/cleaned_ordinary_textbook_1d9a847603a5e37e379738316820850d.md
--- a/tests/test_cli/pdf_dev/ordinary_textbook_1d9a847603a5e37e379738316820850d_model.json
+++ b/tests/test_cli/pdf_dev/ordinary_textbook_1d9a847603a5e37e379738316820850d_model.json
--- a/tests/test_cli/pdf_dev/pdf/academic_literature_0b2c9c91f5232541a7ace8984df306b2.pdf
+++ b/tests/test_cli/pdf_dev/pdf/academic_literature_0b2c9c91f5232541a7ace8984df306b2.pdf
--- a/tests/test_cli/pdf_dev/pdf/academic_literature_f7904bc37cc2e25c1e3e412978854b10.pdf
+++ b/tests/test_cli/pdf_dev/pdf/academic_literature_f7904bc37cc2e25c1e3e412978854b10.pdf
--- a/tests/test_cli/pdf_dev/pdf/academic_literature_fbdb99151e811688574c0c4c67341074.pdf
+++ b/tests/test_cli/pdf_dev/pdf/academic_literature_fbdb99151e811688574c0c4c67341074.pdf
--- a/tests/test_cli/pdf_dev/pdf/ordinary_textbook_1d9a847603a5e37e379738316820850d.pdf
+++ b/tests/test_cli/pdf_dev/pdf/ordinary_textbook_1d9a847603a5e37e379738316820850d.pdf
--- a/tests/test_cli/pdf_dev/research_report_1f978cd81fb7260c8f7644039ec2c054_model.json
+++ b/tests/test_cli/pdf_dev/research_report_1f978cd81fb7260c8f7644039ec2c054_model.json
--- a/tests/test_cli/test_bench.py
+++ b/tests/test_cli/test_bench.py
--- a/tests/test_cli/test_cli.py
+++ b/tests/test_cli/test_cli.py
--- a/tests/test_cli/test_cli_sdk.py
+++ b/tests/test_cli/test_cli_sdk.py
--- a/tests/test_integrations/test_rag/assets/middle.json
+++ b/tests/test_integrations/test_rag/assets/middle.json
--- a/tests/test_integrations/test_rag/assets/one_page_with_table_image.2.pdf
+++ b/tests/test_integrations/test_rag/assets/one_page_with_table_image.2.pdf
--- a/tests/test_integrations/test_rag/assets/one_page_with_table_image.pdf
+++ b/tests/test_integrations/test_rag/assets/one_page_with_table_image.pdf
--- a/tests/test_integrations/test_rag/test_api.py
+++ b/tests/test_integrations/test_rag/test_api.py
--- a/tests/test_integrations/test_rag/test_utils.py
+++ b/tests/test_integrations/test_rag/test_utils.py
--- a/tests/test_para/test_hyphen_at_line_end.py
+++ b/tests/test_para/test_hyphen_at_line_end.py
--- a/tests/test_tools/test_common.py
+++ b/tests/test_tools/test_common.py
--- a/tests/test_unit.py
+++ b/tests/test_unit.py