Commits · 6f58eeabd97f95d6b5bdb6b6962f9e7dfe37721f · Qin Kaijie / pdf-miner

28 Aug, 2024 12 commits
- merge: sync from master branch · 6f58eeab
  drunkpig authored Aug 28, 2024
  
  6f58eeab
- fix(detect_all_bboxes): remove small overlapping blocks by merging (#501) · 9067cd31
  Xiaomeng Zhao authored Aug 28, 2024
```
Previously, small blocks that overlapped with larger ones were merely removed. This fix
changes the approach to merge smaller blocks into the larger block instead, ensuring that
no information is lost and the larger block encompasses all the text content fully.
```
  9067cd31
- fix(pdf-parse-union-core): #492 decrease span threshold for block filling (#500) · 58bfcc9c
  Xiaomeng Zhao authored Aug 28, 2024
```
Reduce the span threshold used in fill_spans_in_blocks from 0.6 to 0.3 to
improve the accuracy of block filling based on layout analysis.
```
  58bfcc9c
- Update Huggingface and ModelScope links to organization account · 7f0fe200
  wangbinDL authored Aug 28, 2024
  
  7f0fe200
- Delete .github/workflows/gpu-ci.yml · 1a01cecb
  yyy authored Aug 28, 2024
  
  1a01cecb
- Delete .github/workflows/gpu-ci.yml · 2c3e35fe
  yyy authored Aug 28, 2024
  
  2c3e35fe
- Update cli.yml · df56e35a
  yyy authored Aug 28, 2024
  
  df56e35a
- Update gpu-ci.yml · c948e58e
  yyy authored Aug 28, 2024
  
  c948e58e
- Update cla.yml · e1c8348d
  Xiaomeng Zhao authored Aug 28, 2024
  
  e1c8348d
- Update cla.yml · 4a7d1fe5
  Xiaomeng Zhao authored Aug 28, 2024
  
  4a7d1fe5
- feat: add test case (#499) · 3d2fb836
  yyy authored Aug 28, 2024
```
Co-authored-by: quyuan <quyuan@pjlab.org>
```
  3d2fb836
- fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494) · f0a8886c
  icecraft authored Aug 28, 2024
```
Co-authored-by: icecraft <xurui1@pjlab.org.cn>
```
  f0a8886c
26 Aug, 2024 3 commits

upload an introduction about chemical formula and update readme.md (#489) · bab19e78

Siyu Hao authored Aug 26, 2024

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

bab19e78

upload an introduction about chemical formula and update readme.md (#489) · 1754c040

Siyu Hao authored Aug 26, 2024

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

1754c040

@strongerfly has signed the CLA in opendatalab/MinerU#487 · 355a17aa
github-actions[bot] authored Aug 26, 2024

355a17aa

22 Aug, 2024 1 commit

build(docker): update docker build step (#471) · 1fc0b76d

Xiaomeng Zhao authored Aug 22, 2024

* build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved
performance and stability.

Additionally, integrate PaddlePaddle GPU version 3.0.0b1
into the Docker build for enhanced AI capabilities. The MinIO configuration file has
also been updated to the latest version.

* build(dockerfile): Updated the Dockerfile

* build(Dockerfile): update Dockerfile

* docs(docker): add instructions for quick deployment with Docker

Include Docker-based deployment instructions in the README for both English and
Chinese locales. This update provides users a quick-start guide to using Docker for
deployment, with notes on GPU VRAM requirements and default acceleration features.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

1fc0b76d

21 Aug, 2024 2 commits
- Create requirements-docker.txt · a7c0898d
  Xiaomeng Zhao authored Aug 21, 2024
  
  a7c0898d
- Create download_models.py · 7ce807f4
  Xiaomeng Zhao authored Aug 21, 2024
  
  7ce807f4
20 Aug, 2024 8 commits

@Matthijz98 has signed the CLA in opendatalab/MinerU#467 · c0336b75
github-actions[bot] authored Aug 20, 2024

c0336b75

fix(ocr_mkcontent): revise table caption output (#397) · dd19f59e

Xiaomeng Zhao authored Aug 20, 2024

* fix(ocr_mkcontent): revise table caption output

- Ensuring that
  table captions are properly included in the output.
- Remove the redundant `table_caption` variable。

* Update cla.yml

* Update bug_report.yml

* feat(cli): add debug option for detailed error handling

Enable users to invoke the CLI command with a new debug flag to get detailed debugging information.

* fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified
to improve the accuracy and consistency of OCR processing. The new values are set to 25
to ensure more precise image cropping and pasting which leads to better OCR recognition
results.

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and
handling of border cases. This change will help in improving the overall quality of
OCR'ed text by providing more context around the detected text areas.

* fix(common): deep copy model list before drawing model bbox

Use a deep copy of the original model list in `drow_model_bbox` to avoid potential
modifications to the source data. This ensures the integrity of the original models
is maintained while generating the model bounding boxes visualization.

---------
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

dd19f59e

<fix>(para_split_v2): index out of range issue of span_text first char (#396) · 65c3ac66
Kaiwen Liu authored Aug 20, 2024
```
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
```
65c3ac66

feat: rename the file generated by command line tools (#401) · c9a51491

icecraft authored Aug 20, 2024

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: icecraft <tmortred@gmail.com>
Co-authored-by: icecraft <xurui1@pjlab.org.cn>

c9a51491

fix(pdf-extract): adjust box threshold for OCR detection (#447) · 041b9465

Xiaomeng Zhao authored Aug 20, 2024

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.

041b9465

fix(self_modify): merge detection boxes for optimized text region detection (#448) · 3da5c411

Xiaomeng Zhao authored Aug 20, 2024

Merge adjacent and overlapping detection boxes to optimize text region detection in
the document. Post processing of text boxes is enhanced by consolidating them into
larger text lines, taking into account their vertical and horizontal alignment. This
improvement reduces fragmentation and improves the readability of detected text blocks.

3da5c411

fix(ocr_mkcontent): improve language detection and content formatting (#458) · 66e3ce9c

Xiaomeng Zhao authored Aug 20, 2024

Optimize the language detection logic to enhance content formatting. This
change addresses issues with long word segmentation. Language detection now uses a
threshold to determine the language of a text based on the proportion of English characters.
Formatting rules for content have been updated to consider a list of languages (initially
including Chinese, Japanese, and Korean) where no space is added between content segments
for inline equations and text spans, improving the handling of Asian languages.

The impact of these changes includes improved accuracy in language detection, better
segmentation of long words, and more appropriate spacing in content formatting for multiple
languages.

66e3ce9c

feat: add tablemaster_paddle (#463) · 0b764d59

Kaiwen Liu authored Aug 20, 2024

* Update README_zh-CN.md (#404) (#409)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

0b764d59

16 Aug, 2024 4 commits
- Update cla.yml · e5e57569
  Xiaomeng Zhao authored Aug 16, 2024
  
  e5e57569
- Update cla.yml · f4316f02
  Xiaomeng Zhao authored Aug 16, 2024
  
  f4316f02
- Update cla.yml · 78d4104b
  Xiaomeng Zhao authored Aug 16, 2024
  
  78d4104b
- add dockerfile (#189) · c6e9578b
  Aoyang Fang authored Aug 16, 2024
```
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
```
  c6e9578b
15 Aug, 2024 2 commits
- Merge pull request #398 from opendatalab/myhloli-patch-1 · 72666130
  Xiaomeng Zhao authored Aug 15, 2024
```
Update cla.yml
```
  72666130
- Merge pull request #400 from opendatalab/myhloli-patch-2 · 2fb964f8
  Xiaomeng Zhao authored Aug 15, 2024
```
Update bug_report.yml
```
  2fb964f8
13 Aug, 2024 8 commits
- @Thepathakarpit has signed the CLA in opendatalab/MinerU#418 · 295df329
  github-actions[bot] authored Aug 13, 2024
  
  295df329
- Update README_zh-CN.md · f794dfa7
  Xiaomeng Zhao authored Aug 13, 2024
  
  f794dfa7
- Update README_Windows_CUDA_Acceleration_zh_CN.md · 8601d233
  Xiaomeng Zhao authored Aug 13, 2024
  
  8601d233
- Update FAQ_en_us.md · 4983bc1d
  Xiaomeng Zhao authored Aug 13, 2024
  
  4983bc1d
- Update FAQ_zh_cn.md · 2f01dbab
  Xiaomeng Zhao authored Aug 13, 2024
```
add new issue
```
  2f01dbab
- Update README_zh-CN.md (#404) (#409) (#410) · 19a5dbf1
  drunkpig authored Aug 13, 2024
```
correct FAQ url
Co-authored-by: sfk <18810651050@163.com>
```
  19a5dbf1
- Update README_zh-CN.md (#404) (#409) · 45158e1b
  drunkpig authored Aug 13, 2024
```
correct FAQ url
Co-authored-by: sfk <18810651050@163.com>
```
  45158e1b
- Update README_zh-CN.md (#404) · bc038aa4
  sfk authored Aug 13, 2024
```
correct FAQ url
```
  bc038aa4