Commits · 3d2fb83665daa51cf894a417517005d8cc183eb7 · Qin Kaijie / pdf-miner

28 Aug, 2024 2 commits
- feat: add test case (#499) · 3d2fb836
  yyy authored Aug 28, 2024
```
Co-authored-by: quyuan <quyuan@pjlab.org>
```
  3d2fb836
- fix: remove the default value of output option in tools/cli.py and tools/cli_dev.py (#494) · f0a8886c
  icecraft authored Aug 28, 2024
```
Co-authored-by: icecraft <xurui1@pjlab.org.cn>
```
  f0a8886c
26 Aug, 2024 2 commits

upload an introduction about chemical formula and update readme.md (#489) · bab19e78

Siyu Hao authored Aug 26, 2024

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

bab19e78

upload an introduction about chemical formula and update readme.md (#489) · 1754c040

Siyu Hao authored Aug 26, 2024

* upload an introduction about chemical formula

* rename 2 files

* update readme.md at TODO in chemstery

* rename 2 files and update readme.md at TODO in chemstery

* update README_zh-CN.md at TODO in chemstery

1754c040

22 Aug, 2024 1 commit

build(docker): update docker build step (#471) · 1fc0b76d

Xiaomeng Zhao authored Aug 22, 2024

* build(docker): update base image to Ubuntu 22.04 and install PaddlePaddleUpgrade the Docker base image from ubuntu:latest to ubuntu:22.04 for improved
performance and stability.

Additionally, integrate PaddlePaddle GPU version 3.0.0b1
into the Docker build for enhanced AI capabilities. The MinIO configuration file has
also been updated to the latest version.

* build(dockerfile): Updated the Dockerfile

* build(Dockerfile): update Dockerfile

* docs(docker): add instructions for quick deployment with Docker

Include Docker-based deployment instructions in the README for both English and
Chinese locales. This update provides users a quick-start guide to using Docker for
deployment, with notes on GPU VRAM requirements and default acceleration features.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

* build(docker): Layer the installation of dependencies, downloading the model, and the setup of the program itself.

1fc0b76d

20 Aug, 2024 5 commits

fix(ocr_mkcontent): revise table caption output (#397) · dd19f59e

Xiaomeng Zhao authored Aug 20, 2024

* fix(ocr_mkcontent): revise table caption output

- Ensuring that
  table captions are properly included in the output.
- Remove the redundant `table_caption` variable。

* Update cla.yml

* Update bug_report.yml

* feat(cli): add debug option for detailed error handling

Enable users to invoke the CLI command with a new debug flag to get detailed debugging information.

* fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified
to improve the accuracy and consistency of OCR processing. The new values are set to 25
to ensure more precise image cropping and pasting which leads to better OCR recognition
results.

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and
handling of border cases. This change will help in improving the overall quality of
OCR'ed text by providing more context around the detected text areas.

* fix(common): deep copy model list before drawing model bbox

Use a deep copy of the original model list in `drow_model_bbox` to avoid potential
modifications to the source data. This ensures the integrity of the original models
is maintained while generating the model bounding boxes visualization.

---------
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

dd19f59e

feat: rename the file generated by command line tools (#401) · c9a51491

icecraft authored Aug 20, 2024

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: icecraft <tmortred@gmail.com>
Co-authored-by: icecraft <xurui1@pjlab.org.cn>

c9a51491

fix(pdf-extract): adjust box threshold for OCR detection (#447) · 041b9465

Xiaomeng Zhao authored Aug 20, 2024

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.

041b9465

fix(self_modify): merge detection boxes for optimized text region detection (#448) · 3da5c411

Xiaomeng Zhao authored Aug 20, 2024

Merge adjacent and overlapping detection boxes to optimize text region detection in
the document. Post processing of text boxes is enhanced by consolidating them into
larger text lines, taking into account their vertical and horizontal alignment. This
improvement reduces fragmentation and improves the readability of detected text blocks.

3da5c411

fix(ocr_mkcontent): improve language detection and content formatting (#458) · 66e3ce9c

Xiaomeng Zhao authored Aug 20, 2024

Optimize the language detection logic to enhance content formatting. This
change addresses issues with long word segmentation. Language detection now uses a
threshold to determine the language of a text based on the proportion of English characters.
Formatting rules for content have been updated to consider a list of languages (initially
including Chinese, Japanese, and Korean) where no space is added between content segments
for inline equations and text spans, improving the handling of Asian languages.

The impact of these changes includes improved accuracy in language detection, better
segmentation of long words, and more appropriate spacing in content formatting for multiple
languages.

66e3ce9c

16 Aug, 2024 3 commits
- Update cla.yml · f4316f02
  Xiaomeng Zhao authored Aug 16, 2024
  
  f4316f02
- Update cla.yml · 78d4104b
  Xiaomeng Zhao authored Aug 16, 2024
  
  78d4104b
- add dockerfile (#189) · c6e9578b
  Aoyang Fang authored Aug 16, 2024
```
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
```
  c6e9578b
13 Aug, 2024 1 commit
- Update README_zh-CN.md (#404) (#409) · 45158e1b
  drunkpig authored Aug 13, 2024
```
correct FAQ url
Co-authored-by: sfk <18810651050@163.com>
```
  45158e1b
12 Aug, 2024 2 commits
- docs: add PR template · 7cdf88c6
  xuchao authored Aug 12, 2024
  
  7cdf88c6
- docs: add pre-commit-config · 82e50654
  xuchao authored Aug 12, 2024
  
  82e50654
10 Aug, 2024 1 commit

docs(faq): add solution for libGL.so.1 missing on WSL2 Ubuntu22.04 · 0405461d

myhloli authored Aug 10, 2024

Add FAQ entries in both English and Chinese to address the issue where the
libGL.so.1 library is missing on Ubuntu22.04 when running under WSL2. The
FAQ now includes instructions on how to install the missing library, resolvingthe corresponding ImportError.Closes https://github.com/opendatalab/MinerU/issues/388

0405461d

09 Aug, 2024 23 commits
- Update README_Windows_CUDA_Acceleration_en_US.md · 24503530
  sfk authored Aug 09, 2024
  
  24503530
- Update README_Windows_CUDA_Acceleration_zh_CN.md · ece8dac4
  sfk authored Aug 09, 2024
  
  ece8dac4
- Update README_Ubuntu_CUDA_Acceleration_en_US.md · 409ece82
  sfk authored Aug 09, 2024
  
  409ece82
- Update README_Ubuntu_CUDA_Acceleration_zh_CN.md · 18f82ab7
  sfk authored Aug 09, 2024
  
  18f82ab7
- Update version.py with new version · b44a7df9
  myhloli authored Aug 09, 2024
  
  b44a7df9
- Merge pull request #386 from myhloli/master · fa3475a4
  Xiaomeng Zhao authored Aug 09, 2024
```
feat(draw_bbox): add model bbox drawing functionality
```
  fa3475a4
- feat(draw_bbox): add model bbox drawing functionality · c90ee891
  myhloli authored Aug 09, 2024
```
Implement the feature to draw bounding boxes for model elements in the PDF. This includes
adding new drawing functions and modifying existing ones to accommodate the new feature.
Also, updates are made to CLI tools and common utilities to support the model bbox drawing.
```
  c90ee891
- docs: update to 0.7.0b1 · e7b0f8be
  xuchao authored Aug 09, 2024
  
  e7b0f8be
- Create FAQ_en_us.md · 85e36358
  sfk authored Aug 09, 2024
  
  85e36358
- Create output_file_en_us.md · cf704253
  sfk authored Aug 09, 2024
  
  cf704253
- Update README_zh-CN_v2.md · 54baabd8
  sfk authored Aug 09, 2024
```
edit FAQ
```
  54baabd8
- Update README_v2.md · 8cc8ab17
  sfk authored Aug 09, 2024
```
update doc url
```
  8cc8ab17
- Update README_zh-CN_v2.md · ba25b1db
  sfk authored Aug 09, 2024
```
update discord url
```
  ba25b1db
- Update README_zh-CN_v2.md · 004beb5c
  sfk authored Aug 09, 2024
```
update content
```
  004beb5c
- Update README_zh-CN_v2.md · c1ad30e7
  sfk authored Aug 09, 2024
```
update content
```
  c1ad30e7
- Update README_v2.md · 5a0cce0c
  sfk authored Aug 09, 2024
  
  5a0cce0c
- Update FAQ_zh_cn.md · b03b5cdd
  Xiaomeng Zhao authored Aug 09, 2024
  
  b03b5cdd
- Update README_v2.md · d9e72e92
  sfk authored Aug 09, 2024
  
  d9e72e92
- Update README_v2.md · 755e8a9b
  sfk authored Aug 09, 2024
  
  755e8a9b
- Update README_v2.md · b413a89d
  sfk authored Aug 09, 2024
  
  b413a89d
- Update README_zh-CN_v2.md · 58e429b6
  sfk authored Aug 09, 2024
  
  58e429b6
- Create README_v2.md · a9063f8c
  sfk authored Aug 09, 2024
  
  a9063f8c
- Update README_zh-CN_v2.md · f8261f35
  sfk authored Aug 09, 2024
  
  f8261f35