Commits · dd19f59eb6e53a2fde77b29b9a4a7be9c91a93f1 · Qin Kaijie / pdf-miner

20 Aug, 2024 5 commits

fix(ocr_mkcontent): revise table caption output (#397) · dd19f59e

Xiaomeng Zhao authored Aug 20, 2024

* fix(ocr_mkcontent): revise table caption output

- Ensuring that
  table captions are properly included in the output.
- Remove the redundant `table_caption` variable。

* Update cla.yml

* Update bug_report.yml

* feat(cli): add debug option for detailed error handling

Enable users to invoke the CLI command with a new debug flag to get detailed debugging information.

* fix(pdf-extract-kit): adjust crop_paste parameters for better accuracyThe crop_paste_x and crop_paste_y values in the pdf_extract_kit.py have been modified
to improve the accuracy and consistency of OCR processing. The new values are set to 25
to ensure more precise image cropping and pasting which leads to better OCR recognition
results.

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url
Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* fix(pdf-extract-kit): increase crop_paste margin for OCR processingDouble the crop_paste margin from25 to 50 to ensure better OCR accuracy and
handling of border cases. This change will help in improving the overall quality of
OCR'ed text by providing more context around the detected text areas.

* fix(common): deep copy model list before drawing model bbox

Use a deep copy of the original model list in `drow_model_bbox` to avoid potential
modifications to the source data. This ensures the integrity of the original models
is maintained while generating the model bounding boxes visualization.

---------
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

dd19f59e

feat: rename the file generated by command line tools (#401) · c9a51491

icecraft authored Aug 20, 2024

* feat: rename the file generated by command line tools

* feat: add pdf filename as prefix to {span,layout,model}.pdf

---------
Co-authored-by: icecraft <tmortred@gmail.com>
Co-authored-by: icecraft <xurui1@pjlab.org.cn>

c9a51491

fix(pdf-extract): adjust box threshold for OCR detection (#447) · 041b9465

Xiaomeng Zhao authored Aug 20, 2024

Tuned the detection box threshold parameter in the OCR model initialization to improve the
accuracy of text extraction from images. The threshold was modified from 0.6 to
0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
text by reducing noise and false positives in the OCR process.

041b9465

fix(self_modify): merge detection boxes for optimized text region detection (#448) · 3da5c411

Xiaomeng Zhao authored Aug 20, 2024

Merge adjacent and overlapping detection boxes to optimize text region detection in
the document. Post processing of text boxes is enhanced by consolidating them into
larger text lines, taking into account their vertical and horizontal alignment. This
improvement reduces fragmentation and improves the readability of detected text blocks.

3da5c411

fix(ocr_mkcontent): improve language detection and content formatting (#458) · 66e3ce9c

Xiaomeng Zhao authored Aug 20, 2024

Optimize the language detection logic to enhance content formatting. This
change addresses issues with long word segmentation. Language detection now uses a
threshold to determine the language of a text based on the proportion of English characters.
Formatting rules for content have been updated to consider a list of languages (initially
including Chinese, Japanese, and Korean) where no space is added between content segments
for inline equations and text spans, improving the handling of Asian languages.

The impact of these changes includes improved accuracy in language detection, better
segmentation of long words, and more appropriate spacing in content formatting for multiple
languages.

66e3ce9c

16 Aug, 2024 3 commits
- Update cla.yml · f4316f02
  Xiaomeng Zhao authored Aug 16, 2024
  
  f4316f02
- Update cla.yml · 78d4104b
  Xiaomeng Zhao authored Aug 16, 2024
  
  78d4104b
- add dockerfile (#189) · c6e9578b
  Aoyang Fang authored Aug 16, 2024
```
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
```
  c6e9578b
13 Aug, 2024 1 commit
- Update README_zh-CN.md (#404) (#409) · 45158e1b
  drunkpig authored Aug 13, 2024
```
correct FAQ url
Co-authored-by: sfk <18810651050@163.com>
```
  45158e1b
12 Aug, 2024 2 commits
- docs: add PR template · 7cdf88c6
  xuchao authored Aug 12, 2024
  
  7cdf88c6
- docs: add pre-commit-config · 82e50654
  xuchao authored Aug 12, 2024
  
  82e50654
10 Aug, 2024 1 commit

docs(faq): add solution for libGL.so.1 missing on WSL2 Ubuntu22.04 · 0405461d

myhloli authored Aug 10, 2024

Add FAQ entries in both English and Chinese to address the issue where the
libGL.so.1 library is missing on Ubuntu22.04 when running under WSL2. The
FAQ now includes instructions on how to install the missing library, resolvingthe corresponding ImportError.Closes https://github.com/opendatalab/MinerU/issues/388

0405461d

09 Aug, 2024 28 commits
- Update README_Windows_CUDA_Acceleration_en_US.md · 24503530
  sfk authored Aug 09, 2024
  
  24503530
- Update README_Windows_CUDA_Acceleration_zh_CN.md · ece8dac4
  sfk authored Aug 09, 2024
  
  ece8dac4
- Update README_Ubuntu_CUDA_Acceleration_en_US.md · 409ece82
  sfk authored Aug 09, 2024
  
  409ece82
- Update README_Ubuntu_CUDA_Acceleration_zh_CN.md · 18f82ab7
  sfk authored Aug 09, 2024
  
  18f82ab7
- Update version.py with new version · b44a7df9
  myhloli authored Aug 09, 2024
  
  b44a7df9
- Merge pull request #386 from myhloli/master · fa3475a4
  Xiaomeng Zhao authored Aug 09, 2024
```
feat(draw_bbox): add model bbox drawing functionality
```
  fa3475a4
- feat(draw_bbox): add model bbox drawing functionality · c90ee891
  myhloli authored Aug 09, 2024
```
Implement the feature to draw bounding boxes for model elements in the PDF. This includes
adding new drawing functions and modifying existing ones to accommodate the new feature.
Also, updates are made to CLI tools and common utilities to support the model bbox drawing.
```
  c90ee891
- docs: update to 0.7.0b1 · e7b0f8be
  xuchao authored Aug 09, 2024
  
  e7b0f8be
- Create FAQ_en_us.md · 85e36358
  sfk authored Aug 09, 2024
  
  85e36358
- Create output_file_en_us.md · cf704253
  sfk authored Aug 09, 2024
  
  cf704253
- Update README_zh-CN_v2.md · 54baabd8
  sfk authored Aug 09, 2024
```
edit FAQ
```
  54baabd8
- Update README_v2.md · 8cc8ab17
  sfk authored Aug 09, 2024
```
update doc url
```
  8cc8ab17
- Update README_zh-CN_v2.md · ba25b1db
  sfk authored Aug 09, 2024
```
update discord url
```
  ba25b1db
- Update README_zh-CN_v2.md · 004beb5c
  sfk authored Aug 09, 2024
```
update content
```
  004beb5c
- Update README_zh-CN_v2.md · c1ad30e7
  sfk authored Aug 09, 2024
```
update content
```
  c1ad30e7
- Update README_v2.md · 5a0cce0c
  sfk authored Aug 09, 2024
  
  5a0cce0c
- Update FAQ_zh_cn.md · b03b5cdd
  Xiaomeng Zhao authored Aug 09, 2024
  
  b03b5cdd
- Update README_v2.md · d9e72e92
  sfk authored Aug 09, 2024
  
  d9e72e92
- Update README_v2.md · 755e8a9b
  sfk authored Aug 09, 2024
  
  755e8a9b
- Update README_v2.md · b413a89d
  sfk authored Aug 09, 2024
  
  b413a89d
- Update README_zh-CN_v2.md · 58e429b6
  sfk authored Aug 09, 2024
  
  58e429b6
- Create README_v2.md · a9063f8c
  sfk authored Aug 09, 2024
  
  a9063f8c
- Update README_zh-CN_v2.md · f8261f35
  sfk authored Aug 09, 2024
  
  f8261f35
- Update README_zh-CN_v2.md · 90f4e364
  sfk authored Aug 09, 2024
  
  90f4e364
- Update README_zh-CN_v2.md · fc6a7c30
  sfk authored Aug 09, 2024
  
  fc6a7c30
- 合并来自myhloli/master的拉取请求#379 · 4ec8466e
  Xiaomeng Zhao authored Aug 09, 2024
```
fix(doc-analyze): adjust image scaling limit to 9000 pixels
```
  4ec8466e
- fix(doc-analyze): adjust image scaling limit to 9000 pixels · 445a397f
  myhloli authored Aug 09, 2024
```
Previously, images were not enlarged if their width or height exceeded 3000 pixels.
This threshold has been increased to 9000 pixels to better handle high-resolutionscans and improve the analysis of documents with larger dimensions.
```
  445a397f
- docs: how to use table recognition · f3ad9be3
  xuchao authored Aug 09, 2024
  
  f3ad9be3