Commits · fc49f5c4465da08bdc4516786fcf45487e748a2d · Qin Kaijie / pdf-miner

18 Oct, 2024 1 commit

refactor(magic_pdf): remove unused parameters and simplify functions · fc49f5c4

myhloli authored Oct 18, 2024

- Remove unused parameters parse_type and lang from various functions
- Simplify function calls by removing unnecessary arguments
- Update related files to reflect these changes

fc49f5c4

17 Oct, 2024 2 commits

Merge pull request #753 from myhloli/dev · fe21eebd

Xiaomeng Zhao authored Oct 17, 2024

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation.

fe21eebd

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. · 011a1b97

myhloli authored Oct 17, 2024

- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc.
- Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements
- Update ocr_model_init to include additional parameters for OCR model configuration

011a1b97

16 Oct, 2024 4 commits
- Merge pull request #747 from myhloli/dev · 2a409845
  Xiaomeng Zhao authored Oct 16, 2024
```
update example files
```
  2a409845
- update example files · cf377ce4
  myhloli authored Oct 16, 2024
  
  cf377ce4
- Merge remote-tracking branch 'origin/dev' into dev · ab3d2d17
  myhloli authored Oct 16, 2024
  
  ab3d2d17
- docs: enhance document parsing capabilities · 237c062d
  myhloli authored Oct 16, 2024
```
- Improve reading order with model-based sorting- Add list recognition within text
- Implement table of contents recognition
- Support table recognition
- Enhance code block and geometric shape recognition
- Address known issues in both English and Chinese READMEs
```
  237c062d
15 Oct, 2024 6 commits

Merge pull request #744 from myhloli/para-split-v3 · f50bc87b
Xiaomeng Zhao authored Oct 15, 2024
```
fix(para_split_v3): refine list block detection in paragraph splitting
```
f50bc87b

refactor(para_split_v3): refine list block detection in paragraph splitting · 81b9fd7b

myhloli authored Oct 15, 2024

- Update list block detection logic to require at least 2 numeric start lines
- Ensure the number of numeric start lines matches the number of end lines
- Remove detection of non-border starting lines for simplicity

81b9fd7b

Merge pull request #743 from myhloli/para-split-v3 · 0d83fb77
Xiaomeng Zhao authored Oct 15, 2024
```
refactor(para_split_v3): merge list and index block detection
```
0d83fb77
fix(split_v3): Fix the rule adaptation for some special list samples. · 244b8684
myhloli authored Oct 15, 2024

244b8684

refactor(pdf): adjust span filling threshold in block construction · 7e301b84

myhloli authored Oct 15, 2024

Increased the threshold for filling spans in blocks from 0.3 to 0.5 to improve the accuracy of block formation. This change helps refine the grouping of spans into blocks, potentially enhancing the overall structure and readability of the PDF content.

7e301b84

refactor(para_split_v3): merge list and index block detection · fdcb49d3

myhloli authored Oct 15, 2024

- Combine __is_list_block() and __is_index_block() into a single function __is_list_or_index_block()
- Simplify block type determination logic
- Remove redundant code and improve readability
- Optimize block merging process

fdcb49d3

14 Oct, 2024 4 commits

fix(magic_pdf): include List and Index block types in processing · 0a9a6d3e

myhloli authored Oct 14, 2024

Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.

0a9a6d3e

Merge pull request #740 from myhloli/para-split-v3 · 702b6ac9
Xiaomeng Zhao authored Oct 14, 2024
```
feat(list&index block): detect and merge list and index blocks
```
702b6ac9

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 14, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

feat: manager docs with sphinx (#737) · c479245e

icecraft authored Oct 14, 2024

* feat: manager docs with sphinx

* fix: readthedocs configure

* feat: support multiple language

* fix: add .readthedocs.yaml

* fix: requirments.txt path

---------
Co-authored-by: icecraft <xurui1@pjlab.org.cn>

c479245e

10 Oct, 2024 6 commits
- Merge pull request #718 from myhloli/para-split-v3 · b9631f30
  Xiaomeng Zhao authored Oct 10, 2024
```
fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks
```
  b9631f30
- fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks · 7b42d5a0
  myhloli authored Oct 10, 2024
  
  7b42d5a0
- Merge pull request #717 from myhloli/dev · 964715b2
  Xiaomeng Zhao authored Oct 10, 2024
```
Update how_to_download_models_zh_cn.md
```
  964715b2
- Merge branch 'opendatalab:dev' into dev · d1c9c7dd
  Xiaomeng Zhao authored Oct 10, 2024
  
  d1c9c7dd
- Merge pull request #716 from myhloli/para-split-v3 · ea7bc620
  Xiaomeng Zhao authored Oct 10, 2024
```
feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support
```
  ea7bc620
- feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support · 6f63e70e
  myhloli authored Oct 10, 2024
```
- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process
- Add support for specifying page range in doc_analyze_by_custom_model
- Implement garbage collection and memory cleaning after processing
- Refine image loading from PDF, including handling out-of-range pages
```
  6f63e70e
09 Oct, 2024 3 commits
- Update how_to_download_models_zh_cn.md · 7f9d80fc
  Xiaomeng Zhao authored Oct 09, 2024
  
  7f9d80fc
- Merge pull request #706 from myhloli/dev · 675f8e66
  Xiaomeng Zhao authored Oct 09, 2024
```
Update README_Windows_CUDA_Acceleration_en_US.md
```
  675f8e66
- Update README_Windows_CUDA_Acceleration_en_US.md · 4e58bf8f
  Xiaomeng Zhao authored Oct 09, 2024
  
  4e58bf8f
08 Oct, 2024 14 commits

Merge pull request #701 from myhloli/dev · 1030ebad
Xiaomeng Zhao authored Oct 08, 2024
```
docs: update CUDA acceleration guides and README content
```
1030ebad

docs: update CUDA acceleration guides and README content · a1c7b5a7

myhloli authored Oct 08, 2024

- Update GPU hardware support information in README.md and README_zh-CN.md
- Enhance CUDA acceleration guides for Ubuntu and Windows
- Modify README_zh-CN.md to reflect changes in GPU requirements and configurations
- Update TODO list to mark semantic reading order as completed

a1c7b5a7

docs: update CUDA acceleration guides and README content · 2fb3869e

myhloli authored Oct 08, 2024

- Update GPU hardware support information in README.md and README_zh-CN.md
- Enhance CUDA acceleration guides for Ubuntu and Windows
- Modify README_zh-CN.md to reflect changes in GPU requirements and configurations
- Update TODO list to mark semantic reading order as completed

2fb3869e

Merge pull request #700 from myhloli/dev · 01306098
Xiaomeng Zhao authored Oct 08, 2024
```
docs: add filename to wget command in model download scripts
```
01306098

docs: add filename to wget command in model download scripts · 5de6af68

myhloli authored Oct 08, 2024

- Update wget commands in both English and Chinese documentation to specify the filename
- Improve clarity and prevent potential filename conflicts when downloading the scripts

5de6af68

Merge pull request #699 from myhloli/dev · 7b787555
Xiaomeng Zhao authored Oct 08, 2024
```
feat(docs): automate model download and configuration
```
7b787555

feat(docs): automate model download and configuration · 6c9b23c3

myhloli authored Oct 08, 2024

- Add scripts to download models and update configuration file
- Remove manual steps for modifying model paths
- Update documentation for both ModelScope and HuggingFace model downloads
- Improve user experience by automating the entire process

6c9b23c3

feat(docs): automate model download and configuration · cf385779

myhloli authored Oct 08, 2024

- Add scripts to download models and update configuration file
- Remove manual steps for modifying model paths
- Update documentation for both ModelScope and HuggingFace model downloads
- Improve user experience by automating the entire process

cf385779

Merge pull request #698 from myhloli/dev · 8786d208
Xiaomeng Zhao authored Oct 08, 2024
```
feat(layoutreader): support local model directory and improve model loading
```
8786d208

docs: add layoutreader to related projects · 0b2b0cef

myhloli authored Oct 08, 2024

Added a link to the layoutreader repository in the Related Projects sections of both the README.md and README_zh-CN.md files. This addition helps to provide users with more resources and tools related to document layout analysis and processing.

0b2b0cef

docs: update model download instructions for version 0.9.x and later- Add note... · b28157ce

myhloli authored Oct 08, 2024

docs: update model download instructions for version 0.9.x and later- Add note about separate download for layoutreader model in version 0.9.x and later
- Include example code for downloading layoutreader model using ModelScope
- Clarify that previous download methods do not support updating to version 0.9.x and later

b28157ce

feat(layoutreader): support local model directory and improve model loading · ded2818a

myhloli authored Oct 08, 2024

- Add function to get local LayoutReader model directory- Check and use local model directory if available
- Fall back to online model if local directory not found
- Update model initialization to support local path
- Refactor model loading in singleton class

ded2818a

Merge pull request #696 from icecraft/fix/caption_match · 3fb0494b
Xiaomeng Zhao authored Oct 08, 2024
```
fix: caption|footnote match algorithm
```
3fb0494b
fix: caption|footnote match algorithm · f31433b8
icecraft authored Oct 08, 2024

f31433b8