Commits · ece7f8d5a476d6fdcf3aa948f1786e69e8c96aed · Qin Kaijie / pdf-miner

15 Oct, 2024 1 commit
- Merge pull request #6 from opendatalab/dev · ece7f8d5
  Kaiwen Liu authored Oct 15, 2024
```
Dev
```
  ece7f8d5
14 Oct, 2024 3 commits

Merge pull request #740 from myhloli/para-split-v3 · 702b6ac9
Xiaomeng Zhao authored Oct 14, 2024
```
feat(list&index block): detect and merge list and index blocks
```
702b6ac9

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 14, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

feat: manager docs with sphinx (#737) · c479245e

icecraft authored Oct 14, 2024

* feat: manager docs with sphinx

* fix: readthedocs configure

* feat: support multiple language

* fix: add .readthedocs.yaml

* fix: requirments.txt path

---------
Co-authored-by: icecraft <xurui1@pjlab.org.cn>

c479245e

10 Oct, 2024 6 commits
- Merge pull request #718 from myhloli/para-split-v3 · b9631f30
  Xiaomeng Zhao authored Oct 10, 2024
```
fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks
```
  b9631f30
- fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks · 7b42d5a0
  myhloli authored Oct 10, 2024
  
  7b42d5a0
- Merge pull request #717 from myhloli/dev · 964715b2
  Xiaomeng Zhao authored Oct 10, 2024
```
Update how_to_download_models_zh_cn.md
```
  964715b2
- Merge branch 'opendatalab:dev' into dev · d1c9c7dd
  Xiaomeng Zhao authored Oct 10, 2024
  
  d1c9c7dd
- Merge pull request #716 from myhloli/para-split-v3 · ea7bc620
  Xiaomeng Zhao authored Oct 10, 2024
```
feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support
```
  ea7bc620
- feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support · 6f63e70e
  myhloli authored Oct 10, 2024
```
- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process
- Add support for specifying page range in doc_analyze_by_custom_model
- Implement garbage collection and memory cleaning after processing
- Refine image loading from PDF, including handling out-of-range pages
```
  6f63e70e
09 Oct, 2024 3 commits
- Update how_to_download_models_zh_cn.md · 7f9d80fc
  Xiaomeng Zhao authored Oct 09, 2024
  
  7f9d80fc
- Merge pull request #706 from myhloli/dev · 675f8e66
  Xiaomeng Zhao authored Oct 09, 2024
```
Update README_Windows_CUDA_Acceleration_en_US.md
```
  675f8e66
- Update README_Windows_CUDA_Acceleration_en_US.md · 4e58bf8f
  Xiaomeng Zhao authored Oct 09, 2024
  
  4e58bf8f
08 Oct, 2024 20 commits

Merge pull request #701 from myhloli/dev · 1030ebad
Xiaomeng Zhao authored Oct 08, 2024
```
docs: update CUDA acceleration guides and README content
```
1030ebad

docs: update CUDA acceleration guides and README content · a1c7b5a7

myhloli authored Oct 08, 2024

- Update GPU hardware support information in README.md and README_zh-CN.md
- Enhance CUDA acceleration guides for Ubuntu and Windows
- Modify README_zh-CN.md to reflect changes in GPU requirements and configurations
- Update TODO list to mark semantic reading order as completed

a1c7b5a7

docs: update CUDA acceleration guides and README content · 2fb3869e

myhloli authored Oct 08, 2024

- Update GPU hardware support information in README.md and README_zh-CN.md
- Enhance CUDA acceleration guides for Ubuntu and Windows
- Modify README_zh-CN.md to reflect changes in GPU requirements and configurations
- Update TODO list to mark semantic reading order as completed

2fb3869e

Merge pull request #700 from myhloli/dev · 01306098
Xiaomeng Zhao authored Oct 08, 2024
```
docs: add filename to wget command in model download scripts
```
01306098

docs: add filename to wget command in model download scripts · 5de6af68

myhloli authored Oct 08, 2024

- Update wget commands in both English and Chinese documentation to specify the filename
- Improve clarity and prevent potential filename conflicts when downloading the scripts

5de6af68

Merge pull request #699 from myhloli/dev · 7b787555
Xiaomeng Zhao authored Oct 08, 2024
```
feat(docs): automate model download and configuration
```
7b787555

feat(docs): automate model download and configuration · 6c9b23c3

myhloli authored Oct 08, 2024

- Add scripts to download models and update configuration file
- Remove manual steps for modifying model paths
- Update documentation for both ModelScope and HuggingFace model downloads
- Improve user experience by automating the entire process

6c9b23c3

feat(docs): automate model download and configuration · cf385779

myhloli authored Oct 08, 2024

- Add scripts to download models and update configuration file
- Remove manual steps for modifying model paths
- Update documentation for both ModelScope and HuggingFace model downloads
- Improve user experience by automating the entire process

cf385779

Merge pull request #698 from myhloli/dev · 8786d208
Xiaomeng Zhao authored Oct 08, 2024
```
feat(layoutreader): support local model directory and improve model loading
```
8786d208

docs: add layoutreader to related projects · 0b2b0cef

myhloli authored Oct 08, 2024

Added a link to the layoutreader repository in the Related Projects sections of both the README.md and README_zh-CN.md files. This addition helps to provide users with more resources and tools related to document layout analysis and processing.

0b2b0cef

docs: update model download instructions for version 0.9.x and later- Add note... · b28157ce

myhloli authored Oct 08, 2024

docs: update model download instructions for version 0.9.x and later- Add note about separate download for layoutreader model in version 0.9.x and later
- Include example code for downloading layoutreader model using ModelScope
- Clarify that previous download methods do not support updating to version 0.9.x and later

b28157ce

feat(layoutreader): support local model directory and improve model loading · ded2818a

myhloli authored Oct 08, 2024

- Add function to get local LayoutReader model directory- Check and use local model directory if available
- Fall back to online model if local directory not found
- Update model initialization to support local path
- Refactor model loading in singleton class

ded2818a

Merge pull request #696 from icecraft/fix/caption_match · 3fb0494b
Xiaomeng Zhao authored Oct 08, 2024
```
fix: caption|footnote match algorithm
```
3fb0494b
fix: caption|footnote match algorithm · f31433b8
icecraft authored Oct 08, 2024

f31433b8
Merge pull request #695 from icecraft/fix/caption_match · 763688c0
Xiaomeng Zhao authored Oct 08, 2024
```
fix: caption or footnote match algorithm
```
763688c0
fix: caption or footnote match algorithm · ef45ad08
icecraft authored Oct 08, 2024

ef45ad08
Merge pull request #694 from myhloli/dev · 3458f85a
Xiaomeng Zhao authored Oct 08, 2024
```
perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity
```
3458f85a

perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity · fb9949c4

myhloli authored Oct 08, 2024

- Introduce a conditional memory cleanup step in the PDF extraction process
- Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs
- This optimization helps to balance performance and resource utilization

fb9949c4

Merge pull request #693 from myhloli/dev · 69eb2c3b
Xiaomeng Zhao authored Oct 08, 2024
```
feat: add arXiv paper link to header and adjust PDF parsing logic
```
69eb2c3b

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv... · a71db703

myhloli authored Oct 08, 2024

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv paper link to the header template for easy access to the latest research paper.
- Modify the PDF parsing logic to handle edge cases more accurately, particularly in determining the number of lines in a block based on its height.

a71db703

06 Oct, 2024 2 commits

Merge pull request #690 from myhloli/dev · de60127c
Xiaomeng Zhao authored Oct 06, 2024
```
refactor(model): improve timing information and performance
```
de60127c

refactor(model): improve timing information and performance · be1b1ae7

myhloli authored Oct 06, 2024

- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second
- Optimize logging for YOLO and table recognition processes
- Remove unnecessary comments and improve code efficiency

be1b1ae7

30 Sep, 2024 5 commits
- Update README_zh-CN.md · 14bb5865
  sfk authored Sep 30, 2024
```
add arxiv url
```
  14bb5865
- Update README.md · 0ae9979a
  sfk authored Sep 30, 2024
```
add arxiv url
```
  0ae9979a
- Update README.md · cd55083b
  sfk authored Sep 30, 2024
  
  cd55083b
- Update Miner technical report bibtex · aca52da1
  wangbinDL authored Sep 30, 2024
  
  aca52da1
- Merge pull request #672 from myhloli/add-layoutreader · bcbee130
  Xiaomeng Zhao authored Sep 30, 2024
```
feat：add layoutreader to sort blocks
```
  bcbee130