Commits · d1c9c7dd89027fa6206f84a7f10b13c8344e0564 · Qin Kaijie / pdf-miner

10 Oct, 2024 3 commits

Merge branch 'opendatalab:dev' into dev · d1c9c7dd
Xiaomeng Zhao authored Oct 10, 2024

d1c9c7dd
Merge pull request #716 from myhloli/para-split-v3 · ea7bc620
Xiaomeng Zhao authored Oct 10, 2024
```
feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support
```
ea7bc620

feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support · 6f63e70e

myhloli authored Oct 10, 2024

- Reintegrate para_split_v3 into the pdf_parse_union_core_v2 process
- Add support for specifying page range in doc_analyze_by_custom_model
- Implement garbage collection and memory cleaning after processing
- Refine image loading from PDF, including handling out-of-range pages

6f63e70e

09 Oct, 2024 3 commits
- Update how_to_download_models_zh_cn.md · 7f9d80fc
  Xiaomeng Zhao authored Oct 09, 2024
  
  7f9d80fc
- Merge pull request #706 from myhloli/dev · 675f8e66
  Xiaomeng Zhao authored Oct 09, 2024
```
Update README_Windows_CUDA_Acceleration_en_US.md
```
  675f8e66
- Update README_Windows_CUDA_Acceleration_en_US.md · 4e58bf8f
  Xiaomeng Zhao authored Oct 09, 2024
  
  4e58bf8f
08 Oct, 2024 20 commits

Merge pull request #701 from myhloli/dev · 1030ebad
Xiaomeng Zhao authored Oct 08, 2024
```
docs: update CUDA acceleration guides and README content
```
1030ebad

docs: update CUDA acceleration guides and README content · a1c7b5a7

myhloli authored Oct 08, 2024

- Update GPU hardware support information in README.md and README_zh-CN.md
- Enhance CUDA acceleration guides for Ubuntu and Windows
- Modify README_zh-CN.md to reflect changes in GPU requirements and configurations
- Update TODO list to mark semantic reading order as completed

a1c7b5a7

docs: update CUDA acceleration guides and README content · 2fb3869e

myhloli authored Oct 08, 2024

- Update GPU hardware support information in README.md and README_zh-CN.md
- Enhance CUDA acceleration guides for Ubuntu and Windows
- Modify README_zh-CN.md to reflect changes in GPU requirements and configurations
- Update TODO list to mark semantic reading order as completed

2fb3869e

Merge pull request #700 from myhloli/dev · 01306098
Xiaomeng Zhao authored Oct 08, 2024
```
docs: add filename to wget command in model download scripts
```
01306098

docs: add filename to wget command in model download scripts · 5de6af68

myhloli authored Oct 08, 2024

- Update wget commands in both English and Chinese documentation to specify the filename
- Improve clarity and prevent potential filename conflicts when downloading the scripts

5de6af68

Merge pull request #699 from myhloli/dev · 7b787555
Xiaomeng Zhao authored Oct 08, 2024
```
feat(docs): automate model download and configuration
```
7b787555

feat(docs): automate model download and configuration · 6c9b23c3

myhloli authored Oct 08, 2024

- Add scripts to download models and update configuration file
- Remove manual steps for modifying model paths
- Update documentation for both ModelScope and HuggingFace model downloads
- Improve user experience by automating the entire process

6c9b23c3

feat(docs): automate model download and configuration · cf385779

myhloli authored Oct 08, 2024

- Add scripts to download models and update configuration file
- Remove manual steps for modifying model paths
- Update documentation for both ModelScope and HuggingFace model downloads
- Improve user experience by automating the entire process

cf385779

Merge pull request #698 from myhloli/dev · 8786d208
Xiaomeng Zhao authored Oct 08, 2024
```
feat(layoutreader): support local model directory and improve model loading
```
8786d208

docs: add layoutreader to related projects · 0b2b0cef

myhloli authored Oct 08, 2024

Added a link to the layoutreader repository in the Related Projects sections of both the README.md and README_zh-CN.md files. This addition helps to provide users with more resources and tools related to document layout analysis and processing.

0b2b0cef

docs: update model download instructions for version 0.9.x and later- Add note... · b28157ce

myhloli authored Oct 08, 2024

docs: update model download instructions for version 0.9.x and later- Add note about separate download for layoutreader model in version 0.9.x and later
- Include example code for downloading layoutreader model using ModelScope
- Clarify that previous download methods do not support updating to version 0.9.x and later

b28157ce

feat(layoutreader): support local model directory and improve model loading · ded2818a

myhloli authored Oct 08, 2024

- Add function to get local LayoutReader model directory- Check and use local model directory if available
- Fall back to online model if local directory not found
- Update model initialization to support local path
- Refactor model loading in singleton class

ded2818a

Merge pull request #696 from icecraft/fix/caption_match · 3fb0494b
Xiaomeng Zhao authored Oct 08, 2024
```
fix: caption|footnote match algorithm
```
3fb0494b
fix: caption|footnote match algorithm · f31433b8
icecraft authored Oct 08, 2024

f31433b8
Merge pull request #695 from icecraft/fix/caption_match · 763688c0
Xiaomeng Zhao authored Oct 08, 2024
```
fix: caption or footnote match algorithm
```
763688c0
fix: caption or footnote match algorithm · ef45ad08
icecraft authored Oct 08, 2024

ef45ad08
Merge pull request #694 from myhloli/dev · 3458f85a
Xiaomeng Zhao authored Oct 08, 2024
```
perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity
```
3458f85a

perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity · fb9949c4

myhloli authored Oct 08, 2024

- Introduce a conditional memory cleanup step in the PDF extraction process
- Assess available GPU memory before deciding to perform memory cleanup- Log the time taken for garbage collection when it occurs
- This optimization helps to balance performance and resource utilization

fb9949c4

Merge pull request #693 from myhloli/dev · 69eb2c3b
Xiaomeng Zhao authored Oct 08, 2024
```
feat: add arXiv paper link to header and adjust PDF parsing logic
```
69eb2c3b

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv... · a71db703

myhloli authored Oct 08, 2024

feat: add arXiv paper link to header and adjust PDF parsing logic- Add arXiv paper link to the header template for easy access to the latest research paper.
- Modify the PDF parsing logic to handle edge cases more accurately, particularly in determining the number of lines in a block based on its height.

a71db703

06 Oct, 2024 2 commits

Merge pull request #690 from myhloli/dev · de60127c
Xiaomeng Zhao authored Oct 06, 2024
```
refactor(model): improve timing information and performance
```
de60127c

refactor(model): improve timing information and performance · be1b1ae7

myhloli authored Oct 06, 2024

- Enhance timing output precision to two decimal places for better readability- Calculate and log document analysis speed in pages per second
- Optimize logging for YOLO and table recognition processes
- Remove unnecessary comments and improve code efficiency

be1b1ae7

30 Sep, 2024 6 commits
- Update README_zh-CN.md · 14bb5865
  sfk authored Sep 30, 2024
```
add arxiv url
```
  14bb5865
- Update README.md · 0ae9979a
  sfk authored Sep 30, 2024
```
add arxiv url
```
  0ae9979a
- Update README.md · cd55083b
  sfk authored Sep 30, 2024
  
  cd55083b
- Update Miner technical report bibtex · aca52da1
  wangbinDL authored Sep 30, 2024
  
  aca52da1
- Merge pull request #672 from myhloli/add-layoutreader · bcbee130
  Xiaomeng Zhao authored Sep 30, 2024
```
feat：add layoutreader to sort blocks
```
  bcbee130
- chore: remove useless files · fcf24242
  myhloli authored Sep 30, 2024
  
  fcf24242
29 Sep, 2024 2 commits

refactor(magic_pdf): improve line sorting and block indexing · 564c4ce1

myhloli authored Sep 29, 2024

- Insert lines into blocks based on median line height- Calculate block index using line indices median
- Remove virtual line information for table and image blocks
- Enhance line sorting algorithm for different block types
- Add line height calculation function

564c4ce1

refactor(memory management): remove unused clean_memory function · 4c9bf8ab

myhloli authored Sep 29, 2024

The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used.
This change streamlines the code and prevents potential confusion regarding its purpose.

4c9bf8ab

28 Sep, 2024 3 commits

refactor(magic_pdf): import model helpers directly for clarity · 42a7d792

myhloli authored Sep 28, 2024

Update import statements in `pdf_parse_union_core_v2.py` to directly import
`prepare_inputs`, `boxes2inputs`, and `parse_logits` from `magic_pdf.model.v3.helpers`
instead of from `magic_pdf.model.v3`. This change streamlines the imports, making the
code more readable and maintaining a cleaner approach to modular design.

42a7d792

refactor(pdf_parse_union_core_v2): update import paths to use new package structure · 5522d0a3

myhloli authored Sep 28, 2024

Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3`
module. This ensures compatibility with the revised directory layout.

5522d0a3

fix(pdf_parse): handle blocks without lines and enable bf16 on compatible devices · 2145a8b6

myhloli authored Sep 28, 2024

Blocks without lines are now correctly indexed even when they contain textual content rendered
as images. The sorting logic has been updated to accommodate this scenario. Additionally, the
LayoutLMv3 model initialization has been enhanced to utilize bfloat16 precision on devices that
support it, offering potential performance benefits on supported hardware.

2145a8b6

27 Sep, 2024 1 commit

refactor(pdf_parse): remove redundant sorting and optimize block indexing · 177ab08e

myhloli authored Sep 27, 2024

Removed redundant sorting of lines by model and optimized calculation of block
indexes by using a single pass through the sorted lines. This change simplifies the
code and potentially improves performance by reducing the number of sortingoperations and unnecessary iterations over blocks without lines.

177ab08e