Commits · c200effc3a1a2af548d1918cf4fb2eb4087c7a32 · Qin Kaijie / pdf-miner

24 Oct, 2024 3 commits
- style: remove unsed log info · c200effc
  icecraft authored Oct 24, 2024
  
  c200effc
- feat: update docs deps · bb179954
  icecraft authored Oct 24, 2024
  
  bb179954
- feat: add [figure | table] match [caption | footnote] match algorithm v2 · 283b597a
  icecraft authored Oct 19, 2024
```
feat: add Data api
```
  283b597a
23 Oct, 2024 10 commits
- Merge pull request #777 from myhloli/add-doclayout-yolo · e36627be
  Xiaomeng Zhao authored Oct 23, 2024
```
feat: add support for non-PDF file conversion to PDF
```
  e36627be
- feat: add support for non-PDF file conversion to PDF · 4834baf4
  myhloli authored Oct 23, 2024
```
- Implement to_pdf function to convert non-PDF files to PDF format
- Integrate file upload functionality for PDF and image files- Update UI to include file upload component and PDF preview
- Add conversion button and update its functionality to handle new file types
```
  4834baf4
- Merge pull request #776 from myhloli/add-doclayout-yolo · d1c0546a
  Xiaomeng Zhao authored Oct 23, 2024
```
build(docker): add doclayout-yolo dependency
```
  d1c0546a
- build(docker): add doclayout-yolo dependency · 2468016b
  myhloli authored Oct 23, 2024
```
- Add doclayout-yolo==0.0.2 to requirements-docker.txt
```
  2468016b
- Merge pull request #774 from myhloli/add-doclayout-yolo · 97585a72
  Xiaomeng Zhao authored Oct 23, 2024
```
build(setup): add doclayout_yolo dependency
```
  97585a72
- build(setup): add doclayout_yolo dependency · 73fe8914
  myhloli authored Oct 23, 2024
```
- Add doclayout_yolo==0.0.2 to the list of dependencies in setup.py
```
  73fe8914
- Merge pull request #773 from myhloli/add-doclayout-yolo · c1ba9dcb
  Xiaomeng Zhao authored Oct 23, 2024
```
feat(model): add support for DocLayout-YOLO model
```
  c1ba9dcb
- feat(model): add support for DocLayout-YOLO model · 1279f2cd
  myhloli authored Oct 23, 2024
```
- Add new layout model option: DocLayout-YOLO
- Implement model initialization and prediction for DocLayout-YOLO
- Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
- Update Gradio app to support more Custom Switch
```
  1279f2cd
- Merge pull request #769 from myhloli/add-doclayout-yolo · efb5851f
  Xiaomeng Zhao authored Oct 23, 2024
```
update：update config json
```
  efb5851f
- update：update config json · 790691d6
  myhloli authored Oct 23, 2024
  
  790691d6
21 Oct, 2024 5 commits
- Merge pull request #766 from myhloli/docs-update · d18a55ec
  Xiaomeng Zhao authored Oct 21, 2024
```
docs:Update the driver requirements on the Ubuntu system.
```
  d18a55ec
- docs:Update the driver requirements on the Ubuntu system. · 8e72f9db
  myhloli authored Oct 21, 2024
  
  8e72f9db
- fix(ocr_mkcontent): expand para_to_standard_format_v2 to handle list and index blocks · 64408576
  myhloli authored Oct 21, 2024
```
- Modified the condition to include List and Index block types- This change enhances the function's capability to process different paragraph types
```
  64408576
- Merge pull request #765 from myhloli/add-list-group · e4904cd6
  Xiaomeng Zhao authored Oct 21, 2024
```
refactor(para): improve paragraph splitting algorithm
```
  e4904cd6
- refactor(para): improve paragraph splitting algorithm · 8cc76c49
  myhloli authored Oct 21, 2024
```
- Adjust the threshold for identifying index blocks from 3 lines to 2 lines
- Add a new function __is_list_group to detect if a group of blocks is a list
- Modify the paragraph merging logic to handle list groups differently
```
  8cc76c49
18 Oct, 2024 1 commit

refactor(magic_pdf): remove unused parameters and simplify functions · fc49f5c4

myhloli authored Oct 18, 2024

- Remove unused parameters parse_type and lang from various functions
- Simplify function calls by removing unnecessary arguments
- Update related files to reflect these changes

fc49f5c4

17 Oct, 2024 2 commits

Merge pull request #753 from myhloli/dev · fe21eebd

Xiaomeng Zhao authored Oct 17, 2024

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation.

fe21eebd

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. · 011a1b97

myhloli authored Oct 17, 2024

- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc.
- Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements
- Update ocr_model_init to include additional parameters for OCR model configuration

011a1b97

16 Oct, 2024 4 commits
- Merge pull request #747 from myhloli/dev · 2a409845
  Xiaomeng Zhao authored Oct 16, 2024
```
update example files
```
  2a409845
- update example files · cf377ce4
  myhloli authored Oct 16, 2024
  
  cf377ce4
- Merge remote-tracking branch 'origin/dev' into dev · ab3d2d17
  myhloli authored Oct 16, 2024
  
  ab3d2d17
- docs: enhance document parsing capabilities · 237c062d
  myhloli authored Oct 16, 2024
```
- Improve reading order with model-based sorting- Add list recognition within text
- Implement table of contents recognition
- Support table recognition
- Enhance code block and geometric shape recognition
- Address known issues in both English and Chinese READMEs
```
  237c062d
15 Oct, 2024 6 commits

Merge pull request #744 from myhloli/para-split-v3 · f50bc87b
Xiaomeng Zhao authored Oct 15, 2024
```
fix(para_split_v3): refine list block detection in paragraph splitting
```
f50bc87b

refactor(para_split_v3): refine list block detection in paragraph splitting · 81b9fd7b

myhloli authored Oct 15, 2024

- Update list block detection logic to require at least 2 numeric start lines
- Ensure the number of numeric start lines matches the number of end lines
- Remove detection of non-border starting lines for simplicity

81b9fd7b

Merge pull request #743 from myhloli/para-split-v3 · 0d83fb77
Xiaomeng Zhao authored Oct 15, 2024
```
refactor(para_split_v3): merge list and index block detection
```
0d83fb77
fix(split_v3): Fix the rule adaptation for some special list samples. · 244b8684
myhloli authored Oct 15, 2024

244b8684

refactor(pdf): adjust span filling threshold in block construction · 7e301b84

myhloli authored Oct 15, 2024

Increased the threshold for filling spans in blocks from 0.3 to 0.5 to improve the accuracy of block formation. This change helps refine the grouping of spans into blocks, potentially enhancing the overall structure and readability of the PDF content.

7e301b84

refactor(para_split_v3): merge list and index block detection · fdcb49d3

myhloli authored Oct 15, 2024

- Combine __is_list_block() and __is_index_block() into a single function __is_list_or_index_block()
- Simplify block type determination logic
- Remove redundant code and improve readability
- Optimize block merging process

fdcb49d3

14 Oct, 2024 4 commits

fix(magic_pdf): include List and Index block types in processing · 0a9a6d3e

myhloli authored Oct 14, 2024

Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.

0a9a6d3e

Merge pull request #740 from myhloli/para-split-v3 · 702b6ac9
Xiaomeng Zhao authored Oct 14, 2024
```
feat(list&index block): detect and merge list and index blocks
```
702b6ac9

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 14, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

feat: manager docs with sphinx (#737) · c479245e

icecraft authored Oct 14, 2024

* feat: manager docs with sphinx

* fix: readthedocs configure

* feat: support multiple language

* fix: add .readthedocs.yaml

* fix: requirments.txt path

---------
Co-authored-by: icecraft <xurui1@pjlab.org.cn>

c479245e

10 Oct, 2024 5 commits
- Merge pull request #718 from myhloli/para-split-v3 · b9631f30
  Xiaomeng Zhao authored Oct 10, 2024
```
fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks
```
  b9631f30
- fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks · 7b42d5a0
  myhloli authored Oct 10, 2024
  
  7b42d5a0
- Merge pull request #717 from myhloli/dev · 964715b2
  Xiaomeng Zhao authored Oct 10, 2024
```
Update how_to_download_models_zh_cn.md
```
  964715b2
- Merge branch 'opendatalab:dev' into dev · d1c9c7dd
  Xiaomeng Zhao authored Oct 10, 2024
  
  d1c9c7dd
- Merge pull request #716 from myhloli/para-split-v3 · ea7bc620
  Xiaomeng Zhao authored Oct 10, 2024
```
feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support
```
  ea7bc620