Commits · 02b7999299b7c50f791a8f7f1a2d24c6cec26bfd · Qin Kaijie / pdf-miner

25 Oct, 2024 3 commits

add init to magic_pdf.config · 02b79992
myhloli authored Oct 25, 2024

02b79992

refactor(ocr): adjust OCR processing parameters · 1807126e

myhloli authored Oct 25, 2024

- Lower the Y-axis overlap threshold for merging spans into lines from0.6 to 0.5
- Reduce the unclip ratio for OCR detection from 2.4 to 1.8

1807126e

refactor(ocr): improve image and table block handling · c34c9d21

myhloli authored Oct 25, 2024

- Split image and table blocks into separate categories
- Add group_id to image and table blocks- Update block processing logic to handle new categories
- Modify layout splitting and span filling to accommodate new block types
- Adjust block indexing and sorting to consider new structures

c34c9d21

24 Oct, 2024 5 commits
- refactor(magic_pdf): adjust confidence threshold for DocLayout_YOLO model · ce72cf05
  myhloli authored Oct 24, 2024
```
- Changed the confidence threshold from0.15 to 0.25 in the DocLayout_YOLO model prediction
- This adjustment aims to improve the accuracy of layout detection by filtering out low-confidence predictions
```
  ce72cf05
- Merge pull request #782 from icecraft/feat/data_api · 82dd7ac5
  Xiaomeng Zhao authored Oct 24, 2024
```
Feat/data api
```
  82dd7ac5
- style: remove unsed log info · c200effc
  icecraft authored Oct 24, 2024
  
  c200effc
- feat: update docs deps · bb179954
  icecraft authored Oct 24, 2024
  
  bb179954
- feat: add [figure | table] match [caption | footnote] match algorithm v2 · 283b597a
  icecraft authored Oct 19, 2024
```
feat: add Data api
```
  283b597a
23 Oct, 2024 10 commits
- Merge pull request #777 from myhloli/add-doclayout-yolo · e36627be
  Xiaomeng Zhao authored Oct 23, 2024
```
feat: add support for non-PDF file conversion to PDF
```
  e36627be
- feat: add support for non-PDF file conversion to PDF · 4834baf4
  myhloli authored Oct 23, 2024
```
- Implement to_pdf function to convert non-PDF files to PDF format
- Integrate file upload functionality for PDF and image files- Update UI to include file upload component and PDF preview
- Add conversion button and update its functionality to handle new file types
```
  4834baf4
- Merge pull request #776 from myhloli/add-doclayout-yolo · d1c0546a
  Xiaomeng Zhao authored Oct 23, 2024
```
build(docker): add doclayout-yolo dependency
```
  d1c0546a
- build(docker): add doclayout-yolo dependency · 2468016b
  myhloli authored Oct 23, 2024
```
- Add doclayout-yolo==0.0.2 to requirements-docker.txt
```
  2468016b
- Merge pull request #774 from myhloli/add-doclayout-yolo · 97585a72
  Xiaomeng Zhao authored Oct 23, 2024
```
build(setup): add doclayout_yolo dependency
```
  97585a72
- build(setup): add doclayout_yolo dependency · 73fe8914
  myhloli authored Oct 23, 2024
```
- Add doclayout_yolo==0.0.2 to the list of dependencies in setup.py
```
  73fe8914
- Merge pull request #773 from myhloli/add-doclayout-yolo · c1ba9dcb
  Xiaomeng Zhao authored Oct 23, 2024
```
feat(model): add support for DocLayout-YOLO model
```
  c1ba9dcb
- feat(model): add support for DocLayout-YOLO model · 1279f2cd
  myhloli authored Oct 23, 2024
```
- Add new layout model option: DocLayout-YOLO
- Implement model initialization and prediction for DocLayout-YOLO
- Update configuration options to include new model- Modify existing code to support both LayoutLMv3 and DocLayout-YOLO models
- Update Gradio app to support more Custom Switch
```
  1279f2cd
- Merge pull request #769 from myhloli/add-doclayout-yolo · efb5851f
  Xiaomeng Zhao authored Oct 23, 2024
```
update：update config json
```
  efb5851f
- update：update config json · 790691d6
  myhloli authored Oct 23, 2024
  
  790691d6
21 Oct, 2024 5 commits
- Merge pull request #766 from myhloli/docs-update · d18a55ec
  Xiaomeng Zhao authored Oct 21, 2024
```
docs:Update the driver requirements on the Ubuntu system.
```
  d18a55ec
- docs:Update the driver requirements on the Ubuntu system. · 8e72f9db
  myhloli authored Oct 21, 2024
  
  8e72f9db
- fix(ocr_mkcontent): expand para_to_standard_format_v2 to handle list and index blocks · 64408576
  myhloli authored Oct 21, 2024
```
- Modified the condition to include List and Index block types- This change enhances the function's capability to process different paragraph types
```
  64408576
- Merge pull request #765 from myhloli/add-list-group · e4904cd6
  Xiaomeng Zhao authored Oct 21, 2024
```
refactor(para): improve paragraph splitting algorithm
```
  e4904cd6
- refactor(para): improve paragraph splitting algorithm · 8cc76c49
  myhloli authored Oct 21, 2024
```
- Adjust the threshold for identifying index blocks from 3 lines to 2 lines
- Add a new function __is_list_group to detect if a group of blocks is a list
- Modify the paragraph merging logic to handle list groups differently
```
  8cc76c49
18 Oct, 2024 1 commit

refactor(magic_pdf): remove unused parameters and simplify functions · fc49f5c4

myhloli authored Oct 18, 2024

- Remove unused parameters parse_type and lang from various functions
- Simplify function calls by removing unnecessary arguments
- Update related files to reflect these changes

fc49f5c4

17 Oct, 2024 2 commits

Merge pull request #753 from myhloli/dev · fe21eebd

Xiaomeng Zhao authored Oct 17, 2024

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation.

fe21eebd

refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. · 011a1b97

myhloli authored Oct 17, 2024

- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc.
- Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements
- Update ocr_model_init to include additional parameters for OCR model configuration

011a1b97

16 Oct, 2024 4 commits
- Merge pull request #747 from myhloli/dev · 2a409845
  Xiaomeng Zhao authored Oct 16, 2024
```
update example files
```
  2a409845
- update example files · cf377ce4
  myhloli authored Oct 16, 2024
  
  cf377ce4
- Merge remote-tracking branch 'origin/dev' into dev · ab3d2d17
  myhloli authored Oct 16, 2024
  
  ab3d2d17
- docs: enhance document parsing capabilities · 237c062d
  myhloli authored Oct 16, 2024
```
- Improve reading order with model-based sorting- Add list recognition within text
- Implement table of contents recognition
- Support table recognition
- Enhance code block and geometric shape recognition
- Address known issues in both English and Chinese READMEs
```
  237c062d
15 Oct, 2024 6 commits

Merge pull request #744 from myhloli/para-split-v3 · f50bc87b
Xiaomeng Zhao authored Oct 15, 2024
```
fix(para_split_v3): refine list block detection in paragraph splitting
```
f50bc87b

refactor(para_split_v3): refine list block detection in paragraph splitting · 81b9fd7b

myhloli authored Oct 15, 2024

- Update list block detection logic to require at least 2 numeric start lines
- Ensure the number of numeric start lines matches the number of end lines
- Remove detection of non-border starting lines for simplicity

81b9fd7b

Merge pull request #743 from myhloli/para-split-v3 · 0d83fb77
Xiaomeng Zhao authored Oct 15, 2024
```
refactor(para_split_v3): merge list and index block detection
```
0d83fb77
fix(split_v3): Fix the rule adaptation for some special list samples. · 244b8684
myhloli authored Oct 15, 2024

244b8684

refactor(pdf): adjust span filling threshold in block construction · 7e301b84

myhloli authored Oct 15, 2024

Increased the threshold for filling spans in blocks from 0.3 to 0.5 to improve the accuracy of block formation. This change helps refine the grouping of spans into blocks, potentially enhancing the overall structure and readability of the PDF content.

7e301b84

refactor(para_split_v3): merge list and index block detection · fdcb49d3

myhloli authored Oct 15, 2024

- Combine __is_list_block() and __is_index_block() into a single function __is_list_or_index_block()
- Simplify block type determination logic
- Remove redundant code and improve readability
- Optimize block merging process

fdcb49d3

14 Oct, 2024 4 commits

fix(magic_pdf): include List and Index block types in processing · 0a9a6d3e

myhloli authored Oct 14, 2024

Add List and Index to the list of block types being processed in the draw_bbox.py file. This inclusion ensures that these block types are handled similarly to other text-containing blocks, improving the overall document processing accuracy and consistency.

0a9a6d3e

Merge pull request #740 from myhloli/para-split-v3 · 702b6ac9
Xiaomeng Zhao authored Oct 14, 2024
```
feat(list&index block): detect and merge list and index blocks
```
702b6ac9

feat(list&index block): detect and merge list and index blocks · 1f1dd353

myhloli authored Oct 14, 2024

- Add detection for list and index blocks in OCR processing- Implement merging of list and index blocks across pages
- Update block types to include list and index categories
- Adjust text merging logic to handle new block types
- Modify layout drawing to distinguish list and index blocks

1f1dd353

feat: manager docs with sphinx (#737) · c479245e

icecraft authored Oct 14, 2024

* feat: manager docs with sphinx

* fix: readthedocs configure

* feat: support multiple language

* fix: add .readthedocs.yaml

* fix: requirments.txt path

---------
Co-authored-by: icecraft <xurui1@pjlab.org.cn>

c479245e