- 02 Nov, 2024 5 commits
-
-
Xiaomeng Zhao authored
docs(tutorial): update magic-pdf command with output directory
-
myhloli authored
- Add '-o ./output' flag to magic-pdf command in multiple documentation files
-
Xiaomeng Zhao authored
feat(list): improve list detection algorithm & fix(list): improve list identification accuracy
-
myhloli authored
feat(list): improve list detection algorithm- Add center_close_num and external_sides_not_close_num variables to analyze line positioning - Implement new list detection condition for centered lines - Enhance existing list detection logic with additional checks
-
myhloli authored
fix(list): improve list identification accuracy- Adjust the threshold for determining right-side spacing to 0.26 * block_weight - Add TODO comment for special list identification with all centered lines- Modify the condition for recognizing short item lists with left alignment - Update the condition for identifying the end of a list item
-
- 01 Nov, 2024 15 commits
-
-
Xiaomeng Zhao authored
fix(ocr_mkcontent): improve content handling for different languages and equation types
-
myhloli authored
- Include InlineEquation in the condition for handling text content - Remove separate block for InlineEquation processing - Ensures consistent handling of inline equations and text, improving content formatting
-
myhloli authored
fix(ocr_mkcontent): improve content handling for different languages and equation types- Adjust content formatting for Chinese, Japanese, Korean, and Western languages - Implement proper spacing rules around inline equations- Remove unnecessary empty lines in paragraph text
-
Xiaomeng Zhao authored
Feat/tune docs
-
Xiaomeng Zhao authored
feat(pdf_parse): improve span filtering and add new block types
-
myhloli authored
- Refactor remove_outside_spans function to filter spans more accurately - Add image_footnote, index, and list block types to output file documentation - Update draw_span_bbox to use preproc_blocks instead of para_blocks - Bump version to 0.9.0
-
xu rui authored
-
xu rui authored
-
xu rui authored
-
icecraft authored
-
xu rui authored
-
Xiaomeng Zhao authored
fix(pdf_parse): improve span removal logic for all content types
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
myhloli authored
- Update remove_outside_spans function to handle all content types - Add processing for text and equation spans - Improve overlap calculation for better accuracy
-
- 31 Oct, 2024 2 commits
-
-
Xiaomeng Zhao authored
fix(pdf_parse): optimize span processing by removing outside spans
-
myhloli authored
- Add new function `remove_outside_spans` to filter spans based on image and table blocks - Reorder span processing steps to improve efficiency - Update imports to include `calculate_overlap_area_in_bbox1_area_ratio`
-
- 30 Oct, 2024 6 commits
-
-
Xiaomeng Zhao authored
-
liukaiwen authored
-
Xiaomeng Zhao authored
fix(magic_pdf): handle missing image_path in spans
-
myhloli authored
# Conflicts: # magic_pdf/dict2md/ocr_mkcontent.py
-
myhloli authored
- Add check for 'image_path' in spans to avoid errors when it's missing - Update image handling in both paragraph text and content dictionary - Improve error handling and make the code more robust
-
myhloli authored
- Update image content extraction to iterate through all spans in a block - Add support for extracting table content from spans within a block - Handle multiple content types within table spans (latex, html, image) - Refactor code to be more modular and easier to maintain
-
- 29 Oct, 2024 2 commits
-
-
Xiaomeng Zhao authored
(docs&build): switch to Aliyun PyPI mirror
-
myhloli authored
- Update PyPI mirror from Tsinghua to Aliyun in multiple Dockerfiles and installation scripts - This change may improve package download speed and reliability for users in China
-
- 28 Oct, 2024 10 commits
-
-
Xiaomeng Zhao authored
docs(README): update model download instructions for PDF-Extract-Kit 1.0
-
myhloli authored
- Update README.md and README_zh-CN.md to include new model download instructions - Provide detailed steps on how to download models after PDF-Extract-Kit 1.0 repository change - Emphasize the need to re-download models due to repository change
-
myhloli authored
- Update README.md and README_zh-CN.md to include new model download instructions - Provide detailed steps on how to download models after PDF-Extract-Kit 1.0 repository change - Emphasize the need to re-download models due to repository change
-
Xiaomeng Zhao authored
refactor(table): disable StructEqTable support and add TableMaster support
-
myhloli authored
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster - Log error and exit if StructEqTable is selected, as it's under upgrade - Update README files to reflect changes in table parsing capabilities
-
Xiaomeng Zhao authored
fix: add priority match rule
-
icecraft authored
-
Xiaomeng Zhao authored
perf: table model update with PP OCRv4
-
liukaiwen authored
-
liukaiwen authored
-