• Xiaomeng Zhao's avatar
    fix(ocr_mkcontent): improve language detection and content formatting (#458) · 66e3ce9c
    Xiaomeng Zhao authored
    Optimize the language detection logic to enhance content formatting.  This
    change addresses issues with long word segmentation. Language detection now uses a
    threshold to determine the language of a text based on the proportion of English characters.
    Formatting rules for content have been updated to consider a list of languages (initially
    including Chinese, Japanese, and Korean) where no space is added between content segments
    for inline equations and text spans, improving the handling of Asian languages.
    
    The impact of these changes includes improved accuracy in language detection, better
    segmentation of long words, and more appropriate spacing in content formatting for multiple
    languages.
    66e3ce9c
Name
Last commit
Last update
.github Loading commit data...
demo Loading commit data...
docs Loading commit data...
magic_pdf Loading commit data...
signatures/version1 Loading commit data...
tests Loading commit data...
.gitignore Loading commit data...
.pre-commit-config.yaml Loading commit data...
Dockerfile Loading commit data...
LICENSE.md Loading commit data...
MinerU_CLA.md Loading commit data...
README.md Loading commit data...
README.md.bak Loading commit data...
README_ja-JP.md Loading commit data...
README_zh-CN.md Loading commit data...
README_zh-CN.md.bak Loading commit data...
magic-pdf.template.json Loading commit data...
requirements-qa.txt Loading commit data...
requirements.txt Loading commit data...
setup.py Loading commit data...
update_version.py Loading commit data...