1. 20 Aug, 2024 4 commits
    • icecraft's avatar
      feat: rename the file generated by command line tools (#401) · c9a51491
      icecraft authored
      * feat: rename the file generated by command line tools
      
      * feat: add pdf filename as prefix to {span,layout,model}.pdf
      
      ---------
      Co-authored-by: 's avataricecraft <tmortred@gmail.com>
      Co-authored-by: 's avataricecraft <xurui1@pjlab.org.cn>
      c9a51491
    • Xiaomeng Zhao's avatar
      fix(pdf-extract): adjust box threshold for OCR detection (#447) · 041b9465
      Xiaomeng Zhao authored
      Tuned the detection box threshold parameter in the OCR model initialization to improve the
      accuracy of text extraction from images. The threshold was modified from 0.6 to
      0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted
      text by reducing noise and false positives in the OCR process.
      041b9465
    • Xiaomeng Zhao's avatar
      fix(self_modify): merge detection boxes for optimized text region detection (#448) · 3da5c411
      Xiaomeng Zhao authored
      Merge adjacent and overlapping detection boxes to optimize text region detection in
      the document. Post processing of text boxes is enhanced by consolidating them into
      larger text lines, taking into account their vertical and horizontal alignment. This
      improvement reduces fragmentation and improves the readability of detected text blocks.
      3da5c411
    • Xiaomeng Zhao's avatar
      fix(ocr_mkcontent): improve language detection and content formatting (#458) · 66e3ce9c
      Xiaomeng Zhao authored
      Optimize the language detection logic to enhance content formatting.  This
      change addresses issues with long word segmentation. Language detection now uses a
      threshold to determine the language of a text based on the proportion of English characters.
      Formatting rules for content have been updated to consider a list of languages (initially
      including Chinese, Japanese, and Korean) where no space is added between content segments
      for inline equations and text spans, improving the handling of Asian languages.
      
      The impact of these changes includes improved accuracy in language detection, better
      segmentation of long words, and more appropriate spacing in content formatting for multiple
      languages.
      66e3ce9c
  2. 16 Aug, 2024 3 commits
  3. 13 Aug, 2024 1 commit
  4. 12 Aug, 2024 2 commits
  5. 10 Aug, 2024 1 commit
  6. 09 Aug, 2024 29 commits