fix(pdf-extract): adjust box threshold for OCR detection (#447)

Tuned the detection box threshold parameter in the OCR model initialization to improve the accuracy of text extraction from images. The threshold was modified from 0.6 to 0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted text by reducing noise and false positives in the OCR process.

fix(pdf-extract): adjust box threshold for OCR detection (#447)
Tuned the detection box threshold parameter in the OCR model initialization to improve the accuracy of text extraction from images. The threshold was modified from 0.6 to 0.3 to filter out smaller detection boxes, which is expected to enhance the quality of the extracted text by reducing noise and false positives in the OCR process.
041b9465 · Xiaomeng Zhao · GitHub · 3da5c411 · 041b9465
Unverified Commit 041b9465 authored Aug 20, 2024 by Xiaomeng Zhao Committed by GitHub Aug 20, 2024
Hide whitespace changes
Inline Side-by-side

Showing with 1 addition and 1 deletion

pdf_extract_kit.py magic_pdf/model/pdf_extract_kit.py +1 -1

No files found.
--- a/magic_pdf/model/pdf_extract_kit.py
+++ b/magic_pdf/model/pdf_extract_kit.py
@@ -139,7 +139,7 @@ class CustomPEKModel:
        )
        # 初始化ocr
        if self.apply_ocr:
-            self.ocr_model = ModifiedPaddleOCR(show_log=show_log)
+            self.ocr_model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=0.3)

        # init structeqtable
        if self.apply_table: