refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation.
- Remove unused functions such as split_long_words, ocr_mk_mm_markdown_with_para, etc. - Simplify ocr_mk_markdown_with_para_core_v2 by removing unnecessary language detection and word splitting logic- Remove wordninja dependency from requirements - Update ocr_model_init to include additional parameters for OCR model configuration
Showing
... | @@ -8,7 +8,6 @@ pdfminer.six==20231228 | ... | @@ -8,7 +8,6 @@ pdfminer.six==20231228 |
pydantic>=2.7.2,<2.8.0 | pydantic>=2.7.2,<2.8.0 | ||
PyMuPDF>=1.24.9 | PyMuPDF>=1.24.9 | ||
scikit-learn>=1.0.2 | scikit-learn>=1.0.2 | ||
wordninja>=2.0.0 | |||
torch>=2.2.2,<=2.3.1 | torch>=2.2.2,<=2.3.1 | ||
transformers | transformers | ||
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator. | # The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator. |
Please register or sign in to comment