Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
97a4e473
Commit
97a4e473
authored
May 28, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
change garbled rate check from not_common_character_rate to not_printable_rate
parent
5de37224
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
12 additions
and
3 deletions
+12
-3
user_api.py
magic_pdf/user_api.py
+12
-3
No files found.
magic_pdf/user_api.py
View file @
97a4e473
...
...
@@ -88,7 +88,7 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
for
span
in
line
[
'spans'
]:
text_all
+=
span
[
'content'
]
def
calculate_
garbled
_rate
(
text
):
def
calculate_
not_common_character
_rate
(
text
):
garbage_regex
=
re
.
compile
(
r'[^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a\u3000-\u303f\uff00-\uffef]'
)
# 计算乱码字符的数量
garbage_count
=
len
(
garbage_regex
.
findall
(
text
))
...
...
@@ -97,9 +97,18 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
return
0
# 避免除以零的错误
return
garbage_count
/
total
garbled_rate
=
calculate_garbled_rate
(
text_all
)
def
calculate_not_printable_rate
(
text
):
printable
=
sum
(
1
for
c
in
text
if
c
.
isprintable
())
total
=
len
(
text
)
if
total
==
0
:
return
0
# 避免除以零的错误
return
(
total
-
printable
)
/
total
if
pdf_info_dict
is
None
or
pdf_info_dict
.
get
(
"_need_drop"
,
False
)
or
garbled_rate
>
0.8
:
# not_common_character_rate = calculate_not_common_character_rate(text_all)
not_printable_rate
=
calculate_not_printable_rate
(
text_all
)
# 测试乱码pdf,not_common_character_rate > 0.9, not_printable_rate > 0.1
# not_common_character_rate对小语种可能会有误伤,not_printable_rate对小语种较为友好
if
pdf_info_dict
is
None
or
pdf_info_dict
.
get
(
"_need_drop"
,
False
)
or
not_printable_rate
>
0.1
:
logger
.
warning
(
f
"parse_pdf_by_txt drop or error or garbled_rate too large, switch to parse_pdf_by_ocr"
)
pdf_info_dict
=
parse_pdf
(
parse_pdf_by_ocr
)
if
pdf_info_dict
is
None
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment