Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
3711a333
Commit
3711a333
authored
May 24, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
add garbled_rate too large process logic
parent
543828c2
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
9 additions
and
3 deletions
+9
-3
user_api.py
magic_pdf/user_api.py
+9
-3
No files found.
magic_pdf/user_api.py
View file @
3711a333
...
@@ -12,6 +12,8 @@
...
@@ -12,6 +12,8 @@
其余部分至于构造s3cli, 获取ak,sk都在code-clean里写代码完成。不要反向依赖!!!
其余部分至于构造s3cli, 获取ak,sk都在code-clean里写代码完成。不要反向依赖!!!
"""
"""
import
re
from
loguru
import
logger
from
loguru
import
logger
from
magic_pdf.rw
import
AbsReaderWriter
from
magic_pdf.rw
import
AbsReaderWriter
...
@@ -87,13 +89,17 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
...
@@ -87,13 +89,17 @@ def parse_union_pdf(pdf_bytes: bytes, pdf_models: list, imageWriter: AbsReaderWr
text_all
+=
span
[
'content'
]
text_all
+=
span
[
'content'
]
def
calculate_garbled_rate
(
text
):
def
calculate_garbled_rate
(
text
):
printable
=
sum
(
1
for
c
in
text
if
c
.
isprintable
())
garbage_regex
=
re
.
compile
(
r'[^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a\u3000-\u303f\uff00-\uffef]'
)
# 计算乱码字符的数量
garbage_count
=
len
(
garbage_regex
.
findall
(
text
))
total
=
len
(
text
)
total
=
len
(
text
)
if
total
==
0
:
if
total
==
0
:
return
0
# 避免除以零的错误
return
0
# 避免除以零的错误
return
(
total
-
printable
)
/
total
return
garbage_count
/
total
garbled_rate
=
calculate_garbled_rate
(
text_all
)
if
pdf_info_dict
is
None
or
pdf_info_dict
.
get
(
"_need_drop"
,
False
)
or
calculate_garbled_rate
(
text_all
)
<
0.5
:
if
pdf_info_dict
is
None
or
pdf_info_dict
.
get
(
"_need_drop"
,
False
)
or
garbled_rate
>
0.8
:
logger
.
warning
(
f
"parse_pdf_by_txt drop or error or garbled_rate too large, switch to parse_pdf_by_ocr"
)
logger
.
warning
(
f
"parse_pdf_by_txt drop or error or garbled_rate too large, switch to parse_pdf_by_ocr"
)
pdf_info_dict
=
parse_pdf
(
parse_pdf_by_ocr
)
pdf_info_dict
=
parse_pdf
(
parse_pdf_by_ocr
)
if
pdf_info_dict
is
None
:
if
pdf_info_dict
is
None
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment