Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
877160a7
Commit
877160a7
authored
Apr 10, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
fix parse_union_pdf logic
parent
51bb3b36
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
13 additions
and
3 deletions
+13
-3
spark_api.py
magic_pdf/spark/spark_api.py
+13
-3
No files found.
magic_pdf/spark/spark_api.py
View file @
877160a7
...
@@ -30,6 +30,9 @@ def parse_txt_pdf(pdf_bytes:bytes, pdf_models:list, imageWriter: AbsReaderWriter
...
@@ -30,6 +30,9 @@ def parse_txt_pdf(pdf_bytes:bytes, pdf_models:list, imageWriter: AbsReaderWriter
start_page_id
=
start_page
,
start_page_id
=
start_page
,
debug_mode
=
is_debug
,
debug_mode
=
is_debug
,
)
)
pdf_info_dict
[
"parse_type"
]
=
"txt"
return
pdf_info_dict
return
pdf_info_dict
...
@@ -44,6 +47,9 @@ def parse_ocr_pdf(pdf_bytes:bytes, pdf_models:list, imageWriter: AbsReaderWrite
...
@@ -44,6 +47,9 @@ def parse_ocr_pdf(pdf_bytes:bytes, pdf_models:list, imageWriter: AbsReaderWrite
start_page_id
=
start_page
,
start_page_id
=
start_page
,
debug_mode
=
is_debug
,
debug_mode
=
is_debug
,
)
)
pdf_info_dict
[
"parse_type"
]
=
"ocr"
return
pdf_info_dict
return
pdf_info_dict
...
@@ -65,12 +71,16 @@ def parse_union_pdf(pdf_bytes:bytes, pdf_models:list, imageWriter: AbsReaderWri
...
@@ -65,12 +71,16 @@ def parse_union_pdf(pdf_bytes:bytes, pdf_models:list, imageWriter: AbsReaderWri
return
None
return
None
pdf_info_dict
=
parse_pdf
(
parse_pdf_by_txt
)
pdf_info_dict
=
parse_pdf
(
parse_pdf_by_txt
)
if
pdf_info_dict
is
None
or
pdf_info_dict
.
get
(
"need_drop"
,
False
):
if
pdf_info_dict
is
None
or
pdf_info_dict
.
get
(
"need_drop"
,
False
):
logger
.
warning
(
f
"parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr"
)
logger
.
warning
(
f
"parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr"
)
pdf_info_dict
=
parse_pdf
(
parse_pdf_by_ocr
)
pdf_info_dict
=
parse_pdf
(
parse_pdf_by_ocr
)
if
pdf_info_dict
is
None
:
if
pdf_info_dict
is
None
:
raise
Exception
(
"Both parse_pdf_by_txt and parse_pdf_by_ocr failed."
)
raise
Exception
(
"Both parse_pdf_by_txt and parse_pdf_by_ocr failed."
)
else
:
pdf_info_dict
[
"parse_type"
]
=
"ocr"
else
:
pdf_info_dict
[
"parse_type"
]
=
"txt"
return
pdf_info_dict
return
pdf_info_dict
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment