Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
8a179269
Commit
8a179269
authored
May 07, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update draw_span_bbox logic
parent
413a9df2
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
23 additions
and
28 deletions
+23
-28
draw_bbox.py
magic_pdf/libs/draw_bbox.py
+23
-28
No files found.
magic_pdf/libs/draw_bbox.py
View file @
8a179269
...
@@ -151,6 +151,25 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
...
@@ -151,6 +151,25 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
dropped_list
=
[]
dropped_list
=
[]
next_page_text_list
=
[]
next_page_text_list
=
[]
next_page_inline_equation_list
=
[]
next_page_inline_equation_list
=
[]
def
get_span_info
(
span
):
if
span
[
"type"
]
==
ContentType
.
Text
:
if
span
.
get
(
CROSS_PAGE
,
False
):
next_page_text_list
.
append
(
span
[
"bbox"
])
else
:
page_text_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
InlineEquation
:
if
span
.
get
(
CROSS_PAGE
,
False
):
next_page_inline_equation_list
.
append
(
span
[
"bbox"
])
else
:
page_inline_equation_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
InterlineEquation
:
page_interline_equation_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
Image
:
page_image_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
Table
:
page_table_list
.
append
(
span
[
"bbox"
])
for
page
in
pdf_info
:
for
page
in
pdf_info
:
page_text_list
=
[]
page_text_list
=
[]
page_inline_equation_list
=
[]
page_inline_equation_list
=
[]
...
@@ -162,10 +181,10 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
...
@@ -162,10 +181,10 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
# 将跨页的span放到移动到下一页的列表中
# 将跨页的span放到移动到下一页的列表中
if
len
(
next_page_text_list
)
>
0
:
if
len
(
next_page_text_list
)
>
0
:
page_text_list
.
extend
(
next_page_text_list
)
page_text_list
.
extend
(
next_page_text_list
)
next_page_text_list
=
[]
next_page_text_list
.
clear
()
if
len
(
next_page_inline_equation_list
)
>
0
:
if
len
(
next_page_inline_equation_list
)
>
0
:
page_inline_equation_list
.
extend
(
next_page_inline_equation_list
)
page_inline_equation_list
.
extend
(
next_page_inline_equation_list
)
next_page_inline_equation_list
=
[]
next_page_inline_equation_list
.
clear
()
# 构造dropped_list
# 构造dropped_list
for
block
in
page
[
"discarded_blocks"
]:
for
block
in
page
[
"discarded_blocks"
]:
...
@@ -183,36 +202,12 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
...
@@ -183,36 +202,12 @@ def draw_span_bbox(pdf_info, pdf_bytes, out_path):
]:
]:
for
line
in
block
[
"lines"
]:
for
line
in
block
[
"lines"
]:
for
span
in
line
[
"spans"
]:
for
span
in
line
[
"spans"
]:
if
span
[
"type"
]
==
ContentType
.
Text
:
get_span_info
(
span
)
if
span
.
get
(
CROSS_PAGE
,
False
):
next_page_text_list
.
append
(
span
[
"bbox"
])
else
:
page_text_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
InlineEquation
:
if
span
.
get
(
CROSS_PAGE
,
False
):
next_page_inline_equation_list
.
append
(
span
[
"bbox"
])
else
:
page_inline_equation_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
InterlineEquation
:
page_interline_equation_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
Image
:
page_image_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
Table
:
page_table_list
.
append
(
span
[
"bbox"
])
elif
block
[
"type"
]
in
[
BlockType
.
Image
,
BlockType
.
Table
]:
elif
block
[
"type"
]
in
[
BlockType
.
Image
,
BlockType
.
Table
]:
for
sub_block
in
block
[
"blocks"
]:
for
sub_block
in
block
[
"blocks"
]:
for
line
in
sub_block
[
"lines"
]:
for
line
in
sub_block
[
"lines"
]:
for
span
in
line
[
"spans"
]:
for
span
in
line
[
"spans"
]:
if
span
[
"type"
]
==
ContentType
.
Text
:
get_span_info
(
span
)
page_text_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
InlineEquation
:
page_inline_equation_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
InterlineEquation
:
page_interline_equation_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
Image
:
page_image_list
.
append
(
span
[
"bbox"
])
elif
span
[
"type"
]
==
ContentType
.
Table
:
page_table_list
.
append
(
span
[
"bbox"
])
text_list
.
append
(
page_text_list
)
text_list
.
append
(
page_text_list
)
inline_equation_list
.
append
(
page_inline_equation_list
)
inline_equation_list
.
append
(
page_inline_equation_list
)
interline_equation_list
.
append
(
page_interline_equation_list
)
interline_equation_list
.
append
(
page_interline_equation_list
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment