Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
11bd9432
Unverified
Commit
11bd9432
authored
Nov 01, 2024
by
Xiaomeng Zhao
Committed by
GitHub
Nov 01, 2024
Browse files
Options
Browse Files
Download
Plain Diff
Merge pull request #831 from opendatalab/dev
fix(pdf_parse): improve span removal logic for all content types
parents
4e685524
73afb7d6
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
8 additions
and
2 deletions
+8
-2
pdf_parse_union_core_v2.py
magic_pdf/pdf_parse_union_core_v2.py
+8
-2
No files found.
magic_pdf/pdf_parse_union_core_v2.py
View file @
11bd9432
...
...
@@ -385,17 +385,20 @@ def revert_group_blocks(blocks):
def
remove_outside_spans
(
spans
,
all_bboxes
):
image_bboxes
=
[]
table_bboxes
=
[]
other_block_bboxes
=
[]
for
block
in
all_bboxes
:
block_type
=
block
[
7
]
block_bbox
=
block
[
0
:
4
]
if
block_type
==
BlockType
.
ImageBody
:
image_bboxes
.
append
(
block_bbox
)
elif
block_type
==
BlockType
.
TableBody
:
table_bboxes
.
append
(
block_bbox
)
else
:
continue
other_block_bboxes
.
append
(
block_bbox
)
new_spans
=
[]
for
span
in
spans
:
if
span
[
'type'
]
==
ContentType
.
Image
:
for
block_bbox
in
image_bboxes
:
...
...
@@ -408,7 +411,10 @@ def remove_outside_spans(spans, all_bboxes):
new_spans
.
append
(
span
)
break
else
:
new_spans
.
append
(
span
)
for
block_bbox
in
other_block_bboxes
:
if
calculate_overlap_area_in_bbox1_area_ratio
(
span
[
'bbox'
],
block_bbox
)
>
0.5
:
new_spans
.
append
(
span
)
break
return
new_spans
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment