Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
3955a3b3
Commit
3955a3b3
authored
May 08, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
update some annotation
parent
3a0a08e4
Changes
4
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
27 additions
and
3 deletions
+27
-3
boxbase.py
magic_pdf/libs/boxbase.py
+24
-0
draw_bbox.py
magic_pdf/libs/draw_bbox.py
+1
-1
pdf_parse_union_core.py
magic_pdf/pdf_parse_union_core.py
+1
-1
AbsPipe.py
magic_pdf/pipe/AbsPipe.py
+1
-1
No files found.
magic_pdf/libs/boxbase.py
View file @
3955a3b3
...
@@ -335,6 +335,19 @@ def find_right_nearest_text_bbox(pymu_blocks, obj_bbox):
...
@@ -335,6 +335,19 @@ def find_right_nearest_text_bbox(pymu_blocks, obj_bbox):
def
bbox_relative_pos
(
bbox1
,
bbox2
):
def
bbox_relative_pos
(
bbox1
,
bbox2
):
"""
判断两个矩形框的相对位置关系
Args:
bbox1: 一个四元组,表示第一个矩形框的左上角和右下角的坐标,格式为(x1, y1, x1b, y1b)
bbox2: 一个四元组,表示第二个矩形框的左上角和右下角的坐标,格式为(x2, y2, x2b, y2b)
Returns:
一个四元组,表示矩形框1相对于矩形框2的位置关系,格式为(left, right, bottom, top)
其中,left表示矩形框1是否在矩形框2的左侧,right表示矩形框1是否在矩形框2的右侧,
bottom表示矩形框1是否在矩形框2的下方,top表示矩形框1是否在矩形框2的上方
"""
x1
,
y1
,
x1b
,
y1b
=
bbox1
x1
,
y1
,
x1b
,
y1b
=
bbox1
x2
,
y2
,
x2b
,
y2b
=
bbox2
x2
,
y2
,
x2b
,
y2b
=
bbox2
...
@@ -345,6 +358,17 @@ def bbox_relative_pos(bbox1, bbox2):
...
@@ -345,6 +358,17 @@ def bbox_relative_pos(bbox1, bbox2):
return
left
,
right
,
bottom
,
top
return
left
,
right
,
bottom
,
top
def
bbox_distance
(
bbox1
,
bbox2
):
def
bbox_distance
(
bbox1
,
bbox2
):
"""
计算两个矩形框的距离。
Args:
bbox1 (tuple): 第一个矩形框的坐标,格式为 (x1, y1, x2, y2),其中 (x1, y1) 为左上角坐标,(x2, y2) 为右下角坐标。
bbox2 (tuple): 第二个矩形框的坐标,格式为 (x1, y1, x2, y2),其中 (x1, y1) 为左上角坐标,(x2, y2) 为右下角坐标。
Returns:
float: 矩形框之间的距离。
"""
def
dist
(
point1
,
point2
):
def
dist
(
point1
,
point2
):
return
math
.
sqrt
((
point1
[
0
]
-
point2
[
0
])
**
2
+
(
point1
[
1
]
-
point2
[
1
])
**
2
)
return
math
.
sqrt
((
point1
[
0
]
-
point2
[
0
])
**
2
+
(
point1
[
1
]
-
point2
[
1
])
**
2
)
...
...
magic_pdf/libs/draw_bbox.py
View file @
3955a3b3
...
@@ -61,7 +61,7 @@ def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config):
...
@@ -61,7 +61,7 @@ def draw_bbox_with_number(i, bbox_list, page, rgb_config, fill_config):
)
# Draw the rectangle
)
# Draw the rectangle
page
.
insert_text
(
page
.
insert_text
(
(
x0
,
y0
+
10
),
str
(
j
+
1
),
fontsize
=
10
,
color
=
new_rgb
(
x0
,
y0
+
10
),
str
(
j
+
1
),
fontsize
=
10
,
color
=
new_rgb
)
# Insert the index
at
the top left corner of the rectangle
)
# Insert the index
in
the top left corner of the rectangle
def
draw_layout_bbox
(
pdf_info
,
pdf_bytes
,
out_path
):
def
draw_layout_bbox
(
pdf_info
,
pdf_bytes
,
out_path
):
...
...
magic_pdf/pdf_parse_union_core.py
View file @
3955a3b3
...
@@ -32,7 +32,7 @@ def remove_horizontal_overlap_block_which_smaller(all_bboxes):
...
@@ -32,7 +32,7 @@ def remove_horizontal_overlap_block_which_smaller(all_bboxes):
is_useful_block_horz_overlap
,
smaller_bbox
=
check_useful_block_horizontal_overlap
(
useful_blocks
)
is_useful_block_horz_overlap
,
smaller_bbox
=
check_useful_block_horizontal_overlap
(
useful_blocks
)
if
is_useful_block_horz_overlap
:
if
is_useful_block_horz_overlap
:
logger
.
warning
(
logger
.
warning
(
f
"skip this page, reason: {DropReason.
TEXT_BLCO
K_HOR_OVERLAP}"
)
f
"skip this page, reason: {DropReason.
USEFUL_BLOC
K_HOR_OVERLAP}"
)
for
bbox
in
all_bboxes
.
copy
():
for
bbox
in
all_bboxes
.
copy
():
if
smaller_bbox
==
bbox
[:
4
]:
if
smaller_bbox
==
bbox
[:
4
]:
all_bboxes
.
remove
(
bbox
)
all_bboxes
.
remove
(
bbox
)
...
...
magic_pdf/pipe/AbsPipe.py
View file @
3955a3b3
...
@@ -57,7 +57,7 @@ class AbsPipe(ABC):
...
@@ -57,7 +57,7 @@ class AbsPipe(ABC):
@
staticmethod
@
staticmethod
def
classify
(
pdf_bytes
:
bytes
)
->
str
:
def
classify
(
pdf_bytes
:
bytes
)
->
str
:
"""
"""
根据pdf的元数据,判断是
否是
文本pdf,还是ocr pdf
根据pdf的元数据,判断是文本pdf,还是ocr pdf
"""
"""
pdf_meta
=
pdf_meta_scan
(
pdf_bytes
)
pdf_meta
=
pdf_meta_scan
(
pdf_bytes
)
if
pdf_meta
.
get
(
"_need_drop"
,
False
):
# 如果返回了需要丢弃的标志,则抛出异常
if
pdf_meta
.
get
(
"_need_drop"
,
False
):
# 如果返回了需要丢弃的标志,则抛出异常
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment