Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
432e1ae5
Commit
432e1ae5
authored
Mar 22, 2024
by
xu rui
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
feat: process title and footnote
parent
e3e125ba
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
16 additions
and
3 deletions
+16
-3
pdf_parse_for_train.py
magic_pdf/pdf_parse_for_train.py
+2
-1
convert_to_train_format.py
magic_pdf/train_utils/convert_to_train_format.py
+14
-2
No files found.
magic_pdf/pdf_parse_for_train.py
View file @
432e1ae5
...
@@ -253,7 +253,8 @@ def parse_pdf_for_train(
...
@@ -253,7 +253,8 @@ def parse_pdf_for_train(
# isSimpleLayout_flag, fullColumn_cnt, subColumn_cnt, curPage_loss = evaluate_pdf_layout(page_id, page, model_output_json)
# isSimpleLayout_flag, fullColumn_cnt, subColumn_cnt, curPage_loss = evaluate_pdf_layout(page_id, page, model_output_json)
接下来开始进行预处理过程
接下来开始进行预处理过程
"""
"""
title_bboxs
=
parse_titles
(
page_id
,
page
,
model_output_json
)
"""去掉每页的页码、页眉、页脚"""
"""去掉每页的页码、页眉、页脚"""
page_no_bboxs
=
parse_pageNos
(
page_id
,
page
,
model_output_json
)
page_no_bboxs
=
parse_pageNos
(
page_id
,
page
,
model_output_json
)
header_bboxs
=
parse_headers
(
page_id
,
page
,
model_output_json
)
header_bboxs
=
parse_headers
(
page_id
,
page
,
model_output_json
)
...
...
magic_pdf/train_utils/convert_to_train_format.py
View file @
432e1ae5
...
@@ -35,8 +35,16 @@ def convert_to_train_format(jso: dict) -> []:
...
@@ -35,8 +35,16 @@ def convert_to_train_format(jso: dict) -> []:
# 脚注, 目前没有看到例子
# 脚注, 目前没有看到例子
for
para
in
v
[
"para_blocks"
]:
for
para
in
v
[
"para_blocks"
]:
n_bbox
=
{
"category_id"
:
2
,
"bbox"
:
para
[
"bbox"
]}
if
"paras"
in
para
:
bboxes
.
append
(
n_bbox
)
paras
=
para
[
"paras"
]
for
para_key
,
para_content
in
paras
.
items
():
para_bbox
=
para_content
[
"para_bbox"
]
is_para_title
=
para_content
[
"is_para_title"
]
if
is_para_title
:
n_bbox
=
{
"category_id"
:
0
,
"bbox"
:
para_bbox
}
else
:
n_bbox
=
{
"category_id"
:
2
,
"bbox"
:
para_bbox
}
bboxes
.
append
(
n_bbox
)
for
inline_equation
in
v
[
"inline_equations"
]:
for
inline_equation
in
v
[
"inline_equations"
]:
n_bbox
=
{
"category_id"
:
13
,
"bbox"
:
inline_equation
[
"bbox"
]}
n_bbox
=
{
"category_id"
:
13
,
"bbox"
:
inline_equation
[
"bbox"
]}
...
@@ -46,6 +54,10 @@ def convert_to_train_format(jso: dict) -> []:
...
@@ -46,6 +54,10 @@ def convert_to_train_format(jso: dict) -> []:
n_bbox
=
{
"category_id"
:
10
,
"bbox"
:
inter_equation
[
"bbox"
]}
n_bbox
=
{
"category_id"
:
10
,
"bbox"
:
inter_equation
[
"bbox"
]}
bboxes
.
append
(
n_bbox
)
bboxes
.
append
(
n_bbox
)
for
footnote
in
v
[
'footnote_bboxes_tmp'
]:
n_bbox
=
{
"category_id"
:
5
,
"bbox"
:
footnote
[
"bbox"
]}
bboxes
.
append
(
n_bbox
)
info
[
"bboxes"
]
=
bboxes
info
[
"bboxes"
]
=
bboxes
info
[
"layout_tree"
]
=
v
[
"layout_bboxes"
]
info
[
"layout_tree"
]
=
v
[
"layout_bboxes"
]
pages
.
append
(
info
)
pages
.
append
(
info
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment