Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
b1ac8d03
Commit
b1ac8d03
authored
Mar 15, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
book_name生成逻辑更新
parent
84867933
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
5 deletions
+5
-5
pipeline.py
magic_pdf/pipeline.py
+5
-5
No files found.
magic_pdf/pipeline.py
View file @
b1ac8d03
...
@@ -57,7 +57,7 @@ def meta_scan(jso: dict, doc_layout_check=True) -> dict:
...
@@ -57,7 +57,7 @@ def meta_scan(jso: dict, doc_layout_check=True) -> dict:
try
:
try
:
data_source
=
get_data_source
(
jso
)
data_source
=
get_data_source
(
jso
)
file_id
=
jso
.
get
(
'file_id'
)
file_id
=
jso
.
get
(
'file_id'
)
book_name
=
data_source
+
"/"
+
file_id
book_name
=
f
"{data_source}/{file_id}"
# 首页存在超量drawing问题
# 首页存在超量drawing问题
# special_pdf_list = ['zlib/zlib_21822650']
# special_pdf_list = ['zlib/zlib_21822650']
...
@@ -103,7 +103,7 @@ def classify_by_type(jso: dict, debug_mode=False) -> dict:
...
@@ -103,7 +103,7 @@ def classify_by_type(jso: dict, debug_mode=False) -> dict:
pdf_meta
=
jso
.
get
(
'pdf_meta'
)
pdf_meta
=
jso
.
get
(
'pdf_meta'
)
data_source
=
get_data_source
(
jso
)
data_source
=
get_data_source
(
jso
)
file_id
=
jso
.
get
(
'file_id'
)
file_id
=
jso
.
get
(
'file_id'
)
book_name
=
data_source
+
"/"
+
file_id
book_name
=
f
"{data_source}/{file_id}"
total_page
=
pdf_meta
[
"total_page"
]
total_page
=
pdf_meta
[
"total_page"
]
page_width
=
pdf_meta
[
"page_width_pts"
]
page_width
=
pdf_meta
[
"page_width_pts"
]
page_height
=
pdf_meta
[
"page_height_pts"
]
page_height
=
pdf_meta
[
"page_height_pts"
]
...
@@ -169,7 +169,7 @@ def save_tables_to_s3(jso: dict, debug_mode=False) -> dict:
...
@@ -169,7 +169,7 @@ def save_tables_to_s3(jso: dict, debug_mode=False) -> dict:
try
:
try
:
data_source
=
get_data_source
(
jso
)
data_source
=
get_data_source
(
jso
)
file_id
=
jso
.
get
(
'file_id'
)
file_id
=
jso
.
get
(
'file_id'
)
book_name
=
data_source
+
"/"
+
file_id
book_name
=
f
"{data_source}/{file_id}"
title
=
jso
.
get
(
'title'
)
title
=
jso
.
get
(
'title'
)
url_encode_title
=
quote
(
title
,
safe
=
''
)
url_encode_title
=
quote
(
title
,
safe
=
''
)
if
data_source
!=
'scihub'
:
if
data_source
!=
'scihub'
:
...
@@ -262,7 +262,7 @@ def parse_pdf(jso: dict, start_page_id=0, debug_mode=False) -> dict:
...
@@ -262,7 +262,7 @@ def parse_pdf(jso: dict, start_page_id=0, debug_mode=False) -> dict:
model_output_json_list
=
jso
.
get
(
'doc_layout_result'
)
model_output_json_list
=
jso
.
get
(
'doc_layout_result'
)
data_source
=
get_data_source
(
jso
)
data_source
=
get_data_source
(
jso
)
file_id
=
jso
.
get
(
'file_id'
)
file_id
=
jso
.
get
(
'file_id'
)
book_name
=
data_source
+
"/"
+
file_id
book_name
=
f
"{data_source}/{file_id}"
# 1.23.22已修复
# 1.23.22已修复
# if debug_mode:
# if debug_mode:
...
@@ -326,7 +326,7 @@ def ocr_parse_pdf(jso: dict, start_page_id=0, debug_mode=False) -> dict:
...
@@ -326,7 +326,7 @@ def ocr_parse_pdf(jso: dict, start_page_id=0, debug_mode=False) -> dict:
model_output_json_list
=
jso
.
get
(
'doc_layout_result'
)
model_output_json_list
=
jso
.
get
(
'doc_layout_result'
)
data_source
=
get_data_source
(
jso
)
data_source
=
get_data_source
(
jso
)
file_id
=
jso
.
get
(
'file_id'
)
file_id
=
jso
.
get
(
'file_id'
)
book_name
=
data_source
+
"/"
+
file_id
book_name
=
f
"{data_source}/{file_id}"
try
:
try
:
save_path
=
"s3://mllm-raw-media/pdf2md_img/"
save_path
=
"s3://mllm-raw-media/pdf2md_img/"
image_s3_config
=
get_s3_config
(
save_path
)
image_s3_config
=
get_s3_config
(
save_path
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment