Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
6199e608
Commit
6199e608
authored
Apr 30, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
add union_make logic
parent
87ac340a
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
45 additions
and
0 deletions
+45
-0
ocr_mkcontent.py
magic_pdf/dict2md/ocr_mkcontent.py
+35
-0
MakeContentConfig.py
magic_pdf/libs/MakeContentConfig.py
+10
-0
No files found.
magic_pdf/dict2md/ocr_mkcontent.py
View file @
6199e608
from
loguru
import
logger
from
loguru
import
logger
from
magic_pdf.libs.MakeContentConfig
import
DropMode
,
MakeMode
from
magic_pdf.libs.commons
import
join_path
from
magic_pdf.libs.commons
import
join_path
from
magic_pdf.libs.language
import
detect_lang
from
magic_pdf.libs.language
import
detect_lang
from
magic_pdf.libs.markdown_utils
import
ocr_escape_special_markdown_char
from
magic_pdf.libs.markdown_utils
import
ocr_escape_special_markdown_char
...
@@ -319,3 +320,37 @@ def ocr_mk_mm_standard_format(pdf_info_dict: list):
...
@@ -319,3 +320,37 @@ def ocr_mk_mm_standard_format(pdf_info_dict: list):
content
=
line_to_standard_format
(
line
)
content
=
line_to_standard_format
(
line
)
content_list
.
append
(
content
)
content_list
.
append
(
content
)
return
content_list
return
content_list
def
union_make
(
pdf_info_dict
:
list
,
make_mode
:
str
,
drop_mode
:
str
,
img_buket_path
:
str
=
""
):
output_content
=
[]
for
page_info
in
pdf_info_dict
:
if
page_info
.
get
(
"need_drop"
,
False
):
drop_reason
=
page_info
.
get
(
"drop_reason"
)
if
drop_mode
==
DropMode
.
NONE
:
pass
elif
drop_mode
==
DropMode
.
WHOLE_PDF
:
raise
Exception
(
f
"drop_mode is {DropMode.WHOLE_PDF} , drop_reason is {drop_reason}"
)
elif
drop_mode
==
DropMode
.
SINGLE_PAGE
:
logger
.
warning
(
f
"drop_mode is {DropMode.SINGLE_PAGE} , drop_reason is {drop_reason}"
)
continue
else
:
raise
Exception
(
f
"drop_mode can not be null"
)
paras_of_layout
=
page_info
.
get
(
"para_blocks"
)
if
not
paras_of_layout
:
continue
if
make_mode
==
MakeMode
.
MM_MD
:
page_markdown
=
ocr_mk_markdown_with_para_core_v2
(
paras_of_layout
,
"mm"
,
img_buket_path
)
output_content
.
extend
(
page_markdown
)
elif
make_mode
==
MakeMode
.
NLP_MD
:
page_markdown
=
ocr_mk_markdown_with_para_core_v2
(
paras_of_layout
,
"nlp"
)
output_content
.
extend
(
page_markdown
)
elif
make_mode
==
MakeMode
.
STANDARD_FORMAT
:
for
para_block
in
paras_of_layout
:
para_content
=
para_to_standard_format_v2
(
para_block
,
img_buket_path
)
output_content
.
append
(
para_content
)
if
make_mode
in
[
MakeMode
.
MM_MD
,
MakeMode
.
NLP_MD
]:
return
'
\n\n
'
.
join
(
output_content
)
elif
make_mode
==
MakeMode
.
STANDARD_FORMAT
:
return
output_content
magic_pdf/libs/MakeContentConfig.py
0 → 100644
View file @
6199e608
class
MakeMode
:
MM_MD
=
"mm_markdown"
NLP_MD
=
"nlp_markdown"
STANDARD_FORMAT
=
"standard_format"
class
DropMode
:
WHOLE_PDF
=
"whole_pdf"
SINGLE_PAGE
=
"single_page"
NONE
=
"none"
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment