Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
97153fab
Commit
97153fab
authored
Apr 08, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
(统一格式)修复中文语境下长文本因分词导致文本丢失问题
(统一格式)修复中文语境content间被增加额外空格的问题 公式内容被转义问题
parent
05fe0548
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
21 additions
and
11 deletions
+21
-11
ocr_mkcontent.py
magic_pdf/dict2md/ocr_mkcontent.py
+21
-11
No files found.
magic_pdf/dict2md/ocr_mkcontent.py
View file @
97153fab
...
@@ -73,7 +73,7 @@ def ocr_mk_mm_markdown_with_para(pdf_info_dict: dict):
...
@@ -73,7 +73,7 @@ def ocr_mk_mm_markdown_with_para(pdf_info_dict: dict):
markdown
=
[]
markdown
=
[]
for
_
,
page_info
in
pdf_info_dict
.
items
():
for
_
,
page_info
in
pdf_info_dict
.
items
():
paras_of_layout
=
page_info
.
get
(
"para_blocks"
)
paras_of_layout
=
page_info
.
get
(
"para_blocks"
)
page_markdown
=
ocr_mk_m
m_m
arkdown_with_para_core
(
paras_of_layout
,
"mm"
)
page_markdown
=
ocr_mk_markdown_with_para_core
(
paras_of_layout
,
"mm"
)
markdown
.
extend
(
page_markdown
)
markdown
.
extend
(
page_markdown
)
return
'
\n\n
'
.
join
(
markdown
)
return
'
\n\n
'
.
join
(
markdown
)
...
@@ -82,7 +82,7 @@ def ocr_mk_nlp_markdown_with_para(pdf_info_dict: dict):
...
@@ -82,7 +82,7 @@ def ocr_mk_nlp_markdown_with_para(pdf_info_dict: dict):
markdown
=
[]
markdown
=
[]
for
_
,
page_info
in
pdf_info_dict
.
items
():
for
_
,
page_info
in
pdf_info_dict
.
items
():
paras_of_layout
=
page_info
.
get
(
"para_blocks"
)
paras_of_layout
=
page_info
.
get
(
"para_blocks"
)
page_markdown
=
ocr_mk_m
m_m
arkdown_with_para_core
(
paras_of_layout
,
"nlp"
)
page_markdown
=
ocr_mk_markdown_with_para_core
(
paras_of_layout
,
"nlp"
)
markdown
.
extend
(
page_markdown
)
markdown
.
extend
(
page_markdown
)
return
'
\n\n
'
.
join
(
markdown
)
return
'
\n\n
'
.
join
(
markdown
)
...
@@ -92,7 +92,7 @@ def ocr_mk_mm_markdown_with_para_and_pagination(pdf_info_dict: dict):
...
@@ -92,7 +92,7 @@ def ocr_mk_mm_markdown_with_para_and_pagination(pdf_info_dict: dict):
paras_of_layout
=
page_info
.
get
(
"para_blocks"
)
paras_of_layout
=
page_info
.
get
(
"para_blocks"
)
if
not
paras_of_layout
:
if
not
paras_of_layout
:
continue
continue
page_markdown
=
ocr_mk_m
m_m
arkdown_with_para_core
(
paras_of_layout
,
"mm"
)
page_markdown
=
ocr_mk_markdown_with_para_core
(
paras_of_layout
,
"mm"
)
markdown_with_para_and_pagination
.
append
({
markdown_with_para_and_pagination
.
append
({
'page_no'
:
page_no
,
'page_no'
:
page_no
,
'md_content'
:
'
\n\n
'
.
join
(
page_markdown
)
'md_content'
:
'
\n\n
'
.
join
(
page_markdown
)
...
@@ -100,7 +100,7 @@ def ocr_mk_mm_markdown_with_para_and_pagination(pdf_info_dict: dict):
...
@@ -100,7 +100,7 @@ def ocr_mk_mm_markdown_with_para_and_pagination(pdf_info_dict: dict):
return
markdown_with_para_and_pagination
return
markdown_with_para_and_pagination
def
ocr_mk_m
m_m
arkdown_with_para_core
(
paras_of_layout
,
mode
):
def
ocr_mk_markdown_with_para_core
(
paras_of_layout
,
mode
):
page_markdown
=
[]
page_markdown
=
[]
for
paras
in
paras_of_layout
:
for
paras
in
paras_of_layout
:
for
para
in
paras
:
for
para
in
paras
:
...
@@ -118,9 +118,9 @@ def ocr_mk_mm_markdown_with_para_core(paras_of_layout, mode):
...
@@ -118,9 +118,9 @@ def ocr_mk_mm_markdown_with_para_core(paras_of_layout, mode):
else
:
else
:
content
=
ocr_escape_special_markdown_char
(
content
)
content
=
ocr_escape_special_markdown_char
(
content
)
elif
span_type
==
ContentType
.
InlineEquation
:
elif
span_type
==
ContentType
.
InlineEquation
:
content
=
f
"${
ocr_escape_special_markdown_char(span['content'])
}$"
content
=
f
"${
span['content']
}$"
elif
span_type
==
ContentType
.
InterlineEquation
:
elif
span_type
==
ContentType
.
InterlineEquation
:
content
=
f
"
\n
$$
\n
{
ocr_escape_special_markdown_char(span['content'])
}
\n
$$
\n
"
content
=
f
"
\n
$$
\n
{
span['content']
}
\n
$$
\n
"
elif
span_type
in
[
ContentType
.
Image
,
ContentType
.
Table
]:
elif
span_type
in
[
ContentType
.
Image
,
ContentType
.
Table
]:
if
mode
==
'mm'
:
if
mode
==
'mm'
:
content
=
f
"
\n
})
\n
"
content
=
f
"
\n
})
\n
"
...
@@ -147,13 +147,23 @@ def para_to_standard_format(para):
...
@@ -147,13 +147,23 @@ def para_to_standard_format(para):
inline_equation_num
=
0
inline_equation_num
=
0
for
line
in
para
:
for
line
in
para
:
for
span
in
line
[
'spans'
]:
for
span
in
line
[
'spans'
]:
language
=
''
span_type
=
span
.
get
(
'type'
)
span_type
=
span
.
get
(
'type'
)
if
span_type
==
ContentType
.
Text
:
if
span_type
==
ContentType
.
Text
:
content
=
ocr_escape_special_markdown_char
(
split_long_words
(
span
[
'content'
]))
content
=
span
[
'content'
]
language
=
detect_lang
(
content
)
if
language
==
'en'
:
# 只对英文长词进行分词处理,中文分词会丢失文本
content
=
ocr_escape_special_markdown_char
(
split_long_words
(
content
))
else
:
content
=
ocr_escape_special_markdown_char
(
content
)
elif
span_type
==
ContentType
.
InlineEquation
:
elif
span_type
==
ContentType
.
InlineEquation
:
content
=
f
"${
ocr_escape_special_markdown_char(span['content'])
}$"
content
=
f
"${
span['content']
}$"
inline_equation_num
+=
1
inline_equation_num
+=
1
para_text
+=
content
+
' '
if
language
==
'en'
:
# 英文语境下 content间需要空格分隔
para_text
+=
content
+
' '
else
:
# 中文语境下,content间不需要空格分隔
para_text
+=
content
para_content
=
{
para_content
=
{
'type'
:
'text'
,
'type'
:
'text'
,
'text'
:
para_text
,
'text'
:
para_text
,
...
@@ -196,14 +206,14 @@ def line_to_standard_format(line):
...
@@ -196,14 +206,14 @@ def line_to_standard_format(line):
return
content
return
content
else
:
else
:
if
span
[
'type'
]
==
ContentType
.
InterlineEquation
:
if
span
[
'type'
]
==
ContentType
.
InterlineEquation
:
interline_equation
=
ocr_escape_special_markdown_char
(
span
[
'content'
])
# 转义特殊符号
interline_equation
=
span
[
'content'
]
# 转义特殊符号
content
=
{
content
=
{
'type'
:
'equation'
,
'type'
:
'equation'
,
'latex'
:
f
"$$
\n
{interline_equation}
\n
$$"
'latex'
:
f
"$$
\n
{interline_equation}
\n
$$"
}
}
return
content
return
content
elif
span
[
'type'
]
==
ContentType
.
InlineEquation
:
elif
span
[
'type'
]
==
ContentType
.
InlineEquation
:
inline_equation
=
ocr_escape_special_markdown_char
(
span
[
'content'
])
# 转义特殊符号
inline_equation
=
span
[
'content'
]
# 转义特殊符号
line_text
+=
f
"${inline_equation}$"
line_text
+=
f
"${inline_equation}$"
inline_equation_num
+=
1
inline_equation_num
+=
1
elif
span
[
'type'
]
==
ContentType
.
Text
:
elif
span
[
'type'
]
==
ContentType
.
Text
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment