Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
314f1637
Unverified
Commit
314f1637
authored
Nov 03, 2024
by
Xiaomeng Zhao
Committed by
GitHub
Nov 03, 2024
Browse files
Options
Browse Files
Download
Plain Diff
Merge pull request #847 from myhloli/dev
fix(dict2md): improve text concatenation logic
parents
863cd6c5
99cf160d
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
5 additions
and
2 deletions
+5
-2
ocr_mkcontent.py
magic_pdf/dict2md/ocr_mkcontent.py
+5
-2
No files found.
magic_pdf/dict2md/ocr_mkcontent.py
View file @
314f1637
...
@@ -145,7 +145,8 @@ def merge_para_with_text(para_block):
...
@@ -145,7 +145,8 @@ def merge_para_with_text(para_block):
elif
span_type
==
ContentType
.
InterlineEquation
:
elif
span_type
==
ContentType
.
InterlineEquation
:
content
=
f
"
\n
$$
\n
{span['content']}
\n
$$
\n
"
content
=
f
"
\n
$$
\n
{span['content']}
\n
$$
\n
"
if
content
.
strip
()
!=
''
:
content
=
content
.
strip
()
if
content
!=
''
:
langs
=
[
'zh'
,
'ja'
,
'ko'
]
langs
=
[
'zh'
,
'ja'
,
'ko'
]
if
line_lang
in
langs
:
# 遇到一些一个字一个span的文档,这种单字语言判断不准,需要用整行文本判断
if
line_lang
in
langs
:
# 遇到一些一个字一个span的文档,这种单字语言判断不准,需要用整行文本判断
if
span_type
in
[
ContentType
.
Text
,
ContentType
.
InterlineEquation
]:
if
span_type
in
[
ContentType
.
Text
,
ContentType
.
InterlineEquation
]:
...
@@ -157,8 +158,10 @@ def merge_para_with_text(para_block):
...
@@ -157,8 +158,10 @@ def merge_para_with_text(para_block):
# 如果是前一行带有-连字符,那么末尾不应该加空格
# 如果是前一行带有-连字符,那么末尾不应该加空格
if
__is_hyphen_at_line_end
(
content
):
if
__is_hyphen_at_line_end
(
content
):
para_text
+=
content
[:
-
1
]
para_text
+=
content
[:
-
1
]
elif
len
(
content
)
==
1
and
content
not
in
[
'A'
,
'I'
,
'a'
,
'i'
]:
para_text
+=
content
else
:
# 西方文本语境下 content间需要空格分隔
else
:
# 西方文本语境下 content间需要空格分隔
para_text
+=
f
"{content
.strip()
} "
para_text
+=
f
"{content} "
elif
span_type
==
ContentType
.
InterlineEquation
:
elif
span_type
==
ContentType
.
InterlineEquation
:
para_text
+=
content
para_text
+=
content
else
:
else
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment