Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
P
pdf-miner
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Qin Kaijie
pdf-miner
Commits
c9c14bea
Commit
c9c14bea
authored
Mar 04, 2024
by
赵小蒙
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
更新readme
parent
9fe81795
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
43 additions
and
11 deletions
+43
-11
README.md
README.md
+24
-11
README.md
others/README.md
+19
-0
No files found.
README.md
View file @
c9c14bea
# pdf_toolbox
pdf 解析基础函数
#
# pdf是否是文字类型/扫描类型的区分
#
Magic-PDF
```
shell
cat
s3_pdf_path.example.pdf | parallel
--colsep
' '
-j
10
"python pdf_meta_scan.py --s3-pdf-path {2} --s3-profile {1} >> {/}.jsonl"
便捷、准确的将PDF转换成Markdown文档
find
dir
/to/jsonl/
-type
f
-name
"*.jsonl"
| parallel
-j
10
"python pdf_classfy_by_type.py --json_file {} >> {/}.jsonl"
```
### 上手指南
###### 开发前的配置要求
python 3.9+
```
shell
# 如果单独运行脚本,合并到code-clean之后需要运行,参考如下:
python
-m
pdf_meta_scan
--s3-pdf-path
"D:
\p
df_files
\内
容排序测试_pdf
\p
3_图文混排 5.pdf"
--s3-profile
s2
###### **安装步骤**
1.
Clone the repo
```
sh
git clone https://github.com/myhloli/Magic-PDF.git
```
## pdf
### 版权说明
该项目签署了MIT 授权许可,详情请参阅
[
LICENSE.txt
](
https://github.com/shaojintian/Best_README_template/blob/master/LICENSE.txt
)
### 鸣谢
-
[
PyMuPDF
](
https://github.com/pymupdf/PyMuPDF
)
others/README.md
0 → 100644
View file @
c9c14bea
# pdf_toolbox
pdf 解析基础函数
## pdf是否是文字类型/扫描类型的区分
```
shell
cat
s3_pdf_path.example.pdf | parallel
--colsep
' '
-j
10
"python pdf_meta_scan.py --s3-pdf-path {2} --s3-profile {1} >> {/}.jsonl"
find
dir
/to/jsonl/
-type
f
-name
"*.jsonl"
| parallel
-j
10
"python pdf_classfy_by_type.py --json_file {} >> {/}.jsonl"
```
```
shell
# 如果单独运行脚本,合并到code-clean之后需要运行,参考如下:
python
-m
pdf_meta_scan
--s3-pdf-path
"D:
\p
df_files
\内
容排序测试_pdf
\p
3_图文混排 5.pdf"
--s3-profile
s2
```
## pdf
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment