centos7使用gpu加速的MinerU
https://mineru.readthedocs.io/zh-cn/latest/user_guide/install/boost_with_cuda.html
由于官方只有ubantu的安装教程,并没有基于centos7的,故需要自己修改命令安装并使用。
在运行此 Docker 容器之前,您可以使用以下命令检查您的设备是否支持 Docker 上的 CUDA 加速。
docker run --rm --gpus=all nvidia/cuda:12.1.0-base-centos7 nvidia-smi
注意cuda的版本需要和nvidia-smi中显示的一致
验证结果:
那就不用docker,直接新建环境并在conda环境中使用gpu加速即可。
1.安装 magic-pdf
conda create -n mineru python=3.10
conda activate mineru
pip install -U "magic-pdf[full]" --extra-index-url https://wheels.myhloli.com
2.下载模型
将download_models_hf.py修改为使用modelscope下载
import json
import os
import requests
from modelscope import snapshot_download
def download_json(url):
# 下载JSON文件
response = requests.get(url)
response.raise_for_status() # 检查请求是否成功
return response.json()
def download_and_modify_json(url, local_filename, modifications):
if os.path.exists(local_filename):
data = json.load(open(local_filename))
config_version = data.get('config_version', '0.0.0')
if config_version < '1.1.1':
data = download_json(url)
else:
data = download_json(url)
# 修改内容
for key, value in modifications.items():
data[key] = value
# 保存修改后的内容
with open(local_filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
if __name__ == '__main__':
# ModelScope 模型路径
mineru_patterns = [
"models/Layout/LayoutLMv3/*",
"models/Layout/YOLO/*",
"models/MFD/YOLO/*",
"models/MFR/unimernet_small_2501/*",
"models/TabRec/TableMaster/*",
"models/TabRec/StructEqTable/*",
]
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit-1.0', allow_patterns=mineru_patterns)
layoutreader_pattern = [
"*.json",
"*.safetensors",
]
#layoutreader_model_dir = snapshot_download('hantian/layoutreader', allow_patterns=layoutreader_pattern)
layoutreader_model_dir = snapshot_download('zxyayase/layoutreader', allow_patterns=layoutreader_pattern)
model_dir = model_dir + '/models'
print(f'model_dir is: {model_dir}')
print(f'layoutreader_model_dir is: {layoutreader_model_dir}')
json_url = 'https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json'
config_file_name = 'magic-pdf.json'
home_dir = os.path.expanduser('~')
config_file = os.path.join(home_dir, config_file_name)
json_mods = {
'models-dir': model_dir,
'layoutreader-model-dir': layoutreader_model_dir,
}
download_and_modify_json(json_url, config_file, json_mods)
print(f'The configuration file has been configured successfully, the path is: {config_file}')
遇到报错说模型不存在:
修改为’zxyayase/layoutreader’即可
3.验证json文件
如果 JSON 中不存在以下项目,请手动添加必填项目并删注释内容。
{
// other config
"layout-config": {
"model": "doclayout_yolo" // Please change to "layoutlmv3" when using layoutlmv3.
},
"formula-config": {
"mfd_model": "yolo_v8_mfd",
"mfr_model": "unimernet_small",
"enable": true // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
},
"table-config": {
"model": "rapid_table", // Default to using "rapid_table", can be switched to "tablemaster" or "struct_eqtable".
"sub_model": "slanet_plus", // When the model is "rapid_table", you can choose a sub_model. The options are "slanet_plus" and "unitable"
"enable": true, // The table recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
"max_time": 400
}
}
4.cpu运行
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/small_ocr.pdf
magic-pdf -p small_ocr.pdf -o ./output
可见每页的处理时间大概是20多s。
5. gpu运行
修改【用户目录】中配置文件 magic-pdf.json 中”device-mode”的值
{
"device-mode":"cuda"
}
再次执行
magic-pdf -p small_ocr.pdf -o ./output