【OCR】使用Umi-OCR进行PDF文档的光学字符识别
一、前言
在当今数字化的世界中,将纸质文档或扫描件转化为可编辑和搜索的电子文本变得尤为重要。幸运的是,借助如Umi-OCR这样的工具,我们可以轻松实现这一目标。本文将详细介绍如何使用Umi-OCR的HTTP API来处理PDF文档,从文件上传到结果下载的完整流程。
二、什么是Umi-OCR?
Umi-OCR是一款开源的离线OCR工具,支持多种语言的文字识别,特别适用于中文文档。它提供了一个基于HTTP的API接口,使得集成到各种应用中变得更加容易。
三、下载地址
蓝奏云 https://hiroi-sora.lanzoul.com/s/umi-ocr (国内推荐,免注册/无限速)
GitHub https://github.com/hiroi-sora/Umi-OCR/releases/latest
Source Forge https://sourceforge.net/projects/umi-ocr
四、开始使用
软件发布包下载为 .7z
压缩包或 .7z.exe
自解压包。自解压包可在没有安装压缩软件的电脑上,解压文件。
本软件无需安装。解压后,点击 Umi-OCR.exe
即可启动程序。
界面语言
Umi-OCR 支持的界面多国语言。在第一次打开软件时,将会按照你的电脑的系统设置,自动切换语言。
如果需要手动切换语言,请参考下图,全局设置
→语言/Language
。
确保您已经安装并运行了Umi-OCR服务,并且该服务正在监听http://127.0.0.1:1224地址。此外,需要准备好待处理的PDF文件路径。
步骤解析
1. 上传文件并获取任务ID
首先,我们需要将PDF文件上传到Umi-OCR服务器,并接收一个唯一的任务ID作为后续操作的凭证。这里我们通过POST请求发送文件以及一些基本的任务参数(例如混合模式识别)至服务器。
url = "{}/api/doc/upload".format(base_url)
response = requests.post(url, files={"file": file}, data={"json": options_json})
如果遇到由于文件名包含非ASCII字符导致的上传失败问题,可以通过指定仅含ASCII字符的临时文件名来解决这个问题。
2. 轮询任务状态直到OCR任务结束
上传成功后,我们进入轮询阶段,持续查询任务的状态,直到OCR处理完成。此过程中可以监控进度,了解当前已处理的页面数与总页数的比例。
while True:
time.sleep(1)
response = requests.post(url, data=data_str, headers=headers)
if res_data["is_done"]:
break
3. 生成目标文件并获取下载链接
一旦OCR任务完成,我们需要告诉服务器准备输出文件,并获取这些文件的下载链接。这一步允许选择希望下载的文件格式,比如txt、jsonl等。
download_options = {"file_types": ["txt", "jsonl"], "id": id}
response = requests.post(url, data=json.dumps(download_options), headers=headers)
4. 下载目标文件
有了下载链接之后,就可以开始下载处理后的文件了。为了更好地用户体验,这里还展示了如何显示下载进度。
with open(download_path, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
file.write(chunk)
5. 清理任务
最后,不要忘记清理服务器上的任务数据,以释放资源。
url = "{}/api/doc/clear/{}".format(base_url, id)
response = requests.get(url)
6、全部代码
import os
import json
import time
import requests
base_url = "http://127.0.0.1:1224"
url = "{}/api/doc/upload".format(base_url)
print("=======================================")
print("===== 1. Upload file, get task ID =====")
print("== URL:", url)
# 替换为pdf路径
file_path = r"XXXXX.pdf"
# Task parameters
options_json = json.dumps(
{
"doc.extractionMode": "mixed",
}
)
with open(file_path, "rb") as file:
response = requests.post(url, files={"file": file}, data={"json": options_json})
response.raise_for_status()
res_data = json.loads(response.text)
if res_data["code"] == 101:
# If code == 101, it indicates that the server did not receive the uploaded file.
# On some Linux systems, if file_name contains non-ASCII characters, this error might occur.
# In this case, we can specify a temp_name containing only ASCII characters to construct the upload request.
file_name = os.path.basename(file_path)
file_prefix, file_suffix = os.path.splitext(file_name)
temp_name = "temp" + file_suffix
print("[Warning] Detected file upload failure: code == 101")
print(
"Attempting to use temp_name",
temp_name,
"instead of the original file_name",
file_name,
)
with open(file_path, "rb") as file:
response = requests.post(
url,
# use temp_name to construct the upload request
files={"file": (temp_name, file)},
data={"json": options_json},
)
response.raise_for_status()
res_data = json.loads(response.text)
assert res_data["code"] == 100, "Task submission failed: {}".format(res_data)
id = res_data["data"]
print("Task ID:", id)
url = "{}/api/doc/result".format(base_url)
print("===================================================")
print("===== 2. Poll task status until OCR task ends =====")
print("== URL:", url)
headers = {"Content-Type": "application/json"}
data_str = json.dumps(
{
"id": id,
"is_data": True,
"format": "text",
"is_unread": True,
}
)
while True:
time.sleep(1)
response = requests.post(url, data=data_str, headers=headers)
response.raise_for_status()
res_data = json.loads(response.text)
assert res_data["code"] == 100, "Failed to get task status: {}".format(res_data)
print(
" Progress: {}/{}".format(
res_data["processed_count"], res_data["pages_count"]
)
)
if res_data["data"]:
print("{}\n========================".format(res_data["data"]))
if res_data["is_done"]:
state = res_data["state"]
assert state == "success", "Task execution failed: {}".format(
res_data["message"]
)
print("OCR task completed.")
break
url = "{}/api/doc/download".format(base_url)
print("======================================================")
print("===== 3. Generate target file, get download link =====")
print("== URL:", url)
# Download file parameters
download_options = {
"file_types": [
"txt",
"txtPlain",
"jsonl",
"csv",
"pdfLayered",
"pdfOneLayer",
],
# ↓ `ingore_blank` is a typo. If you are using Umi-OCR version 2.1.4 or earlier, please use this incorrect spelling.
# ↓ If you are using the latest code-built version of Umi-OCR, please use the corrected spelling `ignore_blank`.
"ingore_blank": False, # Do not ignore blank pages
}
download_options["id"] = id
data_str = json.dumps(download_options)
response = requests.post(url, data=data_str, headers=headers)
response.raise_for_status()
res_data = json.loads(response.text)
assert res_data["code"] == 100, "Failed to get download URL: {}".format(res_data)
url = res_data["data"]
name = res_data["name"]
print("===================================")
print("===== 4. Download target file =====")
print("== URL:", url)
# Save location for downloaded files
download_dir = "./download"
if not os.path.exists(download_dir):
os.makedirs(download_dir)
download_path = os.path.join(download_dir, name)
response = requests.get(url, stream=True)
response.raise_for_status()
# Download file size
total_size = int(response.headers.get("content-length", 0))
downloaded_size = 0
log_size = 10485760 # Print progress every 10MB
with open(download_path, "wb") as file:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
file.write(chunk)
downloaded_size += len(chunk)
if downloaded_size >= log_size:
log_size = downloaded_size + 10485760
progress = (downloaded_size / total_size) * 100
print(
" Downloading file: {}MB | Progress: {:.2f}%".format(
int(downloaded_size / 1048576), progress
)
)
print("Target file downloaded successfully: ", download_path)
url = "{}/api/doc/clear/{}".format(base_url, id)
print("============================")
print("===== 5. Clean up task =====")
print("== URL:", url)
response = requests.get(url)
response.raise_for_status()
res_data = json.loads(response.text)
assert res_data["code"] == 100, "Task cleanup failed: {}".format(res_data)
print("Task cleaned up successfully.")
print("======================\nProcess completed.")
五、结语
通过上述步骤,我们能够利用Umi-OCR高效地对PDF文档执行OCR操作,将不可编辑的扫描版PDF转换为结构化的文本信息。这个过程不仅简化了数据录入的工作量,也提高了信息检索的效率。希望这篇文章能帮助您快速上手Umi-OCR,并应用于实际项目中。