当前位置：首页 > article >正文

【OCR】使用Umi-OCR进行PDF文档的光学字符识别

article 2025/3/21 22:54:14

一、前言

在当今数字化的世界中，将纸质文档或扫描件转化为可编辑和搜索的电子文本变得尤为重要。幸运的是，借助如Umi-OCR这样的工具，我们可以轻松实现这一目标。本文将详细介绍如何使用Umi-OCR的HTTP API来处理PDF文档，从文件上传到结果下载的完整流程。

二、什么是Umi-OCR？

Umi-OCR是一款开源的离线OCR工具，支持多种语言的文字识别，特别适用于中文文档。它提供了一个基于HTTP的API接口，使得集成到各种应用中变得更加容易。

三、下载地址

蓝奏云 https://hiroi-sora.lanzoul.com/s/umi-ocr （国内推荐，免注册/无限速）
GitHub https://github.com/hiroi-sora/Umi-OCR/releases/latest
Source Forge https://sourceforge.net/projects/umi-ocr

四、开始使用

软件发布包下载为 .7z 压缩包或 .7z.exe 自解压包。自解压包可在没有安装压缩软件的电脑上，解压文件。

本软件无需安装。解压后，点击 Umi-OCR.exe 即可启动程序。

界面语言

Umi-OCR 支持的界面多国语言。在第一次打开软件时，将会按照你的电脑的系统设置，自动切换语言。

如果需要手动切换语言，请参考下图，全局设置→语言/Language 。

1-标题-1.png

确保您已经安装并运行了Umi-OCR服务，并且该服务正在监听http://127.0.0.1:1224地址。此外，需要准备好待处理的PDF文件路径。

步骤解析

1. 上传文件并获取任务ID

首先，我们需要将PDF文件上传到Umi-OCR服务器，并接收一个唯一的任务ID作为后续操作的凭证。这里我们通过POST请求发送文件以及一些基本的任务参数（例如混合模式识别）至服务器。

url = "{}/api/doc/upload".format(base_url)
response = requests.post(url, files={"file": file}, data={"json": options_json})

如果遇到由于文件名包含非ASCII字符导致的上传失败问题，可以通过指定仅含ASCII字符的临时文件名来解决这个问题。

2. 轮询任务状态直到OCR任务结束

上传成功后，我们进入轮询阶段，持续查询任务的状态，直到OCR处理完成。此过程中可以监控进度，了解当前已处理的页面数与总页数的比例。


while True:
    time.sleep(1)
    response = requests.post(url, data=data_str, headers=headers)
    if res_data["is_done"]:
        break

3. 生成目标文件并获取下载链接

一旦OCR任务完成，我们需要告诉服务器准备输出文件，并获取这些文件的下载链接。这一步允许选择希望下载的文件格式，比如txt、jsonl等。

download_options = {"file_types": ["txt", "jsonl"], "id": id}
response = requests.post(url, data=json.dumps(download_options), headers=headers)

4. 下载目标文件

有了下载链接之后，就可以开始下载处理后的文件了。为了更好地用户体验，这里还展示了如何显示下载进度。

with open(download_path, "wb") as file:
    for chunk in response.iter_content(chunk_size=8192):
        file.write(chunk)

5. 清理任务

最后，不要忘记清理服务器上的任务数据，以释放资源。

url = "{}/api/doc/clear/{}".format(base_url, id)
response = requests.get(url)

6、全部代码

import os
import json
import time
import requests

base_url = "http://127.0.0.1:1224"

url = "{}/api/doc/upload".format(base_url)

print("=======================================")
print("===== 1. Upload file, get task ID =====")
print("== URL:", url)

# 替换为pdf路径
file_path = r"XXXXX.pdf"

# Task parameters
options_json = json.dumps(
    {
        "doc.extractionMode": "mixed",
    }
)
with open(file_path, "rb") as file:
    response = requests.post(url, files={"file": file}, data={"json": options_json})
response.raise_for_status()
res_data = json.loads(response.text)
if res_data["code"] == 101:
    # If code == 101, it indicates that the server did not receive the uploaded file.
    # On some Linux systems, if file_name contains non-ASCII characters, this error might occur.
    # In this case, we can specify a temp_name containing only ASCII characters to construct the upload request.

    file_name = os.path.basename(file_path)
    file_prefix, file_suffix = os.path.splitext(file_name)
    temp_name = "temp" + file_suffix
    print("[Warning] Detected file upload failure: code == 101")
    print(
        "Attempting to use temp_name",
        temp_name,
        "instead of the original file_name",
        file_name,
    )
    with open(file_path, "rb") as file:
        response = requests.post(
            url,
            # use temp_name to construct the upload request
            files={"file": (temp_name, file)},
            data={"json": options_json},
        )
    response.raise_for_status()
    res_data = json.loads(response.text)
assert res_data["code"] == 100, "Task submission failed: {}".format(res_data)

id = res_data["data"]
print("Task ID:", id)

url = "{}/api/doc/result".format(base_url)
print("===================================================")
print("===== 2. Poll task status until OCR task ends =====")
print("== URL:", url)

headers = {"Content-Type": "application/json"}
data_str = json.dumps(
    {
        "id": id,
        "is_data": True,
        "format": "text",
        "is_unread": True,
    }
)
while True:
    time.sleep(1)
    response = requests.post(url, data=data_str, headers=headers)
    response.raise_for_status()
    res_data = json.loads(response.text)
    assert res_data["code"] == 100, "Failed to get task status: {}".format(res_data)

    print(
        "    Progress: {}/{}".format(
            res_data["processed_count"], res_data["pages_count"]
        )
    )
    if res_data["data"]:
        print("{}\n========================".format(res_data["data"]))
    if res_data["is_done"]:
        state = res_data["state"]
        assert state == "success", "Task execution failed: {}".format(
            res_data["message"]
        )
        print("OCR task completed.")
        break

url = "{}/api/doc/download".format(base_url)
print("======================================================")
print("===== 3. Generate target file, get download link =====")
print("== URL:", url)

# Download file parameters
download_options = {
    "file_types": [
        "txt",
        "txtPlain",
        "jsonl",
        "csv",
        "pdfLayered",
        "pdfOneLayer",
    ],
    # ↓ `ingore_blank` is a typo. If you are using Umi-OCR version 2.1.4 or earlier, please use this incorrect spelling.
    # ↓ If you are using the latest code-built version of Umi-OCR, please use the corrected spelling `ignore_blank`.
    "ingore_blank": False,  # Do not ignore blank pages
}
download_options["id"] = id
data_str = json.dumps(download_options)
response = requests.post(url, data=data_str, headers=headers)
response.raise_for_status()
res_data = json.loads(response.text)
assert res_data["code"] == 100, "Failed to get download URL: {}".format(res_data)

url = res_data["data"]
name = res_data["name"]

print("===================================")
print("===== 4. Download target file =====")
print("== URL:", url)

# Save location for downloaded files
download_dir = "./download"
if not os.path.exists(download_dir):
    os.makedirs(download_dir)
download_path = os.path.join(download_dir, name)
response = requests.get(url, stream=True)
response.raise_for_status()
# Download file size
total_size = int(response.headers.get("content-length", 0))
downloaded_size = 0
log_size = 10485760  # Print progress every 10MB

with open(download_path, "wb") as file:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            file.write(chunk)
            downloaded_size += len(chunk)
            if downloaded_size >= log_size:
                log_size = downloaded_size + 10485760
                progress = (downloaded_size / total_size) * 100
                print(
                    "    Downloading file: {}MB | Progress: {:.2f}%".format(
                        int(downloaded_size / 1048576), progress
                    )
                )
print("Target file downloaded successfully: ", download_path)

url = "{}/api/doc/clear/{}".format(base_url, id)
print("============================")
print("===== 5. Clean up task =====")
print("== URL:", url)

response = requests.get(url)
response.raise_for_status()
res_data = json.loads(response.text)
assert res_data["code"] == 100, "Task cleanup failed: {}".format(res_data)
print("Task cleaned up successfully.")

print("======================\nProcess completed.")