当前位置：首页 > article >正文

从需求文档到智能化测试：基于 PaddleOCR 的图片信息处理与自动化提取

article 2025/3/17 6:19:04

在软件测试的实际工作中，需求文档是测试工程师的重要工具。然而，随着需求文档的复杂化，文档中的图片（如业务流程图、UI 示意图和测试用例说明图）不仅承载了大量关键信息，也是测试用例设计的重要依据。如何从需求文档中快速、准确地提取图片内容，甚至将其转化为结构化文本进行分析，是提升测试效率的关键。

本文将通过结合 Python-Docx 和 PaddleOCR，实现一套自动化提取需求文档图片并进行文字识别的解决方案。测试工程师可以利用本文提供的工具，快速从指定标题下的图片中提取内容并识别文字，大幅减少人工操作的时间成本。

1. 测试工程师的痛点：为何要自动提取图片并识别文字？

在需求分析阶段，测试工程师通常需要从需求文档中提取图片，并根据图片内容设计测试用例。这一过程中存在以下常见痛点：

图片信息分散
图片可能分布在文档的多个章节中，手动查找费时费力。
手动识别效率低下
即使找到图片，人工识别和整理图片内容容易出现遗漏或理解偏差。
文档标题样式多样化
不同文档的标题样式可能不一致，难以通过固定规则准确提取。
图片文字识别复杂
图片中的文字可能包含中英文混合内容、复杂表格或特殊字体，普通工具识别效果有限。

基于这些痛点，我们提出了一个解决方案：通过 Python-Docx 提取指定标题下的图片，并结合 PaddleOCR 对图片内容进行文字识别，最终将识别结果输出，供测试工程师高效使用。

2. 技术方案：从图片提取到文字识别

我们的技术方案分为以下几个步骤：

提取指定标题下的图片
利用 python-docx 从需求文档中提取指定标题下的图片，支持模糊匹配标题，兼容多种标题样式。
使用 PaddleOCR 进行文字识别
对提取的图片进行 OCR 处理，识别图片中的文字内容，支持中英文混合及表格处理。
输出结构化识别结果
将识别到的内容按标题和图片分组输出，便于测试工程师直接使用。

以下是完整的代码实现。

3. 代码实现：自动提取图片并进行文字识别

3.1 获取指定标题下的图片

我们通过 python-docx 提取文档中指定标题下的图片，并保存到本地：

import os
import cv2
import numpy as np
import xml.etree.cElementTree as ET
from docx import Document

def is_heading_enhanced(paragraph):
    """
    判断段落是否为标题（增强版）。
    包含对样式名称、字体大小和加粗特征的检查。
    """
    heading_features = {
        'keywords': ['标题', 'heading', 'header', 'h1', 'h2', 'h3', 'chapter'],
        'font_size': (14, 72)
    }
    style_match = any(
        kw in (paragraph.style.name or "").lower()
        for kw in heading_features['keywords']
    )
    if style_match:
        return True
    try:
        font = paragraph.style.font or paragraph.document.styles['Normal'].font
        effective_size = font.size.pt if font.size else 12
        return heading_features['font_size'][0] <= effective_size <= heading_features['font_size'][1]
    except Exception as e:
        print(f"格式检查失败: {str(e)}")
        return False

def get_target_pic(docx_file, target_title):
    """
    从指定 docx 文件中提取目标标题下的图片。
    """
    doc = Document(docx_file)
    target_title_list = [title.strip() for title in target_title.split(",") if title.strip()]

    xml_elements = []
    found_start = False

    for par in doc.paragraphs:
        par_text = par.text.strip()
        if par_text and any(title in par_text for title in target_title_list) and is_heading_enhanced(par):
            found_start = True
            continue
        elif is_heading_enhanced(par):
            found_start = False
        if found_start:
            xml_elements.append(par._element.xml)

    rId_list = []

    def get_img_rId(root_element, target_tag, target_attribute, out_list):
        for child in root_element:
            if child.tag in target_tag and target_attribute in child.attrib:
                out_list.append(child.attrib[target_attribute])
            else:
                get_img_rId(child, target_tag, target_attribute, out_list)

    for element in xml_elements:
        root = ET.fromstring(element)

        target_tags = [
            '{urn:schemas-microsoft-com:vml}imagedata',
            '{http://schemas.openxmlformats.org/drawingml/2006/main}blip'
        ]
        target_attributes = [
            '{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed',
            '{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id'
        ]

        for target_attribute in target_attributes:
            get_img_rId(root, target_tags, target_attribute, rId_list)

    images = []
    output_dir = os.path.dirname(docx_file)

    for idx, rId in enumerate(rId_list):
        try:
            img_part = doc.part.related_parts[rId]
            img_binary = img_part.blob
            img = cv2.imdecode(np.frombuffer(img_binary, np.uint8), cv2.IMREAD_COLOR)
            if img is None:
                continue
            img_name = os.path.join(output_dir, f"extracted_image_{idx + 1}.jpg")
            cv2.imwrite(img_name, img)
            images.append(img_name)
        except Exception as e:
            print(f"图片提取失败，rId={rId}: {e}")

    return images if images else None

3.2 使用 PaddleOCR 识别提取的图片

我们使用 PaddleOCR 对提取的图片进行文字识别：

from paddleocr import PaddleOCR

# 初始化 PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang="ch")  # 支持方向分类和中英文混合识别

def perform_ocr_with_paddle(images):
    """
    使用 PaddleOCR 对图片进行文字识别。
    """
    results = []
    for image_path in images:
        try:
            img = cv2.imread(image_path)
            ocr_result = ocr.ocr(img, cls=True)
            text_lines = [line[1][0] for line in ocr_result[0]]
            results.append((image_path, "\n".join(text_lines)))
        except Exception as e:
            results.append((image_path, f"OCR 识别失败: {e}"))
    return results

3.3 组合主流程

if __name__ == '__main__':
    # 用户输入的标题和文档路径
    target_title = '业务流程,测试用例'
    file_path = '需求文档.docx'

    # 提取图片
    print("正在提取图片...")
    image_paths = get_target_pic(file_path, target_title)
    if not image_paths:
        print("未找到目标标题下的图片")
        exit()

    print(f"提取到 {len(image_paths)} 张图片：{image_paths}")

    # OCR 识别
    print("\n正在进行 OCR 识别...")
    ocr_results = perform_ocr_with_paddle(image_paths)

    # 输出结果
    for image_path, text in ocr_results:
        print(f"\n图片路径：{image_path}")
        print(f"识别内容：\n{text}")

4. 示例运行结果

用户输入

目标标题：业务流程,测试用例
文档路径：xxx.docx

输出结果

提取到 1 张图片：
['D:\\python_project\\with_playwright\\0312\\img0.jpg']
提取到 1 张图片：['D:\\python_project\\with_playwright\\0312\\img0.jpg']

正在进行 OCR 识别...

图片路径：D:\python_project\with_playwright\0312\img0.jpg
识别内容：
专员
创建活动
发送二维码邀请
填写信息提交
生成客户信息
扫码核销签到
码
拍摄照片上传
T+7录入跟踪
信息
跟踪90天加保
信息