当前位置：首页 > article >正文

使用 Python-pptx 库提取 PPTX 文件中的结构与文字

article 2025/2/10 2:01:56

是的，使用 python-pptx 库是提取 PPTX 文件中结构和文字的理想选择，原因如下：

专门处理 PPTX 格式

python-pptx 是一个专门为处理 PPTX 文件（.pptx 格式）而设计的 Python 库。
它可以读取和操作 PPTX 文件的内部结构，包括幻灯片、形状、文本框、表格等。
这使得 python-pptx 能够准确地提取 PPTX 文件中的各种元素，包括文字内容和结构信息。
2. 深入解析 PPTX 结构

PPTX 文件本质上是一个压缩文件，包含了 XML 文件和其他资源。
python-pptx 能够解析这些 XML 文件，从而获取 PPTX 文件的内部结构。
这使得 python-pptx 可以识别和提取 PPTX 文件中的各种元素，包括 Group 组件中的文字。
3. 弥补其他库的不足

正如你所说，markitdown 和 textract 等库在处理 PPTX 文件时存在一些局限性。
这些库可能无法完整地提取 Group 组件中的文字，导致信息丢失。
python-pptx 则可以弥补这些不足，提供更全面和准确的提取功能。
在日常工作和学习中，我们常常会遇到处理大量 PPTX 文件的需求，比如提取其中的文字内容、分析文档结构等。Python 凭借其强大的功能和易于上手的特性，拥有丰富的库可以帮助我们完成这些任务。其中，python-pptx 库就是专门用于处理 PPTX 文件的优秀工具。

小编在实际项目中需要处理 1000 多份 PPTX 文件，经过反复修改和调试代码，最终成功实现了提取 PPTX 文件结构与文字信息的功能。接下来，将详细介绍如何使用 python-pptx 库来完成这个任务。

安装 python-pptx 库

在开始之前，我们需要先安装 python-pptx 库。可以使用 pip 来进行安装，打开终端并执行以下命令：

pip install python-pptx

基本概念

在使用 python-pptx 库之前，我们需要了解一些基本概念：

Presentation：表示整个 PPTX 文件，是操作的入口点。
Slide：表示 PPTX 文件中的一张幻灯片。
Shape：表示幻灯片上的一个元素，如文本框、表格、图片等。
TextFrame：表示包含文本的形状，用于存储和处理文字信息。

以下将按照代码块的功能对上述提取 PPTX 文件结构与文字信息的 Python 代码进行详细的分段分块解释。

1. 导入必要的库

import json
import os

from pptx import Presentation
from pptx.enum.dml import MSO_FILL
from pptx.enum.shapes import MSO_SHAPE_TYPE
from pptx.shapes.group import GroupShape

json：用于处理 JSON 数据，在将提取的 PPTX 信息保存为 JSON 文件时会用到。
os：提供了与操作系统进行交互的功能，例如检查文件和目录是否存在、创建目录等。
Presentation：python - pptx 库中的类，用于加载和操作 PPTX 文件。
MSO_FILL：枚举类型，用于表示填充类型，如纯色填充、图案填充等。
MSO_SHAPE_TYPE：枚举类型，用于表示形状的类型，如文本框、表格、图片等。
GroupShape：表示组合形状的类，用于处理 PPTX 中组合在一起的多个形状。

2. `extract_text_frame_info` 函数

def extract_text_frame_info(text_frame):
    """提取文本框内容和格式信息"""
    text_info = []
    if text_frame and hasattr(text_frame, 'paragraphs'):
        for paragraph in text_frame.paragraphs:
            para_info = {
                "alignment": str(paragraph.alignment),
                "level": paragraph.level,
                "line_spacing": str(paragraph.line_spacing) if hasattr(paragraph, 'line_spacing') else None,
                "space_before": str(paragraph.space_before) if hasattr(paragraph, 'space_before') else None,
                "space_after": str(paragraph.space_after) if hasattr(paragraph, 'space_after') else None,
                "runs": []
            }
            for run in paragraph.runs:
                run_info = {
                    "text": run.text,
                    "font": {
                        "name": run.font.name if hasattr(run.font, 'name') else None,
                        "size": str(run.font.size) if hasattr(run.font, 'size') else None,
                        "bold": run.font.bold if hasattr(run.font, 'bold') else None,
                        "italic": run.font.italic if hasattr(run.font, 'italic') else None,
                        "underline": run.font.underline if hasattr(run.font, 'underline') else None,
                        "color": str(run.font.color.rgb) if hasattr(run.font.color, 'rgb') and run.font.color.rgb else None
                    }
                }
                para_info["runs"].append(run_info)
            text_info.append(para_info)
    return text_info

功能：提取文本框中的内容和格式信息。
步骤：
1. 检查传入的 text_frame 是否存在且包含段落。
2. 遍历文本框中的每个段落，记录段落的对齐方式、缩进级别、行间距、段前间距和段后间距等信息。
3. 对于每个段落，遍历其中的每个运行（Run），记录运行的文本内容以及字体的名称、大小、加粗、倾斜、下划线和颜色等信息。
4. 将每个段落的信息添加到 text_info 列表中并返回。

3. `extract_table_info` 函数

def extract_table_info(table):
    """提取表格内容和格式信息"""
    table_info = {
        "rows": len(table.rows),
        "cols": len(table.columns),
        "cells": []
    }
    for row_idx, row in enumerate(table.rows):
        for col_idx, cell in enumerate(row.cells):
            # 检查填充类型是否支持前景色
            fill_color = None
            if hasattr(cell.fill, 'type') and cell.fill.type in (MSO_FILL.SOLID, MSO_FILL.PATTERNED):
                fore_color = cell.fill.fore_color
                if hasattr(fore_color, 'rgb') and fore_color.rgb:
                    fill_color = str(fore_color.rgb)

            cell_info = {
                "row": row_idx,
                "col": col_idx,
                "text": extract_text_frame_info(cell.text_frame),
                "fill_color": fill_color
            }
            table_info["cells"].append(cell_info)
    return table_info

功能：提取表格中的内容和格式信息。
步骤：
1. 初始化一个字典 table_info，记录表格的行数和列数。
2. 遍历表格的每一行和每一列，对于每个单元格：
  - 检查单元格的填充类型是否为纯色或图案填充，若支持前景色则尝试获取其 RGB 值。
  - 调用 extract_text_frame_info 函数提取单元格中文本框的信息。
  - 将单元格的行索引、列索引、文本信息和填充颜色信息记录在 cell_info 字典中，并添加到 table_info 的 cells 列表中。
3. 返回 table_info 字典。

4. `get_fill_color` 函数

def get_fill_color(shape):
    if hasattr(shape, 'fill') and shape.fill:
        fill = shape.fill
        if fill.type in (MSO_FILL.SOLID, MSO_FILL.PATTERNED):
            fore_color = fill.fore_color
            if hasattr(fore_color, 'rgb') and fore_color.rgb:
                return str(fore_color.rgb)
            elif hasattr(fore_color, 'theme_color') and fore_color.theme_color:
                return str(fore_color.theme_color)
    return None

功能：获取形状的填充颜色。
步骤：
1. 检查形状是否有填充属性。
2. 若填充类型为纯色或图案填充，尝试获取前景色的 RGB 值，若存在则返回；若不存在 RGB 值但有主题颜色，则返回主题颜色。
3. 若以上条件都不满足，则返回 None。

5. `extract_shape_info` 函数

def extract_shape_info(shape):
    shape_info = {
        "id": shape.shape_id,
        "name": shape.name,
        "type": str(shape.shape_type),
        "left": str(shape.left),
        "top": str(shape.top),
        "width": str(shape.width),
        "height": str(shape.height),
        "fill_type": None,
        "fill_color": None,
        "line_color": None,
        "line_width": None,
        "line_dash_style": None,
        "rotation": str(shape.rotation) if hasattr(shape, 'rotation') else None,
        "content": {}
    }
    try:
        print(shape.auto_shape_type.name)
        shape_info['auto_shape_type'] = shape.auto_shape_type.name
    except:
        pass

    if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
        return shape_info

    if hasattr(shape, 'fill') and shape.fill:
        shape_info["fill_type"] = str(shape.fill.type)
        shape_info["fill_color"] = get_fill_color(shape)
    if hasattr(shape, 'line') and shape.line:
        if hasattr(shape.line, 'color') and shape.line.color and shape.line.color.type == 1:
            shape_info["line_color"] = str(shape.line.color.rgb)
        if hasattr(shape.line, 'width'):
            shape_info["line_width"] = str(shape.line.width)
        try:
            if hasattr(shape.line, 'dash_style'):
                shape_info["line_dash_style"] = str(shape.line.dash_style)
        except ValueError:
            # 处理没有预定义映射的虚线样式
            shape_info["line_dash_style"] = None

    if shape.shape_type == MSO_SHAPE_TYPE.TABLE:
        shape_info["content"] = extract_table_info(shape.table)
    elif shape.has_text_frame:
        shape_info["content"] = extract_text_frame_info(shape.text_frame)
    elif isinstance(shape, GroupShape):
        group_shapes_info = []
        for sub_shape in shape.shapes:
            sub_shape_info = extract_shape_info(sub_shape)
            group_shapes_info.append(sub_shape_info)
        shape_info["content"] = {"group_shapes": group_shapes_info}

    return shape_info

功能：提取形状的信息，包括基本属性、填充信息、线条信息和内容信息。
步骤：
1. 初始化一个字典 shape_info，记录形状的基本属性，如 ID、名称、类型、位置、大小、旋转角度等。
2. 尝试获取形状的自动形状类型名称并记录在 shape_info 中。
3. 如果形状是图片类型，直接返回 shape_info。
4. 检查形状是否有填充属性，若有则记录填充类型和填充颜色。
5. 检查形状是否有线条属性，若有则记录线条颜色、宽度和虚线样式，处理虚线样式时捕获可能的 ValueError 异常。
6. 根据形状的类型，调用相应的函数提取内容信息：
  - 若为表格类型，调用 extract_table_info 函数。
  - 若有文本框，调用 extract_text_frame_info 函数。
  - 若为组合形状，递归调用 extract_shape_info 函数处理每个子形状。
7. 返回 shape_info 字典。

6. `extract_pptx_info` 函数

def extract_pptx_info(pptx_path, output_dir):
    """提取PPT所有幻灯片信息并保存为JSON文件"""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    prs = Presentation(pptx_path)
    all_slides_info = []

    for slide_index, slide in enumerate(prs.slides):
        slide_info = {
            "slide_index": slide_index,
            "shapes": []
        }

        for shape in slide.shapes:
            shape_info = extract_shape_info(shape)
            slide_info["shapes"].append(shape_info)
        all_slides_info.append(slide_info)

        json_filename = os.path.join(output_dir, f"slide_{slide_index}_info.json")
        with open(json_filename, 'w', encoding='utf-8') as f:
            json.dump(slide_info, f, indent=4, ensure_ascii=False)

        print(f"Slide {slide_index} information saved to {json_filename}")

    all_slides_json_filename = os.path.join(output_dir, f"all_slides_info.json")
    with open(all_slides_json_filename, 'w', encoding='utf-8') as f:
        json.dump(all_slides_info, f, indent=4, ensure_ascii=False)
    print(f"All slides information saved to {all_slides_json_filename}")

功能：提取整个 PPTX 文件的信息，并将每张幻灯片的信息保存为单独的 JSON 文件，同时将所有幻灯片的信息保存到一个汇总的 JSON 文件中。
步骤：
1. 检查输出目录是否存在，若不存在则创建。
2. 使用 Presentation 类加载 PPTX 文件。
3. 遍历 PPTX 文件中的每张幻灯片，对于每张幻灯片：
  - 初始化一个字典 slide_info，记录幻灯片的索引和包含的形状信息。
  - 遍历幻灯片上的每个形状，调用 extract_shape_info 函数提取形状信息并添加到 slide_info 中。
  - 将 slide_info 添加到 all_slides_info 列表中。
  - 将 slide_info 保存为单独的 JSON 文件。
4. 将 all_slides_info 保存为汇总的 JSON 文件。

7. `main` 函数

def main():
    source_file = "./example.pptx"  # 源 PPT 文件路径
    output_dir = "./output"  # 输出目录

    if not os.path.exists(source_file):
        print(f"Source file {source_file} does not exist!")
        return

    extract_pptx_info(source_file, output_dir)


if __name__ == "__main__":
    main()

功能：程序的入口点，指定源 PPTX 文件的路径和输出目录，检查源文件是否存在，若存在则调用 extract_pptx_info 函数进行处理。

通过以上的分段分块解释，我们可以清晰地了解代码的各个部分的功能和实现细节，方便对代码进行理解、修改和扩展。

最后我们把完整代码过一遍
下面是一个完整的代码，用于提取 PPTX 文件中的结构与文字信息，并将结果保存为 JSON 文件：

import json
import os

from pptx import Presentation
from pptx.enum.dml import MSO_FILL
from pptx.enum.shapes import MSO_SHAPE_TYPE
from pptx.shapes.group import GroupShape


def extract_text_frame_info(text_frame):
    """提取文本框内容和格式信息"""
    text_info = []
    if text_frame and hasattr(text_frame, 'paragraphs'):
        for paragraph in text_frame.paragraphs:
            para_info = {
                "alignment": str(paragraph.alignment),
                "level": paragraph.level,
                "line_spacing": str(paragraph.line_spacing) if hasattr(paragraph, 'line_spacing') else None,
                "space_before": str(paragraph.space_before) if hasattr(paragraph, 'space_before') else None,
                "space_after": str(paragraph.space_after) if hasattr(paragraph, 'space_after') else None,
                "runs": []
            }
            for run in paragraph.runs:
                run_info = {
                    "text": run.text,
                    "font": {
                        "name": run.font.name if hasattr(run.font, 'name') else None,
                        "size": str(run.font.size) if hasattr(run.font, 'size') else None,
                        "bold": run.font.bold if hasattr(run.font, 'bold') else None,
                        "italic": run.font.italic if hasattr(run.font, 'italic') else None,
                        "underline": run.font.underline if hasattr(run.font, 'underline') else None,
                        "color": str(run.font.color.rgb) if hasattr(run.font.color, 'rgb') and run.font.color.rgb else None
                    }
                }
                para_info["runs"].append(run_info)
            text_info.append(para_info)
    return text_info


def extract_table_info(table):
    """提取表格内容和格式信息"""
    table_info = {
        "rows": len(table.rows),
        "cols": len(table.columns),
        "cells": []
    }
    for row_idx, row in enumerate(table.rows):
        for col_idx, cell in enumerate(row.cells):
            # 检查填充类型是否支持前景色
            fill_color = None
            if hasattr(cell.fill, 'type') and cell.fill.type in (MSO_FILL.SOLID, MSO_FILL.PATTERNED):
                fore_color = cell.fill.fore_color
                if hasattr(fore_color, 'rgb') and fore_color.rgb:
                    fill_color = str(fore_color.rgb)

            cell_info = {
                "row": row_idx,
                "col": col_idx,
                "text": extract_text_frame_info(cell.text_frame),
                "fill_color": fill_color
            }
            table_info["cells"].append(cell_info)
    return table_info


def get_fill_color(shape):
    if hasattr(shape, 'fill') and shape.fill:
        fill = shape.fill
        if fill.type in (MSO_FILL.SOLID, MSO_FILL.PATTERNED):
            fore_color = fill.fore_color
            if hasattr(fore_color, 'rgb') and fore_color.rgb:
                return str(fore_color.rgb)
            elif hasattr(fore_color, 'theme_color') and fore_color.theme_color:
                return str(fore_color.theme_color)
    return None


def extract_shape_info(shape):
    shape_info = {
        "id": shape.shape_id,
        "name": shape.name,
        "type": str(shape.shape_type),
        "left": str(shape.left),
        "top": str(shape.top),
        "width": str(shape.width),
        "height": str(shape.height),
        "fill_type": None,
        "fill_color": None,
        "line_color": None,
        "line_width": None,
        "line_dash_style": None,
        "rotation": str(shape.rotation) if hasattr(shape, 'rotation') else None,
        "content": {}
    }
    try:
        print(shape.auto_shape_type.name)
        shape_info['auto_shape_type'] = shape.auto_shape_type.name
    except:
        pass

    if shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
        return shape_info

    if hasattr(shape, 'fill') and shape.fill:
        shape_info["fill_type"] = str(shape.fill.type)
        shape_info["fill_color"] = get_fill_color(shape)
    if hasattr(shape, 'line') and shape.line:
        if hasattr(shape.line, 'color') and shape.line.color and shape.line.color.type == 1:
            shape_info["line_color"] = str(shape.line.color.rgb)
        if hasattr(shape.line, 'width'):
            shape_info["line_width"] = str(shape.line.width)
        try:
            if hasattr(shape.line, 'dash_style'):
                shape_info["line_dash_style"] = str(shape.line.dash_style)
        except ValueError:
            # 处理没有预定义映射的虚线样式
            shape_info["line_dash_style"] = None

    if shape.shape_type == MSO_SHAPE_TYPE.TABLE:
        shape_info["content"] = extract_table_info(shape.table)
    elif shape.has_text_frame:
        shape_info["content"] = extract_text_frame_info(shape.text_frame)
    elif isinstance(shape, GroupShape):
        group_shapes_info = []
        for sub_shape in shape.shapes:
            sub_shape_info = extract_shape_info(sub_shape)
            group_shapes_info.append(sub_shape_info)
        shape_info["content"] = {"group_shapes": group_shapes_info}

    return shape_info


def extract_pptx_info(pptx_path, output_dir):
    """提取PPT所有幻灯片信息并保存为JSON文件"""
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    prs = Presentation(pptx_path)
    all_slides_info = []

    for slide_index, slide in enumerate(prs.slides):
        slide_info = {
            "slide_index": slide_index,
            "shapes": []
        }

        for shape in slide.shapes:
            shape_info = extract_shape_info(shape)
            slide_info["shapes"].append(shape_info)
        all_slides_info.append(slide_info)

        json_filename = os.path.join(output_dir, f"slide_{slide_index}_info.json")
        with open(json_filename, 'w', encoding='utf-8') as f:
            json.dump(slide_info, f, indent=4, ensure_ascii=False)

        print(f"Slide {slide_index} information saved to {json_filename}")

    all_slides_json_filename = os.path.join(output_dir, f"all_slides_info.json")
    with open(all_slides_json_filename, 'w', encoding='utf-8') as f:
        json.dump(all_slides_info, f, indent=4, ensure_ascii=False)
    print(f"All slides information saved to {all_slides_json_filename}")


def main():
    source_file = "./example.pptx"  # 源 PPT 文件路径
    output_dir = "./output"  # 输出目录

    if not os.path.exists(source_file):
        print(f"Source file {source_file} does not exist!")
        return

    extract_pptx_info(source_file, output_dir)


if __name__ == "__main__":
    main()