当前位置：首页 > article >正文

【Numpy核心编程攻略：Python数据处理、分析详解与科学计算】2.3 结构化索引：记录数组与字段访问

article 2025/2/3 19:16:38

在这里插入图片描述

2.3 结构化索引：记录数组与字段访问

目录/提纲

2.3.1 结构化数据类型定义
2.3.1.1 结构化数据类型的简介
2.3.1.2 使用 dtype 定义结构化数据类型
2.3.2 字段访问优化
2.3.2.1 字段访问的基本方法
2.3.2.2 字段访问的性能优化
2.3.3 内存对齐原理
2.3.3.1 内存对齐的概念
2.3.3.2 内存对齐的实现
2.3.4 记录数组操作
2.3.4.1 创建记录数组
2.3.4.2 记录数组的切片和索引
2.3.5 CSV 数据转换案例
2.3.5.1 读取 CSV 文件
2.3.5.2 将 CSV 数据转换为结构化数组
2.3.5.3 案例分析

文章内容

NumPy 的结构化数组是处理复杂数据结构的强大工具。结构化数组允许我们为每个元素定义多个字段，从而更好地组织和访问数据。本文将详细介绍结构化数组的定义、字段访问优化、内存对齐原理以及记录数组操作，并通过一个 CSV 数据转换的案例来展示这些功能的应用。

2.3.1 结构化数据类型定义

2.3.1.1 结构化数据类型的简介

结构化数据类型（Structured Data Types）允许我们为每个数组元素定义多个字段，每个字段可以有不同的数据类型。这在处理具有多个属性的数据时非常有用，例如数据库记录、电子表格数据等。

原理说明

字段：每个元素可以有多个字段，每个字段有特定的名称和数据类型。
内存布局：结构化数组的内存布局可以根据字段的定义进行优化，以提高访问效率。

2.3.1.2 使用 dtype 定义结构化数据类型

在 NumPy 中，可以使用 dtype 参数来定义结构化数据类型。dtype 是一个包含字段名称、数据类型和偏移量的字典。

结构化dtype的数学表示：

$\text{dtype} = \{ (name_1, type_1, offset_1), \dots, (name_n, type_n, offset_n) \}$

示例创建包含3个字段的数据类型：

person_dtype = np.dtype([
    ('name', 'U32'),       # 32字符Unicode
    ('age', 'u1'),         # 无符号字节
    ('salary', 'f4', (3,)) # 3个float32的数组
], align=True)             # 启用内存对齐

print(person_dtype.itemsize)  # 输出：64（对齐后的总大小）

内存布局示意图（Mermaid）：

示例代码

import numpy as np

# 定义结构化数据类型
dtype = np.dtype([('name', '<U10'), ('age', 'i4'), ('height', 'f4')])  # 定义三个字段：name（字符串），age（整数），height（浮点数）

# 创建结构化数组
structured_array = np.array([('Alice', 25, 1.65), ('Bob', 30, 1.80), ('Charlie', 22, 1.75)], dtype=dtype)
print(structured_array)  # 输出结构化数组

2.3.2 字段访问优化

2.3.2.1 字段访问的基本方法

在结构化数组中，字段访问非常直观。可以通过字段名称直接访问字段数据。

示例代码

# 创建结构化数组
dtype = np.dtype([('name', '<U10'), ('age', 'i4'), ('height', 'f4')])
structured_array = np.array([('Alice', 25, 1.65), ('Bob', 30, 1.80), ('Charlie', 22, 1.75)], dtype=dtype)

# 访问字段
names = structured_array['name']
ages = structured_array['age']
heights = structured_array['height']

print(names)  # 输出 ['Alice' 'Bob' 'Charlie']
print(ages)  # 输出 [25 30 22]
print(heights)  # 输出 [1.65 1.8  1.75]

2.3.2.2 字段访问的性能优化

字段访问的性能可以通过以下几个方面进行优化：

字段顺序：合理安排字段的顺序可以减少内存访问的开销。
内存对齐：使用内存对齐可以提高读写速度。
避免不必要的字段访问：尽量减少对不需要的字段的访问。

示例代码

# 创建结构化数组
dtype = np.dtype([('name', '<U10'), ('age', 'i4'), ('height', 'f4')])
structured_array = np.array([('Alice', 25, 1.65), ('Bob', 30, 1.80), ('Charlie', 22, 1.75)], dtype=dtype)

# 优化字段顺序
dtype_optimized = np.dtype([('age', 'i4'), ('height', 'f4'), ('name', '<U10')])
structured_array_optimized = np.array([(25, 1.65, 'Alice'), (30, 1.80, 'Bob'), (22, 1.75, 'Charlie')], dtype=dotype_optimized)

# 访问字段
ages_optimized = structured_array_optimized['age']
heights_optimized = structured_array_optimized['height']

print(ages_optimized)  # 输出 [25 30 22]
print(heights_optimized)  # 输出 [1.65 1.8  1.75]

2.3.3 内存对齐原理

2.3.3.1 内存对齐的概念

内存对齐是指数据在内存中的存储位置要满足特定的对齐要求。合理的内存对齐可以提高数据访问的速度，减少缓存缺失。

原理说明

对齐要求：不同数据类型有不同的对齐要求。例如，4字节整数的对齐要求是4字节。
对齐方式：NumPy 会自动对齐字段，但也可以通过设置 align 参数来手动控制对齐。

对齐偏移量计算公式：

$offset_i = \lceil \frac{current\_offset}{alignment_i} \rceil \times alignment_i$

2.3.3.2 内存对齐的实现

在定义结构化数据类型时，可以使用 align=True 参数来实现内存对齐。

示例代码

# 创建对齐的结构化数据类型
dtype_aligned = np.dtype([('name', '<U10'), ('age', 'i4'), ('height', 'f4')], align=True)
structured_array_aligned = np.array([('Alice', 25, 1.65), ('Bob', 30, 1.80), ('Charlie', 22, 1.75)], dtype=dtype_aligned)

# 访问字段
names_aligned = structured_array_aligned['name']
ages_aligned = structured_array_aligned['age']
heights_aligned = structured_array_aligned['height']

print(names_aligned)  # 输出 ['Alice' 'Bob' 'Charlie']
print(ages_aligned)  # 输出 [25 30 22]
print(heights_aligned)  # 输出 [1.65 1.8  1.75]

2.3.4 记录数组操作

2.3.4.1 创建记录数组

记录数组（Record Arrays）是结构化数组的一种特殊形式，它允许我们使用点号（.）来访问字段，类似于对象的属性访问。

示例代码

# 创建记录数组
dtype = np.dtype([('name', '<U10'), ('age', 'i4'), ('height', 'f4')])
record_array = np.rec.array([('Alice', 25, 1.65), ('Bob', 30, 1.80), ('Charlie', 22, 1.75)], dtype=dtype)

# 访问字段
print(record_array.name)  # 输出 ['Alice' 'Bob' 'Charlie']
print(record_array.age)  # 输出 [25 30 22]
print(record_array.height)  # 输出 [1.65 1.8  1.75]

2.3.4.2 记录数组的切片和索引

记录数组支持切片和索引操作，与普通数组类似。

示例代码

# 创建记录数组
dtype = np.dtype([('name', '<U10'), ('age', 'i4'), ('height', 'f4')])
record_array = np.rec.array([('Alice', 25, 1.65), ('Bob', 30, 1.80), ('Charlie', 22, 1.75)], dtype=dtype)

# 切片操作
sub_array = record_array[1:3]
print(sub_array)  # 输出 [('Bob', 30, 1.8) ('Charlie', 22, 1.75)]

# 索引操作
print(record_array[0].name)  # 输出 'Alice'
print(record_array[0].age)  # 输出 25
print(record_array[0].height)  # 输出 1.65

2.3.5 CSV 数据转换案例

2.3.5.1 读取 CSV 文件

读取 CSV 文件并将数据转换为 NumPy 结构化数组。

示例代码

import csv

# 读取 CSV 文件
def read_csv(filename):
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        header = next(reader)  # 读取表头
        data = [row for row in reader]  # 读取数据
    return header, data

header, data = read_csv('data.csv')
print(header)  # 输出 ['name', 'age', 'height']
print(data)  # 输出 [['Alice', '25', '1.65'], ['Bob', '30', '1.80'], ['Charlie', '22', '1.75']]

2.3.5.2 将 CSV 数据转换为结构化数组

将读取的 CSV 数据转换为 NumPy 结构化数组。

示例代码

# 将 CSV 数据转换为结构化数组
def convert_to_structured_array(header, data):
    dtype = np.dtype([(field, 'U10' if field == 'name' else 'i4' if field == 'age' else 'f4') for field in header])
    structured_array = np.array(data, dtype=dtype)
    return structured_array

structured_array = convert_to_structured_array(header, data)
print(structured_array)  # 输出结构化数组

2.3.5.3 案例分析

通过上述示例，我们可以看到如何将 CSV 文件中的数据读取并转换为 NumPy 结构化数组。这种做法在处理带有多个属性的数据时非常高效，可以方便地进行字段访问和操作。

完整案例代码

import numpy as np
import csv

# 读取 CSV 文件
def read_csv(filename):
    with open(filename, 'r') as file:
        reader = csv.reader(file)
        header = next(reader)  # 读取表头
        data = [row for row in reader]  # 读取数据
    return header, data

# 将 CSV 数据转换为结构化数组
def convert_to_structured_array(header, data):
    dtype = np.dtype([(field, 'U10' if field == 'name' else 'i4' if field == 'age' else 'f4') for field in header])
    structured_array = np.array(data, dtype=dtype)
    return structured_array

# 读取 CSV 文件
header, data = read_csv('data.csv')

# 将 CSV 数据转换为结构化数组
structured_array = convert_to_structured_array(header, data)

# 访问字段
names = structured_array['name']
ages = structured_array['age']
heights = structured_array['height']

print(f"Names: {names}")
print(f"Ages: {ages}")
print(f"Heights: {heights}")

# 创建记录数组
record_array = np.rec.array(data, dtype=structured_array.dtype)

# 访问记录数组
print(f"Record Array: {record_array}")
print(f"First Name: {record_array[0].name}")
print(f"First Age: {record_array[0].age}")
print(f"First Height: {record_array[0].height}")

总结

通过本文的学习，读者将能够更好地理解 NumPy 结构化数组的定义和使用方法。结构化数组允许我们为每个元素定义多个字段，从而更好地组织和访问复杂数据。字段访问可以通过字段名称进行，支持性能优化和内存对齐。记录数组是结构化数组的一种特殊形式，支持点号操作，使得字段访问更加直观。最后，通过一个 CSV 数据转换的案例，读者可以了解如何将实际数据读取并转换为结构化数组，从而在实际应用中更好地利用这些高级功能。