当前位置：首页 > article >正文

PE（Processing Element，处理单元）在Vitis HLS中的应用与实现

article 2025/3/19 10:18:03

PE在Vitis HLS中的应用与实现

PE的基本概念

在FPGA设计中，PE（Processing Element，处理单元）是指执行特定计算任务的基本功能模块。在Vitis HLS（高层次综合）环境中，PE不是一个预定义的硬件结构，而是通过C/C++代码和编译指令（pragmas）定义的可并行执行的计算单元。

PE在Vitis HLS中的关键特点：

模块化设计：PE通常作为独立的功能单元，具有明确定义的输入/输出接口
可复制性：可以被实例化多次形成阵列，实现并行计算
专用计算：每个PE通常执行特定的算术或逻辑操作，如乘加（MAC）、卷积等
数据流处理：支持流水线操作，提高吞吐量

在Vitis HLS中创建PE的方法

1. 函数级并行化

在Vitis HLS中，函数是创建PE的基本单位。每个函数可以被综合为独立的硬件模块。

// 定义一个简单的PE函数
void simple_pe(float a, float b, float *c) {
    #pragma HLS INLINE off
    *c = a * b + *c; // 乘加操作
}

// 顶层函数实例化多个PE
void top_function(float a[4], float b[4], float c[4]) {
    #pragma HLS DATAFLOW
    
    // 并行实例化4个PE
    simple_pe(a[0], b[0], &c[0]);
    simple_pe(a[1], b[1], &c[1]);
    simple_pe(a[2], b[2], &c[2]);
    simple_pe(a[3], b[3], &c[3]);
}

关键指令：

#pragma HLS INLINE off：防止函数被内联，保持PE的模块化结构
#pragma HLS DATAFLOW：允许多个函数并行执行，形成数据流管道

2. 循环级并行化

通过循环优化指令，可以将循环转换为并行执行的PE阵列：

void loop_pe_array(float a[16], float b[16], float c[16]) {
    // 通过循环展开创建多个并行PE
    for (int i = 0; i < 16; i++) {
        #pragma HLS UNROLL factor=4
        #pragma HLS PIPELINE II=1
        c[i] = a[i] * b[i] + c[i];
    }
}

关键指令：

#pragma HLS UNROLL：展开循环，创建多个并行的PE实例
#pragma HLS PIPELINE：将循环流水线化，提高吞吐量

3. 数据流设计

使用hls::stream和dataflow指令创建流水线化的PE网络：

#include "hls_stream.h"

void producer_pe(float input[100], hls::stream<float> &output) {
    for (int i = 0; i < 100; i++) {
        #pragma HLS PIPELINE II=1
        output.write(input[i]);
    }
}

void compute_pe(hls::stream<float> &input, hls::stream<float> &output) {
    for (int i = 0; i < 100; i++) {
        #pragma HLS PIPELINE II=1
        float data = input.read();
        output.write(data * 2.0f); // 简单计算
    }
}

void consumer_pe(hls::stream<float> &input, float output[100]) {
    for (int i = 0; i < 100; i++) {
        #pragma HLS PIPELINE II=1
        output[i] = input.read();
    }
}

void dataflow_pe_example(float input[100], float output[100]) {
    #pragma HLS DATAFLOW
    
    hls::stream<float> stream1, stream2;
    #pragma HLS STREAM variable=stream1 depth=2
    #pragma HLS STREAM variable=stream2 depth=2
    
    producer_pe(input, stream1);
    compute_pe(stream1, stream2);
    consumer_pe(stream2, output);
}

PE设计模式

1. 基本PE结构

最简单的PE通常包含：

输入/输出接口
计算逻辑
可选的内部状态或缓存

template<typename T>
void basic_pe(T input_a, T input_b, T *output) {
    #pragma HLS INLINE off
    #pragma HLS PIPELINE II=1
    
    // 内部寄存器/状态
    static T accumulator = 0;
    
    // 计算逻辑
    accumulator += input_a * input_b;
    
    // 输出结果
    *output = accumulator;
}

2. PE阵列设计

PE阵列是在FPGA上实现高性能计算的关键。有多种方式可以创建PE阵列：

显式实例化

void pe_array_explicit(float a[4][4], float b[4][4], float c[4][4]) {
    #pragma HLS ARRAY_PARTITION variable=a complete dim=0
    #pragma HLS ARRAY_PARTITION variable=b complete dim=0
    #pragma HLS ARRAY_PARTITION variable=c complete dim=0
    
    // 显式实例化16个PE，形成4x4阵列
    pe_function(a[0][0], b[0][0], &c[0][0]);
    pe_function(a[0][1], b[0][1], &c[0][1]);
    // ... 更多PE实例化
    pe_function(a[3][3], b[3][3], &c[3][3]);
}

循环展开生成

void pe_array_loop(float a[16][16], float b[16][16], float c[16][16]) {
    // 通过循环展开自动生成PE阵列
    for (int i = 0; i < 16; i++) {
        for (int j = 0; j < 16; j++) {
            #pragma HLS UNROLL
            #pragma HLS PIPELINE II=1
            c[i][j] = a[i][j] * b[i][j] + c[i][j];
        }
    }
}

3. 脉动阵列（Systolic Array）

脉动阵列是一种特殊的PE阵列结构，数据在PE之间有规律地流动，适合矩阵乘法等计算：

// 矩阵乘法脉动阵列实现
void systolic_matrix_mult(float a[16][16], float b[16][16], float c[16][16]) {
    #pragma HLS ARRAY_PARTITION variable=a complete dim=2
    #pragma HLS ARRAY_PARTITION variable=b complete dim=1
    
    // 临时存储
    float a_local[16][16];
    float b_local[16][16];
    float c_local[16][16] = {0};
    
    #pragma HLS ARRAY_PARTITION variable=a_local complete dim=2
    #pragma HLS ARRAY_PARTITION variable=b_local complete dim=1
    #pragma HLS ARRAY_PARTITION variable=c_local complete
    
    // 加载数据
    for (int i = 0; i < 16; i++) {
        for (int j = 0; j < 16; j++) {
            #pragma HLS PIPELINE II=1
            a_local[i][j] = a[i][j];
            b_local[i][j] = b[i][j];
        }
    }
    
    // 脉动阵列计算
    for (int k = 0; k < 16; k++) {
        for (int i = 0; i < 16; i++) {
            for (int j = 0; j < 16; j++) {
                #pragma HLS PIPELINE II=1
                c_local[i][j] += a_local[i][k] * b_local[k][j];
            }
        }
    }
    
    // 输出结果
    for (int i = 0; i < 16; i++) {
        for (int j = 0; j < 16; j++) {
            #pragma HLS PIPELINE II=1
            c[i][j] = c_local[i][j];
        }
    }
}

PE间通信模式

1. 流（Stream）通信

使用hls::stream实现PE间的高效数据传输：

#include "hls_stream.h"

void stream_communication_example() {
    #pragma HLS DATAFLOW
    
    hls::stream<float> stream1, stream2;
    #pragma HLS STREAM variable=stream1 depth=2
    #pragma HLS STREAM variable=stream2 depth=2
    
    // PE1: 生产数据
    for (int i = 0; i < 100; i++) {
        #pragma HLS PIPELINE II=1
        stream1.write(i * 1.0f);
    }
    
    // PE2: 处理数据
    for (int i = 0; i < 100; i++) {
        #pragma HLS PIPELINE II=1
        float data = stream1.read();
        stream2.write(data * 2.0f);
    }
    
    // PE3: 消费数据
    for (int i = 0; i < 100; i++) {
        #pragma HLS PIPELINE II=1
        float result = stream2.read();
        // 使用结果...
    }
}

2. Ping-Pong缓冲

使用双缓冲技术实现PE间的高效数据交换：

void ping_pong_buffer_example(float input[1024], float output[1024]) {
    float buffer_ping[1024];
    float buffer_pong[1024];
    
    // 第一阶段：填充ping缓冲区
    for (int i = 0; i < 1024; i++) {
        #pragma HLS PIPELINE II=1
        buffer_ping[i] = input[i];
    }
    
    // 交替处理
    for (int iter = 0; iter < 10; iter++) {
        if (iter % 2 == 0) {
            // 处理ping缓冲区数据，同时填充pong缓冲区
            for (int i = 0; i < 1024; i++) {
                #pragma HLS PIPELINE II=1
                buffer_pong[i] = buffer_ping[i] * 2.0f;
            }
        } else {
            // 处理pong缓冲区数据，同时填充ping缓冲区
            for (int i = 0; i < 1024; i++) {
                #pragma HLS PIPELINE II=1
                buffer_ping[i] = buffer_pong[i] * 2.0f;
            }
        }
    }
    
    // 最后阶段：输出最终结果
    for (int i = 0; i < 1024; i++) {
        #pragma HLS PIPELINE II=1
        output[i] = (10 % 2 == 0) ? buffer_ping[i] : buffer_pong[i];
    }
}

优化PE设计的关键技术

1. 流水线优化

使用PIPELINE指令提高PE的吞吐量：

void pipelined_pe(float input[1024], float output[1024]) {
    for (int i = 0; i < 1024; i++) {
        #pragma HLS PIPELINE II=1  // 理想的初始间隔为1
        output[i] = complex_function(input[i]);
    }
}

调整II（Initiation Interval）值可以平衡资源使用和性能：

II=1：每个时钟周期开始一次新计算，最大吞吐量
II>1：降低资源使用，但也降低吞吐量

2. 内存访问优化

使用ARRAY_PARTITION指令将数组分割为多个小型存储单元，实现并行访问：

void memory_optimized_pe(float input[16][16], float output[16][16]) {
    #pragma HLS ARRAY_PARTITION variable=input complete dim=2
    #pragma HLS ARRAY_PARTITION variable=output complete dim=2
    
    for (int i = 0; i < 16; i++) {
        for (int j = 0; j < 16; j++) {
            #pragma HLS PIPELINE II=1
            output[i][j] = input[i][j] * 2.0f;
        }
    }
}

分区类型：

complete：完全分区，转换为独立的寄存器
block：块状分区，适合按块访问的模式
cyclic：循环分区，适合条带化访问模式

3. 数据类型优化

选择合适的数据类型可以优化PE的资源使用和性能：

#include "ap_fixed.h"

// 使用定点数代替浮点数
typedef ap_fixed<16, 8> fixed_t;  // 16位总宽度，8位整数部分

void datatype_optimized_pe(fixed_t input[1024], fixed_t output[1024]) {
    for (int i = 0; i < 1024; i++) {
        #pragma HLS PIPELINE II=1
        output[i] = input[i] * fixed_t(2.0);
    }
}

实际应用案例

1. CNN卷积加速器

卷积神经网络中的卷积层是计算密集型操作，可以通过PE阵列加速：

// 卷积PE实现
void conv_pe(
    hls::stream<float>& input_feature,
    const float weights[3][3],  // 3x3卷积核
    hls::stream<float>& output_feature,
    int width, int height
) {
    // 行缓冲区实现滑动窗口
    float line_buffer[2][MAX_WIDTH];
    #pragma HLS ARRAY_PARTITION variable=line_buffer complete dim=1
    
    // 滑动窗口
    float window[3][3];
    #pragma HLS ARRAY_PARTITION variable=window complete dim=0
    
    // 处理每个像素
    for (int row = 0; row < height; row++) {
        for (int col = 0; col < width; col++) {
            #pragma HLS PIPELINE II=1
            
            // 读取新像素
            float pixel = input_feature.read();
            
            // 更新行缓冲区和滑动窗口
            // [此处省略详细实现]
            
            // 执行卷积计算
            if (row >= 2 && col >= 2) {
                float sum = 0;
                for (int i = 0; i < 3; i++) {
                    for (int j = 0; j < 3; j++) {
                        #pragma HLS UNROLL
                        sum += window[i][j] * weights[i][j];
                    }
                }
                output_feature.write(sum);
            }
        }
    }
}

2. 矩阵乘法加速器

矩阵乘法是许多算法的核心操作，可以用PE阵列高效实现：

// 矩阵乘法的单个PE
void matrix_mult_pe(
    float a_val,
    float b_val,
    float &c_val
) {
    c_val += a_val * b_val;
}

// 矩阵乘法加速器
void matrix_mult_accelerator(
    float a[16][16],
    float b[16][16],
    float c[16][16]
) {
    #pragma HLS ARRAY_PARTITION variable=a complete dim=2
    #pragma HLS ARRAY_PARTITION variable=b complete dim=1
    
    // 初始化结果矩阵
    for (int i = 0; i < 16; i++) {
        for (int j = 0; j < 16; j++) {
            #pragma HLS PIPELINE II=1
            c[i][j] = 0;
        }
    }
    
    // 矩阵乘法计算
    for (int k = 0; k < 16; k++) {
        for (int i = 0; i < 16; i++) {
            for (int j = 0; j < 16; j++) {
                #pragma HLS PIPELINE II=1
                matrix_mult_pe(a[i][k], b[k][j], c[i][j]);
            }
        }
    }
}

3. FFT处理器

快速傅里叶变换（FFT）是信号处理中的关键算法，可以用PE实现蝶形运算：

// 蝶形运算PE
template<typename T>
void butterfly_pe(
    std::complex<T> &a,
    std::complex<T> &b,
    const std::complex<T> &twiddle
) {
    #pragma HLS INLINE
    
    std::complex<T> temp = b * twiddle;
    b = a - temp;
    a = a + temp;
}

// 简化的FFT实现
template<typename T>
void fft_stage(
    std::complex<T> data[16],
    int stage
) {
    #pragma HLS INLINE off
    
    const int pairs = 8 >> stage;
    const int stride = 1 << stage;
    
    for (int i = 0; i < pairs; i++) {
        #pragma HLS UNROLL
        
        int idx1 = (i / stride) * 2 * stride + (i % stride);
        int idx2 = idx1 + stride;
        
        // 计算旋转因子
        float angle = -2.0f * M_PI * (i % stride) / (2 * stride);
        std::complex<T> twiddle(std::cos(angle), std::sin(angle));
        
        butterfly_pe(data[idx1], data[idx2], twiddle);
    }
}

设计挑战与最佳实践

1. 资源平衡

PE设计中需要平衡性能和资源使用：

PE数量与复杂度：增加PE数量可以提高并行度，但会消耗更多资源
共享资源：多个PE可以共享某些资源（如乘法器），但可能降低性能
资源分配：使用资源指令明确指定资源类型和使用方式

void resource_optimized_pe(float input, float output) {
    #pragma HLS RESOURCE variable=multiply_op core=DSP48  // 指定使用DSP48进行乘法
    #pragma HLS ALLOCATION instances=multiply_op limit=4  // 限制乘法器实例数量
    
    // PE实现逻辑
}