当前位置：首页 > article >正文

Paddle Inference部署推理（五）

article 2025/3/10 23:50:28

五：Paddle Inference推理（python）API详解

1.Config类定义

Config 类为用于配置构建 Predictor 对象的配置信息，如模型路径、是否开启 gpu 等等。
构造函数定义如下：

# Config 类定义，输入为 None
class paddle.inference.Config()

# Config 类定义，输入为其他 Config 对象
class paddle.inference.Config(config: Config)

# Config 类定义，输入分别为模型文件路径和参数文件路径
class paddle.inference.Config(prog_file: str, params_file: str)

示例：

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./mobilenet.pdmodel", "./mobilenet.pdiparams")

# 根据 config 创建 predictor
predictor = paddle_infer.create_predictor(config)

2.设置预测模型

2.1.从文件中加载预测模型

API定义如下：

# 设置模型文件路径，当需要从磁盘加载模型时使用
# 参数：prog_file_path - 模型文件路径
#      params_file_path - 参数文件路径
# 返回：None
paddle.inference.Config.set_model(prog_file_path: str, params_file_path: str)

# 设置模型文件路径
# 参数：x - 模型文件路径
# 返回：None
paddle.inference.Config.set_prog_file(x: str)

# 设置参数文件路径
# 参数：x - 参数文件路径
# 返回：None
paddle.inference.Config.set_params_file(x: str)

# 获取模型文件路径
# 参数：None
# 返回：str - 模型文件路径
paddle.inference.Config.prog_file()

# 获取参数文件路径
# 参数：None
# 返回：str - 参数文件路径
paddle.inference.Config.params_file()

示例：

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config()

# 通过 API 设置模型文件夹路径
config.set_prog_file("./mobilenet_v2.pdmodel")
config.set_params_file("./mobilenet_v2.pdiparams")

# 通过 API 获取 config 中的模型文件和参数文件路径
print(config.prog_file())
print(config.params_file())

# 根据 config 创建 predictor
predictor = paddle_infer.create_predictor(config)

2.2. 从内存中加载预测模型

API定义如下：

# 从内存加载模型
# 参数：prog_buffer - 内存中模型结构数据
#      prog_buffer_size - 内存中模型结构数据的大小
#      params_buffer - 内存中模型参数数据
#      params_buffer_size - 内存中模型参数数据的大小
# 返回：None
paddle.inference.Config.set_model_buffer(prog_buffer: str, prog_buffer_size: int, 
                                         params_buffer: str, params_buffer_size: int)

# 判断是否从内存中加载模型
# 参数：None
# 返回：bool - 是否从内存中加载模型
paddle.inference.Config.model_from_memory()

示例：

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config()

# 加载模型文件到内存
with open('./mobilenet_v2.pdmodel', 'rb') as prog_file:
    prog_data=prog_file.read()
    
with open('./mobilenet_v2.pdiparams', 'rb') as params_file:
    params_data=params_file.read()

# 从内存中加载模型
config.set_model_buffer(prog_data, len(prog_data), params_data, len(params_data))

# 通过 API 获取 config 中 model_from_memory 的值 - True
print(config.model_from_memory())

# 根据 config 创建 predictor
predictor = paddle_infer.create_predictor(config)

3. 使用 CPU 进行预测

注意：
1.在 CPU 型号允许的情况下，进行预测库下载或编译试尽量使用带 AVX 和 MKL 的版本。
2.可以尝试使用 Intel 的 MKLDNN 进行 CPU 预测加速，默认 CPU 不启用 MKLDNN。
3.在 CPU 可用核心数足够时，可以通过设置 set_cpu_math_library_num_threads 将线程数调高一些，默认线程数为 1。

3.1. CPU 设置

API定义如下：

# 设置 CPU 加速库计算线程数
# 参数：cpu_math_library_num_threads - CPU 加速库计算线程数
# 返回：None
paddle.inference.Config.set_cpu_math_library_num_threads(cpu_math_library_num_threads: int)

# 获取 CPU 加速库计算线程数
# 参数：None
# 返回：int - CPU 加速库计算线程数
paddle.inference.Config.cpu_math_library_num_threads()

示例：

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config()

# 设置 CPU 加速库线程数为 10
config.set_cpu_math_library_num_threads(10)

# 通过 API 获取 CPU 加速库线程数 - 10
print(config.cpu_math_library_num_threads())

3.2. MKLDNN 设置

注意：
1.启用 MKLDNN 的前提为已经使用 CPU 进行预测，否则启用 MKLDNN 无法生效
2.启用 MKLDNN BF16 要求 CPU 型号可以支持 AVX512，否则无法启用 MKLDNN BF16
3.set_mkldnn_cache_capacity 请参考 MKLDNN cache设计文档
API定义如下：

# 启用 MKLDNN 进行预测加速
# 参数：None
# 返回：None
paddle.inference.Config.enable_mkldnn()

# 判断是否启用 MKLDNN 
# 参数：None
# 返回：bool - 是否启用 MKLDNN
paddle.inference.Config.mkldnn_enabled()

# 设置 MKLDNN 针对不同输入 shape 的 cache 容量大小
# 参数：int - cache 容量大小
# 返回：None
paddle.inference.Config.set_mkldnn_cache_capacity(capacity: int=0)

# 指定使用 MKLDNN 加速的 OP 集合
# 参数：使用 MKLDNN 加速的 OP 集合
# 返回：None
paddle.inference.Config.set_mkldnn_op(op_list: Set[str])

# 启用 MKLDNN BFLOAT16
# 参数：None
# 返回：None
paddle.inference.Config.enable_mkldnn_bfloat16()

# 指定使用 MKLDNN BFLOAT16 加速的 OP 集合
# 参数：使用 MKLDNN BFLOAT16 加速的 OP 集合
# 返回：None
paddle.inference.Config.set_bfloat16_op(op_list: Set[str])

# 启用 MKLDNN INT8
# 参数：使用 MKLDNN INT8 加速的 OP 集合
# 返回：None
paddle.inference.Config.enable_mkldnn_int8(op_list: Set[str])

代码示例 (1)：使用 MKLDNN 进行预测

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./mobilenet.pdmodel", "./mobilenet.pdiparams")

# 启用 MKLDNN 进行预测
config.enable_mkldnn()

# 通过 API 获取 MKLDNN 启用结果 - true
print(config.mkldnn_enabled())

# 设置 MKLDNN 的 cache 容量大小
config.set_mkldnn_cache_capacity(1)

# 设置启用 MKLDNN 进行加速的 OP 列表
config.set_mkldnn_op({"softmax", "elementwise_add", "relu"})

代码示例 (2)：使用 MKLDNN BFLOAT16 进行预测

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./mobilenet.pdmodel", "./mobilenet.pdiparams")

# 启用 MKLDNN 进行预测
config.enable_mkldnn()

# 启用 MKLDNN BFLOAT16 进行预测
config.enable_mkldnn_bfloat16()

# 设置启用 MKLDNN BFLOAT16 的 OP 列表
config.set_bfloat16_op({"conv2d"})

代码示例 (3)：使用 MKLDNN INT8 进行预测

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./mobilenet.pdmodel", "./mobilenet.pdiparams")

# 启用 MKLDNN 进行预测
config.enable_mkldnn()

# 启用 MKLDNN INT8 进行预测
config.enable_mkldnn_int8()

4. 使用 GPU 进行预测

注意：
1.Config 默认使用 CPU 进行预测，需要通过 EnableUseGpu 来启用 GPU 预测。
2.可以尝试启用 CUDNN 和 TensorRT 进行 GPU 预测加速。

4.1. GPU 设置

API定义如下：

# 启用 GPU 进行预测
# 参数：memory_pool_init_size_mb - 初始化分配的gpu显存，以MB为单位
#      device_id - 设备id
#      precision_mode - 指定推理精度，默认是PrecisionType.Float32
# 返回：None
paddle.inference.Config.enable_use_gpu(memory_pool_init_size_mb: int, device_id: int, precision_mode: PrecisionType)

# 禁用 GPU 进行预测
# 参数：None
# 返回：None
paddle.inference.Config.disable_gpu()

# 判断是否启用 GPU
# 参数：None
# 返回：bool - 是否启用 GPU
paddle.inference.Config.use_gpu()

# 获取 GPU 的device id
# 参数：None
# 返回：int -  GPU 的device id
paddle.inference.Config.gpu_device_id()

# 获取 GPU 的初始显存大小
# 参数：None
# 返回：int -  GPU 的初始的显存大小
paddle.inference.Config.memory_pool_init_size_mb()

# 初始化显存占总显存的百分比
# 参数：None
# 返回：float - 初始的显存占总显存的百分比
paddle.inference.Config.fraction_of_gpu_memory_for_pool()

# 低精度模式（float16）下，推理时期望直接 feed/fetch 低精度数据
# 参数：bool - 是否 feed/fetch 低精度数据
# 返回：None
paddle.inference.Config.enable_low_precision_io(x : bool)

GPU设置代码示例：

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./mobilenet.pdmodel", "./mobilenet.pdiparams")

# 启用 GPU 进行预测 - 初始化 GPU 显存 100M, Deivce_ID 为 0
config.enable_use_gpu(100, 0)
# 通过 API 获取 GPU 信息
print("Use GPU is: {}".format(config.use_gpu())) # True
print("Init mem size is: {}".format(config.memory_pool_init_size_mb())) # 100
print("Init mem frac is: {}".format(config.fraction_of_gpu_memory_for_pool())) # 0.003
print("GPU device id is: {}".format(config.gpu_device_id())) # 0

# 禁用 GPU 进行预测
config.disable_gpu()
# 通过 API 获取 GPU 信息
print("Use GPU is: {}".format(config.use_gpu())) # False

4.2. TensorRT 设置

注意：

启用 TensorRT 的前提为已经启用 GPU，否则启用 TensorRT 无法生效
对存在 LoD 信息的模型，如 BERT、ERNIE 等 NLP 模型，必须使用动态 Shape
启用 TensorRT OSS 可以支持更多 plugin，详细参考 TensorRT OSS

更多 TensorRT 详细信息，请参考使用Paddle-TensorRT库预测。
API定义如下：

# 启用 TensorRT 进行预测加速
# 参数：workspace_size     - 指定 TensorRT 在网络编译阶段进行kernel选择时使用的工作空间大小，不影响运
#                           行时显存占用。该值设置过小可能会导致选不到最佳kernel，设置过大时会增加初始
#                           化阶段的显存使用，请根据实际情况调整，建议值256MB
#      max_batch_size     - 设置最大的 batch 大小，运行时 batch 大小不得超过此限定值
#      min_subgraph_size  - Paddle 内 TensorRT 是以子图的形式运行，为了避免性能损失，当 TensorRT
#                           子图内部节点个数大于 min_subgraph_size 的时候，才会使用 TensorRT 运行
#      precision          - 指定使用 TensorRT 的精度，支持 FP32(kFloat32)，FP16(kHalf)，Int8(kInt8)
#      use_static         - 若指定为 true，在初次运行程序退出Predictor析构的时候会将 TensorRT 的优
#                           化信息进行序列化到磁盘上。下次运行时直接加载优化的序列化信息而不需要重新生
#                           成，以加速启动时间（需要在同样的硬件和相同 TensorRT 版本的情况下）
#      use_calib_mode     - 若要运行 TensorRT INT8 离线量化校准，需要将此选项设置为 True
# 返回：None
paddle.inference.Config.enable_tensorrt_engine(workspace_size: int = 1 << 20,
                                               max_batch_size: int,
                                               min_subgraph_size: int,
                                               precision_mode: PrecisionType,
                                               use_static: bool,
                                               use_calib_mode: bool)

# 判断是否启用 TensorRT
# 参数：None
# 返回：bool - 是否启用 TensorRT
paddle.inference.Config.tensorrt_engine_enabled()

# 启用 TensorRT 显存优化
# 参数：engine_memory_sharing     - 指定是否开启 TensorRT 显存优化，默认为 false，开启后，当一个模型中（predictor）存在
#                                  多个 TensorRT Engine（子图）时，它们的运行时 context memory 会共享
#      sharing_identifier        - 如果有多个可以保证串行执行的模型（多个串行执行的 predictor），可通过此参数控制它们之间的
#                                  子图共享显存。参与显存共享的 predictor 的 config 配置中，指定此参数为相同的大于0的一个值即可
# 返回：None
paddle.inference.Config.enable_tensorrt_memory_optim(engine_memory_sharing : bool = true,
                                                     sharing_identifier : int = 0)

# 设置 TensorRT 的动态 Shape
# 参数：min_input_shape          - TensorRT 子图支持动态 shape 的最小 shape，推理时输入 shape 的任何
#                                 维度均不能小于该项配置
#      max_input_shape          - TensorRT 子图支持动态 shape 的最大 shape，推理是输入 shape 的任何
#                                 维度均不能大于该项配置
#      optim_input_shape        - TensorRT 子图支持动态 shape 的最优 shape，TensorRT 在初始化选
#                                 kernel 阶段以此项配置的 shape 下的性能表现作为选择依据
#      disable_trt_plugin_fp16  - 设置 TensorRT 的 plugin 不在 fp16 精度下运行
# 返回：None
paddle.inference.Config.set_trt_dynamic_shape_info(min_input_shape: Dict[str, List[int]] = {},
                                                   max_input_shape: Dict[str, List[int]] = {},
                                                   optim_input_shape: Dict[str, List[int]] = {},
                                                   disable_trt_plugin_fp16: bool = False)

#
# TensorRT 动态 shape 的自动推导
# 参数： shape_range_info_path  - 统计生成的 shape 信息存储文件路径，当设置为空时，在运行时收集
#       allow_build_at_runtime - 是否开启运行时重建 TensorRT 引擎功能，当设置为 true 时，输入 shape
#                                超过 tune 范围时会触发 TensorRT 重建。当设置为 false 时，输入 shape
#                                超过 tune 范围时会引起推理出错
# 返回：None
paddle.inference.Config.enable_tuned_tensorrt_dynamic_shape(
                                     shape_range_info_path: str = "",
                                     allow_build_at_runtime: bool = True)

# 启用 TensorRT OSS 进行 ERNIE / BERT 预测加速（原理介绍 https://github.com/PaddlePaddle/Paddle-Inference-Demo/tree/master/c%2B%2B/ernie-varlen ）
# 参数：None
# 返回：None
paddle.inference.Config.enable_tensorrt_oss()

# 判断是否启用 TensorRT OSS
# 参数：None
# 返回：bool - 是否启用 TensorRT OSS
paddle.inference.Config.tensorrt_oss_enabled()

# 启用 TensorRT DLA 进行预测加速
# 参数：dla_core - DLA 设备的 id，可选 0，1，...，DLA 设备总数 - 1
# 返回：None
paddle.inference.Config.enable_tensorrt_dla(dla_core: int = 0)

# 判断是否已经开启 TensorRT DLA 加速
# 参数：None
# 返回：bool - 是否已开启 TensorRT DLA 加速
paddle.inference.Config.tensorrt_dla_enabled()

代码示例 (1)：使用 TensorRT FP32 / FP16 / INT8 进行预测

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./mobilenet.pdmodel", "./mobilenet.pdiparams")

# 启用 GPU 进行预测 - 初始化 GPU 显存 100M, Deivce_ID 为 0
config.enable_use_gpu(100, 0)

# 启用 TensorRT 进行预测加速 - FP32
config.enable_tensorrt_engine(workspace_size = 1 << 28,
                              max_batch_size = 1,
                              min_subgraph_size = 3,
                              precision_mode = paddle_infer.PrecisionType.Float32,
                              use_static = False, use_calib_mode = False)
# 通过 API 获取 TensorRT 启用结果 - true
print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))


# 启用 TensorRT 进行预测加速 - FP16
config.enable_tensorrt_engine(workspace_size = 1 << 28,
                              max_batch_size = 1,
                              min_subgraph_size = 3,
                              precision_mode = paddle_infer.PrecisionType.Half,
                              use_static = False, use_calib_mode = False)
# 通过 API 获取 TensorRT 启用结果 - true
print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))

# 启用 TensorRT 进行预测加速 - Int8
config.enable_tensorrt_engine(workspace_size = 1 << 28,
                              max_batch_size = 1,
                              min_subgraph_size = 3,
                              precision_mode = paddle_infer.PrecisionType.Int8,
                              use_static = False, use_calib_mode = False)

# 开启 TensorRT 显存优化
config.enable_tensorrt_memory_optim()

# 通过 API 获取 TensorRT 启用结果 - true
print("Enable TensorRT is: {}".format(config.tensorrt_engine_enabled()))

代码示例 (2)：使用 TensorRT 动态 Shape 进行预测

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./mobilenet.pdmodel", "./mobilenet.pdiparams")

# 启用 GPU 进行预测 - 初始化 GPU 显存 100 M, Deivce_ID 为 0
config.enable_use_gpu(100, 0)

# 启用 TensorRT 进行预测加速 - Int8
config.enable_tensorrt_engine(workspace_size = 1 << 29,
                              max_batch_size = 1,
                              min_subgraph_size = 1,
                              precision_mode=paddle_infer.PrecisionType.Int8,
                              use_static = False, use_calib_mode = True)

# 开启 TensorRT 显存优化
config.enable_tensorrt_memory_optim()

# 设置 TensorRT 的动态 Shape
config.set_trt_dynamic_shape_info(min_input_shape={"image": [1, 1, 3, 3]},
                                  max_input_shape={"image": [1, 1, 10, 10]},
                                  optim_input_shape={"image": [1, 1, 3, 3]})

代码示例 (3)：使用 TensorRT OSS 进行预测

# 引用 paddle inference 预测库
import paddle.inference as paddle_infer

# 创建 config
config = paddle_infer.Config("./ernie.pdmodel", "./ernie.pdiparams")

# 启用 GPU 进行预测 - 初始化 GPU 显存 100M, Deivce_ID 为 0
config.enable_use_gpu(100, 0)

# 启用 TensorRT 进行预测加速
config.enable_tensorrt_engine()

# 启用 TensorRT OSS 进行预测加速
config.enable_tensorrt_oss()

# 通过 API 获取 TensorRT OSS 启用结果 - true
print("Enable TensorRT OSS is: {}".format(config.tensorrt_oss_enabled()))

查看全文

http://www.kler.cn/a/410066.html