当前位置：首页 > article >正文

YOLOv8源码修改（4）- 实现YOLOv8模型剪枝（任意YOLO模型的简单剪枝）

article 2025/3/15 2:40:25

前言

1. 需修改的源码文件

1.1添加C2f_v2模块

1.2 修改模型读取方式

1.3 增加 L1 正则约束化训练

1.4 在tensorboard上增加BN层权重和偏置参数分布的可视化

1.5 增加剪枝处理文件

2. 工程目录结构

3. 源码文件修改

3.1 添加C2f_v2模块和模型读取

3.2 添加L1正则约束训练和现实约束的BN层参数分布

3.3 添加剪枝方法文件yolov8_prune.py

4. 模型训练与测试结果对比

4.1 模型结构比较

4.2 模型稀疏性对比

4.3 模型性能分析

4.4 进一步改进训练

5. 总结

前言

博主，博主，YOLO剪枝相关的实现，虽然很多，但还是太吃操作了，有没有更加简单、易上手的方法？有的兄弟，有的。

本文基于YOLOv8/YOLOv11官方源码修改，以最小改动、不需要了解剪枝算法为前提，可以实现对任意YOLO模型的剪枝、训练和微调。

本文使用的YOLO框架源码：版本为8.3.67的 ultralytics。

本文使用的剪枝框架源码：版本为1.5.1的 Torch-Pruning。

1. 需修改的源码文件

1.1添加C2f_v2模块

ultralytics/nn/modules/block.py

ultralytics/nn/modules/__init__.py

ultralytics/nn/tasks.py

1.2 修改模型读取方式

ultralytics/engine/model.py

ultralytics/cfg/default.yaml

1.3 增加 L1 正则约束化训练

ultralytics/engine/trainer.py

ultralytics/cfg/default.yaml

1.4 在tensorboard上增加BN层权重和偏置参数分布的可视化

ultralytics/engine/trainer.py

ultralytics/utils/callbacks/tensorboard.py

1.5 增加剪枝处理文件

yolov8_prune.py

2. 工程目录结构

YOLO的框架目录结构和训练启动可以参考：YOLOv8-pose（1）- 关键点检测数据集格式详解+快速训练+预测结果详解

本文修改和训练和上述文章大同小异，工程目录结构中，涉及修改的文件如下（淡蓝色）：

3. 源码文件修改

代码添加在哪，看截图中的行数。

3.1 添加C2f_v2模块和模型读取

YOLOv8模型剪枝问题主要在C2f模块，剪枝时会存在BUG：Hey Guys! I think I found the bug. When a concat module & a split module are directly connected, the index mapping system fails to compute correct idxs. I'm going to rewrite the concat & split tracing. Really thanks for this issue!

C2f_v2模块源码取自：YOLOv8 Pruning。torch-pruning官方给出的 yolov8_pruning.py 是针对 ultralytics-8.0.x版本的，在8.3.67上运行会报错（数据设备加载不一致、某些类没有初始化、自定义模型加载失败等）。这导致训练起来麻烦，所以本文就直接在8.3.67的源码上修改了，用官方的训练代码。

# C2f_v2模块，用于替换C2f模块
class C2f_v2(nn.Module):
    # CSP Bottleneck with 2 convolutions
    def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5):  # ch_in, ch_out, number, shortcut, groups, expansion
        super().__init__()
        self.c = int(c2 * e)  # hidden channels
        self.cv0 = Conv(c1, self.c, 1, 1)
        self.cv1 = Conv(c1, self.c, 1, 1)
        self.cv2 = Conv((2 + n) * self.c, c2, 1)  # optional act=FReLU(c2)
        self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))

    def forward(self, x):
        # y = list(self.cv1(x).chunk(2, 1))
        y = [self.cv0(x), self.cv1(x)]
        y.extend(m(y[-1]) for m in self.m)

        return self.cv2(torch.cat(y, 1))

在ultralytics/nn/modules/block.py两处:

在ultralytics/nn/modules/__init__.py添加两处：

在ultralytics/nn/tasks.py三处:

模型加载修改：

在 model.train() 模式下，源码会读取 model.model.yaml 中保存的结构配置文件来生成模型结构，并把加载的模型参数赋值给这个结构，所以如果不修改，每次加载模型（即使是剪枝模型），都会加载原始的YOLO的模型。

而 torch.load() 和 torch.save() 本身就可以读取和保存模型结构和参数，model.load() 和 model.save() 就是对这两个方法的封装，已经可以读取和保存剪枝模型的结构和参数了。

在ultralytics/cfg/default.yaml中，添加参数 custom_model:

用于控制是使用自己定义的读取方式（True）还是官方的读取方式（False）。

网上没找到能在8.3.67版本中有效读取的方法，自己暂时像下面这样实现了一下。思路：

self.trainer 利用回调函数实现智能加载，实现各种参数的初始化，模型用 model.model.yaml 中的参数进行初始化，而在前面已经用 model.load(model) 加载过原始结构（剪枝的）的模型了，所以这里就用原始结构的模型替换新生成的。

同时YOLOv8中 DFL 层是用来处理三个不同分辨率检测头输出的，所以不需要进行梯度回传。这部分后续在 RKNN 上部署也会涉及。

增加的代码：

# ultralytics/engine/model.py 中实现自定义结构模型加载  
          if not args.get("custom_model", None):
                self.model = self.trainer.model
            else:
                print(f"\033[1;32mINFO\033[0m: custom_model is True, load custom model. ")
                for name, param in self.model.named_parameters():
                    if name == "model.22.dfl.conv.weight":
                        param.requires_grad = False  # 冻结
                    else:
                        param.requires_grad = True  # 解冻其他层
                self.trainer.model.model = self.model.model

3.2 添加L1正则约束训练和现实约束的BN层参数分布

还是在ultralytics/cfg/default.yaml中，有参数：

add_L1就是控制是（True）否（False）引入L1-norm，L1_start是起始的的L1权重，最终的是L1_start * L1_end（L1_end需要小于1，否则梯度会出错），这里参考了学习率的实现（lr0和lrf）。

add_L1: False           # (bool) use L1 norm
L1_start: 0             # (float) L1 regularization weight
L1_end: 0.1             # (float) final L1 regularization weight (L1_start * L1_end), default: 0.1

L1会使权重稀疏（服从拉普拉斯分布），L2使权重在0周围平滑（服从高斯分布），因为L1是直接对变量 x 求导（x非0时，恒为1），要使梯度减小，只能让x为0。L2是对变量的平方x^2求导（为2x），要让梯度变小，只需要x减小。

加入L2正则在优化器中实现，直接修改default.yaml中的weight_decay参数即可：

weight_decay: 0.0005 # (float) optimizer weight decay 5e-4

在ultralytics/engine/trainer.py中添加L1正则相关参数：

开启L1正则时，关闭混合精度训练（看网上都关了，我自己没关貌似也没影响），注意self.amp的判断逻辑，add_L1为True，其就为False。

在梯度回传后，优化之前，增加BN层权重和偏置的L1正则：

第一个特征提取层不添加，因为第一层作为最基本的特征提取层，后续也不进行剪枝。

上述两段代码：

        # ============================== add parameter: L1 norm  =======================================================
        self.add_L1 = self.args.add_L1 if self.args.add_L1 is not None else False
        self.L1_start = self.args.L1_start if self.args.L1_start is not None else 0.0001
        self.L1_end = self.args.L1_end if self.args.L1_end is not None else 0.1
        # ============================== add parameter: L1 norm  =======================================================

        # Check AMP
        # self.amp = torch.tensor(self.args.amp).to(self.device)  # True or False
        self.amp = torch.tensor(self.args.amp and not self.add_L1).to(self.device)  # True or False


                # ============================== add parameter: L1 regularization constraint  ==========================
                if self.add_L1:
                    l1_coefficient = self.L1_start * (1 - (1 - self.L1_end) * epoch / self.epochs)
                    for modules_name, modules_layer in self.model.named_modules():
                        if isinstance(modules_layer, nn.BatchNorm2d):
                            # 第一个特征提取层，不进行约束
                            if "model.0." in modules_name:
                                continue

                            modules_layer.weight.grad.data.add_(l1_coefficient * torch.sign(modules_layer.weight.data))
                            modules_layer.bias.grad.data.add_(l1_coefficient * torch.sign(modules_layer.bias.data))
                # ============================== add parameter: L1 regularization constraint  ==========================

在ultralytics/engine/trainer.py中添加用tensorboard查看BN层参数分布的回调函数：

在每轮训练结束前，利用回调函数 on_show_bn_weights_and_bias，计算模型中BN层的权重分布。

代码如下：

            # ============================== show bn weights and bias distribution in tensorboard ======================
            if self.add_L1:
                module_list_weight = []
                module_list_bias = []
                for modules_name, modules_layer in self.model.named_modules():
                    if isinstance(modules_layer, nn.BatchNorm2d):
                        module_list_weight.append(modules_layer.state_dict()['weight'])     # 获取BN层的gamma参数（权重）
                        module_list_bias.append(modules_layer.state_dict()['bias'])         # 获取 BN 层的 beta（偏置）

                # 获取所有 BN 层的通道数
                size_list = [idx.data.shape[0] for idx in module_list_weight]

                # 预分配 Tensor 存放 BN 权重和偏置
                self.bn_weights = torch.zeros(sum(size_list))
                self.bn_biases = torch.zeros(sum(size_list))

                index = 0
                for idx, size in enumerate(size_list):
                    self.bn_weights[index:(index + size)] = module_list_weight[idx].data.abs().clone()  # 取绝对值
                    self.bn_biases[index:(index + size)] = module_list_bias[idx].data.abs().clone()
                    index += size

                # 合并 BN 权重和偏置
                self.bn_params = torch.cat([self.bn_weights, self.bn_biases], dim=0)
                self.run_callbacks("on_show_bn_weights_and_bias")    # 触发回调
            # ============================== show bn weights and bias distribution in tensorboard ======================

在ultralytics/utils/callbacks/tensorboard.py添加回调函数：

这里需要注意numpy和protobuf的版本问题，过高会导致调用add_histogram方法失败。

代码如下：

# ============================== show bn weights and bias distribution in tensorboard ==================================
def on_show_bn_weights_and_bias(trainer):
    if WRITER:
        """
        numpy>=1.18.5,<1.24.0
        protobuf<4.21.3
        """
        show_error = True
        try:
            WRITER.add_histogram('BN/weights', trainer.bn_weights.numpy(), trainer.epoch, bins='doane')
            WRITER.add_histogram('BN/biases', trainer.bn_biases.numpy(), trainer.epoch, bins='doane')
            bn_params = torch.cat([trainer.bn_weights, trainer.bn_biases]).numpy()  # 记录合并参数 weight 和 bias
            WRITER.add_histogram('BN/params', bn_params, trainer.epoch, bins='doane')
        except Exception as e:
            if show_error:
                print(f"\033[1;31mERROR\033[0m: plot weight and bias distribution failed. Cause by: {e}")
            show_error = False
# ============================== show bn weights and bias distribution in tensorboard ==================================


callbacks = (
    {
        "on_pretrain_routine_start": on_pretrain_routine_start,
        "on_train_start": on_train_start,
        "on_fit_epoch_end": on_fit_epoch_end,
        "on_train_epoch_end": on_train_epoch_end,
        # ============================== show bn weights and bias distribution in tensorboard ==========================
        "on_show_bn_weights_and_bias": on_show_bn_weights_and_bias,
        # ============================== show bn weights and bias distribution in tensorboard ==========================
    }
    if SummaryWriter
    else {}
)

3.3 添加剪枝方法文件yolov8_prune.py

该文件相比torch-pruning官方的 yolov8_pruning.py 的改动较大，或者说，仅使用了其中的剪枝方法对稀疏化训练好的模型进行了剪枝。代码如下：

# yolov8_prune.py
# This code is adapted from Issue [#147](https://github.com/VainF/Torch-Pruning/issues/147),
# implemented by @Hyunseok-Kim0.
import math
import torch
import torch.nn as nn
import torch_pruning as tp
from copy import deepcopy
from ultralytics import YOLO
from ultralytics.nn.modules import Detect, C2f, C2f_v2
from ultralytics.utils import yaml_load
from ultralytics.utils.checks import check_yaml
from ultralytics.utils.torch_utils import initialize_weights


def infer_shortcut(bottleneck):
    # 判定 bottleneck 有没有残差连接
    c1 = bottleneck.cv1.conv.in_channels
    c2 = bottleneck.cv2.conv.out_channels

    # a) c1 == c2        表示输入通道数与输出通道数相同
    # b) hasattr(bottleneck, 'add')  检查 bottleneck 对象是否有名为 'add' 的属性
    # c) bottleneck.add  该属性为 True 时表示确实存在捷径连接
    return c1 == c2 and hasattr(bottleneck, 'add') and bottleneck.add


def transfer_weights(c2f, c2f_v2):
    # 1. 直接把 c2f 的 cv2 和 m (ModuleList) 复制给 c2f_v2
    c2f_v2.cv2 = c2f.cv2
    c2f_v2.m = c2f.m

    # 2. 获取旧模型和新模型的 state_dict（权重字典）
    state_dict = c2f.state_dict()   # 包含 c2f 的所有参数(key)->tensor 的映射
    state_dict_v2 = c2f_v2.state_dict()     # 包含 c2f_v2 的所有参数(key)->tensor 的映射

    # 3. 处理 cv1 中的卷积权重，并拆分到 c2f_v2 的 cv0、cv1
    # Transfer cv1 weights from C2f to cv0 and cv1 in C2f_v2
    old_weight = state_dict['cv1.conv.weight']  # 取出旧模型 c2f 的 'cv1.conv.weight'
    half_channels = old_weight.shape[0] // 2
    state_dict_v2['cv0.conv.weight'] = old_weight[:half_channels]   # 把前一半的卷积核(通道维度)放到新模型的 'cv0.conv.weight'
    state_dict_v2['cv1.conv.weight'] = old_weight[half_channels:]   # 把后一半的卷积核(通道维度)放到新模型的 'cv1.conv.weight'

    # 4. 同理，把 cv1 的 BatchNorm 参数(权重、偏置、均值、方差)也拆分到 cv0.bn 和 cv1.bn
    # Transfer cv1 batchnorm weights and buffers from C2f to cv0 and cv1 in C2f_v2
    for bn_key in ['weight', 'bias', 'running_mean', 'running_var']:
        old_bn = state_dict[f'cv1.bn.{bn_key}']
        state_dict_v2[f'cv0.bn.{bn_key}'] = old_bn[:half_channels]  # 前 half_channels 个用于 cv0.bn
        state_dict_v2[f'cv1.bn.{bn_key}'] = old_bn[half_channels:]  # 后 half_channels 个用于 cv1.bn

    # 5. 把剩余的权重直接复制过去(所有不是以 'cv1.' 开头的 key)，因为 'cv1.' 是我们刚才专门拆分处理的，其它的保持原样即可
    # Transfer remaining weights and buffers
    for key in state_dict:
        if not key.startswith('cv1.'):
            state_dict_v2[key] = state_dict[key]

    # 6. 复制所有非方法(non-method)的属性给新的 c2f_v2，例如一些标量、列表、或者其他需要保留的成员
    # Transfer all non-method attributes
    for attr_name in dir(c2f):
        attr_value = getattr(c2f, attr_name)
        if not callable(attr_value) and '_' not in attr_name:
            setattr(c2f_v2, attr_name, attr_value)

    # 7. 用修改完的 state_dict_v2 为 c2f_v2 加载权重，实际完成全部参数赋值
    c2f_v2.load_state_dict(state_dict_v2)


def replace_c2f_with_c2f_v2(module):
    # 1. 遍历当前 module 的所有子模块
    for name, child_module in module.named_children():
        # 2. 如果发现子模块是 C2f 类型，就要把它替换成 C2f_v2
        if isinstance(child_module, C2f):
            # 2.1 通过第一个 Bottleneck 推断它是否存在 shortcut
            # Replace C2f with C2f_v2 while preserving its parameters
            shortcut = infer_shortcut(child_module.m[0])

            # 2.2 根据旧 C2f 的输入输出通道数、Bottleneck 数量等，创建一个新的 C2f_v2
            c2f_v2 = C2f_v2(child_module.cv1.conv.in_channels,
                            child_module.cv2.conv.out_channels,
                            n=len(child_module.m),
                            shortcut=shortcut,
                            g=child_module.m[0].cv2.conv.groups,
                            e=child_module.c / child_module.cv2.conv.out_channels)

            # 2.3 调用 transfer_weights，把旧的 C2f 参数(权重、BN等)拷贝/转换到新的 c2f_v2
            transfer_weights(child_module, c2f_v2)

            # 2.4 用 setattr 把模块本身替换成新创建的 c2f_v2
            setattr(module, name, c2f_v2)
        else:
            # 3. 如果这个子模块不是 C2f，就递归地继续往下找
            replace_c2f_with_c2f_v2(child_module)


def replace_silu_with_relu(module):
    """
    Recursively replace all SiLU activation functions in the given module
    with ReLU activation functions.

    Args:
        module (nn.Module): The module to process.
    """
    # 1. 遍历当前 module 的所有子模块
    for name, child_module in module.named_children():
        # 2. 如果发现子模块是 SiLU 类型，就把它替换成 ReLU
        if isinstance(child_module, nn.SiLU):
            # 用 ReLU 替换 SiLU
            setattr(module, name, nn.ReLU(inplace=True))
        else:
            # 3. 如果子模块不是 SiLU，递归地继续处理子模块
            replace_silu_with_relu(child_module)


def replace_module_in_dict(model_dict, original_module, new_module):
    """
    Replace a specified module type in a model dictionary (e.g., 'C2f' -> 'C2f_v2').

    Args:
        model_dict (dict): Model configuration dictionary containing 'backbone' and 'head' keys.
        original_module (str): The name of the module to replace (e.g., 'C2f').
        new_module (str): The new module name to replace with (e.g., 'C2f_v2').

    Returns:
        dict: Updated model dictionary with the specified module replaced.
    """
    def replace_modules(layers, original, new):
        for layer in layers:
            if layer[2] == original:
                layer[2] = new
        return layers

    # Replace in backbone and head
    if 'backbone' in model_dict:
        model_dict['backbone'] = replace_modules(model_dict['backbone'], original_module, new_module)
    if 'head' in model_dict:
        model_dict['head'] = replace_modules(model_dict['head'], original_module, new_module)

    return model_dict


def update_model_yaml(model_dict):
    width_multiple = model_dict['width_multiple']
    for section in ['backbone', 'head']:
        for layer in model_dict[section]:
            # 检查每层的参数列表是否包含32倍数的值
            args = layer[-1]
            for i in range(len(args)):
                if isinstance(args[i], int) and args[i] % 32 == 0:
                    # 乘以 width_multiple 并四舍五入为最接近的 32 倍数
                    args[i] = round(args[i] * width_multiple / 32) * 32
    # 将 width_multiple 更新为 1.0
    model_dict['width_multiple'] = 1.0
    return model_dict


def update_pruned_model_yaml(original_yaml, pruned_model):
    """
    根据剪枝后的模型，更新原始模型的 model.model.yaml 字典。

    Args:
        original_yaml (dict): 原始模型的 model.model.yaml 字典。
        pruned_model (nn.Module): 剪枝后的模型。

    Returns:
        dict: 更新后的 model.model.yaml 字典。
    """
    # 遍历剪枝后的模型层
    pruned_layers = []
    for idx, layer in enumerate(pruned_model.model):
        pruned_layers.append(layer)

    # 遍历原始模型的 yaml，更新通道数
    updated_yaml = original_yaml.copy()
    backbone = updated_yaml["backbone"]
    head = updated_yaml["head"]

    # 更新 backbone 的通道数
    for i, layer_info in enumerate(backbone):
        module_type = layer_info[2]  # 模块类型 (如 'Conv', 'C2f_v2', 等)
        if module_type == "Conv":  # 更新 Conv 的通道数
            pruned_out_channels = pruned_layers[i].conv.out_channels
            backbone[i][3][0] = pruned_out_channels
        elif module_type.startswith("C2f"):  # 更新 C2f_v2 的通道数
            pruned_out_channels = pruned_layers[i].cv2.conv.out_channels
            backbone[i][3][0] = pruned_out_channels

    # 更新 head 的通道数
    for i, layer_info in enumerate(head):
        module_type = layer_info[2]
        if module_type == "Conv":  # 更新 Conv 的通道数
            pruned_out_channels = pruned_layers[len(backbone) + i].conv.out_channels
            head[i][3][0] = pruned_out_channels
        elif module_type.startswith("C2f"):  # 更新 C2f_v2 的通道数
            pruned_out_channels = pruned_layers[len(backbone) + i].cv2.conv.out_channels
            head[i][3][0] = pruned_out_channels

    # 返回更新后的 yaml
    updated_yaml["backbone"] = backbone
    updated_yaml["head"] = head
    return updated_yaml


def prune(prune_param: dict):
    # 加载yolo模型（以剪枝的也可以），并绑定自定义训练方法train_v2
    model = YOLO(prune_param["model"])
    model_name_part = os.path.basename(prune_param["model"]).rpartition('.')
    model_name = model_name_part[0] + f"_x{str(prune_param['target_prune_rate'])}." + model_name_part[-1]
    # model.train_v2 = train_v2.__get__(model, type(model))

    # 加载剪枝参数
    pruning_cfg = yaml_load(check_yaml(prune_param["cfg_path"]))

    # 加载模型，并把其中的C2f模块替换成C2f_v2
    model.to(pruning_cfg["device"])

    replace_c2f_with_c2f_v2(model)
    # 将SiLU替换成ReLU
    # replace_silu_with_relu(model)
    model.model.yaml = replace_module_in_dict(model.model.yaml, "C2f", "C2f_v2")
    # model.model.yaml = replace_module_in_dict(model.model.yaml, "SiLU", "ReLU")   # yaml里没有，所以不需要

    # 将yaml中C2f替换成C2f_v2
    model.model.yaml = update_model_yaml(model.model.yaml)

    initialize_weights(model.model)  # set BN.eps, momentum, ReLU.inplace

    for name, param in model.model.named_parameters():
        param.requires_grad = True

    # 初始化Torch-Pruning剪枝的相关参数
    example_inputs = torch.randn(1, 3, pruning_cfg["imgsz"], pruning_cfg["imgsz"]).to(pruning_cfg["device"])
    model.model.cuda(pruning_cfg["device"])
    base_macs, base_nparams = tp.utils.count_ops_and_params(model.model, example_inputs)

    # do validation before pruning model
    pruning_cfg['name'] = f"baseline_val"
    pruning_cfg['batch'] = 4
    validation_model = deepcopy(model)
    metric = validation_model.val(**pruning_cfg)
    init_map = metric.box.map

    print(f"Before Pruning: MACs={base_macs / 1e9: .5f} G, #Params={base_nparams / 1e6: .5f} M, mAP={init_map: .5f}")

    # prune same ratio of filter based on initial size
    # pruning_ratio = 1 - math.pow((1 - prune_param["target_prune_rate"]), 1 / prune_param["iterative_steps"])

    model.model.train()
    for name, param in model.model.named_parameters():
        param.requires_grad = True

    ignored_layers = []
    # unwrapped_parameters = []

    # 忽略含 “model.0.conv”，不剪枝最浅层卷积
    for name, module in model.named_modules():
        if "model.0.conv" in name or "dfl" in name:
            ignored_layers.append(module)

    # 不剪枝检测头
    for m in model.model.modules():
        if isinstance(m, (Detect,)):
            ignored_layers.append(m)

    example_inputs = example_inputs.to(model.device)

    pruner = tp.pruner.GroupNormPruner(
        model.model,
        example_inputs,
        importance=tp.importance.GroupNormImportance(),  # L2 norm pruning,
        iterative_steps=prune_param["iterative_steps"],
        pruning_ratio=prune_param["target_prune_rate"],
        ignored_layers=ignored_layers,
        round_to=32,
        global_pruning=prune_param["global_pruning"]
        # unwrapped_parameters=unwrapped_parameters
    )

    pruner.step()

    # pre fine-tuning validation
    pruning_cfg['name'] = f"step_pre_val"
    pruning_cfg['batch'] = 1
    validation_model.model = deepcopy(model.model)
    metric = validation_model.val(**pruning_cfg)
    pruned_map = metric.box.map
    pruned_macs, pruned_nparams = tp.utils.count_ops_and_params(pruner.model, example_inputs.to(model.device))

    current_speed_up = base_macs / pruned_macs
    print(f"After pruning: MACs={pruned_macs / 1e9} G, #Params={pruned_nparams / 1e6} M, "
          f"mAP={pruned_map}, speed up={current_speed_up}")

    model.model.yaml = update_pruned_model_yaml(model.model.yaml, model.model)
    model.save(os.path.join(pruning_cfg['name'], model_name))

    return


if __name__ == "__main__":
    import os

    os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'

    prune_param = {
        "model": r'last.pt',
        # "model": r'G:\code\yolo_v8_v11\torch_pruning/best.pt',
        "cfg_path": 'setting.yaml',     # Pruning config file. Having same format with ultralytics/cfg/default.yaml
        "iterative_steps": 3,           # Total pruning iteration step
        "target_prune_rate": 0.8,       # Target pruning rate
        "global_pruning": True,
    }

    prune(prune_param)

简单说一下这个文件实现的功能：

明面上要修改的参数：指定模型路径，模型的配置文件，剪枝迭代次数（torch-pruning内部的），剪枝比例，进行全局剪枝（从全部层来考虑某些层的重要性）。

    prune_param = {
        "model": r'last.pt',
        # "model": r'G:\code\yolo_v8_v11\torch_pruning/best.pt',
        "cfg_path": 'setting.yaml',     # Pruning config file. Having same format with ultralytics/cfg/default.yaml
        "iterative_steps": 3,           # Total pruning iteration step
        "target_prune_rate": 0.8,       # Target pruning rate
        "global_pruning": True,
    }

在一个任务目录下（如kx_s_250126）下，复制训练用的数据配置文件data.yaml和模型配置文件setting.yaml，以及S1_L1（约束化训练的结果）中的last.pt，然后运行yolov8_prune.py即可。

data.yaml如下：

配置正负样本平衡的参数negative_setting参考文章：

YOLOv8源码修改（1）- DataLoader增加负样本数据读取+平衡训练batch中的正负样本数

不修改的话就删了或者注释掉。

# path: /path/to/datasets
train: ["G:/datasets/bank_scene/00-task/yolo_det_kx_jh_1028/yolo_det_kx_jh_1028/train"]
val: ["G:/datasets/bank_scene/00-task/yolo_det_kx_jh_1028/yolo_det_kx_jh_1028/val"]

# Add negative image ---------------------------------------------------------------------------------------------------
negative_setting:
  neg_ratio: 1    # 小于等于0时，按原始官方配置训练，大于0时，控制正负样本。
  use_extra_neg: True
  extra_neg_sources: {
                      # "G:/datasets/bank_monitor/0-贵重物品遗留/贵重物品遗留三类图片/2024-07-12-qianbao": 100
                      }
  fix_dataset_length: 0    # 是否自定义每轮参与训练的图片数量

# number of classes ----------------------------------------------------------------------------------------------------
nc: 2

# Classes --------------------------------------------------------------------------------------------------------------
names:
  0: kx
  1: kx_dk

setting.yaml如下：

部分超参数需要自己调整，比如轮次、批次、学习率、参数大小等。

# Train settings -------------------------------------------------------------------------------------------------------
task: detect            # (str) YOLO task, i.e. detect, segment, classify, pose
mode: train             # (str) YOLO mode, i.e. train, val, predict, export, track, benchmark
data: ./data.yaml       # (str, optional) path to data file, i.e. coco8.yaml
epochs: 20              # (int) number of epochs to train for
batch: 4                # (int) number of images per batch (-1 for AutoBatch)
imgsz: 640              # (int | list) input images size as int for train and val modes, or list[w,h] for predict and export modes
patience: 1000          # (int) epochs to wait for no observable improvement for early stopping of training
device: 0               # (int | str | list, optional) device to run on, i.e. cuda device=0 or device=0,1,2,3 or device=cpu
workers: 0              # (int) number of worker threads for data loading (per RANK if DDP)
project: ./             # (str, optional) project name
name: S3_ft         # (str, optional) experiment name, results saved to 'project/name' directory
optimizer: SGD          # (str) optimizer to use, choices=[SGD, Adam, Adamax, AdamW, NAdam, RAdam, RMSProp, auto]
single_cls: False       # (bool) train multi-class data as single-class
rect: False             # (bool) rectangular training if mode='train' or rectangular validation if mode='val'
cos_lr: False           # (bool) use cosine learning rate scheduler
close_mosaic: 0         # (int) disable mosaic augmentation for final epochs (0 to disable)
resume: False           # (bool) resume training from last checkpoint
amp: True               # (bool) Automatic Mixed Precision (AMP) training, choices=[True, False], True runs AMP check
freeze: None            # (int | list, optional) freeze first n layers, or freeze list of layer indices during training
multi_scale: True       # (bool) Whether to use multiscale during training

# Hyperparameters ------------------------------------------------------------------------------------------------------
lr0: 0.0005               # (float) initial learning rate (i.e. SGD=1E-2, Adam=1E-3)
lrf: 0.1               # (float) final learning rate (lr0 * lrf)
momentum: 0.937         # (float) SGD momentum/Adam beta1
weight_decay: 0.0001    # (float) optimizer weight decay 5e-4
warmup_epochs: 0.0      # (float) warmup epochs (fractions ok)
warmup_momentum: 0.8    # (float) warmup initial momentum
warmup_bias_lr: 0.1     # (float) warmup initial bias lr
box: 7.5                # (float) box loss gain, default: 7.5
cls: 0.5                # (float) cls loss gain (scale with pixels), default: 0.5
dfl: 1.5                # (float) dfl loss gain, default: 1.5
degrees: 0.0            # (float) image rotation (+/- deg)
translate: 0.1          # (float) image translation (+/- fraction)
scale: 0.5              # (float) image scale (+/- gain)
fliplr: 0.5             # (float) image flip left-right (probability)
mosaic: 0.95            # (float) image mosaic (probability)

# Custom configuration, sparse training, pruning and distillation ------------------------------------------------------
custom_model: True      # (bool) load train model from .pt directory, not from model.model.yaml
add_L1: False            # (bool) use L1 norm
L1_start: 0.001         # (float) L1 regularization weight
L1_end: 0.1             # (float) final L1 regularization weight (L1_start * L1_end), default: 0.1

关于yolov8_prune.py中的部分方法：

replace_c2f_with_c2f_v2: 将原始模型中的C2f模块替换为C2f_v2。

replace_silu_with_relu: 将原始模型中的SiLU替换成ReLU。（默认没有使用）

replace_module_in_dict: 替换model.model.yaml中模块名。

update_model_yaml：根据剪枝模型每个模块的通道大小，修改原始模型model.model.yaml，并复赋值给剪枝后模型的model.model.yaml（改配置信息并不准确，只能大概计算，因为模块内部的输入、输出也会改变）

核心代码-替换模块：

# 加载模型，并把其中的C2f模块替换成C2f_v2
    model.to(pruning_cfg["device"])
    replace_c2f_with_c2f_v2(model)

    # 将SiLU替换成ReLU
    # replace_silu_with_relu(model)
    model.model.yaml = replace_module_in_dict(model.model.yaml, "C2f", "C2f_v2")
    # model.model.yaml = replace_module_in_dict(model.model.yaml, "SiLU", "ReLU")   # yaml里没有，所以不需要

    # 将yaml中C2f替换成C2f_v2
    model.model.yaml = update_model_yaml(model.model.yaml)

    initialize_weights(model.model)  # set BN.eps, momentum, ReLU.inplace

核心代码-模型剪枝：

详细使用参考 Torch-Pruning （应用不同剪枝算法，自定义通道剪枝，自定义比例等）。简单使用如下即可。

    example_inputs = example_inputs.to(model.device)

    pruner = tp.pruner.GroupNormPruner(
        model.model,
        example_inputs,
        importance=tp.importance.GroupNormImportance(),  # L2 norm pruning,
        iterative_steps=prune_param["iterative_steps"],
        pruning_ratio=prune_param["target_prune_rate"],
        ignored_layers=ignored_layers,
        round_to=32,
        global_pruning=prune_param["global_pruning"]
        # unwrapped_parameters=unwrapped_parameters
    )

忽略的层，剪枝时候忽略了第一个卷积层和检测头层已经dfl层：

    ignored_layers = []
    # unwrapped_parameters = []

    # 忽略含 “model.0.conv”，不剪枝最浅层卷积
    for name, module in model.named_modules():
        if "model.0.conv" in name or "dfl" in name:
            ignored_layers.append(module)

    # 不剪枝检测头
    for m in model.model.modules():
        if isinstance(m, (Detect,)):
            ignored_layers.append(m)

最终剪枝的加载效果如下（yolov8-s.pt）：

绿色框表明替换了模块，蓝色框表明减少了参数（不准确，没计算模块内部的剪枝），红色框表明了每个模块输入/输出通道数的减少（同样不准确）。为了方便部署和更好利用gpn/npu等，将剪枝通道的输入/输出均设置为32的倍数（剪枝时，round_to=32）。

上述模型，实际的测评（GFLOPs为半精度的）：

4. 模型训练与测试结果对比

4.1 模型结构比较

ONNX格式查看C2f和C2f_v2（去除了split操作）:

PT格式查看剪枝模型的通道数变化:

4.2 模型稀疏性对比

加入L1约束的训练（颜色越深轮次越大），可以看到0值有增加，但不多（L1_start=0.001, L1_end=0.1）。

量化对比非稀疏训练（左）的BN的0值明显小于稀疏训练的（右）。

回调后：

4.3 模型性能分析

如下表所示，剪枝后模型参数量减少了49.1%，FLOPs减少20.4%，推理速度提升10.6%（RTX3060 LAPTOP），在预测准确度上，几乎不变（相差均在1%以内）。

剪枝后的模型精度还几乎不变，可能是数据集太简单了，总共2个类别，用了2000+张训练，292张验证。后续换大数据集调参再测试。

模型名称	Img.	Inst.	P	R	mAP50	mAP75	mAP50-95	params (M)	time (ms)	Gflops
origin	292	476	0.987	0.992	0.995	0.989	0.943	11.13	4.7	28.4
origin_L1	292	476	0.988	0.996	0.994	0.992	0.943	11.13	4.7	28.4
origin_L1_p	292	476	0.982	0.996	0.994	0.989	0.929	5.66	4.2	22.6
origin_L1_p_ft	292	476	0.995	0.988	0.994	0.993	0.940	5.66	4.2	22.6