当前位置：首页 > article >正文

目标检测之FAST RCNN论文简读

article 2025/2/25 23:02:22

前言

FAST RCNN是RCNN的改进版，针对RCNN的一些痛点进行了修改。

FAST RCNN

论文传送门

摘要

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks.

本文提出了一种基于区域的快速卷积网络方法（Fast R-CNN）用于目标检测。Fast R-CNN 在先前工作（RCNN）的基础上，通过深度卷积网络高效分类候选目标区域。

介绍

Due to this complexity, current approaches (e.g., [9, 11, 19, 25]) train models in multi-stage pipelines that are slow and inelegant.

解决目标检测这种复杂任务，目前多阶段处理的训练方法如（RCNN, SPP,OverFeat,segDeepM）是慢和不够优雅。

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges. First, numerous candidate object locations (often called “proposals”) must be processed. Second, these candidates provide only rough localization that must be refined to achieve precise localization.

目标检测的复杂性在于要精确定位目标。就是需要创建两个主要通道。一、多个候选框。二、这些候选框是粗略的，必须通过重新精确定位。

We propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.

我们提出了一个单阶段的训练算法，是通过联合学习来对候选对象框进行分类和修正他们的空间位置。

R-CNN 和 SPPnet

R-CNN的缺点：

训练是多阶段的
训练需要显存和时间，所以很昂贵
目标检测很慢

SPPnet的缺点：

训练是多阶段的
训练需要显存

But unlike R-CNN, the fine-tuning algorithm proposed in [11] cannot update the convolutional layers that precede the spatial pyramid pooling. Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.

微调算法无法更新金字塔池化层之前的卷积层

Fast RCNN的贡献

Fast R-CNN 在目标检测质量（mAP）上显著优于 R-CNN 和 SPPnet
单阶段训练与多任务损失机制
训练可以更新所有网络层
无需特征缓存的磁盘存储

Fast R-CNN 架构和训练

A Fast R-CNN network takes as input an entire image and a set of object proposals.

Fast R-CNN 是以整张图片和一组目标生成框作为输入

The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.

首先输入整张图片生成表征图。然后，对于每个对象生成框，一个感兴趣区域（RoI）池化层从特征图中提取一个固定长度的特征向量。

Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes.

每个特征向量会被输入到一系列全连接层（fc层）中，最终分叉为两个并列的输出层。一个层用于生成K个对象类别的softmax概率估计（包含一个兜底的’背景’类），另一个层为每个K个对象类别输出四个实数值

Each set of 4 values encodes refined bounding-box positions for one of the K classes.

每组4个值对应K个类别中的一个类别，用于编码其精修的边界框位置。

ROI池化层

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H × W (e.g., 7 × 7), where H and W are layer hyper-parameters that are independent of any particular RoI.

ROI池化层（感兴趣区域池化层）通过最大池化（max pooling）操作，将任意有效候选区域（Region of Interest, ROI）内的特征转换为具有固定空间维度 h × w（例如7×7）的小型特征图。其中 h 和 w 是该层的超参数，其值与具体ROI的尺寸无关。

In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).

在本文中，ROI是一个卷积特征图的矩形窗口。每个ROI由一个四元组（r, c, h, w）定义，其中（r, c）指定其左上角位置，而（h, w）则指定其高度和宽度。

RoI max pooling works by dividing the h × w RoI window into an H × W grid of sub-windows of approximate size h/H × w/W and then max-pooling the values in each sub-window into the corresponding output grid cell.

RoI max pooling 的工作原理是通过将 RoI 窗口划分为 H × W 个子窗口，每个子窗口大小为 h/H × w/W，然后对每个子窗口中的值进行最大池化操作，并将结果存储在相应的输出网格单元中。

Pooling is applied independently to each feature map channel, as in standard max pooling.

池化是在每个特征图通道上独立进行的，就像标准最大池化一样。

The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

RoI层只是SPPnets中的特殊情况，其中只有一层金字塔。我们使用SPPnets中给出的池化子窗口计算。

基于预训练网络的初始化

First, the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g.,H = W = 7 for VGG16).

首先，最后一个最大池化层被替换为RoI池化层，该层被配置为设置H和W与网络第一个全连接层兼容（例如，对于VGG16，H=W=7）。

Second, the network’s last fully connected layer and softmax (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).

第二步，网络最后一个全连接层和softmax（用于1000类ImageNet分类训练）被替换为前面描述的两个并列层（一个全连接层和一个softmax层，分别用于K+1个类别和类别特定的边界框回归）。

Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

第三步，网络被修改以接受两个数据输入：一张图片和一个图片中的ROI列表。

现在是预测阶段的结构：
在这里插入图片描述

微调检测

In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes.

在Fast RCNN的训练过程中，随机梯度下降（SGD）的小批量样本是分层采样的，首先通过采样N张图片，然后从每张图片中采样R/N个RoIs。关键的是，来自同一图像的RoIs在前向和后向传递过程中共享计算和内存。

Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from128 different images (i.e., the R-CNN and SPPnet strategy).

通过减小N可以降低小批量的计算量。例如，当使用N=2和R=128时，所提出的训练方案比从128张不同图像中各采样一个RoI的方法快约64倍（即R-CNN和SPPnet策略）。

One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128using fewer SGD iterations than R-CNN.

尽管有人担心这种策略可能导致训练收敛速度变慢（因为同一张图像中的RoI具有相关性），但实际应用中这种担忧并未显现。通过设置N=2和R=128，我们以比R-CNN更少的SGD迭代次数，仍能取得良好的训练效果。

In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages [9, 11]. The components of this procedure (the loss, mini-batch sampling strategy, back-propagation through RoI pooling layers, and SGD hyper-parameters) are described below.

除了层次化采样外，Fast R-CNN 还采用了一种简化的训练流程，该流程包括一个精细调优阶段，该阶段联合优化了一个softmax分类器和边界框回归器，而不是像传统方法那样分三个独立阶段训练softmax分类器、支持向量机（SVMs）和回归器[9, 11]。该过程的各个组成部分（损失函数、小批量采样策略、反向传播通过RoI池化层以及随机梯度下降（SGD）超参数）如下所述。

这里是已经是端对端的训练。

多任务损失

这里就直接进行复述

roi有两个输出：

类别预测：K个类型，则FC长度是（K+1），多出一个是背景类型。 $p=(p_{0},...,p_{K})$
边界框回归: $t_{k}=(t^{k}_{x},t^{k}_{y},t^{k}_{w},t^{k}_{h})$ ，k则是对应每个类别的边界框

每个ROI都用类别预测和边界框回归预测联合训练：

$L(p,u,t^{u},v)=L_{cls}(p,u) + \lambda[u \geq 1]L_{loc}(t^{u},v)$
其中：
对于分类损失： $L_{cls}(p,u)=-{log}p_{u}$ ，是多分类的交叉熵损失
对于边界框损失： $L_{loc}(t^{u},v)= \sum\limits_{i\in{\{x,y,w,h\}}}{smooth_{L_{1}}(t^{u}_i-v_i)}$ ,
$smooth_{L_{1}}(x)=\left\{ \begin{aligned} 0.5x^{2} &\ if|x|<1\\ |x|-0.5 &\text{otherwise} \end{aligned} \right.$
解释一下：
$p$ 是类别预测
$u$ 是真实类别
$t^{u}$ 是真实类别的预测边界框
$v$ 对应真实目标的边界框回归参数
$\geq 1]$ ： $u\geq 1=> 1$ , $u < 1 => 0$ ，就是表示真实的类别置为1，非真实的置为0，淘汰非真实标签。
$\lambda$ ：超参，论文中用了 $\lambda=1$

小批量采样

每个小批量（mini-batch）包含 N=2 张图像，随机均匀选取
每批总 RoI 数 R=128，每张图像采样 64 个 RoI
正负样本，正样本75%（IoU>=0.5, u>=1），负样本25%(IoU在 $[0.1, 0.5)$ ,u=0)
数据增强：50%概率随机水平翻转图像

RoI池化层的反向传播

$x_{i}$ 是某个 $y_{rj}$ 的前向传播最大值
$R (r, j)$ 表示第r个ROI区域中与第j个输出单元相关的输入位置集合
$i$ : 输入特征图上的位置索引。

$\frac{\partial{L}}{\partial{x_{i}}}=\sum\limits_{r}\sum\limits_{j}[i=i^{*}(r,j)]\frac{\partial{L}}{\partial{y_{ri}}}$
其中：
$i^{*}(r,j)=\arg\max_{i'\in\R(r,j)}x_{i'}$