当前位置：首页 > article >正文

【第三十周】文献阅读：Mask R-CNN

article 2025/3/1 13:00:15

摘要

本周阅读了Mask R-CNN的论文。Mask R-CNN 是一种扩展了 Faster R-CNN 的强大框架，专门用于实例分割任务。它首先通过 RPN 生成大量候选区域（Region Proposals），然后利用 FPN 提取多尺度特征图，增强对不同尺寸目标的检测能力。为了确保空间对齐的准确性，Mask R-CNN 引入了 RoIAlign 层，替代了传统的 RoIPool 层，使用双线性插值避免量化误差，更精确地映射每个感兴趣区域（RoI）到固定的输出尺寸。对于每个 RoI，模型不仅进行分类和边界框回归以提高检测精度，还新增了一个专门的分支——小型全卷积网络（FCN），来预测像素级的分割掩码，实现精确的实例分割。这种多任务学习架构共享卷积特征，提高了效率和性能，同时保持了模型的简洁性和灵活性。通过这些创新，Mask R-CNN 实现了高精度的实例分割，成为该领域的标杆方法，显著提升了检测和分割任务的准确性和高效性。

Abstract

This week, I read the Mask R-CNN paper. Mask R-CNN is a powerful framework that extends Faster R-CNN, specifically designed for instance segmentation tasks. It begins by generating a large number of candidate regions (Region Proposals) through the Region Proposal Network (RPN). Then, it leverages the Feature Pyramid Network (FPN) to extract multi-scale feature maps, enhancing the detection capability for objects of varying sizes. To ensure accurate spatial alignment, Mask R-CNN introduces the RoIAlign layer, which replaces the traditional RoIPool layer. Using bilinear interpolation, RoIAlign avoids quantization errors and provides more precise mapping of each region of interest (RoI) to a fixed output size. For each RoI, the model not only performs classification and bounding box regression to improve detection accuracy but also adds a dedicated branch—a small Fully Convolutional Network (FCN)—to predict pixel-level segmentation masks, achieving precise instance segmentation. This multi-task learning architecture shares convolutional features, improving efficiency and performance while maintaining the simplicity and flexibility of the model. Through these innovations, Mask R-CNN achieves high-precision instance segmentation, becoming a benchmark method in the field, significantly enhancing the accuracy and efficiency of detection and segmentation tasks.

Mask R-CNN

Title: Mask R-CNN
Author: He, KM (He, Kaiming) ; Gkioxari, G (Gkioxari, Georgia) ; Dollár, P (Dollar, Piotr) ; Girshick, R (Girshick, Ross)
Source: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
WOS:https://webofscience.clarivate.cn/wos/alldb/full-record/WOS:000508386100011

项目地址：matterport/Mask_RCNN

研究背景

在过去的几年里，图像识别技术取得了巨大的进步，特别是在物体检测（找出图片中有哪些物体）和语义分割（给图片中的每个像素分类）方面。这些进步很大程度上得益于一些强大的基础系统，比如用于物体检测的 Fast/Faster RCNN 和用于语义分割的完全卷积网络（FCN）。这些方法不仅概念简单易懂，而且非常灵活、可靠，同时还能快速训练和处理数据。

论文作者的目标是开发一个类似的基础系统来进行实例分割。实例分割是一项更具挑战性的任务，因为它不仅要正确地找到图片中的所有物体，还要精确地将每个物体从背景和其他物体中区分开来。这结合了两个经典的计算机视觉任务：一是目标检测，即对单个物体进行分类并用边界框定位；二是语义分割，即将每个像素分类到不同的类别中，但不区分具体的物体实例。

乍一看，你可能会觉得要实现这么复杂的任务需要一种非常复杂的方法。然而，论文作者发现了一种既简单又高效的方法，它甚至超越了之前最先进的实例分割技术。他们把这个方法叫做 Mask R-CNN。Mask R-CNN 是基于 Faster R-CNN 发展而来的，在原有基础上增加了一个新的分支，用来预测每个感兴趣区域（RoI）的分割掩码（mask），这个分支与原来的分类和边界框回归分支一起工作。

新添加的掩码分支是一个小型的完全卷积网络（FCN），它逐像素地预测分割结果。由于 Mask R-CNN 基于 Faster R-CNN 框架，所以它的实现和训练都相对简单，而且可以很容易地适应各种架构设计。此外，新增加的掩码分支带来的计算成本很小，因此整个系统仍然保持了快速运行的能力。

为了更好地处理输入图像和输出结果之间的精确对应关系，作者还引入了一个叫 RoIAlign 的新层。这个层解决了传统 RoIPool 层中存在的空间位置量化问题，确保了特征提取时的空间位置更加准确。

创新性

结合目标检测与实例分割：Mask R-CNN 同时实现了目标检测和实例分割两项任务。通过添加一个分支来预测每个 RoI
上的分割掩码，Mask R-CNN不仅可以识别图像中的物体，还可以精确地分割出每个物体的轮廓。这种设计使得模型能够处理复杂的场景，例如多个重叠或相邻的对象。
引入 RoIAlign 层：针对 Faster R-CNN 中 RoIPool 层存在的空间量化问题，Mask R-CNN 引入了 RoIAlign 层。RoIAlign 通过使用双线性插值的方法，避免了空间位置上的量化误差，确保了输入图像与输出特征图之间更精确的空间对应关系，从而提升了分割质量。
灵活性和可扩展性：Mask R-CNN 的架构非常灵活，可以很容易地集成到不同的骨干网络（如 ResNet 或 ResNeXt）中，并且可以根据需要调整配置参数以适应不同的应用场景。

局限性

计算复杂度高：Mask R-CNN 模型包含了大量的层和参数，导致训练和推理过程中的计算成本较高。这对于实时应用或资源受限的环境来说是一个挑战。此外，处理高分辨率图像或大量对象时，速度可能会受到影响。
数据需求大：为了获得良好的性能，Mask R-CNN 需要大量的带注释的数据来进行训练。获取高质量的标注数据不仅耗时而且成本高昂，尤其是在需要像素级标注的情况下。

网络结构

在这里插入图片描述
Mask RCNN 实际上只是为了满足实例分割任务，在faster RCNN的基础上并行了一个 mask 分支，并且改进了RoIPooling，将其变为RoIAlign 而已，具体架构参照下图：

Backbone

Mask R-CNN 有四种 backbone 可以选择，分别是 ResNet50，ResNet101，ResNet50 + FPN，ResNet101 ＋ FPN。选择不同的backbone，ROI生成方式、RP的选择以及RP投射到feature map上的选择会有所不同，并且进入Head层的特征图大小也不尽相同，见下图：

在这里插入图片描述

注意带和不带FPN结构的Mask R-CNN在Mask分支上略有不同，对于带有FPN结构的Mask R-CNN它的class、box分支和Mask分支并不是共用一个RoIAlign。在训练过程中，对于class, box分支RoIAlign将RPN（Region Proposal Network）得到的Proposals池化到7x7大小，而对于Mask分支RoIAlign将Proposals池化到14x14大小。

FPN（特征金字塔网络）

在这里插入图片描述
基本思想：将多个阶段特征图融合在一起，这就相当于既有了高层的语义特征，也有了低层的轮廓特征，为RPN提供不同尺度的特征图。

那么不同阶段的特征图是如何融合的呢？

在这里插入图片描述

假设不同阶段的特征图如上所示分别为C1、C2、C3、C4和C5，横向通过1×1卷积核改变通道数，右侧纵向通过上采样使得与上一阶段的特征图尺寸对其并输出预测结果。

# 提取每个阶段的特征图
_, C2, C3, C4, C5 = resnet_graph(input_image, config.BACKBONE, stage5=True)
        # 自上而下处理，首先需要输入 P5
        # 通过1x1卷积输出256个通道的特征图
        # C5=32x32x2048, P5=32x32x256
        P5 = KL.Conv2D(256, (1, 1), name='fpn_c5p5')(C5) 
        # C4=64x64x1024, P4=64x64x256
        P4 = KL.Add(name="fpn_p4add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5), 
            KL.Conv2D(256, (1, 1), name='fpn_c4p4')(C4)])
        # C3=128x128x512, P3=128x128x256
        P3 = KL.Add(name="fpn_p3add")([ 
            KL.UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),
            KL.Conv2D(256, (1, 1), name='fpn_c3p3')(C3)])
        # C2=256x256x256, P2=256x256x256,
        P2 = KL.Add(name="fpn_p2add")([
            KL.UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3),
            KL.Conv2D(256, (1, 1), name='fpn_c2p2')(C2)])
        
        # 将P2、P3、P4、P5 再通过一个3×3×256的卷积进行特征提取
        P2 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p2")(P2)
        P3 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p3")(P3)
        P4 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p4")(P4)
        P5 = KL.Conv2D(256, (3, 3), padding="SAME", name="fpn_p5")(P5)
        
        # P6只用来生成锚框，不需要再进行过多的处理
        P6 = KL.MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5)

        rpn_feature_maps = [P2, P3, P4, P5, P6]
        mrcnn_feature_maps = [P2, P3, P4, P5]

生成默认候选框（anchors）

假设我们有一个锚框，其面积为 $A$ ，宽高比为 $r$ 。宽高比定义为宽度与高度的比例，即 $\frac{w}{h}$ 。对于一个给定的面积 $A$ ，我们有：

$\times h$

结合宽高比的定义，我们可以写出：

$\times \frac{w}{r} = \frac{w^2}{r}$

解这个方程得到宽度 $w$ ：

$w^2 = A \times r$
$\sqrt{A \times r}$

同样的道理，可以求得高度 $h$ ：

$\frac{w}{r} = \frac{\sqrt{A \times r}}{r} = \sqrt{\frac{A}{r}}$

在 Mask R-CNN 中， $\text{scales}$ 实际上代表的是锚框的基准面积的平方根，即 $\sqrt{A}$ ，所以我们可以将上述公式简化为：

宽度 $\text{scales} \times \sqrt{r}$

高度 $\frac{\text{scales}}{\sqrt{r}}$

def generate_anchors(scales, ratios, shape, feature_stride, anchor_stride):
    """
    scales: 1D array of anchor sizes in pixels. Example: [32, 64, 128]
    ratios: 1D array of anchor ratios of width/height. Example: [0.5, 1, 2]
    shape: [height, width] spatial shape of the feature map over which
            to generate anchors.
    feature_stride: Stride of the feature map relative to the image in pixels.
    anchor_stride: Stride of anchors on the feature map. For example, if the
        value is 2 then generate anchors for every other feature map pixel.
    """
    # 创建两个二维数组，使其能遍历所有可能的尺度和比例
    scales, ratios = np.meshgrid(np.array(scales), np.array(ratios))
    scales = scales.flatten()
    ratios = ratios.flatten()

    # 通过尺度和比例来求长和宽
    heights = scales / np.sqrt(ratios)
    widths = scales * np.sqrt(ratios)

    # 还原到原始图像上的锚框坐标
    shifts_y = np.arange(0, shape[0], anchor_stride) * feature_stride
    shifts_x = np.arange(0, shape[1], anchor_stride) * feature_stride
    # x和y生成所有可能的组合
    shifts_x, shifts_y = np.meshgrid(shifts_x, shifts_y)

    # Enumerate combinations of shifts, widths, and heights
    box_widths, box_centers_x = np.meshgrid(widths, shifts_x)
    box_heights, box_centers_y = np.meshgrid(heights, shifts_y)

    # Reshape to get a list of (y, x) and a list of (h, w)
    box_centers = np.stack(
        [box_centers_y, box_centers_x], axis=2).reshape([-1, 2])
    box_sizes = np.stack([box_heights, box_widths], axis=2).reshape([-1, 2])

    # 将中心点进行偏移计算左上角和右下角的坐标 (y1, x1, y2, x2)
    boxes = np.concatenate([box_centers - 0.5 * box_sizes,
                            box_centers + 0.5 * box_sizes], axis=1)
    return boxes

def generate_pyramid_anchors(scales, ratios, feature_shapes, feature_strides,
                             anchor_stride):
    """Generate anchors at different levels of a feature pyramid. Each scale
    is associated with a level of the pyramid, but each ratio is used in
    all levels of the pyramid.

    Returns:
    anchors: [N, (y1, x1, y2, x2)]. All generated anchors in one array. Sorted
        with the same order of the given scales. So, anchors of scale[0] come
        first, then anchors of scale[1], and so on.
    """
    # Anchors
    # [anchor_count, (y1, x1, y2, x2)]
    anchors = []
    # 按照不同的尺度和比例生成默认候选框
    for i in range(len(scales)):
        anchors.append(generate_anchors(scales[i], ratios, feature_shapes[i],
                                        feature_strides[i], anchor_stride))
    return np.concatenate(anchors, axis=0)

RPN

RPN 负责从输入图像中生成高质量的候选区域（Region Proposals），即可能包含目标对象的边界框。它基于特征图进行操作，并为每个锚框输出分类得分和边界框回归参数。
RPN 在 FPN 提供的不同尺度特征图上运行，利用这些特征图来生成更加准确的候选区域。

RPN 将不同阶段输出的特征图通过共享权重（3×3卷积）来生成预测结果：

shared = KL.Conv2D(512, (3, 3), padding='same', activation='relu',
                       strides=anchor_stride,
                       name='rpn_conv_shared')(feature_map)

RPN 最后会得到三个值，分别是logits、probs和bbox。其中，logits是针对每个Proposal对应每个类别的Mask信息，probs是类别概率，bbox是边界框回归之后的值。

    # 先用共享的3X3卷积提取特征，再接上1x1卷积
    x = KL.Conv2D(2 * anchors_per_location, (1, 1), padding='valid',
                  activation='linear', name='rpn_class_raw')(shared)

    rpn_class_logits = KL.Lambda(
        lambda t: tf.reshape(t, [tf.shape(t)[0], -1, 2]))(x)

    rpn_probs = KL.Activation(
        "softmax", name="rpn_class_xxx")(rpn_class_logits)

    x = KL.Conv2D(anchors_per_location * 4, (1, 1), padding="valid",
                  activation='linear', name='rpn_bbox_pred')(shared)

    rpn_bbox = KL.Lambda(lambda t: tf.reshape(t, [tf.shape(t)[0], -1, 4]))(x)
    return [rpn_class_logits, rpn_probs, rpn_bbox]

ProposalLayer

对20W+候选框进行过滤，先按照前景得分排序
取6000个得分高的，把之前得到的每个框回归值都利用上
NMS再过滤

        # 从RPN网络输出的结果中选择前景得分值
        scores = inputs[0][:, :, 1]
        # 取边界框的坐标
        deltas = inputs[1]
        deltas = deltas * np.reshape(self.config.RPN_BBOX_STD_DEV, [1, 1, 4])
        # 获得所有的anchors
        anchors = self.anchors

取前6000个得分最高的候选框，并取得scores、deltas和anchors：

        pre_nms_limit = min(6000, self.anchors.shape[0])
        ix = tf.nn.top_k(scores, pre_nms_limit, sorted=True,
                         name="top_anchors").indices
        scores = utils.batch_slice([scores, ix], lambda x, y: tf.gather(x, y),
                                   self.config.IMAGES_PER_GPU)
        deltas = utils.batch_slice([deltas, ix], lambda x, y: tf.gather(x, y),
                                   self.config.IMAGES_PER_GPU)
        anchors = utils.batch_slice(ix, lambda x: tf.gather(anchors, x),
                                    self.config.IMAGES_PER_GPU,
                                    names=["pre_nms_anchors"])

边界框微调:

boxes = utils.batch_slice([anchors, deltas],
                                  lambda x, y: apply_box_deltas_graph(x, y),
                                  self.config.IMAGES_PER_GPU,
                                  names=["refined_anchors"])

在apply_box_deltas_graph函数中，通过回归得到的deltas对anchor进行调整：

def apply_box_deltas_graph(boxes, deltas):
    """Applies the given deltas to the given boxes.
    boxes: [N, 4] where each row is y1, x1, y2, x2
    deltas: [N, 4] where each row is [dy, dx, log(dh), log(dw)]
    """
    # Convert to y, x, h, w
    height = boxes[:, 2] - boxes[:, 0]
    width = boxes[:, 3] - boxes[:, 1]
    center_y = boxes[:, 0] + 0.5 * height
    center_x = boxes[:, 1] + 0.5 * width
    # Apply deltas
    center_y += deltas[:, 0] * height
    center_x += deltas[:, 1] * width
    height *= tf.exp(deltas[:, 2])
    width *= tf.exp(deltas[:, 3])
    # Convert back to y1, x1, y2, x2
    y1 = center_y - 0.5 * height
    x1 = center_x - 0.5 * width
    y2 = y1 + height
    x2 = x1 + width
    result = tf.stack([y1, x1, y2, x2], axis=1, name="apply_box_deltas_out")
    return result

通过clip_boxes_graph函数对边界框进行裁剪，去掉越界的部分：

height, width = self.config.IMAGE_SHAPE[:2]
window = np.array([0, 0, height, width]).astype(np.float32)
boxes = utils.batch_slice(boxes,
                          lambda x: clip_boxes_graph(x, window),
                          self.config.IMAGES_PER_GPU,
                          names=["refined_anchors_clipped"])

归一化：

normalized_boxes = boxes / np.array([[height, width, height, width]])

非极大值抑制，去掉重复的边界框：

def nms(normalized_boxes, scores):
        indices = tf.image.non_max_suppression(
            normalized_boxes, scores, self.proposal_count,
            self.nms_threshold, name="rpn_non_max_suppression")
        proposals = tf.gather(normalized_boxes, indices)
        # Pad if needed
        padding = tf.maximum(self.proposal_count - tf.shape(proposals)[0], 0)
        proposals = tf.pad(proposals, [(0, padding), (0, 0)])
        return proposals

DetectionTargetLayer

之前得到了2000个ROI，将padding进来的这些去掉
有的数据集一个框会包括多个物体，这样情况剔除掉
判断正负样本，基于ROI和GT，通过IOU与默认阈值0.5判断
设置负样本数量是正样本的3倍，总数默认400个

去掉padding：

    proposals, _ = trim_zeros_graph(proposals, name="trim_proposals")
    gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes")
    gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros,
                                   name="trim_gt_class_ids")
    gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2,
                         name="trim_gt_masks")

处理COCO数据集中的重叠数据：

    crowd_ix = tf.where(gt_class_ids < 0)[:, 0]
    non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0]
    crowd_boxes = tf.gather(gt_boxes, crowd_ix)
    crowd_masks = tf.gather(gt_masks, crowd_ix, axis=2)
    gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix)
    gt_boxes = tf.gather(gt_boxes, non_crowd_ix)
    gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)

计算重叠比例：

overlaps = overlaps_graph(proposals, gt_boxes)

选择IoU>= 0.5的样本当做正样本：

    positive_roi_bool = (roi_iou_max >= 0.5)
    positive_indices = tf.where(positive_roi_bool)[:, 0]

选择IoU< 0.5的样本当做负样本：

negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0]

按1比3确定正负样本比例：

positive_count = int(config.TRAIN_ROIS_PER_IMAGE *
                     config.ROI_POSITIVE_RATIO)
positive_indices = tf.random_shuffle(positive_indices)[:positive_count]
positive_count = tf.shape(positive_indices)[0]
r = 1.0 / config.ROI_POSITIVE_RATIO
negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count
negative_indices = tf.random_shuffle(negative_indices)[:negative_count]

分配正负样本的标签：

    positive_rois = tf.gather(proposals, positive_indices)
    negative_rois = tf.gather(proposals, negative_indices)

    # Assign positive ROIs to GT boxes.
    positive_overlaps = tf.gather(overlaps, positive_indices)
    roi_gt_box_assignment = tf.argmax(positive_overlaps, axis=1)
    roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment)
    roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment)

    # Compute bbox refinement for positive ROIs
    deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes)
    deltas /= config.BBOX_STD_DEV

    # Assign positive ROIs to GT masks
    # Permute masks to [N, height, width, 1]
    transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1)
    # Pick the right mask for each ROI
    roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment)

    # Compute mask targets
    boxes = positive_rois
    if config.USE_MINI_MASK:
        # Transform ROI corrdinates from normalized image space
        # to normalized mini-mask space.
        y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
        gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
        gt_h = gt_y2 - gt_y1
        gt_w = gt_x2 - gt_x1
        y1 = (y1 - gt_y1) / gt_h
        x1 = (x1 - gt_x1) / gt_w
        y2 = (y2 - gt_y1) / gt_h
        x2 = (x2 - gt_x1) / gt_w
        boxes = tf.concat([y1, x1, y2, x2], 1)
    box_ids = tf.range(0, tf.shape(roi_masks)[0])
    masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
                                     box_ids,
                                     config.MASK_SHAPE)
    # Remove the extra dimension from masks.
    masks = tf.squeeze(masks, axis=3)

    # Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with
    # binary cross entropy loss.
    masks = tf.round(masks)

    # Append negative ROIs and pad bbox deltas and masks that
    # are not used for negative ROIs with zeros.
    rois = tf.concat([positive_rois, negative_rois], axis=0)
    N = tf.shape(negative_rois)[0]
    P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0)
    rois = tf.pad(rois, [(0, P), (0, 0)])
    roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)])
    roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)])
    deltas = tf.pad(deltas, [(0, N + P), (0, 0)])
    masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)])

    return rois, roi_gt_class_ids, deltas, masks

ROIAlign

ROIAlign 的提出是为了解决 Faster R-CNN中RoIPooling的区域不匹配的问题。ROIPooling的区域不匹配问题是由于ROIPooling过程中的取整操作产生的。ROIPooling是Faster R-CNN中必不可少的一步，在这个过程中会产生长度固定的特征向量，有了长度固定的特征向量才能进行softmax计算分类损失。

假设输入一张800×800 的图片，经过一个有5次降采样的卷机网络，得到大小为 25×25 的Feature Map。再假设 ROI 区域大小是 600×500 ，在特征图上进行相应缩放后对应的区域为 ${600}\over{32}$ × ${500}\over{32}$ = 18.75 × 15.625 = 18.75 × 15.625=18.75×15.625，由于无法整除，ROIPooling采用向下取整的方式，进而得到ROI区域的Feature Map的大小为 18 × 15 ，这就造成了第一次区域不匹配。

RoIPooling的下一步是对Feature Map分bin，假如我们需要一个 7 × 7 的bin，每个bin的大小为 $\over {7}$ × $15\over7$ ，由于不能整除，ROI同样采用了向下取整的方式，从而每个bin的大小为 2 × 2 ，即整个RoI区域的Feature Map的尺寸为14 × 14 。第二次区域不匹配问题因此产生。

对比ROIPooling之前的Feature Map，ROI Pooling分别在横向和纵向产生了4.75和1.625的误差，对于物体分类或者物体检测场景来说，这几个像素的位移或许对结果影响不大，但是语义分割任务通常要精确到每个像素点，因此ROIPooling是不能应用到Mask R-CNN中的。

To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.

为了解决这个问题，作者提出了RoIAlign方法替代RoIPool，以获得更加精确的空间定位信息。

在这里插入图片描述
作者在文中提到，将RoIPool替换成RoIAlign后，分割的Mask准确率相对提升了10%到50%。

在这里插入图片描述
RoIAlign并没有取整的过程，可以全程使用浮点数操作，步骤如下：

计算RoI区域的边长，边长不取整；
将ROI区域均匀分成k × k个bin，每个bin的大小不取整；
每个bin的值为其最邻近的Feature Map的四个值通过双线性插值得到；
使用Max Pooling或者Average Pooling得到长度固定的特征向量。

    # ROI Pooling
    # Shape: [batch, num_boxes, pool_height, pool_width, channels]
    x = PyramidROIAlign([pool_size, pool_size], image_shape,
                        name="roi_align_classifier")([rois] + feature_maps)

损失函数

Mask R-CNN的损失就是在Faster R-CNN的基础上加上了Mask分支上的损失，即：
$faster_rcnn + L mask Loss = L_{\text{rpn}} + L_{\text{faster\_rcnn}} + L_{\text{mask}}$

其中，Mask分支上的损失就是二值交叉熵损失(Binary Cross Entropy)。

网络预测的logits是针对每个Proposal对应每个类别的Mask信息（注意预测的Mask大小都是28x28）。并且这里输入的Proposals都是正样本（在Fast R-CNN阶段采样得到的），对应的GT信息（box、cls）也是知道的。

如下图所示，假设通过RPN得到了一个Proposal（图中黑色的矩形框），通过RoIAlign后得到对应的特征信息（shape为14x14xC），接着通过Mask Branch预测每个类别的Mask信息得到图中的logits（logits通过sigmoid激活函数后，所有值都被映射到0至1之间）。通过Fast R-CNN分支正负样本匹配过程我们能够知道该Proposal的GT类别为猫（cat），所以将logits中对应类别猫的预测mask（shape为28x28）提取出来。然后根据Proposal在原图对应的GT上裁剪并缩放到28x28大小，得到图中的GT mask（对应目标区域为1，背景区域为0）。最后计算logits中预测类别为猫的mask与GT mask的BCELoss（BinaryCrossEntropyLoss）即可。

在这里插入图片描述

实验结果

在这里插入图片描述
MNC 和FCIS 分别是COCO 2015年和2016年实例分割挑战赛的获胜者。而在没有花里胡哨的情况下，Mask R-CNN的性能优于更复杂的FCIS+，后者包括多尺度训练/测试、水平翻转测试和OHEM。

总结

Mask R-CNN 是何凯明等人在 2017 年提出的一种扩展了 Faster R-CNN 的强大框架，专门用于实例分割任务。它不仅保留了目标检测的功能（分类和边界框回归），还新增了一个分支用于预测每个感兴趣区域（RoI）的像素级分割掩码，实现了多任务学习架构。为了确保空间对齐的准确性，Mask R-CNN 引入了 RoIAlign 层，使用双线性插值避免量化误差，提供更精确的特征映射。尽管增加了新的掩码预测分支，Mask R-CNN 依然保持了简洁性和灵活性，易于实现和训练，并且只引入了很小的计算开销。该模型在多个基准数据集上实现了超越当时最先进方法的实例分割性能，显著提升了检测和分割任务的准确性和效率，成为该领域的标杆方法。通过这些创新，Mask R-CNN 展示了如何通过简单的改进实现显著的效果提升，强调了简单设计的重要性。

查看全文

http://www.kler.cn/a/508921.html