AugFPN
一、论文核心
在FPN基础上提出3个改进点(补偿融合的过程中产生的各种信息缺失):
1.Consistent Supervision:用于降低不同scale之间的语义Gap(补偿相邻特征融合后产生的语义信息损失)
2.Residual Feature Augmentation:用于在不同尺度的特征融合(fusion, summation)中降低信息损失(补偿最顶层由于融合前降维产生的信息损失)
3.Soft RoI Selection:更好地从图像金字塔中取出ROI Feature用于分类(补偿从单层金字塔中取ROI Feature产生的信息损失)
二、Consistent Supervision
通过两个卷积进行适配,相邻两个尺度的特征图直接相加的方式,没有考虑到语义信息,会使得最终的特征金字塔陷入次优。
AugFPN中直接在每个融合前的特征后面接上检测器和分类器(RPN Head + RCNN)。
训练时,网络的损失 = lambda * 融合前检测器的Localization Loss + 融合后检测器的Localization Loss + beta * (融合前检测器的Classification Loss + 融合后检测器的Classification Loss)
此外,融合前各尺度的检测器权重是共享的,这样有利于对不同尺度的监督,从而:1)进一步加强各个尺度下的特征联系; 2)反推底层信息能学到更多的语义信息(从高层信息引导过来)
在预测时,融合前共享的这些检测器&分类器都可以去掉
三、Residual Feature Augmentation
在原始FPN中M5层经过1x1卷积后通道数减少,却不像其他层一样有额外特征与其融合,AugFPN中提出Residual Feature Augmentation补偿M5层中损失的信息。如下图所示:
四、Soft RoI Selection
在 FPN 中,每个 RoI 的特征都是通过在一个特定特征级别上进行池化来获得的,该特征级别是根据该 RoI 的规模启发式选择的。通常,较小的 RoI 分配给较低级别的特征,而较大的 RoI 分配给较高级别的特征。
Soft RoI Selection中引入了自适应权重,以更好地衡量不同级别 RoI 区域内特征的重要性。最终的 RoI 特征是基于自适应权重生成的,而不是基于 RoI 分配或最大作等硬选择方法(如PANet)。
Soft RoI Selection首先汇集每个 ROI 的所有金字塔级别的特征。然后利用自适应空间融合模块 (ASF)来自适应地融合这些特征。为不同级别的 RoI 特征生成不同的空间权重图,并将 RoI 特征与加权聚合融合。如下图所示:
其代码实现如下:
@ROI_EXTRACTORS.register_module
class SoftRoIExtractor(nn.Module):
"""Extract RoI features from all level feature map.
If there are mulitple input feature levels, each RoI is mapped to a level
according to its scale.
Args:
roi_layer (dict): Specify RoI layer type and arguments.
out_channels (int): Output channels of RoI layers.
featmap_strides (int): Strides of input feature maps.
finest_scale (int): Scale threshold of mapping to level 0.
"""
def __init__(self,
roi_layer,
out_channels,
featmap_strides,
finest_scale=56):
super(SoftRoIExtractor, self).__init__()
self.roi_layers = self.build_roi_layers(roi_layer, featmap_strides)
self.out_channels = out_channels
self.featmap_strides = featmap_strides
self.finest_scale = finest_scale
self.spatial_attention_conv=nn.Sequential(nn.Conv2d(out_channels*len(featmap_strides), out_channels, 1), nn.ReLU(), nn.Conv2d(out_channels,len(featmap_strides),3, padding=1))
@property
def num_inputs(self):
"""int: Input feature map levels."""
return len(self.featmap_strides)
def init_weights(self):
for m in self.spatial_attention_conv.modules():
if isinstance(m, nn.Conv2d):
xavier_init(m, distribution='uniform')
def map_roi_levels(self, rois, num_levels):
"""Map rois to corresponding feature levels by scales.
- scale < finest_scale: level 0
- finest_scale <= scale < finest_scale * 2: level 1
- finest_scale * 2 <= scale < finest_scale * 4: level 2
- scale >= finest_scale * 4: level 3
Args:
rois (Tensor): Input RoIs, shape (k, 5).
num_levels (int): Total level number.
Returns:
Tensor: Level index (0-based) of each RoI, shape (k, )
"""
scale = torch.sqrt(
(rois[:, 3] - rois[:, 1] + 1) * (rois[:, 4] - rois[:, 2] + 1))
target_lvls = torch.floor(torch.log2(scale / self.finest_scale + 1e-6))
target_lvls = target_lvls.clamp(min=0, max=num_levels - 1).long()
return target_lvls
def build_roi_layers(self, layer_cfg, featmap_strides):
cfg = layer_cfg.copy()
layer_type = cfg.pop('type')
assert hasattr(ops, layer_type)
layer_cls = getattr(ops, layer_type)
roi_layers = nn.ModuleList(
[layer_cls(spatial_scale=1 / s, **cfg) for s in featmap_strides])
return roi_layers
def forward(self, feats, rois):
if len(feats) == 1:
return self.roi_layers[0](feats[0], rois)
out_size = self.roi_layers[0].out_size
num_levels = len(feats)
roi_feats = torch.cuda.FloatTensor(rois.size()[0], self.out_channels,
out_size, out_size).fill_(0)
roi_feats_list = []
for i in range(num_levels):
roi_feats_list.append(self.roi_layers[i](feats[i], rois))
concat_roi_feats = torch.cat(roi_feats_list, dim=1)
spatial_attention_map = self.spatial_attention_conv(concat_roi_feats)
for i in range(num_levels):
roi_feats += (F.sigmoid(spatial_attention_map[:,i,None,:,:]) * roi_feats_list[i])
return roi_feats
五、网络结构
其代码实现如下:
@NECKS.register_module
class HighFPN(nn.Module):
def __init__(self,
in_channels,
out_channels,
num_outs,
pool_ratios=[0.1,0.2,0.3],
start_level=0,
end_level=-1,
add_extra_convs=False,
normalize=None,
activation=None):
super(HighFPN, self).__init__()
assert isinstance(in_channels, list)
self.in_channels = in_channels
self.out_channels = out_channels
self.num_ins = len(in_channels)
self.num_outs = num_outs
self.activation = activation
self.with_bias = normalize is None
if end_level == -1:
self.backbone_end_level = self.num_ins
assert num_outs >= self.num_ins - start_level
else:
# if end_level < inputs, no extra level is allowed
self.backbone_end_level = end_level
assert end_level <= len(in_channels)
assert num_outs == end_level - start_level
self.start_level = start_level
self.end_level = end_level
self.add_extra_convs = add_extra_convs
self.lateral_convs = nn.ModuleList()
self.fpn_convs = nn.ModuleList()
for i in range(self.start_level, self.backbone_end_level):
l_conv = ConvModule(
in_channels[i],
out_channels,
1,
padding=0,
normalize=normalize,
bias=self.with_bias,
activation=self.activation,
inplace=False)
fpn_conv = ConvModule(
out_channels,
out_channels,
3,
padding=1,
normalize=normalize,
bias=self.with_bias,
activation=self.activation,
inplace=False)
self.lateral_convs.append(l_conv)
self.fpn_convs.append(fpn_conv)
# add lateral conv for features generated by rato-invariant scale adaptive pooling
self.adaptive_pool_output_ratio = pool_ratios
self.high_lateral_conv = nn.ModuleList()
self.high_lateral_conv.extend([nn.Conv2d(in_channels[-1], out_channels, 1) for k in range(len(self.adaptive_pool_output_ratio))])
self.high_lateral_conv_attention = nn.Sequential(nn.Conv2d(out_channels*(len(self.adaptive_pool_output_ratio)), out_channels, 1),nn.ReLU(), nn.Conv2d(out_channels,len(self.adaptive_pool_output_ratio),3,padding=1))
# add extra conv layers (e.g., RetinaNet
extra_levels = num_outs - self.backbone_end_level + self.start_level
if add_extra_convs and extra_levels >= 1:
for i in range(extra_levels):
in_channels = (self.in_channels[self.backbone_end_level - 1]
if i == 0 else out_channels)
extra_fpn_conv = ConvModule(
in_channels,
out_channels,
3,
stride=2,
padding=1,
normalize=normalize,
bias=self.with_bias,
activation=self.activation,
inplace=False)
self.fpn_convs.append(extra_fpn_conv)
# default init_weights for conv(msra) and norm in ConvModule
def init_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
xavier_init(m, distribution='uniform')
for m in self.high_lateral_conv_attention.modules():
if isinstance(m, nn.Conv2d):
xavier_init(m, distribution='uniform')
def forward(self, inputs):
assert len(inputs) == len(self.in_channels)
laterals = [
lateral_conv(inputs[i + self.start_level])
for i, lateral_conv in enumerate(self.lateral_convs)
]
#Residual Feature Augmentation
h, w = inputs[-1].size(2), inputs[-1].size(3)
#Ratio Invariant Adaptive Pooling
AdapPool_Features = [F.upsample(self.high_lateral_conv[j](F.adaptive_avg_pool2d(inputs[-1],output_size=(max(1,int(h*self.adaptive_pool_output_ratio[j])), max(1,int(w*self.adaptive_pool_output_ratio[j]))))), size=(h,w), mode='bilinear', align_corners=True) for j in range(len(self.adaptive_pool_output_ratio))]
Concat_AdapPool_Features = torch.cat(AdapPool_Features, dim=1)
fusion_weights = self.high_lateral_conv_attention(Concat_AdapPool_Features)
fusion_weights = F.sigmoid(fusion_weights)
adap_pool_fusion = 0
for i in range(len(self.adaptive_pool_output_ratio)):
adap_pool_fusion += torch.unsqueeze(fusion_weights[:,i, :,:], dim=1) * AdapPool_Features[i]
# for Consistent Supervision
raw_laternals = [laterals[i].clone() for i in range(len(laterals))]
# build top-down path
laterals[-1] += adap_pool_fusion
used_backbone_levels = len(laterals)
for i in range(used_backbone_levels - 1, 0, -1):
laterals[i - 1] += F.interpolate(
laterals[i], scale_factor=2, mode='nearest')
# build outputs
# part 1: from original levels
outs = [
self.fpn_convs[i](laterals[i]) for i in range(used_backbone_levels)
]
# part 2: add extra levels
if self.num_outs > len(outs):
# use max pool to get more levels on top of outputs
# (e.g., Faster R-CNN, Mask R-CNN)
if not self.add_extra_convs:
for i in range(self.num_outs - used_backbone_levels):
outs.append(F.max_pool2d(outs[-1], 1, stride=2))
# add conv layers on top of original feature maps (RetinaNet)
else:
orig = inputs[self.backbone_end_level - 1]
outs.append(self.fpn_convs[used_backbone_levels](orig))
for i in range(used_backbone_levels + 1, self.num_outs):
# BUG: we should add relu before each extra conv
outs.append(self.fpn_convs[i](outs[-1]))
return tuple(outs), tuple(raw_laternals)
六、参考内容
AugFPN: Improving Multi-scale Feature Learning for Object Detection
GitHub - Gus-Guo/AugFPN: source code of AugFPN
论文笔记AugFPN:Improving Multi-scale Feature Learning - 知乎