当前位置：首页 > article >正文

LSTA: Long Short-Term Attention for Egocentric Action Recognition

article 2025/3/17 8:45:37

文章目录

摘要
Abstract
1. 引言
2. 框架
- 2.1 LSTA
- - 2.1.2 注意力池化
  - - 2.1.2.1 选择器定义
    - 2.1.2.2 注意力更新
  - 2.1.3 输出池化
  - - 2.1.3.1 记忆状态过滤
    - 2.1.3.2 输出门控更新
  - 2.1.4 LSTA单元的代码
- 2.2 双流框架
- - 2.2.1 外观流的注意力
  - 2.2.2 运动流的注意力
  - 2.2.3 跨模态融合
3. 创新点和不足
- 3.1 创新点
- 3.2 不足
参考
总结

摘要

LSTA针对第一人称视角动作识别的任务，提出了融合全局时序建模与局部区域聚焦的双流注意力架构。该模型通过改进ConvLSTM单元，引入注意力池化与输出池化模块：注意力池化模块利用基于ResNet-34的类别映射选择器动态生成空间权重图，抑制背景干扰并强化手部操作区域；输出池化模块则通过记忆状态筛选机制保留历史帧中的关键运动轨迹。模型采用双流框架分别处理RGB外观与光流运动信息——外观流基于预训练ResNet34提取物体交互特征，运动流同样基于ResNet34架构，通过跨通道权重平均与复制策略调整输入卷积层以适应10通道光流输入，两个分支均采用两阶段训练优化参数。此外，跨模态交互模块将运动流特征作为偏置项注入外观流门控计算，并利用3D卷积提取外观时空摘要指导运动流建模。尽管该模型在第一人称动作识别上的效果好，但是它仍然面临如下的问题：计算复杂度高、依赖固定时间窗口的时序建模限制以及复杂背景干扰下鲁棒性不足。

Abstract

LSTA, aimed at first-person perspective action recognition, proposes a dual-stream attention architecture that integrates global temporal modeling with local region focusing. The model enhances the ConvLSTM unit by introducing both an attention pooling module and an output pooling module. The attention pooling module employs a ResNet-34–based class mapping selector to dynamically generate spatial weight maps, suppressing background interference and reinforcing hand manipulation areas; the output pooling module, meanwhile, uses a memory state screening mechanism to retain key motion trajectories from historical frames. The model adopts a dual-stream framework to separately process RGB appearance and optical flow motion information—the appearance stream extracts object interaction features based on a pre-trained ResNet-34, while the motion stream, also built on the ResNet-34 architecture, adjusts its input convolutional layer to accommodate 10-channel optical flow input through a cross-channel weight averaging and replication strategy. Both branches are optimized using a two-stage training process. Additionally, a cross-modal interaction module injects the motion stream features as a bias into the appearance stream’s gating computation and employs 3D convolution to extract a spatiotemporal summary of appearance to guide the motion stream modeling. Despite the model’s strong performance in first-person action recognition, it still faces challenges such as high computational complexity, limitations due to its reliance on fixed time-window temporal modeling, and insufficient robustness in the presence of complex background interference.

1. 引言

从视频中识别人类动作是计算机视觉的重要任务之一。之前的大多数研究都专注于分析从第三人称视角拍摄的视频，而很少对第一人称视角的视频进行分析。但是第一人称视角的视频分析也极具研究价值，它的潜在应用有机器人、人机交互等。人体动作识别的核心难点：由于人体关节的高度灵活性和拍摄环境的不可控性，导致数据中存在极大的类内差异和低类间差异。类内差异指的是同一动作即使由同一人执行，因肢体灵活性会产生多种表现形式，而类间差异指的是不同动作之间可能因局部相似性导致模型难以区分。此外，由于视频是由帧组成的，这引入了数据额外的维度，导致模型难以专注于能很好区分动作类型的特征区域上。

2. 框架

2.1 LSTA

LSTA与LSTM相比，多了两个新模块：下图中红色的注意力池化模块和下图中绿色的输出池化模块。如果去掉这两个模块，LSTA就退化成标准的ConvLSTM，标准ConvLSTM的计算过程如下：
$\begin{aligned}h_t&=o_{t-1}\odot \eta(c_{t-1})\\ (i,f,o_t,c)&=(\sigma,\sigma,\sigma,\eta)(W*[x_t,h_t])\\ c_t&=f\odot c_{t-1}+i\odot c. \end{aligned}$
其中 $\eta$ 是tanh函数， $\sigma$ 是sigmoid函数， $\odot$ 是逐元素相乘， $*$ 是卷积运算。而LSTA的计算过程如下；
$\begin{aligned}\nu_a&=\varsigma(x_t,w_a)\\ (i_a,f_a,s_t,a)&=(\sigma,\sigma,\sigma,\eta)(W_a*[\nu_a,s_{t-1}\odot\eta(a_{t-1})])\\ a_t&=f_a\odot a_{t-1}+i_a\odot a\\ s&=\text{softmax}(\nu_a+s_t\odot\eta(a_t))\\ (i_c,f_c,c)&=(\sigma,\sigma,\eta)(W_c*[s\odot x_t,o_{t-1}\odot\eta(c_{t-1})])\\ c_t&=f_c\odot c_{t-1}+i_c\odot c\\ \nu_c&=\varsigma(c_t,w_c+w_o\epsilon(s\odot x_t))\\ o_t&=\sigma(W_o*[\nu_c\odot c_t,o_{t-1}\odot\eta(c_{t-1})]). \end{aligned}$
其中 $a_t$ 和 $s_t$ 的形状为 $N\times1$ ， $c_t$ 和 $o_t$ 的形状为 $N\times K$ ， $W_a$ 和 $W_c$ 为 $3\times3$ 的二维卷积， $w_a$ 和 $w_c$ 的形状为 $K\times C$ ， $w_o$ 的形状为 $C\times C$ (论文中设置 $N=7\times7=49$ ， $K = C = 512$ )。
在这里插入图片描述

2.1.2 注意力池化

2.1.2.1 选择器定义

假定输入数据 $x_ik$ ，形状为 $N\times K$ ， $N$ 是特征图空间位置的数量，等于 $H\times W$ ， $K$ 是特征通道数，论文希望抑制与识别任务无关的 $x_i$ ，因此需要寻找一个输出形状为 $N\times1$ 的池化模型 $\varsigma(x, w)$ ，以便进行 $\varsigma(x, w)\odot x$ 操作后能获得极具区分度的特征。
论文设计 $\varsigma(x,w)$ 考虑的假设是与识别任务有关的类别数量是有限的。由于每个类别在执行过程中存在高可变性，因此 $\varsigma$ 需要从可学习的映射池中选择当前输入 x 对应的最优类别映射，同时保证选择器和映射池的自我一致性。
选择器需要将图片特征 $x$ 映射到类别得分空间，并返回得分最高的类别 $c^*$ ，因此论文设计了形如 $c^*=\argmax\limits_c\pi(\epsilon(x),\theta_c)$ 的选择器，其中 $\theta_ c\in w$ 是针对类别 $c$ 评估 $x$ 的参数。此外，如果 $\pi$ 被设计与降维 $\epsilon$ 等变的话，则有 $\pi(\epsilon(x),\theta_c)=\epsilon(\pi(x,\theta_c))$ ，从而可以将 $\epsilon^{\perp}(\pi(·,\theta_c))$ 作为与 $\epsilon$ 相关的类别映射池，其中 $\epsilon^{\perp}$ 是与 $\epsilon$ 正交的降维。基于上面的设计思想，可以得到如下的池化模型：
$\begin{aligned}\varsigma(x,{\theta_c})&=\epsilon^{\perp}(\pi(x,\theta_{c^*}))\\ c^*&=\argmax\limits_c\pi(\epsilon(x),\theta_c).\end{aligned}$
其中 $\epsilon$ 是空间平均池化，提取全局特征； $\pi(\epsilon,\theta_c)$ 是线性映射，生成类别得分； $\epsilon^{\perp}$ 操作类似类别激活映射(在2.2中会详细解释)，。

2.1.2.2 注意力更新

输入：当前时刻特征 $x_t$ ，计算池化候选 $\nu_a=\varsigma(x_t,w_a)$ 。
更新注意力状态 $a_t$ ：
$\begin{aligned}(i_a,f_a,s_t,a)&=(\sigma,\sigma,\sigma,\eta)(W_a*[\nu_a,s_{t-1}\odot\eta(a_{t-1})])\\ a_t&=f_a\odot a_{t-1}+i_a\odot a.\end{aligned}$
生成注意力图：
$s=\text{softmax}(\nu_a+s_t\odot\eta(a_t)).$
$s$ 用于加权输入特征 $s\odot x_t$ 。

2.1.3 输出池化

传统LSTM的输出门控 $o_t$ 仅依赖当前输入和前一隐藏状态，无法有效筛选记忆 $c_t$ 中的关键信息。而输出池化通过注意力机制动态定位记忆中的高区别性区域。

2.1.3.1 记忆状态过滤

输入：更新后的记忆状态c_t。
动态调整：
$\nu_c=\varsigma(c_t,w_c+w_o\epsilon(s\odot x_t)).$
其中 $s\odot x_t$ 是加权输入特征； $\epsilon(s\odot x_t)$ 是空间平均池化提取全局特征， $\varsigma$ 移除了 $\epsilon^{\perp}$ 以保留 $\nu_c$ $N\times K$ 的形状。

2.1.3.2 输出门控更新

更新公式：
$o_t=\sigma(W_o*[\nu_c\odot c_t,o_{t-1}\odot\eta(c_{t-1})]).$
其中 $\nu_t\odot c_t$ 是过滤后的记忆状态，输出门控控制当前的隐藏状态 $h_t=o_{t-1}\odot\eta(c_{t-1})$ 的暴露程度。

2.1.4 LSTA单元的代码

class MyConvLSTACell(nn.Module):
    def __init__(self, input_size, memory_size, c_cam_classes=100, kernel_size=3,
                 stride=1, padding=1, zero_init=False):
        super(MyConvLSTACell, self).__init__()
        self.input_size = input_size
        self.memory_size = memory_size
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.c_classifier = nn.Linear(memory_size, c_cam_classes, bias=False)
        self.coupling_fc = nn.Linear(memory_size, c_cam_classes, bias=False)
        self.avgpool = nn.AvgPool2d(7)

        # Attention params

        self.conv_i_s = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_i_cam = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)

        self.conv_f_s = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_f_cam = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)

        self.conv_a_s = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_a_cam = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)

        self.conv_o_s = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_o_cam = nn.Conv2d(1, 1, kernel_size=kernel_size, stride=stride, padding=padding, bias=False)

        if zero_init:
            torch.nn.init.constant(self.conv_i_s.weight, 0)
            torch.nn.init.constant(self.conv_i_s.bias, 0)
            torch.nn.init.constant(self.conv_i_cam.weight, 0)

            torch.nn.init.constant(self.conv_f_s.weight, 0)
            torch.nn.init.constant(self.conv_f_s.bias, 0)
            torch.nn.init.constant(self.conv_f_cam.weight, 0)

            torch.nn.init.constant(self.conv_a_s.weight, 0)
            torch.nn.init.constant(self.conv_a_s.bias, 0)

            torch.nn.init.constant(self.conv_o_s.weight, 0)
            torch.nn.init.constant(self.conv_o_s.bias, 0)
            torch.nn.init.constant(self.conv_o_cam.weight, 0)
        else:
            torch.nn.init.xavier_normal(self.conv_i_s.weight)
            torch.nn.init.constant(self.conv_i_s.bias, 0)
            torch.nn.init.xavier_normal(self.conv_i_cam.weight)

            torch.nn.init.xavier_normal(self.conv_f_s.weight)
            torch.nn.init.constant(self.conv_f_s.bias, 0)
            torch.nn.init.xavier_normal(self.conv_f_cam.weight)

            torch.nn.init.xavier_normal(self.conv_a_s.weight)
            torch.nn.init.constant(self.conv_a_s.bias, 0)
            torch.nn.init.xavier_normal(self.conv_a_cam.weight)

            torch.nn.init.xavier_normal(self.conv_o_s.weight)
            torch.nn.init.constant(self.conv_o_s.bias, 0)
            torch.nn.init.xavier_normal(self.conv_o_cam.weight)

        # Memory params

        self.conv_i_x = nn.Conv2d(input_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_i_c = nn.Conv2d(memory_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding,
                                  bias=False)

        self.conv_f_x = nn.Conv2d(input_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_f_c = nn.Conv2d(memory_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding,
                                  bias=False)

        self.conv_c_x = nn.Conv2d(input_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_c_c = nn.Conv2d(memory_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding,
                                  bias=False)

        self.conv_o_x = nn.Conv2d(input_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding)
        self.conv_o_c = nn.Conv2d(memory_size, memory_size, kernel_size=kernel_size, stride=stride, padding=padding,
                                  bias=False)

        if zero_init:
            torch.nn.init.constant(self.conv_i_x.weight, 0)
            torch.nn.init.constant(self.conv_i_x.bias, 0)
            torch.nn.init.constant(self.conv_i_c.weight, 0)

            torch.nn.init.constant(self.conv_f_x.weight, 0)
            torch.nn.init.constant(self.conv_f_x.bias, 0)
            torch.nn.init.constant(self.conv_f_c.weight, 0)

            torch.nn.init.constant(self.conv_c_x.weight, 0)
            torch.nn.init.constant(self.conv_c_x.bias, 0)
            torch.nn.init.constant(self.conv_c_c.weight, 0)

            torch.nn.init.constant(self.conv_o_x.weight, 0)
            torch.nn.init.constant(self.conv_o_x.bias, 0)
            torch.nn.init.constant(self.conv_o_c.weight, 0)
        else:
            torch.nn.init.xavier_normal(self.conv_i_x.weight)
            torch.nn.init.constant(self.conv_i_x.bias, 0)
            torch.nn.init.xavier_normal(self.conv_i_c.weight)

            torch.nn.init.xavier_normal(self.conv_f_x.weight)
            torch.nn.init.constant(self.conv_f_x.bias, 0)
            torch.nn.init.xavier_normal(self.conv_f_c.weight)

            torch.nn.init.xavier_normal(self.conv_c_x.weight)
            torch.nn.init.constant(self.conv_c_x.bias, 0)
            torch.nn.init.xavier_normal(self.conv_c_c.weight)

            torch.nn.init.xavier_normal(self.conv_o_x.weight)
            torch.nn.init.constant(self.conv_o_x.bias, 0)
            torch.nn.init.xavier_normal(self.conv_o_c.weight)

    def forward(self, x, cam, state_att, state_inp, x_flow_i=0, x_flow_f=0, x_flow_c=0, x_flow_o=0):
        # state_att = [a, s]
        # state_inp = [atanh(c), o]

        a_t_1 = state_att[0]
        s_t_1 = state_att[1]

        c_t_1 = F.tanh(state_inp[0])
        o_t_1 = state_inp[1]

        # Attention recurrence

        i_s = F.sigmoid(self.conv_i_s(s_t_1) + self.conv_i_cam(cam))
        f_s = F.sigmoid(self.conv_f_s(s_t_1) + self.conv_f_cam(cam))
        o_s = F.sigmoid(self.conv_o_s(s_t_1) + self.conv_o_cam(cam))
        a_tilde = F.tanh(self.conv_a_s(s_t_1) + self.conv_a_cam(cam))
        a = (f_s * a_t_1) + (i_s * a_tilde)
        s = o_s * F.tanh(a)
        u = s + cam  # hidden state + cam

        u = F.softmax(u.view(u.size(0), -1), 1)
        u = u.view(u.size(0), 1, 7, 7)

        x_att = x * u.expand_as(x)

        i_x = F.sigmoid(self.conv_i_c(o_t_1 * c_t_1) + self.conv_i_x(x_att) + x_flow_i)
        f_x = F.sigmoid(self.conv_f_c(o_t_1 * c_t_1) + self.conv_f_x(x_att) + x_flow_f)
        c_tilde = F.tanh(self.conv_c_c(o_t_1 * c_t_1) + self.conv_c_x(x_att) + x_flow_c)
        c = (f_x * state_inp[0]) + (i_x * c_tilde)

        c_vec = self.avgpool(c).view(c.size(0), -1)
        c_logits = self.c_classifier(c_vec) + self.coupling_fc(self.avgpool(x_att).view(x_att.size(0), -1))
        c_probs, c_idxs = c_logits.sort(1, True)
        c_class_idx = c_idxs[:, 0]
        c_cam = self.c_classifier.weight[c_class_idx].unsqueeze(2).unsqueeze(2) * c
        o_x = F.sigmoid(self.conv_o_x(o_t_1 * c_t_1) + self.conv_o_c(c_cam)) 

        state_att = [a, s]
        state_inp = [c, o_x]
        return state_att, state_inp, x_att

2.2 双流框架

与大多数动作识别的深度学习方法一样，论文使用双流框架：一个流用来编码RGB帧中的外观信息，另一个流用来编码光流堆中的运动信息。

2.2.1 外观流的注意力

论文使用了在ImageNet上预训练的ResNet-34，将该网络中conv5_3中的最后一个卷积层的输出作为LSTA模块的输入并且将LSTA模块最终输出中的 $c_t$ 作为最终分类的输入。该网络的训练是一个两阶段的过程；第一阶段仅训练最终分类器和LSTA模块；第二阶段训练ResNet34中最后一个残差块中的卷积层和全连接层并且包括第一阶段训练的所有层。

2.2.2 运动流的注意力

论文同样使用了ResNet34来训练运动流网络。该网络的训练过程同样分为两个阶段：第一阶段训练动作动词；第二阶段训练活动识别。
第一阶段：该网络的输入是5帧光流堆，每帧光流包含水平和垂直运动分量，因此5帧堆叠后的通道数为10。但是ImageNet上预训练的ResNet34的通道数为3。解决方法是ImageNet上预训练的卷积核的权重跨通道平均得到1通道的滤波器，然后将1通道的滤波器复制10次后得到10通道的滤波器。
第二阶段：该网络的输入是视频中心连续5帧的光流堆。论文对该网络conv5_3的输出进行2.1.2中的注意力池化并且用第一阶段中全连接层的权重来初始化注意力池化。最终经由注意力池化得到的注意力图用作最终分类的输入。

2.2.3 跨模态融合

对外观流的调整：将运动流网络conv5_3输出的特征通过变换对齐维度后添加到外观流LSTA模块的门控计算中。对运动流的调整：对外观流网络conv5_3输出的特征进行3维卷积(时间+空间)，生成摘要特征；在运动流网络中添加一个作为嵌入层的ConvLSTM，然后将上面提到的摘要信息作为偏置项添加到该ConvLSTM门限计算的偏置中。

3. 创新点和不足

3.1 创新点

LSTA在第一视角动作识别领域的创新点主要体现在其针对时序建模的注意力机制设计。该方法通过融合长短期注意力模块，解决了传统模型在复杂动态场景中难以捕捉关键时空特征的问题：长时注意力聚焦于视频序列中全局动作的演变趋势，而短时注意力则强化了局部细微动作的瞬时特征，两者协同增强了对第一视角视频中非结构化动作的建模能力。此外，LSTA引入了动态权重分配策略，根据不同动作类别自适应调整长短期注意力的贡献比例，从而在烹饪、运动等不同场景下实现更精准的特征选择。这一机制相较于传统固定权重的时空模型，显著提升了模型对第一视角视频中头部运动模糊和快速视角切换的鲁棒性。

3.2 不足

LSTA仍存在三方面局限性。首先，其多层次注意力计算导致模型参数量和计算复杂度较高，尤其在处理高分辨率长视频时推理速度受限，难以满足实时性需求。其次，模型对长时依赖的建模仍依赖固定时间窗口，难以有效捕捉跨分钟级别的动作连续性。最后，该方法在极端遮挡或剧烈镜头晃动的场景下性能下降明显，反映出注意力机制对低质量输入数据的敏感性，这需要结合更鲁棒的前处理模块或数据增强策略来弥补。与同期其他动作识别模型相比，LSTA在计算效率和复杂场景适应性方面仍有优化空间。

参考

Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz. LSTA: Long Short-Term Attention for Egocentric Action Recognition.
代码来源：https://github.com/swathikirans/LSTA

总结

模型的运行流程首先将输入视频拆分为RGB帧和光流帧，分别输入外观流和运动流进行特征提取。外观流基于预训练的ResNet34网络，从单帧图像中提取空间特征，其中卷积层输出的高层次特征被输入到LSTA模块，最后利用LSTA模块最终输出中的 $c_t$ 进行分类。运动流则采用调整后的ResNet34处理连续光流帧堆，通过跨通道复制预训练参数解决输入维度不匹配问题，并在高阶特征层后增加ConvLS
TM模块和注意力池化模块，最后将注意力池化模块生成的注意力图用作分类。最终，外观流和运动流产生的分类结果通过加权融合策略得到最终分类结果。尽管该模型在第一人称动作识别上的效果好，但是它仍然面临如下的问题：计算复杂度高、依赖固定时间窗口的时序建模限制以及复杂背景干扰下鲁棒性不足。

查看全文

http://www.kler.cn/a/588126.html