层归一化(201607)
Layer Normalization
层归一化
https://arxiv.org/abs/1607.06450
Abstract
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feedforward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.
这段文字讨论了深度神经网络训练中的一个优化技术——层归一化(Layer Normalization)。以下是对这段文字的简要概述:
- 训练成本:训练最先进的深度神经网络需要大量的计算资源,这使得训练过程非常昂贵。
- 归一化的重要性:为了减少训练时间,可以通过归一化神经元的活动来实现。归一化有助于稳定训练过程,加速收敛。
- 批量归一化(Batch Normalization):这是一种通过计算一个mini-batch中所有训练案例的输入总和的均值和方差,然后使用这些统计数据来归一化每个训练案例中该神经元的输入总和的技术。这种方法显著减少了前馈神经网络的训练时间。
- 批量归一化的局限性:批量归一化的效果依赖于mini-batch的大小,并且不适用于循环神经网络(RNNs)。
- 层归一化(Layer Normalization):本文提出了一种新的归一化技术,即层归一化。它通过计算单个训练案例中一个层内所有神经元的输入总和的均值和方差来进行归一化,而不是基于mini-batch。
- 自适应偏置和增益:与批量归一化类似,层归一化也为每个神经元提供了自适应的偏置和增益,这些在归一化之后、非线性激活之前应用。
- 训练与测试的一致性:与批量归一化不同,层归一化在训练和测试时执行相同的计算。
- 循环神经网络的应用:层归一化可以直接应用于循环神经网络,通过在每个时间步计算归一化统计数据。
- 稳定性和效率:层归一化非常有效地稳定了循环网络中的隐藏状态动态。实验结果表明,与之前发布的技术相比,层归一化可以显著减少训练时间。
1 Introduction
Deep neural networks trained with some version of Stochastic Gradient Descent have been shown to substantially outperform previous approaches on various supervised learning tasks in computer vision [Krizhevsky et al., 2012] and speech processing [Hinton et al., 2012]. But state-of-the-art deep neural networks often require many days of training. It is possible to speed-up the learning by computing gradients for different subsets of the training cases on different machines or splitting the neural network itself over many machines [Dean et al., 2012], but this can require a lot of communication and complex software. It also tends to lead to rapidly diminishing returns as the degree of parallelization increases. An orthogonal approach is to modify the computations performed in the forward pass of the neural net to make learning easier. Recently, batch normalization [Ioffe and Szegedy, 2015] has been proposed to reduce training time by including additional normalization stages in deep neural networks. The normalization standardizes each summed input using its mean and its standard deviation across the training data. Feedforward neural networks trained using batch normalization converge faster even with simple SGD. In addition to training time improvement, the stochasticity from the batch statistics serves as a regularizer during training.
使用某种版本的随机梯度下降(Stochastic Gradient Descent, SGD)训练的深度神经网络在计算机视觉[Krizhevsky et al., 2012]和语音处理[Hinton et al., 2012]的各种监督学习任务上已经显示出显著优于以前的方法。但是,最先进的深度神经网络通常需要多天的训练。通过在不同的机器上为训练案例的不同子集计算梯度,或者将神经网络本身分割到多台机器上,可以加快学习速度[Dean et al., 2012],但这可能需要大量的通信和复杂的软件。随着并行化程度的增加,这种方法也倾向于迅速减少回报。另一种方法是修改神经网络前向传递中执行的计算,使学习变得更容易。最近,提出了批量归一化(batch normalization)[Ioffe and Szegedy, 2015],通过在深度神经网络中增加额外的归一化阶段来减少训练时间。归一化使用训练数据中的均值和标准差来标准化每个输入总和。即使使用简单的SGD,使用批量归一化训练的前馈神经网络也能更快地收敛。除了训练时间的改进,批量统计的随机性在训练期间充当了正则化器。
原来批量归一化和层归一化提出的时间就差了一年啊,有时间看看batch normalization的原始论文
Despite its simplicity, batch normalization requires running averages of the summed input statistics. In feed-forward networks with fixed depth, it is straightforward to store the statistics separately for each hidden layer. However, the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps. Furthermore, batch normalization cannot be applied to online learning tasks or to extremely large distributed models where the minibatches have to be small.
尽管批量归一化(batch normalization)很简单,但它需要运行输入总和统计数据的移动平均值。在具有固定深度的前馈网络中,可以很容易地为每个隐藏层单独存储统计数据。然而,在循环神经网络(RNN)中,循环神经元的输入总和通常随着序列长度的变化而变化,因此对RNN应用批量归一化似乎需要为不同的时间步长提供不同的统计数据。此外,批量归一化不能应用于在线学习任务,或者在必须使用小型minibatch的极大分布式模型中。
有个问题啊,批量归一化是在训练阶段对输入进行归一化,那么在推理阶段要不要对输入做同样的处理呢?如果不是那即使是同一个输入样本,在训练的输入不就和推理的输入不同了吗???没有影响吗或者是有什么处理吗???
This paper introduces layer normalization, a simple normalization method to improve the training speed for various neural network models. Unlike batch normalization, the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. We show that layer normalization works well for RNNs and improves both the training time and the generalization performance of several existing RNN models.
这篇论文介绍了层归一化(layer normalization),这是一种简单的归一化方法,用于提高各种神经网络模型的训练速度。与批量归一化不同,所提出的方法直接从隐藏层内神经元的输入总和中估计归一化统计数据,因此归一化不会引入训练案例之间的任何新依赖关系。我们展示了层归一化对循环神经网络(RNNs)效果很好,并且改善了几个现有RNN模型的训练时间和泛化性能。
2 Background
A feed-forward neural network is a non-linear mapping from a input pattern x to an output vector y. Consider the lth hidden layer in a deep feed-forward, neural network, and let al be the vector representation of the summed inputs to the neurons in that layer. The summed inputs are computed through a linear projection with the weight matrix W l and the bottom-up inputs hl given as follows:
前馈神经网络是一种从输入模式
x
x
x 到输出向量
y
y
y 的非线性映射。考虑深度前馈神经网络中的第
l
l
l 层隐藏层,并设
a
l
a^l
al 为该层神经元的输入总和的向量表示。输入总和是通过与权重矩阵
W
l
W^l
Wl 和自底向上的输入
h
l
h^l
hl 进行线性投影计算得到的,如下所示:
where f(·) is an element-wise non-linear function and wil is the incoming weights to the ith hidden units and bl i is the scalar bias parameter. The parameters in the neural network are learnt using gradient-based optimization algorithms with the gradients being computed by back-propagation.
其中
f
(
⋅
)
f(\cdot)
f(⋅) 是逐元素的非线性函数,
w
i
l
w_{i}^l
wil 是指向第
i
i
i 个隐藏单元的输入权重,
b
i
l
b_{i}^l
bil 是标量偏置参数。神经网络中的参数是通过基于梯度的优化算法学习的,梯度是通过反向传播计算的。
One of the challenges of deep learning is that the gradients with respect to the weights in one layer are highly dependent on the outputs of the neurons in the previous layer especially if these outputs change in a highly correlated way. Batch normalization [Ioffe and Szegedy, 2015] was proposed to reduce such undesirable “covariate shift”. The method normalizes the summed inputs to each hidden unit over the training cases. Specifically, for the ith summed input in the lth layer, the batch normalization method rescales the summed inputs according to their variances under the distribution of the data
深度学习的一个挑战是,对于某一层的权重,其梯度高度依赖于前一层神经元的输出,尤其是当这些输出以高度相关的方式变化时。批量归一化(Batch Normalization)[Ioffe 和 Szegedy, 2015] 被提出来减少这种不受欢迎的“协变量偏移”。该方法归一化了每个隐藏单元在训练案例上的输入总和。具体来说,对于第
l
l
l 层中第
i
i
i 个输入总和,批量归一化方法根据数据分布下的方差重新调整输入总和的规模。
where a¯l i is normalized summed inputs to the ith hidden unit in the lth layer and gi is a gain parameter scaling the normalized activation before the non-linear activation function. Note the expectation is under the whole training data distribution. It is typically impractical to compute the expectations in Eq. (2) exactly, since it would require forward passes through the whole training dataset with the current set of weights. Instead, µ and σ are estimated using the empirical samples from the current mini-batch. This puts constraints on the size of a mini-batch and it is hard to apply to recurrent neural networks.
其中
a
ˉ
i
l
\bar{a}_{i}^l
aˉil 是第
l
l
l 层中第
i
i
i 个隐藏单元的归一化输入总和,
g
i
g_i
gi 是一个增益参数,用于在非线性激活函数之前缩放归一化激活。注意,期望是在整体训练数据分布下计算的。通常,直接计算方程(2)中的期望是不切实际的,因为这将需要使用当前权重集通过整个训练数据集进行前向传递。相反,
μ
\mu
μ 和
σ
\sigma
σ 是使用当前mini-batch的经验样本来估计的。这限制了mini-batch的大小,并且很难应用于循环神经网络。
batch normalization最大的问题是什么?在这篇文章中看到其最大的问题是受mini-batch的影响比较大,所以对于大模型或者自动驾驶输入的规模比较大的数据可能无法使用批量归一化,只能使用层归一化,这么理解对吗?
3 Layer normalization
We now consider the layer normalization method which is designed to overcome the drawbacks of batch normalization.
Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot. This suggests the “covariate shift” problem can be reduced by fixing the mean and the variance of the summed inputs within each layer. We, thus, compute the layer normalization statistics over all the hidden units in the same layer as follows:
我们现在考虑层归一化方法,它旨在克服批量归一化的缺陷。
注意到,一个层的输出变化往往会在下一层的输入总和中引起高度相关的变动,尤其是当使用ReLU单元时,其输出可能会有很大的变化。这表明,通过固定每个层内输入总和的均值和方差,可以减少“协变量偏移”问题。因此,我们计算同一层内所有隐藏单元的层归一化统计数据如下:
其中
μ
l
\mu^l
μl 是第
l
l
l 层所有隐藏单元输入总和的平均值,
σ
l
\sigma^l
σl 是这些输入总和的方差,
H
H
H 是第
l
l
l 层的隐藏单元数量,
a
i
l
a_{i}^l
ail 是第
l
l
l 层第
i
i
i 个隐藏单元的输入总和。
where H denotes the number of hidden units in a layer. The difference between Eq. (2) and Eq. (3) is that under layer normalization, all the hidden units in a layer share the same normalization terms µ and σ, but different training cases have different normalization terms. Unlike batch normalization, layer normaliztion does not impose any constraint on the size of a mini-batch and it can be used in the pure online regime with batch size 1.
其中
H
H
H 表示一个层中的隐藏单元数量。方程(2)和方程(3)之间的不同在于,在层归一化下,一个层中的所有隐藏单元共享相同的归一化项
μ
\mu
μ 和
σ
\sigma
σ,但不同的训练案例具有不同的归一化项。与批量归一化不同,层归一化不对 mini-batch 的大小施加任何限制,它可以在纯在线模式下使用,batch大小为1。
3.1 Layer normalized recurrent neural networks
The recent sequence to sequence models [Sutskever et al., 2014] utilize compact recurrent neural networks to solve sequential prediction problems in natural language processing. It is common among the NLP tasks to have different sentence lengths for different training cases. This is easy to deal with in an RNN because the same weights are used at every time-step. But when we apply batch normalization to an RNN in the obvious way, we need to to compute and store separate statistics for each time step in a sequence. This is problematic if a test sequence is longer than any of the training sequences. Layer normalization does not have such problem because its normalization terms depend only on the summed inputs to a layer at the current time-step. It also has only one set of gain and bias parameters shared over all time-steps.
最近的序列到序列模型[Sutskever et al., 2014]利用紧凑的循环神经网络来解决自然语言处理中的顺序预测问题。在自然语言处理任务中,不同训练案例的句子长度不同是很常见的情况。在RNN中处理这种情况很容易,因为在每个时间步使用相同的权重。但是,当我们以显而易见的方式将批量归一化应用于RNN时,我们需要计算并存储序列中每个时间步的单独统计数据。如果测试序列比任何训练序列都长,这将是个问题。层归一化没有这样的问题,因为它的归一化项仅依赖于当前时间步的层的输入总和。它也只有一组增益和偏置参数,这些参数在所有时间步共享。
In a standard RNN, the summed inputs in the recurrent layer are computed from the current input xt and previous vector of hidden states ht−1 which are computed as at = Whhht−1 + Wxhxt. The layer normalized recurrent layer re-centers and re-scales its activations using the extra normalization terms similar to Eq. (3):
在标准RNN中,循环层的输入总和是从当前输入
x
t
x^t
xt 和前一个隐藏状态向量
h
t
−
1
h^{t-1}
ht−1 计算得出的,计算公式为
a
t
=
W
h
h
h
t
−
1
+
W
x
h
x
t
a_t = W_{hh} h^{t-1} + W_{xh} x^t
at=Whhht−1+Wxhxt。层归一化的循环层使用类似于方程(3)的额外归一化项重新中心化和重新缩放其激活:
where Whh is the recurrent hidden to hidden weights and Wxh are the bottom up input to hidden weights. is the element-wise multiplication between two vectors. b and g are defined as the bias and gain parameters of the same dimension as ht.
其中
W
h
h
W_{hh}
Whh 是循环层隐藏到隐藏的权重,
W
x
h
W_{xh}
Wxh 是自底向上的输入到隐藏层的权重。
⊙
\odot
⊙ 表示两个向量之间的逐元素乘法。
b
b
b 和
g
g
g 被定义为与
h
t
h^t
ht 相同维度的偏置和增益参数。
In a standard RNN, there is a tendency for the average magnitude of the summed inputs to the recurrent units to either grow or shrink at every time-step, leading to exploding or vanishing gradients. In a layer normalized RNN, the normalization terms make it invariant to re-scaling all of the summed inputs to a layer, which results in much more stable hidden-to-hidden dynamics.
在标准RNN中,循环单元的输入总和的平均幅度在每个时间步要么增长要么缩小,导致梯度爆炸或梯度消失。在层归一化的RNN中,归一化项使得对一个层的所有输入总和进行重新缩放变得不变,这导致隐藏层到隐藏层的动态更加稳定。
4 Related work
Batch normalization has been previously extended to recurrent neural networks [Laurent et al., 2015, Amodei et al., 2015, Cooijmans et al., 2016]. The previous work [Cooijmans et al., 2016] suggests the best performance of recurrent batch normalization is obtained by keeping independent normalization statistics for each time-step. The authors show that initializing the gain parameter in the recurrent batch normalization layer to 0.1 makes significant difference in the final performance of the model. Our work is also related to weight normalization [Salimans and Kingma, 2016]. In weight normalization, instead of the variance, the L2 norm of the incoming weights is used to normalize the summed inputs to a neuron. Applying either weight normalization or batch normalization using expected statistics is equivalent to have a different parameterization of the original feed-forward neural network. Re-parameterization in the ReLU network was studied in the Pathnormalized SGD [Neyshabur et al., 2015]. Our proposed layer normalization method, however, is not a re-parameterization of the original neural network. The layer normalized model, thus, has different invariance properties than the other methods, that we will study in the following section.
批量归一化之前已经被扩展到循环神经网络中[Laurent et al., 2015, Amodei et al., 2015, Cooijmans et al., 2016]。先前的工作[Cooijmans et al., 2016]表明,通过为每个时间步保持独立的归一化统计数据,可以获得循环批量归一化的最佳性能。作者们展示了在循环批量归一化层中将增益参数初始化为 0.1 对模型的最终性能有显著影响。我们的工作也与权重归一化[Salimans and Kingma, 2016]有关。在权重归一化中,不是使用方差,而是使用传入权重的 L2 范数来归一化神经元的输入总和。应用权重归一化或使用预期统计数据的批量归一化等同于对原始前馈神经网络有不同的参数化。在Pathnormalized SGD[Neyshabur et al., 2015]中研究了ReLU网络的重新参数化。然而,我们提出的层归一化方法并不是原始神经网络的重新参数化。因此,层归一化模型具有与其他方法不同的不变性质,我们将在下一节中研究这些性质。
5 Analysis
In this section, we investigate the invariance properties of different normalization schemes.
在本节中,我们研究了不同归一化方案的不变性质。
5.1 Invariance under weights and data transformations
The proposed layer normalization is related to batch normalization and weight normalization. Although, their normalization scalars are computed differently, these methods can be summarized as normalizing the summed inputs ai to a neuron through the two scalars µ and σ. They also learn an adaptive bias b and gain g for each neuron after the normalization.
提出的层归一化与批量归一化和权重归一化有关。尽管它们的归一化标量计算方式不同,但这些方法可以总结为通过两个标量
μ
\mu
μ 和
σ
\sigma
σ 来归一化神经元的输入总和
a
i
a_i
ai。它们还在归一化之后为每个神经元学习一个自适应的偏置
b
b
b 和增益
g
g
g。
Note that for layer normalization and batch normalization, µ and σ is computed according to Eq. 2 and 3. In weight normalization, µ is 0, and σ = kwk2.
对于层归一化和批量归一化,
μ
\mu
μ 和
σ
\sigma
σ 是根据方程 2 和 3 计算的。在权重归一化中,
μ
\mu
μ 是0,
σ
\sigma
σ 是权重向量的 L2 范数
σ
=
∥
w
∥
2
\sigma = \|w\|_2
σ=∥w∥2。
Table 1 highlights the following invariance results for three normalization methods.
表1 突出了三种归一化方法的以下不变性结果:
Weight re-scaling and re-centering: First, observe that under batch normalization and weight normalization, any re-scaling to the incoming weights wi of a single neuron has no effect on the normalized summed inputs to a neuron. To be precise, under batch and weight normalization, if the weight vector is scaled by δ, the two scalar µ and σ will also be scaled by δ. The normalized summed inputs stays the same before and after scaling. So the batch and weight normalization are invariant to the re-scaling of the weights. Layer normalization, on the other hand, is not invariant to the individual scaling of the single weight vectors. Instead, layer normalization is invariant to scaling of the entire weight matrix and invariant to a shift to all of the incoming weights in the weight matrix. Let there be two sets of model parameters θ, θ0 whose weight matrices W and W0 differ by a scaling factor δ and all of the incoming weights in W0 are also shifted by a constant vector γ, that is W0 = δW + 1γ>. Under layer normalization, the two models effectively compute the same output:
权重重新缩放和重新中心化:首先,观察到在批量归一化和权重归一化下,对单个神经元的输入权重
w
i
w_i
wi 进行任何缩放对神经元的归一化输入总和没有影响。具体来说,在批量归一化和权重归一化下,如果权重向量被
δ
\delta
δ 缩放,那么两个标量
μ
\mu
μ 和
σ
\sigma
σ 也会被
δ
\delta
δ 缩放。缩放前后的归一化输入总和保持不变。因此,批量归一化和权重归一化对权重的缩放是不变的。另一方面,层归一化并不对单个权重向量的单独缩放保持不变。相反,层归一化对整个权重矩阵的缩放以及权重矩阵中所有输入权重的位移是不变的。假设存在两组模型参数
θ
\theta
θ 和
θ
′
\theta '
θ′,它们的权重矩阵
W
W
W 和
W
′
W'
W′ 相差一个缩放因子
δ
\delta
δ,并且
W
′
W'
W′ 中的所有输入权重也由一个常数向量
γ
\gamma
γ 位移,即
W
′
=
δ
W
+
1
γ
T
W' = \delta W + 1\gamma^T
W′=δW+1γT。在层归一化下,两个模型有效地计算相同的输出:
γ \gamma γ 和 β \beta β 是怎么训练出来的,在pytorch中需要设置吗,还是说拿一个BN或者LN模块,就是自动加上了对应的 γ \gamma γ 和 β \beta β 参数并进行梯度下降学习,
Notice that if normalization is only applied to the input before the weights, the model will not be invariant to re-scaling and re-centering of the weights.
请注意,如果归一化仅应用于权重之前的输入,那么模型将不会对权重的重新缩放和重新中心化保持不变性。
Data re-scaling and re-centering: We can show that all the normalization methods are invariant to re-scaling the dataset by verifying that the summed inputs of neurons stays constant under the changes. Furthermore, layer normalization is invariant to re-scaling of individual training cases, because the normalization scalars µ and σ in Eq. (3) only depend on the current input data. Let x0 be a new data point obtained by re-scaling x by δ. Then we have,
数据重新缩放和重新中心化:我们可以通过验证在变化下神经元的输入总和保持恒定来证明所有归一化方法对数据集的重新缩放是不变的。此外,层归一化对单个训练案例的重新缩放是不变的,因为方程(3)中的归一化标量
μ
\mu
μ 和
σ
\sigma
σ 仅依赖于当前输入数据。设
x
0
x_0
x0 是通过将
x
x
x 重新缩放
δ
\delta
δ 得到的新数据点。那么我们有:
It is easy to see re-scaling individual data points does not change the model’s prediction under layer normalization. Similar to the re-centering of the weight matrix in layer normalization, we can also show that batch normalization is invariant to re-centering of the dataset.
很容易看出,在层归一化下,重新缩放单个数据点不会改变模型的预测。类似于层归一化中权重矩阵的重新中心化,我们也可以证明批量归一化对数据集的重新中心化是不变的。
5.2 Geometry of parameter space during learning
We have investigated the invariance of the model’s prediction under re-centering and re-scaling of the parameters. Learning, however, can behave very differently under different parameterizations, even though the models express the same underlying function. In this section, we analyze learning behavior through the geometry and the manifold of the parameter space. We show that the normalization scalar σ can implicitly reduce learning rate and makes learning more stable.
我们已经研究了在参数重新中心化和重新缩放下模型预测的不变性。然而,即使模型表达的是相同的底层函数,不同的参数化下学习行为可能会有很大差异。在本节中,我们通过参数空间的几何形状和流形来分析学习行为。我们展示了归一化标量
σ
\sigma
σ 可以隐式降低学习率,使学习更加稳定。
5.2.1 Riemannian metric
The learnable parameters in a statistical model form a smooth manifold that consists of all possible input-output relations of the model. For models whose output is a probability distribution, a natural way to measure the separation of two points on this manifold is the Kullback-Leibler divergence between their model output distributions. Under the KL divergence metric, the parameter space is a Riemannian manifold.
统计模型中的可学习参数构成了一个平滑流形,它包括了模型所有可能的输入输出关系。对于输出是概率分布的模型,衡量这个流形上两点之间距离的一种自然方式是它们模型输出分布之间的 Kullback-Leibler散度。在 KL 散度度量下,参数空间是一个黎曼流形。
KL散度经常听说,之后有时间看一下
The curvature of a Riemannian manifold is entirely captured by its Riemannian metric, whose quadratic form is denoted as ds2. That is the infinitesimal distance in the tangent space at a point in the parameter space. Intuitively, it measures the changes in the model output from the parameter space along a tangent direction. The Riemannian metric under KL was previously studied [Amari, 1998] and was shown to be well approximated under second order Taylor expansion using the Fisher information matrix:
黎曼流形的曲率完全由其黎曼度量捕获,其二次形式表示为
d
s
2
ds^2
ds2,这是参数空间中某点的切线空间上的无穷小距离。直观上,它度量了沿着切线方向的参数空间中的模型输出变化。之前已经研究了KL下的黎曼度量[Amari, 1998],并表明它可以用费舍尔信息矩阵在二阶泰勒展开下很好地近似:
where, δ is a small change to the parameters. The Riemannian metric above presents a geometric view of parameter spaces. The following analysis of the Riemannian metric provides some insight into how normalization methods could help in training neural networks.
其中,
δ
\delta
δ 是参数的一个小变化。上述黎曼度量呈现了参数空间的几何视图。以下对黎曼度量的分析提供了一些关于归一化方法如何有助于训练神经网络的见解。
5.2.2 The geometry of normalized generalized linear models
We focus our geometric analysis on the generalized linear model. The results from the following analysis can be easily applied to understand deep neural networks with block-diagonal approximation to the Fisher information matrix, where each block corresponds to the parameters for a single neuron.
我们将几何分析的重点放在广义线性模型上。以下分析的结果可以很容易地应用于理解深度神经网络,其中费舍尔信息矩阵的块对角近似,每个块对应于单个神经元的参数。
A generalized linear model (GLM) can be regarded as parameterizing an output distribution from the exponential family using a weight vector w and bias scalar b. To be consistent with the previous sections, the log likelihood of the GLM can be written using the summed inputs a as the following:
广义线性模型(Generalized Linear Model, GLM)可以被视为使用权重向量
w
w
w 和偏置标量
b
b
b 从指数族中参数化一个输出分布。为了与前面的章节保持一致,GLM的对数似然可以使用输入总和
a
a
a 来表示,如下所示:
where, f(·) is the transfer function that is the analog of the non-linearity in neural networks, f0(·) is the derivative of the transfer function, η(·) is a real valued function and c(·) is the log partition function. φ is a constant that scales the output variance. Assume a H-dimensional output vector y = [y1; y2; · · · ; yH] is modeled using H independent GLMs and log P(y j x; W; b) = PH i=1 log P(yi j x; wi; bi). Let W be the weight matrix whose rows are the weight vectors of the individual GLMs, b denote the bias vector of length H and vec(·) denote the Kronecker vector operator. The Fisher information matrix for the multi-dimensional GLM with respect to its parameters θ = [w1>; b1; · · · ; wH>; bH]> = vec([W; b]>) is simply the expected Kronecker product of the data features and the output covariance matrix:
We obtain normalized GLMs by applying the normalization methods to the summed inputs a in the original model through µ and σ. Without loss of generality, we denote F¯ as the Fisher information matrix under the normalized multi-dimensional GLM with the additional gain parameters θ = vec([W; b; g]>):
我们通过对原始模型中的输入总和
a
a
a 应用归一化方法,通过
μ
\mu
μ 和
σ
\sigma
σ 来获得归一化的GLMs。不失一般性,我们用
F
ˉ
\bar{F}
Fˉ 表示在归一化的多维GLM下,带有额外增益参数
θ
=
vec
(
[
W
,
b
,
g
]
T
)
\theta = \text{vec}([W, b, g]^T)
θ=vec([W,b,g]T) 的费舍尔信息矩阵:
Implicit learning rate reduction through the growth of the weight vector: Notice that, comparing to standard GLM, the block F¯ij along the weight vector wi direction is scaled by the gain parameters and the normalization scalar σi. If the norm of the weight vector wi grows twice as large, even though the model’s output remains the same, the Fisher information matrix will be different. The curvature along the wi direction will change by a factor of 1 2 because the σi will also be twice as large. As a result, for the same parameter update in the normalized model, the norm of the weight vector effectively controls the learning rate for the weight vector. During learning, it is harder to change the orientation of the weight vector with large norm. The normalization methods, therefore, have an implicit “early stopping” effect on the weight vectors and help to stabilize learning towards convergence.
通过权重向量增长隐式降低学习率:请注意,与标准GLM相比,沿着权重向量
w
i
w_i
wi 方向的块
F
ˉ
i
j
\bar{F}_{ij}
Fˉij 被增益参数和归一化标量
σ
i
\sigma_i
σi 缩放。如果权重向量
w
i
w_i
wi 的范数增长到原来的两倍,即使模型的输出保持不变,费舍尔信息矩阵也会有所不同。沿着
w
i
w_i
wi 方向的曲率将因
σ
i
\sigma_i
σi 也增大两倍而改变一个
1
2
\frac{1}{2}
21 的因子。因此,对于归一化模型中的相同参数更新,权重向量的范数实际上控制了权重向量的学习率。在学习过程中,改变具有大范数的权重向量的方向更加困难。因此,归一化方法对权重向量具有隐式的“提前停止”效应,并有助于稳定学习以实现收敛。
Learning the magnitude of incoming weights: In normalized models, the magnitude of the incoming weights is explicitly parameterized by the gain parameters. We compare how the model output changes between updating the gain parameters in the normalized GLM and updating the magnitude of the equivalent weights under original parameterization during learning. The direction along the gain parameters in F¯ captures the geometry for the magnitude of the incoming weights. We show that Riemannian metric along the magnitude of the incoming weights for the standard GLM is scaled by the norm of its input, whereas learning the gain parameters for the batch normalized and layer normalized models depends only on the magnitude of the prediction error. Learning the magnitude of incoming weights in the normalized model is therefore, more robust to the scaling of the input and its parameters than in the standard model. See Appendix for detailed derivations.
学习传入权重的大小:在归一化模型中,传入权重的大小由增益参数明确参数化。我们比较了在学习过程中,更新归一化GLM中的增益参数与更新原始参数化下等效权重大小时模型输出的变化。在
F
ˉ
\bar{F}
Fˉ 中沿着增益参数的方向捕获了传入权重大小的几何形状。我们展示了标准GLM中沿着传入权重大小的黎曼度量由其输入的范数缩放,而批量归一化和层归一化模型中学习增益参数仅依赖于预测误差的大小。因此,在归一化模型中学习传入权重的大小比在标准模型中对输入及其参数的缩放更为稳健。详细的推导见附录。
6 Experimental results
We perform experiments with layer normalization on 6 tasks, with a focus on recurrent neural networks: image-sentence ranking, question-answering, contextual language modelling, generative modelling, handwriting sequence generation and MNIST classification. Unless otherwise noted, the default initialization of layer normalization is to set the adaptive gains to 1 and the biases to 0 in the experiments.
我们在6个任务上对层归一化进行了实验,重点关注循环神经网络:图像-句子排名、问答、上下文语言建模、生成建模、手写序列生成和MNIST分类。除非另有说明,实验中层归一化的默认初始化是将自适应增益设置为1,偏置设置为0。
6.1 Order embeddings of images and language
In this experiment, we apply layer normalization to the recently proposed order-embeddings model of Vendrov et al. [2016] for learning a joint embedding space of images and sentences. We follow the same experimental protocol as Vendrov et al. [2016] and modify their publicly available code to incorporate layer normalization 1 which utilizes Theano [Team et al., 2016]. Images and sentences from the Microsoft COCO dataset [Lin et al., 2014] are embedded into a common vector space, where a GRU [Cho et al., 2014] is used to encode sentences and the outputs of a pre-trained VGG ConvNet [Simonyan and Zisserman, 2015] (10-crop) are used to encode images. The orderembedding model represents images and sentences as a 2-level partial ordering and replaces the cosine similarity scoring function used in Kiros et al. [2014] with an asymmetric one.
在这个实验中,我们将层归一化应用于Vendrov等人[2016]最近提出的顺序嵌入模型,用于学习图像和句子的联合嵌入空间。我们遵循Vendrov等人[2016]的相同实验协议,并修改他们公开可用的代码以纳入层归一化,该层归一化利用了Theano[Team等人,2016]。来自微软COCO数据集[Lin等人,2014]的图像和句子被嵌入到一个共同的向量空间中,其中GRU[Cho等人,2014]用于编码句子,预训练的VGG ConvNet[Simonyan和Zisserman,2015](10-crop)的输出用于编码图像。顺序嵌入模型将图像和句子表示为2级部分排序,并用非对称的评分函数替换了Kiros等人[2014]中使用的余弦相似度评分函数。
We trained two models: the baseline order-embedding model as well as the same model with layer normalization applied to the GRU. After every 300 iterations, we compute Recall@K (R@K) values on a held out validation set and save the model whenever R@K improves. The best performing models are then evaluated on 5 separate test sets, each containing 1000 images and 5000 captions, for which the mean results are reported. Both models use Adam [Kingma and Ba, 2014] with the same initial hyperparameters and both models are trained using the same architectural choices as used in Vendrov et al. [2016]. We refer the reader to the appendix for a description of how layer normalization is applied to GRU.
我们训练了两个模型:基线顺序嵌入模型以及应用层归一化到GRU的相同模型。每300次迭代后,我们在保留的验证集上计算Recall@K (R@K)值,并在R@K提高时保存模型。然后,在5个单独的测试集上评估表现最好的模型,每个测试集包含1000张图像和5000个标题,报告平均结果。两个模型都使用Adam优化器[Kingma和Ba,2014],具有相同的初始超参数,并且两个模型都使用与Vendrov等人[2016]中相同的架构选择进行训练。我们参考附录中关于如何将层归一化应用于GRU的描述。
学习一下Adam优化器《Adam: a method for stochastic optimization》(201412), D.Kingma and J.L.Ba.
Figure 1 illustrates the validation curves of the models, with and without layer normalization. We plot R@1, R@5 and R@10 for the image retrieval task. We observe that layer normalization offers a per-iteration speedup across all metrics and converges to its best validation model in 60% of the time it takes the baseline model to do so. In Table 2, the test set results are reported from which we observe that layer normalization also results in improved generalization over the original model. The results we report are state-of-the-art for RNN embedding models, with only the structure-preserving model of Wang et al. [2016] reporting better results on this task. However, they evaluate under different conditions (1 test set instead of the mean over 5) and are thus not directly comparable.
图1 展示了有无层归一化的模型的验证曲线。我们为图像检索任务绘制了R@1、R@5和R@10。我们观察到,层归一化在所有指标上都提供了每次迭代的速度提升,并且它在达到最佳验证模型的时间上是基线模型所需时间的60%。在 表2 中,我们报告了测试集的结果,从中我们观察到层归一化也改善了模型的泛化能力。我们报告的结果对于RNN嵌入模型来说是最先进的,只有Wang等人[2016]的结构保持模型在这项任务上报告了更好的结果。然而,他们在不同条件下进行评估(1个测试集而不是5个的平均值),因此不直接可比。
表2:5个测试分割的平均结果,用于标题和图像检索。R@K 表示 Recall@K(越高越好)。平均排名 r(越低越好)。Sym 对应于对称基线,而 OE 表示顺序嵌入。
6.2 Teaching machines to read and comprehend
In order to compare layer normalization to the recently proposed recurrent batch normalization [Cooijmans et al., 2016], we train an unidirectional attentive reader model on the CNN corpus both introduced by Hermann et al. [2015]. This is a question-answering task where a query description about a passage must be answered by filling in a blank. The data is anonymized such that entities are given randomized tokens to prevent degenerate solutions, which are consistently permuted during training and evaluation. We follow the same experimental protocol as Cooijmans et al. [2016] and modify their public code to incorporate layer normalization 2 which uses Theano [Team et al., 2016]. We obtained the pre-processed dataset used by Cooijmans et al. [2016] which differs from the original experiments of Hermann et al. [2015] in that each passage is limited to 4 sentences. In Cooijmans et al. [2016], two variants of recurrent batch normalization are used: one where BN is only applied to the LSTM while the other applies BN everywhere throughout the model. In our experiment, we only apply layer normalization within the LSTM.
为了将层归一化与最近提出的循环批量归一化[Cooijmans et al., 2016]进行比较,我们在由Hermann等人[2015]引入的CNN语料库上训练了一个单向注意力阅读模型。这是一个问答任务,其中必须通过填写空白来回答问题,该问题描述了一段文本。数据被匿名化,以便实体被赋予随机化的标记,以防止退化解决方案,在训练和评估期间这些标记会被一致地置换。我们遵循Cooijmans等人[2016]的相同实验协议,并修改他们的公共代码以纳入使用Theano[Team et al., 2016]的层归一化。我们获得了Cooijmans等人[2016]使用的预处理数据集,该数据集与Hermann等人[2015]的原始实验不同,在于每个段落被限制在4个句子内。在Cooijmans等人[2016]中,使用了两种变体的循环批量归一化:一种仅将BN应用于LSTM,另一种则在整个模型中应用BN。在我们的实验中,我们只在LSTM内应用层归一化。
The results of this experiment are shown in Figure 2. We observe that layer normalization not only trains faster but converges to a better validation result over both the baseline and BN variants. In Cooijmans et al. [2016], it is argued that the scale parameter in BN must be carefully chosen and is set to 0.1 in their experiments. We experimented with layer normalization for both 1.0 and 0.1 scale initialization and found that the former model performed significantly better. This demonstrates that layer normalization is not sensitive to the initial scale in the same way that recurrent BN is. 3
这个实验的结果如 图2 所示。我们观察到,层归一化不仅训练得更快,而且与基线和BN变体相比,它在验证结果上也表现得更好。在Cooijmans等人[2016]中,他们认为BN中的缩放参数必须谨慎选择,并在他们的实验中将其设置为0.1。我们对层归一化进行了1.0和0.1缩放初始化的实验,并发现前者的性能明显更好。这表明层归一化对初始缩放的敏感性与循环BN不同。
6.3 Skip-thought vectors
Skip-thoughts [Kiros et al., 2015] is a generalization of the skip-gram model [Mikolov et al., 2013] for learning unsupervised distributed sentence representations. Given contiguous text, a sentence is encoded with a encoder RNN and decoder RNNs are used to predict the surrounding sentences. Kiros et al. [2015] showed that this model could produce generic sentence representations that perform well on several tasks without being fine-tuned. However, training this model is timeconsuming, requiring several days of training in order to produce meaningful results.
Skip-thoughts [Kiros et al., 2015] 是对 skip-gram 模型 [Mikolov et al., 2013] 的一种泛化,用于学习无监督分布式句子表示。给定连续的文本,句子通过编码器 RNN 进行编码,并使用解码器 RNN 预测周围的上下文句子。Kiros et al. [2015] 展示了这个模型能够产生通用的句子表示,这些表示在不经过微调的情况下在多个任务上表现良好。然而,训练这个模型非常耗时,需要几天的训练才能产生有意义的结果。
In this experiment we determine to what effect layer normalization can speed up training. Using the publicly available code of Kiros et al. [2015] 4, we train two models on the BookCorpus dataset [Zhu et al., 2015]: one with and one without layer normalization. These experiments are performed with Theano [Team et al., 2016]. We adhere to the experimental setup used in Kiros et al. [2015], training a 2400-dimensional sentence encoder with the same hyperparameters. Given the size of the states used, it is conceivable layer normalization would produce slower per-iteration updates than without. However, we found that provided CNMeM 5 is used, there was no significant difference between the two models. We checkpoint both models after every 50,000 iterations and evaluate their performance on five tasks: semantic-relatedness (SICK) [Marelli et al., 2014], movie review sentiment (MR) [Pang and Lee, 2005], customer product reviews (CR) [Hu and Liu, 2004], subjectivity/objectivity classification (SUBJ) [Pang and Lee, 2004] and opinion polarity (MPQA) [Wiebe et al., 2005]. We plot the performance of both models for each checkpoint on all tasks to determine whether the performance rate can be improved with LN.
在这个实验中,我们决定研究层归一化可以如何加速训练。我们使用Kiros等人[2015]公开的代码4,在BookCorpus数据集[Zhu等人,2015]上训练两个模型:一个应用层归一化,一个不应用。这些实验使用Theano[Team等人,2016]进行。我们遵循Kiros等人[2015]中使用的实验设置,训练一个2400维的句子编码器,并使用相同的超参数。考虑到使用的状态大小,可以想象层归一化会比不使用时产生更慢的每次迭代更新。然而,我们发现,只要使用CNMeM 5,两个模型之间没有显著差异。我们在每50000次迭代后对两个模型进行检查点,并在五个任务上评估它们的性能:语义相关性(SICK)[Marelli等人,2014]、电影评论情感(MR)[Pang和Lee,2005]、顾客产品评论(CR)[Hu和Liu,2004]、主观性/客观性分类(SUBJ)[Pang和Lee,2004]和观点极性(MPQA)[Wiebe等人,2005]。我们在所有任务上为每个检查点绘制两个模型的性能,以确定是否可以通过LN提高性能比率。
The experimental results are illustrated in Figure 3. We observe that applying layer normalization results both in speedup over the baseline as well as better final results after 1M iterations are performed as shown in Table 3. We also let the model with layer normalization train for a total of a month, resulting in further performance gains across all but one task. We note that the performance differences between the original reported results and ours are likely due to the fact that the publicly available code does not condition at each timestep of the decoder, where the original model does.
实验结果如 图3 所示。我们观察到,应用层归一化不仅在基线基础上实现了加速,而且在进行了100万次迭代后,最终结果也更好,如 表3 所示。我们还让应用层归一化的模型训练了总共一个月,结果在除了一个任务之外的所有任务上都取得了进一步的性能提升。我们注意到,原始报告结果与我们的结果之间的性能差异可能是由于公开可用的代码在解码器的每个时间步都不进行条件处理,而原始模型确实进行了条件处理。
图3:在下游任务上,有无层归一化的skip-thought向量的性能与训练迭代次数的关系。原始线条是[Kiros et al., 2015]中报告的结果。带有误差的图表使用了10折交叉验证。最好以彩色显示。
SICK®、SICK(MSE)、MR、CR、SUBJ、MPQA 是不同的评估指标,用于衡量文本相关性、情感分析、主观性/客观性分类等自然语言处理任务的性能。下面是每个指标的简要说明:
1.SICK(r ): SICK相关性(Semantic Image Kernels)的简称,通常用于评估句子的语义相关性。"r"可能表示使用了余弦相似度(cosine similarity)作为相关性的度量。
2.SICK(MSE): 这是SICK数据集的另一种评分方式,使用均方误差(Mean Squared Error, MSE)来衡量模型预测的相关性分数与人类标注的相关性分数之间的差异。
3.MR: 电影评论(Movie Review)数据集,通常用于情感分析任务,判断评论是积极的还是消极的。
4.CR: 顾客产品评论(Customer Reviews)数据集,也用于情感分析,分析顾客对产品的评论是正面的还是负面的。
5.SUBJ: 主观性/客观性分类(Subjectivity/Objectivity),用于区分文本是表达个人意见和感受的主观文本,还是描述事实的客观文本。
6.MPQA: 意见极性(Opinion Polarity)数据集,用于评估文本的情感倾向,判断是正面(积极)还是负面(消极)。
这些指标通常用于评估模型在特定任务上的性能,如句子嵌入的质量,或者在训练和测试自然语言处理模型时的泛化能力。在实验中,通过这些指标可以比较有无层归一化对模型性能的影响。
表3:Skip-thoughts结果。前两列评估指标表示皮尔逊和斯皮尔曼相关性,第三列是均方误差,其余的表示分类准确率。除了MSE外,所有评估指标越高越好。我们的模型训练了100万次迭代,除了(y)模型训练了一个月(大约170万次迭代)。
6.4 Modeling binarized MNIST using DRAW
We also experimented with the generative modeling on the MNIST dataset. Deep Recurrent Attention Writer (DRAW) [Gregor et al., 2015] has previously achieved the state-of-theart performance on modeling the distribution of MNIST digits. The model uses a differential attention mechanism and a recurrent neural network to sequentially generate pieces of an image. We evaluate the effect of layer normalization on a DRAW model using 64 glimpses and 256 LSTM hidden units. The model is trained with the default setting of Adam [Kingma and Ba, 2014] optimizer and the minibatch size of 128. Previous publications on binarized MNIST have used various training protocols to generate their datasets. In this experiment, we used the fixed binarization from Larochelle and Murray [2011]. The dataset has been split into 50,000 training, 10,000 validation and 10,000 test images.
我们也在MNIST数据集上进行了生成模型的实验。深度递归注意力写入器(Deep Recurrent Attention Writer, DRAW)[Gregor et al., 2015] 之前已经在模拟MNIST数字分布方面取得了最先进的性能。该模型使用差分注意力机制和循环神经网络来顺序生成图像的各个部分。我们评估了层归一化对使用64次快速查看和256个LSTM隐藏单元的DRAW模型的影响。模型使用Adam优化器[Kingma和Ba, 2014]的默认设置和128的minibatch大小进行训练。关于二值化MNIST的先前出版物使用了各种训练协议来生成他们的数据集。在这个实验中,我们使用了Larochelle和Murray[2011]的固定二值化。数据集已经被分割成50,000个训练图像、10,000个验证图像和10,000个测试图像。
Figure 4 shows the test variational bound for the first 100 epoch. It highlights the speedup benefit of applying layer normalization that the layer normalized DRAW converges almost twice as fast than the baseline model. After 200 epoches, the baseline model converges to a variational log likelihood of 82.36 nats on the test data and the layer normalization model obtains 82.09 nats.
图4 显示了前100个训练周期的测试变分界限。它突出了应用层归一化带来的加速效益,层归一化的DRAW模型的收敛速度几乎是基线模型的两倍。在200个周期后,基线模型在测试数据上收敛到82.36纳特的变分对数似然,而层归一化模型获得了82.09纳特。
变分界限(或变分下界)通常用于评估生成模型的性能,特别是在变分自编码器(Variational Autoencoder, VAE)中,它衡量了模型编码和解码数据的能力。较低的变分界限意味着模型更好地学习了数据的分布。在这个实验中,层归一化似乎提高了模型的学习效率,使得模型能够更快地收敛到一个较低的变分界限,这表明它在生成MNIST数字方面可能有更好的性能。
纳特(nats)是信息量的单位,类似于比特(bits),但基于自然对数而不是以2为底的对数。在评估生成模型时,使用纳特可以提供一个更自然的度量,因为它与模型的连续变量更加兼容。
图4:有无层归一化的DRAW模型测试负对数似然。
6.5 Handwriting sequence generation
The previous experiments mostly examine RNNs on NLP tasks whose lengths are in the range of 10 to 40. To show the effectiveness of layer normalization on longer sequences, we performed handwriting generation tasks using the IAM Online Handwriting Database [Liwicki and Bunke, 2005]. IAM-OnDB consists of handwritten lines collected from 221 different writers. When given the input character string, the goal is to predict a sequence of x and y pen co-ordinates of the corresponding handwriting line on the whiteboard. There are, in total, 12179 handwriting line sequences. The input string is typically more than 25 characters and the average handwriting line has a length around 700.
先前的实验主要检验了在长度范围为10到40的自然语言处理任务上的RNN。为了展示层归一化在更长序列上的效力,我们使用IAM在线手写数据库[Liwicki和Bunke, 2005]进行了手写生成任务。IAM-OnDB由来自221个不同作者的手写线组成。给定输入的字符字符串,目标是预测白板上相应手写线的x和y笔坐标序列。总共有12179个手写线序列。输入字符串通常超过25个字符,平均手写线的长度约为700。
We used the same model architecture as in Section (5.2) of Graves [2013]. The model architecture consists of three hidden layers of 400 LSTM cells, which produce 20 bivariate Gaussian mixture components at the output layer, and a size 3 input layer. The character sequence was encoded with one-hot vectors, and hence the window vectors were size 57. A mixture of 10 Gaussian functions was used for the window parameters, requiring a size 30 parameter vector. The total number of weights was increased to approximately 3.7M. The model is trained using mini-batches of size 8 and the Adam [Kingma and Ba, 2014] optimizer.
我们使用了与Graves[2013]第(5.2)节相同的模型架构。该模型架构由三个隐藏层组成,每层有400个LSTM单元,在输出层产生20个二元高斯混合分量,以及一个大小为3的输入层。字符序列使用独热向量进行编码,因此窗口向量的大小为57。用于窗口参数的10个高斯函数的混合,需要一个大小为30的参数向量。总权重数量增加到大约3.7M。模型使用大小为8的mini-batch和Adam[Kingma和Ba, 2014]优化器进行训练。
The combination of small mini-batch size and very long sequences makes it important to have very stable hidden dynamics. Figure 5 shows that layer normalization converges to a comparable log likelihood as the baseline model but is much faster.
小批量大小和非常长的序列的结合使得拥有非常稳定的隐藏动态非常重要。图5 显示,层归一化能够快速收敛到与基线模型相当的对数似然,但速度要快得多。
图5:有无层归一化的手写序列生成模型负对数似然。模型使用8的mini-batch大小和500的序列长度进行训练。
Negtive Log Likelihood:
6.6 Permutation invariant MNIST
In addition to RNNs, we investigated layer normalization in feed-forward networks. We show how layer normalization compares with batch normalization on the well-studied permutation invariant MNIST classification problem. From the previous analysis, layer normalization is invariant to input re-scaling which is desirable for the internal hidden layers. But this is unnecessary for the logit outputs where the prediction confidence is determined by the scale of the logits. We only apply layer normalization to the fully-connected hidden layers that excludes the last softmax layer.
除了RNN之外,我们还研究了前馈网络中的层归一化。我们展示了在广泛研究的MNIST分类问题中,层归一化与批量归一化相比如何,该问题要求模型对输入的排列不变性具有不变性。从先前的分析中,我们知道层归一化对输入重新缩放是不变的,这对内部隐藏层是可取的。但这对输出层的逻辑单元(logits)来说是不必要的,因为预测的置信度是由逻辑单元的规模决定的。我们只将层归一化应用于全连接的隐藏层,不包括最后的softmax层。
All the models were trained using 55000 training data points and the Adam [Kingma and Ba, 2014] optimizer. For the smaller batch-size, the variance term for batch normalization is computed using the unbiased estimator. The experimental results from Figure 6 highlight that layer normalization is robust to the batch-sizes and exhibits a faster training convergence comparing to batch normalization that is applied to all layers.
所有模型都使用55000个训练数据点和Adam[Kingma和Ba, 2014]优化器进行训练。对于较小的批量大小,批量归一化的方差项是使用无偏估计器计算的。图6 的实验结果突出显示,层归一化对批量大小具有鲁棒性,并且与应用于所有层的批量归一化相比,展现出更快的训练收敛速度。
图6:排列不变的MNIST 784-1000-1000-10模型负对数似然和测试误差,分别使用层归一化和批量归一化。(左)模型使用128的批量大小进行训练。(右)模型使用4的批量大小进行训练。
6.7 Convolutional Networks
We have also experimented with convolutional neural networks. In our preliminary experiments, we observed that layer normalization offers a speedup over the baseline model without normalization, but batch normalization outperforms the other methods. With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions is no longer true for convolutional neural networks. The large number of the hidden units whose receptive fields lie near the boundary of the image are rarely turned on and thus have very different statistics from the rest of the hidden units within the same layer. We think further research is needed to make layer normalization work well in ConvNets.
我们还对卷积神经网络进行了实验。在我们的初步实验中,我们观察到层归一化比没有归一化的基线模型提供了加速,但批量归一化的其他方法表现更佳。在全连接层中,一个层中的所有隐藏单元倾向于对最终预测做出类似的贡献,因此对一个层的输入总和进行重新中心化和重新缩放效果很好。然而,对于卷积神经网络来说,这种假设不再成立。大量隐藏单元的接收场位于图像边界附近,很少被激活,因此与同一层中的其他隐藏单元的统计数据非常不同。我们认为需要进一步的研究来使层归一化在ConvNets中表现良好。
7 Conclusion
In this paper, we introduced layer normalization to speed-up the training of neural networks. We provided a theoretical analysis that compared the invariance properties of layer normalization with batch normalization and weight normalization. We showed that layer normalization is invariant to per training-case feature shifting and scaling.
在这篇论文中,我们引入了层归一化来加速神经网络的训练。我们提供了一个理论分析,比较了层归一化与批量归一化和权重归一化的不变性质。我们展示了层归一化对每个训练案例的特征移位和缩放是不变的。
Empirically, we showed that recurrent neural networks benefit the most from the proposed method especially for long sequences and small mini-batches.
从实证角度来看,我们展示了循环神经网络尤其从所提出的方法中受益最大,特别是对于长序列和小批量数据。