AI大模型架构背后的数学原理和数学公式,基于Transformer架构的数学公式有哪些?
大家好,我是微学AI,今天给大家介绍一下大模型架构大部分是基于Transformer架构的研发出来的,背后的数学原理涉及线性代数、概率论、优化理论等。以下是关键数学原理和公式的详细说明及示例。
大模型背后隐藏的数学原理
1. 线性变换(Linear Transformation)
大模型的核心操作之一是线性变换,公式为:
y
=
W
x
+
b
\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}
y=Wx+b
- x \mathbf{x} x 是输入向量(维度 d in d_{\text{in}} din)。
- W \mathbf{W} W 是权重矩阵(维度 d out × d in d_{\text{out}} \times d_{\text{in}} dout×din)。
- b \mathbf{b} b 是偏置向量(维度 d out d_{\text{out}} dout)。
- y \mathbf{y} y 是输出向量(维度 d out d_{\text{out}} dout)。
例子:
假设输入向量
x
=
[
1
,
2
,
3
]
⊤
\mathbf{x} = [1, 2, 3]^\top
x=[1,2,3]⊤,权重矩阵
W
=
[
1
0
1
0
1
0
]
\mathbf{W} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}
W=[100110],偏置向量
b
=
[
0.5
,
−
0.5
]
⊤
\mathbf{b} = [0.5, -0.5]^\top
b=[0.5,−0.5]⊤,则:
y
=
W
x
+
b
=
[
1
0
1
0
1
0
]
[
1
2
3
]
+
[
0.5
−
0.5
]
=
[
4
2
]
+
[
0.5
−
0.5
]
=
[
4.5
1.5
]
\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b} = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} + \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix} = \begin{bmatrix} 4 \\ 2 \end{bmatrix} + \begin{bmatrix} 0.5 \\ -0.5 \end{bmatrix} = \begin{bmatrix} 4.5 \\ 1.5 \end{bmatrix}
y=Wx+b=[100110]
123
+[0.5−0.5]=[42]+[0.5−0.5]=[4.51.5]
2. 位置编码(Positional Encoding)
Transformer模型使用位置编码来注入序列的位置信息,公式为:
P
E
(
p
o
s
,
2
i
)
=
sin
(
p
o
s
1000
0
2
i
d
)
,
P
E
(
p
o
s
,
2
i
+
1
)
=
cos
(
p
o
s
1000
0
2
i
d
)
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
PE(pos,2i)=sin(10000d2ipos),PE(pos,2i+1)=cos(10000d2ipos)
- p o s pos pos 是位置索引。
- i i i 是维度索引。
- d d d 是嵌入维度。
例子:
假设
p
o
s
=
1
pos = 1
pos=1,
d
=
4
d = 4
d=4,则:
P
E
(
1
,
0
)
=
sin
(
1
1000
0
0
/
4
)
=
sin
(
1
)
,
P
E
(
1
,
1
)
=
cos
(
1
1000
0
0
/
4
)
=
cos
(
1
)
PE_{(1, 0)} = \sin\left(\frac{1}{10000^{0/4}}\right) = \sin(1), \quad PE_{(1, 1)} = \cos\left(\frac{1}{10000^{0/4}}\right) = \cos(1)
PE(1,0)=sin(100000/41)=sin(1),PE(1,1)=cos(100000/41)=cos(1)
P
E
(
1
,
2
)
=
sin
(
1
1000
0
2
/
4
)
=
sin
(
1
100
)
,
P
E
(
1
,
3
)
=
cos
(
1
1000
0
2
/
4
)
=
cos
(
1
100
)
PE_{(1, 2)} = \sin\left(\frac{1}{10000^{2/4}}\right) = \sin\left(\frac{1}{100}\right), \quad PE_{(1, 3)} = \cos\left(\frac{1}{10000^{2/4}}\right) = \cos\left(\frac{1}{100}\right)
PE(1,2)=sin(100002/41)=sin(1001),PE(1,3)=cos(100002/41)=cos(1001)
3. 注意力机制(Attention Mechanism)
注意力机制的核心是计算查询(Query)、键(Key)和值(Value)之间的相似度,公式为:
Attention
(
Q
,
K
,
V
)
=
softmax
(
Q
K
⊤
d
k
)
V
\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}
Attention(Q,K,V)=softmax(dkQK⊤)V
- Q \mathbf{Q} Q 是查询矩阵(维度 n × d k n \times d_k n×dk)。
- K \mathbf{K} K 是键矩阵(维度 m × d k m \times d_k m×dk)。
- V \mathbf{V} V 是值矩阵(维度 m × d v m \times d_v m×dv)。
- d k d_k dk 是键的维度。
- softmax \text{softmax} softmax 是归一化函数。
例子:
假设
Q
=
[
1
0
0
1
]
\mathbf{Q} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}
Q=[1001],
K
=
[
0
1
1
0
]
\mathbf{K} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}
K=[0110],
V
=
[
1
2
3
4
]
\mathbf{V} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}
V=[1324],
d
k
=
2
d_k = 2
dk=2,则:
Q
K
⊤
=
[
1
0
0
1
]
[
0
1
1
0
]
=
[
0
1
1
0
]
\mathbf{Q}\mathbf{K}^\top = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}
QK⊤=[1001][0110]=[0110]
softmax
(
Q
K
⊤
2
)
=
softmax
(
[
0
0.707
0.707
0
]
)
≈
[
0.5
0.5
0.5
0.5
]
\text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{2}}\right) = \text{softmax}\left(\begin{bmatrix} 0 & 0.707 \\ 0.707 & 0 \end{bmatrix}\right) \approx \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix}
softmax(2QK⊤)=softmax([00.7070.7070])≈[0.50.50.50.5]
Attention
(
Q
,
K
,
V
)
=
[
0.5
0.5
0.5
0.5
]
[
1
2
3
4
]
=
[
2
3
2
3
]
\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \begin{bmatrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{bmatrix} \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = \begin{bmatrix} 2 & 3 \\ 2 & 3 \end{bmatrix}
Attention(Q,K,V)=[0.50.50.50.5][1324]=[2233]
4. 多头注意力机制(Multi-Head Attention)
多头注意力机制通过并行计算多个注意力头来捕捉不同的特征,公式为:
MultiHead
(
Q
,
K
,
V
)
=
Concat
(
head
1
,
head
2
,
…
,
head
h
)
W
O
\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h)\mathbf{W}^O
MultiHead(Q,K,V)=Concat(head1,head2,…,headh)WO
其中每个注意力头的计算为:
head
i
=
Attention
(
Q
W
i
Q
,
K
W
i
K
,
V
W
i
V
)
\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)
headi=Attention(QWiQ,KWiK,VWiV)
- W i Q , W i K , W i V \mathbf{W}_i^Q, \mathbf{W}_i^K, \mathbf{W}_i^V WiQ,WiK,WiV 是每个头的投影矩阵。
- W O \mathbf{W}^O WO 是输出投影矩阵。
- h h h 是注意力头的数量。
例子:
假设
h
=
2
h = 2
h=2,
Q
=
[
1
0
0
1
]
\mathbf{Q} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}
Q=[1001],
K
=
[
0
1
1
0
]
\mathbf{K} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}
K=[0110],
V
=
[
1
2
3
4
]
\mathbf{V} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}
V=[1324],投影矩阵为:
W
1
Q
=
[
1
0
0
1
]
,
W
1
K
=
[
1
0
0
1
]
,
W
1
V
=
[
1
0
0
1
]
\mathbf{W}_1^Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad \mathbf{W}_1^K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad \mathbf{W}_1^V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}
W1Q=[1001],W1K=[1001],W1V=[1001]
W
2
Q
=
[
0
1
1
0
]
,
W
2
K
=
[
0
1
1
0
]
,
W
2
V
=
[
0
1
1
0
]
\mathbf{W}_2^Q = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad \mathbf{W}_2^K = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad \mathbf{W}_2^V = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}
W2Q=[0110],W2K=[0110],W2V=[0110]
则:
head
1
=
Attention
(
Q
W
1
Q
,
K
W
1
K
,
V
W
1
V
)
=
Attention
(
Q
,
K
,
V
)
\text{head}_1 = \text{Attention}(\mathbf{Q}\mathbf{W}_1^Q, \mathbf{K}\mathbf{W}_1^K, \mathbf{V}\mathbf{W}_1^V) = \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})
head1=Attention(QW1Q,KW1K,VW1V)=Attention(Q,K,V)
head
2
=
Attention
(
Q
W
2
Q
,
K
W
2
K
,
V
W
2
V
)
=
Attention
(
[
0
1
1
0
]
,
[
1
0
0
1
]
,
[
2
1
4
3
]
)
\text{head}_2 = \text{Attention}(\mathbf{Q}\mathbf{W}_2^Q, \mathbf{K}\mathbf{W}_2^K, \mathbf{V}\mathbf{W}_2^V) = \text{Attention}(\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \begin{bmatrix} 2 & 1 \\ 4 & 3 \end{bmatrix})
head2=Attention(QW2Q,KW2K,VW2V)=Attention([0110],[1001],[2413])
5. 残差连接(Residual Connection)
残差连接用于缓解梯度消失问题,公式为:
y
=
Layer
(
x
)
+
x
\mathbf{y} = \text{Layer}(\mathbf{x}) + \mathbf{x}
y=Layer(x)+x
- x \mathbf{x} x 是输入。
- Layer ( x ) \text{Layer}(\mathbf{x}) Layer(x) 是某一层的输出。
例子:
假设
x
=
[
1
,
2
]
⊤
\mathbf{x} = [1, 2]^\top
x=[1,2]⊤,某一层的输出
Layer
(
x
)
=
[
0.5
,
−
0.5
]
⊤
\text{Layer}(\mathbf{x}) = [0.5, -0.5]^\top
Layer(x)=[0.5,−0.5]⊤,则:
y
=
[
0.5
,
−
0.5
]
⊤
+
[
1
,
2
]
⊤
=
[
1.5
,
1.5
]
⊤
\mathbf{y} = [0.5, -0.5]^\top + [1, 2]^\top = [1.5, 1.5]^\top
y=[0.5,−0.5]⊤+[1,2]⊤=[1.5,1.5]⊤
6. 层归一化(Layer Normalization)
层归一化用于稳定训练过程,公式为:
LayerNorm
(
x
)
=
γ
⋅
x
−
μ
σ
+
β
\text{LayerNorm}(\mathbf{x}) = \gamma \cdot \frac{\mathbf{x} - \mu}{\sigma} + \beta
LayerNorm(x)=γ⋅σx−μ+β
- x \mathbf{x} x 是输入向量。
- μ \mu μ 是均值, σ \sigma σ 是标准差。
- γ \gamma γ 和 β \beta β 是可学习的参数。
例子:
假设
x
=
[
1
,
2
,
3
]
⊤
\mathbf{x} = [1, 2, 3]^\top
x=[1,2,3]⊤,
μ
=
2
\mu = 2
μ=2,
σ
=
(
1
−
2
)
2
+
(
2
−
2
)
2
+
(
3
−
2
)
2
3
=
2
3
\sigma = \sqrt{\frac{(1-2)^2 + (2-2)^2 + (3-2)^2}{3}} = \sqrt{\frac{2}{3}}
σ=3(1−2)2+(2−2)2+(3−2)2=32,
γ
=
1
\gamma = 1
γ=1,
β
=
0
\beta = 0
β=0,则:
LayerNorm
(
x
)
=
1
⋅
[
1
,
2
,
3
]
−
2
2
3
+
0
≈
[
−
1.225
,
0
,
1.225
]
\text{LayerNorm}(\mathbf{x}) = 1 \cdot \frac{[1, 2, 3] - 2}{\sqrt{\frac{2}{3}}} + 0 \approx [-1.225, 0, 1.225]
LayerNorm(x)=1⋅32[1,2,3]−2+0≈[−1.225,0,1.225]
7. GELU激活函数
GELU(Gaussian Error Linear Unit)是一种常用的激活函数,公式为:
GELU
(
x
)
=
x
⋅
Φ
(
x
)
\text{GELU}(x) = x \cdot \Phi(x)
GELU(x)=x⋅Φ(x)
其中
Φ
(
x
)
\Phi(x)
Φ(x) 是标准正态分布的累积分布函数,近似计算为:
GELU
(
x
)
≈
0.5
x
(
1
+
tanh
(
2
π
(
x
+
0.044715
x
3
)
)
)
\text{GELU}(x) \approx 0.5x \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(x + 0.044715x^3)\right)\right)
GELU(x)≈0.5x(1+tanh(π2(x+0.044715x3)))
例子:
假设
x
=
1
x = 1
x=1,则:
GELU
(
1
)
≈
0.5
⋅
1
(
1
+
tanh
(
2
π
(
1
+
0.044715
⋅
1
3
)
)
)
≈
0.841
\text{GELU}(1) \approx 0.5 \cdot 1 \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(1 + 0.044715 \cdot 1^3)\right)\right) \approx 0.841
GELU(1)≈0.5⋅1(1+tanh(π2(1+0.044715⋅13)))≈0.841
8. Softmax 函数
Softmax 函数用于将向量转换为概率分布,公式为:
softmax
(
z
)
i
=
e
z
i
∑
j
=
1
n
e
z
j
\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}
softmax(z)i=∑j=1nezjezi
- z \mathbf{z} z 是输入向量。
- z i z_i zi 是向量的第 i i i 个元素。
例子:
假设
z
=
[
1
,
2
,
3
]
⊤
\mathbf{z} = [1, 2, 3]^\top
z=[1,2,3]⊤,则:
softmax
(
z
)
=
[
e
1
e
1
+
e
2
+
e
3
,
e
2
e
1
+
e
2
+
e
3
,
e
3
e
1
+
e
2
+
e
3
]
≈
[
0.090
,
0.245
,
0.665
]
\text{softmax}(\mathbf{z}) = \left[\frac{e^1}{e^1 + e^2 + e^3}, \frac{e^2}{e^1 + e^2 + e^3}, \frac{e^3}{e^1 + e^2 + e^3}\right] \approx [0.090, 0.245, 0.665]
softmax(z)=[e1+e2+e3e1,e1+e2+e3e2,e1+e2+e3e3]≈[0.090,0.245,0.665]
9. 损失函数(Loss Function)
大模型通常使用交叉熵损失函数,公式为:
L
(
y
,
y
^
)
=
−
∑
i
=
1
n
y
i
log
(
y
^
i
)
\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{i=1}^n y_i \log(\hat{y}_i)
L(y,y^)=−i=1∑nyilog(y^i)
- y \mathbf{y} y 是真实标签(one-hot 编码)。
- y ^ \hat{\mathbf{y}} y^ 是模型预测的概率分布。
例子:
假设真实标签
y
=
[
0
,
1
,
0
]
⊤
\mathbf{y} = [0, 1, 0]^\top
y=[0,1,0]⊤,模型预测
y
^
=
[
0.1
,
0.7
,
0.2
]
⊤
\hat{\mathbf{y}} = [0.1, 0.7, 0.2]^\top
y^=[0.1,0.7,0.2]⊤,则:
L
(
y
,
y
^
)
=
−
(
0
⋅
log
(
0.1
)
+
1
⋅
log
(
0.7
)
+
0
⋅
log
(
0.2
)
)
=
−
log
(
0.7
)
≈
0.357
\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = - (0 \cdot \log(0.1) + 1 \cdot \log(0.7) + 0 \cdot \log(0.2)) = -\log(0.7) \approx 0.357
L(y,y^)=−(0⋅log(0.1)+1⋅log(0.7)+0⋅log(0.2))=−log(0.7)≈0.357
10. Dropout
Dropout是一种正则化方法,训练时随机丢弃部分神经元,公式为:
y
=
m
⊙
x
\mathbf{y} = \mathbf{m} \odot \mathbf{x}
y=m⊙x
- m \mathbf{m} m 是掩码向量,元素为0或1,概率为 p p p。
- ⊙ \odot ⊙ 是逐元素乘法。
例子:
假设
x
=
[
1
,
2
,
3
]
⊤
\mathbf{x} = [1, 2, 3]^\top
x=[1,2,3]⊤,
p
=
0.5
p = 0.5
p=0.5,掩码
m
=
[
1
,
0
,
1
]
⊤
\mathbf{m} = [1, 0, 1]^\top
m=[1,0,1]⊤,则:
y
=
[
1
,
0
,
1
]
⊤
⊙
[
1
,
2
,
3
]
⊤
=
[
1
,
0
,
3
]
⊤
\mathbf{y} = [1, 0, 1]^\top \odot [1, 2, 3]^\top = [1, 0, 3]^\top
y=[1,0,1]⊤⊙[1,2,3]⊤=[1,0,3]⊤
11. 反向传播(Backpropagation)
反向传播通过链式法则计算梯度,公式为:
∂
L
∂
W
=
∂
L
∂
y
⋅
∂
y
∂
W
\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{W}}
∂W∂L=∂y∂L⋅∂W∂y
- L \mathcal{L} L 是损失函数。
- y \mathbf{y} y 是模型输出。
例子:
假设
y
=
W
x
\mathbf{y} = \mathbf{W}\mathbf{x}
y=Wx,
L
=
1
2
(
y
−
t
)
2
\mathcal{L} = \frac{1}{2}(\mathbf{y} - \mathbf{t})^2
L=21(y−t)2,则:
∂
L
∂
W
=
(
y
−
t
)
⋅
x
⊤
\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = (\mathbf{y} - \mathbf{t}) \cdot \mathbf{x}^\top
∂W∂L=(y−t)⋅x⊤
12. 梯度下降(Gradient Descent)
梯度下降用于优化模型参数,更新公式为:
θ
←
θ
−
η
∇
θ
L
\mathbf{\theta} \leftarrow \mathbf{\theta} - \eta \nabla_\theta \mathcal{L}
θ←θ−η∇θL
- θ \mathbf{\theta} θ 是模型参数。
- η \eta η 是学习率。
- ∇ θ L \nabla_\theta \mathcal{L} ∇θL 是损失函数对参数的梯度。
例子:
假设损失函数
L
(
θ
)
=
θ
2
\mathcal{L}(\theta) = \theta^2
L(θ)=θ2,初始参数
θ
=
3
\theta = 3
θ=3,学习率
η
=
0.1
\eta = 0.1
η=0.1,则:
∇
θ
L
=
2
θ
=
6
\nabla_\theta \mathcal{L} = 2\theta = 6
∇θL=2θ=6
θ
←
θ
−
η
∇
θ
L
=
3
−
0.1
×
6
=
2.4
\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L} = 3 - 0.1 \times 6 = 2.4
θ←θ−η∇θL=3−0.1×6=2.4
13. Adam优化器
Adam优化器结合了动量和自适应学习率,更新公式为:
m
t
=
β
1
m
t
−
1
+
(
1
−
β
1
)
∇
θ
L
\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \nabla_\theta \mathcal{L}
mt=β1mt−1+(1−β1)∇θL
v
t
=
β
2
v
t
−
1
+
(
1
−
β
2
)
(
∇
θ
L
)
2
\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) (\nabla_\theta \mathcal{L})^2
vt=β2vt−1+(1−β2)(∇θL)2
m
^
t
=
m
t
1
−
β
1
t
,
v
^
t
=
v
t
1
−
β
2
t
\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}
m^t=1−β1tmt,v^t=1−β2tvt
θ
t
=
θ
t
−
1
−
η
m
^
t
v
^
t
+
ϵ
\mathbf{\theta}_t = \mathbf{\theta}_{t-1} - \eta \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}
θt=θt−1−ηv^t+ϵm^t
- m t \mathbf{m}_t mt 和 v t \mathbf{v}_t vt 分别是动量项和二阶动量项。
- β 1 , β 2 \beta_1, \beta_2 β1,β2 是衰减率。
- η \eta η 是学习率。
- ϵ \epsilon ϵ 是平滑项。
例子:
假设
∇
θ
L
=
[
0.1
,
−
0.2
]
⊤
\nabla_\theta \mathcal{L} = [0.1, -0.2]^\top
∇θL=[0.1,−0.2]⊤,
β
1
=
0.9
\beta_1 = 0.9
β1=0.9,
β
2
=
0.999
\beta_2 = 0.999
β2=0.999,
η
=
0.001
\eta = 0.001
η=0.001,
ϵ
=
1
e
−
8
\epsilon = 1e-8
ϵ=1e−8,初始
m
0
=
v
0
=
0
\mathbf{m}_0 = \mathbf{v}_0 = \mathbf{0}
m0=v0=0,则:
m
1
=
0.9
⋅
0
+
0.1
⋅
[
0.1
,
−
0.2
]
⊤
=
[
0.01
,
−
0.02
]
⊤
\mathbf{m}_1 = 0.9 \cdot \mathbf{0} + 0.1 \cdot [0.1, -0.2]^\top = [0.01, -0.02]^\top
m1=0.9⋅0+0.1⋅[0.1,−0.2]⊤=[0.01,−0.02]⊤
v
1
=
0.999
⋅
0
+
0.001
⋅
[
0.
1
2
,
(
−
0.2
)
2
]
⊤
=
[
0.0001
,
0.0004
]
⊤
\mathbf{v}_1 = 0.999 \cdot \mathbf{0} + 0.001 \cdot [0.1^2, (-0.2)^2]^\top = [0.0001, 0.0004]^\top
v1=0.999⋅0+0.001⋅[0.12,(−0.2)2]⊤=[0.0001,0.0004]⊤
m
^
1
=
[
0.01
,
−
0.02
]
⊤
1
−
0.
9
1
=
[
0.01
,
−
0.02
]
⊤
\hat{\mathbf{m}}_1 = \frac{[0.01, -0.02]^\top}{1 - 0.9^1} = [0.01, -0.02]^\top
m^1=1−0.91[0.01,−0.02]⊤=[0.01,−0.02]⊤
v
^
1
=
[
0.0001
,
0.0004
]
⊤
1
−
0.99
9
1
=
[
0.0001
,
0.0004
]
⊤
\hat{\mathbf{v}}_1 = \frac{[0.0001, 0.0004]^\top}{1 - 0.999^1} = [0.0001, 0.0004]^\top
v^1=1−0.9991[0.0001,0.0004]⊤=[0.0001,0.0004]⊤
θ
1
=
θ
0
−
0.001
⋅
[
0.01
,
−
0.02
]
⊤
[
0.0001
,
0.0004
]
⊤
+
1
e
−
8
≈
θ
0
−
[
0.1
,
−
0.1
]
⊤
\mathbf{\theta}_1 = \mathbf{\theta}_0 - 0.001 \cdot \frac{[0.01, -0.02]^\top}{\sqrt{[0.0001, 0.0004]^\top} + 1e-8} \approx \mathbf{\theta}_0 - [0.1, -0.1]^\top
θ1=θ0−0.001⋅[0.0001,0.0004]⊤+1e−8[0.01,−0.02]⊤≈θ0−[0.1,−0.1]⊤
以上是大模型架构背后的核心数学原理和公式。这些公式构成了深度学习模型的基础,并在实际应用中通过高效的数值计算库(如PyTorch、TensorFlow、Paddle)实现。