机器学习 [白板推导](三)[线性分类]
4. 线性分类
4.1. 线性分类的典型模型
- 硬分类:输出结果只有0或1这种离散结果;
- 感知机
- 线性判别分析 Fisher
- 软分类:会输出0-1之间的值作为各个类别的概率;
- 概率生成模型:高斯判别分析GDA、朴素贝叶斯,主要建模的是 p ( x ⃗ , y ) p(\vec{x},y) p(x,y)
- 概率判别模型:逻辑回归,主要建模的是 p ( y ∣ x ⃗ ) p(y|\vec{x}) p(y∣x)
4.2. 感知机
4.2.1. 基本模型
模型:
f
(
x
⃗
^
)
=
sign
(
x
⃗
^
T
W
)
,
sign
(
a
)
=
{
1
,
a
>
0
0
,
a
=
0
−
1
,
a
<
0
.
(4.1)
\begin{aligned} f(\hat{\vec{x}})=\text{sign}(\hat{\vec{x}}^TW),\\ \text{sign}(a)=\left\{\begin{matrix}1,&a>0\\0,&a=0\\-1,&a<0\end{matrix}.\right.\end{aligned}\tag{4.1}
f(x^)=sign(x^TW),sign(a)=⎩
⎨
⎧1,0,−1,a>0a=0a<0.(4.1)
思想:错误驱动,先随机初始化一个
W
W
W,研究错误分类的样本来调整。
策略:使用错分类样本计算损失函数:
L
(
W
)
=
∑
x
⃗
^
i
∈
D
−
y
i
⋅
x
⃗
^
i
T
W
D
=
{
x
⃗
^
i
}
f
(
x
⃗
^
i
)
≠
y
i
.
(4.2)
\begin{aligned} \mathcal{L}(W)=\sum_{\hat{\vec{x}}_i\in D}-y_i\cdot \hat{\vec{x}}_i^TW\\ D=\{\hat{\vec{x}}_i\}_{f(\hat{\vec{x}}_i)\neq y_i}. \end{aligned}\tag{4.2}
L(W)=x^i∈D∑−yi⋅x^iTWD={x^i}f(x^i)=yi.(4.2)
4.3. 线性判别分析 Fisher
4.3.1. 问题定义
对于一个二分类问题,将样本分为
X
c
1
=
{
x
⃗
^
i
∣
y
i
=
+
1
}
X_{c_1}=\left \{ \hat{\vec{x}}_i|y_i=+1 \right \}
Xc1={x^i∣yi=+1} 和
X
c
2
=
{
x
⃗
^
i
∣
y
i
=
−
1
}
X_{c_2}=\left \{ \hat{\vec{x}}_i|y_i=-1 \right \}
Xc2={x^i∣yi=−1},这两组的样本数分别为
N
1
N_1
N1 和
N
2
N_2
N2,
N
1
+
N
2
=
N
N_1+N_2=N
N1+N2=N.
寻找一个投影超平面
W
W
W,使所有样本点在这个平面的投影可以做到类内间距小,类间间距大。
4.3.2. 过程推导
样本
x
⃗
^
i
\hat{\vec{x}}_i
x^i 在超平面
W
W
W 上的投影可以表示为
z
=
x
⃗
^
T
⋅
W
z=\hat{\vec{x}}^T\cdot W
z=x^T⋅W,则对其求均值和方差:
z
ˉ
=
1
N
∑
i
=
1
N
z
i
=
1
N
∑
i
=
1
N
x
⃗
^
i
T
⋅
W
S
z
=
1
N
∑
i
=
1
N
(
z
i
−
z
ˉ
)
2
=
1
N
∑
i
=
1
N
(
x
⃗
^
i
T
⋅
W
−
z
ˉ
)
2
.
(4.3)
\begin{aligned} \bar{z}&=\frac{1}{N}\sum_{i=1}^Nz_i=\frac{1}{N}\sum_{i=1}^N\hat{\vec{x}}_i^T\cdot W\\ S_z&=\frac{1}{N}\sum_{i=1}^N(z_i-\bar{z})^2=\frac{1}{N}\sum_{i=1}^N(\hat{\vec{x}}_i^T\cdot W-\bar{z})^2. \end{aligned}\tag{4.3}
zˉSz=N1i=1∑Nzi=N1i=1∑Nx^iT⋅W=N1i=1∑N(zi−zˉ)2=N1i=1∑N(x^iT⋅W−zˉ)2.(4.3)
基于上式分别对两类样本计算均值
z
ˉ
1
\bar{z}_1
zˉ1 和
z
ˉ
2
\bar{z}_2
zˉ2,以及方差
S
z
1
S_{z_1}
Sz1 和
S
z
2
S_{z_2}
Sz2. 为了尽可能类内间距小,类间间距大,将目标函数定义为
J
(
W
)
=
(
z
ˉ
1
−
z
ˉ
2
)
2
S
z
1
+
S
z
2
,
(4.4)
\mathcal{J}(W)=\frac{(\bar{z}_1-\bar{z}_2)^2}{S_{z_1}+S_{z_2}},\tag{4.4}
J(W)=Sz1+Sz2(zˉ1−zˉ2)2,(4.4)
则模型转为优化问题:
W
=
arg max
W
J
(
W
)
,
(4.5)
W=\argmax_W\mathcal{J}(W),\tag{4.5}
W=WargmaxJ(W),(4.5)
对目标函数进行化简:
(
z
ˉ
1
−
z
ˉ
2
)
2
=
(
1
N
1
∑
i
=
1
N
1
x
⃗
^
i
T
⋅
W
−
1
N
2
∑
i
=
1
N
2
x
⃗
^
i
T
⋅
W
)
2
=
[
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
T
W
]
2
=
W
T
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
T
W
(4.6)
\begin{aligned} (\bar{z}_1-\bar{z}_2)^2&=(\frac{1}{N_1}\sum_{i=1}^{N_1}\hat{\vec{x}}_i^T\cdot W-\frac{1}{N_2}\sum_{i=1}^{N_2}\hat{\vec{x}}_i^T\cdot W)^2\\ &=[(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW]^2\\ &=W^T(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW \end{aligned}\tag{4.6}
(zˉ1−zˉ2)2=(N11i=1∑N1x^iT⋅W−N21i=1∑N2x^iT⋅W)2=[(xˉC1−xˉC2)TW]2=WT(xˉC1−xˉC2)(xˉC1−xˉC2)TW(4.6)
S z 1 = 1 N 1 ∑ i = 1 N 1 ( x ⃗ ^ i T ⋅ W − z ˉ 1 ) 2 = 1 N 1 ∑ i = 1 N 1 W T ( x ⃗ i − x ⃗ ˉ C 1 ) ( x ⃗ i − x ⃗ ˉ C 1 ) T W = W T [ 1 N 1 ∑ i = 1 N 1 ( x ⃗ i − x ⃗ ˉ C 1 ) ( x ⃗ i − x ⃗ ˉ C 1 ) T ] W = W T S C 1 W , (4.7) \begin{aligned}S_{z_1}&=\frac{1}{N_1}\sum_{i=1}^{N_1}(\hat{\vec{x}}_i^T\cdot W-\bar{z}_1)^2\\&=\frac{1}{N_1}\sum_{i=1}^{N_1}W^T(\vec{x}_i-\bar{\vec{x}}_{C_1})(\vec{x}_i-\bar{\vec{x}}_{C_1})^TW\\ & =W^T\left [\frac{1}{N_1}\sum_{i=1}^{N_1}(\vec{x}_i-\bar{\vec{x}}_{C_1})(\vec{x}_i-\bar{\vec{x}}_{C_1})^T \right ]W\\ &=W^TS_{C_1}W,\end{aligned}\tag{4.7} Sz1=N11i=1∑N1(x^iT⋅W−zˉ1)2=N11i=1∑N1WT(xi−xˉC1)(xi−xˉC1)TW=WT[N11i=1∑N1(xi−xˉC1)(xi−xˉC1)T]W=WTSC1W,(4.7)
同理可得
S
z
2
=
W
T
S
C
2
W
S_{z_2}=W^TS_{C_2}W
Sz2=WTSC2W. 所以目标函数化为
J
(
W
)
=
(
z
ˉ
1
−
z
ˉ
2
)
2
S
z
1
+
S
z
2
=
W
T
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
T
W
W
T
(
S
C
1
+
S
C
2
)
W
(4.8)
\begin{aligned} \mathcal{J}(W)&=\frac{(\bar{z}_1-\bar{z}_2)^2}{S_{z_1}+S_{z_2}}\\&=\frac{W^T(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW}{W^T(S_{C_1}+S_{C_2})W} \end{aligned}\tag{4.8}
J(W)=Sz1+Sz2(zˉ1−zˉ2)2=WT(SC1+SC2)WWT(xˉC1−xˉC2)(xˉC1−xˉC2)TW(4.8)
再定义总类内方差
S
w
S_w
Sw 和总类间方差
S
b
S_b
Sb:
S
w
=
S
C
1
+
S
C
2
S
b
=
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
T
(4.9)
\begin{aligned} S_w&=S_{C_1}+S_{C_2}\\ S_b&=(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^T \end{aligned}\tag{4.9}
SwSb=SC1+SC2=(xˉC1−xˉC2)(xˉC1−xˉC2)T(4.9)
因此目标函数被表示为:
J
(
W
)
=
W
T
S
b
W
W
T
S
w
W
,
(4.10)
\mathcal{J}(W)=\frac{W^TS_bW}{W^TS_wW},\tag{4.10}
J(W)=WTSwWWTSbW,(4.10)
对目标函数求导得:
∂
J
(
W
)
∂
W
=
2
S
b
W
(
W
T
S
w
W
)
−
1
+
W
T
S
b
W
⋅
(
−
1
)
⋅
(
W
T
S
w
W
)
−
2
⋅
2
S
w
W
,
(4.11)
\begin{aligned}\frac{\partial \mathcal{J}(W)}{\partial W}&=2S_bW(W^TS_wW)^{-1}+\\&W^TS_bW\cdot(-1)\cdot (W^TS_wW)^{-2}\cdot2S_wW,\end{aligned}\tag{4.11}
∂W∂J(W)=2SbW(WTSwW)−1+WTSbW⋅(−1)⋅(WTSwW)−2⋅2SwW,(4.11)
令其为0可得
W
=
W
T
S
w
W
W
T
S
b
W
S
w
−
1
S
b
W
=
W
T
S
w
W
W
T
S
b
W
S
w
−
1
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
T
W
=
W
T
S
w
W
⋅
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
T
W
W
T
S
b
W
S
w
−
1
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
,
(4.12)
\begin{aligned}W&=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}S_bW\\&=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW\\&=\frac{W^TS_wW\cdot (\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW}{W^TS_bW}S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2}) ,\end{aligned}\tag{4.12}
W=WTSbWWTSwWSw−1SbW=WTSbWWTSwWSw−1(xˉC1−xˉC2)(xˉC1−xˉC2)TW=WTSbWWTSwW⋅(xˉC1−xˉC2)TWSw−1(xˉC1−xˉC2),(4.12)
因为
W
W
W 是一个单位向量,所以我们只关心其方向而不关心其长度,所以最终得到:
W
∝
S
w
−
1
(
x
⃗
ˉ
C
1
−
x
⃗
ˉ
C
2
)
.
(4.13)
\begin{aligned}W\propto S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2}). \end{aligned}\tag{4.13}
W∝Sw−1(xˉC1−xˉC2).(4.13)
4.4. 逻辑回归
4.4.1. 基本思想
在线性回归中引入非线性激活函数,使其可以将回归结果映射为概率值,作为类别的概率。因此将这个模型看做一个条件概率分布的建模,输入 x ⃗ ^ \hat{\vec{x}} x^,通过建模 p ( y ∣ x ⃗ ^ ) p(y|\hat{\vec{x}}) p(y∣x^),输出 y y y 的离散取值;
4.4.2. Sigmoid 激活函数
基本公式:
σ
(
z
)
=
1
1
+
e
−
z
(4.14)
\sigma (z)=\frac{1}{1+e^{-z}}\tag{4.14}
σ(z)=1+e−z1(4.14)
特殊取值:
- z → − ∞ z\rightarrow -\infty z→−∞ 时, lim σ ( z ) = 0 \lim \sigma (z)=0 limσ(z)=0;
- z = 0 z=0 z=0 时, σ ( z ) = 1 2 \sigma(z)=\frac{1}{2} σ(z)=21;
- z → ∞ z\rightarrow \infty z→∞ 时, lim σ ( z ) = 1 \lim \sigma (z)=1 limσ(z)=1.
函数图像:
4.4.3. 模型推导
根据条件概率建模的思想:
p
1
=
p
(
y
=
1
∣
x
⃗
^
)
=
σ
(
x
⃗
^
T
W
)
=
1
1
+
e
−
x
⃗
^
T
W
p
0
=
p
(
y
=
0
∣
x
⃗
^
)
=
1
−
σ
(
x
⃗
^
T
W
)
=
e
−
x
⃗
^
T
W
1
+
e
−
x
⃗
^
T
W
,
(4.15)
\begin{aligned} p_1&=p(y=1|\hat{\vec{x}})=\sigma(\hat{\vec{x}}^TW)= \frac{1}{1+e^{-\hat{\vec{x}}^TW}}\\ p_0&=p(y=0|\hat{\vec{x}})=1-\sigma(\hat{\vec{x}}^TW)= \frac{e^{-\hat{\vec{x}}^TW}}{1+e^{-\hat{\vec{x}}^TW}}, \end{aligned}\tag{4.15}
p1p0=p(y=1∣x^)=σ(x^TW)=1+e−x^TW1=p(y=0∣x^)=1−σ(x^TW)=1+e−x^TWe−x^TW,(4.15)
因此将整个模型写作
p
(
y
∣
x
)
=
p
1
y
⋅
p
0
1
−
y
,
(4.16)
p(y|x)=p_1^y\cdot p_0^{1-y},\tag{4.16}
p(y∣x)=p1y⋅p01−y,(4.16)
当 y = 0 y=0 y=0 时, p = p 0 p=p_0 p=p0,当 y = 1 y=1 y=1 时, p = p 1 p=p_1 p=p1.
用极大似然估计法求解模型:
W
^
=
arg max
W
log
p
(
Y
∣
X
)
=
arg max
W
∑
i
=
1
N
[
y
i
⋅
log
σ
(
x
⃗
^
T
W
)
+
(
1
−
y
i
)
⋅
log
(
1
−
σ
(
x
⃗
^
T
W
)
)
]
,
(4.17)
\begin{aligned} \hat{W}&=\argmax_W\log p(Y|X)\\ &=\argmax_W\sum_{i=1}^N\left [ y_i\cdot \log\sigma(\hat{\vec{x}}^TW)+(1-y_i)\cdot \log(1-\sigma(\hat{\vec{x}}^TW)) \right ], \end{aligned}\tag{4.17}
W^=Wargmaxlogp(Y∣X)=Wargmaxi=1∑N[yi⋅logσ(x^TW)+(1−yi)⋅log(1−σ(x^TW))],(4.17)
对其求梯度得
▽
grad
W
=
∑
i
=
1
N
[
y
i
⋅
(
1
−
σ
(
x
⃗
^
T
W
)
)
⋅
x
⃗
^
−
(
1
−
y
i
)
⋅
σ
(
x
⃗
^
T
W
)
⋅
x
⃗
^
]
=
∑
i
=
1
N
[
y
i
−
σ
(
x
⃗
^
T
W
)
]
⋅
x
⃗
^
,
(4.18)
\begin{aligned} \bigtriangledown \text{grad}_W &=\sum_{i=1}^N \left [y_i\cdot (1-\sigma(\hat{\vec{x}}^TW))\cdot\hat{\vec{x}} - (1-y_i)\cdot \sigma(\hat{\vec{x}}^TW)\cdot\hat{\vec{x}} \right ]\\ &=\sum_{i=1}^N \left [y_i-\sigma(\hat{\vec{x}}^TW) \right ]\cdot\hat{\vec{x}}, \end{aligned}\tag{4.18}
▽gradW=i=1∑N[yi⋅(1−σ(x^TW))⋅x^−(1−yi)⋅σ(x^TW)⋅x^]=i=1∑N[yi−σ(x^TW)]⋅x^,(4.18)
即可对模型进行迭代更新。
4.5. 高斯判别分析
4.5.1. 概率判别式模型与概率生成式模型的区别
概率判别式模型主要计算条件概率密度
p
(
y
∣
x
⃗
)
p(y|\vec{x})
p(y∣x),取令该概率最大的
y
y
y 为分类结果;
概率生成式模型并不需要计算具体的
p
(
y
∣
x
⃗
)
p(y|\vec{x})
p(y∣x) 值,而是直接思考
p
(
y
=
1
∣
x
⃗
)
p(y=1|\vec{x})
p(y=1∣x) 和
p
(
y
=
0
∣
x
⃗
)
p(y=0|\vec{x})
p(y=0∣x) 的结果谁更大,根据贝叶斯公式
p
(
y
∣
x
⃗
)
=
p
(
x
⃗
∣
y
)
⋅
p
(
y
)
p
(
x
⃗
)
p(y|\vec{x}) = \frac{p(\vec{x}|y)\cdot p(y)}{p(\vec{x})}
p(y∣x)=p(x)p(x∣y)⋅p(y),将目标函数变为:
y
^
=
arg max
y
p
(
y
∣
x
⃗
)
=
arg max
y
p
(
x
⃗
∣
y
)
⋅
p
(
y
)
,
(4.19)
\begin{aligned} \hat{y}&=\argmax_y p(y|\vec{x})\\&=\argmax_y p(\vec{x}|y)\cdot p(y), \end{aligned}\tag{4.19}
y^=yargmaxp(y∣x)=yargmaxp(x∣y)⋅p(y),(4.19)
其中若 p ( y = 1 ) = ϕ p(y=1)=\phi p(y=1)=ϕ,则 p ( y = 0 ) = 1 − ϕ p(y=0)=1-\phi p(y=0)=1−ϕ,可以将 p ( y ) p(y) p(y) 合并为 p ( y ) = ϕ y ⋅ ( 1 − ϕ ) 1 − y p(y)=\phi ^y\cdot (1-\phi)^{1-y} p(y)=ϕy⋅(1−ϕ)1−y.
4.5.2. 高斯概率假设
在高斯判别模型中,假设条件分布是遵从高斯概率分布的,即:
x
⃗
∣
y
=
0
∼
N
(
μ
1
,
Σ
)
x
⃗
∣
y
=
1
∼
N
(
μ
2
,
Σ
)
,
(4.20)
\begin{aligned} \vec{x}|y&=0\sim N(\mu_1, \Sigma)\\ \vec{x}|y&=1\sim N(\mu_2, \Sigma), \end{aligned}\tag{4.20}
x∣yx∣y=0∼N(μ1,Σ)=1∼N(μ2,Σ),(4.20)
使用对数似然求解目标函数,可得
L
(
θ
)
=
log
∏
i
=
1
N
p
(
x
⃗
i
,
y
i
)
=
∑
i
=
1
N
[
log
p
(
x
⃗
i
∣
y
i
)
+
log
p
(
y
i
)
]
=
∑
i
=
1
N
{
y
i
⋅
[
−
1
2
(
x
⃗
i
−
μ
⃗
1
)
T
Σ
−
1
(
x
⃗
i
−
μ
⃗
1
)
−
p
2
log
2
π
−
−
1
2
log
∣
Σ
∣
]
+
(
1
−
y
i
)
⋅
[
−
1
2
(
x
⃗
i
−
μ
⃗
2
)
T
Σ
−
1
(
x
⃗
i
−
μ
⃗
2
)
−
p
2
log
2
π
−
−
1
2
log
∣
Σ
∣
]
+
y
i
⋅
log
ϕ
+
(
1
−
y
i
)
⋅
log
(
1
−
ϕ
)
}
.
(4.21)
\begin{aligned} \mathcal{L}(\theta)&=\log\prod_{i=1}^N p(\vec{x}_i,y_i)\\ &=\sum _{i=1}^N\left [\log p(\vec{x}_i|y_i)+\log p(y_i) \right ] \\ &=\sum _{i=1}^N\left \{ y_i\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right )-\frac{p}{2}\log2\pi--\frac{1}{2}\log|\Sigma| \right ]+\right.\\ &\left (1-y_i \right )\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_2 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_2 \right )-\frac{p}{2}\log2\pi--\frac{1}{2}\log|\Sigma| \right ] +\\&\left.y_i \cdot\log\phi + (1-y_i)\cdot\log(1-\phi)\right \} . \end{aligned}\tag{4.21}
L(θ)=logi=1∏Np(xi,yi)=i=1∑N[logp(xi∣yi)+logp(yi)]=i=1∑N{yi⋅[−21(xi−μ1)TΣ−1(xi−μ1)−2plog2π−−21log∣Σ∣]+(1−yi)⋅[−21(xi−μ2)TΣ−1(xi−μ2)−2plog2π−−21log∣Σ∣]+yi⋅logϕ+(1−yi)⋅log(1−ϕ)}.(4.21)
4.5.3. 求解模型
首先求解参数
ϕ
\phi
ϕ,对目标函数求偏导:
∂
L
(
θ
)
∂
ϕ
=
∑
i
=
1
N
(
y
i
ϕ
−
1
−
y
i
1
−
ϕ
)
,
(4.22)
\begin{aligned} \frac{\partial L(\theta)}{\partial \phi}=\sum_{i=1}^N\left (\frac{y_i}{\phi}-\frac{1-y_i}{1-\phi} \right ) , \end{aligned}\tag{4.22}
∂ϕ∂L(θ)=i=1∑N(ϕyi−1−ϕ1−yi),(4.22)
令其为0,求得
ϕ
^
=
1
N
∑
i
=
1
N
y
i
,
(4.23)
\hat{\phi}=\frac{1}{N}\sum_{i=1}^{N}y_i,\tag{4.23}
ϕ^=N1i=1∑Nyi,(4.23)
也就是样本中各个标签出现的频率即为最优概率值。
再求解参数
μ
⃗
1
\vec{\mu}_1
μ1,由于其他和
μ
⃗
1
\vec{\mu}_1
μ1 无关的部分求偏导后都得0,所以从目标函数中单独取出和相关的部分,即
μ
⃗
^
1
=
arg max
μ
⃗
1
∑
i
=
1
N
y
i
⋅
[
−
1
2
(
x
⃗
i
−
μ
⃗
1
)
T
Σ
−
1
(
x
⃗
i
−
μ
⃗
1
)
]
=
arg max
μ
⃗
1
∑
x
i
⃗
∈
C
1
[
−
1
2
(
x
⃗
i
−
μ
⃗
1
)
T
Σ
−
1
(
x
⃗
i
−
μ
⃗
1
)
]
,
(4.22)
\begin{aligned} \hat{\vec{\mu}}_1&=\argmax_{\vec{\mu}_1}\sum_{i=1}^Ny_i\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ]\\ &=\argmax_{\vec{\mu}_1}\sum_{\vec{x_i}\in C_1}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ], \end{aligned}\tag{4.22}
μ^1=μ1argmaxi=1∑Nyi⋅[−21(xi−μ1)TΣ−1(xi−μ1)]=μ1argmaxxi∈C1∑[−21(xi−μ1)TΣ−1(xi−μ1)],(4.22)
其中
C
1
=
{
x
⃗
i
}
y
i
=
1
C_1=\left \{ \vec{x}_i\right \}_{y_i=1}
C1={xi}yi=1,因此求解方法等同于高维高斯分布的极大似然估计(见另一篇笔记 机器学习[白板推导](一) 第2.2.2.节),结果为
μ
⃗
^
1
=
∑
i
=
1
N
y
i
⋅
x
⃗
i
N
1
=
∑
x
⃗
i
∈
C
1
x
⃗
i
N
1
μ
⃗
^
2
=
∑
i
=
1
N
(
1
−
y
i
)
⋅
x
⃗
i
N
2
=
∑
x
⃗
i
∈
C
2
x
⃗
i
N
2
,
(4.23)
\begin{aligned} \hat{\vec{\mu}}_1&=\frac{\sum_{i=1}^Ny_i\cdot \vec{x}_i}{N_1}=\frac{\sum_{\vec{x}_i\in C_1}\vec{x}_i}{N_1}\\ \hat{\vec{\mu}}_2&=\frac{\sum_{i=1}^N\left (1-y_i \right )\cdot \vec{x}_i}{N_2}=\frac{\sum_{\vec{x}_i\in C_2}\vec{x}_i}{N_2}, \end{aligned}\tag{4.23}
μ^1μ^2=N1∑i=1Nyi⋅xi=N1∑xi∈C1xi=N2∑i=1N(1−yi)⋅xi=N2∑xi∈C2xi,(4.23)
最后求
Σ
\Sigma
Σ,同样取出目标函数中和其相关的部分,得:
Σ
^
=
arg max
Σ
∑
x
⃗
i
∈
C
1
[
−
1
2
(
x
⃗
i
−
μ
⃗
1
)
T
Σ
−
1
(
x
⃗
i
−
μ
⃗
1
)
]
+
∑
x
⃗
i
∈
C
2
[
−
1
2
(
x
⃗
i
−
μ
⃗
2
)
T
Σ
−
1
(
x
⃗
i
−
μ
⃗
2
)
]
−
N
2
log
∣
Σ
∣
,
(4.24)
\begin{aligned} \hat{\Sigma}=&\argmax_{\Sigma}\sum_{\vec{x}_i\in C_1}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ]\\ &+\sum_{\vec{x}_i\in C_2}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_2 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_2 \right ) \right ] - \frac{N}{2}\log |\Sigma| , \end{aligned}\tag{4.24}
Σ^=Σargmaxxi∈C1∑[−21(xi−μ1)TΣ−1(xi−μ1)]+xi∈C2∑[−21(xi−μ2)TΣ−1(xi−μ2)]−2Nlog∣Σ∣,(4.24)
对其求偏导得
d
L
(
θ
)
d
Σ
=
−
N
2
Σ
−
1
+
1
2
∑
x
⃗
i
∈
C
1
Σ
−
1
(
x
⃗
i
−
μ
⃗
1
)
(
x
⃗
i
−
μ
⃗
1
)
T
Σ
−
1
+
1
2
∑
x
⃗
i
∈
C
2
Σ
−
1
(
x
⃗
i
−
μ
⃗
2
)
(
x
⃗
i
−
μ
⃗
2
)
T
Σ
−
1
,
(4.25)
\begin{aligned} \frac{d\mathcal{L}(\theta)}{d\Sigma} &=-\frac{N}{2}\Sigma ^{-1}+\frac{1}{2}\sum_{\vec{x}_i\in C_1}\Sigma^{-1}(\vec{x}_i-\vec{\mu}_1)(\vec{x}_i-\vec{\mu}_1)^T\Sigma^{-1}\\ &+\frac{1}{2}\sum_{\vec{x}_i\in C_2}\Sigma^{-1}(\vec{x}_i-\vec{\mu}_2)(\vec{x}_i-\vec{\mu}_2)^T\Sigma^{-1} , \end{aligned}\tag{4.25}
dΣdL(θ)=−2NΣ−1+21xi∈C1∑Σ−1(xi−μ1)(xi−μ1)TΣ−1+21xi∈C2∑Σ−1(xi−μ2)(xi−μ2)TΣ−1,(4.25)
令其为0得
Σ
^
=
1
N
[
∑
x
⃗
i
∈
C
1
(
x
⃗
i
−
μ
⃗
1
)
(
x
⃗
i
−
μ
⃗
1
)
T
+
∑
x
⃗
i
∈
C
2
(
x
⃗
i
−
μ
⃗
2
)
(
x
⃗
i
−
μ
⃗
2
)
T
]
=
N
1
⋅
S
1
+
N
2
⋅
S
2
N
,
(4.26)
\begin{aligned} \hat{\Sigma} &=\frac{1}{N}\left [\sum_{\vec{x}_i\in C_1}(\vec{x}_i-\vec{\mu}_1)(\vec{x}_i-\vec{\mu}_1)^T+\sum_{\vec{x}_i\in C_2}(\vec{x}_i-\vec{\mu}_2)(\vec{x}_i-\vec{\mu}_2)^T \right ]\\ &=\frac{N_1\cdot S_1+N_2\cdot S_2}{N} , \end{aligned}\tag{4.26}
Σ^=N1
xi∈C1∑(xi−μ1)(xi−μ1)T+xi∈C2∑(xi−μ2)(xi−μ2)T
=NN1⋅S1+N2⋅S2,(4.26)
其中 S 1 S_1 S1 和 S 2 S_2 S2 分别为两个类内样本方差。
4.6. 朴素贝叶斯 (Naive Bayes Classifier)
4.6.1. 基本思想
所有朴素贝叶斯家族的算法都是基于朴素贝叶斯假设,又叫条件随机场假设,即假设各个特征之间相互独立。朴素贝叶斯模型是最简单的概率图模型,模型方法和高斯判别分析较为接近,这里不做重复。