Introducing Optimization
Chapter6:Introducing Optimization
声明:本篇博客笔记来源于《Neural Networks from scratch in Python》,作者的youtube
其实关于神经网络的入门博主已经写过几篇了,这里就不再赘述,附上链接。
1.一文窥见神经网络
2.神经网络入门(上)
3.神经网络入门(下)
前五章内容:
1.Coding Our First Neurons
2.Adding Hidden Layers
3.Activation Functions
4.Calculating Network Error with Loss
现在神经网络已经建立起来,可以让数据通过它,并且能够计算损失后,下一步是确定如何调整权重和偏差来减少损失的损失。
找到一种智能的方法来调整神经元输入的权重和偏差,以最小化损失是神经网络的主要难点。
我们尝试使用随机初始化来对网络的权重和偏置进行更新
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()
# 构建两层神经元之间的关系
class Layer_Dense:
# 初始化权重和偏置
def __init__(self, n_inputs, n_neurons):
# weights(2,3) 下标00代表输入层第一个神经元和隐藏层第一个神经元之间的权重
# 0 1 2
# 0 0.013 0.016 0.011
# 1 0.006 0.006 0.046
self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
# biases(1,3) 代表隐藏层3个神经元的偏置
# 0 1 2
# 0 0.0 0.0 0.0
self.biases = np.zeros((1, n_neurons))
# 两层之间的前向传播,计算weights*inputs+biases
def forward(self, inputs):
# inputs(300,2) * weights(2,3) + biases(1,3) 这里将300个样本(矩阵)直接乘输入层和隐藏层之间的权重矩阵完成批量前向传播
# outputs(300,3) 300个样本,每个样本对应三个类别的预测概率值
self.outputs = np.dot(inputs, self.weights) + self.biases
# 隐藏层激活函数ReLU
class Activation_ReLU:
def forward(self, inputs):
self.outputs = np.maximum(0, inputs)
# 输出层激活函数Softmax
class Activation_Softmax:
def forward(self, inputs):
# 指数运算
exp_vals = np.exp(inputs - np.max(inputs)) # 防止指数爆炸
# 指数运算结果求和
probability = exp_vals / np.sum(exp_vals, axis=1, keepdims=True)
self.outputs = probability
# 计算平均损失
class Loss:
def calculate(self, y_pred, y):
# 计算交叉熵
sample_losses = self.forward(y_pred, y)
# 计算平均损失
average_loss = np.mean(sample_losses)
return average_loss
# 计算交叉熵
class cross_entropy(Loss):
def forward(self, y_pred, y_true):
# 为防止除0发生,利用clip函数将y_pred限制在10^-7~1-10^-7,即0.0000001~0.9999999
y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)
# 根据不同y_true类型进行置信度得分计算
# 假设y_true为一维列表形式
# 根据真实标签的下标,将预测结果中每个样本的对应类别预测值过滤出来
samples = len(y_pred)
if len(y_true.shape) == 1:
# range(samples)生成下标0~299的列表,
# 从 y_pred_clipped 中提取每个样本对应真实类别的预测值,结果是一个一维数组 correct_confidences,其中每个值是对应样本的预测置信度
correct_confidences = y_pred_clipped[range(samples), y_true]
elif len(y_true.shape) == 2:
correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
# 计算交叉熵
negtivate_log_likelihood = -np.log(correct_confidences)
return negtivate_log_likelihood
# 数据集
# X (300,2) y(300,)这个真实标签是len(y_true.shape)==1的类型
'''
X是每个数据(x,y)坐标值
X x值 y值
0 0.00299, 0.00964
1 0.01288, 0.01556
y表示类别
y 类别下标
0 0
1 0
2 1
'''
X, y = spiral_data(samples=100, classes=3) # X数据,y真实标签
# 构建从输入层到隐藏层
dense1 = Layer_Dense(2, 3)
# 隐藏层激活函数
relu = Activation_ReLU()
# 构建从隐藏层到输出层
dense2 = Layer_Dense(3, 3)
# 输出层激活函数
softmax = Activation_Softmax()
# 实例化交叉熵函数
loss_function = cross_entropy()
# 初始化最低损失
lowest_loss = 9999
# 用于在更新时,保存最低损失对应网络中的权重和偏置
best_dense1_weights = dense1.weights.copy()
best_dense1_biases = dense1.biases.copy()
best_dense2_weights = dense2.weights.copy()
best_dense2_biases = dense2.biases.copy()
for iteration in range(9999):
# 更新权重和偏置
dense1.weights += 0.05 * np.random.randn(2, 3)
dense1.biases += 0.05 * np.random.randn(1, 3)
dense2.weights += 0.05 * np.random.randn(3, 3)
dense2.biases += 0.05 * np.random.randn(1, 3)
# 传入数据,前向传播到隐藏层
dense1.forward(X)
relu.forward(dense1.outputs)
# 传入隐藏层激活后的值
dense2.forward(relu.outputs)
softmax.forward(dense2.outputs)
# 计算平均损失
loss = loss_function.calculate(softmax.outputs, y)
# 计算精度
prediction = np.argmax(softmax.outputs, axis=1)
accuracy = np.mean(prediction == y)
if loss < lowest_loss:
# 打印迭代次数、损失、精确度
print('New set of weights found, iteration:', iteration, 'loss:', loss, 'accuracy:', accuracy)
# 保存当前最低损失对应的权重和偏置
best_dense1_weights = dense1.weights.copy()
best_dense1_biases = dense1.biases.copy()
best_dense2_weights = dense2.weights.copy()
best_dense2_biases = dense2.biases.copy()
# 更新当前最低损失
lowest_loss = loss
# 当前损失loss大于最低损失,则将初始化的权重和偏置还原到网络中
else:
dense1.weights = best_dense1_weights.copy()
dense1.biases = best_dense1_biases.copy()
dense2.weights = best_dense2_weights.copy()
dense2.biases = best_dense2_biases.copy()
============
New set of weights found, iteration: 0 loss: 1.1008677 accuracy: 0.3333333333333333
New set of weights found, iteration: 1 loss: 1.0994315 accuracy: 0.3333333333333333
New set of weights found, iteration: 2 loss: 1.0991217 accuracy: 0.3333333333333333
New set of weights found, iteration: 3 loss: 1.0986339 accuracy: 0.3333333333333333
New set of weights found, iteration: 4 loss: 1.0986199 accuracy: 0.3333333333333333
New set of weights found, iteration: 5 loss: 1.0984716 accuracy: 0.36333333333333334
New set of weights found, iteration: 18 loss: 1.0983391 accuracy: 0.3333333333333333
New set of weights found, iteration: 27 loss: 1.0982698 accuracy: 0.3333333333333333
New set of weights found, iteration: 31 loss: 1.0982264 accuracy: 0.37333333333333335
New set of weights found, iteration: 35 loss: 1.0979562 accuracy: 0.3333333333333333
New set of weights found, iteration: 36 loss: 1.0977433 accuracy: 0.3433333333333333
New set of weights found, iteration: 37 loss: 1.0976934 accuracy: 0.3333333333333333
New set of weights found, iteration: 44 loss: 1.097596 accuracy: 0.3466666666666667
New set of weights found, iteration: 50 loss: 1.0973785 accuracy: 0.36333333333333334
New set of weights found, iteration: 51 loss: 1.0959908 accuracy: 0.3566666666666667
New set of weights found, iteration: 60 loss: 1.0959282 accuracy: 0.35333333333333333
New set of weights found, iteration: 65 loss: 1.0954362 accuracy: 0.38333333333333336
New set of weights found, iteration: 67 loss: 1.093989 accuracy: 0.4166666666666667
New set of weights found, iteration: 71 loss: 1.0926254 accuracy: 0.37666666666666665
New set of weights found, iteration: 79 loss: 1.0921575 accuracy: 0.35333333333333333
New set of weights found, iteration: 94 loss: 1.0918257 accuracy: 0.4166666666666667
New set of weights found, iteration: 101 loss: 1.0914664 accuracy: 0.38666666666666666
New set of weights found, iteration: 102 loss: 1.0909607 accuracy: 0.38333333333333336
New set of weights found, iteration: 103 loss: 1.0906307 accuracy: 0.35333333333333333
New set of weights found, iteration: 106 loss: 1.089146 accuracy: 0.4166666666666667
New set of weights found, iteration: 113 loss: 1.0891142 accuracy: 0.37666666666666665
New set of weights found, iteration: 115 loss: 1.088237 accuracy: 0.36333333333333334
New set of weights found, iteration: 120 loss: 1.0880405 accuracy: 0.39
New set of weights found, iteration: 129 loss: 1.0874124 accuracy: 0.42333333333333334
New set of weights found, iteration: 140 loss: 1.087239 accuracy: 0.4266666666666667
New set of weights found, iteration: 157 loss: 1.0870038 accuracy: 0.42
New set of weights found, iteration: 163 loss: 1.0870035 accuracy: 0.38666666666666666
New set of weights found, iteration: 172 loss: 1.0862479 accuracy: 0.4266666666666667
New set of weights found, iteration: 175 loss: 1.0861241 accuracy: 0.41
New set of weights found, iteration: 179 loss: 1.0860893 accuracy: 0.3466666666666667
New set of weights found, iteration: 186 loss: 1.0853186 accuracy: 0.37666666666666665
New set of weights found, iteration: 190 loss: 1.0852814 accuracy: 0.42
New set of weights found, iteration: 191 loss: 1.0846506 accuracy: 0.42
New set of weights found, iteration: 203 loss: 1.0842136 accuracy: 0.42333333333333334
New set of weights found, iteration: 204 loss: 1.084184 accuracy: 0.3566666666666667
New set of weights found, iteration: 214 loss: 1.0837997 accuracy: 0.37666666666666665
New set of weights found, iteration: 218 loss: 1.0836842 accuracy: 0.4166666666666667
New set of weights found, iteration: 235 loss: 1.0836092 accuracy: 0.43333333333333335
New set of weights found, iteration: 238 loss: 1.0832268 accuracy: 0.38666666666666666
New set of weights found, iteration: 241 loss: 1.0831857 accuracy: 0.4033333333333333
New set of weights found, iteration: 246 loss: 1.0826017 accuracy: 0.38333333333333336
New set of weights found, iteration: 250 loss: 1.0825759 accuracy: 0.4033333333333333
New set of weights found, iteration: 254 loss: 1.0817988 accuracy: 0.38
New set of weights found, iteration: 282 loss: 1.0817244 accuracy: 0.38
New set of weights found, iteration: 286 loss: 1.0810702 accuracy: 0.41
New set of weights found, iteration: 288 loss: 1.0806731 accuracy: 0.37333333333333335
New set of weights found, iteration: 314 loss: 1.0806231 accuracy: 0.4066666666666667
New set of weights found, iteration: 340 loss: 1.080356 accuracy: 0.4
New set of weights found, iteration: 578 loss: 1.080259 accuracy: 0.4033333333333333
New set of weights found, iteration: 630 loss: 1.0802449 accuracy: 0.4166666666666667
New set of weights found, iteration: 877 loss: 1.0801865 accuracy: 0.4166666666666667
New set of weights found, iteration: 901 loss: 1.0801494 accuracy: 0.43
New set of weights found, iteration: 935 loss: 1.0800657 accuracy: 0.41333333333333333
New set of weights found, iteration: 978 loss: 1.0799247 accuracy: 0.42
New set of weights found, iteration: 1049 loss: 1.0798801 accuracy: 0.3933333333333333
New set of weights found, iteration: 1092 loss: 1.0797858 accuracy: 0.38666666666666666
New set of weights found, iteration: 1103 loss: 1.0795524 accuracy: 0.4033333333333333
New set of weights found, iteration: 1159 loss: 1.0795078 accuracy: 0.39666666666666667
New set of weights found, iteration: 1434 loss: 1.079379 accuracy: 0.41
New set of weights found, iteration: 1944 loss: 1.0793691 accuracy: 0.42
New set of weights found, iteration: 1967 loss: 1.0792985 accuracy: 0.4066666666666667
New set of weights found, iteration: 3281 loss: 1.0792687 accuracy: 0.42
New set of weights found, iteration: 4016 loss: 1.0792656 accuracy: 0.39666666666666667
New set of weights found, iteration: 4309 loss: 1.0792212 accuracy: 0.4033333333333333
New set of weights found, iteration: 5157 loss: 1.0791875 accuracy: 0.3933333333333333
New set of weights found, iteration: 5415 loss: 1.0790575 accuracy: 0.39
我们通过观察可知,loss虽然下降了,但是accuracy却在上下起浮,显然通过随机初始化来更新权重和偏置不可取。我们需要引入别的方法来对权重和偏置进行更新。
Chapter7:Derivatives
随机改变和搜索最优权重和偏差并没有什么效果主要原因是:权重和偏差的可能组合的数量是无限的,大海捞针去寻找权重和偏置的最优组合效率太低。
每个权重和偏置对损失的影响程度不同——这种影响取决于每个权重和偏置本身以及当前输入样本(第一层的输入):
输入样本乘以权重后加偏置,并用激活函数对该结果进行非线性映射,前一层的输出是后一层的输入最终到输出层,输出结果与真实结果作损失,这意味着参数(每一个神经元的权重和偏置)和输入样本均对损失有影响——这就是为什么我们要计算每个输入样本单独的损失值。权重或偏差如何影响的损失函数不一定是线性的。为了知道如何调整权重和偏差,我们首先需要了解参数对损失的影响。
The Impact of a Parameter on the Output
在线性函数中,我们如何描述输入x对函数y的影响?斜率(slope)
C
h
a
n
g
e
i
n
y
C
h
a
n
g
e
i
n
x
=
Δ
y
Δ
x
\frac{Change\ in\ y}{Change\ in\ x}=\frac{\Delta y}{\Delta x}
Change in xChange in y=ΔxΔy
在非线性函数中,我们如何描述输入x对函数y的影响?由于非线性函数的斜率不是一成不变的,斜率通过两点之x之差与y之差比值得到的,也就是说它只能描述两点之间的情况,对于线性函数斜率不变是适用的,而非线性函数每个点的情况都不一样,那
我们可以把两个点无限靠近这样就能描述一个无限小区间(看作一个点)的情况了
def f(x):
return 2*x**2
# 详细地说,无限小的delta值将近似于精确的导数;然而,delta增量值需要在数值上稳定,
# 这意味着,我们的增量不能超过Python浮点精度的限制(不能太小,因为它可能被四舍五入为0,而且,正如我们所知,除以0是“非法的”)。
# 因此,我们的解决方案被限制在估计导数和保持数值稳定之间,从而引入了这个小而明显的误差
delta = 0.0001
x1 = 1
x2 = x1 + delta
y1 = f(x1)
y2 = f(x2)
slope = (y2-y1)/(x2-x1)
print(slope)
=========
# f(x)=2x^2的导数为4x,在x=1处的导数值为4,我们通过近似切线的“割线”斜率计算结果为
4.0001999999987845 # 可见相差很小
“导数”的含义是:当x做出改变时对函数值y的影响多大,也就说导数
f
′
(
x
)
=
d
y
/
d
x
f'(x)=dy/dx
f′(x)=dy/dx对影响进行了量化
代码求解导数的方法—数值微分,即用两个无限接近的点计算切线的斜率
import numpy as np
import matplotlib.pyplot as plt
def f(x):
return 2 * x ** 2
x = np.arange(0, 5, 0.0001) # 生成从 0 到 5,步长为 0.0001 的数组 `x`
y = f(x)
plt.plot(x, y) # 绘制函数2x^2
# 绘制切线
delta = 0.0001
x1 = 2
x2 = x1 + delta
y1 = f(x1)
y2 = f(x2)
print((x1, y1), (x2, y2))
# 切线 y=mx+b 斜率m
numerical_derivatives = (y2 - y1) / (x2 - x1)
# 切线 y=mx+b, b = y - mx
b = y2 - numerical_derivatives * x2 # 计算切线的截距
def tange_line(i):
return numerical_derivatives * i + b # 给定i值返回切线上的函数值
to_plot = [x1 - 0.9, x1, x1 + 0.9] # 包含三个值的列表,切线上三个点的横坐标
plt.plot(to_plot, [tange_line(i) for i in to_plot]) # to_plot横坐标,列表内的值为纵坐标,plot函数根据三个点连接为直线
print('Approximate derivative for f(x)', f'where x = {x1} is {numerical_derivatives}') # 打印在 `x1` 处的近似导数
plt.show()
========
(2, 8) (2.0001, 8.000800020000002)
Approximate derivative for f(x) where x = 2 is 8.000199999998785
Chapter8: Gradients, Partial Derivatives, and the Chain Rule
在第一章提到,神经网络本质上就是一个层层嵌套的复杂函数,我们要想知道每个权重和偏置如何影响损失,就需要对这个复杂函数进行层层求偏导数(多元函数的导数),也就是链式法则(Chain Rule),由复杂函数最内层(对应神经网络输出层)逐层向外,最终到最外层(对应神经网络输入层),输出结果与真实数据作损失得到损失函数。
偏导数衡量单个输入对函数输出的影响程度。偏导数的计算方法与前一章解释的导数的计算方法相同;我们只需要对每个独立的输入重复这个过程。
多元函数的偏导数
偏导数是一个方程,而完整的多元函数的导数由一组称为梯度的方程组成。换句话说,梯度是一个包含对每个输入的偏导数解的输入大小的向量。
链式法则
后面章节我们将多元函数的梯度结合链式法则执行梯度下降算法,使得损失下降,从而找到损失最小时对应的权重和偏置。