当前位置：首页 > article >正文

关于深度学习参数寻优的一些介绍

article 2025/3/17 3:29:59

在深度学习中，参数是十分重要的，严重影响预测的结果。而具体在深度学习中，如何让模型自己找到最合适的参数（权重与偏置等），这就是深度学习一词中“学习”的核心含义。在本文中，我将介绍除梯度下降算法以外的其他几个寻找最优参数的方法，即Momentum、AdaGrad、RMSProp、Adam算法。

一、梯度下降算法

1.1 概述

在之前的文章中，我有介绍过梯度下降算法，即SGD算法，所以这里引用《Deep Learning from Scrach》（斋藤康毅著）一书中的描述，即随机梯度下降就像蒙住了眼睛的旅行家下山，它可以根据哪里坡度大而选择走那条路以到达最低点。如果用公式表示就是：

$W\leftarrow W-\eta \frac{\partial L}{\partial W}$

其中 $w$ 就是需要更新的参数，而 $\eta$ 自然就是学习率。

1.2 缺点

虽然SGD算法在许多时候会有明显效果，但在函数的形状是非均向的，比如蜿蜒形的，此时该方法的搜索效率就会很差。而其原因就是梯度方向没有指向最小值方向。

二、Momentum方法

2.1 原理

那么，在上述问题中，我们可以考虑引入类似于“惯性”的这样一变量，那么原来SGD算法中的公式就变形为：

$W\leftarrow W+v$

$v\leftarrow \alpha v-\eta \frac{\partial L}{\partial W}$

即： $W\leftarrow W - \eta\frac{\partial L}{\partial W}+\alpha v$

其中 $v$ 可以表示物体在梯度上受到的力，而也正是这个力，使得其速度增加。

2.2 代码

def momentum(f, grad, start, lr=0.1, beta=0.9, n=100, plot=True):
    x = start
    v = 0
    history = [x]
    for _ in range(n):
        g = grad(x)
        v = beta * v + (1 - beta) * g
        x = x - lr * v
        history.append(x)
    momentum_path = np.array(history)

    if plot:
        plt.scatter(momentum_path, f(momentum_path), c='red', s=20, label="Momentum")
        plt.xlabel("x")
        plt.ylabel("f(x)")
        plt.legend()
        plt.title("Optimization Paths of Momentum Algorithms")
        plt.show()

    return momentum_path

三、AdaGrad方法

3.1 原理

在关于选择学习率的参数上时，如果学习率过小，会导致学习花费过多时间，如果过大，则导致学习发散而不能收敛，因而不能正确运行。那么，我们考虑一种学习率衰减的方法，即随着学习的进行，使得学习率逐渐减小，也就是一种自适应学习率。所以AdaGrad中，Ada来自于单词Adaptive。

公式如下：

$h\leftarrow h+\frac{\partial L}{\partial W}\bigodot \frac{\partial L}{\partial W}$

$W\leftarrow W - \eta \frac{1}{\sqrt{h}}\frac{\partial L}{\partial W}$

其中 $\bigodot$ 表示矩阵逐元素乘法。

3.2 代码

def AdaGrad(f, grad, start, lr=0.1, epsilon=1e-8, n=100, plot=True):
    x = start
    cache = 0
    history = [x]
    for _ in range(n):
        g = grad(x)
        cache += g ** 2
        x = x - lr * g / (np.sqrt(cache) + epsilon)
        history.append(x)

    adaGrad_path = np.array(history)

    if plot:
        plt.scatter(adaGrad_path, f(adaGrad_path), c='red', s=20, label="AdaGrad")
        plt.xlabel("x")
        plt.ylabel("f(x)")
        plt.legend()
        plt.title("Optimization Paths of AdaGrad Algorithms")
        plt.show()

    return adaGrad_path

四、RMSProp方法

4.1 原理

在AdaGrad方法中存在一个问题，就是随着学习的深入，更新的幅度就会减小，而在实际上，如果无止境地学习，那么更新量就会变成0，完全不再更新。那么，我们考虑对于过去梯度的遗忘设置为逐步的一个过程，在做加法运算时将新梯度的信息更多反映出来，而这种方法就叫做“指数移动平均”，即呈指数式地去减小过去梯度的尺度。那么，公式为：

$h\leftarrow \rho \cdot h+(1-\rho )\cdot (\frac{\partial L}{\partial W}\bigodot \frac{\partial L}{\partial W})$

$W\leftarrow W - \eta \frac{1}{\sqrt{h}}\frac{\partial L}{\partial W}$

其中， $\bigodot$ 表示矩阵逐元素乘法。

4.2 代码

def RMSProp(f, grad, start, lr=0.1, dr=0.9, epsilon=1e-8, n=100, plot=True):
    x = start
    cache = 0
    history = [x]
    for _ in range(n):
        g = grad(x)
        cache = dr * cache + (1 - dr) * g ** 2  # 指数加权平均
        x = x - lr * g / (np.sqrt(cache) + epsilon)  # 更新参数
        history.append(x)

    RMSProp_path = np.array(history)

    if plot:
        plt.scatter(RMSProp_path, f(RMSProp_path), c='red', s=20, label="RMSProp")
        plt.xlabel("x")
        plt.ylabel("f(x)")
        plt.legend()
        plt.title("Optimization Paths of RMSProp Algorithms")
        plt.show()

    return RMSProp_path

五、Adam方法

5.1 原理

Adam算法结合了之前所有的思想，即动量与递减学习率的思想。那么首先在梯度第一矩估计，即在动量项中，存在式子如下：

$m_i\leftarrow m_{i-1}*\beta_1 + (1-\beta_1 )*g_t$

其中， $g_t$ 表示梯度 $\frac{\partial L}{\partial W}$ 。

而梯度第二矩估计（即递减学习率项）的公式为：

$v_t \leftarrow \beta_2 * v_{t-1} + (1-\beta_2)*gt\bigodot gt$

其中， $\bigodot$ 表示矩阵逐元素乘法。

由于初始时刻，即 $m_0 = 0$ 和 $v_0 = 0$ 时刻，所以需要进行偏差矫正，其公式为：

$\hat{m_t} \leftarrow \frac{m_t}{1-\beta_1^t}$ $\hat{v_t} \leftarrow \frac{v_t}{1-\beta_2^t}$

最后的参数更新就为：

$W_{t-1} \leftarrow W_t - \eta\frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon }$

5.2 代码

def adam(f, grad, strat, lr=0.1, beta1=0.9, beta2=0.999, epsilon=1e-8, n=100, plot=True):
    x = strat
    m = 0
    v = 0
    t = 0
    history = [x]
    for _ in range(n):
        t += 1
        g = grad(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        x = x - lr * m_hat / (np.sqrt(v_hat) + epsilon)
        history.append(x)

    adam_path = np.array(history)

    if plot:
        plt.scatter(adam_path, f(adam_path), c='red', s=20, label="Adam")
        plt.xlabel("x")
        plt.ylabel("f(x)")
        plt.legend()
        plt.title("Optimization Paths of Adam Algorithms")
        plt.show()

    return adam_path

此上

查看全文

http://www.kler.cn/a/587816.html