Adam 优化器

2025年2月3日 | 13分钟阅读

自适应矩估计，或称 Adam 优化器，是一种用于深度学习模型训练的复杂优化技术。Adam 最初由 **Diederik P. Kingma 和 Jimmy Ba** 在他们 2014 年的工作中提出，由于其有效性和效率，它已成为最受欢迎和最常用的优化技术之一。Adam 的核心思想是结合 AdaGrad 和 RMSProp 这两种其他知名优化器的优点，提供一种可靠且灵活的方法。

AdaGrad 以其根据梯度的历史来调整每个参数的学习率的能力而闻名。由于它会降低频繁出现的特征的学习率，因此适用于处理稀疏数据和特征。然而，AdaGrad 的学习率在训练过程中会持续下降，最终可能变得过小，从而阻碍收敛过程。

另一方面，RMSProp 通过调整学习率来处理这个问题，它利用平方梯度的移动平均值，这有助于防止学习率下降过快。对于非平稳目标，即最优学习率可能随时间波动时，RMSProp 非常有用。

Adam 集成了这两种方法，通过计算每个可用参数的自适应学习率，同时考虑了梯度的第一和第二阶矩。第二阶矩代表梯度的非中心方差，而第一阶矩是其均值。Adam 会跟踪这两个实例的两个移动平均值，并在每个训练周期进行更新。

更具体地说，通过使用第一矩估计（均值）和第二矩估计（方差）来计算偏差校正估计，这可以确保即使在数据点较少、处于训练早期阶段的估计也是准确的。然后，使用这些偏差校正的估计来更新参数，使 Adam 能够自适应地调整每个参数的学习率。

Adam 优化器的工作原理

Adam 的主要贡献在于其能够利用梯度的第一和第二矩来计算每个参数的自适应学习率。它通过根据历史平方梯度之和来校准学习率，非常擅长处理稀疏数据和特征。但是，随着时间的推移，其学习率会稳步下降，可能达到过小并阻碍训练的地步。通过保持平方梯度的移动平均值来对学习率进行归一化，RMSProp 减轻了这个问题，并且对于非平稳目标（最优学习率可能随时间变化）非常有用。

为了结合这些概念，Adam 会跟踪每个参数的两个移动平均值：第一个矩，即梯度均值；第二个矩，即梯度非中心方差。Adam 在每个训练迭代中更新这些移动平均值，计算两个矩的偏差校正估计，并修改参数。由于在训练早期数据点较少，这种偏差校正可以确保估计的可靠性。

代码

现在我们将实现 Adam 优化器及其扩展。

导入库

 
import numpy as np
from math import sqrt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, BatchNormalization, Flatten, Dropout
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.preprocessing import LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics.pairwise import cosine_similarity   

读取数据集

 
(x_train,y_train) , (x_test,y_test) = tf.keras.datasets.mnist.load_data()

print(f'x_train shape : {x_train.shape}')
print(f'y_train shape : {y_train.shape}')   

输出

成本函数

损失函数或成本函数是一种用于评估算法对数据集建模效果的方法，它通过衡量预测值与实际值之间的差异来计算。该函数在训练期间表示为单个实数，量化了“损失”或误差。基本上，如果预测不准确，损失函数会输出一个较高的数字；如果预测准确，输出则较低。损失函数指导算法的调整，表明模型是否在改进。

在机器学习中，通过检查单个神经元可以简化理解原理。对于输入数据 ( x ) 和输出 ( H(x) )，它们之间的关系由 H(x) =?(wx + b) 给出，其中 ( w ) 是权重，( b ) 是偏置，( ?) 是激活函数，通常是 sigmoid 函数。目标是找到最小化成本函数的 ( w ) 和 ( b )。然而，成本函数不是凸函数，这会导致存在多个局部最小值，给优化带来了挑战。

梯度下降

该方法是一种迭代的一阶优化过程，用于定位函数的局部最小值或最大值。使用不动点技术将成本函数的一阶导数设置为 0。虽然有效，但该方法在复杂的人工神经网络 (ANN) 中面临挑战。为了解决这些困难，使用了 ADAM（自适应矩估计）方法。

梯度下降算法通过使用其导数来优化目标函数。对于给定的输入，导数函数 ?′(?) 给出导数，而目标函数 ?(?) 提供分数。从一个初始点（通常是随机选择的）开始，算法计算导数并在预期会减小目标函数的方向上移动，假设进行最小化。

现在，我们定义一个用于优化的函数。我们将输入的允许范围设定为 -1.0 到 1.0 之间，然后使用一个简单的二维函数，将每个维度上的输入进行平方。

 
 def function(x,y):
    return x**2.0 + y**2.0  

我们可以生成前面描述函数的三个维度曲面图。

默认情况下，输入范围设置为 -1.0 到 1.0 之间；但是，我们可以更改它，以便从不同的角度观察曲率。

 
 r_min, r_max = -1.0, 1.0

xaxis = np.arange(r_min, r_max, 0.1)
yaxis = np.arange(r_min, r_max, 0.1)

for i in range(0,3):
    print(f'Iteration {i+1}:')
    print(f'x-axis = {xaxis[i]}')
    print(f'y-axis = {yaxis[i]}')  

输出

如上所示，由于它只跳跃 0.1，我们可以自行调整。迭代直到达到 1.0。我们构建一个网格，这是一个由两个提供的表示矩阵或笛卡尔索引的一维数组组成的矩形网格。

 
x, y = np.meshgrid(xaxis, yaxis)

outcomes = function(x, y)

figure = plt.figure(figsize=(15,10))
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, outcomes, cmap='jet')
plt.title("bowl shape",size=22,weight='bold')
plt.show()   

输出

上图显示了全局最小值位于 f(0,0) = 0。接下来，我们可以实现梯度下降。

 
def gradient_descent(gradient, start, learn_rate, number_of_iterations=50):

    vector = start

    for _ in range(number_of_iterations):

        diff = -learn_rate * gradient(vector)

        vector += diff

    return vector

gradient_descent(gradient=lambda v: 2 * v, start=10.0, learn_rate=0.2)

输出

我们使用 lambda 函数 lambda v: 2 * v 来获取 ?² 的梯度。将学习率设置为 0.2，从 10.0 开始。得到的结果非常接近零，这是正确的最小值。

从最右边的绿色点 (? = 10) 开始，我们朝着最小值 (? = 0) 的方向前进。由于开始时梯度（和斜率）值较高，更新幅度较大。当接近最小值时，更新幅度会降低。

Adam 优化器

Adam 的设计目标是通过根据搜索过程中遇到的梯度来调整每个输入参数的步长，从而加速优化过程并提高最终结果。这种自适应学习率有助于减少获得最优值所需的函数评估次数。

Adam 引入了另外两个概念：一阶动量和二阶动量。一阶动量涉及使用来自先前步骤的乘以权重的梯度值之和，体现了动量的思想。梯度值平方之和代表二阶动量。算法计算这两个动量值的比率，并使用它来有效地找到最小值。

 
def function(x, y):
    return x**2.0 + y**2.0


def derivative(x, y):
    return np.asarray([x * 2.0, y * 2.0])


def Adam(bounds, number_of_iterations, learning_rate, beta_One, beta_two, eps=1e-8):
    
    answers = [] # This will only store the outputs.
    scores = []
    """Now, We can begin the search by choosing a random place within the problem's boundaries.
    This presupposes that we have an array that specifies the search's boundaries, with one row for each dimension and two columns that specify the dimension's minimum and maximum values, respectively.
"""


    x = bounds[:, 0] + np.random.rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
    
    score = function(x[0],x[1])
    
    """Now we initialize the first and second moments, """
    
    m = [0.0 for _ in range(bounds.shape[0])]
    v = [0.0 for _ in range(bounds.shape[0])]
    
    """Now we run the gradient descent updates. First, we iterate it to number_of_iterations calculate
       the gradient of the x[0] and x[1], and store it in g(t). """
    
    for iterr in range(number_of_iterations):
        
        """Next, we need to perform the Adam update calculations, we take one variable at a time"""
        
        g = derivative(x[0], x[1])
        
        for i in range(bounds.shape[0]):
            
            """m(t) = beta_One * m(t-1) + (1 ? beta_One) * g(t)"""
            m[i] = beta_One * m[i] + (1.0 - beta_One) * g[i]
            
            """ v(t) = beta_two * v(t-1) + (1 ? beta_two) * g(t)^2"""
            v[i] = beta_two * v[i] + (1.0 - beta_two) * g[i]**2
            
            """mhat(t) = m(t) / (1 ? beta_One(t))"""
            mhat = m[i] / (1.0 - beta_One**(iterr+1))
            
            """vhat(t) = v(t) / (1 ? beta_two(t))"""
            vhat = v[i] / (1.0 - beta_two**(iterr+1))
            
            """x(t) = x(t-1) ? alpha_1 * mhat(t) / (sqrt(vhat(t)) + eps)"""
            x[i] =  x[i] - learning_rate * mhat / (sqrt(vhat) + eps) 
            
        score = function(x[0], x[1])
        scores.append(score)
        answers.append(x.copy())
        
    return [answers,scores]

# seed the pseudo-random number generator
np.random.seed(1)

# define range for input
bounds = np.asarray([[-1.0, 1.0], [-1.0, 1.0]])

# defining the total number of  iterations
number_of_iterations = 60

# size of steps (learning rate)
alpha_1 = 0.02

# average gradient factor
beta_One = 0.8

# average squared gradient factor
beta_two = 0.999

print("The Score per iteration: ")
# perform the gradient descent search with Adam
answers, scores = Adam(bounds, number_of_iterations, alpha_1, beta_One, beta_two)   

现在我们将可视化 Adam 的搜索过程。

 
def visualize_gradient_in_2D(answers):
    
# sample input range consistently at intervals of 0.1
    xaxis = np.arange(bounds[0,0], bounds[0,1], 0.1)
    yaxis = np.arange(bounds[1,0], bounds[1,1], 0.1)
    
    #from the axis, make a mesh
    x, y = np.meshgrid(xaxis, yaxis)
    # compute targets
    
    outcomes = function(x, y)
    # make a jet color scheme and fill a contour plot with 50 levels.
    
    plt.contourf(x, y, outcomes, levels=50, cmap='jet')
    # Plot the sample using circles that are black.
    
    answers = np.asarray(answers)
    plt.plot(answers[:, 0], answers[:, 1], '.-', color='w')
    
    # showing the plot
    plt.show()
    
    
visualize_gradient_in_2D(answers)   

输出

正如我们所见，搜索过程中发现的每个解决方案都由一个白点表示，该点最初出现在最优值上方，并逐渐在图的中心向最优值靠近。

尝试将其应用于不同的函数。

 
def function(x):
    return (x ** 3)-(3 *(x ** 2))+7
    
def derivative(x):
    x_deriv = 3* (x**2) - (6 * (x))
    return x_deriv

def Adam(new_x, prev_x, precision, learning_rate, beta_One, beta_two, episolon):
    
    '''
    Description or Information: This function accepts an initial or prior value for x and updates it in accordance with the highest minimum value of x that satisfies the precision satisfaction, as produced by the Adam optimization method.

    Arguments:
    
    new_x - a beginning x value that will change in accordance with the rate of learning
    
    prev_x - the old value of x that is being replaced with the new value.
    
    precision -a level of accuracy that establishes when the gradual fall ends 
    
    learning_rate - the rate of learning (step size of each descent)

    beta_One - gradient descent with moment part, the first-moment parameter for the Adam optimizer's initial portion
    
    beta_two - the RMS prop, the second-moment parameter for the second section of the Adam optimizer
    
    episolon -a number selected so that, in the event that the RMS prop output is extremely little, there is no division by zero

    '''
    
    #make blank lists to which each iteration's new values for x and y will be attached.

    list_x, y_list = [new_x], [function(new_x)]
    
    #Set the first and second moments' starting values to 0. 
    m = 0
    v = 0
    
    # Count Value initialized to 1 to prevent zero division in bias correction
    t = 1
    
    # keep looping until your desired precision
    while abs(new_x - prev_x) > precision:
        
        # change the value of x
        prev_x = new_x
        
        #obtain the function's previous value's derivation using x as the previous value.

        derivative_of_x = - derivative(prev_x)
        
        #peform gradient descent with momentum using beta_One
        m = beta_One * m + ((1-beta_One) * derivative_of_x)
            
        #Get the rms prop value on the derivative using beta_two
        v = beta_two * v + ((1-beta_two) * (derivative_of_x * derivative_of_x))
        
        #Add bias correction to calculated values
        mhat = m / (1-(beta_One)**t)
        vhat = v / (1-(beta_two)**t)
        
        #combine both RMS prop value and gradient descent momentum to get the new derivative
        nderivative_of_x = mhat / np.sqrt(vhat + episolon)
        
        # Add the previous, the derivative's multiplication, and the learning rate to obtain your new value of x.

        new_x = prev_x + (learning_rate * nderivative_of_x)
        list_x.append(new_x)
        y_list.append(function(new_x))
        
        t+=1
        
    print ("Occuring of Local Minimum at: "+ str(new_x))
    print ("Number of steps: " + str(len(list_x)))
    x = np.linspace(-1,3,500)
    # Plot the steps taken by the Adam optimizer
    plt.scatter(list_x,y_list,c="r")
    plt.plot(x,function(x), c="b")
    plt.title("Adam Optimizer")
    plt.show()

Adam(0.5, 0, 0.001, 0.6, 0.9, 0.99, 10e-8)

输出

现在，我们也有 Adam 优化器的扩展。

AdaMax

AdaMax 是 Adam 优化器的一个扩展，它使用最大值来计算二阶动量，提供了一种更稳定的方法。Adam 根据先前梯度的 L2 范数（平方）的缩放值更新权重，而 AdaMax 将此扩展到先前梯度的无穷范数（最大值）。它在优化问题中为每个参数自动适应一个单独的步长。

 
def function(x, y):
    return x**2.0 + y**2.0
 
# derivative of the objective function
def derivative(x, y):
    return np.asarray([x * 2.0, y * 2.0])
 
# gradient descent algorithm with adamax
def Adamax(bounds, number_of_iterations,learning_rate, beta_One, beta_two):
    
    """Now, We can begin the search by choosing a random place within the problem's boundaries.
   This presupposes that we have an array that specifies the search's boundaries, with one row for each dimension and two columns that specify the dimension's minimum and maximum values, respectively.
"""

    answers = []
    t = []
    x = bounds[:, 0] + np.random.rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
    
    """Now we run the gradient descent updates. First, we iterate it to number_of_iterations calculate
       the gradient of the x[0] and x[1], and store it in g(t). """
    
    m = [0.0 for _ in range(bounds.shape[0])]
    u = [0.0 for _ in range(bounds.shape[0])]
    
    
    for iterr in range(number_of_iterations):
 
        g = derivative(x[0], x[1])
        
        """Next, we need to perform the Adam update calculations, we take one variable at a time"""
        for i in range(x.shape[0]):
            
            """ m(iterr) = beta_One * m(iterr-1) + (1 - beta_One) * g(iterr) """
            
            m[i] = beta_One * m[i] + (1.0 - beta_One) * g[i]
            
            """ u(iterr) = max(beta_two * u(t-1), abs(g(iterr))) """
            u[i] = max(beta_two * u[i], abs(g[i]))
            
            """ size_step(iterr) = alpha_1 / (1 - beta_One(iterr)) """
            size_step = learning_rate / (1.0 - beta_One**(iterr+1))
            
            """ delta(t) = m(iterr) / u(iterr) """
            delta = m[i] / u[i]
            
            """ x(iterr) = x(iterr-1) - size_step(iterr) * delta(iterr) """
            x[i] = x[i] - size_step * delta
        
        score = function(x[0], x[1])
        t.append(iterr)
        answers.append(x.copy())
    
    return [answers,t]
 
# random generator
np.random.seed(1)

# define range for input
bounds = np.asarray([[-1.0, 1.0], [-1.0, 1.0]])

# defining the total number of  iterations
number_of_iterations = 60

# size of steps
alpha_1 = 0.02

# average gradient factor
beta_One = 0.8

# average squared gradient factor
beta_two = 0.99

# perform the gradient descent search with adamax
answers,iterations= Adamax(bounds, number_of_iterations, alpha_1, beta_One, beta_two)   

现在我们将可视化 Adamax 的搜索过程。

输出

 
plt.xlabel('No of iterations')
plt.ylabel('Loss')
plt.plot(iterations,answers, '.-', color='red')
plt.show()   

输出

在这里，我们可以观察到损失值范围大约在 50 到 60 之间。这意味着它低于全局最小值。但这并非总是如此，因为这三个优化器在处理非凸函数时会难以获得全局最小值。它们会陷入局部最小值。

 
def function(x):
    return (x ** 3)-(3 *(x ** 2))+7
    
def derivative(x):
    x_deriv = 3* (x**2) - (6 * (x))
    return x_deriv

def AdaMax(new_x, prev_x, precision, learning_rate, beta_One, beta_two, episolon):
    
    '''
    Description or Information: This function accepts an initial or prior value for x and updates it in accordance with the highest minimum value of x that satisfies the precision satisfaction, as produced by the Adam optimization method.

    Arguments:
    
    new_x -a beginning x value that will change in accordance with the rate of learning
    
    prev_x - the old value of x that is being replaced with the new value.
    
    precision - a level of accuracy that establishes when the gradual fall ends 
    
    learning_rate - the rate of learning (step size of each descent)
    
    beta_One - gradient descent with moment part, the first-moment parameter for the Adam optimizer's initial portion
    
    beta_two - the RMS prop, the second-moment parameter for the second section of the Adam optimizer

    episolon - a number selected so that, in the event that the RMS prop output is extremely little, there is no division by zero

    '''
    
    # make blank lists to which each iteration's new values for x and y will be attached.

    list_x, y_list = [new_x], [function(new_x)]
    
    #Set the first and second moments' starting values to 0.  

    m = 0
    u = 0
    
    # Count Value initialized to 1 to prevent zero division in bias correction
    t = 1
    
    # keep looping until your desired precision
    while abs(new_x - prev_x) > precision:
        
        # change the value of x
        prev_x = new_x
        
        # obtain the function's previous value's derivation using x as the previous value.

        derivative_of_x = - derivative(prev_x)
        
        #peform gradient descent with momentum using beta_One
        m = beta_One * m + ((1-beta_One) * derivative_of_x)
            
        #Get the rms prop value on the derivative using beta_two
        u = max(beta_two * u, abs(derivative_of_x))
        
        size_step = learning_rate / (1.0 - beta_One**(t+1))
        
        delta = m / u
        
        #x = x - size_step * delta 
        
        # Add the previous, the derivative's multiplication, and the learning rate to obtain your new value of x.
        new_x = prev_x + (size_step * delta)
        list_x.append(new_x)
        y_list.append(function(new_x))
        
        t+=1
   
    print ("Local minimum happens at: "+ str(new_x))
    print ("Number of steps taken: " + str(len(list_x)))
    
    # Plot each and every step that the eadam optimizer performed.
    x = np.linspace(-1,3,500)
    # Plot each and every step that the eadam optimizer performed.
    plt.scatter(list_x,y_list,c="r")
    plt.plot(x,function(x), c="b")
    plt.title("AdaMax Optimizer")
    plt.show()

AdaMax(0.5, 0, 0.001, 0.6, 0.9, 0.99, 10e-8)

输出

下一个主题机器学习的解析解

← 上一个下一个 →

Adam 优化器

Adam 优化器的工作原理

导入库

读取数据集

成本函数

梯度下降

Adam 优化器

AdaMax

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

Adam 优化器

Adam 优化器的工作原理

导入库

读取数据集

成本函数

梯度下降

Adam 优化器

AdaMax

相关帖子

深度学习和机器学习对数据结构和算法的需求

什么是 Xavier 初始化？

使用 PyTorch 进行时间序列预测的 LSTM

机器学习中的元学习

使用 Teachable Machine 的机器学习模型

Python 中用于 ML 的鲁棒回归

机器学习的距离度量

机器学习中的图像处理

适合机器学习的笔记本电脑

泰勒级数

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器