机器学习中的高斯过程

2025年2月28日 | 阅读 12 分钟

高斯过程是一种非常强大的非参数机器学习方法，最初应用于回归，但最近也被成功地应用于分类任务以及许多更高级的应用，如时间序列分析。它们在建模复杂数据关系方面的灵活性，使其在常规模型失败的情况下表现出色，尤其是在普通模型（特别是线性回归模型）失效时。

高斯过程本质上是一组不同的变量；其中任何有限的子集都构成一个多元正态分布。在机器学习中，GP 是一种表示函数上概率分布的方法。也就是说，对于一些有噪声的观测点，GP 可以表示一个无限的、有噪声的可能解释数据的函数集，以及每个函数的置信度。

简单来说，与传统的回归方法（如线性或多项式）假设固定函数形式不同，GP 直接从数据中推断出最可能的函数。更重要的是，它允许预测输出以及与预测相关的置信度。

它可用于直接推断函数上的分布，而不是参数化函数参数上的分布。高斯过程定义了函数上的先验。一旦观察到一些函数值，就可以将其转换为函数上的后验。高斯过程回归是指此处连续函数值的解释。然而，高斯过程也可用于分类。

数学上

高斯过程是一个随机过程，对于属于回归不连续性的任何点 o，我们都附加一个未经验证的变量 f(o)。对于这些变量的任何有限数量，联合分布碰巧是高斯分布。这可以用数学术语表示如下：

此处，

f=(f(X_one ),…,f(x N ))，其中函数值是在输入点计算的。
μ=(m(X_one ),…,m(xN )) 表示均值向量。通常，我们将均值函数 m(x) 设置为零，因为 GP 具有足够的表现力来拟合均值。
Kij =κ(xi ,xj ) 是一个协方差矩阵，其中核 κ 是一个正定协方差函数。它描述了函数值如何相关。

因此，高斯过程可以被解释为函数上的分布，其函数的属性——例如，它们的平滑度——由核 K 规定。如果核表明点 xi 和 xj 是同一种类的，那么我们期望函数 f(xi) 和 f(xj ) 的相应值也应该是同一种类的。

先验和后验分布

首先，我们有一个 GP 先验 p(f∣X)。在观察到一些数据 y 后，我们更新这个先验以获得 GP 后验 p(f∣X,y)。然后可以使用这个后验来预测刚观察到的输入 X* 的函数值 f*。

后验预测分布再次是高斯分布，这次具有均值 μ∗ 和协方差 Σ*。现在让我们来看一下观测数据 y 和预测 f* 的联合分布如下：

观测数据和预测的联合分布

现在让我们来看一下观测数据 y 和预测 f* 的联合分布如下：

后验预测充分统计量

后验预测分布的充分统计量 μ* 和 Σ* 可以通过以下方式计算：

这些方程构成了评估回归问题高斯过程计算的总和。

现在我们将用代码实现它。

import numpy as np

def kernel(X_one, X_two, l=1, f_sigma=1):
    '''
    squared exponential kernel in isotropy. combines the points in X_one and X_two to create a  matrix of cov.
.
    
    Args:
        X_one: group of the points from a  (a x d).
        X_two: group of the points from b (b x d).

    Returns:
        Matrix of cov (a x b).
    '''
    distri_sqaure = np.sum(X_one**2, 1).reshape(-1, 1) + np.sum(X_two**2, 1) - 2 * np.dot(X_one, X_two.T)
    return f_sigma**2 * np.exp(-0.5 / l**2 * distri_sqaure)

让我们从指定函数上的先验开始，这将是我们看到任何数据之前的假设，关于这些函数可能如何表现。在这里，我们的假设是函数的均值为零。也就是说，平均而言，我们期望函数值围绕零线波动，尽管个体样本可能差异很大。我们还需要一个协方差矩阵，它告诉我们不同输入点上的函数值预计如何共变。它由核函数构建，核函数代表在输入空间中的点之间建立关系的数学方法。

对于我们的设置，我们将使用核参数，其中我们将长度尺度*设置为 l=1，并将信号方差设置为 f_sigma = 1。长度尺度 l 决定了函数的平滑度：对于 l = 1，我们期望函数在输入空间中的变化相对平滑，并且近邻点具有高度相关的函数值。这个振幅取决于信号的方差 f_sigma = 1，它衡量函数与其均值的偏差程度。一旦我们使用这个核计算了协方差矩阵，我们就可以从代表该高斯过程的多元正态分布中抽取随机样本。每个随机样本都是函数可能的一种外观，假设这是 GP 的先验。我们在示例中抽取三个，以便可以看到函数行为的差异有多大。

然后，我们绘制这三个随机样本，并叠加零均值参考线和 95% 的置信区间。

协方差矩阵的对角线元素给出函数最可能位于的范围，置信度为 95%：这将为我们提供函数的可能形状，并能了解 GP 预测的不确定性。

import numpy as np
from matplotlib import cm
from mpl_toolkits.mplot3d import xaes3D
import matplotlib.pyplot as plt



def plot_gp(mu, cov, X, train_X=None, train_Y=None, samples=[]):
    X = X.ravel()
    mu = mu.ravel()
    uncertainty = 1.96 * np.sqrt(np.diag(cov))
    
    plt.fill_between(X, mu + uncertainty, mu - uncertainty, alpha=0.1)
    plt.plot(X, mu, label='Mean')
    for i, sample in enumerate(samples):
        plt.plot(X, sample, lw=1, ls='--', label=f'Sample {i+1}')
    if train_X is not None:
        plt.plot(train_X, train_Y, 'xr')
    plt.legend()

def plot_gp_2D(xg, yg, mu, train_X, train_Y, title, i):
    xa = plt.gcf().add_subplot(1, 2, i, projection='3d')
    xa.plot_surface(xg, yg, mu.reshape(xg.shape), cmap=cm.coolwarm, linewidth=0, alpha=0.2, antialiased=False)
    xa.scatter(train_X[:,0], train_X[:,1], train_Y, c=train_Y, cmap=cm.coolwarm)
    xa.set_title(title)

%matplotlib inline


# Finite number of points
X = np.arange(-10, 10, 0.4).reshape(-1, 1)

# Cov and mean of the previou
mu = np.zeros(X.shape)
cov = kernel(X, X)

# We will take 3 specimen, then we will plot it accordingly
samples = np.random.multivariate_normal(mu.ravel(), cov, 3)

# Now we will plot the confidence interval, samples, and GP mean.
plot_gp(mu, cov, X, samples=samples)

输出

假设训练数据是无噪声的；在这种情况下，我们将有兴趣计算描述后验预测分布的关键统计数据，该分布反映了我们在观察数据后更新的信念。我们尝试计算后验预测分布的均值和协方差。这些是能够对新数据点进行有意义预测的关键统计数据。

我们从两个基本方程中得出这些值，这两个方程在高斯过程文献中被称为后验预测分布均值的方程，另一个用于计算后验预测分布的协方差。这些是我们如何用实际训练数据更新先验信念（即我们在看到任何数据之前的假设），以获得后验预测分布。

from numpy.linalg import inv

def predict_posterior(x_S, train_X, train_Y, l=1.0, f_sigma=1.0, y_sigma=1e-8):
    '''	
    Now we will have to calculate the statistics that will be suitable and required for the distribution of posterior predictive  from n training data train_X and train_Y and m new inputs x_S.
    
    Args:
        x_S: Takes care of the input dimension (n x d).
        train_X: Training Data(n x d).
        train_Y: Testing Data(n x 1).
        l: parameter which takes care of length of the kernel.
        f_sigma: parameter of variation in the kernel that is vertical.
        y_sigma: parameter of the noise.
    
    Returns:
Mean vector with dimension(n and d) and Posterior Cov Matrix with dimension(n and n).        
    '''
    K = kernel(train_X, train_X, l, f_sigma) + y_sigma**2 * np.eye(len(train_X))
    s_K = kernel(train_X, x_S, l, f_sigma)
    ss_K = kernel(x_S, x_S, l, f_sigma) + 1e-8 * np.eye(len(x_S))
    invert_K = inv(K)
    
    # fourth equation
    mean_post = s_K.T.dot(invert_K).dot(train_Y)

    # fifth equation
    covariance_post = ss_K - s_K.T.dot(invert_K).dot(s_K)
    
    return mean_post, covariance_post

现在，让我们使用这些方程处理无噪声训练数据，将 train_X 设置为输入数据，将 train_Y 设置为相应的输出。在下一个示例中，我们从后验预测分布中抽取三个随机样本并绘制它们。然后将这些样本与均值预测、置信区间和原始训练数据一起绘制。

只要模型是无噪声的，那么训练数据点处的方差将为零，并且我们对这些点处的函数值是精确的。因此，从这个后验分布中抽取的任何随机函数都将完全穿过训练点。这意味着，如果场景是无噪声的，我们的模型应该精确地命中训练数据，因此在这些位置上永远不会有偏差；因此，函数与观测值完美对齐。

# Training Data that is free of any noise
train_X = np.array([-4, -3, -2, -1, 1]).reshape(-1, 1)
train_Y = np.sin(train_X)

# We have to Compute the mean and covariance of the posterior predictive distribution
mean_post, covariance_post = predict_posterior(X, train_X, train_Y)

samples = np.random.multivariate_normal(mean_post.ravel(), covariance_post, 3)
plot_gp(mean_post, covariance_post, X, train_X=train_X, train_Y=train_Y, samples=samples)

输出

noise = 0.4

# Training Data that is Noisy
train_X = np.arange(-3, 4, 1).reshape(-1, 1)
train_Y = np.sin(train_X) + noise * np.random.randn(*train_X.shape)

# Now we will  Compute the mean and covariance of the posterior predictive distribution
mean_post, covariance_post = predict_posterior(X, train_X, train_Y, y_sigma=noise)

samples = np.random.multivariate_normal(mean_post.ravel(), covariance_post, 3)
plot_gp(mean_post, covariance_post, X, train_X=train_X, train_Y=train_Y, samples=samples)

输出

核参数和噪声参数的影响对于确定高斯过程 (GP) 模型的行为和性能至关重要。参数 l（长度尺度）、σf（信号方差）和 σy（噪声方差）各自在塑造 GP 生成的函数及其对数据的拟合方面发挥着独特的作用。

首先，函数的流动性由长度尺度 (l) 决定。较高的 l 值会导致更平滑的函数，从而对训练数据进行更广泛、更渐进的近似。这种平滑度有助于模型更好地泛化到未见数据。相比之下，较低的 l 值会产生更“扭曲”或不规则的函数，这意味着模型可以捕捉数据点之间的快速变化，通常会导致更紧密的拟合，但泛化能力可能较差。l 的影响在训练点之间的区间宽度上尤其明显，而较小的 l 会导致更宽的区间。

然后是 σf，它控制函数的垂直变化。它作为模型可能显示的波动振幅的参数。对于较高的 σf 值，垂直轴上的分布倾向于更大，尤其是在考虑训练数据之外的更宽置信区间时。实际上，σf 允许 GP 探索更多样的输出，从而增加了模型捕获数据潜在模式的灵活性。

噪声方差 (σy) 是训练数据上假定的噪声水平。σy 越大，生成的近似越粗糙，不确定性越大，并且允许模型避免在噪声数据点上过拟合。这在数据具有内在噪声时很有用——一个嘈杂的数据集，模型对离群值或随机波动会更宽容。然而，当 σy 非常低时，模型会紧密拟合训练数据，如果数据集嘈杂，则存在过拟合的风险。

import matplotlib.pyplot as plt

params = [
    (0.3, 1.0, 0.2),
    (3.0, 1.0, 0.2),
    (1.0, 0.3, 0.2),
    (1.0, 3.0, 0.2),
    (1.0, 1.0, 0.05),
    (1.0, 1.0, 1.5),
]

plt.figure(figsize=(24, 10))

for i, (l, f_sigma, y_sigma) in enumerate(params):
    mean_post, covariance_post = predict_posterior(X, train_X, train_Y, l=l, 
                                       f_sigma=f_sigma, 
                                       y_sigma=y_sigma)
    plt.subplot(3, 2, i + 1)
    plt.subplots_adjust(top=1)
    plt.title(f'l = {l}, f_sigma = {f_sigma}, y_sigma = {y_sigma}')
    plot_gp(mean_post, covariance_post, X, train_X=train_X, train_Y=train_Y)

输出

因此，我们最小化负对数边际似然关于参数 l（长度尺度）和 σf（信号方差）。这本质上是一种找到最能解释数据的这些核参数的最佳值的方法。负对数边际似然恰好是衡量高斯过程模型对观测数据的拟合程度的指标，通过最小化它，我们试图提高其泛化能力。

from scipy.optimize import minimize
from numpy.linalg import cholesky, det, lstsq


def nll_fn(train_X, train_Y, noise, naive=True):
    '''
    Returns a function that calculates the log marginal that is negative
    likelihood for training data train_X and train_Y and given 
    noise level.
    
    Args:
        train_X: Training data (N x d).
        train_Y: Testing Data (N x 1).
        noise: Noise that is known to use for train_Y.
        naive: if True, Using the equation of naive for the usage , if 
               False We will use a more adaptable and calculated version of naive. 
        
    Returns:
        Minimization objective.
    '''
    def naive_nll(theta):
        # Naive that we are using in this case is very optimised and suitable here, but keep in mind while implementing it does not perform well, so we need to take stable_nll to get a more stable and more calculated based implementation.





        K = kernel(train_X, train_X, l=theta[0], f_sigma=theta[1]) + \
            noise**2 * np.eye(len(train_X))
        return 0.5 * np.log(det(K)) + \
               0.5 * train_Y.T.dot(inv(K).dot(train_Y)) + \
               0.5 * len(train_X) * np.log(2*np.pi)

    def stable_nll(theta):
        # In mathematics or in the numeric format the equation turns out to be more stable.
                K = kernel(train_X, train_X, l=theta[0], f_sigma=theta[1]) + \
            noise**2 * np.eye(len(train_X))
        L = cholesky(K)
        return np.sum(np.log(np.diagonal(L))) + \
               0.5 * train_Y.T.dot(lstsq(L.T, lstsq(L, train_Y)[0])[0]) + \
               0.5 * len(train_X) * np.log(2*np.pi)
    
    if naive:
        return naive_nll
    else:
        return stable_nll

# We need to minimize log-likelihood that is -ve w.r.t parameters l and f.sigma.
#We really should run minimization a few times with different.
#This would have the advantage of avoiding local minima but is not pursued here as a matter of simplicity.


resolutionx = minimize(nll_fn(train_X, train_Y, noise), [1, 1], 
               bounds=((1e-5, None), (1e-5, None)),
               method='L-BFGS-B')
# Now we will have to store the obtained values as in the form of variables that are accessible over all the code, so that we can call it up for later comparisons.


length_opti, f_sigma_opt = resolutionx.x
length_opti, f_sigma_opt

#Plot results after computing posterior predictive analytics using optimal kernel parameters.
mean_post, covariance_post = predict_posterior(X, train_X, train_Y, l=length_opti, f_sigma=f_sigma_opt, y_sigma=noise)
plot_gp(mean_post, covariance_post, X, train_X=train_X, train_Y=train_Y)

输出

2D_noise = 0.1

xr, yr= np.arange(-5, 5, 0.3), np.arange(-5, 5, 0.3)
xg, yg = np.meshgrid(xr, xr)

X_2D = np.c_[xg.ravel(), yg.ravel()]

2D_X_train = np.random.uniform(-8, 8, (200, 4))
2D_Y_train = np.sin(0.5 * np.linalg.norm(2D_X_train, xais=1)) + \
             2D_noise * np.random.randn(len(2D_X_train))

plt.figure(figsize=(14,7))

mean_post, _ = predict_posterior(X_2D, 2D_X_train, 2D_Y_train, y_sigma=2D_noise)
plot_gp_2D(xg, yg, mean_post, 2D_X_train, 2D_Y_train, 
           f'Before parameter optimization: l={1.00} f_sigma={1.00}', 1)

resolutionx = minimize(nll_fn(2D_X_train, 2D_Y_train, 2D_noise), [1, 1], 
               bounds=((1e-5, None), (1e-5, None)),
               method='L-BFGS-B')

mean_post, _ = predict_posterior(X_2D, 2D_X_train, 2D_Y_train, *resolutionx.x, y_sigma=2D_noise)
plot_gp_2D(xg, yg, mean_post, 2D_X_train, 2D_Y_train,
           f'Post parameter optimization: l={resolutionx.x[0]:.2f} f_sigma={resolutionx.x[1]:.2f}', 2)

GaussianProcessRegressor 用于 GP 回归模型。它可以配置预定义的核和用户定义的核。此外，核还可以组合。在 scikit-learn 中，平方指数核称为 rabaf 核。rabaf 核只有一个 scale_length 参数，它对应于上面提到的 l 参数。为了获得 σf，我们也必须使用参数：将 rabaf 核与 ConstantKernel 组合。

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, rabaf

rabaf = ConstantKernel(1.0) * rabaf(scale_length=1.0)
gapar = GaussianProcessRegressor(kernel=rabaf, alpha=noise**2)

# We will Reuse training data from the previous 1D example
gapar.fit(train_X, train_Y)

#Need to  Compute the posterior predictive mean and covariance
mean_post, covariance_post = gapar.predict(X, return_cov=True)

# Now we will have a kernel parameters that are much optimised
l = gapar.kernel_.k2.get_params()['scale_length']
f_sigma = np.sqrt(gapar.kernel_.k1.get_params()['constant_value'])

# Now, we should compare it with older results
assert(np.isclose(length_opti, l))
assert(np.isclose(f_sigma_opt, f_sigma))

# Plot the results
plot_gp(mean_post, covariance_post, X, train_X=train_X, train_Y=train_Y)

输出

GPy 是谢菲尔德机器学习小组开发的高斯过程软件框架。它包含 gaparegression 类，用于 GP 回归模型。gaparegression 默认情况下还会从数据中估计噪声参数 σy。因此，为了复制上述结果，我们必须固定()此参数。

import GPy

rabaf = GPy.kern.rabaf(input_dim=1, variance=1.0, lengthscale=1.0)
gapar = GPy.models.gaparegression(train_X, train_Y, rabaf)

# Now we need to fix the noise variance to a value that is known
gapar.Gaussian_noise.variance = noise**2
gapar.Gaussian_noise.variance.fix()

# Run optimization
gapar.optimize();

# Display optimized parameter values
display(gapar)

输出

# Obtain optimized kernel parameters
l = gapar.rabaf.lengthscale.values[0]
f_sigma = np.sqrt(gapar.rabaf.variance.values[0])

# Comparing with previous results
assert(np.isclose(length_opti, l))
assert(np.isclose(f_sigma_opt, f_sigma))

# We will plot the results 
gapar.plot();

输出

下一主题机器学习中的遗传算法

机器学习中的高斯过程

数学上

先验和后验分布

观测数据和预测的联合分布

后验预测充分统计量

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

机器学习中的高斯过程

数学上

先验和后验分布

观测数据和预测的联合分布

后验预测充分统计量

相关帖子

深度分离卷积神经网络

反向传播 - 算法

什么是 1 维卷积神经网络

SARSA 强化学习

机器学习历史

注意力机制

为什么每次在机器学习中得到的结果都不同

贝叶斯深度学习简介

GIS 的组成部分

使用 ColumnTransformer 和 OneHotEncoder 进行预测

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器