Light Gradient Boosted Machine (LightGBM)

2025年6月20日 | 阅读 7 分钟

LightGBM 是一个使用树结构预测模型的梯度提升框架。它被设计为分布式和高效的。因此，这种方法带来了许多优势，例如更快的训练速度、高效率、低内存使用量、更好的准确性、对并行和 GPU 学习的支持，以及在处理大规模数据集方面的许多其他优势。LightGBM 在众多决策树机器学习算法中，已成为 Kaggle 竞赛的王者，而 Kaggle 竞赛在很大程度上依赖于 XGBoost 等更强大的框架。自 Microsoft 发明以来，LightGBM 获得了大量关注，现在比 XGBoost 更受欢迎。它的速度比 XGBoost 快六倍，并且受到大多数数据科学家和 Kaggle 竞赛者的青睐。

LightGBM 算法相对较新，并且拥有大量的参数。这可以在 LightGBM 的文档中看到。随着数据集大小呈指数级增长，传统的数据科学算法正难以提供有效的结果。因此，LightGBM，因其快速和低内存需求而被称为“Light”，非常适合大型数据集。它的准确性，以及对 GPU 学习的支持，是它在数据科学应用中被广泛使用的另一个原因。尽管 LightGBM 有许多优点，但并非没有局限性。不建议将其用于小型数据集，因为它容易过拟合，从而可能产生次优结果。然而，对于数据科学家来说，LightGBM 将继续是一个强大而高效的工具，用于在大型问题上实现高准确性和性能。

导入库

 
import numpy as np
import pandas as pd
import lightgbm as lightgbm

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import recall_score

mpl.style.use('seaborn')
np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.float_format', lambda x: '%.3f' % x)   

下面是一个分类任务的手动示例：我们将尝试捕获信用卡欺诈。数据集极其不平衡，因为我们可以看到相对于负样本的数量，正样本非常少。在这种情况下，GBDTs 的开发证明特别有用。我们标准化 Amount 特征，并移除 Time 特征，因为它对我们的目的没有用。此外，数据集高度不平衡，但如果提供了类别权重，GBDTs 非常适合不平衡数据集。最后，我们将数据分为训练集和验证集。

 
from sklearn.preprocessing import StandardScaler as StandardScaler
from sklearn.model_selection import train_test_split

dataset = pd.read_csv('../input/creditcard.csv')
datset.head(10)
dataset['NormalizedAmount'] = StandardScaler().fit_transform(dataset['Amount'].values.reshape(-1,1))
dataset = dataset.drop(['Time', 'Amount'], axis=1)

percent_pos = dataset[dataset['Class'] == 1].shape[0]/dataset.shape[0] * 100
print("{:.2f}% of the data are positive examples .".format(percent_pos))

X = dataset.drop('Class', axis=1)
y = dataset['Class']

train_X, val_X, train_y, val_y = train_test_split(X, y, test_size = 0.33)   

输出

 
0.17% of the data are positive examples 

import lightgbm as lgb

# Use LightGBM Datasets to wrap our training and validation sets.
train_lgb = lgb.Dataset(train_X, train_y, data_free_raw=False)
val_lgb = lgb.Dataset(val_X, val_y, reference=train_lgb, data_free_raw=False)

影响梯度提升机训练的主要变量如下，并附有每个变量的摘要。

 
params_core = {
    'boosting_type': 'gbdt', #GBM types include dart, goss, rf (random forest), and gradient-enhanced decision trees.

    'objective': 'binary', # Binary, regression, multiclass, and xentropy are the optimization objects.
    'learning_rate': 0.05, #The step size is determined by the gradient descent learning or shrinkage rate.
    'num_leaves': 31, #how many leaves a single tree has.
    'nthread': 4, # The number of threads that LightGBM should employ is ideally adjusted to the number of cores.
    
    'metric': 'auc' #area under the curve (auc), an extra measure to compute during validation.
}   

我们现在可以使用 LightGBM 来训练一个梯度提升决策树。一个训练 GBDT、为我们绘制训练结果并为每次迭代提供 GBM 和验证结果的函数被封装在训练调用中。

 
def train_gbm(params, training_set, validation_set, init_gbm=None, boost_rounds=100, early_stopping_rounds=0, metric='auc'):
    evaluation_result = {} 

    gbm = lgb.train(params, # parameter dict to use
                    training_set,
                    init_model=init_gbm, # first model to employ for ongoing training.
                    num_boost_round=boost_rounds, # how many iterations or boosting rounds are used.
                    early_stopping_rounds=early_stopping_rounds, # early iterations of halting.
                    # If the "no" measure improves on "any" validation data, cease training.
                    valid_sets=validation_set,
                    evaluation_result=evaluation_result, # determine where evaluation findings should be kept.
                    verbose_eval=False) # print assessments during training.

    
    y_true = validation_set.label
    y_pred = gbm.predict(validation_set.data)
    fpr, tpr, threshold = roc_curve(y_true, y_pred)
    roc_auc = auc(fpr, tpr)
    
    plt.title("ROC Curve. Area under Curve: {:.3f}".format(roc_auc))
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    _ = plt.plot(fpr, tpr, 'r')
    
    return gbm, evaluation_result
model, evaluation = train_gbm(params_core, train_lgb, val_lgb)   

输出

Light Gradient Boosted Machine (LightGBM)

我们的第一个模型表现不佳。为了提高性能、加快训练速度或减少过拟合，我们可以调整许多参数。

参数

下面描述并解释了各种模型参数，其中一些参数经过调整以提高模型的准确性。

Max_depth：设置树的最大深度。此参数控制过拟合。当模型出现过拟合迹象时，降低 max_depth 可能是一个好主意。
min_data_in_leaf：叶节点应包含的最小记录数。其默认值为 20，通常是理想的。它还可以防止过拟合。
feature_fraction：如果将随机森林设置为提升方法，则会使用此参数。例如，feature_fraction 为 0.8 意味着在每次迭代中，LightGBM 将随机选择 80% 的特征来构建树。
bagging_fraction：这决定了每次迭代中要使用的数据的比例。它可以用于加快训练速度和减少过拟合，因为每次迭代只随机选择一部分数据。
early_stopping_round：通过在指定的轮数内验证指标未得到改善时停止过程来加快训练速度，从而减少不必要的迭代。
lambda：表示正则化参数。典型值在 0 到 1 之间，用于防止过拟合。
min_gain_to_split：拆分所需的最小增益。这有助于调节树中有效拆分的数量。
max_cat_group：当类别过多时，直接拆分会导致过拟合。LightGBM 将类别组合成 'max_cat_group' 组（默认为 64），并基于组边界查找拆分点。
max_bin：这是特征值将被分箱的最大数量。
categorical_feature：类别特征的索引。因此，categorical_features= 0,1,2 表示第 0、1 和 2 列被视为类别变量。
ignore_column：它类似于类别特征，但它不会将某些列视为类别，而是完全忽略它们。
Save Binary：如果您担心数据文件的内存大小，请将此参数设置为 'True'。它将数据集保存为二进制文件，在下次使用时加载速度会比平时快得多。

 
params_advanced = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    
    'learning_rate': 0.01,
    'num_leaves': 41, # Although additional leaves improve accuracy, overfitting might result.
    
    'max_depth': 5, # the deepest tree possible. Overfitting is lessened by shallower trees.
    'min_split_gain': 0, # little gain and loss to execute a split
    'min_child_samples': 21, # The minimal number of samples per leaf node is specified by min_data_in_leaf.
    'min_child_weight': 5, #One leaf's minimal sum hessian. Prevents overfitting.
    
    'lambda_l1': 0.5, # L1 regularization
    'lambda_l2': 0.5, # L2 regularization
    
    'feature_fraction': 0.5, # Before constructing each tree, a portion of the characteristics are chosen at random.
    # reduces overfitting and speeds overall training.
    'bagging_fraction': 0.5, # enables data subsampling or bagging to expedite training.
    'bagging_freq': 0, # carry out bagging on each Kth iteration; if 0 is used, deactivate it.
    
    'scale_pos_weight': 99, #To make up for the imbalance, give the good class examples more weight.
    
    'subsample_for_bin': 200000, # quantity of data must be sampled in order to identify histogram bins

    'max_bin': 1000, # the most bins that can be used to store feature data.
    # LightGBM uses this value to automatically compress memory. Accuracy is enhanced by larger bins.
    
    'nthread': 4, # The number of threads that LightGBM should employ is ideally adjusted to the number of cores.

}
%%time
model, evaluation = train_gbm(params_advanced, train_lgb, val_lgb, boost_rounds=500)   

输出

我们对参数的调整显著提高了模型的性能。上述设置提高了模型的准确性，但可以通过降低 max_bin 和指定 bagging_freq 来提高其速度。

 
model, evaluation = train_gbm(params_advanced, train_lgb, val_lgb, init_gbm=model, boost_rounds=500)   

输出

可以将现有模型作为 init_model 参数传递给训练函数以继续训练它。

代码

输出

代码

输出

大多数决策树学习算法是逐层增长树的，即树是深度优先生长的。这个过程涉及在继续到下一层之前分割当前深度的所有节点。也就是说，必须在深入之前构建树的每一层。逐层树生长的一个直观说明将更有效地描绘在这种方法中，所有特定层级的节点在进入下一层之前同时分裂的性质。

代码

输出

如果我们生长完整的树，最佳优先（叶子优先）和深度优先（层级优先）将产生相同的树。区别在于树的展开顺序。由于我们通常不会将树生长到其完整深度，因此顺序很重要。应用早期停止条件和修剪方法可能会导致非常不同的树。由于叶子优先的拆分纯粹基于对全局损失的贡献，而不是局部分支上的损失，因此它们通常（但不总是）能够比层级优先的选择更快地“学习”到低错误率的树。对于少数节点，叶子优先的性能可能优于层级优先系统。如果我们添加显著更多的节点，但不进行停止和修剪，我们可以预期收敛到相同的性能，因为那时它们实际上是在构建相同的树。

下一个主题机器学习中的堆叠（Stacking）

Light Gradient Boosted Machine (LightGBM)

导入库

参数

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

Light Gradient Boosted Machine (LightGBM)

导入库

参数

相关帖子

遗传编程 vs 机器学习

什么是大数据和机器学习

ML | 主动学习

最小角回归

时间序列预测的自回归 (AR) 模型

高斯判别分析

CNB 算法

Bagging 机器学习

深度参数化连续卷积神经网络

SIFT (尺度不变特征变换) 简介

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器