Extra Trees 分类器

2025年6月19日 | 阅读 10 分钟

Extra Trees 是机器学习中的另一个模型，它使用多个决策树并组合结果。与常用的随机森林策略类似，它通常能达到或超过相应的精度，并且在构建决策树集时使用的方法更简单。此外，该算法也易于使用，并且只有少数超参数以及易于使用的启发式方法来调整它们。

Extra Trees 算法

简单来说，Extra Trees 是机器学习领域中一种被称为“极端随机树”（Extremely Randomized Trees）的集成方法。

它是由决策树组成的集合，与诸如自举聚合（bootstrap aggregation）和随机森林算法等其他基于树的集成技术相似。

在训练过程中，该算法会创建大量的、未剪枝的决策树。对于预测：

回归：通过形成一个集成来实现，其中所有决策树提供其输出的平均值。
分类：决策树进行投票，其中响应值最高的投票被视为输出。

决策树的结果由 Extra Trees 算法分组累积，以获得一个总体输出，该输出使用多数投票来对问题进行分类，或使用算术平均来计算回归问题的值。

Extra Trees 的工作方式与 bagging 和随机森林不同。它们使用自举样本构建决策树；而 Extra Trees 则使用整个数据集来训练每棵树。随机森林使用贪婪算法来获得最佳分割点，而 Extra Trees 则在每棵树中随机决定分割点，使其更具随机性，并且树之间的相关性更低。

Extra Trees 算法以经典的自上而下的方法生成未剪枝的决策树或回归树。与其他两种算法不同之处在于：它使用完全随机的切割来分裂节点，并且不执行自举样本训练，而是使用完整样本。

三个主要的超参数控制着该算法

树的数量 (M)：定义了集成大小的方差减小。
随机特征的数量 (K)：决定了属性选择对属性的影响程度。
每个节点的最小样本数 (nmin)：影响平均噪声的过程。

随机分割选择的理念是它为单个树增加了更多的随机性，而这实际上被集成中大量的树所抵消。

这些参数在将分析样本视为总体中的随机样本、过滤噪声以及接近控制模型方差的目标之间保持着权衡。

Scikit-Learn 中的 Extra Trees API

如果您是第一次接触 Extra Trees 集成，从头开始训练可能会非常困难。不过，scikit-learn 为分类和回归问题都提供了非常简单的 API。

检查 Scikit-Learn 版本

运行以下命令，确保您拥有最新版本的 scikit-learn：

代码

# Check scikit-learn version
import sklearn
print(sklearn.__version__)

如果版本低于 0.22.1，请更新该库。

Scikit-Learn 中的 Extra Trees

Extra Trees 集成在 ExtraTreesClassifier 和 ExtraTreesRegressor 类中实现，并且在功能和参数上类似。这意味着，考虑到该算法的随机性质，连续运行该算法可能会产生不同的结果。在使用此模型时，发现跨多次训练或使用重复交叉验证来评估其性能是有效的。

1. 创建数据集

代码

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4)
print(X.shape, y.shape)  

输出

(1000, 20) (1000,)

2. 评估模型

代码

from numpy import mean, std
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import ExtraTreesClassifier

# Define the model
model = ExtraTreesClassifier()

# Define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

# Report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

输出

Accuracy: 0.906 (0.029)

使用 Extra Trees 进行分类

Extra Trees 模型训练完成后，就可以对新数据集进行预测。使用 predict() 函数的过程包括用所有数据点拟合模型，然后进行预测。

1. 定义数据集

from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier

# Create synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4)

2. 拟合模型

代码

# Initialize the model
model = ExtraTreesClassifier()

# Train the model on the entire dataset
model.fit(X, y)

3. 进行预测

# New data for prediction
row = [[-3.52169364, 4.00560592, 2.94756812, -0.09755101, -0.98835896, 
         1.81021933, -0.32657994, 1.08451928, 4.98150546, -2.53855736, 
         3.43500614, 1.64660497, -4.1557091, -1.55301045, -0.30690987, 
         -1.47665577, 6.818756, 0.5132918, 4.3598337, -4.31785495]]

# Predict class for the row
yhat = model.predict(row)

# Output predicted class
print('Predicted Class: %d' % yhat[0])

输出

该示例将 Extra Trees 模型应用于给定数据集，并对新输入数据行进行分类。

Predicted Class: 0

这种方法展示了该模型如何在实际场景中应用，其中数据根据分类模型中学习到的模式进行分类。然后，我们将继续演示如何将 Extra Trees API 用于回归任务。

用于回归的 Extra Trees

与分类情况类似，Extra Trees 算法也可应用于回归问题。下面将展示如何使用它来解决一个合成回归问题。

步骤 1：创建回归数据集

在此上下文中，使用的 make_regression() 函数创建了一个具有 1000 个样本和 20 个预测变量的人工回归数据集。

# Import necessary library
from sklearn.datasets import make_regression

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=3)

# Print dataset shape
print(X.shape, y.shape)

输出

(1000, 20) (1000,)

步骤 2：评估 Extra Trees 的回归性能

在这个数据集上，我们可以采用交叉验证方法来评估 Extra Trees 算法的性能。

from numpy import mean, std
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import RepeatedKFold, cross_val_score

# Define the model
model = ExtraTreesRegressor()

# Define the evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# Report performance
print('Mean Absolute Error: %.3f (%.3f)' % (mean(scores), std(scores)))

说明

交叉验证：重复 K-Fold 以更准确地评估系统/模型的性能。

评分：使用负平均绝对误差（neg_mean_absolute_error）来衡量准确性。

此设置评估了 Extra Trees regressor 在指定数据集上的性能和鲁棒性。之后，该算法可用于模型拟合并提供预测。

输出

Mean Absolute Error: -0.245 (0.018)

用于回归预测的 Extra Trees

Extra Trees 也是执行回归任务的最终模型。其工作原理如下：

代码

from sklearn.datasets import make_regression
from sklearn.ensemble import ExtraTreesRegressor

# Define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=3)

# Define and fit the model
model = ExtraTreesRegressor()
model.fit(X, y)

# Make a single prediction
row = [[-0.56996683, 0.80144889, 2.77523539, 1.32554027, -1.44494378, -0.80834175, -0.84142896, 0.57710245, 
         0.96235932, -0.66303907, -1.13994112, 0.49887995, 1.40752035, -0.2995842, -0.05708706, -2.08701456, 
         1.17768469, 0.13474234, 0.09518152, -0.07603207]]
yhat = model.predict(row)

print('Prediction: %.3f' % yhat[0])

输出

Prediction: 53.916

这表明了如何将训练好的模型应用于对新观测值的目标变量进行预测。

Extra Trees 超参数中的非线性

树的数量作为性能的独立变量

n_estimators 参数决定了构成森林的树的数量。随着该值上升到最佳水平，性能会提高，之后性能会明显保持稳定。

代码

from numpy import mean, std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import ExtraTreesClassifier
from matplotlib import pyplot as plt

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4)

# Define models with varying tree counts
tree_counts = [10, 50, 100, 500, 1000, 5000]
models = {str(n): ExtraTreesClassifier(n_estimators=n) for n in tree_counts}

# Evaluate models
results, names = [], []
for name, model in models.items():
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))

# Plot results
plt.boxplot(results, labels=names, showmeans=True)
plt.title("Effect of Number of Trees on Extra Trees Performance")
plt.xlabel("Number of Trees")
plt.ylabel("Accuracy")
plt.show()

输出

>10 0.845 (0.043)
>50 0.899 (0.025)
>100 0.906 (0.026)
>500 0.913 (0.024)
>1000 0.911 (0.026)
>5000 0.912 (0.026)

关键洞察

随着树的数量增加，性能也会提高，尽管在大约 100 棵树之后，增加的速率会非常缓慢。
对于森林中的非常多的树，可以看到统计学上存在很大的差异。
这说明了调整 n_estimators 数值参数以增强模型有效性的步骤。

研究特征数量的相关性

Extra Trees 算法中需要优化的一个主要超参数可以说是每次分裂随机选择的特征数量，这与随机森林类似。

如所示，该算法对该参数的确切值不太敏感，但调整该参数对于性能至关重要。

此参数通过 max_features 参数进行调整，如果留空，则等于 √V，其中 V 是输入特征的总数。例如，当固定数据集的 20 个功能属性时，建议的默认值约为 4（20 的平方根）。

下面的示例展示了实验结果，其中随机选择的特征数量从 1 更改到 20，以发现对模型准确性的影响。根据基本启发式方法，通常有一个原则是，接近 4 的较低值会产生最佳结果。

代码

from numpy import mean, std  
from sklearn.datasets import make_classification  
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold  
from sklearn.ensemble import ExtraTreesClassifier  
from matplotlib import pyplot  

# Dataset creation  
def get_dataset():  
    X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4)  
    return X, y  

# Model configuration  
def get_models():  
    models = dict()  
    for i in range(1, 21):  
        models[str(i)] = ExtraTreesClassifier(max_features=i)  
    return models  

# Model evaluation  
def evaluate_model(model, X, y):  
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)  
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)  
    return scores  

# Main execution  
X, y = get_dataset()  
models = get_models()  
results, names = list(), list()  

for name, model in models.items():  
    scores = evaluate_model(model, X, y)  
    results.append(scores)  
    names.append(name)  
    print(f'>{name} {mean(scores):.3f} ({std(scores):.3f})')  

pyplot.boxplot(results, labels=names, showmeans=True)  
pyplot.show()  

输出

>1 0.895 (0.029)
>2 0.903 (0.028)
>3 0.903 (0.021)
>4 0.907 (0.026)
>5 0.905 (0.027)
>6 0.909 (0.025)
>7 0.908 (0.025)
>8 0.912 (0.021)
>9 0.909 (0.027)
>10 0.908 (0.028)
>11 0.911 (0.025)
>12 0.910 (0.031)
>13 0.908 (0.025)
>14 0.913 (0.026)
>15 0.908 (0.022)
>16 0.910 (0.027)
>17 0.907 (0.026)
>18 0.907 (0.026)
>19 0.906 (0.022)
>20 0.908 (0.024)

关于每次分裂的最小样本数

提醒读者，min_samples_split 是决策树的另一个超参数，它决定了一个节点需要多少最小样本才能进一步分裂。

此参数通过 min_samples_split 参数设置（默认值：2），它决定了决策树的深度和细节。较小的值会产生更多的分裂，这在一定程度上有助于单个树决策之间的去相关性，并可能提高集成性能。

以下示例考虑了 Extra Trees 算法在 min_samples_split 从 2 到 14 变化时的性能。

代码

from numpy import mean, std  
from sklearn.datasets import make_classification  
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold  
from sklearn.ensemble import ExtraTreesClassifier  
from matplotlib import pyplot  

# Dataset creation  
def get_dataset():  
    X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=4)  
    return X, y  

# Model configuration  
def get_models():  
    models = dict()  
    for i in range(2, 15):  
        models[str(i)] = ExtraTreesClassifier(min_samples_split=i)  
    return models  

# Model evaluation  
def evaluate_model(model, X, y):  
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)  
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)  
    return scores  

# Main execution  
X, y = get_dataset()  
models = get_models()  
results, names = list(), list()  

for name, model in models.items():  
    scores = evaluate_model(model, X, y)  
    results.append(scores)  
    names.append(name)  
    print(f'>{name} {mean(scores):.3f} ({std(scores):.3f})')  

pyplot.boxplot(results, labels=names, showmeans=True)  
pyplot.show()  

输出

>2 0.912 (0.029)
>3 0.905 (0.024)
>4 0.908 (0.030)
>5 0.904 (0.028)
>6 0.906 (0.027)
>7 0.902 (0.029)
>8 0.897 (0.027)
>9 0.898 (0.029)
>10 0.896 (0.029)
>11 0.891 (0.032)
>12 0.889 (0.026)
>13 0.888 (0.031)
>14 0.891 (0.024)