Python Sklearn 中的决策树

2025年3月17日 | 阅读 3 分钟

使用一种名为决策树的机器学习算法，我们可以表示决策及其潜在后果，包括输出、输入成本和效用。

监督学习方法组包括决策算法。它处理分类和连续的输出参数。

决策树算法

在决策树中，它类似于流程图，内部节点代表数据集的变量（或特征），树分支表示决策规则，每个叶节点表示特定决策的结果。决策树图最顶部的节点是根节点。我们可以根据与独立特征相对应的属性值来拆分数据。

递归分区方法用于将树分割成不同的元素。这种决策树具有完整的结构，看起来像流程图，有助于做出决策。它提供了一个图示模型，精确地反映了人们的推理和选择方式。由于这种流程图的特性，决策树易于理解和掌握。

决策树算法：它是如何工作的？

每个决策树算法的基本原理如下：

为了根据目标变量分割数据，使用属性选择度量 (ASM) 选择最佳特征。
然后它将数据集分割成更小的子数据集，并将该特征指定为该分支的决策节点。
一旦其中一个条件匹配，该过程将递归地对每个子节点重复执行，以开始创建树。
相同的属性值适用于每个元组。
- 没有其他属性了。
- 没有更多事件了。

决策树回归

为了使用决策树算法预测未来事件并生成有洞察力的连续数据类型输出，决策树回归算法会分析对象的属性，并将此机器学习模型训练成一棵树。由于它不是完全由一组预定的离散数字定义的，因此输出或结果不是离散的。

这个模型说明了板球比赛预测中的离散输出，该输出预测某支球队是否会赢得或输掉一场比赛。

一个销售预测机器学习模型，该模型根据公司初步数据预测公司利润将在一个财政年度内增长，这说明了连续输出。

在这种情况下，使用决策树回归算法来预测连续值。

在讨论了 sklearn 决策树之后，让我们逐步看看它们是如何实现的。

代码

# Python program to implement decision tree algorithm and plot the tree

# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

# Loading the dataset
iris = load_iris()

#converting the data to a pandas dataframe
data = pd.DataFrame(data = iris.data, columns = iris.feature_names)

#creating a separate column for the target variable of iris dataset 
data['Species'] = iris.target

#replacing the categories of target variable with the actual names of the species
target = np.unique(iris.target)
target_n = np.unique(iris.target_names)
target_dict = dict(zip(target, target_n))
data['Species'] = data['Species'].replace(target_dict)

# Separating the independent dependent variables of the dataset
x = data.drop(columns = "Species")
y = data["Species"]
names_features = x.columns
target_labels = y.unique()

# Splitting the dataset into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 93)

# Importing the Decision Tree classifier class from sklearn
from sklearn.tree import DecisionTreeClassifier

# Creating an instance of the classifier class
dtc = DecisionTreeClassifier(max_depth = 3, random_state = 93)

# Fitting the training dataset to the model
dtc.fit(x_train, y_train)

# Plotting the Decision Tree
plt.figure(figsize = (30, 10), facecolor = 'b')
Tree = tree.plot_tree(dtc, feature_names = names_features, class_names = target_labels, rounded = True, filled = True, fontsize = 14)
plt.show()
y_pred = dtc.predict(x_test)

# Finding the confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
matrix = pd.DataFrame(confusion_matrix)
axis = plt.axes()
sns.set(font_scale = 1.3)
plt.figure(figsize = (10,7))

# Plotting heatmap
sns.heatmap(matrix, annot = True, fmt = "g", ax = axis, cmap = "magma")
axis.set_title('Confusion Matrix')
axis.set_xlabel("Predicted Values", fontsize = 10)
axis.set_xticklabels([''] + target_labels)
axis.set_ylabel( "True Labels", fontsize = 10)
axis.set_yticklabels(list(target_labels), rotation = 0)
plt.show()

输出