Sklearn 教程

2024 年 8 月 29 日 | 14 分钟阅读

什么是 Sklearn？

Scikit-learn 是一个开源的 Python 包，用于在 Python 中实现机器学习模型。该库支持 KNN、随机森林、XGBoost 和 SVC 等现代算法。它构建在 NumPy 之上。知名的软件公司和 Kaggle 竞赛都经常使用 Scikit-learn。它有助于模型构建的各种过程，如模型选择、回归、分类、聚类和降维（参数选择）。

Scikit-learn 易于使用且性能优越。但是，Scikit Learn 不支持并行处理。我们可以使用 sklearn 实现深度学习算法，尽管这不是一个明智的选择，特别是当 TensorFlow 是一个可用的选项时。

在我们的系统上安装 Sklearn

在安装 sklearn 之前，我们需要先安装以下库作为其依赖项。

NumPy
SciPy

在安装 sklearn 库之前，请验证 NumPy 和 SciPy 是否已安装在计算机上。在 NumPy 和 SciPy 已正确安装之后，使用 pip 是安装 scikit-learn 的最简单方法。

导入数据集

正如我们之前讨论的，在此 sklearn 教程中，我们将使用鸢尾花数据集。我们不需要从外部服务器获取此数据集，因为 Scikit Learn Python 已经包含了它。我们将立即导入数据集，但在此之前，我们必须使用以下命令导入 Scikit-Learn 和 Pandas 库。

代码

# Importing the required libraries
import sklearn
import pandas as pd

导入 sklearn 后，使用以下命令，我们可以快速从 sklearn 导入鸢尾花数据集。

代码

# Importing the dataset from the datasets module of sklearn
from sklearn.datasets import load_iris

# Loading the dataset
iris = load_iris()

# Creating the dataframe of the dataset
df = pd.DataFrame(iris.data, columns = iris.feature_names)

分割数据集

我们可以将整个数据集分成两部分——训练数据集和测试数据集，以留出一些未见过的数据来检查模型的准确性。模型训练完成后，使用测试数据集来测试或验证模型。然后，我们可以评估训练模型的性能。

本示例会将数据分为 70:30 的比例，这意味着 70% 的数据将用于训练模型，30% 将用于测试模型。示例中使用的数据集与上面相同。

代码

# Importing the class to perform train test split from model_selection module
from sklearn.model_selection import train_test_split

# Separating the dependent and independent features
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# Creating a testing dataset of size 0.3 times the whole dataset
X_train, X_test, y_train, y_test = train_test_split(
   X, y, test_size = 0.3, random_state = 1
)

# Printing the shape of the training and testing dataset
print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

输出

(105, 3)
(45, 3)
(105,)
(45,)

训练模型

然后，我们可以使用我们的数据集训练一个预测模型。如前所述，scikit-learn 提供了广泛的现代机器学习算法，具有标准化的用户界面，用于拟合、预测准确率分数、召回率等。

在本示例中，我们将使用 KNN（K 最近邻）分类器。KNN 分类器将根据其相似性对数据集进行聚类。我们将在下面的代码中看到如何实现此机器学习算法。

代码

# Importing the required modules
import sklearn
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# Loading the dataset
iris = load_iris()

# Creating the dataframe of the dataset
df = pd.DataFrame(iris.data, columns = iris.feature_names)
df['Targets'] = iris.target

# Separating the dependent and independent features
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

# Importing the class to perform train test split from model_selection module
from sklearn.model_selection import train_test_split

# Creating a testing dataset of size 0.3 times the whole dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)


classifier_knn = KNeighborsClassifier(n_neighbors = 4)
classifier_knn.fit(X_train, y_train)
y_pred = classifier_knn.predict(X_test)
# Finding accuracy by comparing actual response values(y_test) with predicted response value(y_pred)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
# Providing sample data and the model will make predictions out of that data

sample = [[5, 5, 3, 2], [1, 10, 3, 5]]
preds = classifier_knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds] 
print("Predictions:", pred_species)

输出

Accuracy: 0.9777777777777777
Predictions: ['versicolor', 'setosa']

线性建模

这些是 Sklearn 提供的用于执行线性回归分析的回归算法。

序号	模型和描述
1	线性回归使用最佳统计模型之一（X）来研究因变量（Y）与特定一组自变量之间的关联。
2	逻辑回归与名称所示相反，逻辑回归是一种分类算法。它使用一组自变量来估计离散值（0 或 1，是/否，真/假）。
3	岭回归执行 L2 正则化的正则化方法是岭回归或 Tikhonov 正则化。将等于系数幅值平方的惩罚（收缩量）添加到损失函数中。
4	贝叶斯岭回归在设计线性回归时，使用概率分布而不是点估计，贝叶斯回归提供了一种自然的方式来应对数据不足或分布不均匀的情况。
5	LASSO L1 正则化是通过正则化方法 LASSO 执行的。将等于系数绝对值之和的惩罚（收缩量）添加到损失函数中。
6	多任务 LASSO 它允许联合拟合多个回归问题，同时要求为每个回归问题（也称为任务）选择的特征相同。Sklearn 提供了一个名为 MultiTaskLasso 的线性模型，该模型可以同时估计多个回归问题的稀疏系数。它使用了混合 L1 和 L2 范数进行正则化训练。
7	弹性网络弹性网络正则化回归方法将 Lasso 和 Ridge 回归方法的 L1 和 L2 惩罚线性组合。当存在多个相关特征时，它很有用。
8	多任务弹性网络它是一个弹性网络模型，允许联合拟合多个回归问题，强制为所有回归问题（也称为任务）选择相同的特征。

聚类方法

聚类是最佳的无监督机器学习技术之一，用于发现数据集中相似的模式和关系。然后，它们根据相似的特征将这些样本分成组。聚类确定了可用无标签数据的内在分组，因此它很重要。

Sklearn.cluster 是 Scikit-Learn 包的一部分，用于对无标签数据进行聚类。Scikit-learn 在此模块下提供了以下聚类技术：

KMeans

该算法计算质心，然后通过迭代确定最佳质心。它假设已经知道簇，因为它需要提供簇的数量。该方法的基本思想是通过将样本分成 n 个具有相同方差的组来聚类数据，同时减小惯性准则。Scikit-learn 拥有 sklearn.cluster，它表示算法找到的簇数量。K-Means 聚类使用 Sklearn 的 KMeans 包进行。样本权重参数允许 sklearn.cluster 计算簇中心和惯性值，并且 KMeans 模块为某些样本提供额外的权重。

代码

# Python program to perform spectral clustering using sklearn

# Importing the required libraries
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_diabetes

# Loading the dataset
X, Y = load_diabetes(return_X_y = True)

# Performing Spectral clustering
cluster =  KMeans(n_clusters = 10)
cluster.fit(X[:50, :])

print("The number of clusters are: ", cluster.labels_)

输出

The number of clusters are:  [6 0 6 2 0 8 8 5 6 2 8 6 0 6 0 5 3 5 2 2 8 8 2 7 2 6 8 2 4 3 2 4 1 4 4 9 3
 2 5 6 5 8 6 9 1 6 2 8 0 1]

谱聚类

在聚类之前，该方法通过使用相似度矩阵的特征值（或谱）来有效地将维度减少到更少的维度。当存在许多簇时，不建议使用此方法。

代码

# Python program to perform spectral clustering using sklearn

# Importing the required libraries
from sklearn.cluster import SpectralClustering
import numpy as np
from sklearn.datasets import load_diabetes

# Loading the dataset
X, Y = load_diabetes(return_X_y = True)

# Performing Spectral clustering
cluster =  SpectralClustering(n_clusters = 10)
cluster.fit(X[:50, :])

print("The number of clusters are: ", cluster.labels_)

输出

The number of clusters are:  [0 2 0 8 4 3 6 4 9 1 3 0 4 6 2 8 5 4 7 1 7 6 9 5 2 8 3 9 1 3 9 5 0 5 4 5 1
 5 8 1 7 3 6 5 0 6 1 3 6 8]

层次聚类

通过连续合并或拆分簇，该算法创建了嵌套的簇。此簇层次结构显示为树状图，通常称为树，它属于以下两类：

层次聚合算法：在此类层次算法中，每个数据点都被视为一个单独的簇。然后，按照自底向上的方法，一对一对地聚合两个簇。

层次算法（将所有数据点视为一个大簇）：在此层次方法中，聚类过程涉及使用自顶向下的技术将一个大簇拆分成许多小簇。

代码

# Python program to perform hierarchical clustering using sklearn

# Importing the required libraries
from sklearn.cluster import AgglomerativeClustering
import numpy as np
from sklearn.datasets import load_diabetes

# Loading the dataset
X, Y = load_diabetes(return_X_y = True)

# Performing Agglomerative clustering
cluster = AgglomerativeClustering(n_clusters = 10, compute_distances = True)
cluster.fit(X[:50, :])

print("The number of clusters are: ", cluster.labels_)

输出

The number of clusters are:  [3 6 3 5 6 0 0 1 3 5 0 2 6 3 6 1 4 1 5 6 0 0 5 9 5 2 0 5 6 4 5 0 8 7 6 7 4
 5 1 3 1 0 2 7 8 3 0 0 3 2]

决策树算法

节点代表一个特征（或属性），分支表示一个决策函数，并且每个叶节点表示结论，这类似于流程图。决策树中的根节点是从顶部开始的第一个节点。它通过属性值获得划分数据的能力。递归分区是重复划分树的过程。这个类似于流程图的框架有助于决策。它是一种流程图式的表示，完美地复制了人们的思考方式。因此，决策树易于理解和解释。

代码

# Python program to perform classification using Decision Trees

# Importing the required libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, train_test_split

# Loading the dataset
X, Y = load_iris( return_X_y = True )

# Splitting the dataset in training and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

# Creating an instance of the Decision Tree Classifier class
dtc = DecisionTreeClassifier(random_state = 0)
dtc.fit(X_train, Y_train)

# Calculating the accuracy score of the model using cross_val_score 
score = cross_val_score(dtc, X, Y, cv = 10)

# Printing the scores
print("Accuracy scores: ", score)
print("Mean accuracy score: ", np.mean(score))

输出

Accuracy scores:  [1.         0.93333333 1.         0.93333333 0.93333333 0.86666667
 0.93333333 1.         1.         1.        ]
Mean accuracy score:  0.96

梯度提升

当存在回归和分类问题时，我们可以使用梯度提升方法。它基于许多较小的预测模型（通常是决策树）创建一个预测模型。

为了正常工作，Gradient Boosting Classifier 需要一个损失函数。除了处理自定义损失函数外，梯度提升分类器还可以接受许多标准化损失函数，但损失函数必须是可微的。

在回归技术中可以使用平方误差，但在分类算法中通常使用对数损失。在梯度提升系统中，我们不需要为每个渐进提升步骤显式推导损失函数，而是可以使用任何可微的损失函数。

代码

# Python program to perform classification using Gradient Boosting

# Importing the required libraries
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

# Loading the dataset
X, Y = make_hastie_10_2(random_state = 10)

# Splitting the dataset in training and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)

# Creating an instance of the Gradient Boosting Classifier class
gbc = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 1, random_state = 0)
gbc.fit(X_train, Y_train)

# Calculating the accuracy score of the model using cross_val_score 
score = gbc.score(X_test, Y_test)

# Printing the scores
print("Accuracy scores: ", score)

输出

Accuracy scores:  0.9185416666666667
Dimensionality Reduction using PCA in Sklearn

精确 PCA

利用数据的奇异值分解（SVD），通过主成分分析（PCA）进行线性降维，将数据投影到降维后的特征空间。在使用 SVD 进行 PCA 降维之前，输入数据会居中，但不会为每个特征进行归一化。

sklearn.decomposition 模块是 Scikit-learn ML 工具包的一部分。

在其 fit() 方法中，PCA 模块（作为转换器对象使用）学习 n 个组件。它也可以用于将新数据投影到这些组件上。

代码

# Python program to show how to perform PCA using sklearn

# Importing all the required libraries
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Loading the breast cancer dataset
dataset = load_breast_cancer()
print(dataset.keys())
 
# Checking the target classes
print(dataset['target_names'])
 
# Checking the independent attributes
print(dataset['feature_names'])

# constructing a data frame of the dataset using pandas
df = pd.DataFrame(data = dataset['data'], columns = dataset['feature_names'])
 
# Performing feature engineering by performing standard scaling
scaler = StandardScaler()
 
# Using the fit_transform method
df_scaled = scaler.fit_transform(df)
 
# Setting the n_components = 3
pca = PCA(n_components = 3)
pca.fit(df_scaled)
X = pca.transform(df_scaled)
 
# Checking the dimensions of data
print("Shape of data after PCA: ", X.shape)

# Checking the values of eigenvectors
print("Components: ", pca.components_)

# checking how much variance pca can explain
print("Explained variance ratio: ", pca.explained_variance_ratio_)

输出

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
['malignant' 'benign']
['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Shape of data after PCA:  (569, 3)
Components:  [[ 0.21890244  0.10372458  0.22753729  0.22099499  0.14258969  0.23928535
   0.25840048  0.26085376  0.13816696  0.06436335  0.20597878  0.01742803
   0.21132592  0.20286964  0.01453145  0.17039345  0.15358979  0.1834174
   0.04249842  0.10256832  0.22799663  0.10446933  0.23663968  0.22487053
   0.12795256  0.21009588  0.22876753  0.25088597  0.12290456  0.13178394]
 [-0.23385713 -0.05970609 -0.21518136 -0.23107671  0.18611304  0.15189161
   0.06016537 -0.03476751  0.19034877  0.36657546 -0.10555215  0.08997968
  -0.08945724 -0.15229262  0.20443045  0.23271591  0.1972073   0.13032154
   0.183848    0.28009203 -0.21986638 -0.0454673  -0.19987843 -0.21935186
   0.17230436  0.14359318  0.09796412 -0.00825725  0.14188335  0.27533946]
 [-0.00853123  0.0645499  -0.00931421  0.02869954 -0.10429182 -0.07409158
   0.00273384 -0.02556359 -0.04023992 -0.02257415  0.26848138  0.37463367
   0.26664534  0.21600656  0.30883896  0.15477979  0.17646382  0.22465746
   0.28858428  0.21150377 -0.04750699 -0.04229782 -0.04854651 -0.01190231
  -0.25979759 -0.2360756  -0.1730573  -0.17034416 -0.27131265 -0.23279135]]
Explained variance ratio:  [0.44272026 0.18971182 0.09393163]

增量 PCA

主成分分析（PCA）主要允许批量计算，这意味着要分析的所有独立特征都必须适合存储。增量主成分分析（IPCA）用于克服此限制。

sklearn.decomposition 模块是 Scikit-learn ML 工具包的一部分。IPCA 包提供了 np.memmap（内存映射文件），避免将整个文件加载到内存中，从而允许在其 partial fit 函数上对渐进获取的数据部分进行操作，或者两者都可以。

与 PCA 并行，在通过 IPCA 分解数据之前，输入数据会居中，但不会为每个特征进行归一化。

示例

以下示例使用 Sklearn 数字数据集来使用 sklearn.decomposition.IPCA 模块。

代码

# Python program to show how to perform decomposition using incremental PCA method

# Importing the required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA

# Loading the digits dataset
dataset = load_digits()
print(dataset.keys())

# constructing a data frame of the dataset using pandas
df = pd.DataFrame(data = dataset['data'], columns = dataset['feature_names'])

# Checking the shape of the dataset before decomposition
print("Shape of the dataset before decomposition: ", df.shape)
 
# Performing feature engineering by performing standard scaling
scaler = StandardScaler()
 
# Using the fit_transform method
df_scaled = scaler.fit_transform(df)

# Performing the incremental PCA
ipca = IncrementalPCA(n_components = 15, batch_size = 200)
ipca.partial_fit(df.iloc[:200, :-1])
df_transformed = ipca.fit_transform(df.iloc[:, :-1])

# Checking the shape of the dataset after decomposition
print("Shape of the dataset after decomposition: ", df_transformed.shape)

输出

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
Shape of the dataset before decomposition:  (1797, 64)
Shape of the dataset after decomposition:  (1797, 15)

在这种情况下，我们可以使用 fit() 方法将信息分成批次，或者我们可以对较小的数据批次进行部分拟合（就像我们对每批 200 个数据进行的那样）。

核 PCA

通过使用核函数，PCA 的核主成分分析（Kernel PCA）修改可以减少非线性维度。它支持 transform() 和 inverse_transform() 方法。

我们可以使用 sklearn.decomposition 模块的 KernelIPCA 类。

示例

我们将使用 sklearn 的数字数据集来演示 KernelIPCA 的用法。我们正在使用的核函数是 sigmoid。

代码

# Python program to show how to perform decomposition using kernel IPCA method

# Importing the required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
from sklearn.decomposition import KernelPCA

# Loading the digits dataset
dataset = load_digits()
print(dataset.keys())

# constructing a data frame of the dataset using pandas
df = pd.DataFrame(data = dataset['data'], columns = dataset['feature_names'])

# Checking the shape of the dataset before decomposition
print("Shape of the dataset before decomposition: ", df.shape)
 
# Performing feature engineering by performing standard scaling
scaler = StandardScaler()
 
# Using the fit_transform method
df_scaled = scaler.fit_transform(df)

# Performing the incremental PCA
kpca = KernelPCA(n_components = 15, kernel = 'sigmoid')
df_transformed = kpca.fit_transform(df.iloc[:, :-1])

# Checking the shape of the dataset after decomposition
print("Shape of the dataset after decomposition: ", df_transformed.shape)

输出

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
Shape of the dataset before decomposition:  (1797, 64)
Shape of the dataset after decomposition:  (1797, 15)

使用随机 SVD 的 PCA

通过主成分分析（PCA）将变量投影到较低维度的特征空间，并使用随机 SVD，通过移除与较低奇异值相关的特征的奇异向量来保留大部分方差。在这种情况下，sklearn.decomposition.PCA 类加上 svd_solver = 'randomized' 参数将非常有用。

示例

下面的示例将使用 sklearn.decomposition.PCA 类和 svd_solver = 'randomized' 辅助参数，从 sklearn 的乳腺癌数据集中识别前 10 个主成分。

代码

# Python program to show how to perform PCA through randomized solver using sklearn

# Importing all the required libraries
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

# Loading the breast cancer dataset
dataset = load_breast_cancer()

# Checking the shape of the dataset before decomposition
print("Shape of the dataset before decomposition: ", df.shape)

# constructing a data frame of the dataset using pandas
df = pd.DataFrame(data = dataset['data'], columns = dataset['feature_names'])

# Sepaating the dependent and independent features
X = df.iloc[:, :-1].values
Y = df.iloc[:, -1].values

# Performing feature engineering by implementing standard scaling
scaler = StandardScaler()
 
# Transforming the dependent and independent features
scaler.fit(X)
X = scaler.transform(X)
Y = scaler.fit(Y.reshape(-1,1))

# Implementing PCA using randomized solver
pca = PCA(n_components = 10, svd_solver = 'randomized')
pca.fit(X)
X = pca.transform(X)
 
# Checking the dimensions of data
print("Shape of data after PCA: ", X.shape)

# checking how much variance PCA can explain
print("Explained variance ratio: ", pca.explained_variance_ratio_)

输出

Shape of the dataset before decomposition:  (569, 30)
Shape of data after PCA:  (569, 10)
Explained variance ratio:  [0.45067848 0.18239963 0.09159257 0.06781847 0.05626861 0.04135939
 0.01989181 0.01637191 0.01397121 0.01209004]

下一个主题Python 中的睡眠时间是什么

Sklearn 教程

什么是 Sklearn？

在我们的系统上安装 Sklearn

导入数据集

分割数据集

训练模型

线性建模

聚类方法

KMeans

谱聚类

层次聚类

决策树算法

梯度提升

精确 PCA

增量 PCA

核 PCA

使用随机 SVD 的 PCA

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

Sklearn 教程

什么是 Sklearn？

在我们的系统上安装 Sklearn

导入数据集

分割数据集

训练模型

线性建模

聚类方法

KMeans

谱聚类

层次聚类

决策树算法

梯度提升

精确 PCA

增量 PCA

核 PCA

使用随机 SVD 的 PCA

相关帖子

Python 中的字典推导式

使用 NumPy 在 Python 中对 Legendre Series 及其导数进行微分

Python vs Scala

IPython 显示

Python 优化采购流程

在 Python 中从控制台输入

Python 程序计算复利

Python 中多个集合的对称差集

按 Python 列表元素长度排序

int 对象不可迭代

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器