Sklearn 回归模型

2025 年 1 月 12 日 | 13 分钟阅读

机器学习通过使用两种不同的算法来解决回归问题，以执行回归分析：逻辑回归和线性回归。这些是最广泛使用的回归方法。机器学习中的回归分析方法有许多算法，其使用取决于数据的类型和分析目标。

本教程将描述各种机器学习回归模型以及我们可以应用每种模型的情况。如果您是机器学习新手，本教程无疑将帮助您理解回归建模概念。

回归模型

在监督机器学习范式中，系统学习结果。模型的输入和输出变量是已知的，并且都用于训练算法。在训练阶段，算法被输入真实值，并且机器学习模型经过训练以降低预测误差。两种主要的监督学习算法类别是回归（用于连续目标变量）和分类（用于离散目标变量）。

回归分析是一种预测建模方法，用于检查目标或因变量与数据集的自变量之间的关系。根据因变量和自变量之间是否表现出线性或非线性关联，或者目标特征是否具有连续值，使用各种回归模型。回归分析通常用于识别因果关系、预测趋势、时间序列预测分析和预测因子强度。

回归将结果作为连续数据给出。借助这种以特征为中心的技术，我们可以预测训练数据的模式。输出是一个真实的数值；但是，它不属于任何类或类别。例如，估算房产价值取决于房屋大小、位置和开发比率等许多细节。

回归分析技术的类型

回归分析方法种类繁多，如上所述，因素将决定使用哪种技术。这些因素的例子是回归线模式和独立特征的数量。

下面列出了许多回归方法

线性回归
逻辑回归
岭回归
Lasso 回归
多项式回归
贝叶斯线性回归
决策树回归
支持向量回归
梯度提升回归

线性回归

一种称为线性回归的机器学习方法，它建立一个或多个独立特征与特定依赖特征之间的线性关系，以通过预测线性方程变量的系数来预测依赖特征的最佳值。

简单线性回归模型计算因变量 (y) 和单个自变量 (x) 的最佳拟合线。伴随的直线方程定义了它。

m 是回归系数。它表示我们预期 y 会随着 x 值的变化而改变多少。回归模型确定最佳截距 (c) 值和回归系数以最小化误差 (e)。

该算法通过减少目标变量 y 的实际值与模型预测值之间的误差来找到参数的最佳值。使用普通最小二乘法计算误差。它可以适应数据集中存在的许多输入变量。

代码

# Python program how to perform a linear regression (Simple Regression and Multiple Regression)

# Importing the required libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression


# Loading the diabetes dataset
diabetes = load_diabetes()

# Checking the target feature is a continuous data
print(diabetes.target[:10])

# Creating a dataframe
dataset = pd.DataFrame(data = diabetes.data, columns = diabetes.feature_names)

# Adding the target variable to the dataset
dataset["target"] = diabetes.target

# Separating the dependent and independent features
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values


# Separating the data for training and testing the model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 15)

# Creating an object of the Linear Regression model class
l_reg = LinearRegression()

# Fitting the training data to the linear regression model
# Creating a Simple Linear Regression model using only one independent feature of the training dataset
l_reg.fit(X_train[:, 2:3], Y_train)

# Printing the R-square for the simple linear regression model by passing the unseen or the testing data
score = l_reg.score( X_test[:, 2:3], Y_test )
print("The r-square score of the Simple Linear Regression model: ", score)

# Creating a Multiple Linear Regression model using all the independent features
l_reg.fit(X_train, Y_train)

# Finding the predicted values
Y_pred = l_reg.predict(X_test)

# Creating a dataframe for the coefficient values of the independent features
coef = pd.DataFrame(data = dataset.iloc[:,:-1].columns, columns = ["Features"])
coef["Coefficients"] = l_reg.coef_
coef.loc[0] = ["Intercept", l_reg.intercept_]
coef

输出

[151.  75. 141. 206. 135.  97. 138.  63. 110. 310.]
The r-square score of the Simple Linear Regression model:  0.3195508704110651

	Features	Coefficients
0	Intercept	150.559849
1	sex	-207.159699
2	bmi	545.523615
3	bp	282.960253
4	s1	-1195.950976
5	s2	580.404692
6	s3	404.413845
7	s4	431.717208
8	s5	871.098497
9	s6	118.346669

逻辑回归

逻辑回归是另一种回归建模技术，如果因变量是离散的，例如真或假、0 或 1 等，则使用它。因此，因变量只有两个可能的值，并且 sigmoid 曲线描绘了目标变量和输入变量之间的关联。

逻辑回归算法使用 logit 函数来量化目标变量和输入变量之间的关系。

代码

# Python program to create a regression model using Logistic Regression algorithm

# Importing the required modules
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Loading the iris dataset
X, Y = load_iris(return_X_y = True)

# Separating the data for training and testing the model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.4, random_state = 15)

# Creating an object of the Logistic Regression model class
log_reg = LogisticRegression(random_state = 10)

# Fitting the training data to the logistic regression model
log_reg.fit(X_train, Y_train)

# Predicting the values for the unseen data
Y_pred = log_reg.predict(X_test)

# Computing the accuracy score of the Logistic Regression model
scores = accuracy_score(Y_test, Y_pred)
print(scores)

输出

1.0

岭回归

当独立特征具有高相关值时，岭回归算法会预测目标变量。这是因为，对于非共线变量，最小二乘估计了一个无偏解。但是，如果共线性很强，可能会存在偏置分量。因此，在岭回归方程中引入了偏置网格。这种强大的回归方法使得构建的模型不太可能过拟合。

代码

# Python program to create regression model using Ridge Regression algorithm

# Importing the required modules
from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

# Loading the iris dataset
X, Y = load_diabetes(return_X_y = True)

# Separating the data for training and testing the model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.4, random_state = 15)

# Creating a Logistic Regression model class object. The model will be using solver = 'svd.'
r_reg = Ridge(solver = 'svd', random_state = 10)

# Creating another model with solver = 'lsqr.'
# This is the fastest solver based on the least square routine of sklearn
r_reg1 = Ridge(solver = 'lsqr', random_state = 10)

# Fitting the training data to the Ridge regression model
r_reg.fit(X_train, Y_train)
r_reg1.fit(X_train, Y_train)

# Predicting the values for the unseen data
Y_pred = r_reg.predict(X_test)
Y_pred1 = r_reg1.predict(X_test)

# Computing the accuracy score for both the models
print('Accuracy score for solver = "auto": ', r2_score(Y_test, r_reg.predict(X_test).round(5)))
print('Accuracy score for solver = "lsqr": ', r2_score(Y_test, r_reg1.predict(X_test).round(5)))

输出

Accuracy score for solver = "auto": 0.40731258229249656
Accuracy score for solver = "lsqr": 0.4073540170017558

Lasso 回归

Lasso 回归是一种应用于学习算法的回归模型，它结合了特征选择和归一化过程。不考虑回归系数的绝对值。因此，与岭回归不同，独立特征的系数值接近于零。

Lasso 回归涉及特征选择。此过程允许从给定数据集中选择一组变量，这些变量将比其他变量对模型产生更大的影响。在 Lasso 回归中，除了生成良好预测所需的特征外，所有其他特征都设置为零。此步骤有助于防止模型过拟合。当数据集的独立因素的共线性值严重时，lasso 回归只选择一个变量，并将其他变量的系数减小到零。

代码

# Python program to create regression model using the Lasso Regression algorithm

# Importing the required modules
from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso

# Loading the iris dataset
X, Y = load_diabetes(return_X_y = True)

# Segregating the training and testing data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 9)

# Creating an object of the Lasso Regression model class
l_reg = Lasso(random_state = 9)

# Fitting the training data to the Lasso regression model
l_reg.fit(X_train, Y_train)

# Predicting the values for the unseen data
Y_pred = l_reg.predict(X_test)

# Computing the accuracy score for both the models
print('The accuracy score of the model is: ', r2_score(Y_test, l_reg.predict(X_test).round(5)))

输出

The accuracy score of the model is: 0.40131502999775714

多项式回归

另一种用于学习算法的回归分析方法是多项式回归。此方法与多元线性回归相似，但进行了一些微调。多项式回归中的 n 次方定义了独立和依赖特征 X 和 Y 之间的链接。

作为预测器，它使用线性模型进行回归；我们使用 sklearn 的多项式 scaler 函数缩放特征。与线性回归一样，多项式回归算法使用普通最小二乘法来比较线的误差。在多项式回归中，最佳拟合线不是直线，而是曲线，并且根据 X 的幂或 n 的值，它穿过数据点。

在尝试实现 OLS 方程的最低值并找到最佳拟合曲线时，多项式回归模型容易过拟合。建议最后评估不同的回归曲线，因为外推更高次多项式会产生奇怪的结果。

数据集可在此处获取 - https://github.com/content-anu/dataset-polynomial-regression

代码

# Python program to perform the Polynomial Regression using sklearn

#importing the required libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline

#importing the uploaded CSV dataset
data = pd.read_csv('Position_Salaries.csv')

# Seperating the independent and dependent features of the dataset
X = data.iloc[:,1:2].values
Y = data.iloc[:,2].values

# Creating an instance of the Polynomial Scaler
poly_reg = PolynomialFeatures(degree = 4)

# Fitting and transforming the X dataset to the Polynomial Scaler
X_polynomial = poly_reg.fit_transform(X)

# Creating a Linear regression model instance
lreg = LinearRegression()
lreg.fit(X_polynomial, Y)

#Visualisng the Polynomial regression model
plt.scatter(X, Y, color = 'red')
plt.plot(X, lreg.predict(poly_reg.fit_transform(X)), color = 'blue')
plt.title('Polynomial Regression Model')
plt.xlabel('Position Levels')
plt.ylabel('Salary')
plt.show()

# Using a higher degree polynomial to show overfitting
poly_reg1 = PolynomialFeatures(degree = 7)
X_polynomial1 = poly_reg1.fit_transform(X)
lreg1 = LinearRegression()
lreg1.fit(X_polynomial1, Y)
plt.scatter(X, Y, color = 'red')
plt.plot(X, lreg1.predict(X_polynomial1), color = 'blue')
plt.title('Overfitted Polynomial Regression Model')
plt.show()

输出

贝叶斯线性回归

贝叶斯回归是机器学习中使用的回归模型之一，它使用贝叶斯定理计算回归系数的大小。这种回归方法不是寻找最小二乘，而是确定变量的后验分布。与线性回归和岭回归方法一样，贝叶斯线性回归方法比简单线性回归更稳健。

代码

# Python program to perform Bayesian Regression

# Importing modules that are required
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import BayesianRidge
   
# Loading the Boston dataset
X, Y = load_boston(return_X_y = True)
   
# Splitting the training and testing datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 16)
   
# Creating an instance of the model and training it
br = BayesianRidge()
br.fit(X_train, Y_train)
   
# Predicting values for the unseen data, i.e., the testing data
Y_pred = br.predict(X_test)
   
# Computing the R-square score for the model
print(f"The r2 score of the model is: {r2_score(Y_test, Y_pred)}")

输出

The r2 score of the model is: 0.6312926702997255

弹性网络回归

弹性网络回归是一种正则化线性回归方法。它通过线性组合 L1 和 L2 成本，在训练时将其插入到损失函数中。它通过为每个惩罚赋予适当的权重来融合 lasso 和岭回归，从而提高预测精度。

Alpha 和 Lambda 是弹性网络的两个可配置超参数。Lambda 控制两个惩罚的加权总和的比例，该比例决定了模型的有效性。相比之下，alpha 参数控制分配给每个惩罚的权重。

代码

# Python program to perform Elastic Net Regression algorithm

# Importing the required libraries
from sklearn.linear_model import ElasticNet
import numpy as np

# Constructing an Elastic Net regression model having the hyperparameter value of alpha = 0.1
el = ElasticNet(alpha = 0.1)

# Preparing the input data
X = np.array([[2, 2], [2, 3], [2, 4], [3, 2], [3, 3], [3, 4]])

# Preparing the target variables
y = [ 7, 9, 11, 8, 10, 12]

# Fitting the data to the model
el.fit(X, y)

# Printing the coefficients of the model
print(el.coef_)

# Printing the intercept of the best fitting line
print(el.intercept_)

# Prediting value of a sample data
print(el.predict([[4, 4]]))
print(el.predict([[0, 0]]))

输出

[0.66666667 1.79069767]
2.461240310077521
[12.29069767]
[2.46124031]

决策树回归

可以使用一种称为决策树的决策工具来表示决策及其所有可能的结果，包括结果、输入惩罚和效用。

监督学习方法组包括决策树策略，它适用于分类和连续值的输出结果。

决策树回归：决策树回归通过观察项目的属性并将这些属性作为输入提供给算法来开发模型以预测未来数据，从而提供相关的连续输出。连续输出表示没有离散结果，即输出不仅仅由一组离散的、众所周知的数字或值表示。

代码

# Python program to show how to employ Decision Tree Regression

# Importing the required modules
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, cross_val_score

# Loading the diabetes dataset
X, Y = load_diabetes( return_X_y = True )

# Separating the whole dataset in the datasets to train and test the model
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 10)

# Creating an object of the Decision Tree Regressor class
dtr = DecisionTreeRegressor(random_state = 10)
dtr.fit(X_train, Y_train)

# Computing the accuracy score of the Decision Tree model using the in-built cross_val_score method
scores = cross_val_score(dtr, X, Y, cv = 15)

# Printing the scores
print("Accuracy scores of the splits: ", scores)
print("The mean accuracy score of the model: ", np.mean(scores))

输出

Accuracy scores of the splits: [-1.02298964 0.05276312 -0.74850198 -0.07277615 0.47050685 
0.3024374
 -1.31209035 -0.26358347 0.43173477 -0.19109809 -0.56778646 -0.61486148
 -0.11295867 
0.0408493 -0.26464188]
The mean accuracy score of the model: 
-0.25819978131125926

支持向量回归

支持向量回归 (SVR) 与其他回归模型不同。它采用支持向量机（SVM，一种分类方法）来预测连续参数。支持向量回归旨在将最佳线拟合到预定或阈值偏差值内，而不是传统线性回归模型旨在最小化估计值和实际值之间的差异。在这方面，SVR 将所有预测线分为两类：那些超过误差限制（由两条平行线划分的区域）的线和那些没有超过误差限制的线。在确定估计值和真实值之间的差异是否超出误差阈值时，不考虑未超过误差边界的线（epsilon）。超过误差阈值的线被添加到潜在支持向量组中，用于预测未知值。如果我们查看下图，我们可以更好地理解这个想法。

SVR 的核参数最关键。它可以是高斯核、多项式核或线性核。我们可以为我们的模型选择多项式核或高斯核，因为数据集中存在非线性属性，但在这种情况下，我们将选择 RBF（高斯类型）核。

代码

# Python program to apply Support Vector Regression

# Importing the required modules
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

# Importing the dataset
data = pd.read_csv('Position_Salaries.csv')

# Separating the target and independent variables
X = data.iloc[:,1:2].values.astype(float)
y = data.iloc[:,2:3].values.astype(float)

# Performing feature scaling
scaler = StandardScaler()
scaler1 = StandardScaler()
X = scaler.fit_transform(X)
y = scaler1.fit_transform(y)

# Creating an object of the Support Vector Regression class and fitting our dataset to construct the model
reg = SVR(kernel = 'rbf')
reg.fit(X, y)

# Predicting the value of a given initial condition
Y_pred = reg.predict([[7.5]])
print(Y_pred)

# Visualising the model using matplotlib plots
plt.scatter(X, y, color = 'red')
plt.plot(X, reg.predict(X), color = 'blue')
plt.title('Support Vector Regression Model')
plt.xlabel('Position levels')
plt.ylabel('Salary of the Positions')
plt.show()

输出

梯度提升回归

如果分类和回归方法存在困难，我们可以应用梯度提升技术。使用许多较小的预测模型构建预测模型；这些小模型通常是决策树。

梯度提升回归器需要一个损失函数才能执行。梯度提升回归器可以处理各种预定义的损失函数，此外还支持自定义损失函数；但是，损失函数必须是可微的。

尽管回归方法通常使用对数函数，但平方误差也可以用于回归技术中。我们不必为梯度提升算法中的每个渐进提升阶段构建一个损失函数；相反，我们可以选择任何可微损失函数。

代码

# Python program to perform regression using Gradient Boosting Regression algorithm

# Importing the required modules
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_score

# Loading the diabetes dataset
X, Y = load_diabetes(return_X_y = True)

# Separating the training and validating datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 10)

# Creating an object of the Gradient Boosting Regressor class
gbr = GradientBoostingRegressor(n_estimators = 150, learning_rate = 1.0, max_depth = 2, random_state = 10)
gbr.fit(X_train, Y_train)

# Computing the accuracy score of the Gradient Boosting Regression model using the cross_val_score method
score = cross_val_score(gbr, X, Y, cv = 10)

# Printing the scores
print("Accuracy scores: ", scores)
print("The mean accuracy score is: ", np.mean(scores))

输出

Accuracy scores: [-1.02298964 0.05276312 -0.74850198 -0.07277615 0.47050685 
0.3024374
 -1.31209035 -0.26358347 0.43173477 -0.19109809 -0.56778646 -0.61486148
 -0.11295867 
0.0408493 -0.26464188]
The mean accuracy score is: -0.25819978131125926

回归可以处理线性依赖关系

回归是预测数值变量的可靠方法。上述机器学习技术包括有效的回归方法，可以使用 sklearn Python 库对各种机器学习任务执行回归分析和预测。

但是，当数据集在独立和依赖特征之间存在线性相关性时，回归是一个更好的选择。其他回归算法，如神经网络，用于管理数据特征之间的非线性关系，因为它们可以使用激活函数记录非线性。

下一主题使用 Python Tkinter 的 COVID-19 数据表示应用程序

← prev next →

Sklearn 回归模型

回归模型

回归分析技术的类型

线性回归

逻辑回归

岭回归

Lasso 回归

多项式回归

贝叶斯线性回归

弹性网络回归

决策树回归

支持向量回归

梯度提升回归

回归可以处理线性依赖关系

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

Sklearn 回归模型

回归模型

回归分析技术的类型

线性回归

逻辑回归

岭回归

Lasso 回归

多项式回归

贝叶斯线性回归

弹性网络回归

决策树回归

支持向量回归

梯度提升回归

回归可以处理线性依赖关系

相关帖子

Python Mechanize 模块

KMP 算法 - 使用 Python 实现 KMP 算法

Python 中的基本递归程序

理解 Python 机器人技术

Python 中的多态性

Python 中的 re.sub() 函数

Python 中的 argparse

Python Bisect 模块

Python 中的通配符

Python 中的 Defaultdict

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器