Sklearn 线性回归示例

2025年3月17日 | 阅读 7 分钟

线性回归是一种基于监督学习的机器学习算法。它执行回归操作。回归使用自变量来训练模型并查找预测值，主要用于确定变量和预测值之间的关系。

回归模型根据所使用的自变量数量、因变量和自变量之间的关系以及其他因素而异。本教程将展示如何使用多个 Python 模块将线性回归应用于给定数据集。由于单个线性模型更容易可视化，我们将以它为例。在此演示中，模型将使用 sklearn 糖尿病数据集进行学习。

如何在 Sklearn 中使用线性回归？

Scikit-learn 是一个 Python 包，可简化使用各种机器学习 (ML) 方法来研究预测性数据，包括线性回归。

找到最适合一组散点数据的直线模型就是线性回归；然后，我们可以外推该曲线来预测新数据点。线性回归因其简单性和关键特性而成为一项重要的机器学习技术。

使用 Sklearn 的线性回归示例

代码

# Python code on sklearn linear regression example

# Importing required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Loading the sklearn diabetes dataset
X, Y = load_diabetes(return_X_y=True)

# Taking only one feature to perform simple linear regression
X = X[:,8].reshape(-1,1)

# Splitting the dependent and independent features of the dataset into training and testing dataset
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 0.3, random_state = 10 )

# Creating an instance for the linear regression model of sklearn
lr = linear_model.LinearRegression()

# Training the model by passing the dependent and independent features of the training dataset
lr.fit( X_train, Y_train )

# Creating an array of predictions made by the model for the unseen or test dataset
Y_pred = lr.predict( X_test )

# The value of the coefficients for the independent feature through the multiple regression model
print("Value of the oefficients: \n", lr.coef_)

# The value of the mean squared error
print(f"Mean square error: {mean_squared_error( Y_test, Y_pred)}")

# The value of the coefficient of determination, i.e., R-square score of the model
print(f"Coefficient of determination: {r2_score( Y_test, Y_pred )}")

# Plotting the output
plt.scatter(X_test, Y_test, color = "black", label = "original data")
plt.plot(X_test, Y_pred, color = "blue", linewidth=3, label = "regression line")
plt.xlabel("Independent Feature")
plt.ylabel("Target Values")
plt.title("Simple Linear Regression")
plt.show()

输出

Value of the coefficients: 
 [875.72247876]
Mean square error: 4254.602428877642
Coefficient of determination: 0.3276195356900222

使用交叉验证的 Sklearn 线性回归示例

许多机器学习模型在部分原始数据上进行训练，然后在互补的数据子集上进行评估。此过程称为交叉验证。为了识别过拟合或未能泛化模式，请使用交叉验证。

交叉验证涉及以下三个步骤

将样本数据集的一部分预留出来。
使用数据集的特定部分训练模型。
使用数据集的预留部分验证模型。

通过交叉验证，我们使用原始训练数据集生成多个小的训练-测试分割。我们使用这些分割来训练我们的模型，该模型最能描述因变量和自变量之间的关系。对于标准的 k-fold 交叉验证测试，我们将原始数据集分成 k 个子集。在我们迭代地训练线性回归模型使用 k-1 个数据集后，我们使用剩余的数据集来验证模型。

这使我们能够在新数据集上验证模型，以了解模型是否描述了良好的关系。在本节中，我们将学习如何使用 sklearn 在线性回归模型上进行交叉验证测试。此外，我们将看到一种方法来提高 KFold 交叉验证方法提供的准确性。

代码

# Python program to perform kfold cross-validation test on a Linear Regression model 

#Importing the required libraries
from sklearn.datasets import load_diabetes
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
 
#Loading the diabetes dataset of sklearn
dataset = load_diabetes(as_frame = True)
dataset = data.frame

# Segregating dependent and independent variables of the dataset
X = dataset.iloc[:,:-1] 
Y = dataset.iloc[:,-1]

# Separating the dataset into training and testing dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)
 
#Implementing k-fold cross validation
k_value = 6
k_fold = KFold(n_splits = k_value, random_state = None)
Lreg = LinearRegression()

# Fitting the Linear Regression model to the training dataset
Lreg.fit(X_train, Y_train)

# Finding the accuracy scores for each fold using cross_val_score methods
scores = cross_val_score(Lreg, X_train, Y_train, cv = k_fold)

# Calculating the mean accuracy score through the scores value
mean_accuracy_score = sum(scores) / len(scores)

# Printing the accuracy scores
print("Accuracy score of each fold: ", scores)
print("Mean accuracy score: ", mean_accuracy_score)

输出

Accuracy score of each fold:  [0.41676766 0.45263441 0.44526044 0.43015152 0.40605028 0.41904005]
Mean accuracy score:  0.4283173940888665

为了提高 KFold 交叉验证测试的准确性，我们可以使用分层 KFold 方法

代码

# Python program showing stratified k-fold cross-validation test on a Linear Regression model 

# Importing the required library
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split

# Loading the dataset
dataset = load_diabetes()

# Getting dependent and independent features
X = dataset.data
Y = dataset.target
print("Size of the dataset is: ", len(X))

# Separating the dataset into training and testing dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)
 
# Creating an instance of a linear regression model
Lreg = LinearRegression()

# Fitting the training dataset to train the model
Lreg.fit(X_train, Y_train)

# Performing stratified K-fold cross-validation test
stratified_kfold = StratifiedKFold(n_splits = 6)
score = cross_val_score(Lreg, X, Y, cv = stratified_kfold )

# Printing accuracy scores
print("Stratified k-fold Cross Validation Scores are: ", score )
print("Average Cross Validation score is: ", score.mean())

输出

Size of the dataset is:  442
Stratified k-fold Cross Validation Scores are:  [0.50117449 0.45486492 0.46935982 0.5599043  0.50545775 0.42289802]
Average Cross Validation score is:  0.48560988266624

使用 Python Sklearn 进行多元线性回归

多变量回归是一种监督机器学习方法，它使用多个自变量数据特征来分析目标特征。一个因变量和多个自变量构成了多变量回归，这是多元回归模型的扩展。我们尝试通过使用自变量训练模型来预测结果。

多变量回归使用一个公式来描述多个变量如何同时响应目标变量的变化。

数据预处理

大多数机器学习程序员认为，数据预处理是回归模型项目中最重要的阶段之一。可能存在过多的数据点、报告错误或许多其他问题，这些问题会阻止算法对数据集进行准确预测。

在将数据集输入机器学习模型之前，数据科学家会花费大量时间对其进行清理、归一化和缩放，以避免这种情况。

标准化函数，例如 MinMax 和 Standard 函数，是最常见的函数类型，用于执行特征缩放。这是因为您的数据中的特征范围不同。几乎所有机器学习方法都使用欧几里得距离来估计两个数据点之间的距离。

通过将集合中的每个点缩放到相同的范围，尺度标准化函数使算法能够准确计算距离。

我们必须首先同时导入 sklearn.preprocessing 和 numpy。

多元线性回归 Sklearn 示例

代码

# Python program shows how to process data before fitting it to a Linear Regression Model. 
# Multiple Linear Regression Sklearn example

# Importing the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


# Loading the dataset 
iris_data = datasets.load_iris()
dataset = pd.DataFrame(data = iris_data.data, columns = iris_data.feature_names)

# Adding the target feature to the dataset
dataset["target"] = iris_data.target

#Printing head of dataset
print(dataset.head())

# Eliminating from the dataset any NaN or missing type input numbers
dataset.fillna(method = 'ffill', inplace = True)

# Dropping any rows with Nan values
dataset.dropna(inplace = True)

# Separating independent and dependent variables
# This will convert each dataframe into numpy arrays
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, -1].values


# Separating the data into training and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 10)

# Creating an instance of the Linear Regression class of the sklearn
lr = LinearRegression()

# Fitting the training data of the dataset into the model to train the model
lr.fit( X_train, Y_train )

# Printing the R-square for the trained model by passing unseen or the test data
score = lr.score( X_test, Y_test )
print("R-square score of the model: ", score)

# Storing the predicted values of the test dataset by the model in an array
Y_pred = lr.predict(X_test)

# Creating a dataset for the coefficient value for the intercept and independent features
result = pd.DataFrame(data = dataset.iloc[:,:-1].columns, columns = ["Features"])
result["Coefficients"] = lr.coef_
result.loc[0] = ["Intercept", lr.intercept_]
result

输出

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
R-square score of the model:  0.9215461211058802
	Features	Coefficients
0	Intercept	0.279851
1	sepal width (cm)	-0.006845
2	petal length (cm)	0.297868
3	petal width (cm)	0.507009

下一个主题Python Timeit 模块

Sklearn 线性回归示例

如何在 Sklearn 中使用线性回归？

使用 Sklearn 的线性回归示例

使用交叉验证的 Sklearn 线性回归示例

使用 Python Sklearn 进行多元线性回归

数据预处理

多元线性回归 Sklearn 示例

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

Sklearn 线性回归示例

如何在 Sklearn 中使用线性回归？

使用 Sklearn 的线性回归示例

使用交叉验证的 Sklearn 线性回归示例

使用 Python Sklearn 进行多元线性回归

数据预处理

多元线性回归 Sklearn 示例

相关帖子

Python 中的 choice()

Python 中的自守数

字符串转二进制

金融行业的流行 Python 库

如何在 Python 中获得 2 位小数

forward driver method - Selenium Python

如何在 Python 中将第一列设为索引

Tqdm 与 Pandas 集成

Python 中的 XGBoost ML 模型

Python 中的 RSME - 均方根误差

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器