线性回归与自回归模型之间的区别

2025年6月18日 | 阅读 12 分钟

线性回归和自回归模型是预测建模中最常用的统计工具。尽管它们在分析数据时应用数学关系的做法很相似，但它们的目的、结构和应用却有显著差异。因此，为了选择适合任何问题的模型，需要理解这些差异。

线性回归

线性回归是一种预测建模技术，它考虑因变量与一个或多个自变量之间的关系，旨在找到一个最佳预测因变量的线性方程。线性回归模型的一般形式是：Y=β0 +β1X1 +β2X2 +⋯+βnXn +ϵ，其中Y是因变量，Xi是自变量，βi表示系数，ϵ是误差项。模型需要满足以下假设：因变量与自变量之间应存在线性关系，同方差性，即误差方差应恒定，观测值应独立，并且误差必须服从正态分布。

自回归模型

自回归（AR）模型是一种时间序列模型，它利用所有过去的值来预测未来值，并假定与过去观测值存在线性依赖关系。AR模型的一般形式是

Yt =ϕ1Yt−1 +ϕ2Yt−2 +⋯+ϕpYt−p +ϵt，其中Yt代表当前值，Yt−i是滞后值，ϕi是系数，ϵt是误差项。该模型在特定假设下成立，其中时间序列的平稳性意味着均值和方差随时间恒定，并且误差项是白噪声，表明它们同分布且均值为零且独立。

以下是线性回归和自回归模型之间的一些主要区别。

方面	线性回归	自回归模型
目的	使用独立因子预测因变量。	使用变量的历史值预测其未来值。
数据类型	使用独立观测值或横截面数据。	使用具有顺序排列观测值的时间序列数据。
方程结构	涉及自变量和因变量。	仅依赖于变量本身的滞后值。
假设	要求观测值是独立的。	假定时间依赖性和平稳性。
应用	预测非时间序列数据的趋势、销量或价格。	预测时间序列数据的未来趋势。
时间分量	没有内在的时间考量。	时间至关重要；现在受过去值的影响。
依赖性	依赖外部预测器。	基于变量本身的历史值。

现在，为了更好地理解它们的用例，我们将使用线性回归模型。

导入库

# import the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

plt.style.use("seaborn-whitegrid")
warnings.filterwarnings("ignore")

加载数据

load data
datazara= "../input/insurance/insurance.csv"
data_df = pd.read_csv(datazara)

# show data (6 row)
data_df.head(6)

输出

Difference Between Linear Regression and Autoregressive Model

变量信息

年龄：主要保单持有人的年龄。
性别：持有保险的个人的性别，男性或女性。
BMI：身体质量指数，表明个人的体重是否与其身高成比例。其计算方法是体重（kg）除以身高平方（m²），理想范围是18.5至24.9。
子女：健康保险单覆盖的受抚养人或子女数量。
吸烟者：表示该个人是否为吸烟者或非吸烟者。
地区：保单持有人的美国居住地区，分为东北部、东南部、西南部或西北部。
费用：健康保险提供商向个人收取的医疗费用。

输出

从这里可以看到，数据中个人的平均年龄是39岁。年龄值的标准差为14。这意味着平均而言，典型的年龄分布在25至64岁之间。我们可以发现数据的最大值和最小值点，以了解其中是否存在异常值。

根据BMI分数，此数据集中有很多肥胖和超重的人。总的来说，数据集中的大多数人都至少有一个孩子。

输出

我们来看看缺失值。

在实施机器学习的过程中，数据集中的缺失数据可能会导致不正确的结果。因此，我们应该找出并纠正数据集中的缺失信息。

输出

上面显示的数据集非常好。我们的数据集不包含任何缺失信息。

不一致性观察

现在我们将查找不一致性。虽然机器学习模型的百分比构成是95%的预处理和5%的模型选择。我们需要用正确的数据知识来训练模型。需要使用特定的预处理技术来准备可用于机器学习的数据。异常值分析是其中一项技术。数据集中的任何数据点，其值与其他观测值显著不同，都称为异常值。换句话说，它是与整体模式显著不同的观测值。

由于异常值与数据模型行为不同，并且会增加过拟合的错误，因此有必要识别异常值模型并对其进行一些调整。可以使用各种可视化技术来查看矛盾的观测值。其中之一是箱形图。其他群体被聚集在一起并显示在框中，而异常值则被显示为点。

data = data_df.copy()
data = data.select_dtypes(include=["float64","int64"])
data.head()

输出

column_list = ['age', 'bmi', 'children', 'charges']
for col in column_list:
    sns.boxplot(x = data[col])
    plt.xlabel(col)
    plt.show()

输出

从上面的图表中可以看出，费用和BMI读数的值存在异常值。不过，这些并没有影响我们的数据集。相反，这些数据使我们更容易对数据进行评论。因此，我们不会以任何方式处理这些数据。

模型

现在我们将构建我们的线性回归模型。

f= plt.figure(figsize=(16,5))

ax=f.add_subplot(121)
sns.distplot(data_df['charges'],bins=50,color='r',ax=ax)
ax.set_title('Distribution of insurance charges')

ax=f.add_subplot(122)
sns.distplot(np.log10(data_df['charges']),bins=40,color='b',ax=ax)

plt.show()

输出

f = plt.figure(figsize=(14,6))
ax = f.add_subplot(121)
sns.violinplot(x='sex', y='charges',data=data_df,palette='Wistia',ax=ax)
ax.set_title('Violin plot of Charges vs sex')

ax = f.add_subplot(122)
sns.violinplot(x='smoker', y='charges',data=data_df,palette='magma',ax=ax)

plt.show()

输出

sns.jointplot(x="bmi",y="charges",data=data_df,kind="reg")
plt.show()

输出

sns.jointplot(x="age",y="charges",data=data_df,kind="reg")
plt.show()

输出

sns.jointplot(x="children",y="charges",data=data_df,kind="reg")
plt.show()

输出

# importing the required packages
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.stats import boxcox
from sklearn import metrics

data_df_encode = data_df.copy()

输出

实际上，当自变量多重共线性时，就会出现一种称为“虚拟变量陷阱”的情况。也就是说，两个或多个变量高度相关，但可以通过一个变量预测另一个变量。

通过使用pandas的get_dummies函数，可以一行代码完成上述所有步骤。为了将sex、children、smokers和region特征编码为虚拟变量，我们将使用此函数。当设置drop_first=True时，删除一个变量和原始变量将消除虚拟变量陷阱。pandas使我们的工作更加轻松。

data_df_encode = pd.get_dummies(data = data_df_encode, columns = ['sex','smoker','region'])
data_df_encode.head()

输出

# Normalising
y_bc,lam, ci= boxcox(data_df_encode['charges'],alpha=0.05)
data_df_encode['charges'] = np.log(data_df_encode['charges'])

data_df_encode.head()

标准化有两个主要目的：提高数据的一致性或准确性，以及消除数据库中的冗余数据。

标准化适用于各级数据库，以及范式。为了使数据库适合任何一种范式，它必须满足相关范式的每个要求。

标准化执行得当可以极大地提高数据库的速度。

X = data_df_encode.drop('charges',axis=1) 
y = data_df_encode['charges']

train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=0.3,random_state=23)
X = data_df_encode['bmi'].values.reshape(-1,1)  # Independet variable
y = data_df_encode['charges'] # dependent variable

train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=0.3,random_state=42)
Lin_regress = LinearRegression()
model = Lin_regress.fit(train_X,train_y)
pred = Lin_regress.predict(test_X)

print("intercept: ", model.intercept_)
print("coef: ", model.coef_)
print("RScore. ", model.score(test_X,test_y))

输出

plt.figure(figsize=(12,6))
plt.scatter(test_y,pred)
plt.show()

输出

print('MAE:', metrics.mean_absolute_error(test_y, pred))
print('MSE:', metrics.mean_sq_error(test_y, pred))
print('RMSE:', np.sqrt(metrics.mean_sq_error(test_y, pred)))

输出

plt.figure(figsize=(12,6))
g = sns.regplot(x=data_df_encode['bmi'],y=data_df_encode["charges"],ci=None,scatter_kws = {'color':'r','s':9})
g.set_title("Model Equation")
g.set_ylabel("charges")
g.set_xlabel('bmi')
plt.show()

输出

plt.figure(figsize=(12,6))
g = sns.regplot(x=data_df_encode['age'],y=data_df_encode["charges"],ci=None,scatter_kws = {'color':'r','s':9})
g.set_title("Model Equation")
g.set_ylabel("charges")
g.set_xlabel('age')
plt.show()

输出

现在，是时候看看自回归模型的用例了。

导入库

!pip install mplcursors

# Importing the required packages
# import fitz
import warnings
import matplotlib.pyplot as plt
import tensorflow as tf
import mplcursors
import os
import random
import pandas as pd
import numpy as np


%matplotlib inline

warnings.filterwarnings("ignore")

# from keras import backend as K
from sklearn.metrics import mean_sq_error
from math import sqrt
from matplotlib import pyplot
from pandas.plotting import autocorrelation_plot
from pandas import DataFrame
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import *
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.arima.model import ARIMA

我们将验证目录的结构并确保所有预期文件都存在。

# Set the current working directory to its initial state.
currentwd = "/kaggle/input"
print("Current working directory : {}".format(currentwd))

# Create the data files' path.
# path_data = os.path.join(currentwd, "sensor-data/Data")
path_data = os.path.join(currentwd, "sensor-data/Data")
print("Data path : {}".format(path_data))

# Obtain every file within the directory.
files = os.listdir(path_data)
print("Total files in the directory : {}".format(len(files)))
print("Files in the directory : {}".format(files))

输出

现在，我们将使用80-20的比例将文件列表分割成训练集和测试集。

# Assign training and testing files to an 80-20 split.
train_files, test_files = train_test_split(files,test_size=0.2)

# Print information for the training files
print("Total Training Files : {}".format(len(train_files)))
print("Training Data Files : {}".format(train_files))

# Print information about the test files
print("\nTotal Testing Files : {}".format(len(test_files)))
print("Testing Data : {}".format(test_files))

输出

以下函数使用一些基本的数据清理转换和可视化任务来预处理传感器数据文件。首先，它记录正在处理的文件类型和索引，以及文件名和初始列名。该函数将CSV文件读入Pandas DataFrame，并删除任何包含缺失值的行。它将第一列重命名为“Sensor Value”，并进行迭代检查该列中的非数字条目，记录并删除任何无效行。传感器值转换为浮点类型，以便进行数值运算。该函数提供DataFrame的摘要描述，包括其形状、列名和数据类型。就可视化而言，它创建了一个折线图以查看数据趋势，一个滞后图以查看数据是否自相关，以及一个自相关图以了解不同滞后的相关性。最后一点是经过清理和处理的DataFrame，用于进一步分析，然后为数值或机器学习任务做好准备。

def Filepreprocess(filename, type, index):
    
    # Print the file name that is currently being considered.
    print("{} File index being processed : {}".format(type, index))
    print("File name : {}".format(filename))
    
    # Loading the file
    file_path = os.path.join(path_data, filename)

    # Structuring the Pandas DataFrame
    data_df = pd.read_csv(file_path)

    # We will drop the NA values
    data_df = data_df.dropna()

    # Extracting the data column names
    column_name = data_df.columns[0]
    print("Initial Column Name : {}".format(column_name))
    
    # Rename the column
    data_df = data_df.rename(columns={column_name: "Sensor Value"})

    # Get the bad records (Not having numeric sensor data)
    records_to_drop = []
    for i in range(0, len(data_df)):
        try:
            data_df.iloc[i]["Sensor Value"] = float(data_df.iloc[i]["Sensor Value"])
        except Exception as e:
            records_to_drop.append(i)
    
    # taking out the bad data from the dataframe
    rows = data_df.index[records_to_drop]
    data_df.drop(rows, inplace=True)

    # we will change the sensor data to float 
    data_df["Sensor Value"] = pd.to_numeric(data_df["Sensor Value"], downcast="float")
    
    # Obtaining the shape of the dataframe 
    print("Shape of the Dataframe : {}".format(data_df.shape))

    # Obtaining the column types from the dataframe
    print("Column types : {}".format(data_df.dtypes))
    
    # obtaining the column names from the dataframe
    print("Columns : {}".format(data_df.columns))

    # Plotting the data
    print("\nData Plot")
    data_df.plot()
    plt.show()
    
    # To check the correlation we need to plot the lagplot
    print("\nLag Plot")
    pd.plotting.lag_plot(data_df["Sensor Value"])
    plt.show()
    
    # We will plot the auto-correlation
    print("\nAutocorrelation Plot")
    pd.plotting.autocorrelation_plot(data_df["Sensor Value"])
    plt.show()
    
    return data_df

CreateTrainTestData函数根据指定的断点将数据集分割成训练集和测试集。该函数分割输入数据，将断点之前的所有元素放入训练集，并将所有剩余元素包含在测试集中。

def CreateTrainTestData(data, breakpoint):
    
    # Retrieve the model's input features and output labels.
    train = data[:len(data)-breakpoint]
    test = data[len(data)-breakpoint:]
    
    print("\nLength of the total set : {}".format(len(data)))
    print("\nLength of the training set : {}".format(len(train)))
    print("\nLength of the testing set : {}".format(len(test)))
    
    return train, test

initializeRunModel函数使用提供的训练数据来初始化、训练和诊断AR模型。它首先使用提供的训练数据集初始化AR模型，然后拟合模型进行训练。训练后，该函数检索并打印出选择的滞后值k_ar和模型参数params，它们是AR模型的重要组成部分。

def initializeRunModel(training_data):

    # Initialize the model
    model = AR(training_data)  

    # Train the model
    model = model.fit()

    # Get the chosen lag value
    print("\nThe lag value chosen is: {}".format(model.k_ar))
    
    # Get the coefficients
    print("\nCoefficients: {}".format(model.params))
    
    return model

predict函数旨在利用已训练的AR模型对测试数据集进行预测。它首先使用模型对测试数据范围进行预测，该范围从训练数据之后开始，一直持续到测试数据集的末尾。

def predict(model, data_train, data_test):

    pred = model.predict(start=len(data_train), end=len(data_train)+len(data_test)-1, dynamic=False)

    print("\nLength of Testing Set : {}".format(len(data_test)))
    print("Length of Prediction Set : {}".format(len(pred)))
    
    print("\n")
    for i in range(len(pred)):
        print("Expected = {}, Predicted = {}".format(data_test[i], pred[i]))

    rmse = sqrt(mean_sq_error(data_test, pred))
    print("\nTest RMSE = {}".format(rmse))
    
    return pred

现在，我们有了一个可以处理测试和预测数据的函数。

def data_plot(test_y, prediction):
    print("\n")
    range_future = len(prediction)
    plt.plot(np.arange(range_future), test_y, label = "Test data")
    plt.plot(np.arange(range_future), prediction, label = "Predicted Data")    
    plt.title("Test data vs Predicted Data")
    plt.legend(loc = "upper left")
    plt.xlabel("T")
    plt.ylabel("Sensor Value")

然后，我们将加载单独的文件，然后形成组合数据。

# Initialising the combined data
data = []

index = 0

for f in train_files[:1]:
    # We will call up the Filepreprocess Function here
    data_file = Filepreprocess(f, "Training", index)
    for value_sensor in data_file.values:
        data.append(value_sensor)
    index += 1

输出

rolling_Autoregression函数实现了基于自回归时间序列模型的滚动或前向验证过程。它首先根据提供的断点将给定数据集分割成训练集和测试集。然后，它初始化一个包含先前观测值的历史列表和一个空的预测列表，用于存储模型的预测。

def rolling_Autoregression(data, breakpoint):
    
    # Making training and testing set of the data
    data_train, data_test = CreateTrainTestData(data, breakpoint=breakpoint)

    # history
    history = [x for x in data_train]

    # Initialising the pred
    pred = []

    # Walk-forward validation
    print("\n")
    for t in range(len(data_test)):
        model = AutoReg(history, lags=5)
        fitting_model = model.fit()
        yhat = fitting_model.predict(start=len(history), end=len(history)+1-1)
        obs = data_test[t]
        print("Predicted = {}, Expected = {}".format(yhat, obs))
        history.append(obs)
        pred.append(yhat)

    # Evaluating the  forecasts
    rmse = sqrt(mean_sq_error(data_test, pred))
    print("\nTest RMSE =  {}".format(rmse))

    # Compare pred with actual results.
    print("\n")
    plt.plot(data_test)
    plt.plot(pred, color = "red")
    plt.title("Test data vs Predicted Data")
    plt.xlabel("T")
    plt.ylabel("Sensor Value")
    plt.show()

# Using the function
rolling_Autoregression(data, 5)

输出

rolling_ARIMA函数是一种时间序列预测方法，它使用前向验证和ARIMA滚动。它首先根据提供的断点将数据分割成训练集和测试集。然后，它创建了一个空的预测列表，以便其中可以包含模型的输出。

def rolling_ARIMA(data, breakpoint):

    # make a training and testing sets form the data
    data_train, data_test = CreateTrainTestData(data, breakpoint=breakpoint)

    # history
    history = [x for x in data_train]

    # Initialising the pred
    pred = list()

    # Walk-forward validation
    print("\n")
    for t in range(len(data_test)):
        model = ARIMA(history, order=(5,1,0))
        fitting_model = model.fit()
        output = fitting_model.forecast()
        yhat = output[0]
        pred.append(yhat)
        obs = data_test[t]
        history.append(obs)
        print("Predicted = {}, Expected = {}".format(yhat, obs))

    # Evaluating the forecasts
    rmse = sqrt(mean_sq_error(data_test, pred))
    print("\nTest RMSE =  {}".format(rmse))

    # Compare pred with actual results.
    print("\n")
    plt.plot(data_test)
    plt.plot(pred, color = "red")
    plt.title("Test data vs Predicted Data")
    plt.xlabel("T")
    plt.ylabel("Sensor Value")
    plt.show()

# Using the function
rolling_ARIMA(data, 5)

输出

现在我们将使用OhioT1DM数据集，因此我们将加载并预处理它。

# Create the data files' path.
file_path = os.path.join(currentwd, "ohio-data/ohio_t1dm.csv")
print("File path : {}".format(file_path))

# Create the DataFrame
data_df = pd.read_csv(file_path, parse_dates=True, squeeze=True)

# Take out the columns
data_df = data_df.drop("basis_gsr", 1)
data_df = data_df.drop("basis_skin_temperature", 1)

# Taking records from a particular year
data_df["time"] = pd.to_datetime(data_df["time"])
data_df = data_df[data_df["time"].dt.year == 2027]
data_df = data_df[data_df["time"].dt.month == 6]

# Setting the index
data_df.set_index("time", inplace=True)

# Printting the Dataframe
print(data_df)

# Plotting the Dataframe
data_df.plot()
plt.show()

# Plotting the lag plot to check for the correlation
print("\nLag Plot")
pd.plotting.lag_plot(data_df["level_glucose"])
plt.show()
    
# Plotting the autocorrelation
print("\nAutocorrelation Plot")
pd.plotting.autocorrelation_plot(data_df["level_glucose"])
plt.show()

输出

# Take the glucose readings out of the DataFrame.
glucose_data = data_df["level_glucose"].values

# Running the Model
rolling_Autoregression(glucose_data, 5)

输出

# Run the Model
rolling_ARIMA(glucose_data, 5)

输出

下一个主题机器学习中的精确率和召回率

线性回归与自回归模型之间的区别

线性回归

自回归模型

导入库

加载数据

变量信息

不一致性观察

模型

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

机器学习

监督式学习

分类

杂项

相关教程

面试题

线性回归与自回归模型之间的区别

线性回归

自回归模型

导入库

加载数据

变量信息

不一致性观察

模型

相关帖子

平稳时间序列

进化策略

Python 初学者十大机器学习项目

CatBoost 与 XGBoost

机器学习中的分类类型

机器学习的数据结构

机器学习工具

情境感知推荐系统

L1 和 L2 正则化

ELM 在机器学习中的应用

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器