PySpark 逻辑回归

2025年1月31日 | 阅读 6 分钟

引言

PySpark 是 Apache Spark 的 Python API，它允许统计科学家和工程师利用分配计算的优势进行大规模数据处理。 PySpark 的重要组成部分之一是其机器学习库 (MLlib)，它提供了可扩展的机器学习算法，包括逻辑回归。逻辑回归是一种用于二元分类问题的重要算法。在本文中，我们将深入研究 PySpark 的逻辑回归，探索其实现、各种示例和关键概念。

什么是逻辑回归？

逻辑回归是一种统计技术，用于从一组独立的变量预测二元结果（1/0、真/假、是/否）。与预测非连续结果的线性回归不同，逻辑回归预测的概率被限制在 0 和 1 之间。它使用逻辑函数（也称为 sigmoid 函数）来对二元因变量进行建模。

逻辑函数定义为

其中

z 是输入特征的线性聚合。

逻辑回归公式

逻辑回归模型可以表示为

P(Y=1∣X)=σ(β0+β1X1+β2X2+…+βnXn)

其中

P(Y=1∣X) 是结构化变量的可能性
Y 等于 1，给定输入函数
β0 是截距项。
β1,β2,…,βn 是与每个特征 X1,X2,…,Xn 相似的系数。

PySpark 和 MLlib

PySpark 的 MLlib 提供了处理大规模系统学习任务的工具。它支持用于分类、回归、聚类和协同过滤的各种算法。 PySpark 中的逻辑回归是 MLlib 提供的算法类型的一部分。

设置 PySpark

在我们深入研究逻辑回归之前，允许在 PySpark 中进行设置。确保您已安装 Java 和 Spark。您可以使用 pip 安装 PySpark

接下来，让我们启动一个 PySpark 会话

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Logistic Regression Example") \
    .getOrCreate()   

数据准备

PySpark 需要数据采用 DataFrame 的形式，并且特征被组装成单个向量类型列。为此，我们可以使用 pyspark.ml.feature 中的 VectorAssembler。

from pyspark.ml.feature import VectorAssembler
data = [
    (0, 1.0, 0.5, 1.0),
    (1, 2.0, 1.5, 0.0),
    (0, 1.0, 1.2, 1.0),
    (1, 3.0, 2.0, 0.0),
]
columns = ["label", "feature1", "feature2", "feature3"]
df = spark.createDataFrame(data, columns)
# Assemble features into a single vector
assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df = assembler.transform(df)
df.show()  

逻辑回归模型

现在，让我们使用 PySpark 的 LogisticRegression 类构建和训练逻辑回归模型。

from pyspark.ml.classification import LogisticRegression
# Initialize LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")
# Fit the model
lr_model = lr.fit(df)
print(f"Coefficients: {lr_model.coefficients}")
print(f"Intercept: {lr_model.intercept}")  

模型评估

为了评估模型，我们可以使用各种指标，例如准确率、精确率、召回率和 F1 分数。 PySpark 提供了 BinaryClassificationEvaluator 用于二元分类任务。

from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Make predictions
predictions = lr_model.transform(df)
# Initialize evaluator
evaluator = BinaryClassificationEvaluator(labelCol="label")
# Calculate accuracy
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}") 

详细示例 1

让我们通过一个更详细的示例，使用更大的数据集。我们将使用流行的 Iris 数据集，但通过仅考虑两个物种来修改它以进行二元分类。

步骤 1：加载和准备数据

首先，加载 Iris 数据集并准备它以进行二元分类。

 from sklearn.datasets import load_iris
import pandas as pd

# Load Iris dataset
iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df_iris['label'] = iris.target
# Convert to binary classification (setosa vs non-setosa)
df_iris = df_iris[df_iris['label'] != 2]
df_iris['label'] = df_iris['label'].apply(lambda x: 1 if x == 1 else 0)
# Convert to Spark DataFrame
df_spark = spark.createDataFrame(df_iris)
# Assemble features
assembler = VectorAssembler(inputCols=iris.feature_names, outputCol="features")
df_spark = assembler.transform(df_spark).select("features", "label")
df_spark.show()  

步骤 2：训练逻辑回归模型

# Split data into training and test sets
train_data, test_data = df_spark.randomSplit([0.7, 0.3])

# Initialize and train the logistic regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train_data)

# Make predictions on test data
predictions = lr_model.transform(test_data)
predictions.select("features", "label", "prediction", "probability").show()  

步骤 3：评估模型

# Evaluate model performance
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}")
# Confusion matrix
predictions.groupBy("label", "prediction").count().show() 

高级主题

正则化

正则化通过惩罚大系数来帮助防止过度拟合。 PySpark 的逻辑回归支持 L1 (Lasso) 和 L2 (Ridge) 正则化。

# L1 Regularization
lr_l1 = LogisticRegression(featuresCol="features", labelCol="label", regParam=0.1, elasticNetParam=1.0)
lr_model_l1 = lr_l1.fit(train_data)
print(f"L1 Coefficients: {lr_model_l1.coefficients}")
# L2 Regularization
lr_l2 = LogisticRegression(featuresCol="features", labelCol="label", regParam=0.1, elasticNetParam=0.0)
lr_model_l2 = lr_l2.fit(train_data)
print(f"L2 Coefficients: {lr_model_l2.coefficients}")  

交叉验证

交叉验证通过将数据分成多个折叠来帮助选择最佳超参数。

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Create parameter grid
param_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1, 1.0]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

# Initialize CrossValidator
crossval = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Fit cross-validator
cv_model = crossval.fit(train_data)

# Best model
best_model = cv_model.bestModel
print(f"Best Model Coefficients: {best_model.coefficients}")
print(f"Best Model Intercept: {best_model.intercept}")   

流水线

在 PySpark 中，机器学习管道允许您链接多个转换和估计器。

 from pyspark.ml import Pipeline

# Create pipeline
pipeline = Pipeline(stages=[assembler, lr])

# Fit pipeline
pipeline_model = pipeline.fit(train_data)

# Make predictions
pipeline_predictions = pipeline_model.transform(test_data)
pipeline_predictions.select("features", "label", "prediction", "probability").show()  

详细示例 2

泰坦尼克号生存预测

步骤 1：设置 PySpark

首先，确保您已安装 PySpark 并设置 Spark 会话。

 from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("Titanic Logistic Regression") \
    .getOrCreate()  

步骤 2：加载和探索数据集

我们将从 CSV 文件加载泰坦尼克号数据集并执行初始数据探索。

 # Load the Titanic dataset
titanic_df = spark.read.csv("path/to/titanic.csv", header=True, inferSchema=True)

# Display the schema
titanic_df.printSchema()

# Show the first few rows
titanic_df.show(5)  

步骤 3：数据预处理

预处理步骤包括处理缺失值、将分类特征转换为数值特征以及将特征组合成单个向量。

from pyspark.sql.functions import col, when
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import OneHotEncoder

# Fill missing values
titanic_df = titanic_df.fillna({"Age": titanic_df.select("Age").na.drop().approxQuantile("Age", [0.5], 0.0)[0], 
                                "Embarked": "S"})

# Convert categorical columns to numerical
indexers = [
    StringIndexer(inputCol="Sex", outputCol="SexIndex"),
    StringIndexer(inputCol="Embarked", outputCol="EmbarkedIndex")
]

# One-hot encode the categorical columns
encoders = [
    OneHotEncoder(inputCol="SexIndex", outputCol="SexVec"),
    OneHotEncoder(inputCol="EmbarkedIndex", outputCol="EmbarkedVec")
]

# Assemble all features into a single vector
assembler = VectorAssembler(
    inputCols=["Pclass", "Age", "SibSp", "Parch", "Fare", "SexVec", "EmbarkedVec"],
    outputCol="features"
)

# Apply indexers and encoders
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=indexers + encoders + [assembler])
titanic_df = pipeline.fit(titanic_df).transform(titanic_df)

# Select the final set of columns
titanic_df = titanic_df.select("features", col("Survived").alias("label"))

# Show the transformed data
titanic_df.show(5)  

步骤 4：将数据拆分为训练集和测试集

步骤 5：训练逻辑回归模型

from pyspark.ml.classification import LogisticRegression
# Initialize LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")

# Fit the model on the training data
lr_model = lr.fit(train_data)

# Print coefficients and intercept
print(f"Coefficients: {lr_model.coefficients}")
print(f"Intercept: {lr_model.intercept}")   

步骤 6：评估模型

使用测试数据评估模型的性能。

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Make predictions on the test data
predictions = lr_model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
roc_auc = evaluator.evaluate(predictions)
print(f"ROC AUC: {roc_auc}")

# Show confusion matrix
predictions.groupBy("label", "prediction").count().show()   

步骤 7：使用交叉验证进行超参数调整

使用交叉验证找到最佳超参数。

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Create parameter grid for cross-validation
param_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1, 1.0]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .build()

# Initialize CrossValidator
crossval = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5)

# Fit cross-validator on the training data
cv_model = crossval.fit(train_data)

# Best model
best_model = cv_model.bestModel
print(f"Best Model Coefficients: {best_model.coefficients}")
print(f"Best Model Intercept: {best_model.intercept}")

# Evaluate the best model on test data
best_predictions = best_model.transform(test_data)
best_roc_auc = evaluator.evaluate(best_predictions)
print(f"Best ROC AUC: {best_roc_auc}")   

步骤 8：对新数据进行预测

一旦模型经过训练和评估，您就可以使用它对新数据进行预测。

# Example of new data
new_data = spark.createDataFrame([
    (3, 22.0, 1, 0, 7.25, "male", "S"),
    (1, 38.0, 1, 0, 71.2833, "female", "C")
], ["Pclass", "Age", "SibSp", "Parch", "Fare", "Sex", "Embarked"])

# Apply the same transformations as the training data
new_data = pipeline.fit(new_data).transform(new_data)
new_data = new_data.select("features")

# Make predictions
new_predictions = best_model.transform(new_data)
new_predictions.select("features", "prediction", "probability").show()   

结论

在本完整指南中，我们探索了在 PySpark 中使用逻辑回归进行二元类别任务。从介绍逻辑回归及其数学基础开始，我们继续安装 PySpark、准备数据以及使用 PySpark 的 MLlib 构建逻辑回归模型。我们涵盖了关键步骤，其中包括信息预处理、版本培训、利用准确率和 ROC AUC 等指标进行评估以及高级主题，包括正则化、交叉验证和开发机器学习管道。详细的示例，包括对泰坦尼克号数据集的关注，展示了一种处理真实世界事实、转换它们以及应用逻辑回归进行预测建模的方法。

PySpark 的强大框架和可扩展结构使其成为处理大规模统计处理和机器系统学习任务的有效工具。通过利用 PySpark 的能力，记录科学家和工程师可以有效地构建、评估和跟踪逻辑回归模型，以解决复杂的分类问题。无论是在使用标准数据集还是将策略应用于特定的业务需求，了解和利用 PySpark 中的逻辑回归都使从业者能够掌握在大型数据上下文中得出有意义的见解和做出事实驱动型决策的技能。

下个主题Pyspark-merge

PySpark 逻辑回归

引言

什么是逻辑回归？

PySpark 和 MLlib

设置 PySpark

数据准备

逻辑回归模型

模型评估

详细示例 1

步骤 1：加载和准备数据

步骤 2：训练逻辑回归模型

步骤 3：评估模型

高级主题

正则化

交叉验证

流水线

详细示例 2

泰坦尼克号生存预测

步骤 1：设置 PySpark

步骤 2：加载和探索数据集

步骤 3：数据预处理

步骤 4：将数据拆分为训练集和测试集

步骤 5：训练逻辑回归模型

步骤 6：评估模型

步骤 7：使用交叉验证进行超参数调整

步骤 8：对新数据进行预测

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

PySpark 教程

PySpark 逻辑回归

引言

什么是逻辑回归？

PySpark 和 MLlib

设置 PySpark

数据准备

逻辑回归模型

模型评估

详细示例 1

步骤 1：加载和准备数据

步骤 2：训练逻辑回归模型

步骤 3：评估模型

高级主题

正则化

交叉验证

流水线

详细示例 2

泰坦尼克号生存预测

步骤 1：设置 PySpark

步骤 2：加载和探索数据集

步骤 3：数据预处理

步骤 4：将数据拆分为训练集和测试集

步骤 5：训练逻辑回归模型

步骤 6：评估模型

步骤 7：使用交叉验证进行超参数调整

步骤 8：对新数据进行预测

结论

相关帖子

PySpark unionAll

如何更改 PySpark 数据框中的列类型

PySpark 教程

PySpark 数据框转 CSV

PySpark 合并

PySpark StorageLevel

PySpark GroupBy 平均值

PySpark 安装

PySpark 数据框：选择列

PySpark SQL

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器