PySpark unionAll

2025年1月31日 | 阅读 3 分钟

在 PySpark 中，`unionAll` 是一种变形操作，用于将具有相同模式的 DataFrame 合并成一个 DataFrame，方法是将一个 DataFrame 中的行附加到另一个 DataFrame 中。此操作类似于 SQL 的 `UNION ALL` 操作，并保留重复的行。本指南将详细解释 PySpark 中的 `unionAll`，包括其语法、用法和实际示例。

理解 `unionAll`

`unionAll` 是 PySpark 中用于垂直组合 DataFrame 的基本操作，允许您将多个 DataFrame 中的行连接成一个 DataFrame。与删除重复行的 `union` 不同，`unionAll` 保留两个 DataFrame 中的所有行，包括重复项。

语法

PySpark 中 `unionAll` 的语法很简单

其中 `df1` 和 `df2` 是要组合的 DataFrame。为了使操作成功，两个 DataFrame 必须具有相同的模式。

`unionAll` 的用法

示例 1：组合两个 DataFrame

让我们考虑两个代表员工记录的 DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create a SparkSession
spark = SparkSession.builder \
    .appName("UnionAllExample") \
    .getOrCreate()

# Define the schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True)
])

# Sample data
data1 = [("Alice", 30, "HR"), ("Bob", 25, "Engineering")]
data2 = [("Charlie", 35, "Marketing")]

# Create DataFrames
df1 = spark.createDataFrame(data1, schema)
df2 = spark.createDataFrame(data2, schema)

# Combine DataFrames using unionAll
combined_df = df1.unionAll(df2)

# Show the combined DataFrame
combined_df.show()

输出

名称	年龄	部门
Alice	30	HR
Bob	25	工程
查理	35	营销

在此示例中，使用 `unionAll` 组合了 `df1` 和 `df2`，生成一个包含两个 DataFrame 中行的 DataFrame。

示例 2：带有不同列的 UnionAll

尽管 `unionAll` 要求每个 DataFrame 具有相同的模式，但您仍然可以将它与包含不同列的 DataFrame 一起使用，方法是排列列名。

# Define schemas with different column names
schema1 = StructType([
    StructField("EmployeeName", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True)
])

schema2 = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Dept", StringType(), True)
])

# Create DataFrames with different schemas
data1 = [("Alice", 30, "HR"), ("Bob", 25, "Engineering")]
data2 = [("Charlie", 35, "Marketing"), ("David", 40, "Finance")]

df1 = spark.createDataFrame(data1, schema1)
df2 = spark.createDataFrame(data2, schema2)

# Rename columns to align schemas
df2 = df2.withColumnRenamed("Name", "EmployeeName").withColumnRenamed("Dept", "Department")

# Combine DataFrames using unionAll
combined_df = df1.unionAll(df2)

# Show the combined DataFrame
combined_df.show()

输出

员工姓名	年龄	部门
Alice	30	HR
Bob	25	工程
查理	35	营销
大卫	40	融资

示例 3：带有不同模式顺序的 UnionAll

`unionAll` 要求 DataFrame 在列名及其顺序方面具有相同的模式。更改列的顺序将导致不匹配，并可能导致不正确的记录对齐。

# Define schemas with different column orders
schema1 = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True)
])

schema2 = StructType([
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True),
    StructField("Name", StringType(), True) ])
data1 = [("Alice", 30, "HR"), ("Bob", 25, "Engineering")]
data2 = [(35, "Marketing", "Charlie"), (40, "Finance", "David")]

df1 = spark.createDataFrame(data1, schema1)
df2 = spark.createDataFrame(data2, schema2)
combined_df = df1.unionAll(df2)
combined_df.show()

输出

名称	年龄	部门
Alice	30	HR
Bob	25	工程
35	营销	查理
40	融资	大卫

在这种情况下，`df1` 和 `df2` 具有相同的列名，但顺序不同。因此，记录被不正确地组合，导致不匹配。

最佳实践和性能考虑因素

模式对齐： 确保两个 DataFrame 在列名及其顺序方面具有相同的模式，以避免不匹配。
数据一致性： 检查 DataFrame 之间的数据一致性，特别是在组合来自不同来源的信息时。
性能： 谨慎使用 `unionAll`，尤其是在处理大型数据集时，因为它涉及跨分区的数据洗牌。

结论

`unionAll` 是 PySpark 中用于垂直组合 DataFrame 的重要操作。它允许您连接来自多个 DataFrame 的行，同时保留重复项。通过了解其语法、用法和最佳实践，您可以在 PySpark 中正确使用 `unionAll` 来简化您的数据处理管道。

通过本指南中提供的示例，您现在对如何在各种情况下使用 `unionAll` 有了全面的了解。无论您是组合具有相同模式的 DataFrame，还是对齐具有特定列名或顺序的模式，`unionAll` 都提供了一种灵活而有效的机制，用于 PySpark 中的数据集成。

下一个主题Nagios

← 上一个下一个 →

PySpark unionAll

理解 `unionAll`

语法

`unionAll` 的用法

示例 1：组合两个 DataFrame

示例 2：带有不同列的 UnionAll

示例 3：带有不同模式顺序的 UnionAll

最佳实践和性能考虑因素

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

PySpark 教程

PySpark unionAll

理解 `unionAll`

语法

`unionAll` 的用法

示例 1：组合两个 DataFrame

示例 2：带有不同列的 UnionAll

示例 3：带有不同模式顺序的 UnionAll

最佳实践和性能考虑因素

结论

相关帖子

如何更改 PySpark 数据框中的列类型

PySpark Profiler

PySpark StatusTracker

PySpark 数据框转 CSV

PySpark StorageLevel

PySpark 逻辑回归

PySpark 序列化器

PySpark GroupBy 平均值

PySpark 合并

广播和累加器

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器