PySpark DataFrame 分割

17 Mar 2025 | 5 分钟阅读

当数据集非常大时，将数据表分割成相等的部分然后单独处理每个数据框将非常有益。当数据框上的操作与行无关时，这才是可能的。在这里，每个机会，或者我们可以说，相等分割的数据框，可以利用资源以非常有效的方式并行处理。我们通过本文来实现这一点。我们将讨论并学习如何将 PySpark 数据框分割成相等数量的行，甚至列。当然，在本文中，我们将主要关注行。

让我们为演示目的创建一个 DataFrame。

首先，我们将导入所需的模块。之后，我们将从 Pyspark.sql 模块导入 SparkSession。然后，我们将创建 SparkSession 并为其指定应用程序名称。

之后，我们将为数据框设置列名，然后为数据框设置行数据。

最后，我们将使用我们设置的上述值来创建数据框，然后我们将查看数据框。

代码

# imported the module
import pyspark
  
# imported the spark session from the pyspark.sql module
from pyspark.sql import SparkSession
  
# Created the sparksession and gave the app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# put the Column names for the dataframe
columns = ["Brand", "Product"]
  
# put the Row data for the dataframe
data = [
    ("HP", "Laptop"),
    ("Lenovo", "Mouse"),
    ("Dell", "Keyboard"),
    ("Samsung", "Monitor"),
    ("MSI", "Graphics Card"),
    ("Asus", "Motherboard"),
    ("Gigabyte", "Motherboard"),
    ("Zebronics", "Cabinet"),
    ("Adata", "RAM"),
    ("Transcend", "SSD"),
    ("Kingston", "HDD"),
    ("Toshiba", "DVD Writer")
]
  
# Created the dataframe for using the above values
prod_df = spark.createDataFrame(data=data,
                                schema=columns)
  
# Viewed the dataframe
prod_df.show()

输出

在上面的代码块中，我们可以看到数据框的模式结构已经定义，并且也提供了示例数据。值得注意的是，数据框包含两个字符串类型的列，共有 12 条记录。

现在让我们举几个例子来理解这种 PySpark 数据框的拆分。

示例 1：使用 'DataFrame.limit()' 拆分数据框

在此示例中，我们将使用 split() 方法，然后创建 'n' 个相等的数据框。

语法

这里 limit 是将结果计数限制到指定的所需数量。

在此代码中，我们将首先定义我们想要分割的次数。然后我们将计算每个数据框的行数，并创建原始数据框的副本。之后，我们将迭代数据框，获取每行的顶部长度，并截断副本定义以删除已获取的内容。然后我们将查看数据框，最后，我们将增加以分割数字。

代码

# Defined the number of splits that we want
n_splits = 4

# Calculated the count of each dataframe row
each_len = prod_df.count() // n_splits

# Created a copy of the original dataframe
copy_df = prod_df

# Iterated for each dataframe
i = 0
while i < n_splits:

	# Got the top `each_len` number of the rows
	temp_df = copy_df.limit(each_len)

	# Truncated the `copy_df` for removing
	# the contents that are fetched for the `temp_df`
	copy_df = copy_df.subtract(temp_df)

	# Viewed the dataframe
	temp_df.show(truncate=False)

	# Incremented the split number
	i += 1

输出

示例 2：在此示例中，我们拆分了数据框并执行了连接结果的操作。

在此示例中，我们将数据框拆分成相等的部分，然后单独对每一部分执行连接操作。我们将结果连接到 result_df。好吧，这是关于用户如何能够使用前一段代码的扩展来为每个数据框执行单独的数据框操作，然后追加这些单独的数据框以生成新的数据框，该数据框具有特定长度，并且该长度等于原始数据框。

这里我们最初定义了我们要进行的拆分数，然后计算了每行数据框的行数。之后，我们创建了原始数据框的副本，并为每个单独拆分的列进行了修改函数。之后，我们创建了一个空数据框来存储连接的结果，然后为每个数据框进行了迭代。之后，我们做了与上面代码中 Vishal 一样的步骤，例如获取每行的长度或截断副本定义。最后，我们对新创建的数据框执行了操作，然后连接了数据框，最后增加了拆分号。

代码

# Defined the number of splits we want
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.functions import concat, col, lit

n_splits = 4

# Calculated the count of each dataframe row
each_len = prod_df.count() // n_splits

# Created a copy of original dataframe
copy_df = prod_df

# did Function to modify the columns of each individual split


def modify_dataframe(data):
	return data.select(
		concat(col("Brand"), lit(" - "),
			col("Product"))
	)


# Created the empty dataframe for
# storing the concatenated results
schema = StructType([
	StructField('Brand - Product', StringType(), True)
])
result_df = spark.createDataFrame(data=[],
								schema=schema)

# Iterated for each dataframe
i = 0
while i < n_splits:

	# Get the top of `each_len` number of rows
	temp_df = copy_df.limit(each_len)

	# Truncated the `copy_df` for removing
	# the contents that are fetched for `temp_df`
	copy_df = copy_df.subtract(temp_df)

	# Performed the operation on the newly created dataframe
	temp_df_mod = modify_dataframe(data=temp_df)
	temp_df_mod.show(truncate=False)

	# Concatenated the dataframe
	result_df = result_df.union(temp_df_mod)

	# Incremented the split number
	i += 1

result_df.show(truncate=False)

输出

结论

在本文中，我们已经学习了 PySpark 拆分数据框以及它的用途。基本上，它用于大型数据集，当您想将其分割成相等的块然后单独处理每个数据框时。

在这里，我们创建了用于演示的数据框，并举了两个例子。在第一个示例中，我们使用 dataframe.limit 拆分了数据框，在第二个示例中，我们通过执行操作和连接结果来拆分数据框。之后，我们获得了相应的输出。

所以这就是本文的全部内容；这里的一切都以一种任何人都可以轻松从中获得帮助的方式进行了说明。

下一主题SciPy CSGraph - 压缩稀疏图

PySpark DataFrame 分割

代码

示例 1：使用 'DataFrame.limit()' 拆分数据框

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Python 问题

PySpark DataFrame 分割

代码

示例 1：使用 'DataFrame.limit()' 拆分数据框

结论

相关帖子

Python re 模块的 Split, Sub, Subn 函数

Python 中的浅拷贝和深拷贝

如何在 Python 中调用函数

Python Pendulum 库

使用 Python 以螺旋顺序打印单链表

Sylvester's Sequence using Python

如何向列表中添加元素

Python 中的属性含义

使用 Scipy 在 Python 中进行多维图像处理

如何在 Python 中绘制多个线性回归

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器