使用Python进行Apache Beam，构建数据管道

2025年1月4日 | 阅读7分钟

引言

Apache Beam 是一个开源 SDK，可用于构建分布式或直接的数据管道，这些管道基于批处理或流式集成。您可以为每个管道添加不同的转换。然而，Beam 的真正优势在于它不依赖于任何一个计算引擎，因此具有平台独立性。为了计算您的转换，您需要指定要使用的“运行器”。例如，您可以选择 Spark 引擎或 Cloud Dataflow，而不是默认使用本地计算资源。

通过利用 Beam 的 Python SDK，开发人员可以轻松且可扩展地构建复杂的数据管道。Apache Beam 抽象了执行层，允许用户在不更改管道代码的情况下，在 Google Cloud Dataflow、Apache Flink 和 Apache Spark 等多个执行引擎上运行其管道。关键概念包括 PCollections（数据）、PTransforms（操作）以及用于各种数据源和接收器的 I/O 连接器。Apache Beam 的灵活性使其成为处理大规模数据处理任务的强大工具，提供实时和批处理功能，因此对于现代数据工程和分析至关重要。

安装

截至本文撰写时（Apache Beam 2.8.1），仅支持 Python 2.7；Python 3 版本应该很快会发布。Beam 已知会在安装 Python Snappy 时崩溃，但这将在 Beam 2.9 中修复。

基本管道

Python 中的基本 Apache Beam 管道涉及三个主要步骤：读取数据、转换数据和写入结果。您首先定义一个 `Pipeline` 对象，然后创建一个 `PCollection` 来保存输入数据。接下来，应用 `PTransforms` 来处理数据，例如过滤、映射或聚合。最后，将转换后的数据写入输出接收器。Apache Beam 的灵活性允许在 Google Cloud Dataflow 或 Apache Flink 等不同后端上运行相同的管道，使其成为批处理和流处理的通用工具。这种方法以可扩展、高效的执行简化了复杂的数据工作流。

示例

 
import apache_beam as beam

# Define a function to transform the data
def transform_data(element):
    return element.upper()

with beam.Pipeline() as pipeline:
    # Read data from a text file
    input_data = pipeline | 'Read' >> beam.io.ReadFromText('input.txt')
    
    transformed_data = input_data | 'Transform' >> beam.Map(transform_data)
    
    
    transformed_data | 'Write' >> beam.io.WriteToText('output.txt')

result = pipeline.run()
result.wait_until_finish()

with open('output.txt-00000-of-00001', 'r') as f:
    output = f.read()

print(output)   

输出

 
HELLO WORLD
APACHE BEAM
DATA PROCESSING
PIPELINE EXAMPLE

说明

此示例演示了一个基本的数据转换 Apache Beam 管道。如果尚未安装，它将首先安装 Apache Beam 库。随后，定义一个管道，该管道执行三个主要任务：从 `input.txt` 接收文本数据，将每一行转换为大写，并将更改后的行发布到 `output.txt`。管道异步运行，并在配置时等待完成。处理后，Beam 可能会将输出文件分成多个部分（例如，{output.txt-00000-of-00001}），脚本会读取并输出这些部分的内容以确认转换。这说明了 Beam 使用简单但高效的代码来管理复杂数据处理作业的能力。

Beam 中的转换原理

在 Apache Beam 中，PCollection 对象表示要处理的数据集合。要开始处理数据，您首先通过使用读取操作将其摄取到 PCollection 中，读取操作本身就是一个转换。此转换从各种数据源读取数据，例如 CSV 文件、数据库或其他存储系统。

读取操作：这是将数据摄取到 PCollection 中的初始步骤。Read 转换从指定的数据源（例如 CSV 文件）读取并创建一个 PCollection。

应用转换：数据进入 PCollection 后，您可以应用各种转换来处理它。转换是修改或分析数据的操作，例如过滤、映射或聚合。

 
transformed_data = input_data | 'MapToUppercase' >> beam.Map(lambda x: x.upper())   

写入操作：应用转换后，通常将结果写入输出接收器，例如文件或数据库。

遵循这些准则，您可以创建一个数据处理管道，其中每个操作（读取、转换、写入）都表示为应用于 PCollection 的转换。通过这种模块化方法，可以使数据处理工作流灵活且可扩展。

示例

这是一个全面的示例，展示了如何使用 Apache Beam 从 CSV 文件读取数据、对其进行转换以及写入结果。此示例涉及的步骤包括读取 CSV 文件、将每一行转换为大写，然后将更新后的行写入输出文件。

首先，创建一个名为 data.csv 的示例 CSV 文件

 
# Create a sample CSV file
with open('data.csv', 'w') as f:
    f.write("hello world\n")
    f.write("apache beam\n")
    f.write("data processing\n")
    f.write("pipeline example\n")
Pipeline Code
!pip install apache-beam

import apache_beam as beam
def transform_data(element):
    return element.upper()
with beam.Pipeline() as pipeline:
    # Read data from a CSV file
    input_data = pipeline | 'ReadFromCSV' >> beam.io.ReadFromText('data.csv')
    
    transformed_data = input_data | 'MapToUppercase' >> beam.Map(transform_data)
    
    # Write the transformed data to an output text file
    transformed_data | 'WriteToText' >> beam.io.WriteToText('output.txt')

# Run the pipeline and wait until it's done
result = pipeline.run()
result.wait_until_finish()

# Read and print the results from the output file
with open('output.txt-00000-of-00001', 'r') as f:
    output = f.read()

print(output)
Read Data

input_data = pipeline | 'ReadFromCSV' >> beam.io.ReadFromText('data.csv')

Transform Data
transformed_data = input_data | 'MapToUppercase' >> beam.Map(transform_data)

Write Data
transformed_data | 'WriteToText' >> beam.io.WriteToText('output.txt')   

运行管道

 
result = pipeline.run()
result.wait_until_finish()   

打印输出

 
with open('output.txt-00000-of-00001', 'r') as f:
    output = f.read()
print(output)   

输出

输出文件 output.txt-00000-of-00001 将包含

 
HELLO WORLD
APACHE BEAM
DATA PROCESSING
PIPELINE EXAMPLE

说明

该代码演示了一个 Apache Beam 管道，该管道从 CSV 文件读取数据，将每一行转换为大写，并将结果写入输出文件。它首先创建一个示例 `data.csv` 文件。管道以 `ReadFromCSV` 转换开始，将 CSV 数据读取到 `PCollection` 中。然后使用 `beam.Map` 应用 `MapToUppercase` 转换，将每一行转换为大写。转换后的数据使用 `WriteToText` 转换写入 `output.txt`。通过 `pipeline.run()` 执行管道并等待其完成。最后，读取输出文件并打印结果，演示了 Beam 管道的基本步骤：读取、转换和写入数据。

Apache Beam

Apache Beam 是一个开源软件开发工具包，可通过批处理或流式集成，让您创建各种数据管道并直接或间接运行它们。每个管道都允许您添加不同的转换。然而，Beam 的真正优势在于其平台独立性，这源于它不依赖于任何特定的计算引擎。您需要指定要应用哪个“运行器”来计算转换。默认情况下使用本地计算机资源，您可以选择其他运行器，例如 Cloud Dataflow 或 Spark 引擎。

安装

要运行此示例，您需要安装 Apache Beam。如果您使用的是 Python，可以通过 pip 安装。

示例

 
import apache_beam as beam

def split_words(line):
    """Splits a line of text into individual words."""
    return line.split()

def count_words(elements):
    """Counts the frequency of each word in the list of words."""
    word_counts = {}
    for word in elements:
        word_counts[word] = word_counts.get(word, 0) + 1
    return word_counts.items()

def format_result(word_count):
    """Formats the result for output."""
    word, count = word_count
    return f'{word}: {count}'

def run():
    # Define the pipeline
    with beam.Pipeline() as pipeline:
        # Read input data from a file
        lines = (
            pipeline
            | 'Read' >> beam.io.ReadFromText('input.txt')
        )
        
        # Split lines into words
        words = (
            lines
            | 'Split' >> beam.FlatMap(split_words)
        )
        
        # Count the words
        word_counts = (
            words
            | 'PairWithOne' >> beam.Map(lambda word: (word, 1))
            | 'GroupAndSum' >> beam.CombinePerKey(sum)
        )
        
        # Format the results for output
        formatted_results = (
            word_counts
            | 'FormatResults' >> beam.Map(format_result)
        )
        
        # Write results to an output file
        formatted_results | 'Write' >> beam.io.WriteToText('output.txt')

if __name__ == '__main__':
    run()   

输出

 
hello: 2
world: 2
beam: 2

说明

Apache Beam 脚本对文本数据执行词频统计操作。它首先从 `input.txt` 文件读取行，并使用 `split_words` 函数将每一行拆分为单词。然后将每个单词映射到键值对 `(word, 1)`，接着使用 `beam.CombinePerKey(sum)` 对这些对进行分组并求和。使用 `format_result` 函数将结果格式化为易读的字符串。最后，将格式化的词频统计结果写入 `output.txt`。此管道处理文本、统计单词出现次数并保存结果，如果部署在 Apache Flink 或 Google Cloud Dataflow 等平台上，则可以分布式地处理数据。

结论

总之，我们介绍了如何使用 Apache Beam 创建数据管道，重点关注一个基本示例，即从 CSV 文件获取数据、对其进行处理并将其结果发布到输出文件。我们回顾了构建管道的基础知识，包括如何使用 `Read` 进行数据摄取、使用 `Map` 进行转换以及使用 `Write` 进行输出。该过程展示了 Beam 的模块化设计如何促进对大规模数据处理的高效管理。通过执行管道和管理结果，我们展示了 Beam 在执行分布式数据活动方面的适应性和强大功能，其代码既易于理解又可读。

下一主题Python 3 如何将字节读取为流

使用Python进行Apache Beam，构建数据管道

引言

安装

基本管道

示例

说明

Beam 中的转换原理

示例

说明

Apache Beam

安装

示例

说明

结论

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

其他

使用Python进行Apache Beam，构建数据管道

引言

安装

基本管道

示例

说明

Beam 中的转换原理

示例

说明

Apache Beam

安装

示例

说明

结论

相关帖子

Python中的ctime

Python的Sublime Text

在线性代数中返回矩阵的Frobenius范数（Python）

Python中的numpy.cumprod()

Python中的Matplotlib.pyplot.show()

设置Python中的路径

Python的Google API客户端

如何在Matplotlib中绘制对数坐标轴

在Python列表中查找中位数

Python程序：获取字典的第一个和最后一个元素

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器