Kafka - 分布式流媒体平台

2025 年 1 月 23 日 | 阅读 13 分钟

分布式流处理平台简介

分布式流处理平台是一个旨在跨多台机器或节点实时处理、分析和处理连续数据流的系统。这些平台通过将工作负载分布到计算机集群中，实现对大量数据的有效处理。它们支持可伸缩性、容错性和实时处理等各种功能，使其对于需要即时洞察的应用程序至关重要，例如金融交易、监控系统和实时分析。

Kafka - A Distributed Streaming Platform

使用 Apache Pulsar 的示例程序

Apache Pulsar 是一个开源的分布式消息和流处理平台。它为实时数据流提供了一个可伸缩、低延迟且持久的解决方案。下面是一个示例程序，演示如何使用 Apache Pulsar 从流中生产和消费消息。

设置 Apache Pulsar

1. 在本地模式下启动 Pulsar

在运行程序之前，请确保 Pulsar 正在运行。您可以使用以下命令在本地模式下启动 Pulsar

# Download Pulsar binary
wget https://archive.apache.org/dist/pulsar/pulsar-2.8.1/apache-pulsar-2.8.1-bin.tar.gz

# Extract the archive
tar xzf apache-pulsar-2.8.1-bin.tar.gz
cd apache-pulsar-2.8.1

# Start Pulsar in standalone mode
bin/pulsar standalone

2. 创建 Pulsar 主题

您可以使用 Pulsar CLI 创建主题

生产者和消费者示例

Pulsar 生产者 Python 代码

from pulsar import Client
# Create a Pulsar client instance
client = Client('pulsar://:6650')

# Create a producer on the topic
producer = client.create_producer('persistent://public/default/my-topic')

# Send messages to the topic
for i in range(10):
    message = f'Message {i}'
    producer.send(message.encode('utf-8'))
    print(f'Sent: {message}')

# Close the producer and client
producer.close()
client.close()

输出

Pulsar 消费者 Python 代码

from pulsar import Client, ConsumerType

# Create a Pulsar client instance
client = Client('pulsar://:6650')

# Create a consumer on the topic
consumer = client.subscribe('persistent://public/default/my-topic', 'my-subscription', consumer_type=ConsumerType.Shared)

# Receive messages from the topic
while True:
    msg = consumer.receive()
    print(f'Received message: {msg.data().decode("utf-8")}')
    consumer.acknowledge(msg)

# Close the consumer and client (in practice, this would be done on termination)
consumer.close()
client.close()

说明

生产者代码
- 创建一个连接到本地 Pulsar 代理的 Pulsar 客户端。
- 向主题 my-topic 生产 10 条消息。
- 每条消息都发送到 Pulsar 并打印到控制台。
消费者代码
- 创建一个 Pulsar 客户端和一个订阅 my-topic 主题的消费者。
- 持续接收主题中的消息并打印。
- 处理后确认每条消息，以确保不会重新投递。

此示例演示了 Apache Pulsar 生产和消费消息的简单用例，展示了分布式流处理平台在实际场景中的核心功能。

流处理模型

流处理模型旨在实时处理和分析连续数据流。这些模型定义了数据在系统中流动时如何处理，无论是应用简单的转换、按时间窗口聚合数据还是将流连接起来。

1. 无状态与有状态处理

无状态处理

无状态处理是指不依赖任何先前数据或上下文的操作。每个事件都是独立处理的，这意味着处理特定事件的结果不依赖于事件本身之外的任何信息。

示例：从整数流中过滤偶数。

使用 Apache Kafka Streams API（通过 Faust）的 Python 程序示例

import faust

app = faust.App('stateless-example', broker='kafka://:9092')

# Define the stream topic
topic = app.topic('numbers', value_type=int)

# Define the output stream topic
even_numbers_topic = app.topic('even-numbers', value_type=int)

@app.agent(topic)
async def process_numbers(numbers):
    async for number in numbers:
        if number % 2 == 0:  # Stateless processing: check if the number is even
            await even_numbers_topic.send(value=number)

if __name__ == '__main__':
    app.main()

说明

该程序处理来自 numbers 主题的数字流。
它过滤掉偶数并将它们发送到 even-numbers 主题。
每个数字的处理都是独立的，没有维护状态。

有状态处理

有状态处理涉及维护跨多个事件的某些状态或上下文的操作。这允许事件的处理依赖于先前事件或累积状态。

示例：计算文本消息流中每个单词的出现次数。

使用 Apache Kafka Streams API（通过 Faust）的 Python 程序示例

import faust

app = faust.App('stateful-example', broker='kafka://:9092')

# Define the stream topic
topic = app.topic('text-messages', value_type=str)

# Define a table (stateful) to keep the count of words
word_count_table = app.Table('word-counts', default=int)

@app.agent(topic)
async def count_words(messages):
    async for message in messages:
        words = message.split()
        for word in words:
            word_count_table[word] += 1
            print(f'Word: {word}, Count: {word_count_table[word]}')

if __name__ == '__main__':
    app.main()

输出

说明

该程序从 text-messages 主题读取文本消息流。
它使用表 (word_count_table) 维护所有消息中每个单词的有状态计数。
状态（单词计数）会随着新消息的到来而更新和使用。

2. 窗口和聚合

窗口是一种将数据流分成有限时间块（称为窗口）的方法，以应用聚合等操作，否则这些操作在无限流上是不可能实现的。

翻滚窗口

翻滚窗口是固定大小、不重叠的窗口。每个事件都恰好落入一个窗口。

示例：计算 10 秒窗口中的事件数。

使用 Apache Flink 和 PyFlink API 的 Python 程序示例

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import Time
from pyflink.table import EnvironmentSettings, StreamTableEnvironment

# Set up the environment
env = StreamExecutionEnvironment.get_execution_environment()
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define the stream source (for example, a socket stream)
input_stream = env.socket_text_stream('localhost', 9999)

# Apply a tumbling window of 10 seconds and count the events
windowed_counts = (
    input_stream
    .time_window_all(Time.seconds(10))
    .reduce(lambda i1, i2: int(i1) + int(i2))
)

windowed_counts.print()

# Execute the program
env.execute('Tumbling Window Example')

输出

说明

该程序处理输入流（例如，来自套接字）。
它应用 10 秒的翻滚窗口，其中每个窗口内的所有事件都将被计数。
每个窗口都独立处理，结果在每个窗口结束时打印。

滑动窗口

滑动窗口与先前的窗口重叠。每个事件都可以是多个窗口的一部分。

示例：计算每 5 秒滑动一次的 10 秒滑动窗口中的事件。

使用 Apache Flink 和 PyFlink API 的 Python 程序示例

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import Time
from pyflink.table import EnvironmentSettings, StreamTableEnvironment

# Set up the environment
env = StreamExecutionEnvironment.get_execution_environment()
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define the stream source (for example, a socket stream)
input_stream = env.socket_text_stream('localhost', 9999)

# Apply a sliding window of 10 seconds that slides every 5 seconds
windowed_counts = (
    input_stream
    .time_window_all(Time.seconds(10), Time.seconds(5))
    .reduce(lambda i1, i2: int(i1) + int(i2))
)

windowed_counts.print()

# Execute the program
env.execute('Sliding Window Example')

输出

说明

该程序处理一个输入流，该输入流具有一个 10 秒的滑动窗口，每 5 秒前进一次。
每个事件都可能对多个窗口的结果有所贡献。

3. 事件时间处理

事件时间处理是指根据事件生成的时间（事件时间）而不是接收时间（处理时间）来处理事件。这在事件可能延迟或乱序到达的场景中至关重要。

示例：使用事件时间处理事件并处理延迟数据。

使用 Apache Flink 和 PyFlink API 的 Python 程序示例

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import ProcessFunction, KeyedProcessFunction
from pyflink.datastream.time_characteristic import TimeCharacteristic
from pyflink.datastream.window import Time
from pyflink.table import EnvironmentSettings, StreamTableEnvironment

# Set up the environment with event time
env = StreamExecutionEnvironment.get_execution_environment()
env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define the stream source (for example, a socket stream)
input_stream = env.socket_text_stream('localhost', 9999)

# Assume the input stream events have a timestamp and a value
# For example: "1622547800000,5" where 1622547800000 is the timestamp in milliseconds

def extract_timestamp(line):
    timestamp, value = map(int, line.split(','))
    return timestamp, value

timestamped_stream = input_stream.map(extract_timestamp)

# Apply event time processing with a window of 10 seconds
windowed_counts = (
    timestamped_stream
    .assign_timestamps_and_watermarks(lambda x: x[0])  # Assign event time
    .time_window_all(Time.seconds(10))
    .reduce(lambda i1, i2: (i1[0], i1[1] + i2[1]))  # Sum values within the window
)

windowed_counts.print()

# Execute the program
env.execute('Event Time Processing Example')

输出

说明

该程序处理一个输入流，其中每个事件都有一个时间戳。
事件时间处理用于通过为事件分配时间戳来正确处理延迟数据。
基于事件时间而不是处理时间应用 10 秒的窗口。

4. 流连接和复杂事件处理 (CEP)

流连接允许根据相关事件组合多个流。复杂事件处理 (CEP) 涉及检测跨流的模式或事件序列。

流连接

示例程序

# Define the environment and sources as before
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import Time
from pyflink.table import EnvironmentSettings, StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define the stream sources
click_stream = env.socket_text_stream('localhost', 9998)
purchase_stream = env.socket_text_stream('localhost', 9999)

# Assume the streams have user_id and event information
# For example: "user1,click" and "user1,purchase"

def parse_event(line):
    user_id, event_type = line.split(',')
    return user_id, event_type

parsed_clicks = click_stream.map(parse_event)
parsed_purchases = purchase_stream.map(parse_event)

# Perform a windowed join between clicks and purchases on user_id
joined_stream = (
    parsed_clicks.join(parsed_purchases)
    .where(lambda x: x[0])  # Join on user_id
    .equal_to(lambda x: x[0])
    .window(Time.seconds(10))  # Join events within a 10-second window
    .apply(lambda click, purchase: (click[0], click[1], purchase[1]))  # Combine click and purchase information
)

joined_stream.print()

# Execute the program
env.execute('Stream Join Example')

输出

说明

该程序根据 user_id 在两个流之间执行连接。
连接在 10 秒窗口内执行，这意味着如果来自两个流的事件发生在此时间范围内，则它们会匹配。
结果包括用户 ID、点击事件和购买事件。

复杂事件处理 (CEP)

复杂事件处理 (CEP) 涉及识别随时间推移的事件模式和序列。这对于检测欺诈、异常或操作序列等模式非常有用。

示例：使用 Apache Flink CEP 检测“登录后购买”模式。

使用 Apache Flink 和 PyFlink API 的 Python 程序示例

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream import PatternStream
from pyflink.datastream.functions import ProcessFunction
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
from pyflink.datastream import Pattern

# Set up the environment
env = StreamExecutionEnvironment.get_execution_environment()
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define the stream source
event_stream = env.socket_text_stream('localhost', 9998)

# Assume the stream has event information
# For example: "user1,login" and "user1,purchase"

def parse_event(line):
    user_id, event_type = line.split(',')
    return user_id, event_type

parsed_events = event_stream.map(parse_event)

# Define a pattern to detect "login followed by purchase"
pattern = Pattern.begin('start').where(lambda x: x[1] == 'login').followed_by('next').where(lambda x: x[1] == 'purchase')

# Apply the pattern to the event stream
pattern_stream = PatternStream(parsed_events, pattern)

# Define a function to process matched patterns
class PatternMatch(ProcessFunction):
    def process_element(self, value, ctx: 'ProcessFunction.Context'):
        print(f"Detected pattern: {value}")

# Apply the processing function
pattern_stream.process(PatternMatch())

# Execute the program
env.execute('CEP Example')

输出

说明

该程序从流中读取事件，每个事件都包含 user_id 和 event_type。
它使用 Apache Flink 的 CEP 库来定义登录事件后跟购买事件的模式。
当检测到此模式时，它会触发处理函数，在此示例中，该函数只是打印匹配的模式。

高级分布式流处理平台

随着对实时数据处理的需求持续增长，分布式流处理平台正在不断发展，以应对新的挑战并利用新兴技术。以下是一些未来趋势

无服务器流处理架构
边缘计算和物联网流处理
AI 和机器学习与流数据集成

1. 无服务器流处理架构

无服务器架构使开发人员能够专注于编写代码，而无需担心底层基础设施。在流处理平台上下文中，无服务器架构抽象了服务器管理、扩展和配置问题。

示例：AWS Lambda 与 Kinesis Data Streams

AWS Lambda 可用于处理 Kinesis Data Streams 中的事件，而无需管理服务器。

Python 程序示例

1. 创建 Lambda 函数

创建一个 Python Lambda 函数来处理 Kinesis 流中的记录

import json
def lambda_handler(event, context):
    for record in event['Records']:
        # Decode the Kinesis record payload
        payload = json.loads(record['kinesis']['data'])
        print(f"Processed record: {payload}")
    
    return {
        'statusCode': 200,
        'body': json.dumps('Successfully processed records')
    }

说明

Lambda 函数由 Kinesis Data Streams 触发。
它处理流中的每个记录，解码有效负载并打印。
您无需管理服务器；AWS Lambda 处理扩展和执行。

2. 配置 Lambda 函数

在 AWS 管理控制台中设置 Kinesis Data Stream。
创建一个新的 Lambda 函数并将其配置为使用 Python 运行时。
将函数的触发器设置为您创建的 Kinesis Data Stream。

2. 边缘计算和物联网流处理

边缘计算涉及在数据生成的地方（网络边缘）附近处理数据，以减少延迟和带宽使用。物联网流处理需要实时处理和分析来自众多物联网设备的数据。

示例：在 IoT 边缘设备上使用 Apache Flink

Python 程序示例

1. 使用 Flink 设置边缘处理

对于边缘处理，您可以在物联网设备上部署轻量级 Flink 集群。下面是一个在边缘设备上运行的 Flink 作业的概念示例

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import Time
from pyflink.table import EnvironmentSettings, StreamTableEnvironment

# Set up the environment
env = StreamExecutionEnvironment.get_execution_environment()
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define the stream source from IoT sensors (e.g., temperature data)
iot_stream = env.socket_text_stream('localhost', 9998)

def parse_sensor_data(line):
    sensor_id, temperature = line.split(',')
    return sensor_id, float(temperature)

parsed_data = iot_stream.map(parse_sensor_data)

# Apply a sliding window to analyze temperature data
windowed_avg = (
    parsed_data
    .key_by(lambda x: x[0])  # Key by sensor_id
    .time_window(Time.minutes(5), Time.minutes(1))  # 5-minute window sliding every minute
    .reduce(lambda acc, x: (x[0], (acc[1] + x[1]) / 2))  # Calculate average temperature
)

windowed_avg.print()

# Execute the program
env.execute('Edge Processing Example')

说明

该程序从物联网传感器读取温度数据。
它使用滑动窗口计算平均温度。
处理在边缘设备上进行，以最大程度地减少延迟和带宽使用。

3. AI 和机器学习与流数据集成

将 AI 和机器学习与流数据集成，可实现实时分析、预测和决策。此趋势涉及将模型应用于流数据，以执行异常检测、推荐系统等任务。

示例：使用 Apache Flink 和 TensorFlow 进行实时异常检测

Python 程序示例

1. 准备机器学习模型

假设您有一个用于异常检测的预训练 TensorFlow 模型。

import tensorflow as tf

# Load pre-trained model
model = tf.keras.models.load_model('path_to_your_model')

def predict_anomaly(data):
    # Preprocess data
    data = preprocess(data)
    prediction = model.predict(data)
    return prediction
 

2. 与 Flink 集成以进行实时预测

将模型与 Apache Flink 集成以应用实时预测。

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import ProcessFunction
import tensorflow as tf

# Load the pre-trained model
model = tf.keras.models.load_model('path_to_your_model')

class AnomalyDetection(ProcessFunction):
    def process_element(self, value, ctx: 'ProcessFunction.Context'):
        # Preprocess data
        data = preprocess(value)
        # Predict anomaly
        prediction = model.predict(data)
        if prediction > threshold:
            print(f"Anomaly detected: {value}")

# Set up the environment
env = StreamExecutionEnvironment.get_execution_environment()
settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
table_env = StreamTableEnvironment.create(env, environment_settings=settings)

# Define the data source
data_stream = env.socket_text_stream('localhost', 9999)

# Apply anomaly detection
anomaly_stream = data_stream.process(AnomalyDetection())

# Execute the program
env.execute('Real-Time Anomaly Detection Example')

输出

说明

使用预训练的 TensorFlow 模型进行异常检测。
该模型与 Flink 作业集成，其中流中的数据实时处理。
对传入数据进行预测，并检测和处理异常。

分布式流处理平台中的数据管理

分布式流处理平台中的数据管理涉及几个关键组件

数据摄取和集成
流存储
序列化格式
模式演进和管理

让我们通过示例程序来探讨这些组件。

1. 数据摄取和集成

数据摄取是将数据收集并引入流处理平台的过程。集成涉及将来自各种来源的数据组合成统一的流。

示例：用于数据摄取的 Kafka Connect

Kafka Connect 是一个用于将各种来源的数据可伸缩、可靠地摄取到 Kafka 的工具。

使用文件源连接器设置 Kafka Connect

1. 创建连接器配置

创建文件源连接器配置 (file-source-connector.properties)

name=file-source-connector
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
file=/path/to/input/file.txt
topic=test-topic

2. 启动 Kafka Connect

使用配置启动 Kafka Connect

bin/connect-standalone.sh config/connect-standalone.properties config/file-source-connector.properties

说明

文件源连接器：从文件读取数据并将其流式传输到 Kafka 主题。
配置：指定要读取的文件和要写入的 Kafka 主题。

2. 流存储

流存储涉及数据如何在流处理平台内存储和管理。例如，Kafka 将数据存储在日志中。

示例：Kafka 日志存储

Kafka 将数据存储在主题分区中，每个分区都是一个日志文件。每个日志文件都由一系列记录组成。

使用 Kafka 写入和读取日志的程序

1. Kafka 生产者 (Python)

from kafka import KafkaProducer

# Create a Kafka producer
producer = KafkaProducer(bootstrap_servers='localhost:9092')

# Send messages to the topic 'test-topic'
for i in range(10):
    message = f"Log message {i}"
    producer.send('test-topic', value=message.encode('utf-8'))
    print(f"Sent: {message}")

# Close the producer
producer.close()

2. Kafka 消费者 (Python)

from kafka import KafkaConsumer

# Create a Kafka consumer
consumer = KafkaConsumer(
    'test-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='test-group'
)

# Receive messages from the topic
for message in consumer:
    print(f"Received: {message.value.decode('utf-8')}")

说明

生产者：将日志消息写入 Kafka 主题。
消费者：从 Kafka 主题读取日志消息。

3. 序列化格式

序列化格式用于在通过网络发送数据之前对其进行编码。常见的格式包括 Avro 和 Protobuf。

示例：Avro 序列化

1. 定义 Avro 模式

Create a schema file (user.avsc):
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

2. Avro 序列化的 Python 代码

Install the fastavro library:
pip install fastavro
Producer Example:
import fastavro
from kafka import KafkaProducer
from io import BytesIO

# Define Avro schema
schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"}
    ]
}

# Create Kafka producer
producer = KafkaProducer(bootstrap_servers='localhost:9092')

# Serialize data using Avro
def serialize_avro(record):
    buf = BytesIO()
    writer = fastavro.writer(buf, schema, [record])
    return buf.getvalue()

# Produce Avro serialized data
record = {"name": "Alice", "age": 30}
producer.send('avro-topic', value=serialize_avro(record))

消费者示例

import fastavro
from kafka import KafkaConsumer
from io import BytesIO

# Define Avro schema
schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"}
    ]
}

# Create Kafka consumer
consumer = KafkaConsumer(
    'avro-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='avro-group'
)

# Deserialize Avro data
def deserialize_avro(data):
    buf = BytesIO(data)
    reader = fastavro.reader(buf, schema)
    return list(reader)[0]

# Receive Avro serialized messages
for message in consumer:
    record = deserialize_avro(message.value)
    print(f"Received: {record}")

说明

Avro 序列化：使用 Avro 格式序列化和反序列化数据，提供紧凑快速的数据编码。
生产者：将 Avro 编码数据发送到 Kafka。
消费者：从 Kafka 接收和解码 Avro 数据。

4. 模式演进和管理

模式演进允许数据模式随时间变化而不会破坏兼容性。这对于管理分布式系统中的数据更改至关重要。

示例：带有 Avro 的模式注册表

1. 启动模式注册表

假设您已安装 Confluent 模式注册表

# Start Schema Registry
bin/schema-registry-start config/schema-registry.properties

2. 注册 Avro 模式

使用模式注册表 REST API 注册模式

# Register schema with Schema Registry
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
    --data '{"schema": "{ \"type\": \"record\", \"name\": \"User\", \"fields\": [ { \"name\": \"name\", \"type\": \"string\" }, { \"name\": \"age\", \"type\": \"int\" } ] }"}' \
    https://:8081/subjects/avro-topic/versions

说明

模式注册表：管理和提供数据序列化和反序列化的模式。
模式注册：向模式注册表注册模式，以确保兼容性并管理模式演进。

下一主题Kafka-高级流处理

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Kafka 教程

Kafka 安装

Kafka CLI

Kafka 编程

实时示例

Kafka 监控

Kafka Connect

Kafka Streams

杂项

Kafka - 分布式流媒体平台

分布式流处理平台简介

使用 Apache Pulsar 的示例程序

设置 Apache Pulsar

生产者和消费者示例

Pulsar 消费者 Python 代码

流处理模型

1. 无状态与有状态处理

无状态处理

有状态处理

2. 窗口和聚合

翻滚窗口

滑动窗口

3. 事件时间处理

4. 流连接和复杂事件处理 (CEP)

流连接

复杂事件处理 (CEP)

高级分布式流处理平台

1. 无服务器流处理架构

1. 创建 Lambda 函数

2. 配置 Lambda 函数

2. 边缘计算和物联网流处理

1. 使用 Flink 设置边缘处理

3. AI 和机器学习与流数据集成

1. 准备机器学习模型

2. 与 Flink 集成以进行实时预测

分布式流处理平台中的数据管理

1. 数据摄取和集成

1. 创建连接器配置

2. 启动 Kafka Connect

2. 流存储

1. Kafka 生产者 (Python)

2. Kafka 消费者 (Python)

3. 序列化格式

1. 定义 Avro 模式

2. Avro 序列化的 Python 代码

4. 模式演进和管理

1. 启动模式注册表

相关帖子

Apache Kafka - 集群架构

Kafka 中的系统监控和警报

系统迁移中的 Kafka 组件

Kafka 消费者重新平衡

Kafka 中的集群迁移

Apache Kafka vs Apache Storm

Kafka 日志压缩

Apache Kafka vs RabbitMQ

Kafka 复制

使用 Kafka 进行地理分布式事件流

订阅 Tpoint Tech

联系信息