Kafka 到 Cosmos DB 故障的死信队列

2025 年 5 月 16 日 | 阅读 9 分钟

Kafka 死信队列 (DLQ) 简介

什么是死信队列 (DLQ)?

死信队列 (DLQ) 是 Kafka 中的一个特殊主题，用于存储因各种原因导致处理失败的消息，例如：

消息格式问题 (无效的 JSON、字段缺失)
序列化/反序列化失败
下游服务失败 (例如，Cosmos DB 不可用)
时效性数据过期

与其丢弃或无限重试失败的消息，不如DLQ 会隔离坏消息，从而便于调试、重新处理或发出警报。

为什么 Kafka 到 Cosmos DB 管道需要 DLQ?

当集成 Kafka 与 Azure Cosmos DB 时，可能会在多个点发生故障。DLQ 可确保

数据完整性：失败的消息不会丢失。
错误恢复：修复问题后可以重放失败的消息。
操作效率：减少重试次数，防止阻塞健康的消息。

Kafka 到 Cosmos DB 的 DLQ 架构概述

Kafka 生产者： 将消息发布到主主题 (Main Topic)
Kafka 消费者： 从主主题读取并写入Cosmos DB
故障检测： 如果消息失败，则将其推送到DLQ 主题
DLQ 处理器： 稍后，将从 DLQ 中重试或手动检查消息

代码示例：设置基本的 Kafka DLQ

我们将创建

一个 Kafka 生产者来发送消息。
一个 Kafka 消费者来处理消息并模拟故障。
一个 DLQ 生产者将失败的消息存储在 DLQ 中。

步骤 1：启动 Kafka 并创建主题

运行这些命令来设置 Kafka 主题

# Create main topic
kafka-topics.sh --create --topic main-topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1 kafka-topics.sh --create --topic dlq-topic --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

步骤 2：Kafka 生产者 (向 Kafka 发送消息)

此生产者会将消息发送到main-topic。

producer.py

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

messages = [
    {"id": 1, "name": "Song A", "rating": 4.5},
    {"id": 2, "name": "Song B", "rating": "invalid_rating"},  # Invalid data type
    {"id": 3, "name": "Song C", "rating": 5.0}
]

for message in messages:
    producer.send('main-topic', message)
    print(f"Sent: {message}")

producer.flush()
producer.close()

输出

Dead Letter Queues for Kafka to Cosmos DB Failures

步骤 3：Kafka 消费者 (处理消息 & 处理故障)

此消费者从main-topic读取，模拟故障，并将失败的消息发送到 DLQ。

consumer.py

from kafka import KafkaConsumer, KafkaProducer
import json

# Kafka consumer
consumer = KafkaConsumer(
    'main-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='consumer-group-1',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

# DLQ producer
dlq_producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

def process_message(message):
    """Simulates message processing and raises an error if data is invalid."""
    try:
        if not isinstance(message["rating"], (int, float)):
            raise ValueError("Invalid rating data type")
        print(f"Processed: {message}")
    except Exception as e:
        print(f"Error: {e}, sending to DLQ")
        dlq_producer.send('dlq-topic', message)

# Read messages from main topic
for msg in consumer:
    process_message(msg.value)

consumer.close()
dlq_producer.close()

输出

步骤 4：DLQ 消费者 (检查失败的消息)

此消费者从dlq-topic读取消息，使我们能够调试故障。

dlq_consumer.py

from kafka import KafkaConsumer
import json

# DLQ consumer
dlq_consumer = KafkaConsumer(
    'dlq-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='dlq-group',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

print("DLQ Messages:")
for msg in dlq_consumer:
    print(f"Failed message: {msg.value}")

dlq_consumer.close()

输出

步骤 5：运行管道

在单独的终端中运行这些命令

1. 启动 Kafka & 创建主题

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. 运行生产者

3. 运行消费者

4. 运行 DLQ 消费者

在 Kafka 中为 Cosmos DB 故障实现 DLQ

为什么在 DLQ 之前实现重试机制?

与其立即将失败的消息推送到死信队列 (DLQ)，不如

重试处理几次 (例如，网络问题可能是暂时的)。
仅在所有重试都失败后才发送到DLQ。

这样可以减少DLQ 中的消息量，并防止不必要的故障。

处理 Cosmos DB 写入失败

Cosmos DB 写入失败的常见原因

超出速率限制 (请求单位 - RU)
瞬时网络问题
模式验证错误
重复 ID 约束

而不是丢弃消息，我们将

重试消息 (针对瞬时问题)。
记录错误 (针对永久性问题)。
发送到 DLQ (如果所有重试都失败)。

代码实现：带有重试机制的 Kafka 消费者（在发送到 DLQ 之前）

我们将修改我们的Kafka 消费者以
在发送到 DLQ 之前重试 Cosmos DB 写入
记录失败原因
使用指数退避策略进行重试

步骤 1：安装依赖项

步骤 2：Kafka 生产者 (向 Kafka 发送消息)

producer.py

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

messages = [
    {"id": "1", "name": "Song A", "rating": 4.5},
    {"id": "2", "name": "Song B", "rating": "invalid_rating"},  # Invalid data type
    {"id": "3", "name": "Song C", "rating": 5.0}
]

for message in messages:
    producer.send('main-topic', message)
    print(f"Sent: {message}")

producer.flush()
producer.close()

输出

步骤 3：Kafka 消费者 (重试，然后发送到 DLQ)

此消费者

从main-topic读取

尝试写入Cosmos DB

重试3 次 (指数退避)

如果所有重试都失败，则发送到 DLQ

consumer.py

from kafka import KafkaConsumer, KafkaProducer
from azure.cosmos import CosmosClient, exceptions
import json
import time

# Kafka consumer
consumer = KafkaConsumer(
    'main-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='consumer-group',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

# Kafka producer for DLQ
dlq_producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Cosmos DB Setup
COSMOS_DB_URL = "https://your-cosmos-db.documents.azure.com:443/"
COSMOS_DB_KEY = "your-primary-key"
DATABASE_NAME = "MusicDB"
CONTAINER_NAME = "Songs"

client = CosmosClient(COSMOS_DB_URL, COSMOS_DB_KEY)
container = database.get_container_client(CONTAINER_NAME)

def write_to_cosmos(message):
    """Attempts to write to Cosmos DB with retries."""
    retries = 3
    delay = 2  # Initial delay in seconds

    for attempt in range(retries):
        try:
            container.create_item(body=message)
            print(f"Successfully written to Cosmos DB: {message}")
            return True
        except exceptions.CosmosHttpResponseError as e:
            print(f"Cosmos DB Error (Attempt {attempt + 1}): {e}")
            time.sleep(delay)
            delay *= 2  # Exponential backoff

    return False  # If all retries fail

# Consume messages
for msg in consumer:
    message = msg.value
    success = write_to_cosmos(message)

    if not success:
        print(f"Message failed after retries. Sending to DLQ: {message}")
        dlq_producer.send('dlq-topic', message)

consumer.close()
dlq_producer.close()

输出

步骤 4：DLQ 消费者 (处理失败的消息)

dlq_consumer.py

from kafka import KafkaConsumer
import json

dlq_consumer = KafkaConsumer(
    'dlq-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='dlq-group',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

print("DLQ Messages:")
for msg in dlq_consumer:
    print(f"Failed message: {msg.value}")

dlq_consumer.close()

输出

步骤 5：运行管道

在单独的终端中运行这些命令

1. 启动 Kafka & 创建主题

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

2. 运行生产者

3. 运行消费者 (重试，然后发送到 DLQ)

4. 运行 DLQ 消费者

监视和自动化 Kafka 中用于 Cosmos DB 故障的 DLQ 处理

为什么监视 DLQ?

DLQ不是最终目的地。它们存储失败的消息，但我们必须

跟踪有多少消息进入 DLQ。
识别常见的失败原因。
在错误修复后重新处理消息。

使用 Prometheus & Grafana 设置 DLQ 监视

我们将集成Prometheus来收集 Kafka DLQ 指标，并使用Grafana来可视化它们。

步骤 1：安装 Prometheus & Grafana

sudo apt update
sudo apt install prometheus grafana

步骤 2：配置 Prometheus 以监视 Kafka

修改Prometheus 配置 (prometheus.yml)

global:
  scrape_interval: 15s  

scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['localhost:9092']

启动 Prometheus

设置 DLQ 指标生产者

我们将创建一个Kafka 消费者，它将

读取 DLQ 消息

记录失败次数

将指标暴露给 Prometheus

dlq_metrics.py

from kafka import KafkaConsumer
from prometheus_client import start_http_server, Counter
import json

# Define Prometheus metric
failed_messages_counter = Counter('failed_messages_total', 'Total number of failed messages')

# Kafka consumer for DLQ
consumer = KafkaConsumer(
    'dlq-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='dlq-monitoring-group',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

# Start Prometheus server
start_http_server(8000)

print("Monitoring DLQ messages...")
for msg in consumer:
    print(f"DLQ Message: {msg.value}")
    failed_messages_counter.inc()  # Increment metric

consumer.close()

步骤 3：在 Prometheus 中查看 DLQ 指标

访问https://:8000/metrics

您应该会看到

# TYPE failed_messages_total counter
failed_messages_total 1

步骤 4：为 Kafka DLQ 监视设置 Grafana 仪表板

1. 打开 Grafana (https://:3000)

2. 将 Prometheus 添加为数据源

3. 创建一个新仪表板

4. 添加一个图表面板

5. 使用此 PromQL 查询

6. 点击保存并应用

代码：DLQ 自动重试回放

消费 DLQ 消息

尝试将它们重新发送到 Cosmos DB

如果成功 → 从 DLQ 中删除

如果失败 → 保留在 DLQ 中

dlq_replayer.py

from kafka import KafkaConsumer, KafkaProducer
from azure.cosmos import CosmosClient, exceptions
import json
import time

# Kafka consumer for DLQ
consumer = KafkaConsumer(
    'dlq-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='dlq-replayer-group',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
COSMOS_DB_URL = "https://your-cosmos-db.documents.azure.com:443/"
COSMOS_DB_KEY = "your-primary-key"
DATABASE_NAME = "MusicDB"
CONTAINER_NAME = "Songs"

client = CosmosClient(COSMOS_DB_URL, COSMOS_DB_KEY)
database = client.get_database_client(DATABASE_NAME)
container = database.get_container_client(CONTAINER_NAME)

def write_to_cosmos(message):
    """Attempts to write to Cosmos DB."""
    try:
        container.create_item(body=message)
        print(f"Reprocessed successfully: {message}")
        return True
    except exceptions.CosmosHttpResponseError as e:
        print(f"Still failing: {message} - Error: {e}")
        return False

# Replaying DLQ messages
for msg in consumer:
    message = msg.value
    success = write_to_cosmos(message)

    if not success:
        print(f"Resending message to DLQ: {message}")
        producer.send('dlq-topic', message)

consumer.close()
producer.close()

步骤 5：运行回放过程

在修复问题后运行此脚本 (例如，修复 Cosmos DB 中的模式问题)。

步骤 6：预期输出

在修复问题之前 (第一次运行)

在修复问题之后 (第二次运行)

扩展死信队列 (DLQ) 以应对大规模工作负载

大规模 DLQ 处理中的挑战

随着数据量的增加，如果处理效率不高，DLQ 可能会成为瓶颈。

一些关键挑战包括

高失败率 - 数千条消息进入 DLQ。
重处理延迟 - 缓慢的重试导致积压。
单一消费者瓶颈 - 单个消费者可能无法跟上。
存储开销 - DLQ 无限增长。

为高吞吐量故障优化 DLQ

为了提高性能，我们可以

分区 DLQ - 将负载分散到多个消费者。

并行处理 - 使用 Kafka 消费者组进行回放。

Kafka Streams 进行智能过滤 - 自动分类错误。

步骤 1：为提高效率而分区 DLQ

而不是一个 DLQ 主题，我们可以创建多个分区来并行化处理。

创建分区 DLQ 主题

bin/kafka-topics.sh --create --topic dlq-topic --partitions 3 --replication-factor 1 --bootstrap-server localhost:9092

现在，Kafka 将失败的消息分布到3 个分区，以便更快地处理。

步骤 2：更新 DLQ 消费者以处理分区

我们将修改我们的DLQ 消费者以

使用 Kafka 消费者组

并行处理消息

dlq_consumer_partitioned.py

from kafka import KafkaConsumer
import json
import threading

def process_dlq_partition(consumer):
    """Processes messages from a single DLQ partition."""
    for msg in consumer:
        print(f"Processing DLQ message from partition {msg.partition}: {msg.value}")

# Create multiple consumers
consumers = [
    KafkaConsumer(
        'dlq-topic',
        bootstrap_servers='localhost:9092',
        group_id='dlq-processing-group',
        auto_offset_reset='earliest',
        enable_auto_commit=True,
        value_deserializer=lambda x: json.loads(x.decode('utf-8'))
    )
    for _ in range(3)  # One consumer per partition
]

# Start threads for each consumer
threads = []
for consumer in consumers:
    thread = threading.Thread(target=process_dlq_partition, args=(consumer,))
    thread.start()
    threads.append(thread)

# Wait for threads to complete
for thread in threads:
    thread.join()

步骤 3：使用 Kafka Streams 进行高级 DLQ 处理

而不是盲目重试，我们可以在重试前分析故障。

Kafka Streams 允许我们

过滤瞬时错误和永久性错误

将瞬时错误路由到自动重试

将永久性错误发送到单独的存档

步骤 3.1：实现 Kafka Streams 来分类 DLQ 消息

我们创建一个Kafka Streams 处理器来

检查错误类型

仅重试瞬时故障

存档永久性故障

dlq_streams_processor.py

from kafka import KafkaProducer, KafkaConsumer
from kafka.admin import KafkaAdminClient, NewTopic
import json

# Define error categories
TRANSIENT_ERRORS = ["TimeoutException", "RateLimitExceeded"]
PERMANENT_ERRORS = ["SchemaValidationError", "InvalidDataError"]

# Kafka consumer for DLQ
consumer = KafkaConsumer(
    'dlq-topic',
    bootstrap_servers='localhost:9092',
    group_id='dlq-streams-group',
    auto_offset_reset='earliest',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

# Kafka producers
retry_producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

archive_producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

def classify_and_route(message):
    """Classifies DLQ messages and routes them accordingly."""
    error_type = message.get("error_type", "Unknown")

    if error_type in TRANSIENT_ERRORS:
        print(f"Retrying message: {message}")
        retry_producer.send('retry-topic', message)
    else:
        print(f"Archiving message: {message}")
        archive_producer.send('archive-topic', message)

# Process messages from DLQ
for msg in consumer:
    classify_and_route(msg.value)

consumer.close()
retry_producer.close()
archive_producer.close()

输出

步骤 4：实现自动重试消费者

retry-topic 包含可以安全重试的消息。

我们将创建一个消费者，它

重试将数据发送到 Cosmos DB

将失败的重试移回 DLQ

retry_consumer.py

from kafka import KafkaConsumer, KafkaProducer
from azure.cosmos import CosmosClient, exceptions
import json
import time

# Kafka consumer for retry topic
consumer = KafkaConsumer(
    'retry-topic',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    group_id='retry-group',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

# Kafka producer for failed retries (back to DLQ)
dlq_producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Cosmos DB Setup
COSMOS_DB_URL = "https://your-cosmos-db.documents.azure.com:443/"
COSMOS_DB_KEY = "your-primary-key"
DATABASE_NAME = "MusicDB"
CONTAINER_NAME = "Songs"

client = CosmosClient(COSMOS_DB_URL, COSMOS_DB_KEY)
database = client.get_database_client(DATABASE_NAME)
container = database.get_container_client(CONTAINER_NAME)

def write_to_cosmos(message):
    """Attempts to write to Cosmos DB with retries."""
    try:
        container.create_item(body=message)
        print(f"Successfully retried: {message}")
        return True
    except exceptions.CosmosHttpResponseError as e:
        print(f"Retry failed: {message} - Error: {e}")
        return False

# Process retry messages
for msg in consumer:
    message = msg.value
    success = write_to_cosmos(message)

    if not success:
        print(f"Moving message back to DLQ: {message}")
        dlq_producer.send('dlq-topic', message)

consumer.close()
dlq_producer.close()

输出

步骤 5：运行优化后的 DLQ 管道

1.启动 Kafka & 创建主题

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
bin/kafka-topics.sh --create --topic dlq-topic --partitions 3 --replication-factor 1 --bootstrap-server localhost:9092
bin/kafka-topics.sh --create --topic retry-topic --partitions 2 --replication-factor 1 --bootstrap-server localhost:9092
bin/kafka-topics.sh --create --topic archive-topic --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092

下一主题End-to-end-data-streaming-with-kafka-azure-functions-and-cosmos-db

Kafka 到 Cosmos DB 故障的死信队列

Kafka 死信队列 (DLQ) 简介

什么是死信队列 (DLQ)?

为什么 Kafka 到 Cosmos DB 管道需要 DLQ?

Kafka 到 Cosmos DB 的 DLQ 架构概述

步骤 1：启动 Kafka 并创建主题

步骤 3：Kafka 消费者 (处理消息 & 处理故障)

步骤 4：DLQ 消费者 (检查失败的消息)

步骤 5：运行管道

在 Kafka 中为 Cosmos DB 故障实现 DLQ

为什么在 DLQ 之前实现重试机制?

处理 Cosmos DB 写入失败

步骤 1：安装依赖项

步骤 2：Kafka 生产者 (向 Kafka 发送消息)

步骤 3：Kafka 消费者 (重试，然后发送到 DLQ)

步骤 4：DLQ 消费者 (处理失败的消息)

步骤 5：运行管道

监视和自动化 Kafka 中用于 Cosmos DB 故障的 DLQ 处理

为什么监视 DLQ?

扩展死信队列 (DLQ) 以应对大规模工作负载

步骤 1：为提高效率而分区 DLQ

步骤 2：更新 DLQ 消费者以处理分区

步骤 3：使用 Kafka Streams 进行高级 DLQ 处理

步骤 4：实现自动重试消费者

步骤 5：运行优化后的 DLQ 管道

联系信息

关注我们

教程

面试题

在线编译器

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Kafka 教程

Kafka 安装

Kafka CLI

Kafka 编程

实时示例

Kafka 监控

Kafka Connect

Kafka Streams

杂项

Kafka 到 Cosmos DB 故障的死信队列

Kafka 死信队列 (DLQ) 简介

什么是死信队列 (DLQ)?

为什么 Kafka 到 Cosmos DB 管道需要 DLQ?

Kafka 到 Cosmos DB 的 DLQ 架构概述

步骤 1：启动 Kafka 并创建主题

步骤 3：Kafka 消费者 (处理消息 & 处理故障)

步骤 4：DLQ 消费者 (检查失败的消息)

步骤 5：运行管道

在 Kafka 中为 Cosmos DB 故障实现 DLQ

为什么在 DLQ 之前实现重试机制?

处理 Cosmos DB 写入失败

步骤 1：安装依赖项

步骤 2：Kafka 生产者 (向 Kafka 发送消息)

步骤 3：Kafka 消费者 (重试，然后发送到 DLQ)

步骤 4：DLQ 消费者 (处理失败的消息)

步骤 5：运行管道

监视和自动化 Kafka 中用于 Cosmos DB 故障的 DLQ 处理

为什么监视 DLQ?

扩展死信队列 (DLQ) 以应对大规模工作负载

步骤 1：为提高效率而分区 DLQ

步骤 2：更新 DLQ 消费者以处理分区

步骤 3：使用 Kafka Streams 进行高级 DLQ 处理

步骤 4：实现自动重试消费者

步骤 5：运行优化后的 DLQ 管道

相关帖子

节能 Kafka 集群

Apache Kafka vs RabbitMQ

使用 Kafka 设计容错微服务

Kafka 复制

Kafka API

事件流架构的用例

Kafka 代理 - 详细信息

Kafka Streams 中的数据屏蔽技术

Apache Kafka BigQuery 集成

Kafka 重新平衡

订阅 Tpoint Tech

联系信息

关注我们

教程

面试题

在线编译器