??????? Event-Driven Architecture — Engineering Handbook
EDA
Event-Driven Architecture
EDA-HB-2024 v1.0
Engineering Handbook

Event-Driven
Architecture

The backbone of modern enterprise scaling
"Don't call us, we'll call you — the Hollywood Principle applied to distributed systems."

The definitive field guide to building systems that react, not poll. Covers the full spectrum: Kafka partitions to RabbitMQ routing, event sourcing to CQRS, saga orchestration to schema evolution — everything required to design, build, and operate event-driven systems at scale.

Apache Kafka Apache Pulsar RabbitMQ Event Sourcing Pub/Sub
01

What is Event-Driven Architecture

// FUNDAMENTALS & MOTIVATIONS

Event-Driven Architecture (EDA) is a design paradigm in which system components communicate by producing and consuming events — immutable records of things that have happened — rather than calling each other directly. Producers have no knowledge of consumers. Consumers have no knowledge of producers. The broker sits in between and is the only coupling point.

This inversion is the source of EDA's power: temporal decoupling (producer and consumer need not be available simultaneously), spatial decoupling (they need not know each other's address), and semantic decoupling (they agree only on the event schema, not on processing logic).

Temporal Decoupling Core

Producer and consumer do not need to be online at the same time. Events persist in the broker until consumed. Services scale and deploy independently without coordination windows.

Spatial Decoupling Core

Producers publish to a topic, not to a specific consumer endpoint. Adding new consumers requires zero changes to producers. Services discover data through subscriptions, not service registries.

Scale & Resilience Benefit

Consumers scale horizontally against the event backlog. Slow consumers don't block producers. Back-pressure is managed by the broker, not by synchronous timeouts cascading through a call chain.

EDA vs REST vs RPC: REST/RPC is request/response — caller blocks waiting for a reply, both parties must be up, and the caller must know the callee's address. EDA is fire-and-forget at the producer side. For read-heavy query paths, REST is still appropriate. EDA shines on write/mutation flows and anywhere you need fan-out, replay, or audit.

When EDA is the right choice

ScenarioEDA FitWhy
Order placed → multiple downstream actions (fulfillment, email, analytics)EXCELLENTFan-out without producer knowing consumers; consumers added independently
Real-time fraud detection on transactionsEXCELLENTStream processing on event log; millisecond latency with Kafka Streams/Flink
Audit trail / compliance logEXCELLENTImmutable event log is the audit record; replay for compliance queries
Simple CRUD API, single consumerPOORAdds operational complexity with no benefit; use REST
Interactive query (user asks for their profile data)POORSynchronous response required; use REST + query model (CQRS read side)
Cross-service saga / distributed transactionGOODChoreography or orchestration via events; avoids 2PC distributed locking
02

Core Messaging Patterns

// THE FOUR BUILDING BLOCKS

Every EDA system is built from four primitive patterns. These compose into the larger architectures — pub/sub, event sourcing, stream processing — described in later sections. Understanding the primitives clarifies why different brokers make different design choices.

PATTERN 01 — MESSAGE QUEUE
Point-to-Point Queue

Each message is consumed by exactly one consumer. Multiple consumers compete for messages from the same queue — load balancing with guaranteed single delivery. Classic use: background job processing, task distribution. Work queue model.

PATTERN 02 — PUBLISH/SUBSCRIBE
Topic Fan-Out

Each message is delivered to all subscribers. Producers publish to a topic; every subscription receives a copy. Classic use: event notification, audit events, data replication. N consumers all see every event.

PATTERN 03 — EVENT STREAMING
Durable Ordered Log

Events are written to an immutable, ordered, partitioned log. Consumers track their own offset. Any consumer can re-read history. New consumers can start from the beginning. Classic use: Kafka, Kinesis. Time-travel is the superpower.

PATTERN 04 — REQUEST/REPLY
Async RPC over Events

Caller publishes a request event with a correlationId and a replyTo queue. Responder publishes the reply to that queue. Caller awaits. Bridges synchronous-seeming semantics over async infrastructure. Use sparingly.

Kafka blurs the queue/pub-sub boundary using consumer groups: all consumers in the same group share partitions (queue semantics — each message processed once); consumers in different groups each receive all messages (pub-sub semantics). This single primitive covers both use cases.
03

Event Anatomy

// WHAT AN EVENT IS (AND ISN'T)

An event is an immutable record of a fact that occurred in the past. Not a command (don't say "please process this"). Not a request. A fact: "Order 8271 was placed by user 903 at 14:22:05 UTC containing 3 items totalling $142.00." Events are named in past tense.

✕ Commands (avoid as events)
  • ProcessOrder — imperative, implies a specific handler
  • SendEmail — prescribes implementation
  • UpdateInventory — tells receiver what to do
  • ChargeCustomer — couples business logic to event
  • Present/future tense names
  • Contains processing instructions
✓ Domain Events (correct)
  • OrderPlaced — past tense, describes reality
  • PaymentSucceeded — fact, consumer decides what to do
  • InventoryReserved — records what happened
  • CustomerCharged — consequence, not instruction
  • Past tense names always
  • Contains only the data of what happened

Canonical Event Envelope

JSON / CloudEvents v1.0
// CloudEvents v1.0 — industry standard envelope { // ─── Required: CloudEvents attributes ─────────────────── "specversion": "1.0", "id": "01HWZR5F4P7TKMXX9BQ2GV1ERD", // ULID — sortable UUID "source": "/orders-service/production", "type": "com.acme.orders.OrderPlaced", // reverse-DNS + past tense "time": "2024-09-14T14:22:05.182Z", // ISO 8601 UTC always // ─── Standard optional attributes ─────────────────────── "datacontenttype": "application/json", "dataschema": "https://schemas.acme.com/orders/v2/order-placed.json", "subject": "order/8271", // resource identifier // ─── Custom extensions ─────────────────────────────────── "correlationid": "sess_abc123", // trace correlation "causationid": "cmd_xyz789", // triggering command "tenantid": "acme-corp", // multi-tenancy "version": 2, // schema version // ─── Domain payload ────────────────────────────────────── "data": { "orderId": "8271", "customerId": "cust_903", "placedAt": "2024-09-14T14:22:05.182Z", "currency": "USD", "totalAmount": 142.00, "lineItems": [ { "sku": "PROD-441", "qty": 2, "unitPrice": 49.99 }, { "sku": "PROD-109", "qty": 1, "unitPrice": 42.02 } ], "shippingAddress": { /* ... */ } } }
Fat vs Thin events: Fat events include full payload (the order data above). Thin events include only the ID and type — consumers must fetch full state if needed. Fat events are self-contained but expensive at scale. Thin events require consumers to have read access to the source system. Prefer fat events for loose coupling; use thin events when payloads are very large or contain regulated data.
04

Publish / Subscribe

// DECOUPLED FAN-OUT AT SCALE

In a pub/sub model, producers (publishers) emit events to a topic. Subscribers express interest in topics and receive all matching events. The broker is responsible for delivering the event to all registered subscribers. Neither party knows the other exists — only the topic name is shared.

Producer
Order
Service
Topic
orders.placed
Consumer 1
Email
Service
Consumer 2
Fulfillment
Service
Consumer 3
Analytics
Service
Each consumer receives every event independently — adding Consumer 4 requires zero changes to Order Service
Strengths Use When
  • Multiple downstream reactions to a single business event
  • Adding new consumers without touching producers
  • Event notification — "this happened, react however you want"
  • Cross-domain data propagation (CDC events, domain events)
Limitations Watch Out
  • No built-in ordering guarantee across consumers
  • Consumer failures need independent dead-letter handling
  • Complex debugging — causality chains are non-obvious
  • Schema changes propagate to all consumers simultaneously
05

Event Sourcing

// STATE AS A FUNCTION OF HISTORY

Event Sourcing is a persistence pattern where the state of an entity is never stored directly. Instead, only the sequence of events that led to the current state is persisted. Current state is derived by replaying events from the beginning (or from a snapshot). The event log is the database.

✕ Traditional CRUD State Store
  • Store current state only — history is lost
  • UPDATE overwrites — no audit trail
  • Temporal queries impossible (what did order look like last Tuesday?)
  • Can't replay to fix a bug in processing logic
  • Simple read path: SELECT * WHERE id = X
✓ Event Sourced Store
  • Append-only event log — full history preserved
  • Audit trail is free — every change recorded
  • Time travel: replay to any point in time
  • Fix bugs: replay with corrected logic to rebuild projections
  • Write is fast (append); reads need projection rebuild
TypeScript — Event Sourced Aggregate
// Events are the source of truth — append only type OrderEvent = | { type: 'OrderPlaced'; orderId: string; customerId: string; total: number } | { type: 'ItemAdded'; orderId: string; sku: string; qty: number } | { type: 'PaymentReceived'; orderId: string; amount: number; txId: string } | { type: 'OrderShipped'; orderId: string; trackingNo: string } | { type: 'OrderCancelled'; orderId: string; reason: string }; // State is a pure function of events (fold/reduce) function applyEvent(state: OrderState, event: OrderEvent): OrderState { switch (event.type) { case 'OrderPlaced': return { ...state, status: 'pending', total: event.total, customerId: event.customerId }; case 'PaymentReceived': return { ...state, status: 'paid', paidAmount: event.amount }; case 'OrderShipped': return { ...state, status: 'shipped', trackingNo: event.trackingNo }; case 'OrderCancelled': return { ...state, status: 'cancelled', cancelReason: event.reason }; default: return state; } } // Reconstitute state from event history async function loadOrder(orderId: string): Promise<OrderState> { const events = await eventStore.load(orderId); // or from snapshot return events.reduce(applyEvent, INITIAL_STATE); } // Append a new event — never UPDATE, only INSERT async function shipOrder(orderId: string, trackingNo: string, expectedVersion: number) { const event: OrderEvent = { type: 'OrderShipped', orderId, trackingNo }; await eventStore.append(orderId, event, expectedVersion); // optimistic concurrency await eventBus.publish('orders.shipped', event); // fan-out to consumers }

Snapshots — Performance at Scale

Replaying thousands of events to reconstitute state on every read is expensive. Snapshots periodically capture current state alongside the event sequence number. On load, restore from the latest snapshot and replay only subsequent events.

Event sourcing is not a silver bullet. The write model is simple; the read model is complex. You need projections (read models) for every query pattern. Schema evolution of historical events is a significant engineering challenge. Use event sourcing when audit trail, time-travel, or event replay are genuine business requirements — not by default.
06

CQRS — Command Query Responsibility Segregation

// SEPARATE READ AND WRITE PATHS

CQRS separates the model used for writes (commands) from the model used for reads (queries). They can use different databases, different schemas, and scale independently. Commands mutate state and publish events; queries read from purpose-built projections optimized for the specific view.

Client
API
Command
PlaceOrder
CancelOrder
Write DB
Event Store
PostgreSQL
Event Bus
Kafka
Query
GetOrder
ListOrders
Read DB
Elasticsearch
Redis
Event bus (Kafka) feeds projectors that maintain purpose-built read models — denormalized, query-optimized, polyglot
Read Model Design Rules
  • One read model per query pattern — no compromise
  • Denormalize ruthlessly — joins at read time are expensive
  • Optimize for the exact query shape the UI needs
  • Rebuild-ability is a requirement: projectors must be replayable
  • Eventual consistency with the write model — always
Projection Technology Choices
  • Elasticsearch: full-text search, faceted navigation
  • Redis: hot read paths, leaderboards, counters
  • PostgreSQL (materialized view): complex relational queries
  • DynamoDB: single-table design, predictable latency
  • ClickHouse / BigQuery: analytics queries over large datasets
07

Stream Processing

// COMPUTE OVER CONTINUOUS EVENT FLOWS

Stream processing is real-time computation over unbounded event sequences. Unlike batch processing (which operates on finite datasets at rest), stream processing applies transformations, aggregations, and joins continuously as events arrive — with sub-second latency.

FrameworkModelLatencyBest ForManaged Option
Kafka Streams Library (no cluster) ms Kafka-native, small-medium topologies, microservice-embedded Confluent Cloud
Apache Flink Cluster, stateful DAG ms Complex CEP, large state, exactly-once, SQL over streams AWS Kinesis Analytics, Confluent
Apache Spark Streaming Micro-batch seconds ML pipelines, batch+stream unified, large cluster analytics Databricks, EMR, Dataproc
ksqlDB SQL over Kafka ms SQL-native stream queries, materialized tables, no-code streaming Confluent Cloud
Pulsar Functions Serverless functions ms Pulsar-native, simple transform/filter pipelines StreamNative, Astra Streaming
Java — Kafka Streams: Real-Time Order Analytics
StreamsBuilder builder = new StreamsBuilder(); // 1. Source stream from Kafka topic KStream<String, OrderEvent> orders = builder .stream("orders.placed", Consumed.with(Serdes.String(), orderSerde)); // 2. Filter: only paid orders over $100 KStream<String, OrderEvent> highValue = orders .filter((key, order) -> order.getTotal() > 100.0 && order.isPaid()); // 3. Group by region, count in 5-minute tumbling window KTable<Windowed<String>, Long> countByRegion = highValue .groupBy((key, order) -> order.getRegion()) .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5))) .count(Materialized.as("high-value-orders-by-region")); // queryable state store // 4. Join enrichment: fetch product catalog (KTable) KTable<String, Product> products = builder .table("products.catalog", Consumed.with(Serdes.String(), productSerde)); KStream<String, EnrichedOrder> enriched = orders .join(products, (order, product) -> new EnrichedOrder(order, product), Joined.with(Serdes.String(), orderSerde, productSerde)); // 5. Sink to output topic enriched.to("orders.enriched", Produced.with(Serdes.String(), enrichedSerde)); // 6. Build and start topology KafkaStreams streams = new KafkaStreams(builder.build(), config); streams.start(); // Runs embedded in your microservice — no separate cluster needed
08

Apache Kafka

// DISTRIBUTED COMMIT LOG AT PLANETARY SCALE

Kafka is a distributed, partitioned, replicated commit log. It is not a message queue in the traditional sense — it is a durable, ordered, replayable event log that producers append to and consumers read from at their own pace. Kafka's architecture makes it the dominant choice for high-throughput event streaming at enterprise scale.

Core Architecture Concepts

Topics & Partitions

A topic is split into N partitions. Each partition is an ordered, immutable sequence of records. Partitions are the unit of parallelism — you can have at most as many active consumers as partitions in a group. Choose partition count based on target throughput.

Offsets & Consumer Groups

Each message has a monotonically increasing offset per partition. Consumer groups track their committed offset per partition — Kafka never deletes data based on consumption. Multiple groups can read the same topic independently at their own offsets.

Replication & Durability

Each partition has a leader and N-1 follower replicas. acks=all + min.insync.replicas=2 ensures no data loss under broker failure. Replication factor 3 is standard for production. KRaft (since 3.3) eliminates ZooKeeper dependency.

Properties — Kafka Producer (High Durability)
# Producer configuration for financial-grade durability # Durability: wait for all in-sync replicas to acknowledge acks=all # 0=fire-forget | 1=leader only | all=full durability min.insync.replicas=2 # Set on topic — broker rejects if fewer ISRs available enable.idempotence=true # Exactly-once semantics per partition (dedup by sequence no.) # Retry behavior retries=2147483647 # Retry indefinitely (with idempotence, safe to do so) max.in.flight.requests.per.connection=5 # Must be ≤5 with idempotence delivery.timeout.ms=120000 # 2-min total retry window retry.backoff.ms=1000 # Batching for throughput (tune for your latency/throughput trade-off) batch.size=65536 # 64KB batch size (default 16KB — too small for high throughput) linger.ms=10 # Wait up to 10ms to fill batch (0 = lowest latency) compression.type=snappy # snappy (balanced) | lz4 (fast) | zstd (best ratio) buffer.memory=67108864 # 64MB producer buffer # Serialization (use Schema Registry for schema evolution) key.serializer=org.apache.kafka.common.serialization.StringSerializer value.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer schema.registry.url=https://schema-registry:8081
Properties — Kafka Consumer (Exactly-Once Processing)
# Consumer configuration for reliable at-least-once (default) group.id=fulfillment-service-v2 auto.offset.reset=earliest # earliest | latest — new group starts from beginning enable.auto.commit=false # NEVER use auto-commit for business-critical consumers # Commit manually AFTER successful processing # Failure resilience max.poll.records=100 # Process 100 records per poll — size to your DB batch capacity max.poll.interval.ms=300000 # 5 min — consumer kicked from group if poll takes longer session.timeout.ms=45000 # Heartbeat timeout before rebalance heartbeat.interval.ms=15000 # Should be 1/3 of session.timeout # Fetch tuning (throughput) fetch.min.bytes=1048576 # 1MB minimum before broker responds (reduces round trips) fetch.max.wait.ms=500 # Max wait if fetch.min.bytes not met # Commit after successful processing pattern: # consumer.poll() → process batch → commitSync() → repeat # On exception: do NOT commit → messages re-delivered after rebalance

Kafka Topic Design Rules

DecisionRecommendationRationale
Partition countStart with 6–12 per topic; multiply expected consumer group size by 2–3Cannot decrease partitions; over-partition rather than under-partition
Replication factor3 for production; 1 for dev onlySurvives 1 broker failure with RF=3
Retention7 days default; event store topics: 1 year or compact+deleteLog compaction keeps last value per key (good for state topics)
Message keyUse entity ID (orderId, userId) — ensures ordering per entityAll events for the same key land in the same partition
Topic naming{domain}.{entity}.{event-type} e.g. orders.order.placedHierarchical, discoverable, schema-registry friendly
09

Apache Pulsar

// MULTI-TENANCY, GEO-REPLICATION, UNIFIED QUEUING

Pulsar separates compute (brokers) from storage (Apache BookKeeper), enabling instant elastic scaling of brokers without data rebalancing. It supports both streaming and queuing natively, built-in multi-tenancy, and geo-replication across data centers — features that require significant additional tooling in Kafka.

Pulsar's Architectural Advantages Unique
  • Separated storage: Scale brokers independently from BookKeeper storage nodes
  • Built-in multi-tenancy: Namespaces, tenants, and quotas are first-class primitives
  • Native geo-replication: Async or sync replication across regions configured declaratively
  • Unified model: Topic supports streaming + queue + pub-sub in a single API
  • Functions: Serverless compute embedded in the broker (Pulsar Functions)
When Kafka Still Wins Trade-offs
  • Kafka ecosystem is vastly larger (Kafka Connect, ksqlDB, 200+ connectors)
  • Kafka Streams is more mature than Pulsar's streaming APIs
  • Operational expertise far more available for Kafka
  • BookKeeper adds operational complexity (3-tier: ZK + BK + Broker)
  • Schema registry is third-party (not built-in like Pulsar's)
Python — Pulsar Producer & Consumer
import pulsar from pulsar.schema import JsonSchema, Record, String, Float # Schema definition — Pulsar has a built-in schema registry class OrderPlaced(Record): order_id = String(required=True) customer_id = String(required=True) total = Float(required=True) # Producer client = pulsar.Client('pulsar://broker:6650') producer = client.create_producer( topic='persistent://acme/orders/order-placed', # tenant/namespace/topic schema=JsonSchema(OrderPlaced), producer_name='orders-service-prod', send_timeout_millis=30000, block_if_queue_full=True, batching_enabled=True, batching_max_publish_delay_ms=10, compression_type=pulsar.CompressionType.Snappy ) producer.send(OrderPlaced(order_id='8271', customer_id='903', total=142.0)) # Consumer (shared subscription = queue semantics among consumers) consumer = client.subscribe( topic='persistent://acme/orders/order-placed', subscription_name='fulfillment-service', # durable subscription name consumer_type=pulsar.ConsumerType.Shared, # Shared|Exclusive|Failover|KeyShared schema=JsonSchema(OrderPlaced), receiver_queue_size=1000 ) while True: msg = consumer.receive(timeout_millis=5000) try: process_order(msg.value()) consumer.acknowledge(msg) # Ack = message removed from sub except Exception as e: consumer.negative_acknowledge(msg) # Nack = redeliver after backoff
10

RabbitMQ

// FLEXIBLE ROUTING, CLASSIC MESSAGING

RabbitMQ is a traditional message broker implementing AMQP. Its power lies in its flexible routing model: messages flow through exchanges that apply routing rules to deliver to one or more queues. This makes complex routing topologies (fanout, topic-based, header-based, direct) easy to configure without writing code. RabbitMQ prioritizes low-latency delivery and a rich broker-side feature set over raw throughput.

Exchange Types — RabbitMQ's Routing Power

Exchange TypeRouting RuleUse CaseExample Key
Direct Routes to queues where binding key = routing key (exact match) Task queues, error routing, targeted delivery order.created
Topic Routing key with wildcards: * (one word) and # (zero or more) Selective subscriptions, hierarchical event routing orders.*.placed, orders.#
Fanout Broadcasts to ALL bound queues, ignores routing key Pub/sub, notifications, cache invalidation — (ignored)
Headers Routes based on message header attributes (x-match: all/any) Content-based routing, multi-attribute filtering {region: eu, priority: high}
Python — RabbitMQ Topic Exchange Pattern
import pika, json connection = pika.BlockingConnection(pika.ConnectionParameters( host='rabbitmq', port=5672, credentials=pika.PlainCredentials('app', 'secret'), heartbeat=600, blocked_connection_timeout=300 )) channel = connection.channel() # Declare the topic exchange (idempotent) channel.exchange_declare( exchange='domain.events', exchange_type='topic', durable=True # Survives broker restart ) # ─── PRODUCER: publish with routing key ─────────────────────── def publish_event(routing_key: str, payload: dict): channel.basic_publish( exchange='domain.events', routing_key=routing_key, # e.g. "orders.eu.placed" body=json.dumps(payload), properties=pika.BasicProperties( content_type='application/json', delivery_mode=2, # 2 = persistent (survives restart) message_id='uuid-here', correlation_id='trace-id' ) ) # ─── CONSUMER: bind queue with wildcard pattern ──────────────── channel.queue_declare(queue='fulfillment.eu.orders', durable=True) channel.queue_bind( queue='fulfillment.eu.orders', exchange='domain.events', routing_key='orders.eu.*' # matches orders.eu.placed, orders.eu.cancelled, etc. ) channel.basic_qos(prefetch_count=10) # Process 10 in-flight max — back-pressure def on_message(ch, method, properties, body): try: event = json.loads(body) process(event) ch.basic_ack(delivery_tag=method.delivery_tag) # explicit ack except Exception: ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False) # → DLX channel.basic_consume('fulfillment.eu.orders', on_message) channel.start_consuming()
11

Broker Selection Guide

// CHOOSING THE RIGHT TOOL FOR THE JOB
Kafka
Apache / Confluent
Throughput Millions/sec
Latency 2–10ms
Model Log / Streaming
Retention Days–Forever
Replayable Yes — offsets
Best For High throughput streaming, event sourcing, audit log
Pulsar
Apache / StreamNative
Throughput High
Latency 5–15ms
Model Streaming + Queue
Retention Configurable
Replayable Yes — cursors
Best For Multi-tenant SaaS, geo-replication, unified messaging
RabbitMQ
Broadcom / CNCF
Throughput ~50K/sec
Latency <1ms
Model Message Queue
Retention Until consumed
Replayable No (shovel plugin)
Best For Complex routing, task queues, low-latency, RPC
If you need…ChooseWhy
Millions of events/sec, durable replay, event sourcingKafkaLog-based, partitioned, designed for this workload
Complex routing rules, fanout, priority queuesRabbitMQExchange model, broker-side routing logic
Multi-tenant SaaS, built-in geo-replicationPulsarFirst-class tenant/namespace primitives
Simple task queue / background jobsRabbitMQSimpler ops, lower overhead for this use case
SQL-based stream queries, BI on event streamsKafka + ksqlDBksqlDB provides ANSI SQL over Kafka topics
Starting fresh, limited ops expertiseManaged KafkaConfluent Cloud / MSK / Aiven — outsource ops complexity
12

Schema & Event Contracts

// SCHEMA EVOLUTION WITHOUT BREAKING CONSUMERS

Schema management is the hardest long-term problem in EDA. Producers and consumers are deployed independently — a schema change can break a consumer that hasn't been updated. Schema registries (Confluent Schema Registry, AWS Glue Schema Registry) enforce compatibility rules at publish time, preventing breaking changes from reaching the broker.

Backward Compatibility Preferred

Old consumers can read new events. You can add optional fields with defaults. You cannot remove fields or change types. New producers publish v2 events; old consumers still process them using default values for new fields.

Forward Compatibility

New consumers can read old events. Old producers publish v1; new consumers handle missing new fields gracefully. Allows consumer upgrades before producer upgrades.

Avro Schema — OrderPlaced v1 → v2 (Backward Compatible)
// v1 — original schema { "type": "record", "name": "OrderPlaced", "namespace": "com.acme.orders.v1", "fields": [ { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "total", "type": "double" } ] } // v2 — adding optional fields with defaults (BACKWARD COMPATIBLE ✓) { "type": "record", "name": "OrderPlaced", "namespace": "com.acme.orders.v2", "fields": [ { "name": "orderId", "type": "string" }, { "name": "customerId", "type": "string" }, { "name": "total", "type": "double" }, { "name": "currency", "type": "string", "default": "USD" }, // NEW — has default ✓ { "name": "region", "type": ["null", "string"], "default": null } // NEW — nullable ✓ ] } // BREAKING CHANGES — never do these with BACKWARD compatibility mode: // ❌ Remove existing field (orderId, customerId, total) // ❌ Change field type (double → string) // ❌ Rename field without alias // ❌ Add required field without default // Schema Registry enforces compatibility before accepting new schema version // curl -X POST -H "Content-Type: application/json" \ // --data '{"schema": "..."}' \ // http://schema-registry:8081/subjects/orders.placed-value/versions
13

Ordering & Delivery Guarantees

// THE HOLY TRINITY OF DISTRIBUTED MESSAGING

Every broker makes trade-offs across three guarantees: ordering, delivery semantics, and throughput. Understanding these is essential for designing correct systems — especially under failure conditions.

GuaranteeDefinitionKafkaPulsarRabbitMQ
At-Most-Once Messages may be lost, never duplicated acks=0 Non-persistent sub auto-ack, noAck
At-Least-Once No loss, but duplicates possible on retry Default acks=1 Default ack-based Default manual ack
Exactly-Once No loss, no duplicates — hardest guarantee Transactions + idempotence Transactions (limited) No (use idempotent consumers)
Exactly-once is expensive and rarely necessary. The practical target for most systems is at-least-once delivery + idempotent consumers. Design your consumer to handle duplicate delivery safely (using a deduplication key), and you get effectively-once semantics at much lower overhead than true exactly-once transactions.

Ordering Guarantees in Kafka

Kafka guarantees ordering within a partition. All messages with the same key go to the same partition — so all events for orderId=8271 are processed in order. Events across different partitions (different orderIds) may interleave. If global ordering across entities is required, use a single partition — but this eliminates parallelism.

14

Error Handling & Dead Letter Queues

// FAILURES ARE EVENTS TOO

In EDA, processing failures must not block the message pipeline. A Dead Letter Queue (DLQ) captures messages that failed processing after all retry attempts. The DLQ preserves failed messages for investigation and re-processing without stalling the main consumer.

Python — Retry with Exponential Backoff + DLQ
import time, math, json, logging from dataclasses import dataclass @dataclass class RetryPolicy: max_attempts: int = 5 base_delay_ms: int = 200 max_delay_ms: int = 30_000 jitter: bool = True def backoff_delay(attempt: int, policy: RetryPolicy) -> float: """Exponential backoff: 200ms, 400ms, 800ms, 1.6s, 3.2s, ...""" delay = policy.base_delay_ms * (2 ** attempt) delay = min(delay, policy.max_delay_ms) if policy.jitter: delay *= (0.5 + math.random() * 0.5) # ±25% jitter prevents thundering herd return delay / 1000 # convert to seconds def process_with_retry(consumer, dlq_producer, policy: RetryPolicy): while True: msg = consumer.poll(timeout_ms=5000) if not msg: continue attempt = 0 while attempt < policy.max_attempts: try: process_event(msg.value()) consumer.commit() break except TransientError as e: # DB timeout, downstream 503, etc. attempt += 1 if attempt >= policy.max_attempts: send_to_dlq(dlq_producer, msg, e, attempt) consumer.commit() # commit to advance — don't re-process else: time.sleep(backoff_delay(attempt, policy)) except PoisonPillError as e: # Malformed message — don't retry send_to_dlq(dlq_producer, msg, e, attempt=0) consumer.commit() break def send_to_dlq(producer, original_msg, error, attempt): dlq_event = { "originalTopic": original_msg.topic(), "originalPartition": original_msg.partition(), "originalOffset": original_msg.offset(), "payload": original_msg.value(), "error": str(error), "errorType": type(error).__name__, "failedAt": datetime.utcnow().isoformat(), "attempts": attempt } producer.produce('orders.placed.dlq', json.dumps(dlq_event)) logging.error(f"Message sent to DLQ after {attempt} attempts: {error}")
15

Idempotency

// SAFE TO PROCESS TWICE — ALWAYS

At-least-once delivery means your consumer will receive the same message more than once — guaranteed. A consumer that can be safely re-executed with the same message (same result as processing it once) is idempotent. Idempotency is the practical alternative to expensive exactly-once transactions.

Database-Level Idempotency

Use the event's unique ID (eventId or messageId) as a deduplication key. Store processed IDs in a processed_events table with a unique constraint. INSERT … ON CONFLICT DO NOTHING makes re-processing a no-op. Prune old records after your retention window.

Natural Idempotency

Design operations to be naturally idempotent where possible: SET balance = 142.00 is idempotent; INCREMENT balance BY 10 is not. UPSERT is idempotent; INSERT is not. Use version fields and optimistic locking to detect and discard stale re-deliveries.

SQL — Idempotent Event Processing Pattern
-- Deduplication table (partitioned by month for pruning) CREATE TABLE processed_events ( event_id VARCHAR(64) NOT NULL, consumer_group VARCHAR(64) NOT NULL, processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), PRIMARY KEY (event_id, consumer_group) ) PARTITION BY RANGE (processed_at); -- Idempotent event handling in a single transaction BEGIN; -- 1. Record that we're processing this event (dedup check) INSERT INTO processed_events (event_id, consumer_group) VALUES ('01HWZR5F4P7TKMXX9BQ2GV1ERD', 'fulfillment-service') ON CONFLICT (event_id, consumer_group) DO NOTHING; -- 2. Check if row was inserted (affected rows = 0 → already processed) GET DIAGNOSTICS v_affected = ROW_COUNT; IF v_affected = 0 THEN ROLLBACK; -- skip silently — duplicate delivery RETURN; END IF; -- 3. Apply business logic (now guaranteed to run exactly once) UPDATE orders SET status = 'fulfilling', updated_at = NOW() WHERE id = v_order_id AND status = 'paid'; -- state machine guard prevents double-fulfillment COMMIT;
16

Observability

// DISTRIBUTED TRACING ACROSS EVENT HOPS

Debugging an event-driven system without proper observability is nearly impossible — causality chains span multiple services, brokers, and asynchronous hops. The three pillars of EDA observability are distributed tracing (propagate trace context through event headers), consumer lag monitoring (the key health metric for any Kafka consumer), and event lineage (which events caused which downstream events).

Consumer Lag — The Primary SLA

Consumer lag = messages produced but not yet consumed. Rising lag = consumer can't keep up. Alert when lag > N events or lag grows for >5 minutes. Monitor per-partition. Tools: Kafka Exporter + Prometheus, Confluent Control Center, Burrow.

Distributed Tracing

Propagate traceparent (W3C Trace Context) in event headers. Consumers extract trace context and start a child span. Kafka instrumentation libraries (OpenTelemetry) handle this automatically. Platforms: Jaeger, Tempo, Honeycomb.

Key Metrics to Alert On
  • Consumer lag per group per partition
  • DLQ message count (should be zero)
  • Processing error rate per consumer
  • End-to-end event latency (produce → consume)
  • Broker disk utilization and ISR shrink events
17

Saga Pattern

// DISTRIBUTED TRANSACTIONS WITHOUT 2PC

A Saga is a sequence of local transactions, each publishing an event or message that triggers the next transaction. If any step fails, compensating transactions undo the preceding steps. Sagas replace distributed ACID transactions (2PC) — which don't scale and are unavailable under network partition — with eventual consistency and explicit compensation logic.

Choreography-based Saga
  • Each service reacts to events from the previous step
  • No central coordinator — fully decentralized
  • Simple for linear flows; complex for branching
  • Hard to visualize the overall saga state
  • Good for: simple, stable flows with few steps
Orchestration-based Saga
  • Central saga orchestrator issues commands, awaits events
  • Saga state is explicit and queryable
  • Easier to add steps, error handling, timeouts
  • Saga orchestrator is a single point of failure (mitigated by persistence)
  • Good for: complex flows, audit requirements, many services
TypeScript — Orchestration Saga: Order Fulfillment
// Saga steps: Reserve Inventory → Charge Payment → Notify Fulfillment // Compensations (on failure): Release Inventory ← Refund Payment ← Cancel Fulfillment enum SagaState { STARTED, INVENTORY_RESERVED, PAYMENT_CHARGED, COMPLETED, COMPENSATING, FAILED } class OrderFulfillmentSaga { async execute(orderId: string) { const saga = await sagaStore.create({ orderId, state: SagaState.STARTED }); // Step 1: Reserve inventory try { await commandBus.send('inventory', { type: 'ReserveItems', orderId }); await saga.transition(SagaState.INVENTORY_RESERVED); } catch { return this.fail(saga, []); } // Step 2: Charge payment try { await commandBus.send('payments', { type: 'ChargeOrder', orderId }); await saga.transition(SagaState.PAYMENT_CHARGED); } catch { return this.compensate(saga, [ () => commandBus.send('inventory', { type: 'ReleaseItems', orderId }) ]); } // Step 3: Trigger fulfillment await commandBus.send('fulfillment', { type: 'FulfillOrder', orderId }); await saga.transition(SagaState.COMPLETED); await eventBus.publish('orders.fulfilled', { orderId }); } async compensate(saga, compensations: Array<() => Promise<void>>) { await saga.transition(SagaState.COMPENSATING); for (const comp of compensations.reverse()) { await retry(() => comp()); // compensations must also be retried until success } await saga.transition(SagaState.FAILED); await eventBus.publish('orders.fulfillment-failed', { orderId: saga.orderId }); } }
18

EDA Anti-Patterns

// WHAT NOT TO DO — LESSONS LEARNED EXPENSIVELY
Event Spaghetti Arch Smell

Every service subscribes to every other service's events, creating a tangle of implicit dependencies. No one can trace what triggers what. Solution: domain-driven topic boundaries, bounded contexts, explicit event catalogs.

God Topics Avoid

A single events topic carrying all event types from all services. Consumers must filter; schema evolution breaks everyone. Solution: one topic per event type or per aggregate root.

Synchronous Expectations Mindset

Treating events as synchronous RPC — expecting an immediate response, timing out if not received in 200ms. EDA is inherently asynchronous. Design UX around eventual consistency with progress indicators, not blocking spinners.

No Schema Contract Critical

Publishing raw JSON with no schema registry. Works fine until the producer team renames a field and 8 downstream consumers silently start failing. Always register schemas. Always version them.

Storing Business State in Kafka Misuse

Using Kafka as the primary queryable state store. Kafka is not a database. Consumers need their own materialized views (CQRS read models) for queries. Use Kafka for the event log; project into queryable stores.

Non-Idempotent Consumers Bug

Assuming each message is delivered exactly once. At-least-once delivery is the guarantee. Any consumer that double-charges a customer or double-books a slot on redelivery has a production bug waiting to happen.

19

Migration Roadmap

// FROM MONOLITH OR REST TO EDA

Migrating to event-driven architecture is a multi-phase journey. The strangler fig pattern — gradually replacing direct calls with event flows without a big-bang rewrite — is the most reliable approach. Begin at the edges with high-value, high-volume flows before tackling the core domain.

Phase 0 — Months 1–2
Observe & Instrument
  • Map existing data flows — which services call which, volumes, latency
  • Identify top fan-out candidates (one call triggers N downstream side effects)
  • Choose broker and deploy in non-critical path first
  • Establish schema registry and naming conventions
  • Define consumer group naming, DLQ topology, and on-call runbooks
Phase 1 — Months 2–5
Dual-Write Strangler
  • Augment existing writes to also publish domain events (transactional outbox pattern)
  • New consumers read from events; existing consumers from DB — run in parallel
  • Validate event-derived state matches DB state for 30 days
  • Migrate notification, analytics, and audit consumers to events first (low risk)
  • Implement DLQ monitoring and reprocessing tooling
Phase 2 — Months 5–10
Choreography Cutover
  • Replace synchronous service-to-service calls with event-driven choreography
  • Implement saga orchestration for multi-step flows
  • Remove direct DB coupling between services — consume events, not shared tables
  • Introduce CQRS read models for query-heavy paths
  • Establish SLAs: max consumer lag, DLQ response time
Phase 3 — Months 10–18
Stream Processing & Optimization
  • Introduce real-time stream processing (Kafka Streams / Flink) for analytics and enrichment
  • Migrate high-throughput audit and CDC flows to Kafka Connect
  • Evaluate event sourcing for core domain aggregates where audit/replay is critical
  • Implement schema evolution governance and consumer group ownership registry
  • Continuous maturity assessment against EDA pillars

Transactional Outbox — Safe Dual-Write

The hardest problem when introducing events into an existing system: how to atomically update your database and publish an event. A crash between the two creates inconsistency. The Transactional Outbox pattern solves this by writing the event to an outbox table in the same database transaction. A separate relay process (Debezium CDC) reads the outbox and publishes to Kafka — guaranteed, ordered, and recoverable.

SQL + Config — Transactional Outbox + Debezium
-- 1. Outbox table (created alongside your domain tables) CREATE TABLE outbox_events ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), aggregate_id VARCHAR(64) NOT NULL, event_type VARCHAR(128) NOT NULL, payload JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), published BOOLEAN NOT NULL DEFAULT FALSE ); -- 2. Write to domain table AND outbox in ONE transaction (atomic) BEGIN; INSERT INTO orders (id, customer_id, total, status) VALUES ('8271', '903', 142.00, 'pending'); INSERT INTO outbox_events (aggregate_id, event_type, payload) VALUES ('8271', 'com.acme.orders.OrderPlaced', '{"orderId":"8271","customerId":"903","total":142.00}'); COMMIT; -- Either both succeed or both rollback — no inconsistency possible -- 3. Debezium connector reads PostgreSQL WAL (Change Data Capture) -- and publishes outbox events to Kafka automatically -- (debezium-connector-postgres with outbox.event.router SMT) -- Debezium Kafka Connect config snippet: -- { -- "connector.class": "io.debezium.connector.postgresql.PostgresConnector", -- "database.hostname": "postgres", "database.dbname": "orders", -- "table.include.list": "public.outbox_events", -- "transforms": "outbox", -- "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter", -- "transforms.outbox.table.field.event.type": "event_type", -- "transforms.outbox.route.topic.replacement": "${routedByValue}" -- }
Reference implementations & further reading: Martin Fowler's Event Sourcing pattern (martinfowler.com) · Chris Richardson's microservices.io (Saga, Outbox, CQRS patterns) · Confluent Developer docs (Kafka Streams, Schema Registry) · CloudEvents spec (cloudevents.io) · Lightbend Akka (reactive event-driven microservices) · Axon Framework (Java CQRS/ES framework).