Engineering Handbook

Event-Driven
Architecture

The backbone of modern enterprise scaling

"Don't call us, we'll call you — the Hollywood Principle applied to distributed systems."

The definitive field guide to building systems that react, not poll. Covers the full spectrum: Kafka partitions to RabbitMQ routing, event sourcing to CQRS, saga orchestration to schema evolution — everything required to design, build, and operate event-driven systems at scale.

Apache Kafka Apache Pulsar RabbitMQ Event Sourcing Pub/Sub

What is Event-Driven Architecture

// FUNDAMENTALS & MOTIVATIONS

Event-Driven Architecture (EDA) is a design paradigm in which system components communicate by producing and consuming events — immutable records of things that have happened — rather than calling each other directly. Producers have no knowledge of consumers. Consumers have no knowledge of producers. The broker sits in between and is the only coupling point.

This inversion is the source of EDA's power: temporal decoupling (producer and consumer need not be available simultaneously), spatial decoupling (they need not know each other's address), and semantic decoupling (they agree only on the event schema, not on processing logic).

Temporal Decoupling Core

Producer and consumer do not need to be online at the same time. Events persist in the broker until consumed. Services scale and deploy independently without coordination windows.

Spatial Decoupling Core

Producers publish to a topic, not to a specific consumer endpoint. Adding new consumers requires zero changes to producers. Services discover data through subscriptions, not service registries.

Scale & Resilience Benefit

Consumers scale horizontally against the event backlog. Slow consumers don't block producers. Back-pressure is managed by the broker, not by synchronous timeouts cascading through a call chain.

▸

EDA vs REST vs RPC: REST/RPC is request/response — caller blocks waiting for a reply, both parties must be up, and the caller must know the callee's address. EDA is fire-and-forget at the producer side. For read-heavy query paths, REST is still appropriate. EDA shines on write/mutation flows and anywhere you need fan-out, replay, or audit.

When EDA is the right choice

Scenario	EDA Fit	Why
Order placed → multiple downstream actions (fulfillment, email, analytics)	EXCELLENT	Fan-out without producer knowing consumers; consumers added independently
Real-time fraud detection on transactions	EXCELLENT	Stream processing on event log; millisecond latency with Kafka Streams/Flink
Audit trail / compliance log	EXCELLENT	Immutable event log is the audit record; replay for compliance queries
Simple CRUD API, single consumer	POOR	Adds operational complexity with no benefit; use REST
Interactive query (user asks for their profile data)	POOR	Synchronous response required; use REST + query model (CQRS read side)
Cross-service saga / distributed transaction	GOOD	Choreography or orchestration via events; avoids 2PC distributed locking

Core Messaging Patterns

// THE FOUR BUILDING BLOCKS

Every EDA system is built from four primitive patterns. These compose into the larger architectures — pub/sub, event sourcing, stream processing — described in later sections. Understanding the primitives clarifies why different brokers make different design choices.

PATTERN 01 — MESSAGE QUEUE

Point-to-Point Queue

Each message is consumed by exactly one consumer. Multiple consumers compete for messages from the same queue — load balancing with guaranteed single delivery. Classic use: background job processing, task distribution. Work queue model.

PATTERN 02 — PUBLISH/SUBSCRIBE

Topic Fan-Out

Each message is delivered to all subscribers. Producers publish to a topic; every subscription receives a copy. Classic use: event notification, audit events, data replication. N consumers all see every event.

PATTERN 03 — EVENT STREAMING

Durable Ordered Log

Events are written to an immutable, ordered, partitioned log. Consumers track their own offset. Any consumer can re-read history. New consumers can start from the beginning. Classic use: Kafka, Kinesis. Time-travel is the superpower.

PATTERN 04 — REQUEST/REPLY

Async RPC over Events

Caller publishes a request event with a correlationId and a replyTo queue. Responder publishes the reply to that queue. Caller awaits. Bridges synchronous-seeming semantics over async infrastructure. Use sparingly.

◈

Kafka blurs the queue/pub-sub boundary using consumer groups: all consumers in the same group share partitions (queue semantics — each message processed once); consumers in different groups each receive all messages (pub-sub semantics). This single primitive covers both use cases.

Event Anatomy

// WHAT AN EVENT IS (AND ISN'T)

An event is an immutable record of a fact that occurred in the past. Not a command (don't say "please process this"). Not a request. A fact: "Order 8271 was placed by user 903 at 14:22:05 UTC containing 3 items totalling $142.00." Events are named in past tense.

✕ Commands (avoid as events)

ProcessOrder — imperative, implies a specific handler
SendEmail — prescribes implementation
UpdateInventory — tells receiver what to do
ChargeCustomer — couples business logic to event
Present/future tense names
Contains processing instructions

✓ Domain Events (correct)

OrderPlaced — past tense, describes reality
PaymentSucceeded — fact, consumer decides what to do
InventoryReserved — records what happened
CustomerCharged — consequence, not instruction
Past tense names always
Contains only the data of what happened

Canonical Event Envelope

JSON / CloudEvents v1.0

// CloudEvents v1.0 — industry standard envelope
{
  // ─── Required: CloudEvents attributes ───────────────────
  "specversion": "1.0",
  "id":          "01HWZR5F4P7TKMXX9BQ2GV1ERD",       // ULID — sortable UUID
  "source":      "/orders-service/production",
  "type":        "com.acme.orders.OrderPlaced",         // reverse-DNS + past tense
  "time":        "2024-09-14T14:22:05.182Z",            // ISO 8601 UTC always

  // ─── Standard optional attributes ───────────────────────
  "datacontenttype": "application/json",
  "dataschema":      "https://schemas.acme.com/orders/v2/order-placed.json",
  "subject":         "order/8271",                        // resource identifier

  // ─── Custom extensions ───────────────────────────────────
  "correlationid": "sess_abc123",                      // trace correlation
  "causationid":   "cmd_xyz789",                       // triggering command
  "tenantid":      "acme-corp",                        // multi-tenancy
  "version":       2,                                  // schema version

  // ─── Domain payload ──────────────────────────────────────
  "data": {
    "orderId":     "8271",
    "customerId":  "cust_903",
    "placedAt":    "2024-09-14T14:22:05.182Z",
    "currency":    "USD",
    "totalAmount": 142.00,
    "lineItems": [
      { "sku": "PROD-441", "qty": 2, "unitPrice": 49.99 },
      { "sku": "PROD-109", "qty": 1, "unitPrice": 42.02 }
    ],
    "shippingAddress": { /* ... */ }
  }
}

▸

Fat vs Thin events: Fat events include full payload (the order data above). Thin events include only the ID and type — consumers must fetch full state if needed. Fat events are self-contained but expensive at scale. Thin events require consumers to have read access to the source system. Prefer fat events for loose coupling; use thin events when payloads are very large or contain regulated data.

Publish / Subscribe

// DECOUPLED FAN-OUT AT SCALE

In a pub/sub model, producers (publishers) emit events to a topic. Subscribers express interest in topics and receive all matching events. The broker is responsible for delivering the event to all registered subscribers. Neither party knows the other exists — only the topic name is shared.

Producer

Order
Service

→

Topic

orders.placed

→

Consumer 1

Email
Service

Consumer 2

Fulfillment
Service

Consumer 3

Analytics
Service

Each consumer receives every event independently — adding Consumer 4 requires zero changes to Order Service

Strengths Use When

Multiple downstream reactions to a single business event
Adding new consumers without touching producers
Event notification — "this happened, react however you want"
Cross-domain data propagation (CDC events, domain events)

Limitations Watch Out

No built-in ordering guarantee across consumers
Consumer failures need independent dead-letter handling
Complex debugging — causality chains are non-obvious
Schema changes propagate to all consumers simultaneously

Event Sourcing

// STATE AS A FUNCTION OF HISTORY

Event Sourcing is a persistence pattern where the state of an entity is never stored directly. Instead, only the sequence of events that led to the current state is persisted. Current state is derived by replaying events from the beginning (or from a snapshot). The event log is the database.

✕ Traditional CRUD State Store

Store current state only — history is lost
UPDATE overwrites — no audit trail
Temporal queries impossible (what did order look like last Tuesday?)
Can't replay to fix a bug in processing logic
Simple read path: SELECT * WHERE id = X

✓ Event Sourced Store

Append-only event log — full history preserved
Audit trail is free — every change recorded
Time travel: replay to any point in time
Fix bugs: replay with corrected logic to rebuild projections
Write is fast (append); reads need projection rebuild

TypeScript — Event Sourced Aggregate

// Events are the source of truth — append only
type OrderEvent =
  | { type: 'OrderPlaced';     orderId: string; customerId: string; total: number }
  | { type: 'ItemAdded';       orderId: string; sku: string;        qty: number  }
  | { type: 'PaymentReceived'; orderId: string; amount: number;      txId: string  }
  | { type: 'OrderShipped';    orderId: string; trackingNo: string            }
  | { type: 'OrderCancelled'; orderId: string; reason: string                 };

// State is a pure function of events (fold/reduce)
function applyEvent(state: OrderState, event: OrderEvent): OrderState {
  switch (event.type) {
    case 'OrderPlaced':
      return { ...state, status: 'pending', total: event.total, customerId: event.customerId };
    case 'PaymentReceived':
      return { ...state, status: 'paid', paidAmount: event.amount };
    case 'OrderShipped':
      return { ...state, status: 'shipped', trackingNo: event.trackingNo };
    case 'OrderCancelled':
      return { ...state, status: 'cancelled', cancelReason: event.reason };
    default: return state;
  }
}

// Reconstitute state from event history
async function loadOrder(orderId: string): Promise<OrderState> {
  const events = await eventStore.load(orderId);              // or from snapshot
  return events.reduce(applyEvent, INITIAL_STATE);
}

// Append a new event — never UPDATE, only INSERT
async function shipOrder(orderId: string, trackingNo: string, expectedVersion: number) {
  const event: OrderEvent = { type: 'OrderShipped', orderId, trackingNo };
  await eventStore.append(orderId, event, expectedVersion); // optimistic concurrency
  await eventBus.publish('orders.shipped', event);           // fan-out to consumers
}

Snapshots — Performance at Scale

Replaying thousands of events to reconstitute state on every read is expensive. Snapshots periodically capture current state alongside the event sequence number. On load, restore from the latest snapshot and replay only subsequent events.

⚠

Event sourcing is not a silver bullet. The write model is simple; the read model is complex. You need projections (read models) for every query pattern. Schema evolution of historical events is a significant engineering challenge. Use event sourcing when audit trail, time-travel, or event replay are genuine business requirements — not by default.

CQRS — Command Query Responsibility Segregation

// SEPARATE READ AND WRITE PATHS

CQRS separates the model used for writes (commands) from the model used for reads (queries). They can use different databases, different schemas, and scale independently. Commands mutate state and publish events; queries read from purpose-built projections optimized for the specific view.

Client

API

→

Command

PlaceOrder
CancelOrder

→

Write DB

Event Store
PostgreSQL

→

Event Bus

Kafka

→

Query

GetOrder
ListOrders

→

Read DB

Elasticsearch
Redis

Event bus (Kafka) feeds projectors that maintain purpose-built read models — denormalized, query-optimized, polyglot

Read Model Design Rules

One read model per query pattern — no compromise
Denormalize ruthlessly — joins at read time are expensive
Optimize for the exact query shape the UI needs
Rebuild-ability is a requirement: projectors must be replayable
Eventual consistency with the write model — always

Projection Technology Choices

Elasticsearch: full-text search, faceted navigation
Redis: hot read paths, leaderboards, counters
PostgreSQL (materialized view): complex relational queries
DynamoDB: single-table design, predictable latency
ClickHouse / BigQuery: analytics queries over large datasets

Stream Processing

// COMPUTE OVER CONTINUOUS EVENT FLOWS

Stream processing is real-time computation over unbounded event sequences. Unlike batch processing (which operates on finite datasets at rest), stream processing applies transformations, aggregations, and joins continuously as events arrive — with sub-second latency.

Framework	Model	Latency	Best For	Managed Option
Kafka Streams	Library (no cluster)	ms	Kafka-native, small-medium topologies, microservice-embedded	Confluent Cloud
Apache Flink	Cluster, stateful DAG	ms	Complex CEP, large state, exactly-once, SQL over streams	AWS Kinesis Analytics, Confluent
Apache Spark Streaming	Micro-batch	seconds	ML pipelines, batch+stream unified, large cluster analytics	Databricks, EMR, Dataproc
ksqlDB	SQL over Kafka	ms	SQL-native stream queries, materialized tables, no-code streaming	Confluent Cloud
Pulsar Functions	Serverless functions	ms	Pulsar-native, simple transform/filter pipelines	StreamNative, Astra Streaming

Java — Kafka Streams: Real-Time Order Analytics

StreamsBuilder builder = new StreamsBuilder();

// 1. Source stream from Kafka topic
KStream<String, OrderEvent> orders = builder
    .stream("orders.placed", Consumed.with(Serdes.String(), orderSerde));

// 2. Filter: only paid orders over $100
KStream<String, OrderEvent> highValue = orders
    .filter((key, order) -> order.getTotal() > 100.0 && order.isPaid());

// 3. Group by region, count in 5-minute tumbling window
KTable<Windowed<String>, Long> countByRegion = highValue
    .groupBy((key, order) -> order.getRegion())
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count(Materialized.as("high-value-orders-by-region"));   // queryable state store

// 4. Join enrichment: fetch product catalog (KTable)
KTable<String, Product> products = builder
    .table("products.catalog", Consumed.with(Serdes.String(), productSerde));

KStream<String, EnrichedOrder> enriched = orders
    .join(products,
        (order, product) -> new EnrichedOrder(order, product),
        Joined.with(Serdes.String(), orderSerde, productSerde));

// 5. Sink to output topic
enriched.to("orders.enriched", Produced.with(Serdes.String(), enrichedSerde));

// 6. Build and start topology
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();   // Runs embedded in your microservice — no separate cluster needed

Apache Kafka

// DISTRIBUTED COMMIT LOG AT PLANETARY SCALE

Kafka is a distributed, partitioned, replicated commit log. It is not a message queue in the traditional sense — it is a durable, ordered, replayable event log that producers append to and consumers read from at their own pace. Kafka's architecture makes it the dominant choice for high-throughput event streaming at enterprise scale.

Core Architecture Concepts

Topics & Partitions

A topic is split into N partitions. Each partition is an ordered, immutable sequence of records. Partitions are the unit of parallelism — you can have at most as many active consumers as partitions in a group. Choose partition count based on target throughput.

Offsets & Consumer Groups

Each message has a monotonically increasing offset per partition. Consumer groups track their committed offset per partition — Kafka never deletes data based on consumption. Multiple groups can read the same topic independently at their own offsets.

Replication & Durability

Each partition has a leader and N-1 follower replicas. acks=all + min.insync.replicas=2 ensures no data loss under broker failure. Replication factor 3 is standard for production. KRaft (since 3.3) eliminates ZooKeeper dependency.

Properties — Kafka Producer (High Durability)

# Producer configuration for financial-grade durability

# Durability: wait for all in-sync replicas to acknowledge
acks=all                    # 0=fire-forget | 1=leader only | all=full durability
min.insync.replicas=2       # Set on topic — broker rejects if fewer ISRs available
enable.idempotence=true     # Exactly-once semantics per partition (dedup by sequence no.)

# Retry behavior
retries=2147483647           # Retry indefinitely (with idempotence, safe to do so)
max.in.flight.requests.per.connection=5  # Must be ≤5 with idempotence
delivery.timeout.ms=120000  # 2-min total retry window
retry.backoff.ms=1000

# Batching for throughput (tune for your latency/throughput trade-off)
batch.size=65536             # 64KB batch size (default 16KB — too small for high throughput)
linger.ms=10                 # Wait up to 10ms to fill batch (0 = lowest latency)
compression.type=snappy      # snappy (balanced) | lz4 (fast) | zstd (best ratio)
buffer.memory=67108864       # 64MB producer buffer

# Serialization (use Schema Registry for schema evolution)
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=io.confluent.kafka.serializers.KafkaAvroSerializer
schema.registry.url=https://schema-registry:8081

Properties — Kafka Consumer (Exactly-Once Processing)

# Consumer configuration for reliable at-least-once (default)
group.id=fulfillment-service-v2
auto.offset.reset=earliest     # earliest | latest — new group starts from beginning
enable.auto.commit=false        # NEVER use auto-commit for business-critical consumers
                                 # Commit manually AFTER successful processing

# Failure resilience
max.poll.records=100            # Process 100 records per poll — size to your DB batch capacity
max.poll.interval.ms=300000     # 5 min — consumer kicked from group if poll takes longer
session.timeout.ms=45000        # Heartbeat timeout before rebalance
heartbeat.interval.ms=15000     # Should be 1/3 of session.timeout

# Fetch tuning (throughput)
fetch.min.bytes=1048576         # 1MB minimum before broker responds (reduces round trips)
fetch.max.wait.ms=500           # Max wait if fetch.min.bytes not met

# Commit after successful processing pattern:
#   consumer.poll() → process batch → commitSync() → repeat
#   On exception: do NOT commit → messages re-delivered after rebalance

Kafka Topic Design Rules

Decision	Recommendation	Rationale
Partition count	Start with 6–12 per topic; multiply expected consumer group size by 2–3	Cannot decrease partitions; over-partition rather than under-partition
Replication factor	3 for production; 1 for dev only	Survives 1 broker failure with RF=3
Retention	7 days default; event store topics: 1 year or compact+delete	Log compaction keeps last value per key (good for state topics)
Message key	Use entity ID (orderId, userId) — ensures ordering per entity	All events for the same key land in the same partition
Topic naming	`{domain}.{entity}.{event-type}` e.g. `orders.order.placed`	Hierarchical, discoverable, schema-registry friendly

Apache Pulsar

// MULTI-TENANCY, GEO-REPLICATION, UNIFIED QUEUING

Pulsar separates compute (brokers) from storage (Apache BookKeeper), enabling instant elastic scaling of brokers without data rebalancing. It supports both streaming and queuing natively, built-in multi-tenancy, and geo-replication across data centers — features that require significant additional tooling in Kafka.

Pulsar's Architectural Advantages Unique

Separated storage: Scale brokers independently from BookKeeper storage nodes
Built-in multi-tenancy: Namespaces, tenants, and quotas are first-class primitives
Native geo-replication: Async or sync replication across regions configured declaratively
Unified model: Topic supports streaming + queue + pub-sub in a single API
Functions: Serverless compute embedded in the broker (Pulsar Functions)

When Kafka Still Wins Trade-offs

Kafka ecosystem is vastly larger (Kafka Connect, ksqlDB, 200+ connectors)
Kafka Streams is more mature than Pulsar's streaming APIs
Operational expertise far more available for Kafka
BookKeeper adds operational complexity (3-tier: ZK + BK + Broker)
Schema registry is third-party (not built-in like Pulsar's)

Python — Pulsar Producer & Consumer

import pulsar
from pulsar.schema import JsonSchema, Record, String, Float

# Schema definition — Pulsar has a built-in schema registry
class OrderPlaced(Record):
    order_id    = String(required=True)
    customer_id = String(required=True)
    total       = Float(required=True)

# Producer
client   = pulsar.Client('pulsar://broker:6650')
producer = client.create_producer(
    topic='persistent://acme/orders/order-placed',  # tenant/namespace/topic
    schema=JsonSchema(OrderPlaced),
    producer_name='orders-service-prod',
    send_timeout_millis=30000,
    block_if_queue_full=True,
    batching_enabled=True,
    batching_max_publish_delay_ms=10,
    compression_type=pulsar.CompressionType.Snappy
)
producer.send(OrderPlaced(order_id='8271', customer_id='903', total=142.0))

# Consumer (shared subscription = queue semantics among consumers)
consumer = client.subscribe(
    topic='persistent://acme/orders/order-placed',
    subscription_name='fulfillment-service',      # durable subscription name
    consumer_type=pulsar.ConsumerType.Shared,       # Shared|Exclusive|Failover|KeyShared
    schema=JsonSchema(OrderPlaced),
    receiver_queue_size=1000
)
while True:
    msg = consumer.receive(timeout_millis=5000)
    try:
        process_order(msg.value())
        consumer.acknowledge(msg)                  # Ack = message removed from sub
    except Exception as e:
        consumer.negative_acknowledge(msg)          # Nack = redeliver after backoff

RabbitMQ

// FLEXIBLE ROUTING, CLASSIC MESSAGING

RabbitMQ is a traditional message broker implementing AMQP. Its power lies in its flexible routing model: messages flow through exchanges that apply routing rules to deliver to one or more queues. This makes complex routing topologies (fanout, topic-based, header-based, direct) easy to configure without writing code. RabbitMQ prioritizes low-latency delivery and a rich broker-side feature set over raw throughput.

Exchange Types — RabbitMQ's Routing Power

Exchange Type	Routing Rule	Use Case	Example Key
Direct	Routes to queues where binding key = routing key (exact match)	Task queues, error routing, targeted delivery	`order.created`
Topic	Routing key with wildcards: `*` (one word) and `#` (zero or more)	Selective subscriptions, hierarchical event routing	`orders.*.placed`, `orders.#`
Fanout	Broadcasts to ALL bound queues, ignores routing key	Pub/sub, notifications, cache invalidation	— (ignored)
Headers	Routes based on message header attributes (x-match: all/any)	Content-based routing, multi-attribute filtering	`{region: eu, priority: high}`

Python — RabbitMQ Topic Exchange Pattern

import pika, json

connection = pika.BlockingConnection(pika.ConnectionParameters(
    host='rabbitmq', port=5672,
    credentials=pika.PlainCredentials('app', 'secret'),
    heartbeat=600,
    blocked_connection_timeout=300
))
channel = connection.channel()

# Declare the topic exchange (idempotent)
channel.exchange_declare(
    exchange='domain.events',
    exchange_type='topic',
    durable=True  # Survives broker restart
)

# ─── PRODUCER: publish with routing key ───────────────────────
def publish_event(routing_key: str, payload: dict):
    channel.basic_publish(
        exchange='domain.events',
        routing_key=routing_key,           # e.g. "orders.eu.placed"
        body=json.dumps(payload),
        properties=pika.BasicProperties(
            content_type='application/json',
            delivery_mode=2,              # 2 = persistent (survives restart)
            message_id='uuid-here',
            correlation_id='trace-id'
        )
    )

# ─── CONSUMER: bind queue with wildcard pattern ────────────────
channel.queue_declare(queue='fulfillment.eu.orders', durable=True)
channel.queue_bind(
    queue='fulfillment.eu.orders',
    exchange='domain.events',
    routing_key='orders.eu.*'  # matches orders.eu.placed, orders.eu.cancelled, etc.
)

channel.basic_qos(prefetch_count=10)  # Process 10 in-flight max — back-pressure

def on_message(ch, method, properties, body):
    try:
        event = json.loads(body)
        process(event)
        ch.basic_ack(delivery_tag=method.delivery_tag)    # explicit ack
    except Exception:
        ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)  # → DLX

channel.basic_consume('fulfillment.eu.orders', on_message)
channel.start_consuming()

Broker Selection Guide

// CHOOSING THE RIGHT TOOL FOR THE JOB

Kafka

Apache / Confluent

Throughput Millions/sec

Latency 2–10ms

Model Log / Streaming

Retention Days–Forever

Replayable Yes — offsets

Best For High throughput streaming, event sourcing, audit log

Pulsar

Apache / StreamNative

Throughput High

Latency 5–15ms

Model Streaming + Queue

Retention Configurable

Replayable Yes — cursors

Best For Multi-tenant SaaS, geo-replication, unified messaging

RabbitMQ

Broadcom / CNCF

Throughput ~50K/sec

Latency <1ms

Model Message Queue

Retention Until consumed

Replayable No (shovel plugin)

Best For Complex routing, task queues, low-latency, RPC

If you need…	Choose	Why
Millions of events/sec, durable replay, event sourcing	Kafka	Log-based, partitioned, designed for this workload
Complex routing rules, fanout, priority queues	RabbitMQ	Exchange model, broker-side routing logic
Multi-tenant SaaS, built-in geo-replication	Pulsar	First-class tenant/namespace primitives
Simple task queue / background jobs	RabbitMQ	Simpler ops, lower overhead for this use case
SQL-based stream queries, BI on event streams	Kafka + ksqlDB	ksqlDB provides ANSI SQL over Kafka topics
Starting fresh, limited ops expertise	Managed Kafka	Confluent Cloud / MSK / Aiven — outsource ops complexity

Schema & Event Contracts

// SCHEMA EVOLUTION WITHOUT BREAKING CONSUMERS

Schema management is the hardest long-term problem in EDA. Producers and consumers are deployed independently — a schema change can break a consumer that hasn't been updated. Schema registries (Confluent Schema Registry, AWS Glue Schema Registry) enforce compatibility rules at publish time, preventing breaking changes from reaching the broker.

Backward Compatibility Preferred

Old consumers can read new events. You can add optional fields with defaults. You cannot remove fields or change types. New producers publish v2 events; old consumers still process them using default values for new fields.

Forward Compatibility

New consumers can read old events. Old producers publish v1; new consumers handle missing new fields gracefully. Allows consumer upgrades before producer upgrades.

Avro Schema — OrderPlaced v1 → v2 (Backward Compatible)

// v1 — original schema
{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.acme.orders.v1",
  "fields": [
    { "name": "orderId",    "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "total",      "type": "double" }
  ]
}

// v2 — adding optional fields with defaults (BACKWARD COMPATIBLE ✓)
{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "com.acme.orders.v2",
  "fields": [
    { "name": "orderId",    "type": "string" },
    { "name": "customerId", "type": "string" },
    { "name": "total",      "type": "double" },
        { "name": "currency",   "type": "string",  "default": "USD" },  // NEW — has default ✓
    { "name": "region",    "type": ["null", "string"], "default": null }  // NEW — nullable ✓
  ]
}

// BREAKING CHANGES — never do these with BACKWARD compatibility mode:
// ❌  Remove existing field (orderId, customerId, total)
// ❌  Change field type (double → string)
// ❌  Rename field without alias
// ❌  Add required field without default

// Schema Registry enforces compatibility before accepting new schema version
// curl -X POST -H "Content-Type: application/json" \
//   --data '{"schema": "..."}' \
//   http://schema-registry:8081/subjects/orders.placed-value/versions

Ordering & Delivery Guarantees

// THE HOLY TRINITY OF DISTRIBUTED MESSAGING

Every broker makes trade-offs across three guarantees: ordering, delivery semantics, and throughput. Understanding these is essential for designing correct systems — especially under failure conditions.

Guarantee	Definition	Kafka	Pulsar	RabbitMQ
At-Most-Once	Messages may be lost, never duplicated	`acks=0`	Non-persistent sub	auto-ack, `noAck`
At-Least-Once	No loss, but duplicates possible on retry	Default `acks=1`	Default ack-based	Default manual ack
Exactly-Once	No loss, no duplicates — hardest guarantee	Transactions + idempotence	Transactions (limited)	No (use idempotent consumers)

◆

Exactly-once is expensive and rarely necessary. The practical target for most systems is at-least-once delivery + idempotent consumers. Design your consumer to handle duplicate delivery safely (using a deduplication key), and you get effectively-once semantics at much lower overhead than true exactly-once transactions.

Ordering Guarantees in Kafka

Kafka guarantees ordering within a partition. All messages with the same key go to the same partition — so all events for orderId=8271 are processed in order. Events across different partitions (different orderIds) may interleave. If global ordering across entities is required, use a single partition — but this eliminates parallelism.

Error Handling & Dead Letter Queues

// FAILURES ARE EVENTS TOO

In EDA, processing failures must not block the message pipeline. A Dead Letter Queue (DLQ) captures messages that failed processing after all retry attempts. The DLQ preserves failed messages for investigation and re-processing without stalling the main consumer.

Python — Retry with Exponential Backoff + DLQ

import time, math, json, logging
from dataclasses import dataclass

@dataclass
class RetryPolicy:
    max_attempts: int = 5
    base_delay_ms: int = 200
    max_delay_ms: int = 30_000
    jitter: bool = True

def backoff_delay(attempt: int, policy: RetryPolicy) -> float:
    """Exponential backoff: 200ms, 400ms, 800ms, 1.6s, 3.2s, ..."""
    delay = policy.base_delay_ms * (2 ** attempt)
    delay = min(delay, policy.max_delay_ms)
    if policy.jitter:
        delay *= (0.5 + math.random() * 0.5)   # ±25% jitter prevents thundering herd
    return delay / 1000                            # convert to seconds

def process_with_retry(consumer, dlq_producer, policy: RetryPolicy):
    while True:
        msg = consumer.poll(timeout_ms=5000)
        if not msg: continue

        attempt = 0
        while attempt < policy.max_attempts:
            try:
                process_event(msg.value())
                consumer.commit()
                break
            except TransientError as e:            # DB timeout, downstream 503, etc.
                attempt += 1
                if attempt >= policy.max_attempts:
                    send_to_dlq(dlq_producer, msg, e, attempt)
                    consumer.commit()              # commit to advance — don't re-process
                else:
                    time.sleep(backoff_delay(attempt, policy))
            except PoisonPillError as e:           # Malformed message — don't retry
                send_to_dlq(dlq_producer, msg, e, attempt=0)
                consumer.commit()
                break

def send_to_dlq(producer, original_msg, error, attempt):
    dlq_event = {
        "originalTopic": original_msg.topic(),
        "originalPartition": original_msg.partition(),
        "originalOffset": original_msg.offset(),
        "payload": original_msg.value(),
        "error": str(error),
        "errorType": type(error).__name__,
        "failedAt": datetime.utcnow().isoformat(),
        "attempts": attempt
    }
    producer.produce('orders.placed.dlq', json.dumps(dlq_event))
    logging.error(f"Message sent to DLQ after {attempt} attempts: {error}")

Idempotency

// SAFE TO PROCESS TWICE — ALWAYS

At-least-once delivery means your consumer will receive the same message more than once — guaranteed. A consumer that can be safely re-executed with the same message (same result as processing it once) is idempotent. Idempotency is the practical alternative to expensive exactly-once transactions.

Database-Level Idempotency

Use the event's unique ID (eventId or messageId) as a deduplication key. Store processed IDs in a processed_events table with a unique constraint. INSERT … ON CONFLICT DO NOTHING makes re-processing a no-op. Prune old records after your retention window.

Natural Idempotency

Design operations to be naturally idempotent where possible: SET balance = 142.00 is idempotent; INCREMENT balance BY 10 is not. UPSERT is idempotent; INSERT is not. Use version fields and optimistic locking to detect and discard stale re-deliveries.

SQL — Idempotent Event Processing Pattern

-- Deduplication table (partitioned by month for pruning)
CREATE TABLE processed_events (
    event_id        VARCHAR(64)  NOT NULL,
    consumer_group  VARCHAR(64)  NOT NULL,
    processed_at    TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    PRIMARY KEY (event_id, consumer_group)
) PARTITION BY RANGE (processed_at);

-- Idempotent event handling in a single transaction
BEGIN;

-- 1. Record that we're processing this event (dedup check)
INSERT INTO processed_events (event_id, consumer_group)
VALUES ('01HWZR5F4P7TKMXX9BQ2GV1ERD', 'fulfillment-service')
ON CONFLICT (event_id, consumer_group)
DO NOTHING;

-- 2. Check if row was inserted (affected rows = 0 → already processed)
GET DIAGNOSTICS v_affected = ROW_COUNT;
IF v_affected = 0 THEN
    ROLLBACK;              -- skip silently — duplicate delivery
    RETURN;
END IF;

-- 3. Apply business logic (now guaranteed to run exactly once)
UPDATE orders
SET status = 'fulfilling', updated_at = NOW()
WHERE id = v_order_id
  AND status = 'paid';    -- state machine guard prevents double-fulfillment

COMMIT;

Observability

// DISTRIBUTED TRACING ACROSS EVENT HOPS

Debugging an event-driven system without proper observability is nearly impossible — causality chains span multiple services, brokers, and asynchronous hops. The three pillars of EDA observability are distributed tracing (propagate trace context through event headers), consumer lag monitoring (the key health metric for any Kafka consumer), and event lineage (which events caused which downstream events).

Consumer Lag — The Primary SLA

Consumer lag = messages produced but not yet consumed. Rising lag = consumer can't keep up. Alert when lag > N events or lag grows for >5 minutes. Monitor per-partition. Tools: Kafka Exporter + Prometheus, Confluent Control Center, Burrow.

Distributed Tracing

Propagate traceparent (W3C Trace Context) in event headers. Consumers extract trace context and start a child span. Kafka instrumentation libraries (OpenTelemetry) handle this automatically. Platforms: Jaeger, Tempo, Honeycomb.

Key Metrics to Alert On

Consumer lag per group per partition
DLQ message count (should be zero)
Processing error rate per consumer
End-to-end event latency (produce → consume)
Broker disk utilization and ISR shrink events

Saga Pattern

// DISTRIBUTED TRANSACTIONS WITHOUT 2PC

A Saga is a sequence of local transactions, each publishing an event or message that triggers the next transaction. If any step fails, compensating transactions undo the preceding steps. Sagas replace distributed ACID transactions (2PC) — which don't scale and are unavailable under network partition — with eventual consistency and explicit compensation logic.

Choreography-based Saga

Each service reacts to events from the previous step
No central coordinator — fully decentralized
Simple for linear flows; complex for branching
Hard to visualize the overall saga state
Good for: simple, stable flows with few steps

Orchestration-based Saga

Central saga orchestrator issues commands, awaits events
Saga state is explicit and queryable
Easier to add steps, error handling, timeouts
Saga orchestrator is a single point of failure (mitigated by persistence)
Good for: complex flows, audit requirements, many services

TypeScript — Orchestration Saga: Order Fulfillment

// Saga steps: Reserve Inventory → Charge Payment → Notify Fulfillment
// Compensations (on failure): Release Inventory ← Refund Payment ← Cancel Fulfillment

enum SagaState {
  STARTED, INVENTORY_RESERVED, PAYMENT_CHARGED,
  COMPLETED, COMPENSATING, FAILED
}

class OrderFulfillmentSaga {
  async execute(orderId: string) {
    const saga = await sagaStore.create({ orderId, state: SagaState.STARTED });

    // Step 1: Reserve inventory
    try {
      await commandBus.send('inventory', { type: 'ReserveItems', orderId });
      await saga.transition(SagaState.INVENTORY_RESERVED);
    } catch { return this.fail(saga, []); }

    // Step 2: Charge payment
    try {
      await commandBus.send('payments', { type: 'ChargeOrder', orderId });
      await saga.transition(SagaState.PAYMENT_CHARGED);
    } catch {
      return this.compensate(saga, [
        () => commandBus.send('inventory', { type: 'ReleaseItems', orderId })
      ]);
    }

    // Step 3: Trigger fulfillment
    await commandBus.send('fulfillment', { type: 'FulfillOrder', orderId });
    await saga.transition(SagaState.COMPLETED);
    await eventBus.publish('orders.fulfilled', { orderId });
  }

  async compensate(saga, compensations: Array<() => Promise<void>>) {
    await saga.transition(SagaState.COMPENSATING);
    for (const comp of compensations.reverse()) {
      await retry(() => comp());   // compensations must also be retried until success
    }
    await saga.transition(SagaState.FAILED);
    await eventBus.publish('orders.fulfillment-failed', { orderId: saga.orderId });
  }
}

EDA Anti-Patterns

// WHAT NOT TO DO — LESSONS LEARNED EXPENSIVELY

Event Spaghetti Arch Smell

Every service subscribes to every other service's events, creating a tangle of implicit dependencies. No one can trace what triggers what. Solution: domain-driven topic boundaries, bounded contexts, explicit event catalogs.

God Topics Avoid

A single events topic carrying all event types from all services. Consumers must filter; schema evolution breaks everyone. Solution: one topic per event type or per aggregate root.

Synchronous Expectations Mindset

Treating events as synchronous RPC — expecting an immediate response, timing out if not received in 200ms. EDA is inherently asynchronous. Design UX around eventual consistency with progress indicators, not blocking spinners.

No Schema Contract Critical

Publishing raw JSON with no schema registry. Works fine until the producer team renames a field and 8 downstream consumers silently start failing. Always register schemas. Always version them.

Storing Business State in Kafka Misuse

Using Kafka as the primary queryable state store. Kafka is not a database. Consumers need their own materialized views (CQRS read models) for queries. Use Kafka for the event log; project into queryable stores.

Non-Idempotent Consumers Bug

Assuming each message is delivered exactly once. At-least-once delivery is the guarantee. Any consumer that double-charges a customer or double-books a slot on redelivery has a production bug waiting to happen.

Migration Roadmap

// FROM MONOLITH OR REST TO EDA

Migrating to event-driven architecture is a multi-phase journey. The strangler fig pattern — gradually replacing direct calls with event flows without a big-bang rewrite — is the most reliable approach. Begin at the edges with high-value, high-volume flows before tackling the core domain.

Phase 0 — Months 1–2

Observe & Instrument

Map existing data flows — which services call which, volumes, latency
Identify top fan-out candidates (one call triggers N downstream side effects)
Choose broker and deploy in non-critical path first
Establish schema registry and naming conventions
Define consumer group naming, DLQ topology, and on-call runbooks

Phase 1 — Months 2–5

Dual-Write Strangler

Augment existing writes to also publish domain events (transactional outbox pattern)
New consumers read from events; existing consumers from DB — run in parallel
Validate event-derived state matches DB state for 30 days
Migrate notification, analytics, and audit consumers to events first (low risk)
Implement DLQ monitoring and reprocessing tooling

Phase 2 — Months 5–10

Choreography Cutover

Replace synchronous service-to-service calls with event-driven choreography
Implement saga orchestration for multi-step flows
Remove direct DB coupling between services — consume events, not shared tables
Introduce CQRS read models for query-heavy paths
Establish SLAs: max consumer lag, DLQ response time

Phase 3 — Months 10–18

Stream Processing & Optimization

Introduce real-time stream processing (Kafka Streams / Flink) for analytics and enrichment
Migrate high-throughput audit and CDC flows to Kafka Connect
Evaluate event sourcing for core domain aggregates where audit/replay is critical
Implement schema evolution governance and consumer group ownership registry
Continuous maturity assessment against EDA pillars

Transactional Outbox — Safe Dual-Write

The hardest problem when introducing events into an existing system: how to atomically update your database and publish an event. A crash between the two creates inconsistency. The Transactional Outbox pattern solves this by writing the event to an outbox table in the same database transaction. A separate relay process (Debezium CDC) reads the outbox and publishes to Kafka — guaranteed, ordered, and recoverable.

SQL + Config — Transactional Outbox + Debezium

-- 1. Outbox table (created alongside your domain tables)
CREATE TABLE outbox_events (
    id           UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_id VARCHAR(64)  NOT NULL,
    event_type   VARCHAR(128) NOT NULL,
    payload      JSONB        NOT NULL,
    created_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    published    BOOLEAN      NOT NULL DEFAULT FALSE
);

-- 2. Write to domain table AND outbox in ONE transaction (atomic)
BEGIN;
INSERT INTO orders (id, customer_id, total, status)
    VALUES ('8271', '903', 142.00, 'pending');

INSERT INTO outbox_events (aggregate_id, event_type, payload)
    VALUES ('8271', 'com.acme.orders.OrderPlaced',
        '{"orderId":"8271","customerId":"903","total":142.00}');
COMMIT;  -- Either both succeed or both rollback — no inconsistency possible

-- 3. Debezium connector reads PostgreSQL WAL (Change Data Capture)
--    and publishes outbox events to Kafka automatically
--    (debezium-connector-postgres with outbox.event.router SMT)

-- Debezium Kafka Connect config snippet:
-- {
--   "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
--   "database.hostname": "postgres", "database.dbname": "orders",
--   "table.include.list": "public.outbox_events",
--   "transforms": "outbox",
--   "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
--   "transforms.outbox.table.field.event.type": "event_type",
--   "transforms.outbox.route.topic.replacement": "${routedByValue}"
-- }

▸

Reference implementations & further reading: Martin Fowler's Event Sourcing pattern (martinfowler.com) · Chris Richardson's microservices.io (Saga, Outbox, CQRS patterns) · Confluent Developer docs (Kafka Streams, Schema Registry) · CloudEvents spec (cloudevents.io) · Lightbend Akka (reactive event-driven microservices) · Axon Framework (Java CQRS/ES framework).

Event-DrivenArchitecture

What is Event-Driven Architecture

When EDA is the right choice

Core Messaging Patterns

Event Anatomy

Canonical Event Envelope

Publish / Subscribe

Event Sourcing

Snapshots — Performance at Scale

CQRS — Command Query Responsibility Segregation

Stream Processing

Apache Kafka

Core Architecture Concepts

Kafka Topic Design Rules

Apache Pulsar

RabbitMQ

Exchange Types — RabbitMQ's Routing Power

Broker Selection Guide

Schema & Event Contracts

Ordering & Delivery Guarantees

Ordering Guarantees in Kafka

Error Handling & Dead Letter Queues

Idempotency

Observability

Saga Pattern

EDA Anti-Patterns

Migration Roadmap

Transactional Outbox — Safe Dual-Write

Event-Driven
Architecture