Event-Driven
Architecture
The definitive field guide to building systems that react, not poll. Covers the full spectrum: Kafka partitions to RabbitMQ routing, event sourcing to CQRS, saga orchestration to schema evolution — everything required to design, build, and operate event-driven systems at scale.
What is Event-Driven Architecture
Event-Driven Architecture (EDA) is a design paradigm in which system components communicate by producing and consuming events — immutable records of things that have happened — rather than calling each other directly. Producers have no knowledge of consumers. Consumers have no knowledge of producers. The broker sits in between and is the only coupling point.
This inversion is the source of EDA's power: temporal decoupling (producer and consumer need not be available simultaneously), spatial decoupling (they need not know each other's address), and semantic decoupling (they agree only on the event schema, not on processing logic).
Producer and consumer do not need to be online at the same time. Events persist in the broker until consumed. Services scale and deploy independently without coordination windows.
Producers publish to a topic, not to a specific consumer endpoint. Adding new consumers requires zero changes to producers. Services discover data through subscriptions, not service registries.
Consumers scale horizontally against the event backlog. Slow consumers don't block producers. Back-pressure is managed by the broker, not by synchronous timeouts cascading through a call chain.
When EDA is the right choice
| Scenario | EDA Fit | Why |
|---|---|---|
| Order placed → multiple downstream actions (fulfillment, email, analytics) | EXCELLENT | Fan-out without producer knowing consumers; consumers added independently |
| Real-time fraud detection on transactions | EXCELLENT | Stream processing on event log; millisecond latency with Kafka Streams/Flink |
| Audit trail / compliance log | EXCELLENT | Immutable event log is the audit record; replay for compliance queries |
| Simple CRUD API, single consumer | POOR | Adds operational complexity with no benefit; use REST |
| Interactive query (user asks for their profile data) | POOR | Synchronous response required; use REST + query model (CQRS read side) |
| Cross-service saga / distributed transaction | GOOD | Choreography or orchestration via events; avoids 2PC distributed locking |
Core Messaging Patterns
Every EDA system is built from four primitive patterns. These compose into the larger architectures — pub/sub, event sourcing, stream processing — described in later sections. Understanding the primitives clarifies why different brokers make different design choices.
Each message is consumed by exactly one consumer. Multiple consumers compete for messages from the same queue — load balancing with guaranteed single delivery. Classic use: background job processing, task distribution. Work queue model.
Each message is delivered to all subscribers. Producers publish to a topic; every subscription receives a copy. Classic use: event notification, audit events, data replication. N consumers all see every event.
Events are written to an immutable, ordered, partitioned log. Consumers track their own offset. Any consumer can re-read history. New consumers can start from the beginning. Classic use: Kafka, Kinesis. Time-travel is the superpower.
Caller publishes a request event with a correlationId and a replyTo queue. Responder publishes the reply to that queue. Caller awaits. Bridges synchronous-seeming semantics over async infrastructure. Use sparingly.
Event Anatomy
An event is an immutable record of a fact that occurred in the past. Not a command (don't say "please process this"). Not a request. A fact: "Order 8271 was placed by user 903 at 14:22:05 UTC containing 3 items totalling $142.00." Events are named in past tense.
ProcessOrder— imperative, implies a specific handlerSendEmail— prescribes implementationUpdateInventory— tells receiver what to doChargeCustomer— couples business logic to event- Present/future tense names
- Contains processing instructions
OrderPlaced— past tense, describes realityPaymentSucceeded— fact, consumer decides what to doInventoryReserved— records what happenedCustomerCharged— consequence, not instruction- Past tense names always
- Contains only the data of what happened
Canonical Event Envelope
Publish / Subscribe
In a pub/sub model, producers (publishers) emit events to a topic. Subscribers express interest in topics and receive all matching events. The broker is responsible for delivering the event to all registered subscribers. Neither party knows the other exists — only the topic name is shared.
Service
Service
Service
Service
- Multiple downstream reactions to a single business event
- Adding new consumers without touching producers
- Event notification — "this happened, react however you want"
- Cross-domain data propagation (CDC events, domain events)
- No built-in ordering guarantee across consumers
- Consumer failures need independent dead-letter handling
- Complex debugging — causality chains are non-obvious
- Schema changes propagate to all consumers simultaneously
Event Sourcing
Event Sourcing is a persistence pattern where the state of an entity is never stored directly. Instead, only the sequence of events that led to the current state is persisted. Current state is derived by replaying events from the beginning (or from a snapshot). The event log is the database.
- Store current state only — history is lost
- UPDATE overwrites — no audit trail
- Temporal queries impossible (what did order look like last Tuesday?)
- Can't replay to fix a bug in processing logic
- Simple read path: SELECT * WHERE id = X
- Append-only event log — full history preserved
- Audit trail is free — every change recorded
- Time travel: replay to any point in time
- Fix bugs: replay with corrected logic to rebuild projections
- Write is fast (append); reads need projection rebuild
Snapshots — Performance at Scale
Replaying thousands of events to reconstitute state on every read is expensive. Snapshots periodically capture current state alongside the event sequence number. On load, restore from the latest snapshot and replay only subsequent events.
CQRS — Command Query Responsibility Segregation
CQRS separates the model used for writes (commands) from the model used for reads (queries). They can use different databases, different schemas, and scale independently. Commands mutate state and publish events; queries read from purpose-built projections optimized for the specific view.
CancelOrder
PostgreSQL
ListOrders
Redis
- One read model per query pattern — no compromise
- Denormalize ruthlessly — joins at read time are expensive
- Optimize for the exact query shape the UI needs
- Rebuild-ability is a requirement: projectors must be replayable
- Eventual consistency with the write model — always
- Elasticsearch: full-text search, faceted navigation
- Redis: hot read paths, leaderboards, counters
- PostgreSQL (materialized view): complex relational queries
- DynamoDB: single-table design, predictable latency
- ClickHouse / BigQuery: analytics queries over large datasets
Stream Processing
Stream processing is real-time computation over unbounded event sequences. Unlike batch processing (which operates on finite datasets at rest), stream processing applies transformations, aggregations, and joins continuously as events arrive — with sub-second latency.
| Framework | Model | Latency | Best For | Managed Option |
|---|---|---|---|---|
| Kafka Streams | Library (no cluster) | ms | Kafka-native, small-medium topologies, microservice-embedded | Confluent Cloud |
| Apache Flink | Cluster, stateful DAG | ms | Complex CEP, large state, exactly-once, SQL over streams | AWS Kinesis Analytics, Confluent |
| Apache Spark Streaming | Micro-batch | seconds | ML pipelines, batch+stream unified, large cluster analytics | Databricks, EMR, Dataproc |
| ksqlDB | SQL over Kafka | ms | SQL-native stream queries, materialized tables, no-code streaming | Confluent Cloud |
| Pulsar Functions | Serverless functions | ms | Pulsar-native, simple transform/filter pipelines | StreamNative, Astra Streaming |
Apache Kafka
Kafka is a distributed, partitioned, replicated commit log. It is not a message queue in the traditional sense — it is a durable, ordered, replayable event log that producers append to and consumers read from at their own pace. Kafka's architecture makes it the dominant choice for high-throughput event streaming at enterprise scale.
Core Architecture Concepts
A topic is split into N partitions. Each partition is an ordered, immutable sequence of records. Partitions are the unit of parallelism — you can have at most as many active consumers as partitions in a group. Choose partition count based on target throughput.
Each message has a monotonically increasing offset per partition. Consumer groups track their committed offset per partition — Kafka never deletes data based on consumption. Multiple groups can read the same topic independently at their own offsets.
Each partition has a leader and N-1 follower replicas. acks=all + min.insync.replicas=2 ensures no data loss under broker failure. Replication factor 3 is standard for production. KRaft (since 3.3) eliminates ZooKeeper dependency.
Kafka Topic Design Rules
| Decision | Recommendation | Rationale |
|---|---|---|
| Partition count | Start with 6–12 per topic; multiply expected consumer group size by 2–3 | Cannot decrease partitions; over-partition rather than under-partition |
| Replication factor | 3 for production; 1 for dev only | Survives 1 broker failure with RF=3 |
| Retention | 7 days default; event store topics: 1 year or compact+delete | Log compaction keeps last value per key (good for state topics) |
| Message key | Use entity ID (orderId, userId) — ensures ordering per entity | All events for the same key land in the same partition |
| Topic naming | {domain}.{entity}.{event-type} e.g. orders.order.placed | Hierarchical, discoverable, schema-registry friendly |
Apache Pulsar
Pulsar separates compute (brokers) from storage (Apache BookKeeper), enabling instant elastic scaling of brokers without data rebalancing. It supports both streaming and queuing natively, built-in multi-tenancy, and geo-replication across data centers — features that require significant additional tooling in Kafka.
- Separated storage: Scale brokers independently from BookKeeper storage nodes
- Built-in multi-tenancy: Namespaces, tenants, and quotas are first-class primitives
- Native geo-replication: Async or sync replication across regions configured declaratively
- Unified model: Topic supports streaming + queue + pub-sub in a single API
- Functions: Serverless compute embedded in the broker (Pulsar Functions)
- Kafka ecosystem is vastly larger (Kafka Connect, ksqlDB, 200+ connectors)
- Kafka Streams is more mature than Pulsar's streaming APIs
- Operational expertise far more available for Kafka
- BookKeeper adds operational complexity (3-tier: ZK + BK + Broker)
- Schema registry is third-party (not built-in like Pulsar's)
RabbitMQ
RabbitMQ is a traditional message broker implementing AMQP. Its power lies in its flexible routing model: messages flow through exchanges that apply routing rules to deliver to one or more queues. This makes complex routing topologies (fanout, topic-based, header-based, direct) easy to configure without writing code. RabbitMQ prioritizes low-latency delivery and a rich broker-side feature set over raw throughput.
Exchange Types — RabbitMQ's Routing Power
| Exchange Type | Routing Rule | Use Case | Example Key |
|---|---|---|---|
| Direct | Routes to queues where binding key = routing key (exact match) | Task queues, error routing, targeted delivery | order.created |
| Topic | Routing key with wildcards: * (one word) and # (zero or more) |
Selective subscriptions, hierarchical event routing | orders.*.placed, orders.# |
| Fanout | Broadcasts to ALL bound queues, ignores routing key | Pub/sub, notifications, cache invalidation | — (ignored) |
| Headers | Routes based on message header attributes (x-match: all/any) | Content-based routing, multi-attribute filtering | {region: eu, priority: high} |
Broker Selection Guide
| If you need… | Choose | Why |
|---|---|---|
| Millions of events/sec, durable replay, event sourcing | Kafka | Log-based, partitioned, designed for this workload |
| Complex routing rules, fanout, priority queues | RabbitMQ | Exchange model, broker-side routing logic |
| Multi-tenant SaaS, built-in geo-replication | Pulsar | First-class tenant/namespace primitives |
| Simple task queue / background jobs | RabbitMQ | Simpler ops, lower overhead for this use case |
| SQL-based stream queries, BI on event streams | Kafka + ksqlDB | ksqlDB provides ANSI SQL over Kafka topics |
| Starting fresh, limited ops expertise | Managed Kafka | Confluent Cloud / MSK / Aiven — outsource ops complexity |
Schema & Event Contracts
Schema management is the hardest long-term problem in EDA. Producers and consumers are deployed independently — a schema change can break a consumer that hasn't been updated. Schema registries (Confluent Schema Registry, AWS Glue Schema Registry) enforce compatibility rules at publish time, preventing breaking changes from reaching the broker.
Old consumers can read new events. You can add optional fields with defaults. You cannot remove fields or change types. New producers publish v2 events; old consumers still process them using default values for new fields.
New consumers can read old events. Old producers publish v1; new consumers handle missing new fields gracefully. Allows consumer upgrades before producer upgrades.
Ordering & Delivery Guarantees
Every broker makes trade-offs across three guarantees: ordering, delivery semantics, and throughput. Understanding these is essential for designing correct systems — especially under failure conditions.
| Guarantee | Definition | Kafka | Pulsar | RabbitMQ |
|---|---|---|---|---|
| At-Most-Once | Messages may be lost, never duplicated | acks=0 |
Non-persistent sub | auto-ack, noAck |
| At-Least-Once | No loss, but duplicates possible on retry | Default acks=1 |
Default ack-based | Default manual ack |
| Exactly-Once | No loss, no duplicates — hardest guarantee | Transactions + idempotence | Transactions (limited) | No (use idempotent consumers) |
Ordering Guarantees in Kafka
Kafka guarantees ordering within a partition. All messages with the same key go to the same partition — so all events for orderId=8271 are processed in order. Events across different partitions (different orderIds) may interleave. If global ordering across entities is required, use a single partition — but this eliminates parallelism.
Error Handling & Dead Letter Queues
In EDA, processing failures must not block the message pipeline. A Dead Letter Queue (DLQ) captures messages that failed processing after all retry attempts. The DLQ preserves failed messages for investigation and re-processing without stalling the main consumer.
Idempotency
At-least-once delivery means your consumer will receive the same message more than once — guaranteed. A consumer that can be safely re-executed with the same message (same result as processing it once) is idempotent. Idempotency is the practical alternative to expensive exactly-once transactions.
Use the event's unique ID (eventId or messageId) as a deduplication key. Store processed IDs in a processed_events table with a unique constraint. INSERT … ON CONFLICT DO NOTHING makes re-processing a no-op. Prune old records after your retention window.
Design operations to be naturally idempotent where possible: SET balance = 142.00 is idempotent; INCREMENT balance BY 10 is not. UPSERT is idempotent; INSERT is not. Use version fields and optimistic locking to detect and discard stale re-deliveries.
Observability
Debugging an event-driven system without proper observability is nearly impossible — causality chains span multiple services, brokers, and asynchronous hops. The three pillars of EDA observability are distributed tracing (propagate trace context through event headers), consumer lag monitoring (the key health metric for any Kafka consumer), and event lineage (which events caused which downstream events).
Consumer lag = messages produced but not yet consumed. Rising lag = consumer can't keep up. Alert when lag > N events or lag grows for >5 minutes. Monitor per-partition. Tools: Kafka Exporter + Prometheus, Confluent Control Center, Burrow.
Propagate traceparent (W3C Trace Context) in event headers. Consumers extract trace context and start a child span. Kafka instrumentation libraries (OpenTelemetry) handle this automatically. Platforms: Jaeger, Tempo, Honeycomb.
- Consumer lag per group per partition
- DLQ message count (should be zero)
- Processing error rate per consumer
- End-to-end event latency (produce → consume)
- Broker disk utilization and ISR shrink events
Saga Pattern
A Saga is a sequence of local transactions, each publishing an event or message that triggers the next transaction. If any step fails, compensating transactions undo the preceding steps. Sagas replace distributed ACID transactions (2PC) — which don't scale and are unavailable under network partition — with eventual consistency and explicit compensation logic.
- Each service reacts to events from the previous step
- No central coordinator — fully decentralized
- Simple for linear flows; complex for branching
- Hard to visualize the overall saga state
- Good for: simple, stable flows with few steps
- Central saga orchestrator issues commands, awaits events
- Saga state is explicit and queryable
- Easier to add steps, error handling, timeouts
- Saga orchestrator is a single point of failure (mitigated by persistence)
- Good for: complex flows, audit requirements, many services
EDA Anti-Patterns
Every service subscribes to every other service's events, creating a tangle of implicit dependencies. No one can trace what triggers what. Solution: domain-driven topic boundaries, bounded contexts, explicit event catalogs.
A single events topic carrying all event types from all services. Consumers must filter; schema evolution breaks everyone. Solution: one topic per event type or per aggregate root.
Treating events as synchronous RPC — expecting an immediate response, timing out if not received in 200ms. EDA is inherently asynchronous. Design UX around eventual consistency with progress indicators, not blocking spinners.
Publishing raw JSON with no schema registry. Works fine until the producer team renames a field and 8 downstream consumers silently start failing. Always register schemas. Always version them.
Using Kafka as the primary queryable state store. Kafka is not a database. Consumers need their own materialized views (CQRS read models) for queries. Use Kafka for the event log; project into queryable stores.
Assuming each message is delivered exactly once. At-least-once delivery is the guarantee. Any consumer that double-charges a customer or double-books a slot on redelivery has a production bug waiting to happen.
Migration Roadmap
Migrating to event-driven architecture is a multi-phase journey. The strangler fig pattern — gradually replacing direct calls with event flows without a big-bang rewrite — is the most reliable approach. Begin at the edges with high-value, high-volume flows before tackling the core domain.
- Map existing data flows — which services call which, volumes, latency
- Identify top fan-out candidates (one call triggers N downstream side effects)
- Choose broker and deploy in non-critical path first
- Establish schema registry and naming conventions
- Define consumer group naming, DLQ topology, and on-call runbooks
- Augment existing writes to also publish domain events (transactional outbox pattern)
- New consumers read from events; existing consumers from DB — run in parallel
- Validate event-derived state matches DB state for 30 days
- Migrate notification, analytics, and audit consumers to events first (low risk)
- Implement DLQ monitoring and reprocessing tooling
- Replace synchronous service-to-service calls with event-driven choreography
- Implement saga orchestration for multi-step flows
- Remove direct DB coupling between services — consume events, not shared tables
- Introduce CQRS read models for query-heavy paths
- Establish SLAs: max consumer lag, DLQ response time
- Introduce real-time stream processing (Kafka Streams / Flink) for analytics and enrichment
- Migrate high-throughput audit and CDC flows to Kafka Connect
- Evaluate event sourcing for core domain aggregates where audit/replay is critical
- Implement schema evolution governance and consumer group ownership registry
- Continuous maturity assessment against EDA pillars
Transactional Outbox — Safe Dual-Write
The hardest problem when introducing events into an existing system: how to atomically update your database and publish an event. A crash between the two creates inconsistency. The Transactional Outbox pattern solves this by writing the event to an outbox table in the same database transaction. A separate relay process (Debezium CDC) reads the outbox and publishes to Kafka — guaranteed, ordered, and recoverable.