2025-12-16
Transactional Outbox Pattern: Reliable Event Publishing in Distributed Systems
Learn how the Transactional Outbox Pattern solves the dual-write problem in distributed systems, with practical implementations using PostgreSQL, DynamoDB, and CDC tools.
Abstract
The dual-write problem affects nearly every event-driven system I’ve worked with. When you need to update a database and publish an event atomically, you face an impossible choice: which operation fails when things go wrong? The Transactional Outbox Pattern provides a proven solution by writing both operations to the same database within a single transaction, then using a separate process to reliably publish events. This post covers practical implementations using polling publishers, Change Data Capture (CDC), and AWS serverless patterns.
The Dual-Write Problem
Here’s a scenario I’ve encountered repeatedly: an order service needs to save an order to the database and publish an OrderCreated event. The naive approach looks like this:
async function createOrder(orderData: Order) {
// Step 1: Save to database
await db('orders').insert(orderData);
// Step 2: Publish event
await messageQueue.publish('OrderCreated', orderData);
}
What could go wrong? Everything.
Failure Scenario 1: Database succeeds, event publish fails
- Network timeout to message broker
- Message broker temporarily down
- Your service crashes after database write
- Result: Order exists in database, but inventory service never receives the event. Stock is never reserved.
Failure Scenario 2: Event publish succeeds, database fails
- Database write violates constraint
- Transaction rolled back due to deadlock
- Database connection lost
- Result: Inventory service receives event and reserves stock, but order doesn’t exist. Data inconsistency.
Why Not Use Two-Phase Commit (2PC)?
You might ask: “Can’t we use distributed transactions?” Technically yes, but the trade-offs make it impractical:
- Performance overhead: Coordinating transactions across systems adds significant latency
- Reduced availability: If any participant is down, the entire operation fails
- Complexity: Implementing XA transactions correctly is difficult
- Limited support: Many message brokers don’t support 2PC
- Coupling: Violates microservices independence principles
Working with distributed systems taught me that avoiding distributed transactions is better than trying to make them work reliably.
Understanding the Outbox Pattern
The Transactional Outbox Pattern solves the dual-write problem through a simple insight: instead of writing to two separate systems (database + message broker), write to two tables in the same database within a single ACID transaction.
Core Components
- Outbox Table: Stores events to be published, lives in the same database as your business data
- Business Transaction: Single ACID transaction writing to both business tables and outbox
- Message Relay: Separate process reads outbox and publishes to message broker
- Idempotent Consumers: Downstream services handle duplicate events correctly
How It Works
The key insight: either both the business data and event are committed, or neither is. This guarantees atomicity between your state changes and event publishing.
Implementation Approach 1: Polling Publisher
The simplest approach polls the outbox table periodically. Here’s what works in practice:
Basic Implementation
// 1. Outbox table schema (PostgreSQL)
CREATE TABLE outbox (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(100) NOT NULL,
aggregate_id VARCHAR(100) NOT NULL,
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
published BOOLEAN DEFAULT FALSE
);
-- Critical index for efficient polling
CREATE INDEX idx_outbox_unpublished
ON outbox(created_at)
WHERE published = false;
Producer: Write to Outbox
async function createOrder(orderData: Order) {
await db.transaction(async (trx) => {
// Insert order
const order = await trx('orders').insert({
id: orderData.id,
customer_id: orderData.customerId,
total: orderData.total,
status: 'PENDING'
}).returning('*');
// Insert event to outbox IN SAME TRANSACTION
await trx('outbox').insert({
id: uuid(),
aggregate_type: 'Order',
aggregate_id: order[0].id,
event_type: 'OrderCreated',
payload: {
orderId: order[0].id,
customerId: orderData.customerId,
total: orderData.total,
items: orderData.items
},
created_at: new Date()
});
// Both succeed or both fail - atomicity guaranteed
});
}
Publisher: Poll and Publish
async function publishOutboxEvents() {
// Use FOR UPDATE SKIP LOCKED to prevent concurrent processing
const events = await db.raw(`
SELECT * FROM outbox
WHERE published = false
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED
`);
for (const event of events.rows) {
try {
// Publish to message broker
await messageQueue.publish(event.event_type, {
messageId: event.id, // Important for deduplication
aggregateId: event.aggregate_id,
payload: event.payload
});
// Mark as published
await db('outbox')
.where('id', event.id)
.update({ published: true });
} catch (error) {
console.error('Failed to publish event:', error);
// Will retry on next poll - at-least-once delivery
}
}
}
// Run publisher every 5 seconds
setInterval(publishOutboxEvents, 5000);
The FOR UPDATE SKIP LOCKED clause is critical: it prevents multiple publisher instances from processing the same events, enabling horizontal scaling.
When to Use Polling
Pros:
- Simple to implement and understand
- No additional infrastructure required
- Works with any database
- Easy to debug with SQL queries
Cons:
- Polling adds database load
- Latency depends on poll interval (5-10 seconds typical)
- Less efficient than CDC for high volumes
Use polling when:
- Low to medium event volumes (< 1000 events/minute)
- Getting started quickly
- Simple architectures
- Your database doesn’t support CDC
Implementation Approach 2: Change Data Capture (CDC)
For production systems at scale, CDC eliminates polling overhead by monitoring the database transaction log directly.
How CDC Works
Instead of polling the outbox table, CDC tools like Debezium monitor the database’s Write-Ahead Log (PostgreSQL) or Binary Log (MySQL). When an outbox event is written, the CDC tool detects it and publishes to your message broker automatically.
PostgreSQL + Debezium Setup
-- 1. Enable logical replication (requires PostgreSQL restart)
ALTER SYSTEM SET wal_level = 'logical';
-- Note: PostgreSQL must be restarted for wal_level change to take effect
-- 2. Create publication for outbox table
CREATE PUBLICATION outbox_publication FOR TABLE outbox;
-- 3. Grant replication rights to Debezium user
ALTER USER debezium_user WITH REPLICATION;
Debezium Configuration
{
"name": "outbox-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres.example.com",
"database.port": "5432",
"database.user": "debezium_user",
"database.password": "${DB_PASSWORD}",
"database.dbname": "orders_db",
"database.server.name": "orders",
"table.include.list": "public.outbox",
"plugin.name": "pgoutput",
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.table.field.event.type": "event_type",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.payload": "payload"
}
}
Producer Code (Identical to Polling)
The beauty of CDC: your application code doesn’t change. You still write to the outbox table in the same transaction. Debezium handles the publishing.
// Same code as polling approach - no changes needed
async function createOrder(orderData: Order) {
await db.transaction(async (trx) => {
await trx('orders').insert(orderData);
await trx('outbox').insert({
aggregate_type: 'Order',
aggregate_id: orderData.id,
event_type: 'OrderCreated',
payload: orderData
});
});
// Debezium automatically detects the new outbox row and publishes
}
When to Use CDC
Pros:
- Near real-time event publishing (< 1 second)
- Minimal database overhead (reads WAL, not tables)
- Scales to high volumes (100K+ events/sec)
- Preserves event order per partition
Cons:
- Complex infrastructure (Kafka Connect, Debezium)
- Requires operational expertise
- Database-specific setup (WAL configuration)
- More expensive than serverless options
Use CDC when:
- High event volumes (> 1000 events/minute)
- Low latency requirements (< 1 second)
- Production systems at scale
- Already using Kafka ecosystem
AWS Implementation: DynamoDB + EventBridge Pipes
AWS provides a serverless outbox implementation using DynamoDB Streams and EventBridge Pipes. This is my preferred approach for AWS-native architectures.
Architecture
Implementation
// 1. Write both items in single transaction
// Note: DynamoDB transactions have limits - max 100 items, 4MB aggregate size
async function createOrder(orderData: Order) {
await dynamodb.transactWrite({
TransactItems: [
{
Put: {
TableName: 'Orders',
Item: {
orderId: { S: orderData.id },
customerId: { S: orderData.customerId },
total: { N: orderData.total.toString() },
status: { S: 'PENDING' }
}
}
},
{
Put: {
TableName: 'Outbox',
Item: {
eventId: { S: uuid() },
aggregateType: { S: 'Order' },
aggregateId: { S: orderData.id },
eventType: { S: 'OrderCreated' },
payload: { S: JSON.stringify(orderData) },
timestamp: { N: Date.now().toString() }
}
}
}
]
}).promise();
}
Infrastructure as Code (AWS CDK)
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as pipes from 'aws-cdk-lib/aws-pipes';
import * as events from 'aws-cdk-lib/aws-events';
// 1. Create outbox table with streams enabled
const outboxTable = new dynamodb.Table(this, 'OutboxTable', {
partitionKey: { name: 'eventId', type: dynamodb.AttributeType.STRING },
stream: dynamodb.StreamViewType.NEW_IMAGE, // Critical: stream new items
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
removalPolicy: cdk.RemovalPolicy.DESTROY
});
// 2. Create event bus
const eventBus = new events.EventBus(this, 'OrderEventBus', {
eventBusName: 'order-events'
});
// 3. Create EventBridge Pipe (NO Lambda needed!)
new pipes.CfnPipe(this, 'OutboxPipe', {
source: outboxTable.tableStreamArn!,
target: eventBus.eventBusArn,
roleArn: pipeRole.roleArn,
sourceParameters: {
dynamoDbStreamParameters: {
startingPosition: 'LATEST',
batchSize: 10,
maximumRetryAttempts: 3, // Note: Default is -1 (infinite retry)
deadLetterConfig: {
arn: dlqQueue.queueArn
}
}
},
targetParameters: {
eventBridgeEventBusParameters: {
detailType: 'OutboxEvent',
source: 'outbox.publisher'
}
}
});
Why This Approach Works
No Lambda code for publishing: EventBridge Pipes automatically reads DynamoDB Streams and publishes to EventBridge. This eliminates:
- Cold start latency
- Lambda billing for publisher
- Code to maintain for the relay
Built-in reliability: Pipes include retry logic, dead-letter queues, and monitoring out of the box.
Cost efficiency: You only pay for events processed, not for idle publisher infrastructure.
Cost Analysis
Based on a system processing 10 million events per month:
- DynamoDB Streams: Free (included with DynamoDB)
- EventBridge Pipes: 4.00/month
- EventBridge Event Bus: 10.00/month
- Total: ~$14/month for 10M events
Compare to Lambda polling approach:
- Lambda invocations: 43,200/month (every minute) = ~$0.01
- Lambda duration: 100ms avg × 43,200 = ~$0.50
- RDS queries: Adds load to database
- Total: Similar cost but higher operational complexity
Handling Ordering and Idempotency
Ordering Guarantees
The outbox pattern preserves ordering per partition, not globally across all events.
// Ensure events for same aggregate are ordered
await kafka.producer.send({
topic: 'order-events',
messages: [{
key: event.aggregateId, // All events for ORDER-123 go to same partition
value: JSON.stringify(event.payload)
}]
});
For DynamoDB Streams, use the aggregate ID as the partition key:
await dynamodb.put({
TableName: 'Outbox',
Item: {
aggregateId: 'ORDER-123', // Partition key - ensures ordering
eventId: uuid(), // Sort key
eventType: 'OrderCreated',
timestamp: Date.now()
}
});
The Inbox Pattern: Consumer-Side Idempotency
The outbox pattern guarantees at-least-once delivery, which means events may be delivered multiple times. Consumers must handle duplicates.
The Inbox Pattern provides idempotent processing:
async function handleOrderCreatedEvent(event: OrderCreatedEvent) {
await db.transaction(async (trx) => {
// 1. Check if already processed
const existing = await trx('inbox')
.where('message_id', event.messageId)
.first();
if (existing) {
console.log('Duplicate message, skipping:', event.messageId);
return; // Idempotent - safe to skip
}
// 2. Process the event (your business logic)
await trx('inventory')
.where('product_id', event.productId)
.decrement('quantity', event.quantity);
// 3. Record as processed IN SAME TRANSACTION
await trx('inbox').insert({
message_id: event.messageId,
event_type: event.type,
processed_at: new Date()
});
// Either all three operations succeed, or all fail
});
// ACK to message broker only after successful commit
await messageQueue.ack(event.messageId);
}
Inbox table schema:
CREATE TABLE inbox (
message_id UUID PRIMARY KEY,
event_type VARCHAR(100),
processed_at TIMESTAMP DEFAULT NOW(),
payload JSONB -- Optional: for debugging
);
-- Cleanup old processed messages (run daily)
DELETE FROM inbox
WHERE processed_at < NOW() - INTERVAL '7 days';
Complete Pattern: Outbox + Inbox
Performance Considerations
Database Performance
Outbox table growth: Without cleanup, the outbox table grows indefinitely. I’ve seen this cause significant performance degradation.
-- Strategy 1: Delete immediately after publish
DELETE FROM outbox WHERE id = $1 AND published = true;
-- Strategy 2: Batch cleanup (run daily via cron)
DELETE FROM outbox
WHERE published = true
AND created_at < NOW() - INTERVAL '7 days';
-- Strategy 3: Table partitioning (PostgreSQL 10+)
CREATE TABLE outbox_2025_12 PARTITION OF outbox
FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');
-- Drop old partitions (much faster than DELETE)
DROP TABLE outbox_2025_11;
Index optimization: The partial index only indexes unpublished events, saving space:
CREATE INDEX idx_outbox_unpublished
ON outbox(created_at)
WHERE published = false;
Polling Publisher Tuning
Poll interval trade-offs:
- 1 second: Low latency, high database load
- 5 seconds: Balanced (recommended for most cases)
- 10+ seconds: Low overhead, higher latency
Batch size:
// Too small: many queries, inefficient
const batchSize = 10;
// Too large: long transactions, lock contention
const batchSize = 10000;
// Optimal: balance efficiency and transaction length
const batchSize = 100; // Recommended starting point
CDC Performance
Monitor replication lag to ensure Debezium keeps up:
-- Check replication slot lag
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots
WHERE slot_type = 'logical';
If lag grows, your WAL files accumulate and can fill disk. This is a real operational concern I’ve dealt with.
Common Pitfalls and Solutions
Pitfall 1: Unbounded Table Growth
Problem: Outbox table grows indefinitely, queries slow down.
Solution: Implement automatic cleanup in your publisher:
async function publishAndCleanup() {
// Publish events
await publishOutboxEvents();
// Cleanup old published events (every 100 iterations)
if (cleanupCounter++ % 100 === 0) {
await db('outbox')
.where('published', true)
.where('created_at', '<', db.raw("NOW() - INTERVAL '7 days'"))
.delete();
}
}
Pitfall 2: Message Relay Failure Goes Unnoticed
Problem: Publisher crashes, events pile up unpublished.
Solution: Monitor outbox age metrics:
async function checkOutboxHealth() {
const result = await db('outbox')
.where('published', false)
.min('created_at as oldest')
.first();
if (!result.oldest) return; // No unpublished events
const ageMs = Date.now() - new Date(result.oldest).getTime();
const ageMinutes = ageMs / 60000;
if (ageMinutes > 5) {
alerting.trigger('OUTBOX_LAG_HIGH', {
ageMinutes,
message: 'Outbox events not being published'
});
}
}
// Run health check every minute
setInterval(checkOutboxHealth, 60000);
Pitfall 3: CDC Replication Slot Filling Disk
Problem: Debezium connector goes down, PostgreSQL WAL accumulates.
Solution: Monitor replication slots and set retention limits:
-- Set WAL retention limit
ALTER SYSTEM SET wal_keep_size = '10GB';
-- Monitor slot status
SELECT slot_name, active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots;
Alert if a slot is inactive for more than 5 minutes, indicating a publisher failure.
Comparison with Other Patterns
Outbox vs. Event Sourcing
| Aspect | Outbox Pattern | Event Sourcing |
|---|---|---|
| Purpose | Reliable event publishing | Events as source of truth |
| Event Lifetime | Short-lived (deleted after publish) | Permanent append-only log |
| State Storage | Current state in tables | Derived from events |
| Complexity | Low | High |
| Query Model | Direct database queries | Requires projections/CQRS |
| Best For | E-commerce orders, workflows | Banking, audit systems |
Key difference: In event sourcing, events are the permanent record. In outbox, events are a communication mechanism.
Outbox vs. Saga Pattern
The outbox pattern complements the saga pattern. Use outbox within each service participating in a saga:
// Order Service publishes OrderCreated via outbox
await db.transaction(async (trx) => {
await trx('orders').insert(order);
await trx('outbox').insert({ event_type: 'OrderCreated', payload: order });
});
// Saga Orchestrator receives OrderCreated, publishes commands via its own outbox
await db.transaction(async (trx) => {
await trx('saga_state').insert({ saga_id: orderId, step: 'INVENTORY_PENDING' });
await trx('outbox').insert({ event_type: 'ReserveInventory', payload: { orderId } });
});
Decision Framework
Use this framework to choose the right implementation:
Choose Polling when:
- Event volume < 1000/minute
- Getting started quickly
- Simple architecture preferred
- Database doesn’t support CDC
Choose CDC when:
- Event volume > 1000/minute
- Need < 1 second latency
- Production system at scale
- Already using Kafka
Choose DynamoDB + EventBridge when:
- Building on AWS
- Want serverless architecture
- Minimal operational overhead desired
- Cost-effective for moderate volumes
Production Readiness Checklist
Before deploying the outbox pattern to production:
- Cleanup strategy: Automated deletion of published events
- Monitoring: Outbox age, backlog size, publisher health
- Alerting: Lag exceeds threshold, publisher failures
- Idempotency: Inbox pattern or idempotency keys implemented
- Ordering: Partition key strategy for event ordering
- Dead Letter Queue: Failed events routed for investigation
- Schema versioning: Event payload versioning strategy
- Load testing: Verified at expected throughput
- Runbook: Documented recovery procedures
- Backup strategy: For outbox and inbox tables
Key Takeaways
Working with the outbox pattern across multiple systems taught me these lessons:
-
Start simple: Begin with polling publishers. Move to CDC only when you need the performance.
-
Monitor lag aggressively: The time between event creation and publishing is your most important metric. If this grows, your system is degrading.
-
Idempotency is non-negotiable: At-least-once delivery means duplicates will happen. Design for it from day one.
-
Clean up ruthlessly: Outbox tables that grow unbounded will eventually cause production issues. Automate cleanup.
-
Partition wisely: Event ordering within a partition is guaranteed. Use aggregate IDs as partition keys.
-
AWS makes it easier: DynamoDB + EventBridge Pipes provides a production-ready outbox with minimal code.
The outbox pattern isn’t just theory; it’s a battle-tested solution to the dual-write problem that I’ve relied on for building reliable event-driven systems. The implementations shown here are production-ready patterns you can adapt to your specific requirements.
Further Reading
Related posts
Learn multi-account AWS architecture patterns for building resilient event-driven systems. Explore account structure, EventBridge routing, cross-service communication, and operational challenges in distributed systems.
Named signals that justify a Kafka migration from a managed event bus, and a four-phase outbox-anchored playbook to move without rip-and-replace.
A platform-engineering default for multi-team AWS orgs: one event, many consumers, each in its own account with its own SQS and DLQ, fan-out lives in the event bus layer.
A pragmatic guide for designers working with async backends: three interaction patterns, when to use each, and four anti-patterns to push back against.
A practical introduction to idempotency for developers building APIs, payment flows, and message consumers. Covers HTTP method semantics, idempotency keys, database upserts, and common pitfalls with working Node.js examples.