2025-12-16

Transactional Outbox Pattern: Reliable Event Publishing in Distributed Systems

Learn how the Transactional Outbox Pattern solves the dual-write problem in distributed systems, with practical implementations using PostgreSQL, DynamoDB, and CDC tools.

Abstract

The dual-write problem affects nearly every event-driven system I’ve worked with. When you need to update a database and publish an event atomically, you face an impossible choice: which operation fails when things go wrong? The Transactional Outbox Pattern provides a proven solution by writing both operations to the same database within a single transaction, then using a separate process to reliably publish events. This post covers practical implementations using polling publishers, Change Data Capture (CDC), and AWS serverless patterns.

The Dual-Write Problem

Here’s a scenario I’ve encountered repeatedly: an order service needs to save an order to the database and publish an OrderCreated event. The naive approach looks like this:

async function createOrder(orderData: Order) {
  // Step 1: Save to database
  await db('orders').insert(orderData);

  // Step 2: Publish event
  await messageQueue.publish('OrderCreated', orderData);
}

What could go wrong? Everything.

Failure Scenario 1: Database succeeds, event publish fails

Network timeout to message broker
Message broker temporarily down
Your service crashes after database write
Result: Order exists in database, but inventory service never receives the event. Stock is never reserved.

Failure Scenario 2: Event publish succeeds, database fails

Database write violates constraint
Transaction rolled back due to deadlock
Database connection lost
Result: Inventory service receives event and reserves stock, but order doesn’t exist. Data inconsistency.

Why Not Use Two-Phase Commit (2PC)?

You might ask: “Can’t we use distributed transactions?” Technically yes, but the trade-offs make it impractical:

Performance overhead: Coordinating transactions across systems adds significant latency
Reduced availability: If any participant is down, the entire operation fails
Complexity: Implementing XA transactions correctly is difficult
Limited support: Many message brokers don’t support 2PC
Coupling: Violates microservices independence principles

Working with distributed systems taught me that avoiding distributed transactions is better than trying to make them work reliably.

Understanding the Outbox Pattern

The Transactional Outbox Pattern solves the dual-write problem through a simple insight: instead of writing to two separate systems (database + message broker), write to two tables in the same database within a single ACID transaction.

Core Components

Outbox Table: Stores events to be published, lives in the same database as your business data
Business Transaction: Single ACID transaction writing to both business tables and outbox
Message Relay: Separate process reads outbox and publishes to message broker
Idempotent Consumers: Downstream services handle duplicate events correctly

How It Works

The key insight: either both the business data and event are committed, or neither is. This guarantees atomicity between your state changes and event publishing.

Implementation Approach 1: Polling Publisher

The simplest approach polls the outbox table periodically. Here’s what works in practice:

Basic Implementation

// 1. Outbox table schema (PostgreSQL)
CREATE TABLE outbox (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  aggregate_type VARCHAR(100) NOT NULL,
  aggregate_id VARCHAR(100) NOT NULL,
  event_type VARCHAR(100) NOT NULL,
  payload JSONB NOT NULL,
  created_at TIMESTAMP DEFAULT NOW(),
  published BOOLEAN DEFAULT FALSE
);

-- Critical index for efficient polling
CREATE INDEX idx_outbox_unpublished
ON outbox(created_at)
WHERE published = false;

Producer: Write to Outbox

async function createOrder(orderData: Order) {
  await db.transaction(async (trx) => {
    // Insert order
    const order = await trx('orders').insert({
      id: orderData.id,
      customer_id: orderData.customerId,
      total: orderData.total,
      status: 'PENDING'
    }).returning('*');

    // Insert event to outbox IN SAME TRANSACTION
    await trx('outbox').insert({
      id: uuid(),
      aggregate_type: 'Order',
      aggregate_id: order[0].id,
      event_type: 'OrderCreated',
      payload: {
        orderId: order[0].id,
        customerId: orderData.customerId,
        total: orderData.total,
        items: orderData.items
      },
      created_at: new Date()
    });

    // Both succeed or both fail - atomicity guaranteed
  });
}

Publisher: Poll and Publish

async function publishOutboxEvents() {
  // Use FOR UPDATE SKIP LOCKED to prevent concurrent processing
  const events = await db.raw(`
    SELECT * FROM outbox
    WHERE published = false
    ORDER BY created_at
    LIMIT 100
    FOR UPDATE SKIP LOCKED
  `);

  for (const event of events.rows) {
    try {
      // Publish to message broker
      await messageQueue.publish(event.event_type, {
        messageId: event.id,  // Important for deduplication
        aggregateId: event.aggregate_id,
        payload: event.payload
      });

      // Mark as published
      await db('outbox')
        .where('id', event.id)
        .update({ published: true });

    } catch (error) {
      console.error('Failed to publish event:', error);
      // Will retry on next poll - at-least-once delivery
    }
  }
}

// Run publisher every 5 seconds
setInterval(publishOutboxEvents, 5000);

The FOR UPDATE SKIP LOCKED clause is critical: it prevents multiple publisher instances from processing the same events, enabling horizontal scaling.

When to Use Polling

Pros:

Simple to implement and understand
No additional infrastructure required
Works with any database
Easy to debug with SQL queries

Cons:

Polling adds database load
Latency depends on poll interval (5-10 seconds typical)
Less efficient than CDC for high volumes

Use polling when:

Low to medium event volumes (< 1000 events/minute)
Getting started quickly
Simple architectures
Your database doesn’t support CDC

Implementation Approach 2: Change Data Capture (CDC)

For production systems at scale, CDC eliminates polling overhead by monitoring the database transaction log directly.

How CDC Works

Instead of polling the outbox table, CDC tools like Debezium monitor the database’s Write-Ahead Log (PostgreSQL) or Binary Log (MySQL). When an outbox event is written, the CDC tool detects it and publishes to your message broker automatically.

PostgreSQL + Debezium Setup

-- 1. Enable logical replication (requires PostgreSQL restart)
ALTER SYSTEM SET wal_level = 'logical';
-- Note: PostgreSQL must be restarted for wal_level change to take effect

-- 2. Create publication for outbox table
CREATE PUBLICATION outbox_publication FOR TABLE outbox;

-- 3. Grant replication rights to Debezium user
ALTER USER debezium_user WITH REPLICATION;

Debezium Configuration

{
  "name": "outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres.example.com",
    "database.port": "5432",
    "database.user": "debezium_user",
    "database.password": "${DB_PASSWORD}",
    "database.dbname": "orders_db",
    "database.server.name": "orders",
    "table.include.list": "public.outbox",
    "plugin.name": "pgoutput",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.field.event.type": "event_type",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.table.field.payload": "payload"
  }
}

Producer Code (Identical to Polling)

The beauty of CDC: your application code doesn’t change. You still write to the outbox table in the same transaction. Debezium handles the publishing.

// Same code as polling approach - no changes needed
async function createOrder(orderData: Order) {
  await db.transaction(async (trx) => {
    await trx('orders').insert(orderData);
    await trx('outbox').insert({
      aggregate_type: 'Order',
      aggregate_id: orderData.id,
      event_type: 'OrderCreated',
      payload: orderData
    });
  });
  // Debezium automatically detects the new outbox row and publishes
}

When to Use CDC

Pros:

Near real-time event publishing (< 1 second)
Minimal database overhead (reads WAL, not tables)
Scales to high volumes (100K+ events/sec)
Preserves event order per partition

Cons:

Complex infrastructure (Kafka Connect, Debezium)
Requires operational expertise
Database-specific setup (WAL configuration)
More expensive than serverless options

Use CDC when:

High event volumes (> 1000 events/minute)
Low latency requirements (< 1 second)
Production systems at scale
Already using Kafka ecosystem

AWS Implementation: DynamoDB + EventBridge Pipes

AWS provides a serverless outbox implementation using DynamoDB Streams and EventBridge Pipes. This is my preferred approach for AWS-native architectures.

Architecture

Implementation

// 1. Write both items in single transaction
// Note: DynamoDB transactions have limits - max 100 items, 4MB aggregate size
async function createOrder(orderData: Order) {
  await dynamodb.transactWrite({
    TransactItems: [
      {
        Put: {
          TableName: 'Orders',
          Item: {
            orderId: { S: orderData.id },
            customerId: { S: orderData.customerId },
            total: { N: orderData.total.toString() },
            status: { S: 'PENDING' }
          }
        }
      },
      {
        Put: {
          TableName: 'Outbox',
          Item: {
            eventId: { S: uuid() },
            aggregateType: { S: 'Order' },
            aggregateId: { S: orderData.id },
            eventType: { S: 'OrderCreated' },
            payload: { S: JSON.stringify(orderData) },
            timestamp: { N: Date.now().toString() }
          }
        }
      }
    ]
  }).promise();
}

Infrastructure as Code (AWS CDK)

import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as pipes from 'aws-cdk-lib/aws-pipes';
import * as events from 'aws-cdk-lib/aws-events';

// 1. Create outbox table with streams enabled
const outboxTable = new dynamodb.Table(this, 'OutboxTable', {
  partitionKey: { name: 'eventId', type: dynamodb.AttributeType.STRING },
  stream: dynamodb.StreamViewType.NEW_IMAGE,  // Critical: stream new items
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  removalPolicy: cdk.RemovalPolicy.DESTROY
});

// 2. Create event bus
const eventBus = new events.EventBus(this, 'OrderEventBus', {
  eventBusName: 'order-events'
});

// 3. Create EventBridge Pipe (NO Lambda needed!)
new pipes.CfnPipe(this, 'OutboxPipe', {
  source: outboxTable.tableStreamArn!,
  target: eventBus.eventBusArn,
  roleArn: pipeRole.roleArn,
  sourceParameters: {
    dynamoDbStreamParameters: {
      startingPosition: 'LATEST',
      batchSize: 10,
      maximumRetryAttempts: 3,  // Note: Default is -1 (infinite retry)
      deadLetterConfig: {
        arn: dlqQueue.queueArn
      }
    }
  },
  targetParameters: {
    eventBridgeEventBusParameters: {
      detailType: 'OutboxEvent',
      source: 'outbox.publisher'
    }
  }
});

Why This Approach Works

No Lambda code for publishing: EventBridge Pipes automatically reads DynamoDB Streams and publishes to EventBridge. This eliminates:

Cold start latency
Lambda billing for publisher
Code to maintain for the relay

Built-in reliability: Pipes include retry logic, dead-letter queues, and monitoring out of the box.

Cost efficiency: You only pay for events processed, not for idle publisher infrastructure.

Cost Analysis

Based on a system processing 10 million events per month:

DynamoDB Streams: Free (included with DynamoDB)
EventBridge Pipes: $0.40/million events =$ 4.00/month
EventBridge Event Bus: $1.00/million events =$ 10.00/month
Total: ~$14/month for 10M events

Compare to Lambda polling approach:

Lambda invocations: 43,200/month (every minute) = ~$0.01
Lambda duration: 100ms avg × 43,200 = ~$0.50
RDS queries: Adds load to database
Total: Similar cost but higher operational complexity

Handling Ordering and Idempotency

Ordering Guarantees

The outbox pattern preserves ordering per partition, not globally across all events.

// Ensure events for same aggregate are ordered
await kafka.producer.send({
  topic: 'order-events',
  messages: [{
    key: event.aggregateId,  // All events for ORDER-123 go to same partition
    value: JSON.stringify(event.payload)
  }]
});

For DynamoDB Streams, use the aggregate ID as the partition key:

await dynamodb.put({
  TableName: 'Outbox',
  Item: {
    aggregateId: 'ORDER-123',  // Partition key - ensures ordering
    eventId: uuid(),  // Sort key
    eventType: 'OrderCreated',
    timestamp: Date.now()
  }
});

The Inbox Pattern: Consumer-Side Idempotency

The outbox pattern guarantees at-least-once delivery, which means events may be delivered multiple times. Consumers must handle duplicates.

The Inbox Pattern provides idempotent processing:

async function handleOrderCreatedEvent(event: OrderCreatedEvent) {
  await db.transaction(async (trx) => {
    // 1. Check if already processed
    const existing = await trx('inbox')
      .where('message_id', event.messageId)
      .first();

    if (existing) {
      console.log('Duplicate message, skipping:', event.messageId);
      return;  // Idempotent - safe to skip
    }

    // 2. Process the event (your business logic)
    await trx('inventory')
      .where('product_id', event.productId)
      .decrement('quantity', event.quantity);

    // 3. Record as processed IN SAME TRANSACTION
    await trx('inbox').insert({
      message_id: event.messageId,
      event_type: event.type,
      processed_at: new Date()
    });

    // Either all three operations succeed, or all fail
  });

  // ACK to message broker only after successful commit
  await messageQueue.ack(event.messageId);
}

Inbox table schema:

CREATE TABLE inbox (
  message_id UUID PRIMARY KEY,
  event_type VARCHAR(100),
  processed_at TIMESTAMP DEFAULT NOW(),
  payload JSONB  -- Optional: for debugging
);

-- Cleanup old processed messages (run daily)
DELETE FROM inbox
WHERE processed_at < NOW() - INTERVAL '7 days';

Complete Pattern: Outbox + Inbox

Performance Considerations

Database Performance

Outbox table growth: Without cleanup, the outbox table grows indefinitely. I’ve seen this cause significant performance degradation.

-- Strategy 1: Delete immediately after publish
DELETE FROM outbox WHERE id = $1 AND published = true;

-- Strategy 2: Batch cleanup (run daily via cron)
DELETE FROM outbox
WHERE published = true
  AND created_at < NOW() - INTERVAL '7 days';

-- Strategy 3: Table partitioning (PostgreSQL 10+)
CREATE TABLE outbox_2025_12 PARTITION OF outbox
  FOR VALUES FROM ('2025-12-01') TO ('2026-01-01');

-- Drop old partitions (much faster than DELETE)
DROP TABLE outbox_2025_11;

Index optimization: The partial index only indexes unpublished events, saving space:

CREATE INDEX idx_outbox_unpublished
ON outbox(created_at)
WHERE published = false;

Polling Publisher Tuning

Poll interval trade-offs:

1 second: Low latency, high database load
5 seconds: Balanced (recommended for most cases)
10+ seconds: Low overhead, higher latency

Batch size:

// Too small: many queries, inefficient
const batchSize = 10;

// Too large: long transactions, lock contention
const batchSize = 10000;

// Optimal: balance efficiency and transaction length
const batchSize = 100;  // Recommended starting point

CDC Performance

Monitor replication lag to ensure Debezium keeps up:

-- Check replication slot lag
SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots
WHERE slot_type = 'logical';

If lag grows, your WAL files accumulate and can fill disk. This is a real operational concern I’ve dealt with.

Common Pitfalls and Solutions

Pitfall 1: Unbounded Table Growth

Problem: Outbox table grows indefinitely, queries slow down.

Solution: Implement automatic cleanup in your publisher:

async function publishAndCleanup() {
  // Publish events
  await publishOutboxEvents();

  // Cleanup old published events (every 100 iterations)
  if (cleanupCounter++ % 100 === 0) {
    await db('outbox')
      .where('published', true)
      .where('created_at', '<', db.raw("NOW() - INTERVAL '7 days'"))
      .delete();
  }
}

Pitfall 2: Message Relay Failure Goes Unnoticed

Problem: Publisher crashes, events pile up unpublished.

Solution: Monitor outbox age metrics:

async function checkOutboxHealth() {
  const result = await db('outbox')
    .where('published', false)
    .min('created_at as oldest')
    .first();

  if (!result.oldest) return;  // No unpublished events

  const ageMs = Date.now() - new Date(result.oldest).getTime();
  const ageMinutes = ageMs / 60000;

  if (ageMinutes > 5) {
    alerting.trigger('OUTBOX_LAG_HIGH', {
      ageMinutes,
      message: 'Outbox events not being published'
    });
  }
}

// Run health check every minute
setInterval(checkOutboxHealth, 60000);

Pitfall 3: CDC Replication Slot Filling Disk

Problem: Debezium connector goes down, PostgreSQL WAL accumulates.

Solution: Monitor replication slots and set retention limits:

-- Set WAL retention limit
ALTER SYSTEM SET wal_keep_size = '10GB';

-- Monitor slot status
SELECT slot_name, active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots;

Alert if a slot is inactive for more than 5 minutes, indicating a publisher failure.

Comparison with Other Patterns

Outbox vs. Event Sourcing

Aspect	Outbox Pattern	Event Sourcing
Purpose	Reliable event publishing	Events as source of truth
Event Lifetime	Short-lived (deleted after publish)	Permanent append-only log
State Storage	Current state in tables	Derived from events
Complexity	Low	High
Query Model	Direct database queries	Requires projections/CQRS
Best For	E-commerce orders, workflows	Banking, audit systems

Key difference: In event sourcing, events are the permanent record. In outbox, events are a communication mechanism.

Outbox vs. Saga Pattern

The outbox pattern complements the saga pattern. Use outbox within each service participating in a saga:

// Order Service publishes OrderCreated via outbox
await db.transaction(async (trx) => {
  await trx('orders').insert(order);
  await trx('outbox').insert({ event_type: 'OrderCreated', payload: order });
});

// Saga Orchestrator receives OrderCreated, publishes commands via its own outbox
await db.transaction(async (trx) => {
  await trx('saga_state').insert({ saga_id: orderId, step: 'INVENTORY_PENDING' });
  await trx('outbox').insert({ event_type: 'ReserveInventory', payload: { orderId } });
});

Decision Framework

Use this framework to choose the right implementation:

Choose Polling when:

Event volume < 1000/minute
Getting started quickly
Simple architecture preferred
Database doesn’t support CDC

Choose CDC when:

Event volume > 1000/minute
Need < 1 second latency
Production system at scale
Already using Kafka

Choose DynamoDB + EventBridge when:

Building on AWS
Want serverless architecture
Minimal operational overhead desired
Cost-effective for moderate volumes

Production Readiness Checklist

Before deploying the outbox pattern to production:

Key Takeaways

Working with the outbox pattern across multiple systems taught me these lessons:

Start simple: Begin with polling publishers. Move to CDC only when you need the performance.
Monitor lag aggressively: The time between event creation and publishing is your most important metric. If this grows, your system is degrading.
Idempotency is non-negotiable: At-least-once delivery means duplicates will happen. Design for it from day one.
Clean up ruthlessly: Outbox tables that grow unbounded will eventually cause production issues. Automate cleanup.
Partition wisely: Event ordering within a partition is guaranteed. Use aggregate IDs as partition keys.
AWS makes it easier: DynamoDB + EventBridge Pipes provides a production-ready outbox with minimal code.

The outbox pattern isn’t just theory; it’s a battle-tested solution to the dual-write problem that I’ve relied on for building reliable event-driven systems. The implementations shown here are production-ready patterns you can adapt to your specific requirements.