2025-09-04

Circuit Breaker Pattern: Building Resilient Microservices That Don't Cascade Failures

Real-world implementation of the Circuit Breaker pattern with proven strategies for preventing cascading failures in distributed systems

When a payment service fails slowly rather than quickly, it can take down an entire platform. Each request taking 30 seconds to timeout creates a traffic jam that backs up through other services. This cascading failure pattern is common in distributed systems. Here’s how the Circuit Breaker pattern addresses this problem, with lessons learned from working through these incidents.

The Problem: When Slow is Worse Than Dead

Picture this: Your payment provider’s API starts responding slowly. Not down, just taking 20-30 seconds per request instead of the usual 200ms. Your service dutifully waits. Meanwhile, incoming requests pile up. Thread pools exhaust. Memory consumption spikes. Eventually, your healthy service becomes unhealthy, and the infection spreads upstream.

This pattern can kill entire platforms. The challenging part? Monitoring shows all services are “up” - they’re just not responding.

Circuit Breaker: Your System’s Safety Valve

The Circuit Breaker pattern acts like an electrical circuit breaker in your house. When things go wrong, it trips, preventing damage from spreading. But unlike your home’s breaker, this one is smart - it can test if the problem is fixed and automatically recover.

The Three States

enum CircuitState {
  CLOSED = 'CLOSED',  // Normal operation, requests flow through
  OPEN = 'OPEN',  // Circuit tripped, requests fail immediately
  HALF_OPEN = 'HALF_OPEN' // Testing if service recovered
}

Think of it like a bouncer at a club:

CLOSED: “Come on in, everything’s fine”
OPEN: “Nobody gets in, there’s a problem inside”
HALF_OPEN: “Let me check with one person if it’s safe now”

Real Implementation: What Actually Works

Here’s a circuit breaker implementation that addresses these challenges. This pattern has proven reliable across services handling high request volumes:

interface CircuitBreakerConfig {
  failureThreshold: number;  // Failures before opening
  successThreshold: number;  // Successes to close from half-open
  timeout: number;  // Request timeout in ms
  resetTimeout: number;  // Time before trying half-open
  volumeThreshold: number;  // Min requests before evaluating
  errorThresholdPercentage: number; // Error % to trip
}

class CircuitBreaker<T> {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime?: Date;
  private requestCount = 0;
  private errorCount = 0;
  private window = new RollingWindow(10000); // 10 second window

  constructor(
    private readonly config: CircuitBreakerConfig,
    private readonly protectedFunction: () => Promise<T>
  ) {}

  async execute(): Promise<T> {
    // Check if we should attempt half-open
    if (this.state === CircuitState.OPEN) {
      if (this.shouldAttemptReset()) {
        this.state = CircuitState.HALF_OPEN;
      } else {
        throw new CircuitOpenError('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await this.executeWithTimeout();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private async executeWithTimeout(): Promise<T> {
    return Promise.race([
      this.protectedFunction(),
      new Promise<T>((_, reject) =>
        setTimeout(() => reject(new TimeoutError()), this.config.timeout)
      )
    ]);
  }

  private onSuccess(): void {
    this.failureCount = 0;
    this.window.recordSuccess();

    if (this.state === CircuitState.HALF_OPEN) {
      this.successCount++;
      if (this.successCount >= this.config.successThreshold) {
        this.state = CircuitState.CLOSED;
        this.successCount = 0;
      }
    }
  }

  private onFailure(): void {
    this.failureCount++;
    this.lastFailureTime = new Date();
    this.window.recordFailure();

    if (this.state === CircuitState.HALF_OPEN) {
      this.state = CircuitState.OPEN;
      this.successCount = 0;
      return;
    }

    // Check both absolute and percentage thresholds
    const stats = this.window.getStats();
    if (stats.totalRequests >= this.config.volumeThreshold) {
      const errorRate = (stats.failures / stats.totalRequests) * 100;
      if (errorRate >= this.config.errorThresholdPercentage ||
          this.failureCount >= this.config.failureThreshold) {
        this.state = CircuitState.OPEN;
      }
    }
  }

  private shouldAttemptReset(): boolean {
    return this.lastFailureTime &&
      Date.now() - this.lastFailureTime.getTime() >= this.config.resetTimeout;
  }
}

Lessons from Production: What the Tutorials Don’t Tell You

1. Timeout is Your Most Important Setting

Analysis of incident patterns shows most failures (around 70%) are caused by slow responses, not complete failures. Setting timeouts aggressively helps:

const config = {
  timeout: 3000,  // 3 seconds - our P99 is 1.2s, so this catches problems
  // NOT 30000!  // This killed us. Waiting 30s = thread exhaustion
};

Example timing from a payment service:

Normal P50: 180ms
Normal P99: 1.2s
Circuit breaker timeout: 3s
Result: Significant reduction in cascading failures

2. The Half-Open State Gotcha

Early on, we’d trip to half-open, send one request, succeed, close the circuit, then immediately fail again with full traffic. The fix: require multiple successes before closing.

// Don't do this
if (testRequest.succeeded) {
  this.state = CircuitState.CLOSED; // Boom! Full traffic returns
}

// Do this instead
if (++this.successCount >= this.config.successThreshold) {
  this.state = CircuitState.CLOSED; // Gradual recovery
}

3. Combine with Retry Logic (But Carefully)

Circuit breakers and retries can create feedback loops. Here’s a reliable combination:

class ResilientClient {
  private circuitBreaker: CircuitBreaker<any>;

  async callWithResilience(request: Request): Promise<Response> {
    // Circuit breaker wraps retry logic, not vice versa
    return this.circuitBreaker.execute(async () => {
      return await this.retryWithBackoff(request, {
        maxAttempts: 3,
        backoffMs: [100, 200, 400],
        shouldRetry: (error) => {
          // Don't retry circuit breaker errors
          if (error instanceof CircuitOpenError) return false;
          // Don't retry client errors
          if (error.statusCode >= 400 && error.statusCode < 500) return false;
          return true;
        }
      });
    });
  }
}

4. Monitor the Right Metrics

What to track (in order of importance):

Circuit state changes - Alert immediately on OPEN
Reset attempt results - Failed resets = ongoing problem
Request rejection rate - Business impact metric
Time in OPEN state - Helps tune reset timeout

Our CloudWatch dashboard:

// Custom metrics we push
await cloudwatch.putMetricData({
  Namespace: 'CircuitBreakers',
  MetricData: [
    {
      MetricName: 'StateChange',
      Value: 1,
      Unit: 'Count',
      Dimensions: [
        { Name: 'ServiceName', Value: this.serviceName },
        { Name: 'FromState', Value: oldState },
        { Name: 'ToState', Value: newState }
      ]
    },
    {
      MetricName: 'RejectedRequests',
      Value: rejectedCount,
      Unit: 'Count',
      Dimensions: [{ Name: 'ServiceName', Value: this.serviceName }]
    }
  ]
});

Advanced Patterns: Beyond Basic Circuit Breaking

Bulkheading: Isolated Circuit Breakers

Don’t use one circuit breaker for an entire service. Isolate critical paths:

class PaymentService {
  private readonly chargeBreaker = new CircuitBreaker(chargeConfig);
  private readonly refundBreaker = new CircuitBreaker(refundConfig);
  private readonly queryBreaker = new CircuitBreaker(queryConfig);

  async chargeCard(request: ChargeRequest): Promise<ChargeResponse> {
    // Charging failures don't affect refunds
    return this.chargeBreaker.execute(() => this.api.charge(request));
  }

  async refundPayment(request: RefundRequest): Promise<RefundResponse> {
    // Refunds stay available even if charges are failing
    return this.refundBreaker.execute(() => this.api.refund(request));
  }
}

This pattern proves valuable during high-traffic periods when one endpoint becomes overwhelmed while others remain available.

Fallback Strategies

Not all failures are equal. Sometimes you can degrade gracefully:

async getProductRecommendations(userId: string): Promise<Product[]> {
  try {
    return await this.recommendationBreaker.execute(
      () => this.mlService.getRecommendations(userId)
    );
  } catch (error) {
    if (error instanceof CircuitOpenError) {
      // Fallback to simple popularity-based recommendations
      return this.getPopularProducts();
    }
    throw error;
  }
}

Circuit Breaker Inheritance

For microservices calling other microservices, inherit circuit state:

// API Gateway
if (paymentServiceBreaker.state === CircuitState.OPEN) {
  // Don't even try to call order service which depends on payment
  return { error: 'Payment service unavailable', status: 503 };
}

Real-World Configuration Examples

Here’s what actually works in production for different service types:

// External API (payment providers, third-party services)
const externalAPIConfig: CircuitBreakerConfig = {
  failureThreshold: 5,  // 5 consecutive failures
  successThreshold: 2,  // 2 successes to recover
  timeout: 5000,  // 5 second timeout
  resetTimeout: 30000,  // Try recovery after 30s
  volumeThreshold: 10,  // Need 10 requests minimum
  errorThresholdPercentage: 50  // 50% error rate trips
};

// Internal microservice
const internalServiceConfig: CircuitBreakerConfig = {
  failureThreshold: 10,  // More tolerant
  successThreshold: 3,
  timeout: 3000,  // Faster timeout
  resetTimeout: 10000,  // Faster recovery attempts
  volumeThreshold: 20,
  errorThresholdPercentage: 30  // More sensitive to error rates
};

// Database connections
const databaseConfig: CircuitBreakerConfig = {
  failureThreshold: 3,  // Quick to trip
  successThreshold: 5,  // Slow to recover
  timeout: 1000,  // Very fast timeout
  resetTimeout: 5000,  // Quick retry
  volumeThreshold: 5,
  errorThresholdPercentage: 20  // Very sensitive
};

Testing Circuit Breakers: Chaos Engineering

You can’t trust a circuit breaker you haven’t tested. Here’s our chaos testing approach:

describe('Circuit Breaker Chaos Tests', () => {
  it('should handle gradual degradation', async () => {
    const scenarios = [
      { latency: 100, errorRate: 0 },  // Normal
      { latency: 500, errorRate: 0.1 },  // Slight degradation
      { latency: 2000, errorRate: 0.3 }, // Major degradation
      { latency: 5000, errorRate: 0.7 }, // Near failure
    ];

    for (const scenario of scenarios) {
      mockService.setScenario(scenario);
      await runLoadTest(1000); // 1000 requests

      const metrics = await breaker.getMetrics();
      if (scenario.errorRate > 0.5) {
        expect(breaker.state).toBe(CircuitState.OPEN);
      }
    }
  });
});

In production, we use AWS Fault Injection Simulator to randomly inject failures and verify our circuit breakers respond correctly.

The Mistakes That Cost Us

Mistake 1: Client-Side Only Circuit Breaking

We initially implemented circuit breakers only in clients. When the server itself had issues, it couldn’t protect itself:

// Bad: Client protects itself but server still overwhelmed
class Client {
  private breaker = new CircuitBreaker();
  async call() { return this.breaker.execute(() => fetch('/api')); }
}

// Good: Server also protects itself
class Server {
  private downstreamBreaker = new CircuitBreaker();
  async handleRequest(req, res) {
    try {
      const data = await this.downstreamBreaker.execute(() =>
        this.database.query(req.query)
      );
      res.json(data);
    } catch (error) {
      if (error instanceof CircuitOpenError) {
        res.status(503).json({ error: 'Service temporarily unavailable' });
      }
    }
  }
}

We had one circuit breaker for “database operations”. When writes failed, reads were also blocked:

// Bad: One breaker for everything
class UserService {
  private dbBreaker = new CircuitBreaker();

  async getUser(id) {
    return this.dbBreaker.execute(() => db.query('SELECT...'));
  }

  async createUser(data) {
    return this.dbBreaker.execute(() => db.query('INSERT...'));
  }
}

// Good: Separate breakers for different operations
class UserService {
  private readBreaker = new CircuitBreaker(readConfig);
  private writeBreaker = new CircuitBreaker(writeConfig);

  async getUser(id) {
    return this.readBreaker.execute(() => db.query('SELECT...'));
  }

  async createUser(data) {
    return this.writeBreaker.execute(() => db.query('INSERT...'));
  }
}

Mistake 3: Not Considering Business Impact

We treated all services equally. Then we blocked payment processing while letting metrics collection through. Learned that lesson quickly.

The Implementation Checklist

When implementing circuit breakers, here’s a useful checklist:

Final Thoughts: It’s About Failing Fast

A key insight: sometimes the best thing a service can do is fail immediately. A 503 response in 10ms is far better than a timeout after 30 seconds. Users can retry quickly, and systems can recover. Thread exhaustion leads to much more serious problems.

Circuit breakers aren’t about preventing failures - they’re about preventing failures from spreading. They’re about maintaining enough system health that when the problem is fixed, you can actually recover.

Implementing circuit breakers before you encounter problems makes crisis response much smoother.

A UX Guide for Async Backends: Optimistic, Decoupled, or Neither

A pragmatic guide for designers working with async backends: three interaction patterns, when to use each, and four anti-patterns to push back against.

event-drivenstate-managementpatterns+2

April 18, 2026

AWS Cognito + Verified Permissions for SaaS Authorization

A deep dive into building SaaS authorization with AWS Cognito and Verified Permissions. Covers Cedar policy language, multi-tenant patterns, JWT token flow, cost analysis, and common mistakes with TypeScript examples.

authorizationawscognito+4

March 22, 2026

External Authorization Management Systems: Choosing the Right Platform for Your Architecture

A vendor-neutral evaluation of external authorization platforms including AWS Verified Permissions, SpiceDB, OpenFGA, Cerbos, and OPA. Covers architecture patterns, cost analysis, and a decision framework for engineering teams.

authorizationsecurityarchitecture+5

March 22, 2026

Cedar vs Rego vs OpenFGA: Policy Language Comparison

A deep technical comparison of Cedar, Rego, OpenFGA DSL, and Cerbos YAML/CEL policy languages. Covers syntax, performance benchmarks, formal verification, tooling, and integration patterns with TypeScript examples for each language.

authorizationsecurityarchitecture+3

March 22, 2026

SpiceDB vs Auth0 FGA: Relationship-Based Authorization Compared

A deep technical comparison of SpiceDB and Auth0 FGA (OpenFGA) -- two Zanzibar-inspired authorization systems with different trade-offs in schema design, consistency models, deployment, and scalability.

authorizationsecurityarchitecture+3

March 22, 2026

The Problem: When Slow is Worse Than Dead

Circuit Breaker: Your System’s Safety Valve

The Three States

Real Implementation: What Actually Works

Lessons from Production: What the Tutorials Don’t Tell You

1. Timeout is Your Most Important Setting

2. The Half-Open State Gotcha

3. Combine with Retry Logic (But Carefully)

4. Monitor the Right Metrics

Advanced Patterns: Beyond Basic Circuit Breaking

Bulkheading: Isolated Circuit Breakers

Fallback Strategies

Circuit Breaker Inheritance

Real-World Configuration Examples

Testing Circuit Breakers: Chaos Engineering

The Mistakes That Cost Us

Mistake 1: Client-Side Only Circuit Breaking

Mistake 2: Sharing Circuit Breakers Across Unrelated Operations

Mistake 3: Not Considering Business Impact

The Implementation Checklist

Final Thoughts: It’s About Failing Fast

Related posts