2025-09-08

From RFC to Production: What They Don't Tell You About Implementation

An honest take on the gap between beautiful RFC designs and messy production reality, featuring real-world lessons from implementing notification systems at scale

Abstract

RFCs rarely survive contact with production unchanged, and that’s not necessarily a problem. Through examining notification system implementations, we can learn how elegant designs evolve when they meet organizational constraints, timeline pressures, and unexpected requirements. This exploration reveals patterns that help bridge the gap between theoretical design and practical implementation.

Situation: The Beautiful RFC vs. Production Reality

You know that feeling when you’re reading through a beautifully crafted RFC, nodding along to the elegant architecture diagrams, and thinking “This is it, this is the design that will finally work perfectly”? Then six months later you’re knee-deep in production issues, the timeline has doubled, and that pristine database schema looks like it went through a blender?

This pattern emerges repeatedly across system implementations. The gap between RFC and production isn’t a bug - it’s a feature of building complex systems with teams under business pressures. Understanding this gap helps us plan more effectively and set realistic expectations.

Note: The following examples are adapted from multiple notification system implementations across different organizations. While specific details may vary, the patterns and challenges described are representative of common experiences in this domain.

Task: Building a Notification System from RFC to Reality

The task seemed straightforward from the RFC perspective. A comprehensive notification system with clean architecture diagrams, well-planned database schemas, and phased rollout strategies. The specifications looked thorough and the timeline appeared conservative:

// The RFC specifications
interface NotificationSystemGoals {
  deliveryTime: '<100ms for in-app, <5s for email',
  throughput: '10,000+ notifications per second',
  uptime: '99.9% availability',
  timeline: '12 weeks with 2 developers',
  budget: '$120,000-180,000'
}

// What emerged in production
interface ProductionReality {
  deliveryTime: '2-3s for in-app on good days, 30s+ during peaks',
  throughput: 'Started at 500/sec, took 6 months to reach 5,000/sec',
  uptime: '97% first quarter, 99% after year one',
  timeline: '8 months with 4 developers plus 2 contractors',
  budget: '$400,000+ and still counting maintenance costs'
}

The RFC appeared comprehensive, covering rate limiting, deduplication, preference management, and user experience considerations like quiet hours. The phased approach seemed reasonable - core infrastructure in 4 weeks felt achievable.

Action: Implementation Challenges and Adaptations

Database Schema Evolution

The initial database schema design emphasized clean normalization with proper foreign keys and constraints:

-- Initial RFC schema design
CREATE TABLE notification_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id) ON DELETE CASCADE,
    notification_type VARCHAR(100) NOT NULL,
    template_id UUID REFERENCES notification_templates(id),
    data JSONB DEFAULT '{}',
    status VARCHAR(20) DEFAULT 'pending',
    sent_at TIMESTAMP,
    delivered_at TIMESTAMP,
    read_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);

Three months into production, the schema had evolved significantly:

-- Schema after production adaptations
CREATE TABLE notification_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID, -- Foreign key removed due to performance issues
    notification_type VARCHAR(100),
    notification_type_v2 VARCHAR(255), -- Migration in progress
    template_id UUID,
    template_id_v2 BIGINT, -- Different team used different ID type
    data JSONB DEFAULT '{}',
    data_compressed BYTEA, -- Added when JSONB got too large
    status VARCHAR(20) DEFAULT 'pending',
    status_v2 VARCHAR(50), -- More statuses than expected
    priority INTEGER DEFAULT 0, -- Not in RFC, critical for production
    retry_count INTEGER DEFAULT 0, -- Not in RFC, essential for debugging
    channel VARCHAR(50), -- Denormalized for query performance
    correlation_id UUID, -- Added for distributed tracing
    partition_key INTEGER, -- Added for sharding
    sent_at TIMESTAMP,
    delivered_at TIMESTAMP,
    read_at TIMESTAMP,
    failed_at TIMESTAMP, -- Not in RFC, very much needed
    expires_at TIMESTAMP, -- Not in RFC, prevented infinite growth
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW() -- Added after debugging nightmares
);

-- Plus 15 indexes we didn't anticipate
CREATE INDEX CONCURRENTLY idx_notification_events_user_created ON notification_events(user_id, created_at DESC) WHERE status != 'deleted';
CREATE INDEX CONCURRENTLY idx_notification_events_correlation ON notification_events(correlation_id) WHERE correlation_id IS NOT NULL;
-- ... and 13 more

Each schema change addressed production incidents, performance bottlenecks, or requirements that emerged during implementation. These adaptations reflect the natural evolution from theoretical design to operational system.

WebSocket Connection Management Complexity

The RFC specified WebSocket-based delivery for optimal performance. The initial implementation approach was straightforward:

// RFC's WebSocket implementation
class NotificationWebSocketManager {
  private connections: Map<string, WebSocket> = new Map();
  
  async sendNotification(userId: string, notification: NotificationEvent) {
    const connection = this.connections.get(userId);
    if (connection && connection.readyState === WebSocket.OPEN) {
      connection.send(JSON.stringify({
        type: 'notification',
        data: notification
      }));
    }
  }
}

Production requirements revealed additional complexity. After addressing connection management challenges during mobile app deployments, the implementation evolved:

// Production implementation addressing edge cases
class NotificationWebSocketManager {
  private connections: Map<string, Set<WebSocketConnection>> = new Map();
  private connectionMetadata: Map<string, ConnectionMetadata> = new Map();
  private healthChecks: Map<string, NodeJS.Timeout> = new Map();
  private rateLimiters: Map<string, RateLimiter> = new Map();
  private deadLetterQueue: Queue<FailedNotification>;
  private circuit: CircuitBreaker;
  
  async sendNotification(userId: string, notification: NotificationEvent) {
    // 200+ lines of defensive programming
    const connections = this.connections.get(userId);
    if (!connections || connections.size === 0) {
      await this.queueForLaterDelivery(userId, notification);
      return;
    }
    
    // Handle multiple connections per user (mobile + web + tablet)
    const results = await Promise.allSettled(
      Array.from(connections).map(async (conn) => {
        try {
          // Check connection health
          if (!this.isConnectionHealthy(conn)) {
            await this.reconnectOrEvict(conn);
            throw new Error('Unhealthy connection');
          }
          
          // Rate limiting per connection
          const limiter = this.getRateLimiter(conn.id);
          if (!await limiter.tryAcquire()) {
            await this.backpressure(conn, notification);
            return;
          }
          
          // Circuit breaker for cascading failures
          return await this.circuit.fire(async () => {
            // Message size validation (learned this the hard way)
            const message = this.serializeNotification(notification);
            if (message.length > MAX_MESSAGE_SIZE) {
              const chunks = this.chunkMessage(message);
              for (const chunk of chunks) {
                await this.sendChunk(conn, chunk);
              }
            } else {
              await this.sendMessage(conn, message);
            }
          });
        } catch (error) {
          await this.handleDeliveryFailure(conn, notification, error);
        }
      })
    );
    
    // Track delivery metrics
    await this.recordDeliveryMetrics(userId, notification, results);
  }
  
  // Plus 50+ other methods for handling edge cases
}

Each addition addressed specific production challenges: circuit breakers for cascading failures, message chunking for large payloads, and sophisticated rate limiting for notification storms. These patterns emerge consistently when simple designs meet complex operational requirements.

Timeline and Scope Evolution

The RFC outlined a structured development approach:

Phase 1 (Weeks 1-4): Core Infrastructure
Phase 2 (Weeks 5-8): Advanced Features
Phase 3 (Weeks 9-12): Integration & Optimization

The implementation timeline revealed different patterns:

Weeks 1-4: Infrastructure Foundation Challenges

Environment setup and capacity planning consumed more time than anticipated. Database throughput requirements exceeded initial assumptions, and competing production priorities affected team availability.

Weeks 5-12: Scope Expansion

Early demonstrations generated enthusiasm and additional requirements. Channel diversity expanded beyond initial specifications as business needs emerged during development.

// Original scope
const originalChannels = ['in_app', 'email', 'push'];

// Month 3 scope
const actualChannels = [
  'in_app', 
  'email', 
  'push', 
  'sms',  // Added week 6
  'slack',  // Added week 8
  'teams',  // Added week 10
  'webhook',  // Added week 11
  'discord',  // Added week 14 (yes, we were already late)
  'voice_call'  // Added week 20 (for critical security alerts)
];

Months 4-6: Integration Complexity

The clean API design assumed consistent authentication patterns across services. Production revealed three different authentication systems requiring unified notification support.

// RFC assumption
interface AuthContext {
  userId: string;
  token: string;
}

// Production reality
type AuthContext = 
  | { type: 'jwt'; userId: string; token: string; claims: JWTClaims }
  | { type: 'oauth2'; userId: string; accessToken: string; refreshToken: string; expiresAt: Date }
  | { type: 'legacy'; sessionId: string; userId?: string; cookieData: LegacyCookie }
  | { type: 'service_account'; serviceId: string; apiKey: string }
  | { type: 'anonymous'; temporaryId: string; ipAddress: string };

// Each authentication pattern required specialized handling:
// rate limiting, security validation, and audit requirements

Months 7-8: Performance Optimization

While functional, the system required significant performance work to meet throughput requirements. Template rendering emerged as an unexpected bottleneck, with personalization features requiring multiple API calls per notification.

Team Scaling and Organizational Changes

The RFC specified “2 developers for 12 weeks.” The implementation team evolved differently:

2 senior engineers (supposed to be full-time, averaged 60% due to production support)
1 junior engineer (added month 2, spent month 3 learning the codebase)
2 contractors (added month 4 for “quick wins,” spent month 5 fixing their code)
1 DevOps engineer (supposedly “consulting,” became full-time by month 3)
1 database expert (brought in month 5 for performance crisis)
Product manager (changed twice during the project)
3 different engineering managers (reorg happened in month 6)

Team changes introduced context transfer challenges and architectural reviews. Contractor contributions required additional integration work, and organizational restructuring prompted design reassessment that affected project momentum.

Monitoring Requirements Discovery

The RFC monitoring section covered standard metrics: delivery rate, response time, and error rate. Production operation revealed additional observability requirements:

// RFC monitoring plan
const plannedMetrics = [
  'delivery_rate',
  'response_time', 
  'error_rate',
  'throughput'
];

// What we actually monitor
const productionMetrics = [
  // Basic metrics (from RFC)
  'delivery_rate_by_channel_by_priority_by_user_segment',
  'response_time_p50_p95_p99_p999',
  'error_rate_by_type_by_service_by_retry_count',
  
  // The metrics that actually matter
  'template_render_time_by_template_by_variables_count',
  'database_connection_pool_wait_time',
  'redis_operation_time_by_operation_type',
  'webhook_retry_backoff_effectiveness',
  'notification_staleness_at_delivery',
  'user_preference_cache_hit_rate',
  'deduplication_effectiveness_by_time_window',
  'rate_limit_rejection_by_reason',
  'circuit_breaker_state_transitions',
  'message_size_distribution_by_channel',
  'websocket_reconnection_storms',
  'push_token_invalidation_rate',
  'email_bounce_classification',
  'notification_feedback_loop_latency',
  'cost_per_notification_by_channel',
  'regulatory_compliance_audit_completeness',
  
  // The weird ones we needed after specific incidents
  'mobile_app_version_vs_notification_compatibility',
  'timezone_calculation_accuracy',
  'emoji_rendering_failures_by_client',
  'notification_delivery_during_database_failover',
  'memory_leak_in_template_cache',
  'thundering_herd_detection'
];

Each additional metric addresses specific operational challenges that emerged during production use, highlighting the difference between design-time and runtime observability needs.

Technical Debt Accumulation Patterns

Technical debt considerations weren’t explicit in the RFC. By month 8, several patterns had emerged:

Template System Complexity

Multiple template engines emerged to support different team requirements, creating a hybrid system that required ongoing maintenance.

// Multi-engine template management complexity
class NotificationTemplateManager {
  private mustacheTemplates: Map<string, MustacheTemplate>;  // Original system
  private handlebarsTemplates: Map<string, HandlebarsTemplate>; // Added for marketing
  private reactEmailTemplates: Map<string, ReactEmailTemplate>; // Added for pretty emails
  
  async render(templateId: string, data: any): Promise<string> {
    // 150 lines of logic to figure out which template engine to use,
    // handle edge cases, maintain backwards compatibility,
    // and work around bugs we can't fix without breaking production
    
    // This comment has been here since month 4:
    // TODO: Unify template systems (estimated: 2 weeks)
    // Actual estimate after investigation: 3 months + migration plan
  }
}

Schema Migration Challenges

The evolution from initial to optimized schema required careful migration planning. Running parallel schemas during transition introduced synchronization complexity.

-- The migration nightmare
BEGIN;
  -- Step 1 of 47 in the migration plan
  INSERT INTO notification_events_v2 
  SELECT 
    id,
    user_id,
    -- 50 lines of complex transformation logic
    CASE 
      WHEN notification_type IN ('old_type_1', 'old_type_2') THEN 'new_type_1'
      WHEN notification_type LIKE 'legacy_%' THEN REPLACE(notification_type, 'legacy_', 'classic_')
      -- 20 more WHEN clauses
    END as notification_type_v2,
    -- More transformations...
  FROM notification_events 
  WHERE created_at > NOW() - INTERVAL '1 hour'
    AND status != 'migrated'
    AND NOT EXISTS (
      SELECT 1 FROM notification_events_v2 
      WHERE notification_events_v2.id = notification_events.id
    );
  
  -- Update migration status
  UPDATE migration_status 
  SET last_run = NOW(), 
      records_migrated = records_migrated + row_count,
      estimated_completion = NOW() + (remaining_records / current_rate * INTERVAL '1 second')
  WHERE migration_name = 'notification_schema_v2';
  
  -- Check for conflicts
  -- Handle rollback scenarios
  -- Update monitoring metrics
  -- 100 more lines...
COMMIT;

Result: Lessons from Implementation Experience

The RFC specified technical success criteria: 99.9% uptime, sub-100ms delivery, and 10,000 notifications per second. Achievement of these targets revealed that user and business metrics were equally important.

What actually mattered:

User happiness: We had 99% delivery rate but users hated the notifications because they were poorly timed
Developer productivity: Other teams couldn’t integrate with our “clean” API without extensive hand-holding
Operational burden: The system required constant babysitting despite all our automation
Business value: Marketing couldn’t use half the features because they were too complex

// What we optimized for (from RFC)
const technicalMetrics = {
  uptime: 99.9,
  deliveryTime: 95, // ms
  throughput: 10000, // per second
  errorRate: 0.1 // percent
};

// What actually mattered
const businessMetrics = {
  userNotificationDisableRate: 45, // percent - way too high
  developerIntegrationTime: 3, // weeks - should be hours
  supportTicketsPerWeek: 150, // related to notifications
  marketingCampaignSetupTime: 2, // days - should be minutes
  monthlyOperationalCost: 25000, // dollars - 5x the estimate
  engineersPagedPerWeek: 12 // times - unsustainable
};

Key Implementation Insights

Several patterns emerge consistently across notification system implementations:

1. RFCs as Starting Hypotheses

Treating RFCs as initial hypotheses rather than fixed specifications enables better adaptation. Documents should evolve with implementation learning rather than remaining static reference points.

2. Planning for Emergent Requirements

Significant buffer allocation for unexpected requirements reflects implementation reality. Doubling estimates and adding contingency helps accommodate discovery during development.

3. Evolution-Ready Design

Systems inevitably require migration, versioning, and compatibility features. Building these capabilities early reduces future technical debt and operational complexity.

4. Edge Cases as Core Requirements

Scenarios discussed during design reviews typically manifest in production. Planning for these cases during initial implementation proves more efficient than reactive fixes.

5. Organizational Context Integration

Technical design success depends on organizational alignment. Team changes, restructuring, and varying stakeholder priorities affect implementation more than architectural elegance.

6. Operational Observability Focus

Effective monitoring addresses incident response needs rather than design documentation requirements. Business impact, user experience, and operational detail provide more valuable debugging information.

Bridging Design and Implementation

Several strategies help minimize the RFC-to-production gap:

Progressive Feature Development

Starting with well-executed core functionality enables better iteration than comprehensive initial implementation. Perfect email notifications provide a stronger foundation than basic multi-channel support.

Adaptability Over Optimization

Systems designed for graceful evolution handle changing requirements better than those optimized for predicted scenarios. Flexibility often proves more valuable than initial perfection.

Developer Experience Investment

Easy integration and operation drive adoption more effectively than raw performance. API usability often determines system success more than technical specifications.

Documentation Evolution

Maintaining documentation as living artifacts rather than historical records improves team understanding. Sections for original design, current implementation, and learned insights provide comprehensive context.

Comprehensive Feedback Integration

Feedback loops across user experience, operational metrics, and developer workflow enable rapid iteration. Quick learning cycles accelerate problem identification and resolution.

Conclusion: Embracing Implementation Reality

Learning to work with implementation evolution rather than against it improves outcomes. Pristine RFCs naturally become complex as they address user needs. Beautiful architectures develop practical extensions. Clean codebases accumulate necessary technical debt. This represents successful problem-solving rather than design failure.

The RFC-to-production gap requires management rather than elimination. Effective engineering adapts to emerging reality while maintaining system coherence and user value.

Reflecting on notification system implementations, final systems rarely match initial designs. They’re typically more complex and take longer to build, but they’re also more capable and solve problems that weren’t apparent during initial planning.

When writing RFCs, remember: you’re starting a conversation with implementation reality rather than defining fixed specifications. This perspective enables better planning and more realistic expectations.

Production Insights: Debugging Notification Delivery at Scale

Real-world debugging techniques, monitoring strategies, and lessons learned from notification system failures in high-stakes production environments

debuggingmonitoringproduction+4

September 8, 2025

AWS Fargate 103: Production Lessons That'll Save You Hours

Production incidents from running Fargate at scale. Memory leaks, ENI limits, subnet failures, and debugging techniques that work.

awsfargatedebugging+4

September 4, 2025

External Authorization Management Systems: Choosing the Right Platform for Your Architecture

A vendor-neutral evaluation of external authorization platforms including AWS Verified Permissions, SpiceDB, OpenFGA, Cerbos, and OPA. Covers architecture patterns, cost analysis, and a decision framework for engineering teams.

authorizationsecurityarchitecture+5

March 22, 2026

Cedar vs Rego vs OpenFGA: Policy Language Comparison

A deep technical comparison of Cedar, Rego, OpenFGA DSL, and Cerbos YAML/CEL policy languages. Covers syntax, performance benchmarks, formal verification, tooling, and integration patterns with TypeScript examples for each language.

authorizationsecurityarchitecture+3

March 22, 2026

SpiceDB vs Auth0 FGA: Relationship-Based Authorization Compared

A deep technical comparison of SpiceDB and Auth0 FGA (OpenFGA) -- two Zanzibar-inspired authorization systems with different trade-offs in schema design, consistency models, deployment, and scalability.

authorizationsecurityarchitecture+3

March 22, 2026

Abstract

Situation: The Beautiful RFC vs. Production Reality

Task: Building a Notification System from RFC to Reality

Action: Implementation Challenges and Adaptations

Database Schema Evolution

WebSocket Connection Management Complexity

Timeline and Scope Evolution

Weeks 1-4: Infrastructure Foundation Challenges

Weeks 5-12: Scope Expansion

Months 4-6: Integration Complexity

Months 7-8: Performance Optimization

Team Scaling and Organizational Changes

Monitoring Requirements Discovery

Technical Debt Accumulation Patterns

Template System Complexity

Schema Migration Challenges

Result: Lessons from Implementation Experience

Key Implementation Insights

1. RFCs as Starting Hypotheses

2. Planning for Emergent Requirements

3. Evolution-Ready Design

4. Edge Cases as Core Requirements

5. Organizational Context Integration

6. Operational Observability Focus

Bridging Design and Implementation

Progressive Feature Development

Adaptability Over Optimization

Developer Experience Investment

Documentation Evolution

Comprehensive Feedback Integration

Conclusion: Embracing Implementation Reality

Related posts