2025-09-08
From RFC to Production: What They Don't Tell You About Implementation
An honest take on the gap between beautiful RFC designs and messy production reality, featuring real-world lessons from implementing notification systems at scale
Abstract
RFCs rarely survive contact with production unchanged, and that’s not necessarily a problem. Through examining notification system implementations, we can learn how elegant designs evolve when they meet organizational constraints, timeline pressures, and unexpected requirements. This exploration reveals patterns that help bridge the gap between theoretical design and practical implementation.
Situation: The Beautiful RFC vs. Production Reality
You know that feeling when you’re reading through a beautifully crafted RFC, nodding along to the elegant architecture diagrams, and thinking “This is it, this is the design that will finally work perfectly”? Then six months later you’re knee-deep in production issues, the timeline has doubled, and that pristine database schema looks like it went through a blender?
This pattern emerges repeatedly across system implementations. The gap between RFC and production isn’t a bug - it’s a feature of building complex systems with teams under business pressures. Understanding this gap helps us plan more effectively and set realistic expectations.
Note: The following examples are adapted from multiple notification system implementations across different organizations. While specific details may vary, the patterns and challenges described are representative of common experiences in this domain.
Task: Building a Notification System from RFC to Reality
The task seemed straightforward from the RFC perspective. A comprehensive notification system with clean architecture diagrams, well-planned database schemas, and phased rollout strategies. The specifications looked thorough and the timeline appeared conservative:
// The RFC specifications
interface NotificationSystemGoals {
deliveryTime: '<100ms for in-app, <5s for email',
throughput: '10,000+ notifications per second',
uptime: '99.9% availability',
timeline: '12 weeks with 2 developers',
budget: '$120,000-180,000'
}
// What emerged in production
interface ProductionReality {
deliveryTime: '2-3s for in-app on good days, 30s+ during peaks',
throughput: 'Started at 500/sec, took 6 months to reach 5,000/sec',
uptime: '97% first quarter, 99% after year one',
timeline: '8 months with 4 developers plus 2 contractors',
budget: '$400,000+ and still counting maintenance costs'
}
The RFC appeared comprehensive, covering rate limiting, deduplication, preference management, and user experience considerations like quiet hours. The phased approach seemed reasonable - core infrastructure in 4 weeks felt achievable.
Action: Implementation Challenges and Adaptations
Database Schema Evolution
The initial database schema design emphasized clean normalization with proper foreign keys and constraints:
-- Initial RFC schema design
CREATE TABLE notification_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE,
notification_type VARCHAR(100) NOT NULL,
template_id UUID REFERENCES notification_templates(id),
data JSONB DEFAULT '{}',
status VARCHAR(20) DEFAULT 'pending',
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
read_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
Three months into production, the schema had evolved significantly:
-- Schema after production adaptations
CREATE TABLE notification_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID, -- Foreign key removed due to performance issues
notification_type VARCHAR(100),
notification_type_v2 VARCHAR(255), -- Migration in progress
template_id UUID,
template_id_v2 BIGINT, -- Different team used different ID type
data JSONB DEFAULT '{}',
data_compressed BYTEA, -- Added when JSONB got too large
status VARCHAR(20) DEFAULT 'pending',
status_v2 VARCHAR(50), -- More statuses than expected
priority INTEGER DEFAULT 0, -- Not in RFC, critical for production
retry_count INTEGER DEFAULT 0, -- Not in RFC, essential for debugging
channel VARCHAR(50), -- Denormalized for query performance
correlation_id UUID, -- Added for distributed tracing
partition_key INTEGER, -- Added for sharding
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
read_at TIMESTAMP,
failed_at TIMESTAMP, -- Not in RFC, very much needed
expires_at TIMESTAMP, -- Not in RFC, prevented infinite growth
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW() -- Added after debugging nightmares
);
-- Plus 15 indexes we didn't anticipate
CREATE INDEX CONCURRENTLY idx_notification_events_user_created ON notification_events(user_id, created_at DESC) WHERE status != 'deleted';
CREATE INDEX CONCURRENTLY idx_notification_events_correlation ON notification_events(correlation_id) WHERE correlation_id IS NOT NULL;
-- ... and 13 more
Each schema change addressed production incidents, performance bottlenecks, or requirements that emerged during implementation. These adaptations reflect the natural evolution from theoretical design to operational system.
WebSocket Connection Management Complexity
The RFC specified WebSocket-based delivery for optimal performance. The initial implementation approach was straightforward:
// RFC's WebSocket implementation
class NotificationWebSocketManager {
private connections: Map<string, WebSocket> = new Map();
async sendNotification(userId: string, notification: NotificationEvent) {
const connection = this.connections.get(userId);
if (connection && connection.readyState === WebSocket.OPEN) {
connection.send(JSON.stringify({
type: 'notification',
data: notification
}));
}
}
}
Production requirements revealed additional complexity. After addressing connection management challenges during mobile app deployments, the implementation evolved:
// Production implementation addressing edge cases
class NotificationWebSocketManager {
private connections: Map<string, Set<WebSocketConnection>> = new Map();
private connectionMetadata: Map<string, ConnectionMetadata> = new Map();
private healthChecks: Map<string, NodeJS.Timeout> = new Map();
private rateLimiters: Map<string, RateLimiter> = new Map();
private deadLetterQueue: Queue<FailedNotification>;
private circuit: CircuitBreaker;
async sendNotification(userId: string, notification: NotificationEvent) {
// 200+ lines of defensive programming
const connections = this.connections.get(userId);
if (!connections || connections.size === 0) {
await this.queueForLaterDelivery(userId, notification);
return;
}
// Handle multiple connections per user (mobile + web + tablet)
const results = await Promise.allSettled(
Array.from(connections).map(async (conn) => {
try {
// Check connection health
if (!this.isConnectionHealthy(conn)) {
await this.reconnectOrEvict(conn);
throw new Error('Unhealthy connection');
}
// Rate limiting per connection
const limiter = this.getRateLimiter(conn.id);
if (!await limiter.tryAcquire()) {
await this.backpressure(conn, notification);
return;
}
// Circuit breaker for cascading failures
return await this.circuit.fire(async () => {
// Message size validation (learned this the hard way)
const message = this.serializeNotification(notification);
if (message.length > MAX_MESSAGE_SIZE) {
const chunks = this.chunkMessage(message);
for (const chunk of chunks) {
await this.sendChunk(conn, chunk);
}
} else {
await this.sendMessage(conn, message);
}
});
} catch (error) {
await this.handleDeliveryFailure(conn, notification, error);
}
})
);
// Track delivery metrics
await this.recordDeliveryMetrics(userId, notification, results);
}
// Plus 50+ other methods for handling edge cases
}
Each addition addressed specific production challenges: circuit breakers for cascading failures, message chunking for large payloads, and sophisticated rate limiting for notification storms. These patterns emerge consistently when simple designs meet complex operational requirements.
Timeline and Scope Evolution
The RFC outlined a structured development approach:
- Phase 1 (Weeks 1-4): Core Infrastructure
- Phase 2 (Weeks 5-8): Advanced Features
- Phase 3 (Weeks 9-12): Integration & Optimization
The implementation timeline revealed different patterns:
Weeks 1-4: Infrastructure Foundation Challenges
Environment setup and capacity planning consumed more time than anticipated. Database throughput requirements exceeded initial assumptions, and competing production priorities affected team availability.
Weeks 5-12: Scope Expansion
Early demonstrations generated enthusiasm and additional requirements. Channel diversity expanded beyond initial specifications as business needs emerged during development.
// Original scope
const originalChannels = ['in_app', 'email', 'push'];
// Month 3 scope
const actualChannels = [
'in_app',
'email',
'push',
'sms', // Added week 6
'slack', // Added week 8
'teams', // Added week 10
'webhook', // Added week 11
'discord', // Added week 14 (yes, we were already late)
'voice_call' // Added week 20 (for critical security alerts)
];
Months 4-6: Integration Complexity
The clean API design assumed consistent authentication patterns across services. Production revealed three different authentication systems requiring unified notification support.
// RFC assumption
interface AuthContext {
userId: string;
token: string;
}
// Production reality
type AuthContext =
| { type: 'jwt'; userId: string; token: string; claims: JWTClaims }
| { type: 'oauth2'; userId: string; accessToken: string; refreshToken: string; expiresAt: Date }
| { type: 'legacy'; sessionId: string; userId?: string; cookieData: LegacyCookie }
| { type: 'service_account'; serviceId: string; apiKey: string }
| { type: 'anonymous'; temporaryId: string; ipAddress: string };
// Each authentication pattern required specialized handling:
// rate limiting, security validation, and audit requirements
Months 7-8: Performance Optimization
While functional, the system required significant performance work to meet throughput requirements. Template rendering emerged as an unexpected bottleneck, with personalization features requiring multiple API calls per notification.
Team Scaling and Organizational Changes
The RFC specified “2 developers for 12 weeks.” The implementation team evolved differently:
- 2 senior engineers (supposed to be full-time, averaged 60% due to production support)
- 1 junior engineer (added month 2, spent month 3 learning the codebase)
- 2 contractors (added month 4 for “quick wins,” spent month 5 fixing their code)
- 1 DevOps engineer (supposedly “consulting,” became full-time by month 3)
- 1 database expert (brought in month 5 for performance crisis)
- Product manager (changed twice during the project)
- 3 different engineering managers (reorg happened in month 6)
Team changes introduced context transfer challenges and architectural reviews. Contractor contributions required additional integration work, and organizational restructuring prompted design reassessment that affected project momentum.
Monitoring Requirements Discovery
The RFC monitoring section covered standard metrics: delivery rate, response time, and error rate. Production operation revealed additional observability requirements:
// RFC monitoring plan
const plannedMetrics = [
'delivery_rate',
'response_time',
'error_rate',
'throughput'
];
// What we actually monitor
const productionMetrics = [
// Basic metrics (from RFC)
'delivery_rate_by_channel_by_priority_by_user_segment',
'response_time_p50_p95_p99_p999',
'error_rate_by_type_by_service_by_retry_count',
// The metrics that actually matter
'template_render_time_by_template_by_variables_count',
'database_connection_pool_wait_time',
'redis_operation_time_by_operation_type',
'webhook_retry_backoff_effectiveness',
'notification_staleness_at_delivery',
'user_preference_cache_hit_rate',
'deduplication_effectiveness_by_time_window',
'rate_limit_rejection_by_reason',
'circuit_breaker_state_transitions',
'message_size_distribution_by_channel',
'websocket_reconnection_storms',
'push_token_invalidation_rate',
'email_bounce_classification',
'notification_feedback_loop_latency',
'cost_per_notification_by_channel',
'regulatory_compliance_audit_completeness',
// The weird ones we needed after specific incidents
'mobile_app_version_vs_notification_compatibility',
'timezone_calculation_accuracy',
'emoji_rendering_failures_by_client',
'notification_delivery_during_database_failover',
'memory_leak_in_template_cache',
'thundering_herd_detection'
];
Each additional metric addresses specific operational challenges that emerged during production use, highlighting the difference between design-time and runtime observability needs.
Technical Debt Accumulation Patterns
Technical debt considerations weren’t explicit in the RFC. By month 8, several patterns had emerged:
Template System Complexity
Multiple template engines emerged to support different team requirements, creating a hybrid system that required ongoing maintenance.
// Multi-engine template management complexity
class NotificationTemplateManager {
private mustacheTemplates: Map<string, MustacheTemplate>; // Original system
private handlebarsTemplates: Map<string, HandlebarsTemplate>; // Added for marketing
private reactEmailTemplates: Map<string, ReactEmailTemplate>; // Added for pretty emails
async render(templateId: string, data: any): Promise<string> {
// 150 lines of logic to figure out which template engine to use,
// handle edge cases, maintain backwards compatibility,
// and work around bugs we can't fix without breaking production
// This comment has been here since month 4:
// TODO: Unify template systems (estimated: 2 weeks)
// Actual estimate after investigation: 3 months + migration plan
}
}
Schema Migration Challenges
The evolution from initial to optimized schema required careful migration planning. Running parallel schemas during transition introduced synchronization complexity.
-- The migration nightmare
BEGIN;
-- Step 1 of 47 in the migration plan
INSERT INTO notification_events_v2
SELECT
id,
user_id,
-- 50 lines of complex transformation logic
CASE
WHEN notification_type IN ('old_type_1', 'old_type_2') THEN 'new_type_1'
WHEN notification_type LIKE 'legacy_%' THEN REPLACE(notification_type, 'legacy_', 'classic_')
-- 20 more WHEN clauses
END as notification_type_v2,
-- More transformations...
FROM notification_events
WHERE created_at > NOW() - INTERVAL '1 hour'
AND status != 'migrated'
AND NOT EXISTS (
SELECT 1 FROM notification_events_v2
WHERE notification_events_v2.id = notification_events.id
);
-- Update migration status
UPDATE migration_status
SET last_run = NOW(),
records_migrated = records_migrated + row_count,
estimated_completion = NOW() + (remaining_records / current_rate * INTERVAL '1 second')
WHERE migration_name = 'notification_schema_v2';
-- Check for conflicts
-- Handle rollback scenarios
-- Update monitoring metrics
-- 100 more lines...
COMMIT;
Result: Lessons from Implementation Experience
The RFC specified technical success criteria: 99.9% uptime, sub-100ms delivery, and 10,000 notifications per second. Achievement of these targets revealed that user and business metrics were equally important.
What actually mattered:
- User happiness: We had 99% delivery rate but users hated the notifications because they were poorly timed
- Developer productivity: Other teams couldn’t integrate with our “clean” API without extensive hand-holding
- Operational burden: The system required constant babysitting despite all our automation
- Business value: Marketing couldn’t use half the features because they were too complex
// What we optimized for (from RFC)
const technicalMetrics = {
uptime: 99.9,
deliveryTime: 95, // ms
throughput: 10000, // per second
errorRate: 0.1 // percent
};
// What actually mattered
const businessMetrics = {
userNotificationDisableRate: 45, // percent - way too high
developerIntegrationTime: 3, // weeks - should be hours
supportTicketsPerWeek: 150, // related to notifications
marketingCampaignSetupTime: 2, // days - should be minutes
monthlyOperationalCost: 25000, // dollars - 5x the estimate
engineersPagedPerWeek: 12 // times - unsustainable
};
Key Implementation Insights
Several patterns emerge consistently across notification system implementations:
1. RFCs as Starting Hypotheses
Treating RFCs as initial hypotheses rather than fixed specifications enables better adaptation. Documents should evolve with implementation learning rather than remaining static reference points.
2. Planning for Emergent Requirements
Significant buffer allocation for unexpected requirements reflects implementation reality. Doubling estimates and adding contingency helps accommodate discovery during development.
3. Evolution-Ready Design
Systems inevitably require migration, versioning, and compatibility features. Building these capabilities early reduces future technical debt and operational complexity.
4. Edge Cases as Core Requirements
Scenarios discussed during design reviews typically manifest in production. Planning for these cases during initial implementation proves more efficient than reactive fixes.
5. Organizational Context Integration
Technical design success depends on organizational alignment. Team changes, restructuring, and varying stakeholder priorities affect implementation more than architectural elegance.
6. Operational Observability Focus
Effective monitoring addresses incident response needs rather than design documentation requirements. Business impact, user experience, and operational detail provide more valuable debugging information.
Bridging Design and Implementation
Several strategies help minimize the RFC-to-production gap:
Progressive Feature Development
Starting with well-executed core functionality enables better iteration than comprehensive initial implementation. Perfect email notifications provide a stronger foundation than basic multi-channel support.
Adaptability Over Optimization
Systems designed for graceful evolution handle changing requirements better than those optimized for predicted scenarios. Flexibility often proves more valuable than initial perfection.
Developer Experience Investment
Easy integration and operation drive adoption more effectively than raw performance. API usability often determines system success more than technical specifications.
Documentation Evolution
Maintaining documentation as living artifacts rather than historical records improves team understanding. Sections for original design, current implementation, and learned insights provide comprehensive context.
Comprehensive Feedback Integration
Feedback loops across user experience, operational metrics, and developer workflow enable rapid iteration. Quick learning cycles accelerate problem identification and resolution.
Conclusion: Embracing Implementation Reality
Learning to work with implementation evolution rather than against it improves outcomes. Pristine RFCs naturally become complex as they address user needs. Beautiful architectures develop practical extensions. Clean codebases accumulate necessary technical debt. This represents successful problem-solving rather than design failure.
The RFC-to-production gap requires management rather than elimination. Effective engineering adapts to emerging reality while maintaining system coherence and user value.
Reflecting on notification system implementations, final systems rarely match initial designs. They’re typically more complex and take longer to build, but they’re also more capable and solve problems that weren’t apparent during initial planning.
When writing RFCs, remember: you’re starting a conversation with implementation reality rather than defining fixed specifications. This perspective enables better planning and more realistic expectations.
Related posts
Real-world debugging techniques, monitoring strategies, and lessons learned from notification system failures in high-stakes production environments
Production incidents from running Fargate at scale. Memory leaks, ENI limits, subnet failures, and debugging techniques that work.
A vendor-neutral evaluation of external authorization platforms including AWS Verified Permissions, SpiceDB, OpenFGA, Cerbos, and OPA. Covers architecture patterns, cost analysis, and a decision framework for engineering teams.
A deep technical comparison of Cedar, Rego, OpenFGA DSL, and Cerbos YAML/CEL policy languages. Covers syntax, performance benchmarks, formal verification, tooling, and integration patterns with TypeScript examples for each language.
A deep technical comparison of SpiceDB and Auth0 FGA (OpenFGA) -- two Zanzibar-inspired authorization systems with different trade-offs in schema design, consistency models, deployment, and scalability.