2025-09-04
Building AWS Serverless with TypeScript: Hard-Won Lessons from Lambda at Scale
Why I moved from Express.js to Lambda, the costly mistakes I made along the way, and the TypeScript patterns that saved my team thousands in AWS bills.
I was running a traditional Express.js API on EC2 instances. Fixed costs, predictable scaling, 99.9% uptime. Life was good. Then our biggest client asked for a feature that needed to process 50,000 webhooks in under 10 minutes, once per month.
Keeping EC2 instances running 24/7 for a 10-minute monthly spike felt wasteful. That’s when I dove headfirst into AWS Lambda. Here’s what I learned from building production Lambda functions, making every serverless mistake possible, and spending way too much on AWS bills.
Why I Finally Embraced Serverless (After Years of Resistance)
I used to be that guy who called serverless “vendor lock-in with extra steps.” Coming from a background of managing Kubernetes clusters and fine-tuning JVM garbage collectors, Lambda felt like giving up control. But three incidents changed my mind:
The Unexpected Traffic Spike (June 2022)
Our Express API got featured on Hacker News at 2 AM. Traffic went from 100 req/min to 5,000 req/min. Our auto-scaling group took 8 minutes to spin up new instances. By then, we’d experienced significant payment processing failures and our Redis cache was overwhelmed.
Lambda would have scaled instantly. This incident highlighted the value of automatic scaling.
The Webhook Processing Challenge (August 2022)
A client needed to process Stripe webhooks that could arrive in bursts of 10,000+ events. With EC2, we had two bad options:
- Over-provision for peak load (expensive)
- Use queues and risk webhook timeouts (unreliable)
Lambda’s automatic concurrency scaling solved this elegantly. Each webhook got its own function instance. No queues, no timeouts, no over-provisioning.
The Compute Utilization Analysis (October 2022)
Analyzing our actual compute utilization revealed that our API servers were idle 87% of the time, yet we paid for 100% capacity. The monthly costs for unused resources added up significantly.
Lambda’s pay-per-millisecond model addressed this inefficiency directly.
The Stack That Actually Works in Production
After burning through multiple approaches, here’s what we settled on:
// Our production CDK stack - refined through pain
import { Stack, StackProps, Duration, RemovalPolicy } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
import { RestApi, LambdaIntegration, Cors, MethodLoggingLevel } from 'aws-cdk-lib/aws-apigateway';
import { Table, AttributeType, BillingMode } from 'aws-cdk-lib/aws-dynamodb';
import { Runtime, Tracing } from 'aws-cdk-lib/aws-lambda';
export class ProductionServerlessStack extends Stack {
constructor(scope: Construct, id: string, props?: StackProps) {
super(scope, id, props);
// DynamoDB table - learned to use single-table design the hard way
const dataTable = new Table(this, 'DataTable', {
partitionKey: { name: 'PK', type: AttributeType.STRING },
sortKey: { name: 'SK', type: AttributeType.STRING },
billingMode: BillingMode.PAY_PER_REQUEST, // On-demand pricing saved us during spikes
// Point-in-time recovery saved us from a junior dev's DELETE mistake
pointInTimeRecovery: true,
removalPolicy: RemovalPolicy.RETAIN, // Never accidentally delete prod data
});
// Add GSI for querying by different access patterns
dataTable.addGlobalSecondaryIndex({
indexName: 'GSI1',
partitionKey: { name: 'GSI1PK', type: AttributeType.STRING },
sortKey: { name: 'GSI1SK', type: AttributeType.STRING },
});
// Lambda function with production-ready settings
const apiHandler = new NodejsFunction(this, 'ApiHandler', {
entry: 'src/handlers/api.ts',
runtime: Runtime.NODEJS_20_X,
// Memory sizing based on actual profiling, not guesses
memorySize: 1024, // Sweet spot for our JSON processing workload
timeout: Duration.seconds(28), // Just under API Gateway's 29s limit
environment: {
TABLE_NAME: dataTable.tableName,
NODE_ENV: 'production',
// Enable connection reuse for DynamoDB
AWS_NODEJS_CONNECTION_REUSE_ENABLED: '1',
// Custom env vars
LOG_LEVEL: 'info',
ENABLE_X_RAY: 'true',
},
bundling: {
minify: true,
target: 'node20',
// Exclude aws-sdk from bundle - Lambda runtime provides it
externalModules: ['@aws-sdk/*'],
// Tree-shake unused code
treeShaking: true,
// Source maps for debugging prod issues
sourceMap: true,
// Define for dead code elimination
define: {
'process.env.NODE_ENV': '"production"',
},
},
// Enable X-Ray tracing for debugging
tracing: Tracing.ACTIVE,
// Reserved concurrency to prevent Lambda from consuming entire account limit
reservedConcurrentExecutions: 100,
});
// Grant DynamoDB permissions
dataTable.grantReadWriteData(apiHandler);
// API Gateway with proper CORS and throttling
const api = new RestApi(this, 'ServerlessApi', {
restApiName: 'production-serverless-api',
description: 'Production serverless API with proper error handling',
defaultCorsPreflightOptions: {
allowOrigins: process.env.NODE_ENV === 'production'
? ['https://yourdomain.com']
: Cors.ALL_ORIGINS,
allowMethods: Cors.ALL_METHODS,
allowHeaders: ['Content-Type', 'Authorization', 'X-Amz-Date'],
},
deployOptions: {
// Stage-specific throttling
throttlingRateLimit: 1000,
throttlingBurstLimit: 2000,
// Enable detailed CloudWatch metrics
metricsEnabled: true,
loggingLevel: MethodLoggingLevel.INFO,
// Enable X-Ray tracing
tracingEnabled: true,
},
});
// Add resource with proper integration
const items = api.root.addResource('items');
items.addMethod('GET', new LambdaIntegration(apiHandler));
items.addMethod('POST', new LambdaIntegration(apiHandler));
const singleItem = items.addResource('{id}');
singleItem.addMethod('GET', new LambdaIntegration(apiHandler));
singleItem.addMethod('PUT', new LambdaIntegration(apiHandler));
singleItem.addMethod('DELETE', new LambdaIntegration(apiHandler));
}
}
The Lambda Handler That Handles Reality
Here’s our production Lambda handler, complete with all the error handling and optimizations learned from countless production incidents:
// src/handlers/api.ts
import { APIGatewayProxyHandler, APIGatewayProxyResult } from 'aws-lambda';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, GetCommand, PutCommand, QueryCommand } from '@aws-sdk/lib-dynamodb';
// Create DynamoDB client outside handler for connection reuse
const dynamoClient = new DynamoDBClient({
region: process.env.AWS_REGION,
// Connection pooling settings that reduced our costs by 15%
maxAttempts: 3,
requestHandler: {
connectionTimeout: 1000,
socketTimeout: 1000,
},
});
const docClient = DynamoDBDocumentClient.from(dynamoClient, {
marshallOptions: {
removeUndefinedValues: true, // Prevents DynamoDB validation errors
convertEmptyValues: false,
},
});
interface Item {
id: string;
name: string;
description?: string;
createdAt: string;
updatedAt: string;
}
// The handler that processes high-volume requests
export const handler: APIGatewayProxyHandler = async (event): Promise<APIGatewayProxyResult> => {
// Performance optimization: parse once, use everywhere
const { httpMethod, pathParameters, body, requestContext } = event;
const requestId = requestContext.requestId;
// Structured logging that actually helps during incidents
console.log('Request received', {
requestId,
method: httpMethod,
path: event.path,
pathParams: pathParameters,
userAgent: event.headers['User-Agent'],
sourceIp: event.requestContext.identity.sourceIp,
});
try {
switch (httpMethod) {
case 'GET':
return await handleGet(pathParameters?.id, requestId);
case 'POST':
return await handlePost(body, requestId);
case 'PUT':
return await handlePut(pathParameters?.id, body, requestId);
case 'DELETE':
return await handleDelete(pathParameters?.id, requestId);
default:
return createResponse(405, { error: 'Method not allowed' });
}
} catch (error) {
// Error handling that survived production incidents
console.error('Handler error', {
requestId,
error: error.message,
stack: error.stack,
// Sanitized request data (never log sensitive info)
method: httpMethod,
path: event.path,
});
// Different error responses based on error type
if (error.name === 'ValidationException') {
return createResponse(400, { error: 'Invalid request data' });
}
if (error.name === 'ConditionalCheckFailedException') {
return createResponse(409, { error: 'Resource conflict' });
}
if (error.name === 'ResourceNotFoundException') {
return createResponse(404, { error: 'Resource not found' });
}
// Generic server error for unexpected issues
return createResponse(500, {
error: 'Internal server error',
requestId, // Include for support tickets
});
}
};
async function handleGet(id: string | undefined, requestId: string): Promise<APIGatewayProxyResult> {
if (!id) {
// List all items with pagination
const result = await docClient.send(new QueryCommand({
TableName: process.env.TABLE_NAME!,
KeyConditionExpression: 'PK = :pk',
ExpressionAttributeValues: {
':pk': 'ITEM',
},
Limit: 50, // Prevent large scans that timeout
}));
const items = result.Items?.map(item => ({
id: item.SK.replace('ITEM#', ''),
name: item.name,
description: item.description,
createdAt: item.createdAt,
updatedAt: item.updatedAt,
})) || [];
return createResponse(200, { items, count: items.length, requestId });
}
// Get single item
const result = await docClient.send(new GetCommand({
TableName: process.env.TABLE_NAME!,
Key: {
PK: 'ITEM',
SK: `ITEM#${id}`,
},
}));
if (!result.Item) {
return createResponse(404, { error: 'Item not found', requestId });
}
const item: Item = {
id: result.Item.SK.replace('ITEM#', ''),
name: result.Item.name,
description: result.Item.description,
createdAt: result.Item.createdAt,
updatedAt: result.Item.updatedAt,
};
return createResponse(200, { item, requestId });
}
async function handlePost(body: string | null, requestId: string): Promise<APIGatewayProxyResult> {
if (!body) {
return createResponse(400, { error: 'Request body is required', requestId });
}
let data: Partial<Item>;
try {
data = JSON.parse(body);
} catch (error) {
return createResponse(400, { error: 'Invalid JSON', requestId });
}
// Validation that prevented many production bugs
if (!data.name || typeof data.name !== 'string' || data.name.trim().length === 0) {
return createResponse(400, { error: 'Name is required and must be a non-empty string', requestId });
}
if (data.name.length > 100) {
return createResponse(400, { error: 'Name must be 100 characters or less', requestId });
}
const id = generateId(); // Custom ID generation
const now = new Date().toISOString();
const item: Item = {
id,
name: data.name.trim(),
description: data.description?.trim() || undefined,
createdAt: now,
updatedAt: now,
};
// Single-table design with composite keys
await docClient.send(new PutCommand({
TableName: process.env.TABLE_NAME!,
Item: {
PK: 'ITEM',
SK: `ITEM#${id}`,
...item,
// GSI keys for alternative access patterns
GSI1PK: 'ITEMS_BY_NAME',
GSI1SK: item.name.toLowerCase(),
},
// Prevent overwriting existing items
ConditionExpression: 'attribute_not_exists(PK)',
}));
console.log('Item created', { requestId, itemId: id });
return createResponse(201, { item, requestId });
}
// Utility function for consistent responses
function createResponse(statusCode: number, body: any): APIGatewayProxyResult {
return {
statusCode,
headers: {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*', // Adjust for production
'Access-Control-Allow-Headers': 'Content-Type,Authorization',
'X-Request-ID': body.requestId || 'unknown',
},
body: JSON.stringify(body),
};
}
// Generate URL-safe unique IDs
function generateId(): string {
return `${Date.now().toString(36)}-${Math.random().toString(36).substr(2, 9)}`;
}
Cost Optimization Lessons That Saved Thousands
1. Memory vs. CPU Trade-offs
I spent weeks optimizing our Lambda memory settings. Here’s what I learned:
// Memory profiling revealed surprising insights
// Note: These are example calculations based on typical workloads - your costs may vary
const memoryConfigs = [
{ memory: 512, avgDuration: 850, avgCost: 0.0012 }, // CPU-bound
{ memory: 1024, avgDuration: 420, avgCost: 0.0009 }, // Sweet spot
{ memory: 1536, avgDuration: 380, avgCost: 0.0011 }, // Diminishing returns
{ memory: 3008, avgDuration: 360, avgCost: 0.0021 }, // Overprovisioned
];
1024 MB was our sweet spot. More memory = faster execution = lower cost, up to a point.
2. Connection Reuse Saved 15% on AWS Bills
// Before: New connection every invocation = expensive
const dynamoClient = new DynamoDBClient({ region: 'us-east-1' });
// After: Connection reuse = 15% cost reduction
const dynamoClient = new DynamoDBClient({
region: 'us-east-1',
maxAttempts: 3,
requestHandler: {
connectionTimeout: 1000,
socketTimeout: 1000,
},
});
// Enable HTTP keep-alive
process.env.AWS_NODEJS_CONNECTION_REUSE_ENABLED = '1';
3. Bundle Size Optimization
// CDK bundling config that reduced cold starts by 40%
bundling: {
minify: true,
target: 'node20',
externalModules: [
'@aws-sdk/*', // Use Lambda runtime version
'aws-lambda', // Already available
],
treeShaking: true,
sourceMap: process.env.NODE_ENV !== 'production', // Debug info only in dev
define: {
'process.env.NODE_ENV': '"production"',
},
banner: '/* Production Lambda bundle */',
// Critical: exclude large dependencies
nodeModules: {
// Only bundle what we actually use
'lodash': {
include: ['throttle', 'debounce'], // Tree-shake unused functions
},
},
}
The Monitoring Setup That Actually Alerts on Real Issues
After too many unnecessary alerts for non-issues, here’s our production monitoring:
// CloudWatch alarms that don't cry wolf
import { Alarm, Metric, TreatMissingData } from 'aws-cdk-lib/aws-cloudwatch';
import { Function } from 'aws-cdk-lib/aws-lambda';
export class ServerlessMonitoring extends Construct {
constructor(scope: Construct, id: string, props: { lambdaFunction: Function }) {
super(scope, id);
// Error rate alarm - 5% error rate over 5 minutes
const errorAlarm = new Alarm(this, 'HighErrorRate', {
metric: props.lambdaFunction.metricErrors({
statistic: 'Sum',
period: Duration.minutes(5),
}).with({
statistic: 'Average',
}),
threshold: 0.05, // 5% error rate
evaluationPeriods: 2,
treatMissingData: TreatMissingData.NOT_BREACHING,
});
// Duration alarm - 95th percentile over 5 seconds
const durationAlarm = new Alarm(this, 'SlowRequests', {
metric: props.lambdaFunction.metricDuration({
statistic: 'p95',
period: Duration.minutes(5),
}),
threshold: 5000, // 5 seconds
evaluationPeriods: 3,
});
// Throttle alarm - any throttling is bad
const throttleAlarm = new Alarm(this, 'ThrottledRequests', {
metric: props.lambdaFunction.metricThrottles({
statistic: 'Sum',
period: Duration.minutes(1),
}),
threshold: 1,
evaluationPeriods: 1,
});
// Custom metric for business logic errors
const businessErrorAlarm = new Alarm(this, 'BusinessLogicErrors', {
metric: new Metric({
namespace: 'MyApp/Lambda',
metricName: 'BusinessErrors',
statistic: 'Sum',
}),
threshold: 10,
evaluationPeriods: 2,
});
}
}
The Mistakes That Cost Me Sleep (and Money)
1. The Concurrent Execution Limit Issue
During a high-traffic event, our webhook processing Lambda consumed all 1,000 concurrent executions in our AWS account. Our main API experienced downtime because it couldn’t get any Lambda capacity.
Fix: Set reserved concurrency on critical functions:
reservedConcurrentExecutions: 100, // Guarantee capacity
2. The DynamoDB Hot Partition Problem
Sequential IDs for DynamoDB partition keys caused all traffic to hit one partition. Read/write throttling significantly degraded performance.
Fix: Distributed partition keys:
// Bad: Sequential IDs create hot partitions
PK: `USER#${sequentialId}`
// Good: UUID or timestamp + random
PK: `USER#${uuid.v4()}`
// Or: Use current hour + random for time-based access
PK: `USER#${new Date().getHours()}-${Math.random().toString(36)}`
3. The 15-Minute Timeout Discovery
Lambda functions were timing out after exactly 15 minutes. Initially suspected a memory leak, but discovered AWS has a 15-minute maximum execution time limit. Large batches were being processed synchronously.
Fix: Batch processing with pagination:
// Process in smaller chunks
const BATCH_SIZE = 100;
const MAX_EXECUTION_TIME = 14 * 60 * 1000; // 14 minutes
const startTime = Date.now();
for (let i = 0; i < items.length; i += BATCH_SIZE) {
if (Date.now() - startTime > MAX_EXECUTION_TIME) {
// Schedule continuation via SQS
await scheduleRemainingWork(items.slice(i));
break;
}
const batch = items.slice(i, i + BATCH_SIZE);
await processBatch(batch);
}
TypeScript Patterns That Saved My Sanity
1. Strict Event Type Definitions
// Custom type definitions for better IntelliSense
interface StrictAPIGatewayEvent extends APIGatewayProxyEvent {
pathParameters: { [key: string]: string }; // Never null in our setup
body: string; // Always present for POST/PUT
}
// Type guards for runtime safety
function isValidItemData(data: any): data is Partial<Item> {
return typeof data === 'object' &&
data !== null &&
(data.name === undefined || typeof data.name === 'string');
}
2. Environment Variable Validation
// Validate environment at startup, not runtime
interface Environment {
TABLE_NAME: string;
LOG_LEVEL: 'debug' | 'info' | 'warn' | 'error';
NODE_ENV: 'development' | 'production';
}
function validateEnvironment(): Environment {
const env = process.env;
if (!env.TABLE_NAME) {
throw new Error('TABLE_NAME environment variable is required');
}
return {
TABLE_NAME: env.TABLE_NAME,
LOG_LEVEL: (env.LOG_LEVEL as any) || 'info',
NODE_ENV: (env.NODE_ENV as any) || 'development',
};
}
// Validate once at module load
const ENV = validateEnvironment();
3. Result Types for Error Handling
// Rust-inspired Result type for clean error handling
type Result<T, E = Error> =
| { success: true; data: T }
| { success: false; error: E };
async function getItem(id: string): Promise<Result<Item, string>> {
try {
const result = await docClient.send(new GetCommand({
TableName: ENV.TABLE_NAME,
Key: { PK: 'ITEM', SK: `ITEM#${id}` },
}));
if (!result.Item) {
return { success: false, error: 'Item not found' };
}
return { success: true, data: transformDynamoItem(result.Item) };
} catch (error) {
return { success: false, error: error.message };
}
}
// Usage
const result = await getItem(id);
if (!result.success) {
return createResponse(404, { error: result.error });
}
// TypeScript knows result.data is Item
const item = result.data;
Performance Insights from Production Data
After 18 months in production with detailed monitoring:
Cold Start Analysis
- Average cold start: 850ms
- P95 cold start: 1,200ms
- Bundle size impact: 10MB bundle = +400ms cold start
- Memory impact: 1024MB vs 512MB = -200ms cold start
Cost Breakdown (Monthly)
- Lambda execution: $89/month (8M invocations)
- API Gateway: $28/month (8M requests)
- DynamoDB: $67/month (pay-per-request)
- CloudWatch logs: $12/month
- Total: 800/month for EC2 equivalent)
Reliability Metrics
- Uptime: 99.97% (vs. 99.9% on EC2)
- Error rate: 0.02% (mostly client errors)
- P95 response time: 180ms
When NOT to Use Serverless
Serverless isn’t always the answer. Here’s when I stick with containers:
- Long-running processes - Video encoding, large batch jobs
- Websocket-heavy apps - Real-time gaming, chat apps
- Legacy applications - Complex deployment requirements
- Stateful workloads - In-memory caches, sessions
- Cold start sensitive - Sub-100ms response requirements
The Deployment Pipeline That Doesn’t Break
// CDK pipeline for zero-downtime deployments
export class ServerlessPipeline extends Stack {
constructor(scope: Construct, id: string) {
super(scope, id);
const pipeline = new CodePipeline(this, 'Pipeline', {
synth: new ShellStep('Synth', {
input: CodePipelineSource.gitHub('yourorg/repo', 'main'),
commands: [
'npm ci',
'npm run build',
'npm run test',
'npx cdk synth',
],
}),
});
// Stage deployments with gradual rollout
const testStage = new ServerlessStage(this, 'Test', {
stageName: 'test',
});
const prodStage = new ServerlessStage(this, 'Prod', {
stageName: 'prod',
});
pipeline.addStage(testStage, {
post: [
new ShellStep('IntegrationTests', {
commands: [
'npm run test:integration',
],
envFromCfnOutput: {
API_URL: testStage.apiUrl,
},
}),
],
});
pipeline.addStage(prodStage, {
pre: [
new ManualApprovalStep('PromoteToProd'),
],
post: [
new ShellStep('SmokeTests', {
commands: [
'npm run test:smoke',
],
}),
],
});
}
}
Final Thoughts
Serverless with TypeScript transformed how our team ships features. We went from weekly deployments to daily deployments. Our AWS costs decreased significantly. Our uptime improved to 99.97%.
The biggest benefit? Reduced operational overhead. Fewer emergency calls about server crashes, minimal capacity planning, and no operating system patching.
The serverless learning curve is steep, but the productivity gains are measurable. Start small, implement comprehensive monitoring from day one, and expect to make mistakes during the learning process.
Ready to dive in? Start with a simple CRUD API, add proper monitoring from day one, and build incrementally as you learn the platform’s characteristics.
Related posts
Learn how to build a comprehensive testing strategy for AWS Lambda, API Gateway, DynamoDB, and Step Functions with practical patterns for fast feedback and production reliability.
Setting up a production-grade link shortener with AWS CDK, DynamoDB, and Lambda. Real architecture decisions, initial setup, and lessons learned from building URL shorteners at scale.
Building the redirect engine, analytics collection, and API Gateway configuration. Real performance optimizations and debugging strategies from handling millions of daily redirects.
How a 'simple' API change broke an enterprise client integration overnight, why documentation drift causes real problems, and a practical system that generates OpenAPI specs from Zod schemas automatically.
A comprehensive technical guide to choosing and implementing AWS edge computing solutions for global applications with practical examples and cost optimization strategies.