Skip to content

2026-03-01

Building a Scalable GitHub Actions Platform for a Large-Scale Microservices Architecture

A practical guide to building an org-level shared GitHub Actions platform covering architecture decisions, security governance, adoption strategy, and the 7 biggest mistakes we made along the way.

Abstract

When CI/CD pipelines grow organically across dozens of repositories, you end up with duplicated YAML, inconsistent security practices, and a constant stream of support requests. This post documents how we built an organization-level shared GitHub Actions platform for a large e-commerce platform running around 20 microservices across multiple teams. We cover the architecture decisions, security governance model, adoption strategy, and the concrete metrics that resulted: build times dropping from ~45 minutes to ~12 minutes, a 70% reduction in CI-related support tickets, and 85%+ adoption within six months. We also share the 7 biggest mistakes we made, because those taught us more than the things that went right.

Introduction

GitHub Actions is deceptively simple to get started with. Copy a YAML file, add a few steps, and you have a working pipeline. The problem is that this simplicity does not scale. Once you have dozens of repositories for different microservices, each maintained by different teams with their own flavor of build, test, and deploy workflows, you end up with a maintenance burden that quietly consumes engineering capacity.

We found ourselves in exactly this situation: hundreds of workflow files with subtle differences, inconsistent security practices, build times that varied wildly, and a growing backlog of CI-related support tickets. The platform engineering team was spending more time answering “how do I do X in GitHub Actions?” questions than building platform capabilities.

This post documents how we designed, built, and rolled out an org-level shared actions platform. The goal is not to prescribe a single correct approach but to share what worked, what did not, and the trade-offs behind every decision.

Why We Needed an Org-Level Shared Actions Platform

The symptoms were clear before we even started measuring:

  • Duplication everywhere: The average workflow file was ~500 lines of YAML, with roughly 80% of it identical across repositories. Teams copy-pasted from each other and diverged over time.
  • Inconsistent security posture: Some repos pinned action versions by SHA, others used @latest. Some configured OIDC for AWS, others still used long-lived access keys stored as secrets.
  • Slow builds: Average build time was around 45 minutes. Teams had added steps over time without considering caching, parallelism, or runner selection.
  • Support burden: The platform team received roughly 30 CI-related tickets per week, mostly about configuration, debugging failures, and “works on my machine” issues.
  • Onboarding friction: New projects took days to set up CI/CD because there was no standard template, and the tribal knowledge lived in Slack threads.

We needed a platform that would give teams a “golden path” for CI/CD while preserving the flexibility to customize when necessary.

Note: A “golden path” is a well-supported, opinionated default. Teams can deviate, but the supported path should cover 80%+ of use cases with minimal configuration.

Architecture Decisions & Trade-offs

Every architecture decision involved trade-offs. Here is how we evaluated the major ones.

Composite Actions vs. Reusable Workflows vs. Workflow Templates

This was the first and most consequential decision. GitHub Actions offers three mechanisms for sharing CI/CD logic, and they serve different purposes:

FeatureComposite ActionsReusable WorkflowsWorkflow Templates
Abstraction levelSingle step or group of stepsEntire job or workflowStarting point for new repos
Inputs/OutputsFull supportFull supportManual copy, then customize
Secrets accessInherits caller’s contextExplicit secrets: inherit or namedN/A (copied into repo)
NestingCan call other compositesCan call composites; up to 10 levels deep, 50 total callsN/A
VersioningGit tags / SHAGit tags / SHASnapshot at copy time
Drift preventionCentrally updatedCentrally updatedNone after copy
Visibility into stepsCollapsed in UISeparate job in UIFull visibility

Our decision: We use all three, each for its purpose:

  • Composite actions for reusable building blocks (setup Node.js with caching, run linting, build Docker images)
  • Reusable workflows for standardized pipelines (build-test-deploy for a Node.js service, deploy-to-ECS)
  • Workflow templates for bootstrapping new repositories with a sensible starting configuration

The key insight: composite actions compose well. We build reusable workflows from composite actions, so the workflow itself is thin orchestration logic while the actions contain the implementation.

Monorepo vs. Multi-Repo for Shared Actions

AspectMonorepoMulti-Repo
DiscoverabilityAll actions in one placeScattered across repos
Cross-cutting changesSingle PR updates everythingMultiple PRs across repos
VersioningShared release cycleIndependent versions
CODEOWNERSSingle file, path-based rulesPer-repo configuration
CI for actionsTest everything togetherIndependent test pipelines
Blast radiusA bad release affects all actionsIsolated failures

Our decision: Monorepo. The discoverability and cross-cutting change benefits outweigh the blast radius concern, especially when combined with strict branch protection and automated testing. We mitigate the blast radius by releasing individual actions with independent semver tags.

Repository Structure

shared-actions/
├── actions/
│  ├── setup-node/
│  │  ├── action.yml
│  │  └── README.md
│  ├── docker-build/
│  │  ├── action.yml
│  │  └── README.md
│  ├── deploy-ecs/
│  │  ├── action.yml
│  │  └── README.md
│  └── security-scan/
│  ├── action.yml
│  └── README.md
├── workflows/
│  ├── node-service.yml
│  ├── python-service.yml
│  └── deploy-production.yml
├── tests/
│  ├── setup-node.test.yml
│  └── docker-build.test.yml
├── .github/
│  ├── CODEOWNERS
│  └── workflows/
│  ├── test-actions.yml
│  └── release.yml
└── docs/
    ├── CONTRIBUTING.md
    └── MIGRATION.md

Versioning Strategy

Versioning is where security and developer experience collide. We use a layered approach:

Most secure

Immutable

Balanced

Predictable

Most convenient

Auto-updates

SHA Pinning

abc1234...

Semver Tag

v2.1.3

Major Tag

v2

@main

Living edge

Our policy:

  • External third-party actions: SHA pinning required. No exceptions. Dependabot handles update PRs.
  • Our own shared actions: Semver tags for production, major tags for development environments.
  • Never @main: Even for internal actions, referencing a branch directly is not permitted in production workflows.

Warning: Using @main or @latest for third-party actions is a supply chain attack vector. A compromised upstream repository can inject malicious code into every workflow that references it. Always pin by SHA for external actions.

Self-Hosted vs. GitHub-Hosted Runners

DimensionGitHub-HostedSelf-Hosted
MaintenanceZeroPatching, scaling, monitoring
Cost at scalePer-minute billing adds upFixed infra cost, better at high volume
SecurityEphemeral, clean environmentPersistent unless you manage cleanup
Network accessPublic internet onlyVPC access, private registries
CustomizationLimited to available imagesFull control over tooling
Startup time~20-40s (warm)~5-10s (pre-warmed)
GPU/SpecializedLimited optionsFull control

Our decision: Hybrid. GitHub-hosted larger runners for most workloads, self-hosted runners in our VPC for jobs that need private network access (integration tests against staging databases, deployments to private ECS clusters). We use ephemeral self-hosted runners on ECS Fargate to avoid the stale-environment problem.

Implementation Deep Dive

Composite Action Example: Node.js Setup with Caching

This action replaces roughly 30 lines of duplicated YAML across repositories with a single step:

# actions/setup-node/action.yml
name: "Setup Node.js with Caching"
description: "Sets up Node.js, restores npm cache, and installs dependencies"
inputs:
  node-version:
    description: "Node.js version to use"
    required: false
    default: "20"
  working-directory:
    description: "Directory containing package.json"
    required: false
    default: "."

runs:
  using: "composite"
  steps:
    - name: Setup Node.js
      uses: actions/setup-node@v4
      with:
        node-version: ${{ inputs.node-version }}

    - name: Cache npm dependencies
      uses: actions/cache@v4
      id: npm-cache
      with:
        path: ~/.npm
        key: npm-${{ runner.os }}-${{ hashFiles(format('{0}/package-lock.json', inputs.working-directory)) }}
        restore-keys: |
          npm-${{ runner.os }}-

    - name: Install dependencies
      shell: bash
      working-directory: ${{ inputs.working-directory }}
      run: npm ci

Reusable Workflow: Node.js Service Pipeline

This is the “golden path” workflow for Node.js services. It composes multiple shared actions and reduces per-repo pipeline YAML from ~500 lines to ~50:

# workflows/node-service.yml
name: Node.js Service Pipeline

on:
  workflow_call:
    inputs:
      node-version:
        type: string
        default: "20"
      deploy-environment:
        type: string
        required: true
      aws-region:
        type: string
        default: "eu-central-1"
      run-e2e:
        type: boolean
        default: false
    secrets:
      AWS_ROLE_ARN:
        required: true

permissions:
  id-token: write
  contents: read

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: our-org/shared-actions/actions/setup-node@v2
        with:
          node-version: ${{ inputs.node-version }}

      - name: Lint
        run: npm run lint

      - name: Unit tests
        run: npm run test:unit -- --coverage

      - name: Build
        run: npm run build

      - uses: our-org/shared-actions/actions/security-scan@v2

  deploy:
    needs: build-and-test
    runs-on: ubuntu-latest
    environment: ${{ inputs.deploy-environment }}
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ inputs.aws-region }}

      - uses: our-org/shared-actions/actions/deploy-ecs@v2
        with:
          environment: ${{ inputs.deploy-environment }}

What a Consumer Repository Looks Like

This is the entire CI/CD configuration for a typical Node.js service. Compare this to the 500-line files we started with:

# .github/workflows/ci.yml (in consumer repo)
name: CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  pipeline:
    uses: our-org/shared-actions/.github/workflows/node-service.yml@v2
    with:
      node-version: "20"
      deploy-environment: ${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
      run-e2e: ${{ github.ref == 'refs/heads/main' }}
    secrets:
      AWS_ROLE_ARN: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}

That is roughly 20 lines of YAML. The team gets build caching, security scanning, OIDC-based AWS authentication, and a standardized deploy process without configuring any of it.

Automated Release Pipeline

We use a release workflow in the shared-actions monorepo that creates semver tags for individual actions when changes are merged to main:

# .github/workflows/release.yml
name: Release Actions

on:
  push:
    branches: [main]
    paths:
      - "actions/**"
      - "workflows/**"

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      changed-actions: ${{ steps.changes.outputs.actions }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - id: changes
        run: |
          changed=$(git diff --name-only HEAD~1 HEAD | grep '^actions/' | cut -d'/' -f2 | sort -u | jq -R . | jq -s .)
          echo "actions=$changed" >> "$GITHUB_OUTPUT"

  release:
    needs: detect-changes
    if: needs.detect-changes.outputs.changed-actions != '[]'
    runs-on: ubuntu-latest
    strategy:
      matrix:
        action: ${{ fromJson(needs.detect-changes.outputs.changed-actions) }}
    steps:
      - uses: actions/checkout@v4

      - name: Determine version bump
        id: version
        run: |
          # Read version from action.yml metadata or use conventional commits
          echo "version=v2.1.3" >> "$GITHUB_OUTPUT"

      - name: Create release tag
        run: |
          git tag "${{ matrix.action }}/${{ steps.version.outputs.version }}"
          git push origin "${{ matrix.action }}/${{ steps.version.outputs.version }}"

Security & Governance Layer

Security at scale is not optional. We enforce it through multiple layers so that individual teams do not need to think about it.

OIDC for AWS Authentication

Long-lived AWS credentials stored as GitHub secrets are a liability. We replaced all of them with OIDC federation, scoped to specific repositories and environments:

# IAM trust policy (Terraform)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/token.actions.githubusercontent.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
        },
        "StringLike": {
          "token.actions.githubusercontent.com:sub": "repo:our-org/service-*:environment:production"
        }
      }
    }
  ]
}

The sub claim condition is critical. It restricts which repositories and environments can assume the role. A repository named our-org/random-fork cannot assume production roles, even if it somehow obtained the workflow configuration.

Supply Chain Security

We implemented a multi-layer supply chain security strategy:

Enforcement Layer

CODEOWNERS

Required reviews

Branch Protection

No direct pushes

Repository Rulesets

Security scans on all repos

Detection Layer

Dependabot

Action update PRs

OpenSSF Scorecard

Supply chain health

StepSecurity Harden Runner

Runtime monitoring

Prevention Layer

SHA Pinning

All external actions

Minimal Permissions

permissions: {}

OIDC Authentication

No stored credentials

Key configurations:

  • StepSecurity Harden Runner: Monitors outbound network calls during workflow execution. If a compromised action tries to exfiltrate secrets to an unknown endpoint, it gets flagged.
  • Dependabot for actions: Automatically creates PRs when pinned action SHAs have newer versions. This keeps us up to date without sacrificing security.
  • OpenSSF Scorecard: Runs weekly on the shared-actions repo to surface security weaknesses in our own practices.

CODEOWNERS and Branch Protection

The shared-actions repository has strict governance:

# .github/CODEOWNERS
# Platform team owns everything by default
* @our-org/platform-engineering

# Security team must review security-related actions
actions/security-scan/ @our-org/security-team @our-org/platform-engineering
workflows/deploy-*.yml @our-org/security-team @our-org/platform-engineering

# Individual teams own their contributed actions
actions/mobile-build/ @our-org/mobile-team @our-org/platform-engineering

Branch protection rules:

  • Require 2 approving reviews (at least 1 from platform team)
  • Require status checks to pass (all action tests must succeed)
  • Require signed commits
  • No force pushes, no deletion of main
  • Dismiss stale reviews on new pushes

Minimal Token Permissions

Every workflow starts with the most restrictive permissions and explicitly opts in to what it needs:

# Default: no permissions
permissions: {}

# Then grant only what's needed per job
jobs:
  deploy:
    permissions:
      id-token: write  # For OIDC
      contents: read  # For checkout

Tip: Set permissions: {} at the workflow level and then grant only what each job needs. This follows the principle of least privilege and makes the security posture auditable at a glance.

Adoption & Measurement Strategy

Building a platform is the easy part. Getting multiple teams across a large microservices organization to use it is the real challenge.

Inner Source Contribution Model

We explicitly chose an inner source model over a top-down mandate. The platform team maintains core actions, but any engineer can contribute:

Propose

Open RFC issue

Develop

Fork & implement

Review

Platform team + CODEOWNERS

Release

Automated semver tag

Adopt

Teams consume new action

The contribution process:

  1. RFC issue: Describe the problem and proposed action. Platform team provides feedback on scope, naming, and existing overlap.
  2. Implementation: Contributor opens a PR with the action, tests, and documentation.
  3. Review: Platform team reviews for consistency, security, and composability. CODEOWNERS ensures the right people review.
  4. Release: Merged PRs trigger automated releases with proper semver tags.
  5. Announcement: New actions are announced in the engineering Slack channel with a usage example.

This model was critical for adoption. When the mobile team contributed a mobile-build action, their peers adopted it far more readily than if the platform team had built it.

Migration Playbook

We created a structured migration guide. The key was not forcing teams to migrate everything at once:

  1. Phase 1: Replace credential management with OIDC (security win, no workflow changes needed)
  2. Phase 2: Adopt setup-node or setup-python composite actions (easy swap, immediate caching benefits)
  3. Phase 3: Move to reusable workflows for standard service pipelines
  4. Phase 4: Adopt repository rulesets for security scanning

Each phase was independently valuable, which meant teams could migrate incrementally.

DORA Metrics Dashboard

We track the core DORA metrics plus platform-specific KPIs:

MetricBefore PlatformAfter PlatformChange
Deployment Frequency~2 per week per team~8 per week per team+300%
Lead Time for Changes~4 days~1.5 days-62%
Change Failure Rate~18%~8%-56%
Failed Deployment Recovery Time~3 hours~45 minutes-75%
Avg Build Time~45 minutes~12 minutes-73%
CI Support Tickets/Week~30~9-70%
Pipeline YAML per Repo~500 lines~50 lines-90%

Note: These improvements did not come solely from the shared actions platform. Caching, runner optimization, and parallelism contributed significantly. The platform made it easy to adopt all these optimizations consistently.

Lessons Learned & The 7 Biggest Mistakes

These are the mistakes that cost us the most time. Each one is something we would do differently if starting over.

Mistake 1: Building Too Much Before Getting Feedback

We spent weeks building a comprehensive set of shared actions before any team used them. When we finally shipped, the abstractions did not match how teams actually structured their projects. We had to rewrite several actions after real usage revealed incorrect assumptions.

What works instead: Ship the smallest useful action first. We should have started with setup-node alone, gotten 5 teams using it, and then expanded.

Mistake 2: Overly Abstract Reusable Workflows

Our first reusable workflows tried to handle every possible configuration through inputs. The node-service.yml workflow had 23 inputs. Teams found it harder to understand than writing their own YAML.

What works instead: Fewer inputs, more opinionated defaults. Our current workflows have 4-6 inputs. If a team needs significantly different behavior, they compose from our actions rather than parameterizing the workflow.

Mistake 3: Ignoring Workflow Debugging Experience

When a reusable workflow fails, the error appears in the calling workflow’s logs, but the actual steps are in the reusable workflow’s definition. This confused teams during debugging, especially when they could not see the intermediate steps clearly.

What works instead: Add verbose logging to composite actions with clear step names. Use ::group:: and ::endgroup:: log commands to create collapsible sections. Include the shared action version in the log output so debugging can identify exactly which version is running.

Mistake 4: No Breaking Change Policy

We shipped a v2 of setup-node that changed the caching strategy without realizing it would break repositories with non-standard node_modules locations. This caused failures across 15 repos simultaneously.

What works instead: Semantic versioning with a documented breaking change policy. Major version bumps require a migration guide and a two-week deprecation notice. We now run an automated compatibility check that tests new action versions against a sample of consumer repositories before releasing.

Mistake 5: Underestimating Runner Costs

We initially defaulted all jobs to ubuntu-latest-16core runners for speed. The GitHub Actions bill grew much faster than anticipated. Not every job benefits from larger runners; dependency installation is often network-bound, not CPU-bound.

What works instead: Default to standard runners and opt in to larger runners per-job with documented justification. We profile new actions to determine whether larger runners actually improve build times before recommending them.

Mistake 6: Making Security Annoying Instead of Invisible

Our first security scanning implementation added 8 minutes to every pipeline and produced noisy reports with false positives. Teams started adding if: false conditions to skip the security steps, which defeated the entire purpose.

What works instead: Security scanning should be fast and have low false-positive rates. We moved to incremental scanning (only scan changed files on PRs, full scan on main), tuned the rulesets to eliminate persistent false positives, and got scanning time under 90 seconds. Adoption went from ~40% to 95% once it stopped being a bottleneck.

Mistake 7: No Deprecation Path for Old Patterns

When we released the shared platform, we did not have a plan for removing the old workflow files from repositories. Some repos ran both old and new pipelines for months, wasting compute and creating confusion about which results to trust.

What works instead: Create a migration CLI tool that can detect old patterns, generate migration PRs, and track migration progress across the organization. We built a simple script that opens automated PRs to remove deprecated workflow files once the new pipeline is confirmed working.

Results, Metrics & Future Roadmap

Quantified Outcomes

After six months of incremental rollout:

  • 85% adoption rate: 34 of 40 repositories migrated to shared actions. The remaining 6 have legitimate reasons for custom pipelines (specialized hardware, non-standard build systems).
  • Build time reduction: Average dropped from ~45 minutes to ~12 minutes, primarily through standardized caching, parallelized test execution, and right-sized runners.
  • 70% reduction in CI support tickets: From ~30 to ~9 per week. The remaining tickets are mostly about genuinely novel requirements rather than “how do I configure caching.”
  • Pipeline YAML reduction: From ~500 lines per repository to ~50 lines. This is the metric teams feel most directly because it reduces their cognitive load.
  • Security posture: 100% of active repositories use OIDC for AWS authentication. Zero long-lived AWS credentials in GitHub secrets.

Architecture Overview

Infrastructure

Consumer Repositories (40+)

shared-actions monorepo

Composite Actions

setup-node, docker-build,

security-scan, deploy-ecs

Reusable Workflows

node-service, python-service,

deploy-production

Action Tests

Automated validation

~50 lines YAML

workflow_call reference

Hybrid Runners

GitHub-hosted + Self-hosted

AWS OIDC

Least-privilege roles

Metrics Dashboard

DORA + Platform KPIs

Future Roadmap

We are investing in three areas:

  1. Dynamic pipeline generation: Instead of static YAML, generate workflow configurations based on repository metadata (language, deployment target, compliance requirements). This could further reduce per-repo configuration to near-zero.
  2. Ephemeral environment per PR: Using the shared deploy action to spin up a preview environment for every pull request, with automatic cleanup after merge.
  3. Cost attribution: Tagging GitHub Actions minutes by team, service, and workflow type to give engineering managers visibility into their CI/CD spend and help identify optimization opportunities.

Starting This Journey

For teams considering a similar effort, here is the sequence that worked for us:

  1. Start with one high-value action (caching or security scanning) and get 3-5 teams using it.
  2. Measure before and after: build times, support tickets, adoption rate. Numbers drive organizational buy-in.
  3. Invest in the contribution model early. If only the platform team can modify shared actions, you have created a bottleneck.
  4. Security should be invisible, not an obstacle. If teams work around your security controls, the controls are failing.
  5. Plan for deprecation from day one. Every v1 will eventually become a v2, and you need a path to get there.

The shared actions platform has been one of the highest-leverage investments our platform engineering team has made. The upfront effort was significant, but the compounding returns in developer productivity, security consistency, and operational reliability have more than justified it.

References

Related posts

Set Up Claude as a PR Reviewer with the Official GitHub Action

A hardened, paste-ready setup for adding Anthropic's claude-code-action to a GitHub repo, with the security and cost knobs spelled out for production use.

claudegithub-actionscode-review+4
E2E Testing Strategies for Modern Web Applications - A Practical Engineering Guide

Learn how to build reliable, maintainable E2E test suites with Playwright and Cypress. Covers framework selection, flaky test prevention, CI/CD integration, and real-world optimization strategies.

testingplaywrightcypress+5
Zapier MCP as a Permission Control Layer: Taming Broad API Access for AI Agents

How Zapier MCP provides action-level whitelisting, credential isolation, and human-in-the-loop approval for AI agents. A managed alternative to custom scoped proxies for multi-app API governance.

mcpsecurityai-agents+4
AWS Control Tower Multi-Account Strategy: From Landing Zone to Enterprise Governance

A practical guide to designing and implementing AWS Control Tower multi-account strategy covering OU structure, SCPs, RCPs, Account Factory for Terraform, IAM Identity Center, and centralized security architecture.

awsaws-control-towermulti-account+6
AWS Secrets Manager & Parameter Store: Security Best Practices

A comprehensive technical guide comparing AWS Secrets Manager and Systems Manager Parameter Store, demonstrating when to use each service with real-world implementation patterns.

awssecrets-managerparameter-store+8