From Nitro to Junction: Testing in Production at Scale

Ian Nowland

Co-founder, Junction Labs

In 2012, I faced an intimidating challenge: rebuild EC2's core virtualization stack—without disrupting the world's largest cloud. By 2016, my team had written hundreds of thousands of lines of C code running on half a million servers. The experience taught me something crucial: even the most rigorous pre-production testing must be paired with validation using real production traffic. That idea—treating production as part of the test loop, with the right safeguards, became the foundation of AWS Nitro’s reliability.

But since leaving AWS, I’ve found it frustrating how inaccessible this approach remains, especially on Kubernetes. That’s a large part of why I was excited to co-found Junction: to deliver the network capabilities that bring production-first testing to any team, without the cost and complexity of sidecar-based service meshes.

In this post, I’ll explain what Nitro is, why traditional testing wasn't sufficient, and how our four-stage deployment pipeline enabled safe production validation. I’ll also show how Junction offers these capabilities to modern engineering teams—without the operational complexity.

What is AWS Nitro?

AWS’ Official image for Nitro - a computer inside a computer

AWS Nitro was a fundamental redesign of EC2's virtualization model. Before Nitro, each host ran its virtualization software on its main CPUs—wasting compute and memory, creating noisy-neighbor problems, and introducing security risks. We fixed this by offloading those responsibilities to a dedicated hardware card with its own CPU. That meant rewriting all the hypervisor software to be optimized for the much more constrained (but also much more accelerated) specialized hardware.

Nitro transformed EC2 by allowing it to match the performance and isolation of bare metal hardware. But to enable that, we were writing hundreds of thousands of lines of low-level systems code powering storage and networking for millions of EC2 instances across hundreds of thousands of physical hosts, across multiple hardware generations. Deploying changes carried enormous risk, yet we needed to maintain a monthly release cadence to support new features.

The Limits of Integration Testing

We invested heavily in testing—correctness tests, performance tests, fuzz tests. But they fell short for three key reasons:

‍Coverage: For systems like Nitro, the gap between integration testing and production was simply too wide. We were deploying low-level code on the hot path for storage and networking across millions of instances. It was entirely possible to pass every CI check—unit tests, integration tests, fuzzers—and still ship a change that silently broke disk I/O under a specific OS version and workload pattern (I'm looking at you, O_DIRECT). No test environment could reflect the full combinatorial reality of EC2’s scale.

Flakiness: EC2 started with a comprehensive integration suite that validated every hypervisor aspect. This worked beautifully when EC2 was 10 people in Cape Town. But as the organization grew, and test count scaled with engineer count, these tests became constant friction points. A test might fail 0.1% of the time—but when that failure blocked someone on another team, 12 time zones away, working on unrelated code, troubleshooting motivation plummeted.

The rewrite trap: This was hardest to accept. I watched three separate integration test platform rewrites during my EC2 tenure, including one in my own team. Each promised cleaner orchestration, faster debugging, better flake handling. But as more teams joined, the same problems emerged. When someone's code breaks someone else's test in non-obvious ways, diagnosing becomes burdensome, with unclear ownership. Small teams easily establish "you break it, you fix it" norms, but that doesn't scale to organizations with hundreds of engineers. EC2 eventually moved to team-specific integration testing rather than running the entire org's test suite against every change. This wasn't a technical limitation—it was social. But understanding it meant accepting that testing will have blindspots.

The Real Cost of Testing Blind Spots

In 2015, we pushed what seemed like a minor update to Nitro's network virtualization layer that sailed through integration testing. Within hours in production, monitoring flagged crashes in the networking stack on certain hosts, causing brief network outages as it rebooted.

The issue affected just one instance type, only under high-throughput connection loads, and only when a specific race condition occurred—less than 0.0001% of packets triggered the crash. However our scale was such that had we deployed fleet-wide simultaneously, thousands of customers running mission-critical workloads would have been impacted.

Instead, our canary deployment caught it early, limiting the blast radius to a handful of instances in one availability zone. The fix took nearly three weeks to develop, as the bug only appeared under timing conditions impossible to recreate in testing.

This hammered home a crucial point: the next critical bug wouldn't match any pattern we'd seen before. No pre-production test suite could anticipate every failure mode in a system this complex. We needed a deployment pipeline that acknowledged testing limitations and embraced safe production validation.

The Nitro Deployment Pipeline

Claire Liguori has an excellent post on how AWS handles safe deployments, and I don’t have anything to add at the conceptual level. But here I want to go into more detail on what this actually looked like for Nitro’s 2015 deployments:

1. Synthetic Monitors

First, we deployed changes to specialized hosts running continuous synthetic workloads—servers dedicated to simulating customer traffic patterns. These were critical because they bridged the gap between narrow integration testing and using customers as unwitting testers.

AWS cultivated a deep investment in synthetic monitoring—not as replacements for integration tests, but as production monitoring infrastructure: always-on, end-to-end probes catching broad, dashboard-level regressions before customers did. This framing made all the difference. As production infrastructure rather than dev tooling, they had clear ownership, strict reliability requirements, and real-time alerting. While lacking integration tests' depth, they provided broad stack coverage, and their owners prioritized keeping them reliable and high-signal. This reliability made them ideal first targets for production testing.

Our process was:

Deploy: Roll out to synthetic test hosts
Pause: Observe for a full 24-hour cycle to catch periodic patterns
Validate: An engineer reviewed detailed metrics and logs for anomalies before approving the next stage

2. One-Box Test (also called Canary)

Next, we expanded to a small production slice, large enough for meaningful signals but contained enough to limit potential damage. By 2015, "one-box" testing in EC2 actually required a lot more hosts to cover a significant amount (but still < 0.001%) of the fleet:

Deploy: Roll out to approximately 100 production hosts
Pause: Another 24-hour observation period
Validate: Deep manual inspection of low-level metrics—storage latency, network packet statistics, hypervisor performance counters

3. Single AZ Rollout

With initial validation complete, we carefully expanded to one availability zone, creating a controlled experiment with real customer workloads:

Deploy: Gradually roll out across a single EC2 availability zone over several hours
Safeguards: Deployment system continuously monitored key metrics with automatic circuit breakers; any anomaly triggered an immediate deployment halt and paged an engineer.
Pause: Another 24 hours of stable operation
Validate: Conduct final metrics review before wider rollout

4. Phased Global Rollout

Only after passing all previous gates would we begin the carefully orchestrated global deployment:

Deploy: Methodically expand across regions over 1-2 weeks
Phasing: Deploy zone by zone to maintain isolation between availability zones and regions
Safeguards: Deployment system continuously monitored key metrics with automatic circuit breakers; any anomaly triggered an immediate deployment halt and paged an engineer.

The lengthy waiting periods and human validation were necessary for Nitro given the code's impact surface, deployment scale, and business criticality. But even for less critical services, these four steps—synthetic testing, canary deployment, zonal rollout, and phased global deployment—form the backbone for safe production testing.

You’re Already Testing in Production. Are You Doing It Safely?

Building and shipping Nitro taught us that integration testing has inherent limitations. At scale, the only fully reliable tests are those that happen under real load in production. But production testing isn't reckless when you have proper safeguards: narrow deployments, strong observability, automatic halts, and progressive rollout.

This approach requires tooling that puts controls directly in developers' hands, so they can send a portion of production traffic to the newly deployment software. For teams running Kubernetes at scale today, the building block to do so is not DNS tricks—nor multi-year "it will be worth the wait when we finally get there" migrations to sidecar-based service meshes.

Junction provides a lightweight control plane and client SDK that work together to make progressive delivery simple. No sidecars. No complex policy engines. No waiting on a platform team to wire everything up. Paired with a framework like Argo Rollouts, Junction puts production-first testing within reach of any team—not just hyperscalers.

Production is part of the test loop. Junction enables your developers to make it safe.

Subscribe to the Junction Blog

New blogs, product updates, events and more

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.