What I learned running 50 A/B tests in one year

Running A/B tests was never something I thought I'd be doing this much. But at my current gig, we ship new experiments almost every week. After doing this for about a year, I've picked up a few things I wish I knew going in.

The baseline is everything

Before you even think about what you're testing, you need to have a really solid understanding of your baseline metrics. CTR, viewability, time on page — whatever matters for your feature. Without a clean baseline, every result you get is suspect.

I made the mistake of launching experiments before we had stable metric collection. The data looked great but it was noise. Took us three weeks to realize the logging pipeline had a bug. Always audit your logging first.

Statistical significance isn't the whole story

A result can be statistically significant and still be meaningless. A 0.1% CTR lift with p < 0.05 in a massive sample is technically significant. But does it translate to anything real?

I started asking myself: if this change shipped to 100% of users for a year, would it matter? That question filters out a lot of fake wins.

The hardest part is not the engineering

Honestly the hardest part of A/B testing at scale is aligning people. Engineers want to ship, PMs want wins, and data scientists want clean experiments. These goals conflict more often than you'd think.

We had situations where a PM wanted to layer three experiments on top of each other because "we need results fast." That kind of stacking kills experiment validity. Building the culture around rigorous testing takes more energy than building the tooling.

Ramp slowly, watch closely

Always ramp new experiments gradually. Start at 1%, then 5%, then 25% before going wide. This has caught bugs that would have been catastrophic at full traffic. We once found a layout issue at 5% ramp that would have dropped CTR significantly if we'd gone to 100%.

We automate the ramp schedule now and hook it into our monitoring alerts. Any metric drop over a threshold triggers an automatic pause. Should have done that from day one.

TypeScript really helps here

On the frontend side, having strong types around the experiment configuration has been a lifesaver. We define each experiment variant as a discriminated union, so the component tree only receives the props it actually needs. No more runtime surprises about which variant is active.

type AdVariant =
  | { type: 'control' }
  | { type: 'bold-title'; titleSize: 'lg' | 'xl' }
  | { type: 'image-first'; imagePosition: 'top' | 'left' };

function AdComponent({ variant }: { variant: AdVariant }) {
  if (variant.type === 'image-first') {
    return <ImageFirstAd position={variant.imagePosition} />;
  }
  // ...
}

Sounds obvious in hindsight, but the previous codebase was passing a big untyped config object and doing if (config.experimentName === 'foo') everywhere. Not fun to debug.

Takeaway

A/B testing at scale is as much a discipline as a tool. The engineering is the easy part. The hard part is having the patience to run clean experiments, the judgment to interpret results correctly, and the communication skills to push back when things get sloppy.