Statistical significance
Imagine flipping a coin 10 times and getting 6 heads. Does this mean the coin favors heads? Probably not — this small difference could be random chance. But if you flip it 1000 times and get 600 heads, that's more convincing evidence of a real pattern. Statistical significance works the same way in A/B testing.
The p-value is your risk of being wrong. A p-value of 0.05 (which equals 95% confidence) means there's a 5% chance that the difference between A and B versions is just random luck. Think of it like a weather forecast — if there's a 5% chance of rain, you probably don't need an umbrella. Similarly, if there's only a 5% chance that your test results are random (p-value = 0.05), you can feel confident about implementing the winning version.
Most product teams aim for a p-value less than 0.05, meaning at least 95% confidence in the results. A higher p-value like 0.2 means a 20% chance of random results — too risky for making important product decisions. Running tests longer and testing bigger changes typically leads to lower p-values and more reliable results.[1]