Harnessing Statistical Power for Test Results You Can Trust
By Philip Cross
Years ago, when I first started split-testing, I thought every test was worth running. It didn’t matter if it was changing a button color or a headline—I wanted to run that test.
My enthusiastic, yet misguided, belief was that I simply needed to find aspects to optimize, set up the tool, and start the test. After that, I thought, it was just a matter of awaiting the infamous 95% statistical significance.
I was wrong.
After implementing “statistically significant” variations, I experienced no lift in sales because there was no true lift—“it was imaginary.” Many of those tests were doomed at inception. I was committing common statistical errors, like not testing for a full business cycle or neglecting to take the effect size into consideration.
I also failed to consider another possibility: That an “underpowered” test could cause me to miss changes that would generate a “true lift.”
Understanding statistical power, or the “sensitivity” of a test, is an essential part of pre-test planning and will help you implement more revenue-generating changes to your site.
What is statistical power?
Statistical power is the probability of observing a statistically significant result at level alpha (α) if a true effect of a certain magnitude is present. It’s your ability to detect a difference between test variations when a difference actually exists.
Statistical power is the crowning achievement of the hard work you put into conversion research and properly prioritized treatment(s) against a control. This is why power is so important—it increases your ability to find and measure differences when they’re actually there.
Statistical power (1 – β) holds an inverse relationship with Type II errors (β). It’s also how to control for the possibility of false negatives. We want to lower the risk of Type I errors to an acceptable level while retaining sufficient power to detect improvements if test treatments are actually better.
Finding the right balance, as detailed later, is both art and science. If one of your variations is better, a properly powered test makes it likely that the improvement is detected. If your test is underpowered, you have an unacceptably high risk of failing to reject a false null.
Before we go into the components of statistical power, let’s review the errors we’re trying to account for.
Type I and Type II errors
Type I errors
A Type I error, or false positive, rejects a null hypothesis that is actually true. Your test measures a difference between variations that, in reality, does not exist. The observed difference—that the test treatment outperformed the control—is illusory and due to chance or error.
The probability of a Type I error, denoted by the Greek alpha (α), is the level of significance for your A/B test. If you test with a 95% confidence level, it means you have a 5% probability of a Type I error (1.0 – 0.95 = 0.05).
If 5% is too high, you can lower your probability of a false positive by increasing your confidence level from 95% to 99%—or even higher. This, in turn, would drop your …read more
Read more here:: B2CMarketingInsider