It feels like I’ve made an art form out of only updating my blog once per year now. Except that I put all my articles on Medium.com now. Either way, here’s my review of experimentation in 2016
This is my annual review of everything I tweeted about AB-testing in 2015. It’s a couple of years old, but mostly holds up very well!
This is simply the 3rd instalment of my annual summary of everything I tweeted about shaping behavior over the year. The difference this time, however, is that the article itself is on Medium. It’s just a much easier platform!
Tripline let’s you create trips on maps. I just have to test how it works to embed their maps, so this blog has a little map tripline of the trip I did of Morocco, Spain and Portugal with @angelarhodes in April/May. Check it out – it’s a cool little thing to play with.
Angela’s blogpost: Morocco – Assault on the Senses.
I’ve always felt that the idea of repeated significance testing error and false positive rates is a bit of a pedantic academic exercise. And I’m not the only one, some A/B frameworks let you automatically stop or conclude at the moment of significance, and there’s is blessed little discussion of false positive rates online. For anyone running A/B tests it’s also little incentive to control your false positives. Why make it harder for yourself to show successful changes, just to meet some standard no-one cares about anyways?
It’s not that easy. Because it actually matters, and matters a lot if you care about your A/B experiments, and not the least about what you learn from them. Evan Miller has written a thorough article on the subject in How Not To Run An A/B Test, but it’s quite too advanced to illustrate the effect very well. To demonstrate how much it matters, I’ve ran a simulation of how much impact you should expect repeat testing errors to have on your success rate.
Here’s how the simulation works:
- It runs 1.000 experiments, each with 200.000 fake participants divided randomly into two experiment variants.
- The conversion rate is 3% in both variants.
- Each individual “participant” gets randomly assigned to a variant and either the “hit” or “miss” group based on the conversion rate.
- After each participant, a g-test type significance test is run, testing if the distribution is different between the two variants.
- I then count every occasion where an experiment did hit significance at 90% and 95% probability, then count every experiment that did reach significance at any point.
- As the g-test doesn’t like low numbers, I didn’t check significance during first 1.000 participants in each experiment.
- You can download the script and alter the variables to fit your metrics.
So what’s the outcome? Keep in mind that these are 1.000 controlled experiment where it’s known that there are no difference between the variants.
- 771 experiments out of 1.000 reached 90% significance at some point
- 531 experiments out of 1.000 reached 95% significance at some point
This means if you’ve run 1.000 experiments and didn’t control for repeat testing error in any way, a rate of successful positive experiments up to 25% might be explained by a false positive rate. But you’ll see a temporary significant effect in around half of your experiments!
Fortunately, there’s an easy fix. Select your sample size or decision point in advance, and make your decision then. These are the false error rates when making the decision only at the end of the experiment:
- 100 experiments out of 1.000 were significant at 90%
- 51 experiments out of 1.000 were significant at 95%
So you still get a false positive rate you should not ignore, but nowhere near as serious as when you don’t control correctly. And this is what you should expect when running with significance levels like this – this is actually the probability level of 95% you would expect, and at this point you can talk about real hypothesis testing.