I’ve always felt that the idea of repeated significance testing error and false positive rates is a bit of a pedantic academic exercise. And I’m not the only one, some A/B frameworks let you automatically stop or conclude at the moment of significance, and there’s is blessed little discussion of false positive rates online. For anyone running A/B tests it’s also little incentive to control your false positives. Why make it harder for yourself to show successful changes, just to meet some standard no-one cares about anyways?

It’s not that easy. Because it actually matters, and matters a lot if you care about your A/B experiments, and not the least about what you *learn* from them. Evan Miller has written a thorough article on the subject in How Not To Run An A/B Test, but it’s quite too advanced to illustrate the effect very well. To demonstrate how much it matters, I’ve ran a simulation of how much impact you should expect repeat testing errors to have on your success rate.

Here’s how the simulation works:

- It runs 1.000 experiments, each with 200.000 fake participants divided randomly into two experiment variants.
- The conversion rate is 3% in both variants.
- Each individual “participant” gets randomly assigned to a variant and either the “hit” or “miss” group based on the conversion rate.
- After each participant, a g-test type significance test is run, testing if the distribution is different between the two variants.
- I then count every occasion where an experiment did hit significance at 90% and 95% probability, then count every experiment that did reach significance at any point.
- As the g-test doesn’t like low numbers, I didn’t check significance during first 1.000 participants in each experiment.
- You can download the script and alter the variables to fit your metrics.

**So what’s the outcome? **Keep in mind that these are 1.000 controlled experiment where it’s known that there are **no** difference between the variants.

**771**experiments out of**1.000**reached 90% significance at some point**531**experiments out of**1.000**reached 95% significance at some point

This means if you’ve run 1.000 experiments and didn’t control for repeat testing error in any way, a rate of successful positive experiments up to 25% might be explained by a false positive rate. *But you’ll see a temporary significant effect in around half of your experiments!*

Fortunately, there’s an easy fix. Select your sample size or decision point in advance, and make your decision then. These are the false error rates when making the decision **only **at the end of the experiment:

- 100 experiments out of 1.000 were significant at 90%
- 51 experiments out of 1.000 were significant at 95%

So you still get a false positive rate you should not ignore, but nowhere near as serious as when you don’t control correctly. And this is what you should expect when running with significance levels like this – this is actually the probability level of 95% you would expect, and at this point you can talk about real hypothesis testing.

Interesting. What I would want to see as a follow-up to this, then, is:

1) Distribution over time of false positives. I’d assume they are clustered towards the start of the tests, so it would be valuable to know the prevalence of late false positives.

2) How to accurately calculate a suitable sample size or decision point (preferably weighted to also work with enormous datasets, ahem) 😉

Thanks for the comment, this is what I’d expect on your two points:

1) I don’t think the false positives would be clustered on the beginning, quite to the opposite the significance test method should have the same sensitivity regardless of where it is. I can run the simulation again and have a look, though.

2) Check out Evan Miller’s article. For enormous datasets you probably have to do it yourself, all online calculators I’ve found seems to be limited upwards by how many first year college students a sane Ph.D. candidate can conceivably expect to invite to a lab.

I’ve replicated your script as a (slightly interactive) webpage. A bit of a hack, but no download or Perl skills required.

http://www.lukasvermeer.nl/projects/significance/

[…] My colleague Mats has an excellent piece on the topic of repeated significance testing on his blog. […]

[…] Mats Einarsen’s blog post on A/B testing, he demonstrates the problem with not allowing enough time or not repeat […]

[…] this: One thousand A/A tests (two identical pages tested against each other) were […]

[…] this: One thousand A/A tests (two identical pages tested against each other) were […]

[…] this: One thousand A/A tests (two identical pages tested against each other) were […]

Interesting article and takes some time to grasp. Its important to have these ideas in place when running any experiment and prior to making any conclusions. Sample size is important and knowing the length and number of visitors needed is very important. I am still a little puzzled.. 🙂

This is a good article. Unfortunately, we’re three years smarter and virtually nothing has been done to correct the problem of statistically bunk A/B tests. In fact, the problem seems to have gotten worse.

To Mr. Pennell’s comments:

1) I’ve never come across any sort of false positives vs time comparison, but I think he may be on to something here. It makes statistical sense that false positives would occur more frequently at the beginning of tests (smaller sample sizes lead to lower test power, etc). Then again, the built-in cutoff point in your tests should address that problem. It would still be interesting to see the distribution though.

2) As mentioned, there are a number of online sample size calculators. Unfortunately, most A/B tests don’t have enormous data sets to work with.

[…] this: When one thousand A/A tests (two identical pages tested against each other) were […]

[…] this: When one thousand A/A tests (two identical pages tested against each other) were […]

[…] you did a fake test, with the same version of a page, an A/A test, you’d have more than 70% chance that your test will reach 95% significance level at some […]

[…] test may turn out statistically significant at some point, but it is not the reason to stop the test as soon as it reaches the desired confidence […]

[…] Mats Stafseng Einarsen put together a really interesting simulation where he launched an A/A test (the same variation against each other). He ran 1,000 experiments, each with 200,000 fake participants divided randomly into the two variants. […]

[…] you the full picture – you’re only seeing a small percentage of your visitors and your data can’t be counted on for statistical validity. (You want uniform sampling to avoid The Simpson’s […]

[…] you the full picture – you’re only seeing a small percentage of your visitors and your data can’t be counted on for statistical validity. (You want uniform sampling to avoid The Simpson’s […]

[…] you the full picture – you’re only seeing a small percentage of your visitors and your data can’t be counted on for statistical validity. (You want uniform sampling to avoid The Simpson’s […]