A/B testing may seem straightforward, but the truth is that there are a lot of “little-known” factors that can shift how your tests perform.
After you’ve done your pre-test research, formulated a hypothesis and have a good idea of which test variations to create to maximize your chances of producing a winning test, it’s easy to think that the job is done. Thing is, the steps to a successful A/B test don’t stop there.
In fact, Qubit Research found that at least 80% of winning tests are completely worthless.
You might have heard about sample sizes, statistical significance, and time periods required for proper testing of landing page.
But what about the History Effect? The Novelty Effect? The Instrumentation Effect? Statistical Regression, or the Selection Effect?
The above may be the reason why your winning test variation showed a 50% increase in conversions, but after implementation, you barely saw your revenue rise.
Keep reading to find out why the above threats could invalidate your A/B tests (and how to avoid them from skewing your data).
The History Effect is a big one – and it can happen when an event from the outside world skews your testing data. Let me explain…
Let’s say your company just launched a new influencer marketing campaign while you’re running a test. This might result in increased attention, or even press coverage, which in turn results in an unusual spike of traffic for your website.
The difference here is that a traffic spike that’s a direct result of an unusual event means there’s a high probability that those visitors differ from your usual, targeted traffic. In other words, they might have different needs, wants and browsing behaviours.
Now, because this traffic is only temporary, it means your test data could shift completely during this event and could result in making one of your variations win when in reality, with your regular traffic, it should’ve lost.
Maybe this is cliché, but the key to avoidance is prevention. If you’re aware of a major event coming up that could impact your test results, it’s a good idea not to test hypotheses and funnel steps that are likely to be affected.
For example, if you’re testing a new product page layout, and planning on getting a lot of unusual traffic for a week, make sure to run your test for longer – at least 3 to 4 weeks in order to get some of your usual traffic into the mix as well.
Be aware that you’ll always have traffic fluctuations and outside events that will affect your test data; it can never be completely avoided. In this case, the #1 thing you need to do to minimize the negative effect the History Effect can have on your testing program, is to simply be aware of the fluctuations and differences in your traffic.
When you’re aware of what’s happening, you can dig deeper in Google Analytics to analyze your variations’ performance, and then recognize if your winning variation is indeed a winner.
Another solution is to launch a test that’s only targeting traffic coming from certain traffic sources. Let’s say you have partnered with Instagram influencers to advertise your product for a week. If you have enough traffic, you could launch an A/B test that targets visitors from Facebook ads only, excluding the most variable traffic source from the test entirely.
At the end of the day, the key thing to remember is to never analyze a test solely by using your testing tool. The analytics of your testing tool – whether you use VWO, Optimizely, or something else – don’t allow you to dig as deeply into your analysis as with Google Analytics.
In Google Analytics, you’ll be able to analyze your test results using different audience segments, purchase paths, traffic sources and so on.
In short, use your testing tool for running your tests. Use Google Analytics to analyze them.
The Instrumentation Effect is quite possibly one of the most frequent amongst companies new to testing, and it happens when there are problems with your testing tools or test variations that cause your data to be flawed.
A common example is when the code of one or more of your variations is not functioning properly with all devices or browser types, and often, without the company who is running the test even being aware.
It can be a big problem: let’s say variation C of your test isn’t displaying properly in Firefox… this means a portion of your visitors will be served a problematic page.
As as you imagine, in this case variation C is at a disadvantage – its chances to win are slim.
The thing is… if variation C had been coded and tested properly without any bugs, it may have won by a large margin!
Before launching ANY tests, you should always do rigorous Quality Assurance (QA) checks such as performing cross-browser and cross-device testing on your new variations, and trying out your variations under multiple different user scenarios.
“Quality Assurance (QA) is a critical part of any web or application development project. QA helps to verify that a project has met the project’s requirements and technical specifications without bugs or other defects. The aim is to identify issues prior to product launch. – Catriona Shedd via InspireUX
The good news is, many testing tools such as Optimizely and VWO have browser testing features integrated. The hiccup is that simply opening your test variation in a new browser is far from enough to ensure you don’t have bugs appearing mid-test that will ruin your experiment.
This means you’ll still have to test different browsers on different devices in addition to trying out all of your test variations’ features and elements to ensure they work in every possible test scenario.
To test on different devices and tools, you can use online tools such as Browserstack (which guarantees 100% accuracy for cross-browser functionalities), or Browserling.
Then, combine the use of these tools with different devices: iPad, iPhone, Android phones, other tablets, and so on. If you don’t have access to multiple different devices at your disposal, search for a Device Lab near you on opendevicelab.com.
Don’t use drag-and-drop editors of your A/B testing tool
Testing tools have every reason in the world to convince you launching an A/B test is a 3-click process, that it’s super easy, and that everyone can do it. That’s also the reason why most A/B testing tools have drag-and-drop editors that allow you to move and modify elements of your variations rather quickly. Who doesn’t like drag-and-drop?
Beware – you’ll be shooting yourself in the foot. Using these to create your test variations is a big mistake.
When you’re using the testing tools’ drag-and-drop editors, your variations’ code is being auto-generated. This makes the code messy and quite frequently incompatible with some browsers. In the end, your variations will be prone to bugs and browser issues. The Instrumentation Effect in action.
Instead, unless you’re just changing a line of text, get a developer to code your test variations. Clean code and a rigorous quality assurance process will greatly diminish the risks of your test variations breaking mid-tests, and making your test data useless all at the same time.
There’s a lot of bad conversion optimization advice out there, and I’m not afraid to call it out…
One erroneous piece of advice I hear far too often is the following: “If you don’t have enough traffic to test one of your pages, temporarily send paid traffic to it for the duration of the test”.
Please, don’t do this.
This “piece of advice” assumes that traffic coming from your paid traffic channel will have the same needs, wants and behaviours as your regular traffic. And that’s a false assumption.
I recently had a client that used the same landing page for both email traffic and Facebook ads. Of course, traffic sources were tracked and analyzed… and the result? Facebook traffic converted at 6%, and email at 43%. HUGE difference, and this is massively common.
Each traffic source brings its own type of visitors, and you can’t assume that paid traffic from a few ads and one channel mirrors the behaviors, context, mindset and needs of the totality of your usual traffic.
“So if you run an A/B test and the traffic sample is not a good representative of the average visitors to your website then you are not going to get an accurate insight on how your website visitors respond to different landing page variations (unless off course if you are running your test only for a particular traffic segment).” via OptimizeSmart
Simple: Be aware of your different traffic sources when running a test. When you’re analyzing the test results, make sure to segment by sources in order to see the real data that lies behind averages.
If you don’t segment and analyze your results in Google Analytics, your testing tool could tell you your “Control” (original version of what you’re testing) won, but you might discover it won only because of one specific traffic source that doesn’t represent your usual traffic well.
Don’t be fooled by averages. Compare the performance of each different traffic source that lead visitors to your test.
This effect is more likely to come into play if a large portion of your traffic is coming from returning visitors rather than brand-new visitors (e.g. landing pages with paid traffic), so please be aware of it when making drastic changes to a webpage for a test.
Let me explain: The Novelty Effect happens when the engagement and interaction with one of your variations is substantially higher than previously, but only temporarily – giving you a false positive.
For example, if you launch a new checkout page and test it against the previous version, people will need to figure out how to use the new design. They’ll click around, spend more time figuring it out, and ultimately, it can give the impression that your variation is performing better than it really is.
Truth is, there are still chances that in the long run, it will actually perform to a lesser degree, but that’s a result of your changes being novel for users during your testing period.
Because the Novelty Effect is temporary, if you’re testing a variation that dramatically impacts your user’s flow, it is critical that you run your test for at least 4 weeks. In most cases, 4 weeks will be enough time to start seeing the novelty wear off and the test results begin to regulate.
If you have a variation that wins and you decide to implement it, be sure to keep tracking its performance in your analytics to ensure its long-term performance. And make sure to analyze session recordings and run usability tests in order to understand the user behavior that’s happening on your site.
Adobe recommends the following method to distinguish the difference between the Novelty Effect and actual, losing test variations.
“To determine if the new offer underperforms because of a Novelty Effect or because it’s truly inferior, you can segment your visitors into new and returning visitors and compare the conversion rates.
If it’s just the Novelty Effect, the new offer will win with new visitors. Eventually, as returning visitors get accustomed to the new changes, the offer will win with them, too.”
Have you ever launched an A/B test and noticed wild fluctuations during the first few days of it being live?
That’s what we call Statistical Regression, also known as regression to the mean. It’s defined as “the phenomenon that if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement.”
What this means is that if you end a test too early or based only on reaching statistical significance, you’ll likely see a false positive. You may declare Variation A to be your winning variation, but if the test was stopped too early, Variation A may only be the winner for the first few days. Where if you were to let it run longer, that variation might have had no clear difference in performance compared to your original version, or it may even have lost.
Statistical Regression can’t be avoided. You’ll see large fluctuations in the early days no matter what.
But what can be avoided is letting it ruin your A/B test results and decisions that follow. The trick is to go against the grain of what your A/B testing tool may tell you:
Don’t end your test solely based on when you reach statistical significance.
It’s likely that your testing tool will tell you have a winning variation as soon as you hit your pre-determined statistical significance percentage.
Before you end a test, make sure you’ve had a large enough sample size. This can be determined before testing by using a tool like Optimizely’s calculator.
There’s no magic rule as to how many conversions you need to ensure you can end your A/B test, but generally, don’t end your test before you have at least 100 conversions per variations.
Noah Lorang who’s a data analyst at Basecamp has a great example of why sample size is important:
“If you stop your test as soon as you see “significant” differences, you might not have actually achieved the outcome you think you have.
As a simple example of this, imagine you have two coins, and you think they might be weighted. If you flip each coin 10 times, you might get heads on one all of the time, and tails on the other all of the time. If you run a statistical test comparing the portion of flips that got you heads between the two coins after these 10 flips, you’ll get what looks like a statistically significant result—if you stop now, you’ll think they’re weighted heavily in different directions.
If you keep going and flip each coin another 100 times, you might now see that they are in fact balanced coins and there is no statistically significant difference in the number of heads or tails.”
If your testing tool is telling you that you have a winner a few days into your test (while statistical regression is still happening), should you end your test if you have reached your required sample size?
The answer is no. In most cases, you should run your tests for at least 3 to 4 weeks. You want your test results to normalize (read: to stop wildly fluctuating) before ending the test. And remember how the other validity threats such as the Novelty and History Effects can influence your tests?
Running your tests for longer will help you avoid the unusual user behaviors over a few days that may arise with these effects.
The History, Instrumentation, Selection and Novelty Effects, in addition to Statistical Regression are 5 validity threats that could invalidate your A/B test data, giving you the illusion that one variation won when in reality, it lost.
Keep them in mind when analyzing your test data, and don’t forget to analyze your results in Google Analytics (or your favorite analytics tool) to see what truly lies behind averages, and spot the signs of flawed tests.
Implementing changes from an A/B test that you think won but in fact actually lost, could be a major mistake. If we’re talking about higher-risk tests such as those you may run on the cart page or in the checkout process – a drop in performance means a direct hit on your revenues.
The process of quality assurance and analysis against validity threats may seem tedious, but at the end of the day, it’s worth it and completely counterproductive not to do it.
If you don’t test properly, why bother A/B testing at all?