Now Reading
8 annoying A/B testing errors each engineer ought to know

8 annoying A/B testing errors each engineer ought to know

2023-06-16 05:30:41

1. Together with unaffected customers in your experiment

The primary widespread mistake in A/B testing is together with customers in your experiment who aren’t truly affected by the change you are testing. It dilutes your experiment outcomes, making it more durable to find out the impression of your modifications.

Say you are testing a brand new characteristic in your app that rewards customers for finishing a sure motion. You mistakenly embrace customers who’ve already accomplished the motion within the experiment. Since they aren’t affected by the change, any metrics associated to this motion don’t change, and thus the outcomes for this experiment might not present a statistically important change.

To keep away from this error, make certain to first filter out ineligible customers in your code earlier than together with them in your experiment. Beneath is an instance of how to do that:

// Incorrect. Will embrace unaffected customers

operate showNewChanges(person) {

if (posthog.getFeatureFlag('experiment-key') === 'management') {

return false;


if (person.hasCompletedAction) {

return false


// different checks

return true


// Right. Will exclude unaffected customers

operate showNewChanges(person) {

if (person.hasCompletedAction) {

return false


// different checks

if (posthog.getFeatureFlag('experiment-key') === 'management') {

return false;


return true


2. Solely viewing ends in mixture (aka Simpson’s paradox)

It is potential an experiment can present one final result when analyzed at an aggregated degree, however one other when the identical knowledge is analyzed by subgroups.

For instance, suppose you might be testing a change to your sign-up and onboarding circulate. The change impacts each desktop and cell customers. Your experiment outcomes present the next:

Variant Guests Conversions Conversion Price
Management 5,000 500 10%
Check 5,000 1,000 20%

At first look, the check variant appears to be the clear winner. Nonetheless, breaking down the outcomes into the desktop and cell subgroups reveals:

Gadget Variant Guests Conversions Conversion Price
???? Desktop Management 2,000 400 20%
Check 2,000 100 5%
???? Cellular Management 3,000 100 10%
Check 3,000 900 30%

It is now clear the check variant carried out higher for cell customers, nevertheless it decreased desktop conversions – an perception we missed after we mixed these metrics! This phenomenon is named Simpson’s paradox.

Relying in your app and experiment, this is an inventory of mixture metrics you need to breakdown:

  • Person tenure
  • Geographic location
  • Subscription or pricing tier
  • Enterprise measurement, e.g., small, medium, or massive
  • Gadget sort, e.g., desktop or cell, iOS or Android
  • Acquisition channel, e.g., natural search, paid adverts, or referrals,
  • Person function or job operate, e.g., supervisor, government, or particular person contributor.

3. Conducting an experiment with no predetermined period

Beginning an experiment with out deciding how lengthy it ought to final could cause points. You could fall sufferer to the “peeking problem“: if you test the intermediate outcomes for statistical significance, make choices based mostly on them, and finish your experiment too early. With out figuring out how lengthy your experiment ought to run, you can not differentiate between intermediate and remaining outcomes.

Alternatively, if you do not have sufficient statistical energy (i.e., not sufficient customers to acquire a big consequence), you may doubtlessly waste weeks ready for outcomes. That is particularly widespread in group-targeted experiments.

The answer is to make use of an A/B check operating time calculator to find out in case you have the required statistical energy to run your experiment and for a way lengthy you must run your experiment. That is constructed into PostHog.

Setting up a new experiment in PostHog includes a recommended running time calculator4. Operating an experiment with out testing it first

Generally we’re so desperate to get outcomes from our experiments, we soar straight to operating them with all our customers. This can be okay if every thing is about up appropriately, however if you happen to’ve made a mistake, chances are you’ll be unable to rerun your experiment. Why? Let me clarify.

Think about you are operating an experiment with a 50/50 cut up between management and check. You roll out the experiment to all of your customers, however a day after launch, you discover that your change is inflicting the app to crash for all check customers. You instantly cease the experiment and repair the foundation reason behind the crash. Nonetheless, restarting the experiment now will produce unreliable outcomes since many customers have already seen your change.

See Also

To keep away from this downside, you must first check your experiment with a small rollout (e.g., 5% of customers) for a number of days. When you’re assured every thing works appropriately, you can begin the experiment with the remaining customers.

Here’s a checklist of what to test throughout your check rollout:

  • Logging is working appropriately
  • No enhance in crashes or different errors
  • Use session replays to make sure your app is behaving as anticipated
  • Customers are assigned to the management and check teams within the ratio you expect (e.g., 50/50).

5. Neglecting counter metrics

Counter metrics measure unintended detrimental side-effects in your experiments. If you don’t monitor them, chances are you’ll unintentionally roll out modifications that end in a worse person expertise.

For instance, say you are testing a change to your sign-up web page. Whereas the variety of sign-ups might enhance, you discover that point spent in your app decreases. On this case, it might point out that your new sign-up web page is deceptive customers about what your app does, leading to extra sign-ups but additionally extra churn.

6. Not accounting for seasonality

Seasonal intervals could cause important modifications in person habits. Folks could be taking day off, or specializing in completely different duties. In a B2B context, seasonality may also have an effect on the decision-making processes of companies. For instance, corporations typically have particular occasions of the yr after they allocate budgets, overview contracts, or make buying choices. Conducting A/B checks throughout these intervals may end up in biased outcomes that won’t characterize typical person habits.

Here’s a checklist of seasonality intervals to be conscious of:

  • Vacation seasons – e.g., Christmas and New 12 months’s, July and August summer season holidays.
  • Fiscal year-end – Corporations sometimes overview budgeting choices and contract renewals throughout this era.
  • Seasonal gross sales cycles – Some industries, like retail or agriculture, have seasonal gross sales cycles. Concentrate on these cycles in case your goal prospects function in such industries.

7. Testing an unclear speculation

Having a transparent definition of what you might be testing and why you might be testing it’s going to allow you to decide which metrics to measure and analyze. To know this higher, let’s take a look at an instance of a foul speculation:

Dangerous Speculation: Altering the colour of the “Proceed to checkout” button will enhance purchases.

That is unhealthy because it’s unclear why we’re testing this modification and why we count on it to extend purchases. Because of this, it is not clear what we have to measure on this experiment. Can we solely have to rely button clicks? Or is there one thing else we have to measure?

This is an improved model:

Good speculation: Person analysis confirmed that customers are not sure of the way to proceed to the checkout web page. Altering the button’s shade will result in extra customers noticing it and thus extra individuals will proceed to the checkout web page. It will then result in extra purchases.

It is now clear that we have to measure the next:

  • Button clicks, to indicate if the change in shade results in extra individuals noticing the button
  • Variety of purchases, since extra individuals arriving on the checkout web page ought to imply extra purchases

This additionally makes it simpler to analyze the experiment outcomes, particularly when they aren’t what we anticipated. For instance:

  • If the colour change did not enhance button clicks, the web page might have a unique situation.
  • If the variety of button clicks elevated however the variety of purchases didn’t, there could also be a difficulty with the checkout web page.

8. Relying an excessive amount of on A/B checks for decision-making

Not every thing that may be measured issues. Not every thing that issues might be measured. It is necessary to keep in mind that there might be different causes for delivery issues moreover for metric modifications, corresponding to fixing person ache factors, or creating fulfilling person experiences.

Raquel, one among our progress engineers right here at PostHog, shares an instance:

“We ran an experiment on our sign-up web page to make our social login buttons extra outstanding (e.g., “Join with Google” and “Join with GitHub”) as an alternative of signing up with e mail and password. Whereas extra individuals signed up utilizing Google and Github, general sign-ups did not enhance, and nor did activation. In the end, we determined to ship the change since we felt that social login lowers friction, and offers a greater person expertise.”


It is best to now have a greater understanding on widespread experimentation pitfalls to keep away from. To cite Emily Robinson: Producing numbers is straightforward; producing numbers you must belief is difficult!

To learn up extra on experimentation, take a look at our guides on:

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top