Is your A/B Testing program failing to translate into business impact?

Have you ever had two analysts pull data in perfectly accurate, logical ways, only to reach different conclusions? A clear, holistic understanding of your strategy is necessary to maximize your optimization potential.  Ask yourself these questions for trustworthy and reproducible results.

Who is my “normal” customer? Does this audience include irregular customer behavior that should be excluded?

Outlier Identification and Exclusion principles 

While a common best practice is the exclusion of internal IP addresses with the intention of removing site QA activity, many businesses overlook the exclusion of rare or odd customer behavior.  In some instances, we see unusual customer cases that can dramatically skew test results.

For example, are you seeing certain customers adding 200+ of the same item to their cart within a 2 hour period?  If your primary KPI is Cart Additions/Customer, you might want to consider excluding these rare super users from your test read. This isn’t to say you shouldn’t optimize for your most valued customers, but do so with intention and recognize they may be a small minority where personalization could be especially useful. 

Is it important that the customer participated in this behavior, or is it important that this behavior occurred multiple times? 

Multiple Instances of KPI Actions vs. KPI Participation 

Another way to ensure metrics are not skewed by unusual activity is to build them on a participation basis rather than a sum of all instances of this action.  The formula would be: Session where X action exists/Sessions, or Visitors who do X action/Visitors.

This works best in instances where the test promotes an initial action occurrence, rather than repeated or incremental actions.

For example, a test based on customer acquisition might consider Customers who Order / Customers rather than Sum of Orders / Customers, as the concern is not how many times the customer ordered, but if they ordered at all.

Should the level of granularity of my test read reflect visitor-level data or session-level data?

Audience vs. Session-Level Test Evaluation

When running a high risk test, it’s important to ensure the test read accurately represents the customer engagement and financial impacts to the business.  Different ways of looking at the data can generate opposing results, impacting your business’s long-term optimization potential, and your understanding of user preferences and behavior patterns. 

The nature of the test should determine how the response should be analyzed.

Does the primary KPI of this test take place over a short conversion period, or, in other words, drive an activity that would be expected to take place within a singular visit? Or is it dependent upon long term client behaviors – farsighted activities that build across multiple visits by a singular visitor, such as customer loyalty or high impact purchases? Could customer loyalty be something you need to incentivize through things like loyalty programs (there is software out there to help you with this, you can find out more here)?

If the primary motive of the test is the promotion of short-term actions, such as coupon redemption or habitual micro-purchases (ex. food delivery, audiobook rentals, household goods), the test should be evaluated at the Visit or Session level.

On the other hand, if the primary motive of the test is the promotion of long-term behaviors, dissonance reduction, or complex buying behaviors (expensive or impactful purchases), requiring more frequent visits over a longer planning period, KPIs should be evaluated at the Visitor level.


To validate a potential improvement in an interaction between your business and your end user, the KPIs, goals, and target audience must be clearly defined.  If these are not accurately represented when evaluating results, the learnings can be misinterpreted, or worse, features that actually produced a negative impact could be rolled out.

To further ensure validity of results with regard to high risk experiments, you may consider adjusting your confidence and power levels for statistical significance. 

The ability of an organization to synchronize evaluation practices to produce consistent, universally accepted results will skyrocket a company’s A/B Testing practice, freeing up time and resources and providing the most customer-friendly and profitable experience.

Photo credit: Aleks Marinkovic via Unsplash

Written By

Hannah Alexander

Hannah Alexander, Associate Director of Experimentation & Strategy, leads the Experimentation & Strategy practice at Evolytics, inclusive of A/B Testing, personalization, Conversion Rate Optimization and strategy planning for clients such as Vail, Sephora, HSBC and She is an Adobe Analytics certified expert who is further known at Evolytics for having developed the Evolytics Hybrid Analytics Workforce Team, designed to help new-to-analytics employees develop a personalized analytics career path.