How to Perform A/B Testing Analysis With Statistical Tools

A/B testing compares two versions of a web page, email, or app screen to determine which performs better on a specific metric. Running the test is only half the work; analyzing the results correctly is what separates a reliable experiment from a misleading one. This article explains the statistical methods behind A/B test analysis and the tools you can use to perform them.
Understanding the Hypothesis Framework
Every A/B test starts with a hypothesis. For example: "Changing the call-to-action button from green to orange will increase the click-through rate." The null hypothesis (H0) states that there is no difference between the two versions. The alternative hypothesis (H1) states that the orange button has a different click-through rate than the green button. Your analysis determines whether the data provides enough evidence to reject the null hypothesis.
Before running the test, you need to define your primary metric (click-through rate, conversion rate, revenue per visitor), your minimum detectable effect (the smallest difference you care about), your significance level (typically 0.05, meaning you accept a 5 percent chance of a false positive), and your statistical power (typically 0.80, meaning you have an 80 percent chance of detecting a real effect if one exists). These parameters determine the sample size you need.
Calculating Sample Size Before the Test
Running a test with too few visitors leads to underpowered results where you cannot detect real differences. Running it with too many visitors wastes time and resources. Use a sample size calculator to determine the required sample size before launching the test. For a conversion rate test where the current rate is 5 percent, the minimum detectable effect is 10 percent relative improvement (from 5 percent to 5.5 percent), the significance level is 0.05, and the power is 0.80, you need approximately 30,000 visitors per variation.

Online calculators like Evan Miller's Sample Size Calculator, Optimizely's Stats Engine calculator, or the statsmodels.stats.power module in Python can compute this for you. Enter your baseline conversion rate, minimum detectable effect, significance level, and power, and the calculator outputs the required sample size per variation.
Analyzing Results: The Chi-Square Test
For conversion rate comparisons (binary outcomes: converted or not), the chi-square test is the standard method. You construct a 2x2 contingency table with the counts of conversions and non-conversions for each variation. The test calculates a chi-square statistic and a p-value. If the p-value is below your significance level (0.05), you reject the null hypothesis and conclude that the difference between variations is statistically significant.
In Python, use scipy.stats.chi2_contingency to run this test. Pass a 2D array with your conversion counts: chi2, p_value, dof, expected = chi2_contingency([[conversions_A, non_conversions_A], [conversions_B, non_conversions_B]]). The p-value tells you the probability of observing a difference this large or larger if there were no real difference between the versions.
Analyzing Results: The T-Test for Continuous Metrics
For continuous metrics like revenue per visitor, average order value, or time on page, use a two-sample t-test instead of chi-square. The t-test compares the means of two groups and accounts for the variance within each group. In Python, use scipy.stats.ttest_ind(group_A, group_B). The output includes the t-statistic and the p-value.

Always report the confidence interval alongside the p-value. A p-value tells you whether the difference is significant, but the confidence interval tells you the range of plausible values for the true difference. For example, "The orange button increased click-through rate by 1.2 percentage points (95 percent CI: 0.3 to 2.1 percentage points)" is more informative than "The orange button increased click-through rate (p = 0.01)."
Tools for A/B Test Analysis
Dedicated A/B testing platforms like Optimizely, VWO, and Google Optimize (now part of Google Analytics 4) handle both test execution and statistical analysis. They track visitor assignments, calculate results in real time, and use sequential testing methods that allow you to peek at results without inflating the false positive rate. Optimizely's Stats Engine, for example, uses a sequential testing approach that controls the false discovery rate across multiple metrics and variations.
If you analyze test results manually, use R or Python. In R, the prop.test() function performs a proportion test (equivalent to chi-square for binary outcomes), and the t.test() function handles continuous metrics. In Python, the statsmodels library provides proportion.proportions_ztest() for conversion rates and stats.weightstats.ttest_ind() for continuous metrics. These libraries also calculate confidence intervals automatically.
Common Pitfalls to Avoid
Peeking at results before reaching the required sample size is the most common mistake. Each time you check the results and stop the test early if the p-value looks good, you increase the chance of a false positive. This is called "optional stopping" and it invalidates the p-value. Either commit to a fixed sample size before the test or use a sequential testing method that accounts for interim analyses.

Other pitfalls include testing too many variations simultaneously (which requires multiple testing corrections like the Bonferroni adjustment), ignoring segmentation effects (a variation might work well for new visitors but poorly for returning visitors), and conflating statistical significance with practical significance. A 0.1 percent increase in conversion rate might be statistically significant with a large enough sample, but it may not be worth the engineering effort to implement the change.
Building a Testing Culture
Effective A/B testing requires more than statistical tools. It requires a culture where decisions are based on data rather than opinions. Start with a hypothesis, design the test carefully, run it to completion, and accept the results even if they contradict your expectations. Document every test: what you tested, why, what the results were, and what you learned. This documentation builds institutional knowledge and prevents the organization from repeating the same experiments.
Prioritize your tests based on expected impact and ease of implementation. A test on your checkout page (high traffic, direct revenue impact) should take priority over a test on your about page (low traffic, indirect impact). Use a testing roadmap to plan your experiments for the quarter, and track the cumulative impact of all tests on your conversion rate. Over time, the compounding effect of many small improvements can significantly increase your overall conversion rate.