What is ab-test-analysis?

This Claude skill provides a comprehensive framework for rigorous A/B test statistical analysis, enabling data-driven decision-making with scientific precision. It automates complex statistical workflows including significance testing (p-values, z-scores), power analysis, and Sample Ratio Mismatch (SRM) detection to ensure experiment integrity. By evaluating both primary success metrics and technical guardrails, it transforms raw experiment data into actionable rollout recommendations while minimizing the risk of false positives or technical bias.

When should I use ab-test-analysis?

ab-test-analysis is useful in the following scenarios: • Post-Experiment Analysis: Calculating statistical significance and relative uplift for conversion rates or revenue metrics to decide whether to ship a new feature. • Data Integrity Validation: Detecting Sample Ratio Mismatch (SRM) to identify underlying technical issues in randomization or tracking before interpreting results. • Pre-Launch Test Design: Estimating required sample sizes and minimum detectable effects (MDE) to ensure the experiment is sufficiently powered to yield meaningful results. • Guardrail Monitoring: Analyzing secondary metrics like page load speed or error rates to ensure that a 'winning' variant doesn't cause unintended negative side effects. • Stakeholder Reporting: Generating professional, visualized A/B test reports including confidence intervals and business impact estimates for executive review.

name	ab-test-analysis
description	Rigorous A/B test statistical analysis. Use when analyzing experiment results, calculating statistical significance, checking for sample ratio mismatch, or validating test design before launch.

A/B Test Analysis

Quick Start

Analyze experiment results with statistical rigor, including significance testing, power analysis, sample ratio checks, and actionable recommendations for rolling out changes.

Context Requirements

Before analyzing the test, I need:

Test Design: What was tested (variants, hypothesis, metric)
Test Data: Results from each variant
Randomization Unit: User, session, page view, etc.
Primary Metric: Success metric for decision-making
Guardrail Metrics (optional): Metrics that shouldn't degrade
Test Duration: When test started/ended

Context Gathering

For Test Design:

"Tell me about the experiment:

What did you test?

Control (baseline): Current experience
Treatment (variant): What changed?
Example: 'Control: Blue button' vs 'Treatment: Green button'

What's your hypothesis?

Example: 'Green button will increase conversions by 10%'

Randomization level:

User-level (recommended): Each user always sees same variant
Session-level: User might see different variants across sessions
Page-view level: Randomize every page load

Which did you use?"

For Test Data:

"I need the results data. Provide:

Option 1 - Summary Stats:

Control:  Users: 10,000 | Conversions: 1,200 (12.0%)
Treatment: Users: 10,000 | Conversions: 1,350 (13.5%)

Option 2 - User-Level Data:

user_id | variant | converted | revenue | ...
123     | control | TRUE      | 50.00   | ...
456     | treatment | FALSE   | 0       | ...

Option 3 - Daily Aggregates:

date       | variant   | users | conversions
2024-12-01 | control   | 500   | 60
2024-12-01 | treatment | 500   | 68

Which format works for you?"

For Metrics:

"What metrics are you tracking?

Primary Metric (decision metric):

Conversion rate, revenue per user, time on site, etc.
This determines success/failure

Secondary Metrics (nice to know):

Supporting metrics that provide context

Guardrail Metrics (must not degrade):

Page load time, error rate, support tickets
Treatment must not worsen these

What's your primary metric?"

For Test Parameters:

"To calculate statistical significance, I need:

Minimum Detectable Effect (MDE):

What % improvement would make it worth rolling out?
Industry standard: 2-5% for conversion rates

Significance Level (α):

Standard: 0.05 (5% false positive rate)
Use default unless you have specific requirements

Power (1-β):

Standard: 0.80 (80% chance to detect real effect)
Use default unless you have specific requirements

Should we use standard parameters (5% significance, 80% power, 2% MDE)?"

For Sample Ratio:

"How were users split between variants?

Target Allocation:

50/50 (most common)
90/10 (if testing risky change)
33/33/34 (three variants)

Actual Allocation:

I'll check if actual split matches target
Sample Ratio Mismatch (SRM) indicates technical issues"

Workflow

Step 1: Load and Validate Test Data

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Load test data
test_data = pd.read_csv('ab_test_results.csv')

print(f"📊 Test Data Loaded:")
print(f"  Total Users: {len(test_data):,}")
print(f"  Control: {(test_data['variant'] == 'control').sum():,}")
print(f"  Treatment: {(test_data['variant'] == 'treatment').sum():,}")
print(f"  Primary Metric: conversion_rate")

Checkpoint: "Data loaded. Sample sizes look reasonable?"

Step 2: Sample Ratio Mismatch (SRM) Check

def check_sample_ratio_mismatch(test_data, expected_ratio=0.5):
    """
    Check if actual variant split matches expected
    SRM indicates technical issues with randomization
    """
    
    control_count = (test_data['variant'] == 'control').sum()
    treatment_count = (test_data['variant'] == 'treatment').sum()
    total = len(test_data)
    
    # Expected counts
    expected_control = total * expected_ratio
    expected_treatment = total * (1 - expected_ratio)
    
    # Chi-square test
    chi2_stat = (
        (control_count - expected_control)**2 / expected_control +
        (treatment_count - expected_treatment)**2 / expected_treatment
    )
    
    # Critical value for 1 degree of freedom at α=0.001 (very strict)
    critical_value = 10.828  # chi2.ppf(0.999, df=1)
    
    p_value = 1 - stats.chi2.cdf(chi2_stat, df=1)
    
    srm_detected = chi2_stat > critical_value
    
    results = {
        'control_count': control_count,
        'treatment_count': treatment_count,
        'control_pct': control_count / total * 100,
        'treatment_pct': treatment_count / total * 100,
        'expected_ratio': f"{expected_ratio*100:.0f}/{(1-expected_ratio)*100:.0f}",
        'chi2_stat': chi2_stat,
        'p_value': p_value,
        'srm_detected': srm_detected
    }
    
    return results

srm = check_sample_ratio_mismatch(test_data, expected_ratio=0.5)

print(f"\n🔍 Sample Ratio Mismatch Check:")
print(f"  Expected: {srm['expected_ratio']}")
print(f"  Actual: {srm['control_pct']:.1f}% / {srm['treatment_pct']:.1f}%")
print(f"  Chi-square: {srm['chi2_stat']:.2f}")
print(f"  P-value: {srm['p_value']:.4f}")

if srm['srm_detected']:
    print(f"  ⚠️  SRM DETECTED - Investigate randomization issue!")
else:
    print(f"  ✅ No SRM - Randomization looks good")

Step 3: Calculate Metrics by Variant

def calculate_variant_metrics(test_data, metric_col='converted'):
    """Calculate key metrics for each variant"""
    
    variants = {}
    
    for variant_name in ['control', 'treatment']:
        variant_data = test_data[test_data['variant'] == variant_name]
        
        n = len(variant_data)
        successes = variant_data[metric_col].sum()
        success_rate = successes / n
        
        # Standard error for proportion
        se = np.sqrt(success_rate * (1 - success_rate) / n)
        
        # 95% confidence interval
        ci_lower = success_rate - 1.96 * se
        ci_upper = success_rate + 1.96 * se
        
        variants[variant_name] = {
            'n': n,
            'successes': successes,
            'rate': success_rate,
            'se': se,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper
        }
    
    return variants

metrics = calculate_variant_metrics(test_data, 'converted')

print(f"\n📊 Variant Performance:")
for variant_name, stats in metrics.items():
    print(f"\n  {variant_name.upper()}:")
    print(f"    Sample Size: {stats['n']:,}")
    print(f"    Conversions: {stats['successes']:,}")
    print(f"    Rate: {stats['rate']:.3%}")
    print(f"    95% CI: [{stats['ci_lower']:.3%}, {stats['ci_upper']:.3%}]")

Step 4: Statistical Significance Test

def test_statistical_significance(control, treatment):
    """
    Two-proportion z-test for statistical significance
    """
    
    # Pool the proportions
    pooled_p = (control['successes'] + treatment['successes']) / (control['n'] + treatment['n'])
    pooled_se = np.sqrt(pooled_p * (1 - pooled_p) * (1/control['n'] + 1/treatment['n']))
    
    # Calculate z-score
    diff = treatment['rate'] - control['rate']
    z_score = diff / pooled_se
    
    # Two-tailed p-value
    p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
    
    # Effect size (relative uplift)
    relative_uplift = (treatment['rate'] - control['rate']) / control['rate']
    
    # Absolute uplift
    absolute_uplift = treatment['rate'] - control['rate']
    
    # Confidence interval for the difference
    se_diff = np.sqrt(
        control['rate'] * (1 - control['rate']) / control['n'] +
        treatment['rate'] * (1 - treatment['rate']) / treatment['n']
    )
    ci_lower = diff - 1.96 * se_diff
    ci_upper = diff + 1.96 * se_diff
    
    results = {
        'absolute_uplift': absolute_uplift,
        'relative_uplift': relative_uplift,
        'z_score': z_score,
        'p_value': p_value,
        'significant': p_value < 0.05,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper
    }
    
    return results

sig_test = test_statistical_significance(metrics['control'], metrics['treatment'])

print(f"\n📈 Statistical Significance Test:")
print(f"  Absolute Uplift: {sig_test['absolute_uplift']:.3%}")
print(f"  Relative Uplift: {sig_test['relative_uplift']:+.1%}")
print(f"  95% CI: [{sig_test['ci_lower']:.3%}, {sig_test['ci_upper']:.3%}]")
print(f"  Z-score: {sig_test['z_score']:.2f}")
print(f"  P-value: {sig_test['p_value']:.4f}")

if sig_test['significant']:
    print(f"  ✅ STATISTICALLY SIGNIFICANT (p < 0.05)")
    if sig_test['relative_uplift'] > 0:
        print(f"  📈 Treatment WINS")
    else:
        print(f"  📉 Treatment LOSES")
else:
    print(f"  ❌ NOT SIGNIFICANT - No clear winner")

Step 5: Power Analysis

def calculate_achieved_power(control, treatment, alpha=0.05):
    """
    Calculate the statistical power achieved in the test
    """
    
    # Effect size (Cohen's h for proportions)
    p1 = control['rate']
    p2 = treatment['rate']
    
    effect_size = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1)))
    
    # Critical z-value for two-tailed test
    z_crit = stats.norm.ppf(1 - alpha/2)
    
    # Standard error under alternative hypothesis
    n = control['n']  # assuming equal sample sizes
    se_alt = np.sqrt(p1*(1-p1)/n + p2*(1-p2)/n)
    
    # Non-centrality parameter
    ncp = (p2 - p1) / se_alt
    
    # Power calculation
    power = 1 - stats.norm.cdf(z_crit - abs(ncp)) + stats.norm.cdf(-z_crit - abs(ncp))
    
    return {
        'effect_size': effect_size,
        'power': power,
        'sample_size_per_variant': n
    }

power_analysis = calculate_achieved_power(metrics['control'], metrics['treatment'])

print(f"\n⚡ Power Analysis:")
print(f"  Effect Size (Cohen's h): {power_analysis['effect_size']:.3f}")
print(f"  Achieved Power: {power_analysis['power']:.1%}")
print(f"  Sample Size per Variant: {power_analysis['sample_size_per_variant']:,}")

if power_analysis['power'] < 0.80:
    print(f"  ⚠️  UNDERPOWERED - Results less reliable")
else:
    print(f"  ✅ Well-powered test")

Step 6: Guardrail Metrics Check

def check_guardrail_metrics(test_data, guardrail_metrics=['page_load_time', 'error_rate']):
    """
    Ensure treatment doesn't degrade important guardrail metrics
    """
    
    print(f"\n🛡️  Guardrail Metrics Check:")
    
    guardrail_results = []
    
    for metric in guardrail_metrics:
        if metric not in test_data.columns:
            continue
        
        control_data = test_data[test_data['variant'] == 'control'][metric]
        treatment_data = test_data[test_data['variant'] == 'treatment'][metric]
        
        # T-test for continuous metrics
        t_stat, p_value = stats.ttest_ind(treatment_data, control_data)
        
        control_mean = control_data.mean()
        treatment_mean = treatment_data.mean()
        change = ((treatment_mean - control_mean) / control_mean) * 100
        
        # Check if treatment is worse (degraded)
        degraded = (change > 0 and 'time' in metric.lower()) or \
                   (change > 0 and 'error' in metric.lower()) or \
                   (change < 0 and 'score' in metric.lower())
        
        print(f"\n  {metric}:")
        print(f"    Control: {control_mean:.2f}")
        print(f"    Treatment: {treatment_mean:.2f}")
        print(f"    Change: {change:+.1f}%")
        
        if degraded and p_value < 0.05:
            print(f"    ⚠️  DEGRADED significantly (p={p_value:.4f})")
        elif degraded:
            print(f"    ⚠️  Degraded but not significant")
        else:
            print(f"    ✅ No degradation")
        
        guardrail_results.append({
            'metric': metric,
            'control_mean': control_mean,
            'treatment_mean': treatment_mean,
            'change_pct': change,
            'p_value': p_value,
            'degraded': degraded and p_value < 0.05
        })
    
    return guardrail_results

guardrails = check_guardrail_metrics(test_data, ['page_load_time', 'bounce_rate'])

Step 7: Visualize Results

def plot_ab_test_results(metrics, sig_test):
    """Create comprehensive visualization of test results"""
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Plot 1: Conversion rates with confidence intervals
    variants = ['Control', 'Treatment']
    rates = [metrics['control']['rate'], metrics['treatment']['rate']]
    ci_lower = [metrics['control']['ci_lower'], metrics['treatment']['ci_lower']]
    ci_upper = [metrics['control']['ci_upper'], metrics['treatment']['ci_upper']]
    
    x = np.arange(len(variants))
    colors = ['#3498db', '#2ecc71' if sig_test['relative_uplift'] > 0 else '#e74c3c']
    
    bars = ax1.bar(x, rates, color=colors, alpha=0.7, width=0.6)
    ax1.errorbar(x, rates, 
                 yerr=[np.array(rates) - np.array(ci_lower), 
                       np.array(ci_upper) - np.array(rates)],
                 fmt='none', color='black', capsize=5, capthick=2)
    
    # Add value labels
    for i, (variant, rate) in enumerate(zip(variants, rates)):
        ax1.text(i, rate + 0.01, f'{rate:.2%}', ha='center', fontweight='bold')
    
    ax1.set_ylabel('Conversion Rate')
    ax1.set_title('Conversion Rate by Variant\n(with 95% Confidence Intervals)')
    ax1.set_xticks(x)
    ax1.set_xticklabels(variants)
    ax1.set_ylim(0, max(rates) * 1.2)
    
    # Plot 2: Effect size visualization
    uplift = sig_test['relative_uplift']
    ci_lower_pct = (sig_test['ci_lower'] / metrics['control']['rate'])
    ci_upper_pct = (sig_test['ci_upper'] / metrics['control']['rate'])
    
    ax2.barh(['Effect'], [uplift * 100], color='green' if uplift > 0 else 'red', alpha=0.7)
    ax2.errorbar([uplift * 100], ['Effect'],
                 xerr=[[uplift * 100 - ci_lower_pct * 100], [ci_upper_pct * 100 - uplift * 100]],
                 fmt='none', color='black', capsize=5, capthick=2)
    
    # Add significance indicator
    sig_text = "✅ Significant" if sig_test['significant'] else "❌ Not Significant"
    ax2.text(uplift * 100, 0, f"  {uplift*100:+.1f}%\n  {sig_text}",
             va='center', fontweight='bold')
    
    ax2.axvline(0, color='black', linestyle='--', alpha=0.5)
    ax2.set_xlabel('Relative Uplift (%)')
    ax2.set_title(f'Treatment Effect\n(p-value: {sig_test["p_value"]:.4f})')
    
    plt.tight_layout()
    plt.savefig('ab_test_results.png', dpi=300, bbox_inches='tight')
    plt.show()

plot_ab_test_results(metrics, sig_test)

Step 8: Generate Decision Recommendation

def generate_recommendation(sig_test, guardrails, power_analysis, srm):
    """
    Provide clear recommendation based on all checks
    """
    
    print(f"\n{'='*60}")
    print("🎯 RECOMMENDATION")
    print('='*60)
    
    # Check for blockers
    blockers = []
    
    if srm['srm_detected']:
        blockers.append("Sample Ratio Mismatch detected - randomization issue")
    
    if power_analysis['power'] < 0.70:
        blockers.append(f"Underpowered ({power_analysis['power']:.0%}) - results unreliable")
    
    degraded_guardrails = [g for g in guardrails if g['degraded']]
    if degraded_guardrails:
        blockers.append(f"Guardrail metrics degraded: {[g['metric'] for g in degraded_guardrails]}")
    
    # Make recommendation
    if blockers:
        print(f"\n❌ DO NOT SHIP - Critical Issues Found:\n")
        for blocker in blockers:
            print(f"  • {blocker}")
        recommendation = "DO_NOT_SHIP"
    
    elif sig_test['significant'] and sig_test['relative_uplift'] > 0:
        print(f"\n✅ RECOMMEND SHIPPING")
        print(f"\n  Treatment shows {sig_test['relative_uplift']:+.1%} improvement")
        print(f"  Statistically significant (p={sig_test['p_value']:.4f})")
        print(f"  No guardrail issues detected")
        
        # Estimate impact
        if 'control' in metrics:
            baseline_rate = metrics['control']['rate']
            sample_size = metrics['control']['n']
            print(f"\n  Expected Impact:")
            print(f"    If applied to {sample_size:,} users monthly:")
            print(f"    Additional conversions: {sample_size * sig_test['absolute_uplift']:+,.0f}/month")
        
        recommendation = "SHIP"
    
    elif sig_test['significant'] and sig_test['relative_uplift'] < 0:
        print(f"\n❌ DO NOT SHIP")
        print(f"\n  Treatment shows {sig_test['relative_uplift']:.1%} degradation")
        print(f"  Statistically significant negative impact")
        recommendation = "DO_NOT_SHIP"
    
    else:
        print(f"\n⚠️  NO CLEAR WINNER")
        print(f"\n  Treatment shows {sig_test['relative_uplift']:+.1%} change")
        print(f"  But NOT statistically significant (p={sig_test['p_value']:.4f})")
        print(f"\n  Options:")
        print(f"    1. Ship if change is low-risk and directionally positive")
        print(f"    2. Run longer to gather more data")
        print(f"    3. Redesign with larger expected effect")
        recommendation = "INCONCLUSIVE"
    
    return recommendation

recommendation = generate_recommendation(sig_test, guardrails, power_analysis, srm)

Context Validation

Before proceeding, verify:

Test ran long enough to reach statistical power
Randomization was properly implemented
No SRM (sample ratio mismatch) detected
Primary metric is clearly defined
Have baseline data for power calculations
Understand minimum detectable effect needed

Output Template

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
A/B TEST ANALYSIS REPORT
Test: Green Button vs Blue Button
Period: Dec 1-15, 2024
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ RECOMMENDATION: SHIP TREATMENT

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 RESULTS SUMMARY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Primary Metric: Conversion Rate

Control:
  Sample: 10,000 users
  Conversions: 1,200
  Rate: 12.0% (95% CI: 11.3% - 12.7%)

Treatment:
  Sample: 10,000 users
  Conversions: 1,350
  Rate: 13.5% (95% CI: 12.8% - 14.2%)

Effect:
  Absolute: +1.5 percentage points
  Relative: +12.5%
  95% CI: [+0.4%, +2.6%]

Statistical Significance:
  Z-score: 2.65
  P-value: 0.0080
  Result: ✅ SIGNIFICANT (p < 0.05)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ VALIDATION CHECKS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Sample Ratio: ✅ PASS
  Expected: 50/50
  Actual: 50.0% / 50.0%
  No randomization issues detected

Statistical Power: ✅ PASS
  Achieved: 85%
  Effect Size: 0.042 (Cohen's h)

Guardrail Metrics: ✅ PASS
  Page Load Time: No degradation
  Bounce Rate: -2.1% (improved)
  Error Rate: No change

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💰 EXPECTED IMPACT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Monthly Volume: 300,000 users

Additional Conversions: +3,750/month
Revenue Impact: +$187,500/month
  (assuming $50 avg order value)

Confidence: High
Risk: Low

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📋 NEXT STEPS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. ✅ Ship treatment to 100% of users
2. Monitor for 1 week post-launch
3. Track conversion rate stays elevated
4. Iterate: Test other button colors?

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📁 FILES GENERATED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ ab_test_results.png (visualization)
✓ statistical_analysis.csv (detailed metrics)
✓ power_analysis.txt (power calculations)
✓ guardrail_check.csv (all guardrail metrics)

Common Scenarios

Scenario 1: "Should we ship this new feature?"

→ Run full significance test → Check guardrail metrics → Calculate expected business impact → Provide clear ship/don't ship recommendation → Quantify confidence level

Scenario 2: "Test is inconclusive after 2 weeks"

→ Calculate achieved power → Determine if more time would help → Estimate time needed to reach significance → Recommend: run longer, redesign, or make business decision

Scenario 3: "Validate test design before launching"

→ Calculate required sample size → Estimate test duration → Review randomization approach → Check guardrail metrics are defined → Prevent common pitfalls

Scenario 4: "Multiple variants to compare"

→ Use ANOVA or pairwise comparisons → Apply Bonferroni correction for multiple testing → Identify best performing variant → Check if any significantly better than control

Scenario 5: "Test shows improvement but stakeholders skeptical"

→ Show statistical rigor (significance, power, CI) → Rule out SRM and other technical issues → Demonstrate guardrails not degraded → Provide expected business impact → Build confidence with data

Handling Missing Context

User shares results without test design: "To properly analyze, I need to know:

What was tested (control vs treatment)
How users were randomized
What metric we're measuring
Expected/desired effect size

Can you share the test plan?"

User doesn't know if sample size is enough: "Let me calculate the required sample size based on:

Baseline conversion rate
Desired uplift to detect
Acceptable error rates

Then compare to what you have."

User concerned about p-value close to 0.05: "P-value of 0.049 is technically significant, but borderline. Let's:

Check confidence interval (does it cross zero?)
Review statistical power
Consider practical significance
Possibly run longer for more confidence"

User wants to peek at results mid-test: "Peeking increases false positive rate. If we must:

Apply alpha spending function
Use sequential testing methods
Or just note results are preliminary"

Advanced Options

After basic analysis, offer:

Bayesian Analysis: "Want probability that treatment is better? Bayesian approach gives you 'P(treatment > control)'"

Sequential Testing: "Planning to check results multiple times? I can adjust for peeking using sequential testing methods"

Heterogeneous Treatment Effects: "Want to see if treatment works better for certain user segments? I can analyze by subgroup"

Long-term Impact Estimation: "I can estimate sustained lift accounting for novelty effect and regression to mean"

Multi-Armed Bandit: "For continuous optimization, consider switching to bandit algorithm instead of fixed A/B test"

Sample Size Calculator: "Planning your next test? I can calculate required sample size for desired power and effect"

ab-test-analysis

When & Why to Use This Skill

Use Cases