ab-testing-statistician
Expert in statistical analysis for blind A/B and ABX audio testing. Validates randomization, calculates statistical significance, and ensures proper experimental design. Use when implementing A/B test features or analyzing test results.
When & Why to Use This Skill
The A/B Testing Statistician is a specialized Claude skill designed for the rigorous design and analysis of audio comparison tests. It provides expert guidance on Blind AB and ABX testing methodologies, ensuring experimental integrity through proper randomization validation, p-value calculations using binomial tests, and loudness compensation to eliminate psychoacoustic bias.
Use Cases
- Case 1: Audibility Testing - Using ABX protocols to determine if users can statistically distinguish between high-resolution audio and compressed formats.
- Case 2: Preference Validation - Conducting Blind AB tests to evaluate which EQ preset or audio processing algorithm users prefer without confirmation bias.
- Case 3: Experimental Design - Calculating required sample sizes and trial counts to ensure audio tests reach 95% or 99% statistical significance.
- Case 4: Software Implementation - Integrating robust Rust-based randomization and loudness normalization logic into audio testing applications.
- Case 5: Result Interpretation - Analyzing raw trial data to calculate p-values and determine if test results are statistically significant or merely due to chance.
| name | ab-testing-statistician |
|---|---|
| description | Expert in statistical analysis for blind A/B and ABX audio testing. Validates randomization, calculates statistical significance, and ensures proper experimental design. Use when implementing A/B test features or analyzing test results. |
A/B Testing Statistician
Specialized agent for designing and validating blind audio comparison tests (A/B, Blind AB, ABX) with proper statistical analysis.
Overview of Audio A/B Testing
Test Modes
| Mode | Description | User Knows? | Purpose |
|---|---|---|---|
| AB | Switch between A and B | Yes | Quick comparison, training |
| Blind AB | A and B randomly mapped to Options 1 and 2 | No | Unbiased preference detection |
| ABX | X is secretly either A or B, user guesses | No | Audibility testing (can you hear the difference?) |
Why Blind Testing Matters
Confirmation Bias: Listeners tend to prefer what they expect to be better.
Example:
Non-blind: "This expensive cable sounds clearer!" (placebo effect)
Blind: "I can't tell the difference" (objective reality)
Session Management
Session State (Rust)
#[derive(Clone, Serialize, Deserialize)]
pub struct ABSession {
pub mode: ABTestMode, // AB, BlindAB, or ABX
pub preset_a_name: String,
pub preset_b_name: String,
pub trim_db: f32, // Loudness compensation for B
pub total_trials: usize,
pub current_trial: usize,
pub hidden_mapping: Vec<bool>, // For BlindAB: true = Option1 is A
pub x_is_a: Vec<bool>, // For ABX: true = X is A
pub answers: Vec<ABAnswer>, // User responses
}
#[derive(Clone, Serialize, Deserialize)]
pub enum ABTestMode {
AB, // Non-blind switching
BlindAB, // Blind preference test
ABX, // Blind audibility test
}
#[derive(Clone, Serialize, Deserialize)]
pub struct ABAnswer {
pub trial: usize,
pub selected_option: String, // "A", "B", "1", "2", or "X"
pub timestamp: u64, // Milliseconds since session start
}
Randomization (Critical!)
BlindAB Mode: Each trial randomly maps A/B to Options 1/2:
pub fn create_blind_ab_session(
preset_a: String,
preset_b: String,
num_trials: usize,
trim_db: f32,
) -> ABSession {
use rand::Rng;
let mut rng = rand::thread_rng();
// Randomize each trial independently
let hidden_mapping: Vec<bool> = (0..num_trials)
.map(|_| rng.gen_bool(0.5)) // 50% chance Option1 = A
.collect();
ABSession {
mode: ABTestMode::BlindAB,
preset_a_name: preset_a,
preset_b_name: preset_b,
trim_db,
total_trials: num_trials,
current_trial: 0,
hidden_mapping,
x_is_a: vec![],
answers: vec![],
}
}
ABX Mode: X is randomly set to A or B for each trial:
pub fn create_abx_session(
preset_a: String,
preset_b: String,
num_trials: usize,
trim_db: f32,
) -> ABSession {
use rand::Rng;
let mut rng = rand::thread_rng();
// Randomize X for each trial
let x_is_a: Vec<bool> = (0..num_trials)
.map(|_| rng.gen_bool(0.5)) // 50% chance X = A
.collect();
ABSession {
mode: ABTestMode::ABX,
preset_a_name: preset_a,
preset_b_name: preset_b,
trim_db,
total_trials: num_trials,
current_trial: 0,
hidden_mapping: vec![],
x_is_a,
answers: vec![],
}
}
Critical Rule: Randomize PER TRIAL, not once for all trials!
❌ Wrong:
let option1_is_a = rng.gen_bool(0.5);
// Use same mapping for all trials
✅ Correct:
let hidden_mapping: Vec<bool> = (0..num_trials)
.map(|_| rng.gen_bool(0.5))
.collect();
Loudness Compensation (Trim Parameter)
Problem: Louder = perceived as "better" (Fletcher-Munson curves)
Solution: Level-match presets before testing
Auto-Calculate Trim
pub fn calculate_auto_trim(
bands_a: &[ParametricBand],
preamp_a: f32,
bands_b: &[ParametricBand],
preamp_b: f32,
) -> f32 {
use crate::audio_math::calculate_peak_gain;
let peak_a = calculate_peak_gain(bands_a, preamp_a);
let peak_b = calculate_peak_gain(bands_b, preamp_b);
// Adjust B to match A's peak level
peak_a - peak_b
}
Apply Trim to Preset B
pub fn apply_preset_with_trim(
bands: &[ParametricBand],
preamp: f32,
trim_db: f32,
) -> Result<(), String> {
let adjusted_preamp = preamp + trim_db;
// Apply to EqualizerAPO
write_eapo_config(bands, adjusted_preamp)?;
Ok(())
}
Example:
Preset A: Peak gain = -2 dB
Preset B: Peak gain = +1 dB
Trim = -2 - (+1) = -3 dB
Apply Preset B with -3 dB trim → Both have -2 dB peak
Statistical Analysis
Preference Analysis (BlindAB)
Count how many times each preset was preferred:
pub struct PreferenceResults {
pub a_selected: usize,
pub b_selected: usize,
pub total_trials: usize,
pub a_percentage: f64,
pub b_percentage: f64,
pub p_value: f64, // Statistical significance
}
pub fn analyze_blind_ab(session: &ABSession) -> PreferenceResults {
let mut a_count = 0;
let mut b_count = 0;
for (i, answer) in session.answers.iter().enumerate() {
let option1_is_a = session.hidden_mapping[i];
let selected_a = match answer.selected_option.as_str() {
"1" => option1_is_a,
"2" => !option1_is_a,
_ => continue,
};
if selected_a {
a_count += 1;
} else {
b_count += 1;
}
}
let total = a_count + b_count;
let a_pct = (a_count as f64 / total as f64) * 100.0;
let b_pct = (b_count as f64 / total as f64) * 100.0;
// Binomial test: is this significantly different from 50/50?
let p_value = binomial_test(a_count, total, 0.5);
PreferenceResults {
a_selected: a_count,
b_selected: b_count,
total_trials: total,
a_percentage: a_pct,
b_percentage: b_pct,
p_value,
}
}
ABX Analysis (Audibility Test)
Count correct vs incorrect identifications:
pub struct ABXResults {
pub correct: usize,
pub incorrect: usize,
pub total_trials: usize,
pub accuracy: f64,
pub p_value: f64,
}
pub fn analyze_abx(session: &ABSession) -> ABXResults {
let mut correct = 0;
let mut incorrect = 0;
for (i, answer) in session.answers.iter().enumerate() {
let x_is_a = session.x_is_a[i];
let guessed_a = match answer.selected_option.as_str() {
"A" => true,
"B" => false,
_ => continue,
};
if guessed_a == x_is_a {
correct += 1;
} else {
incorrect += 1;
}
}
let total = correct + incorrect;
let accuracy = (correct as f64 / total as f64) * 100.0;
// Binomial test: is this better than 50% guessing?
let p_value = binomial_test(correct, total, 0.5);
ABXResults {
correct,
incorrect,
total_trials: total,
accuracy,
p_value,
}
}
Binomial Test (P-Value)
Null Hypothesis: User is guessing randomly (50% chance)
P-Value: Probability of seeing this result (or more extreme) by chance
fn binomial_test(successes: usize, trials: usize, p_null: f64) -> f64 {
use statrs::distribution::{Binomial, Discrete};
let dist = Binomial::new(p_null, trials as u64).unwrap();
// Two-tailed test
let observed = successes as u64;
let expected = (trials as f64 * p_null) as u64;
let p_observed = dist.pmf(observed);
let mut p_value = p_observed;
// Add probabilities of more extreme outcomes
for k in 0..=trials as u64 {
let p_k = dist.pmf(k);
if p_k <= p_observed && k != observed {
p_value += p_k;
}
}
p_value.min(1.0)
}
Interpretation:
p < 0.05: Significant - unlikely to be chance (95% confidence)p < 0.01: Highly significant - very unlikely to be chance (99% confidence)p >= 0.05: Not significant - could be random guessing
Example:
ABX Test: 15/20 correct (75% accuracy)
P-value = 0.041
Interpretation: Statistically significant at 95% level.
User can reliably hear the difference.
Sample Size Requirements
How many trials needed for reliable results?
Rule of Thumb:
- Small effect: 50+ trials
- Medium effect: 20-30 trials
- Large effect: 10-15 trials
Formula (ABX test, 80% power):
n = (Z_α/2 + Z_β)² * p(1-p) / (p - 0.5)²
Where:
- Z_α/2 = 1.96 (for α = 0.05, two-tailed)
- Z_β = 0.84 (for 80% power)
- p = expected accuracy
Example:
Expected accuracy: 70%
n = (1.96 + 0.84)² * 0.7 * 0.3 / (0.7 - 0.5)²
n ≈ 41 trials
Recommended Trial Counts
pub fn recommended_trial_count(expected_accuracy: f64) -> usize {
if expected_accuracy <= 0.55 {
100 // Very subtle difference
} else if expected_accuracy <= 0.65 {
50 // Small difference
} else if expected_accuracy <= 0.75 {
25 // Medium difference
} else {
15 // Large difference
}
}
Results Export
CSV Format
pub fn export_to_csv(session: &ABSession) -> String {
let mut csv = String::from("Trial,Option1,Option2,Selected,Timestamp\n");
for (i, answer) in session.answers.iter().enumerate() {
let (opt1, opt2) = if session.mode == ABTestMode::BlindAB {
if session.hidden_mapping[i] {
(&session.preset_a_name, &session.preset_b_name)
} else {
(&session.preset_b_name, &session.preset_a_name)
}
} else {
("A", "B")
};
csv.push_str(&format!(
"{},{},{},{},{}\n",
i + 1,
opt1,
opt2,
answer.selected_option,
answer.timestamp
));
}
csv
}
Output:
Trial,Option1,Option2,Selected,Timestamp
1,Flat,Boosted,1,1234
2,Boosted,Flat,2,2456
3,Flat,Boosted,1,3789
JSON Format
pub fn export_to_json(
session: &ABSession,
results: &PreferenceResults,
) -> String {
let export = serde_json::json!({
"mode": session.mode,
"presets": {
"a": session.preset_a_name,
"b": session.preset_b_name,
},
"trim_db": session.trim_db,
"trials": session.total_trials,
"results": {
"a_selected": results.a_selected,
"b_selected": results.b_selected,
"a_percentage": results.a_percentage,
"b_percentage": results.b_percentage,
"p_value": results.p_value,
"significant": results.p_value < 0.05,
},
"answers": session.answers,
});
serde_json::to_string_pretty(&export).unwrap()
}
Experimental Design Best Practices
1. Counterbalancing
Ensure equal distribution of A and B across trials:
pub fn validate_counterbalancing(hidden_mapping: &[bool]) -> f64 {
let a_count = hidden_mapping.iter().filter(|&&x| x).count();
let total = hidden_mapping.len();
let ratio = a_count as f64 / total as f64;
// Should be close to 0.5
(ratio - 0.5).abs()
}
Warning threshold:
if validate_counterbalancing(&session.hidden_mapping) > 0.15 {
println!("Warning: Unbalanced randomization (>15% deviation from 50/50)");
}
2. Trial Independence
Each trial should be independent:
- ✅ Randomize per trial
- ❌ Use patterns (ABABAB...)
- ❌ Fixed order
3. Rest Breaks
Prevent listener fatigue:
if (currentTrial % 10 === 0 && currentTrial !== totalTrials) {
showRestBreakDialog();
}
4. Reference Switching
Allow listeners to switch between options multiple times before answering:
let switchCount = 0;
function handleSwitch() {
switchCount++;
applyOpposite();
}
// Log switch count as quality metric
Common Pitfalls
❌ Volume Mismatch
// WRONG: Apply presets without level matching
applyPresetA();
applyPresetB();
// CORRECT: Apply with trim
applyPreset(presetA, 0);
applyPreset(presetB, trimDb);
❌ Non-Random Patterns
// WRONG: Alternating pattern
let hidden_mapping = vec![true, false, true, false, ...];
// CORRECT: True randomization
let hidden_mapping: Vec<bool> = (0..trials)
.map(|_| rng.gen_bool(0.5))
.collect();
❌ Ignoring P-Value
// WRONG: Report raw percentages without significance
"Preset A preferred 55% of the time"
// CORRECT: Include statistical context
"Preset A preferred 55% (p=0.42, not significant)"
❌ Too Few Trials
// WRONG: Only 5 trials
const trials = 5; // Unreliable!
// CORRECT: Adequate sample size
const trials = 20; // Minimum for medium effects
Validation Tests
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_randomization_distribution() {
let session = create_blind_ab_session("A".into(), "B".into(), 1000, 0.0);
let a_count = session.hidden_mapping.iter().filter(|&&x| x).count();
let ratio = a_count as f64 / 1000.0;
// With 1000 trials, should be very close to 0.5
assert!((ratio - 0.5).abs() < 0.05, "Randomization biased: {}", ratio);
}
#[test]
fn test_trial_independence() {
let session = create_blind_ab_session("A".into(), "B".into(), 100, 0.0);
// Count runs (consecutive same values)
let mut runs = 1;
for i in 1..session.hidden_mapping.len() {
if session.hidden_mapping[i] != session.hidden_mapping[i - 1] {
runs += 1;
}
}
// Expected runs ≈ n/2 for random data
let expected_runs = 50.0;
let deviation = (runs as f64 - expected_runs).abs() / expected_runs;
assert!(deviation < 0.3, "Trials may not be independent");
}
#[test]
fn test_binomial_test() {
// 20/20 correct should be highly significant
let p = binomial_test(20, 20, 0.5);
assert!(p < 0.001);
// 10/20 correct should not be significant (random guessing)
let p = binomial_test(10, 20, 0.5);
assert!(p > 0.05);
}
}
Reference Materials
references/statistical_tests.md- Detailed statistical methodsreferences/experimental_design.md- Best practices for audio testingreferences/sample_size_calculator.md- Power analysis formulas