edu-demo-evaluator-free

hanialshater's avatarfrom hanialshater

Watch educational demo like a learner (BLIND evaluation). No test cases. Nobenchmark. No rubric. Honest assessment of: impression, what works, what doesn't,learner impact, recommendation. Output: agent_X_free_eval.json

0stars🔀0forks📁View on GitHub🕐Updated Dec 29, 2025

When & Why to Use This Skill

The Educational Demo Evaluator is a specialized Claude skill designed for 'blind' qualitative assessment of educational tools and interactive demos. By simulating a first-time learner's experience, it provides honest, unbiased feedback on pedagogical value, user engagement, and clarity without the constraints of rigid test cases or benchmark bias. It helps developers understand the real-world impact of their educational content through interactive exploration, browser-based testing, and visual state capture.

Use Cases

  • Pedagogical Impact Assessment: Evaluate whether an interactive science or math simulation effectively communicates complex concepts to a student with no prior knowledge.
  • UX/UI Friction Identification: Discover confusing navigation, misleading visual metaphors, or 'janky' animations in AI-generated educational interfaces through human-like interaction.
  • Unbiased Quality Assurance: Conduct a 'blind' review of educational prototypes to ensure they are engaging and intuitive before applying formal scoring rubrics or technical benchmarks.
  • Visual Learning Journey Documentation: Automatically capture and organize key screenshots of a learner's interaction flow to provide developers with concrete evidence of educational 'aha!' moments or points of confusion.
nameedu-demo-evaluator-free
description|
benchmark. No rubric. Honest assessment ofimpression, what works, what doesn't,
learner impact, recommendation. Outputagent_X_free_eval.json

Educational Demo Evaluator - Free Evaluation

Watch the demo like a learner would. Be honest. No scoring rubric. No benchmark bias.

Core Principles

  1. BLIND to test cases - Don't read test_cases.json
  2. BLIND to benchmark - Don't look at benchmark_ux/
  3. Watch like a learner - First time seeing it, no prior knowledge
  4. Honest assessment - What's awesome? What's confusing?
  5. Qualitative only - No numeric scores

Workflow

Step 1: Setup Chrome

# Get or create tab
mcp__claude-in-chrome__tabs_context_mcp(createIfEmpty=true)

# Create new tab for evaluation
mcp__claude-in-chrome__tabs_create_mcp()
# Returns: tabId (use this for the agent)

Step 2: Start HTTP Server

# Start HTTP server (from code-evo-agent-simple root directory)
cd /Users/hani/code-evo-agent-simple
python3 -m http.server 9999 &

Step 3: Navigate to Demo

# Navigate to demo via HTTP (NOT file://)
mcp__claude-in-chrome__navigate(
  url="http://localhost:9999/problems/<name>/generations/gen{N}/agent_X.html",
  tabId=X
)

# Wait for load
mcp__claude-in-chrome__computer(action="wait", duration=2, tabId=X)

Step 4: Watch and Interact as a Learner

Spend 5-10 minutes with the demo like a real student:

  • Read initial content - what's explained?
  • Click buttons, interact with controls
  • Watch animations play - are they clear?
  • Try different scenarios - what do you learn?
  • Capture screenshots at key moments

Focus on educational value, not technical polish

# Screenshot initial state (demo will do this automatically)
mcp__claude-in-chrome__computer(action="screenshot", tabId=X)

# Read what's on the page
mcp__claude-in-chrome__read_page(tabId=X)

# Find buttons to interact with
mcp__claude-in-chrome__find(query="play button or start button", tabId=X)

# Click and interact
mcp__claude-in-chrome__computer(action="left_click", ref=found_ref, tabId=X)

# Wait for animation
mcp__claude-in-chrome__computer(action="wait", duration=2, tabId=X)

# CAPTURE at key moments using the built-in system
# In browser console:
mcp__claude-in-chrome__javascript_tool(
  action="javascript_exec",
  text="window.screenshotManager.captureState('key_moment')",
  tabId=X
)

Screenshots are captured in the demo via the built-in html2canvas system:

  • Click "📸 Capture State" button at key moments to capture
  • Click "⬇️ Download Screenshots" when done to download all PNGs
  • Each screenshot is labeled (initial_state, capture_1, capture_2, etc.)

ORGANIZE them for next generation builders:

# Move from ~/Downloads to /problems/<name>/screenshots/ with agent labels
mv ~/Downloads/capture_1.png /problems/<name>/screenshots/agent_X_initial.png
mv ~/Downloads/capture_2.png /problems/<name>/screenshots/agent_X_moment_1.png

The demo maintains a capture history during your evaluation session.

Step 5: Record Honest Assessment

As you watch, ask yourself:

First Impression

  • What do you see immediately?
  • Is it inviting or intimidating?
  • Does it look complete or broken?

Does It Make Sense?

  • Can you understand what's happening?
  • Is the core concept clear from the visualization?
  • Are there confusing or misleading parts?

Is It Engaging?

  • Do you want to keep exploring?
  • Are interactions satisfying and rewarding?
  • Do animations feel smooth or janky?

What Works?

  • What design choices are brilliant for learning?
  • What explanations are clear and memorable?
  • What makes the concept "click"?

What Doesn't Work?

  • What's confusing to a learner?
  • What feels incomplete or wrong?
  • What metaphors or explanations could mislead?

Educational Value

  • Would a student understand the concept after this?
  • Could they explain it to someone else?
  • What's the key learning takeaway?
  • What would a learner REMEMBER in a week?

Recommendation

  • Should this be used?
  • What's the one thing to fix?
  • Is it a winner, or needs major work?

Output Format

{
  "agent": "gen2/agent_1",
  "approach": "Comparison/Dual-View",

  "first_impression": "Clean, minimal UI with two side-by-side algorithms",

  "what_works": [
    "Immediately shows WHY quicksort matters (bubble sort is slow)",
    "Color coding makes comparisons easy to follow",
    "Step-by-step controls let learner control pace",
    "Comparison metrics visible (comparisons, swaps, time)"
  ],

  "what_doesn't_work": [
    "Recursion depth not clearly shown - jumps between levels",
    "Pivot selection explanation could be clearer",
    "Animation speed is a bit fast for beginners"
  ],

  "learner_impact": "A student would understand that quicksort is faster because of intelligent partitioning. Might not fully grasp recursion or pivot selection strategy.",

  "recommendation": "STRONG CANDIDATE - Fix recursion visualization, maybe add narrative explanations for pivot selection. Otherwise excellent foundation.",

  "screenshots_captured": "agent_1_initial.png, agent_1_comparison.png, agent_1_recursion.png (moved to /problems/<name>/screenshots/)"
}

Key Phrases to Avoid

❌ "Correctness score: 85" ❌ "Compared to benchmark..." ❌ "Test case coverage: 14/15" ❌ "Points deducted for..."

✅ "Immediately shows WHY" ✅ "A learner would understand..." ✅ "The animation feels smooth" ✅ "Confusing part: recursion depth"

Important Notes

  • Don't read test cases - You don't know what you're supposed to verify
  • Don't think about benchmark - You don't know what "good" looks like
  • Don't use rubric - No scoring categories, no point calculations
  • Be honest - If it's confusing, say it's confusing
  • Watch 5-10 minutes per agent - Enough time to form honest impression

Example Evaluation

Visit http://localhost:9999/problems/quicksort-demo/generations/gen2/agent_1.html

First impression:
- Clean white background with two columns side by side
- Left: Quicksort animation, Right: Bubble sort animation
- Professional looking, not too colorful

Interact:
- Click "Start" button
- Both arrays start animating
- Quicksort finishes first
- Bubble sort continues much longer
- Counter shows comparisons: QS=45, BS=120

Impression: "OH! This is why quicksort is better! The visualization immediately makes it clear."

Assessment:
- WORKS: Side-by-side comparison is brilliant
- WORKS: Metrics visible (comparison count)
- WORKS: Speed difference obvious
- DOESN'T WORK: Recursion not explained (which levels are being called?)
- DOESN'T WORK: Pivot selection seems arbitrary
- RECOMMENDATION: This is a strong foundation. Add narrative about pivot strategy, show recursion depth. Could be winner.

Cleanup

# Kill HTTP server
pkill -f "python3 -m http.server 9999"