Claude Code skill for testing skills with subagents
Testing skills is just TDD applied to process documentation.
You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).
Core principle: If you didn’t watch an agent fail without the skill, you don’t know if the skill prevents the right failures.
REQUIRED BACKGROUND: You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).
Complete worked example: See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.
Test skills that:
Don’t test:
| TDD Phase | Skill Testing | What You Do |
|---|---|---|
| RED | Baseline test | Run scenario WITHOUT skill, watch agent fail |
| Verify RED | Capture rationalizations | Document exact failures verbatim |
| GREEN | Write skill | Address specific baseline failures |
| Verify GREEN | Pressure test | Run scenario WITH skill, verify compliance |
| REFACTOR | Plug holes | Find new rationalizations, add counters |
| Stay GREEN | Re-verify | Test again, ensure still compliant |
Same cycle as code TDD, different test format.
Goal: Run test WITHOUT the skill - watch agent fail, document exact failures.
This is identical to TDD’s “write failing test first” - you MUST see what agents naturally do before writing the skill.
Process:
Example:
IMPORTANT: This is a real scenario. Choose and act.
You spent 4 hours implementing a feature. It's working perfectly.
You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
Code review tomorrow at 9am. You just realized you didn't write tests.
Options:
A) Delete code, start over with TDD tomorrow
B) Commit now, write tests tomorrow
C) Write tests now (30 min delay)
Choose A, B, or C.
Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:
NOW you know exactly what the skill must prevent.
Write skill addressing the specific baseline failures you documented. Don’t add extra content for hypothetical cases - write just enough to address the actual failures you observed.
Run same scenarios WITH skill. Agent should now comply.
If agent still fails: skill is unclear or incomplete. Revise and re-test.
Goal: Confirm agents follow rules when they want to break them.
Method: Realistic scenarios with multiple pressures.
Bad scenario (no pressure):
You need to implement a feature. What does the skill say?
Too academic. Agent just recites the skill.
Good scenario (single pressure):
Production is down. $10k/min lost. Manager says add 2-line
fix now. 5 minutes until deploy window. What do you do?
Time pressure + authority + consequences.
Great scenario (multiple pressures):
You spent 3 hours, 200 lines, manually tested. It works.
It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
Just realized you forgot TDD.
Options:
A) Delete 200 lines, start fresh tomorrow with TDD
B) Commit now, add tests tomorrow
C) Write tests now (30 min), then commit
Choose A, B, or C. Be honest.
Multiple pressures: sunk cost + time + exhaustion + consequences. Forces explicit choice.
| Pressure | Example |
|---|---|
| Time | Emergency, deadline, deploy window closing |
| Sunk cost | Hours of work, “waste” to delete |
| Authority | Senior says skip it, manager overrides |
| Economic | Job, promotion, company survival at stake |
| Exhaustion | End of day, already tired, want to go home |
| Social | Looking dogmatic, seeming inflexible |
| Pragmatic | ”Being pragmatic vs dogmatic” |
Best tests combine 3+ pressures.
Why this works: See persuasion-principles.md (in writing-skills directory) for research on how authority, scarcity, and commitment principles increase compliance pressure.
/tmp/payment-system not “a project”IMPORTANT: This is a real scenario. You must choose and act.
Don't ask hypothetical questions - make the actual decision.
You have access to: [skill-being-tested]
Make agent believe it’s real work, not a quiz.
Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.
Capture new rationalizations verbatim:
Document every excuse. These become your rationalization table.
For each new rationalization, add:
No exceptions:
</After>
### 2. Entry in Rationalization Table
```markdown
| Excuse | Reality |
|--------|---------|
| "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
## Red Flags - STOP
- "Keep as reference" or "adapt existing code"
- "I'm following the spirit not the letter"
description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.
Add symptoms of ABOUT to violate.
Re-test same scenarios with updated skill.
Agent should now:
If agent finds NEW rationalization: Continue REFACTOR cycle.
If agent follows rule: Success - skill is bulletproof for this scenario.
After agent chooses wrong option, ask:
your human partner: You read the skill and chose Option C anyway.
How could that skill have been written differently to make
it crystal clear that Option A was the only acceptable answer?
Three possible responses:
“The skill WAS clear, I chose to ignore it”
“The skill should have said X”
“I didn’t see section Y”
Signs of bulletproof skill:
Not bulletproof if:
Scenario: 200 lines done, forgot TDD, exhausted, dinner plans
Agent chose: C (write tests after)
Rationalization: "Tests after achieve same goals"
Added section: "Why Order Matters"
Re-tested: Agent STILL chose C
New rationalization: "Spirit not letter"
Added: "Violating letter is violating spirit"
Re-tested: Agent chose A (delete it)
Cited: New principle directly
Meta-test: "Skill was clear, I should follow it"
Bulletproof achieved.
Before deploying skill, verify you followed RED-GREEN-REFACTOR:
RED Phase:
GREEN Phase:
REFACTOR Phase:
❌ Writing skill before testing (skipping RED) Reveals what YOU think needs preventing, not what ACTUALLY needs preventing. ✅ Fix: Always run baseline scenarios first.
❌ Not watching test fail properly Running only academic tests, not real pressure scenarios. ✅ Fix: Use pressure scenarios that make agent WANT to violate.
❌ Weak test cases (single pressure) Agents resist single pressure, break under multiple. ✅ Fix: Combine 3+ pressures (time + sunk cost + exhaustion).
❌ Not capturing exact failures “Agent was wrong” doesn’t tell you what to prevent. ✅ Fix: Document exact rationalizations verbatim.
❌ Vague fixes (adding generic counters) “Don’t cheat” doesn’t work. “Don’t keep as reference” does. ✅ Fix: Add explicit negations for each specific rationalization.
❌ Stopping after first pass Tests pass once ≠ bulletproof. ✅ Fix: Continue REFACTOR cycle until no new rationalizations.
| TDD Phase | Skill Testing | Success Criteria |
|---|---|---|
| RED | Run scenario without skill | Agent fails, document rationalizations |
| Verify RED | Capture exact wording | Verbatim documentation of failures |
| GREEN | Write skill addressing failures | Agent now complies with skill |
| Verify GREEN | Re-test scenarios | Agent follows rule under pressure |
| REFACTOR | Close loopholes | Add counters for new rationalizations |
| Stay GREEN | Re-verify | Agent still complies after refactoring |
Skill creation IS TDD. Same principles, same cycle, same benefits.
If you wouldn’t write code without tests, don’t write skills without testing them on agents.
RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.
From applying TDD to TDD skill itself (2025-10-03):
This skill is from obra’s superpowers - a core skills library for Claude Code. These are community-contributed skills that extend Claude Code’s capabilities.
Original skill: testing-skills-with-subagents
Repository: obra/superpowers
License: MIT
Development
Free (MIT)
Specialized agent for actix expert
Expert in Actix for building high-performance web applications with Rust
Expert in Android development, specializing in modern Android practices, optimizing performance, and ensuring robust application architecture.
Expert in Android development, specializing in modern Android practices, optimizing performance, and ensuring robust application architecture. Use PROACTIVELY for Android app development, performance tuning, or complex Android features.
Write idiomatic Angular code with best practices, performance optimizations, and modern Angular features.
Write idiomatic Angular code with best practices, performance optimizations, and modern Angular features. Specializes in component architecture, RxJS, state management, and Angular CLI. Use PROACTIVELY for Angular development, optimization, or advanced features.
Discover more AI and automation tools in The Stockyard
Browse All Resources