Yue Song

Why I Wrote a Skill to Review Skills

Mar 27, 2026

If stable AI performance depends on reusable skills, then skills themselves need a review method before they can be trusted.

I started writing skills for one reason: I want stable performance from AI.

A good one-off prompt is not enough for that. If I expect an AI to handle the same kind of work repeatedly, I need something more reusable and more disciplined. That is what a skill gives me. A skill is not just instructions for one task. It is a way to make future behavior more consistent.

But once I started thinking that way, another problem showed up. If skills are supposed to improve reliability, then the skills themselves need to be reviewed before I trust them.

I do not want to try a skill blindly and only discover later that it is vague, structurally wrong, or easy for another AI to misuse. If a skill is meant to make AI behavior more stable, then the skill itself has to be something I can evaluate in a repeatable way.

That question pushed me toward a more serious view of skill design. I was no longer asking only, “does this read well?” I was asking whether the skill was built in the right shape for the job.

Skills Are About Stability, Not Just Reuse

Once you care about reliability, reuse stops being a convenience and starts being part of the operating model.

The value of a skill is not only that I can run it again later. The value is that I can expect roughly the same behavior the next time an AI faces the same kind of task. A useful skill narrows ambiguity. It tells the agent what kind of job this is, what to inspect first, what order to follow, and what kind of output counts as correct.

That is what makes skills important to me. They are not prompt snippets. They are reusable constraints for future behavior.

But that also means a bad skill is worse than no skill at all. A vague or structurally confused skill does not stabilize anything. It just hides risk behind nicer formatting.

Reviewing a Skill Needs More Than “Good Writing”

The turning point for me was Lavini Gama’s article on skill design patterns:

5 Agent Skill Design Patterns Every ADK Developer Should Know

That article gave me a better way to think about skills. It framed them as different kinds of tools, not just different kinds of writing. The five patterns that mattered most to me were:

  • Tool Wrapper
  • Generator
  • Reviewer
  • Inversion
  • Pipeline

That changed the review question. Instead of asking only whether a skill was well written, I started asking whether it was the right kind of skill in the first place.

Is this supposed to wrap tools in a safer way? Generate a concrete artifact? Invert the flow so the agent inspects before asking? Review another object against a standard? Or orchestrate a sequence of steps across stages?

That lens catches structural mistakes early. A skill can be clear at the sentence level and still be the wrong pattern for the job.

The “Other AI” Test

The second review method turned out to be just as important.

When I review a skill, I imagine another competent AI trying to use it with no extra coaching from me.

Would that AI know when to use the skill? Would it know what to read first? Would it know what order to follow? Would it produce the right output? Would it skip something important because the wording was too soft?

That perspective catches a lot of issues that ordinary editing misses.

As the author, I already know what I meant. Another AI does not. It only has the words in front of it. If the trigger conditions are weak, the order is unclear, the hard rules are buried, or the expected output is underspecified, another agent may still produce something plausible while quietly doing the wrong thing.

That is exactly the kind of failure I want to prevent.

So I Wrote reviewing-skills

Once I combined the design-pattern lens with the “other AI” lens, the next step became obvious: I should write a skill that reviews skills with both methods every time.

So I did.

The result is reviewing-skills.

Its purpose is simple. It reviews a target skill by checking two things:

  • whether the skill matches the right design pattern and follows sound skill-design principles
  • whether another competent AI could use it correctly and stably without extra coaching

That second point matters a lot to me. I do not just want a skill that looks reasonable to the author. I want a skill that survives contact with another agent.

In practice, that means looking for things like:

  • unclear trigger conditions
  • ambiguous step ordering
  • weak or optional-sounding instructions where the behavior really needs to be mandatory
  • outputs that are implied but not specified
  • references that are technically present but not easy for another agent to follow

If the goal is stable AI behavior, then the review process has to be explicit about those risks.

The Useful Part Was Letting It Review Itself

Then I used reviewing-skills on itself.

That self-review loop turned out to be especially useful. It helped me tighten the output format, clarify the review flow, make the pattern assessment more explicit, strengthen the wording in places where another AI might otherwise improvise, and generally make the skill more reliable.

That was the part I found most interesting. The skill did not only review other skills. It also gave me a way to improve itself using the same review logic it applies elsewhere.

In other words, the skill became part of a feedback loop:

  1. Draft the reviewing-skills skill.
  2. Review it against the design-pattern lens by myself.
  3. Review it through the “other AI” lens by myself.
  4. Fix the problems.
  5. Ask AI to use the skill to review itself.
  6. Repeat step 5 until no more issues found.

That loop feels much closer to engineering than to prompt tweaking.

What I Took Away

The main idea I have come away with is simple.

If I care about stable AI performance, I need good skills. And if I care about good skills, I need a good way to review them before I rely on them. Writing reviewing-skills was my answer to that.

The broader lesson is that reliability does not come from writing longer prompts. It comes from building reusable structures, checking whether they are the right structures, and testing whether another agent could actually use them as intended.

That is the standard I want for skills now. Not just “does this sound smart?” but “is this the kind of thing another AI can use correctly, repeatedly, and with minimal drift?”

That is why I wrote a skill to review skills.

References

The ideas in this post are mine; Codex helped me write it.

If you'd like to follow what I'm learning about AI tools and workflows, you can subscribe here → Subscribe to my notes