In my last article, I wrote about building a 100+ AI system. It wasn’t by grand design, but by the same instinct that produces good code. This article goes deeper into the piece that turned out to matter most: skills.
But first, the question I keep getting. Why 100+ agents?
Here's a better question: why do you have 100 utility functions?
Nobody plans to write 100 utility functions. You write one to avoid repeating a date formatter. Then another to normalize API responses. Then another, because three components need the same validation logic. Each utility exists because you noticed a repeated pattern and extracted it. That's DRY - Don't Repeat Yourself. That's not complexity. That's good engineering.
My agent system reached 100+ specialists the same way. Not by design. By refactoring. Every agent started as a repeated prompt that was expensive to re-explain. Each one exists because the alternative - re-prompting the same context every session - violated something I started calling DRYP: Don't Repeat Your Prompt. Same principle as DRY, applied to the new medium. When you find yourself pasting the same context into a conversation for the third time, that's a prompt begging to become a skill file.
The agents are the functions. And like functions, they don't all run in parallel - they're organized in a hierarchy. My orchestrator calls my designer, which calls my theme specialist. Just like main() calls buildUI() calls applyTheme(). But the skills - the domain knowledge files each agent loads before starting work - turned out to be the more important piece. The skill is the product. The agent is just the runtime.
That line sounds like a slogan. It's actually the core finding of a recent benchmark study - and it matches what I've seen across 100+ skill files in production.
What Skills Are and Why They Work
In my system, a skill is a structured markdown file that loads domain knowledge into an agent's context window before it starts working. It's not a system prompt. It's not a generic instruction set. It's precompiled expertise for a specific type of work.
Here's the difference in practice. Without a skill, I'd start a session like this:
I need you to review this DeFi integration. The protocol uses a singleton architecture with sub-accounts. Interest rates accrue per-second using ray math. The liquidation engine uses a Dutch auction. Check for reentrancy at these specific entry points...
Every session. Re-typed or pasted from notes. Slightly different each time. The model would get most of it, miss the edges, and I'd spend the first ten minutes correcting its assumptions.
With a skill, I load a single file. The agent reads the protocol's architecture, the specific risk patterns to check, the verification sequence, and the output format. The session starts at step one of the actual work, not step one of the explanation.
That's DRYP in action. The third time I pasted that DeFi review context, I extracted it into a skill file. Same instinct as extracting a utility function - just applied to prompts instead of code.
The Data
This isn't just my observation. Look at the chart above.
SkillsBench tested 7,308 agent runs across 84 tasks, 11 domains, and 7 model configurations. The chart plots every model with and without skills. The entire "with skills" frontier sits above and to the right of "without skills." Every model. No exceptions.
Three findings from the study that matched what I'd been seeing:
-
2-3 focused skills is the sweet spot. More isn't better. Load 4+ and gains drop. Provide "comprehensive" documentation and performance actually drops below baseline - worse than giving the model nothing. A function with three clear parameters outperforms one with fifteen optional arguments and a sprawling docstring. Same principle.
-
Self-generated skills are net harmful. Models that wrote their own skill descriptions performed worse than models with no skills at all. They couldn't extract the right procedural patterns from their own experience. Human curation was the differentiator.
-
Domain expertise matters most where pre-training is weakest. Healthcare saw the largest gains. Software engineering saw the smallest. The more specialized your domain, the more skills help. For me, that's DeFi - a domain where the model's pretraining barely scratches the surface.
Making 100+ Skills Manageable
So skills work. Now you have a problem: you've written 20, then 50, then 100+ of them. How do you not drown?
The answer is the same as in any large codebase: hierarchy. My system isn't 100+ agents running in parallel. It's a tree. Agents manage other agents, the same way managers manage teams.
I talk to maybe 8-10 skills directly, my "inner circle." The orchestrator, the designer, the implementer. When I tell my orchestrator to build a feature, it delegates to specialists I never interact with: the wagmi hook expert, the TypeScript type specialist, the QA sub-agents. Those specialists might delegate further. The protocol librarian routes to Morpho, Aave, or Uniswap specialists depending on the task.
I don't remember all 100+ skills. I don't need to. I talk to the inner circle. They know who to call.
This is the distinction between skills and agents. A skill is a persona I converse with - back-and-forth, iterative, collaborative. An agent is a worker that gets dispatched by another skill or agent. It takes a task, returns a result, no conversation needed. Most of the 100+ are agents. The inner circle are skills.
The skill files are what make the whole hierarchy work. Every node in the tree - whether I talk to it directly or it's five levels deep - loads the same kind of focused, domain-specific markdown file. The skill is the knowledge. The agent is just the thing that carries it.
Building a Useful Skill: The Person-Risk-Analyzer
Let me walk through one skill that illustrates the full lifecycle - from solving my own problem to becoming a tool the whole company uses.
The Problem
In DeFi, the biggest risk isn't usually the smart contract. It's the person on the other end of a Telegram message or a LinkedIn connection request. Compromised accounts, fake identities, social engineering - these are daily occurrences, and the consequences are financial.
I needed a way to vet people before engaging with them. Not a background check service - something that could investigate the way I would if I had unlimited time and paranoia.
The Skill Design
The person-risk-analyzer skill is an adversarial investigation agent. Its default stance: guilty until proven innocent. This isn't a personality quirk - it's a design decision. In security contexts, false negatives (missing a scam) are catastrophically worse than false positives (being too cautious about a legitimate contact).
The skill is a structured markdown file. It defines the default stance, a ranked trust hierarchy for information sources, a multi-phase research methodology, web search strategies, output format templates, and protocols for edge cases like hacked accounts. Not "investigate the person" - a complete investigative framework with explicit ordering, decision criteria, and structured deliverables. You can read the full file.
What This Looks Like in Practice
Load the skill. Give the agent a name and a context ("This person contacted us about a partnership on Telegram"). The agent runs the investigation: searches public records, cross-references social profiles, checks for consistency in employment claims, examines the timeline of account creation versus claimed experience.
The output is a structured report: identity verification status, red flags found (with evidence), green flags found (with evidence), overall risk score, and recommended next steps.
The critical part: the skill's procedural structure means even a smaller, cheaper model produces useful investigations. The skill is doing the expert reasoning. The model is doing the execution. A cheaper runtime runs the same program.
Where It Got Interesting
I didn't plan what happened next.
The first version worked fine for me. I stopped re-explaining my vetting process every session. Consistent methodology. Repeatable results. DRYP, problem solved.
Then I did what any engineer does when a function gets unwieldy - I cleaned it up. Clearer section headers, modular investigation steps, explicit decision criteria. I wasn't thinking about reuse. I was just making it less messy. But by making it cleaner, I accidentally made it portable. Any agent could consume this skill, not just the one I built it for.
Then came the part I really didn't see coming. I plugged it into our company's Slack. And suddenly the person who handles our partnerships - no engineering background, no prompt crafting - was using it. She messages the bot with a name, reads the report, decides whether to proceed. She doesn't know or care what model is running it. She just knows it catches things she'd miss.
The Three-Stage Payoff
This pattern - custom skill, engineering artifact, business tool - is the arc I didn't see coming but now build toward deliberately:
-
Custom skills lead to better outcomes. This is the individual productivity gain. Every practitioner who's moved from ad-hoc prompting to structured skills has seen it. You stop re-explaining context. The agent starts at the work, not at the orientation.
-
Refactoring the skill creates a reusable artifact. This is the engineering gain. Once you treat a skill like code - with structure, modularity, and clear interfaces - it becomes composable. Other agents import it. New workflows incorporate it. DRYP is just DRY applied to the new medium - the same instinct that produces good utility functions produces good skills.
-
Reusable skills become accessible to non-technical people. This is the business gain, and the step most agent builders haven't taken yet. A well-structured skill doesn't need an engineer to operate it. Wrap it in a simple interface - a Slack bot, a web form, a scheduled job - and domain experts use AI capabilities without ever writing a prompt.
An engineer using a skill saves their own time. A non-engineer using a skill does something they couldn't do before. That's capability expansion, not just efficiency.
What This Means
Skills are programs written in English. They follow the same engineering principles as code: modularity, reuse, separation of concerns, single responsibility.
Focused, human-curated skills are the highest-leverage intervention you can make in an agent system. Not better models. Not more context. Not more sophisticated agent frameworks. Two or three focused skill files that encode exactly the procedural knowledge the task requires.
And the payoff extends past the terminal. A well-built skill is a business tool that anyone in the organization can use. Not someday. Now. The Slack bot running a person-risk-analyzer skill is not a demo - it's a tool a non-technical team member uses before every partnership call.
Start with one. The skill you wish you had last week for the task you keep re-explaining. Write the procedure, not the documentation. Keep it focused on two or three responsibilities. Let the model handle the rest. A sample of the system is open source at Web3-Claude if you want to see what skill files look like in practice.
The skill is the product. The model is just the runtime.
Next in Part 3: what happens when your context window becomes the bottleneck -- context rot at scale, and how to give agents memory that survives across sessions.
