AI agent skills boost task performance by 50% — if humans write them
- Giving AI agents skills doubles success rate in complex tasks.
- Small cheap AI models outperform expensive versions running without skills.
Think of an AI agent like a smart new hire. Brilliant in many ways, but on their first day at a hospital, a factory floor, or a trading desk, they’ll struggle without the right briefing. Tell them exactly how things work in that environment – the procedures, the tools, the shortcuts that matter – and suddenly they perform like a seasoned professional.
That is essentially what AI agent skills do. And according to a new benchmark study published in February 2026, the difference between an agent with good skills and one without can be enormous.
The study, SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, was produced by 40 researchers from institutions including Amazon, Carnegie Mellon University, Stanford, UC Berkeley, and the University of Oxford. They tested seven leading AI agent configurations – tools like Anthropic’s Claude Code, Google’s Gemini CLI, and OpenAI’s Codex CLI – in 84 real-world tasks and more than 7,300 individual attempts.
Three conditions were tested for each task: agents given no guidance, agents given expert-written skills, and agents asked to figure it out themselves.
What are AI agent skills?
An AI agent is a model, like Claude or GPT, that has been given access to tools and software, letting it to carry out tasks on its own, step by step, not answering questions. These agents are increasingly being used handle complex work in industries: analysing financial reports, processing medical data, managing cybersecurity checks, and more.
The problem is that, however capable these agents are in general, they often lack the specific know-how needed for specialised work. They might know how to analyse data, but not the exact method a hospital uses to harmonise lab results in different systems, or the specific engineering approach required to optimise a factory’s production schedule.
AI agent skills bridge that gap. Each skill is essentially a structured briefing document – a set of instructions, code examples, and reference material – that tells an agent how to approach a particular type of task in a particular domain. No retraining needed. The agent just reads the skill and applies it.
The numbers for the Asia Pacific
In all the AI agent configurations tested, giving agents expert-written skills improved their average success rate by 16.2%. But some sectors saw far greater gains than others– and the pattern is directly relevant to industries driving enterprise AI adoption in the region.
Healthcare tasks saw a 51.9 improvement with skills. Manufacturing jumped by 41.9%. Cybersecurity gained 23.2 points, and the energy sector tasks improved by 17.9 points.
The reason is intuitive. AI models are trained on vast amounts of general data from the internet and published sources. But the step-by-step procedures used in clinical data management, factory scheduling, or power grid operations are not published online – they live in the heads of specialists and in internal documentation. Without a skill that captures that knowledge, an agent is essentially improvising.
One example from the study makes this vivid. Agents tasked with flood risk analysis – identifying which water monitoring stations are at risk based on streamflow data – achieved a success rate of just 2.9% when left to their own devices.
When given a skill that specified the correct statistical method (the Log-Pearson Type III distribution, which is the standard approach used by US water management authorities) along with guidance on how to apply it, the success rate jumped to 80%.Same task. Same AI model. The only difference was a well-written briefing document.
Smaller models can beat bigger ones – with the right skills
One of the study’s most practically useful findings is what skills do to the cost equation.
AI models vary in cost. Larger, more capable models charge more per use. For enterprises running agents at scale, that adds up. The assumption has generally been that if you want better results, you pay for a bigger model.
The SkillsBench data complicates that assumption. Anthropic’s Claude Haiku 4.5 – the smallest, most affordable model tested – achieved a 27.7% success rate when given curated skills. Without skills, it managed just 11%. The skills-equipped Haiku result beats Claude Opus 4.5 – a more expensive model – running without skills, which achieved 22%.
In other words, a well-briefed smaller model outperformed a larger model left to its own devices.
Keep it focused, not exhaustive
The study also settles a design question many teams face when building out agent workflows: how much information should a skill contain? The answer, it turns out, is less than most people assume. Tasks equipped with two or three focused skill modules showed the best results, improving success rates by an average of 18.6%.
Tasks given four or more skills saw that figure drop to just 5.9%. And tasks given very long, comprehensive documentation packages actually performed worse than if they had been given shorter, more targeted guidance.
The authors put it plainly: overly long skills “can consume context budgets without providing actionable guidance.” An agent drowning in documentation faces the same problem as a new employee handed a 300-page manual on their first morning – it is harder, not easier, to know what to do.
Concise, step-by-step guidance with a working example consistently outperformed exhaustive documentation.
Why human expertise still sets the ceiling
The final condition tested was the one with the most implications for how enterprises think about AI autonomy. Agents were asked to generate their own skills before attempting tasks – essentially, to self-brief before getting to work.
On average, this made things marginally worse, not better. Self-generated skills produced a -1.3% change compared to agents given no skills at all. Only one model showed any meaningful improvement.
Most either generated vague, imprecise guidance or failed to recognise that they needed specialised knowledge in the first place and simply pushed ahead with general approaches.
The conclusion the researchers draw is significant: “Effective skills require human-curated domain expertise that models cannot reliably self-generate.”
AI agents can do remarkable things when properly equipped. The equipping itself – the work of capturing institutional knowledge, domain procedures, and specialist expertise in a structured, reusable form– is where human judgement remains essential.
For enterprises in the Asia Pacific, where deep operational knowledge often sits with experienced teams not in formal documentation, building that knowledge-encoding ability is increasingly a competitive question.
The organisations that get this right will have a structural advantage that is genuinely difficult to replicate.
Want to experience the full spectrum of enterprise technology innovation? Join TechEx in Amsterdam, California, and London. Covering AI, Big Data, Cyber Security, IoT, Digital Transformation, Intelligent Automation, Edge Computing, and Data Centres, TechEx brings together global leaders to share real-world use cases and in-depth insights. Click here for more information.
TNG – Latest News & Reviews

