Building Skills That Actually Work (Claude → Codex)
Based on The Complete Guide to Building Skills for Claude (includes January 2026 distribution notes).
If you keep finding yourself typing the same “here’s how I like this done” instructions over and over, you’re exactly who skills are for.
A skill is basically a small, packaged playbook: a folder of instructions (and optional helper scripts/docs) that teaches an AI how to run a specific workflow consistently. Instead of re-explaining your process every chat, you teach it once and reuse it forever.
This guide walks through the full lifecycle:
- How skills are structured (and why the structure matters)
- How to design skills around real use cases
- The non-negotiable technical rules (so the platform actually accepts your skill)
- How to test and iterate without guessing
- How to distribute a skill to other people
- Patterns that show up in “good” skills (plus common failure modes)
You’ll also get a practical translation at the end: how these same ideas map to Codex skills on your machine.
Table of contents
- Quick start: build your first skill in ~30 minutes
- Fundamentals
- Planning and design
- Technical requirements (non-negotiables)
- Writing SKILL.md instructions (make it executable)
- Testing and iteration
- Distribution and sharing
- Patterns and troubleshooting
- Appendix A: Quick checklist
- Appendix B: YAML frontmatter reference
- Appendix C: Where to find full examples
- How this maps to Codex skills (practical translation)
Quick start: build your first skill in ~30 minutes
If you want the “skip the theory, ship something” path, do this:
- Pick one repeatable workflow you do a lot (not three, not ten).
- Write 2–3 concrete use cases (example user asks + what “done” looks like).
- Draft frontmatter that’s ruthlessly specific about when to use the skill.
- Write SKILL.md instructions as a step-by-step checklist (with validation gates).
- Add 6–12 test prompts:
- obvious triggers
- paraphrases
- “should NOT trigger” decoys
- Run the tests, then tighten:
- undertriggering → make the description more specific and add trigger phrases
- overtriggering → add negative triggers and narrow scope
- execution issues → tighten steps, add validation, add troubleshooting
That’s it. You can always expand later.
Fundamentals
What a skill is (and what’s inside)
At minimum, a skill is a folder with one required file:
SKILL.md(required): Markdown instructions with YAML frontmatter at the top
Optionally, you can include:
scripts/: code that makes parts of the workflow deterministic (validation, formatting, conversions)references/: docs the model can consult when needed (API guides, error codes, examples)assets/: templates and reusable artifacts (report skeletons, brand style guides, prompt templates)
Why this matters: skills are easiest to maintain when you separate the “how” (steps
in SKILL.md) from the “details” (docs in references/) and
the “must be correct” parts (scripts in scripts/).
Practical tip: if you ever catch yourself writing a 400-line “rules” section, that’s usually a sign you need either (a) a reference file, or (b) a script.
Core design principle #1: Progressive disclosure
The skill system is designed to avoid shoving everything into context all the time. It’s a layered approach:
- Frontmatter (always visible): tiny metadata that helps the model decide when to load the skill.
- The SKILL.md body (loaded when relevant): the full workflow instructions.
- Linked/bundled files (loaded only if needed): deep docs, templates, examples, scripts.
Why this matters: most skills fail because they do too much too early. Keeping “heavy” content out of the always-loaded layer makes everything faster and more reliable.
Practical tip: treat frontmatter like a router. It shouldn’t teach the workflow; it should only say “use this when the user wants X and says Y.”
Core design principle #2: Composability
Models can load multiple skills at once. Your skill should behave nicely with neighbors.
Why this matters: in the real world, people enable lots of skills. If your skill assumes it owns the whole conversation (“always do X, never do Y”), you’ll get conflicts and weird behavior.
Practical tip: write your instructions so they still make sense if another skill is active (for example, a “docx” skill or a “spreadsheet” skill). Use clear scope statements like “This skill is only for…” and “If the user asks for…, hand off to…”.
Core design principle #3: Portability
Skills are meant to work across different surfaces (chat UI, coding environments, API-driven agents) as long as the environment supports the dependencies your skill needs.
Why this matters: if your skill requires a system binary or an API key, portability isn’t automatic. You have to declare it (and design fallbacks).
Practical tip: add a short “Requirements” section in SKILL.md when scripts are involved:
- what’s required
- how to install it
- what the fallback behavior should be if it’s missing
Skills + MCP (the “tools vs recipes” idea)
If MCP is how an agent gets access to tools and live data, skills are the “how to use those tools well” layer.
One way to think about it:
- MCP gives the agent a workshop full of tools.
- Skills are the build manual: order of operations, defaults, validations, and best practices.
Why this matters for connector builders: without skills, users may connect your tools and still not know what to do. Skills turn tool access into repeatable outcomes.
Deep cut: why frontmatter quality dominates
Most “my skill doesn’t work” issues aren’t about the instructions. They’re about the skill not loading at the right time.
Frontmatter is always present in the system prompt. That means:
- It has outsized influence on whether your skill activates.
- Small wording changes can dramatically change triggering behavior.
If you only have time to perfect one thing, perfect the description.
Planning and design
Start with use cases (not filenames)
Before you write anything, define 2–3 concrete use cases. Your goal is to answer: “What does the user want to accomplish, and what steps reliably get them there?”
Use this template:
Use Case: (short name)
Trigger: what the user might say (phrases, nouns, file types)
Steps: 3–8 steps in order (include any tools)
Result: what “done” looks like
Example (MCP-flavored):
Use Case: Sprint planning
Trigger: “plan this sprint”, “create sprint tasks”, “prioritize tickets”
Steps:
1. Pull current project state (tool call)
2. Estimate capacity/velocity (logic)
3. Propose priorities (output)
4. Create tasks with labels/estimates (tool calls)
Result: a sprint plan with tasks created and linked
Why this matters: if you can’t write the steps, you can’t write the skill. Skills are just “repeatable steps, packaged.”
The 3 common use case categories
From the patterns in the PDF, most skills fall into one of these:
- Document & asset creation
Use when you want consistent output quality (docs, designs, code artifacts).
Key techniques: style guides, templates, quality checklists. - Workflow automation
Use when a multi-step process benefits from one consistent method every time.
Key techniques: step ordering, validation gates, iterative loops. - MCP enhancement
Use when tools exist, but users need a “best practices workflow” layered on top.
Key techniques: coordinating tool calls, passing data between steps, robust error handling.
Define success criteria (so you can iterate sanely)
You want both quantitative and qualitative targets. Think “benchmarks,” not perfect scientific metrics.
Quantitative examples:
- Trigger accuracy: does it load for ~90% of relevant prompts in your test set?
- Tool-call efficiency: does it use fewer tool calls/tokens than a baseline?
- Reliability: does the workflow run without failed tool calls in typical runs?
Qualitative examples:
- Low steering: users don’t have to keep asking “what next?”
- Low correction: fewer “no, do it this way” interventions
- Consistency: similar results across repeated runs and across users
How to measure triggering quickly:
- Write 10–20 prompts that should trigger.
- Write 10 decoys that shouldn’t.
- Track activation behavior and adjust the description.
Technical requirements (non-negotiables)
File structure
Use a simple structure like this:
your-skill-name/
├── SKILL.md # required
├── scripts/ # optional
│ ├── process_data.py
│ └── validate.sh
├── references/ # optional
│ ├── api-guide.md
│ └── examples/
└── assets/ # optional
└── report-template.md
Critical rules
These are the “don’t fight the system” rules:
SKILL.mdmust be named exactlySKILL.md(case-sensitive).- Skill folder name should be kebab-case:
- ✅
notion-project-setup - ❌
Notion Project Setup - ❌
notion_project_setup - ❌
NotionProjectSetup
- ✅
- Don’t put a
README.mdinside the skill folder.- Put human-facing docs in the repo-level README, not in the skill folder.
YAML frontmatter: the most important part
Frontmatter is how the model decides whether the skill is relevant.
Minimal required format:
---
name: your-skill-name
description: What it does. Use when the user asks for X, says Y, or uploads Z.
---
Field rules to follow:
name (required):
- kebab-case only
- no spaces, no capitals
- should match the folder name
description (required):
- must include what the skill does and when to use it
- must include realistic trigger phrases users might say
- keep it under the platform limit (the guide calls out 1024 characters)
- mention file types if that’s part of the trigger (for example: “when user uploads a .fig file”)
Optional fields you’ll commonly see:
license: if you’re open-sourcing the skillcompatibility: environment expectations (product/surface, required packages, network needs)metadata: any custom fields (author, version, associated MCP server, etc.)allowed-tools: a tool access allowlist (when supported) to limit what the skill can invoke
Frontmatter security restrictions (take these seriously)
Because frontmatter is loaded into the system prompt:
- Don’t include
<or>in frontmatter. - The guide's quick checklist also recommends avoiding
<and>anywhere in the skill (safest default). - Don’t include “claude” or “anthropic” in the skill
name(reserved, per the guide).
Frontmatter patterns (how to actually get triggering right)
Bad descriptions (too vague):
description: Helps with projects.
Better descriptions (specific “what” + specific “when”):
description: Creates sprint plans from a backlog and team capacity. Use when the user says "plan a sprint", "prioritize tickets", or asks to "create sprint tasks".
Add negative triggers when you overtrigger:
description: Performs advanced CSV analysis (regression, clustering, hypothesis tests). Do NOT use for simple "open this CSV" or basic charting.
Why this works: you’re shaping a decision boundary. The model needs both positive and negative examples to route correctly.
Writing SKILL.md instructions (make it executable)
Recommended structure
This structure is simple, predictable, and easy to test:
- Goal / scope
- Instructions (step-by-step)
- Examples (common user requests)
- Troubleshooting (error → cause → fix)
- References (what files to consult and when)
Here’s a starter template you can adapt:
---
name: your-skill-name
description: ...
---
# Your Skill Name
## Scope
- What this skill is for
- What it is NOT for
## Instructions
### Step 1: ...
### Step 2: ...
### Step 3: ...
## Examples
### Example: ...
User says: "..."
Actions:
1. ...
2. ...
Result: ...
## Troubleshooting
### Error: "..."
Cause: ...
Fix:
1. ...
2. ...
## References
- If you need pagination patterns, read `references/pagination.md`
Best practices that make skills “stick”
Be specific and actionable
- Good: “Run
scripts/validate.py --input {file}; if it fails, fix missing columns and retry.” - Weak: “Validate the data.”
Add validation gates
If a step can go wrong, say what “good” looks like before moving on:
- “Before creating tickets, confirm the project name is non-empty and the user chose a label set.”
Reference bundled resources by name
Make it easy to find the right doc:
- “Before writing API queries, check
references/api-errors.mdfor rate limits and retry rules.”
Keep SKILL.md focused
Put deep documentation in references/ and link to it. This keeps the core flow readable
and keeps the model from drowning in details.
Deep cut: when to write scripts (and when not to)
Language instructions are flexible. Scripts are deterministic.
Use a script when:
- you must validate a file format correctly (CSV headers, JSON schema, date formats)
- you want a repeatable quality check (lint a document, verify required sections)
- you want to transform data (extract, normalize, summarize) reliably
Don’t write a script when:
- the user’s intent needs negotiation (“what do you mean by ‘clean up’?”)
- the workflow is mostly judgment and writing, not verification
Good skills often blend both: instructions for the human-ish parts, scripts for the brittle parts.
Testing and iteration
Skills can be tested at different levels depending on how much you care about correctness and how many users will rely on it.
Level 1: manual tests (fastest)
Run prompts in the target environment and observe:
- did the skill load?
- did it follow the steps?
- did it ask sensible clarifying questions?
Level 2: scripted tests (repeatability)
In a coding environment, you can keep a “test prompts” file and run the same set after every change.
Level 3: programmatic eval suites (Scale)
If you’re using an API-driven agent setup, build a test harness:
- fixed test set
- expected outputs (or scoring rubric)
- automatic reporting over time
Using a “skill-creator” helper (optional)
The
PDF calls out a skill-creator skill that can help you draft and refine skills
faster:
- generate a first draft from a plain-English description
- suggest trigger phrases and a sane structure
- review an existing skill for common problems (vague description, missing triggers, messy structure)
The important constraint: it helps you design, but it won’t magically produce quantitative eval results for you. You still need your own test prompts (and, for serious deployments, an eval harness).
Recommended coverage: the 3-part suite
1) Triggering tests
- Should trigger:
- obvious asks (“set up a new workspace”, “plan a sprint”, “generate a report”)
- paraphrases (“kick off sprint planning”, “turn this backlog into tasks”)
- Should NOT trigger:
- unrelated questions (weather, generic coding help, random Q&A)
2) Functional tests
Pick a concrete scenario and check:
- output structure is correct
- tool calls succeed (if applicable)
- error handling actually helps (not just “try again”)
- edge cases don’t explode
Example functional test:
Test: create a project with 5 tasks
Given: project name + 5 task descriptions
Then: project exists, tasks exist, tasks are linked, and no tool errors occurred
3) Performance comparison
Compare to baseline (no skill):
- fewer back-and-forth messages
- fewer failed tool calls
- fewer tokens consumed
- fewer user corrections
How to iterate without losing your mind
The guide's best advice here is simple: start with one hard task and iterate until it works. Then generalize.
Common failure signals and what to do:
Undertriggering (skill doesn’t load when it should)
- Fix: expand the description with more real trigger phrases and domain terms.
Overtriggering (skill loads for unrelated things)
- Fix: narrow the description, add “Do NOT use when…” clauses, clarify scope.
Execution issues (skill loads but behaves badly)
- Fix: tighten step order, add validation gates, add troubleshooting, replace fuzzy rules with scripts where appropriate.
Distribution and sharing
Current distribution model (January 2026)
As of January 2026, the guide describes a user flow roughly like:
- download the skill folder
- zip it if required by the UI
- upload it in the product settings (or drop it into the local skills directory in a coding environment)
For org-level rollouts, it calls out admin-managed deployment shipped December 18, 2025 (workspace-wide distribution with centralized management and automatic updates).
Skills as an open standard
The guide frames “Agent Skills” as an open, portable standard: write once, reuse across platforms (with the usual caveat that dependencies and tool availability still matter).
Using skills via API (when it’s worth it)
If you’re building a product or pipeline around skills:
- API usage gives you versioning and programmatic control.
- The guide notes that API-based skills require a secure code execution environment (it mentions a code execution tool beta).
Key capabilities the guide calls out:
- A
/v1/skillsendpoint for listing and managing skills - Adding skills to API requests via a
container.skillsparameter - Version control/management via a console
- Compatibility with an agent SDK
When to use skills via API vs UI surfaces (from the guide's framing):
| Use case | Best surface |
|---|---|
| End users interacting with skills directly | Claude.ai / Claude Code |
| Manual testing and iteration during development | Claude.ai / Claude Code |
| Individual, ad-hoc workflows | Claude.ai / Claude Code |
| Applications using skills programmatically | API |
| Production deployments at scale | API |
| Automated pipelines and agent systems | API |
Recommended rollout approach (practical, not fancy)
- Host the skill on GitHub (public if open source).
- Put a repo-level README for humans:
- what outcomes the skill enables
- installation steps
- screenshots / examples
- Link the skill from your MCP docs (if you ship an MCP server).
- Provide a quick install + test guide.
Positioning tip: sell outcomes, not internal mechanics.
- Better: “Creates a full sprint plan in minutes.”
- Worse: “Contains YAML and calls tools.”
Patterns and troubleshooting
Problem-first vs tool-first (the “store” analogy)
One framing that helps: some users show up with a goal (“help me set up a workspace”), others show up with tool access (“I connected Notion; now what?”).
- Problem-first skills: user describes the outcome; the skill chooses and sequences tools.
- Tool-first skills: user already has tool access; the skill teaches best practices and patterns.
Knowing which one you’re building keeps your instructions clean.
Pattern 1: Sequential workflow orchestration
Use when: the process must happen in a strict order.
Template:
## Workflow: Onboard a new customer
### Step 1: Create account
- Call tool: `create_customer`
- Validate: email present, company present
### Step 2: Set up payment
- Call tool: `setup_payment_method`
- Wait for: verification
### Step 3: Create subscription
- Call tool: `create_subscription`
- Needs: customer_id from Step 1
### Step 4: Send welcome email
- Call tool: `send_email` using `assets/welcome_email.md`
What makes it work:
- explicit ordering
- dependencies spelled out
- validation at each step
- rollback guidance when things fail
Pattern 2: Multi-MCP coordination
Use when: one outcome requires multiple services.
Template idea:
- Phase 1: export from Design tool
- Phase 2: store assets in Drive
- Phase 3: create tickets in Issue tracker
- Phase 4: notify in Chat tool
What makes it work:
- clear phase boundaries
- passing outputs forward (links, IDs)
- centralized error handling
Pattern 3: Iterative refinement loops
Use when: quality improves with “generate → check → fix → repeat.”
Template:
## Iterative report generation
1. Draft report
2. Run `scripts/check_report.py`
3. Fix issues found
4. Repeat until checks pass
5. Finalize formatting and save
What makes it work:
- explicit “quality bar”
- deterministic checks where possible
- a clear stop condition
Pattern 4: Context-aware tool selection
Use when: same goal, different best tool depending on file type/size/context.
Example decision criteria:
- large binaries → cloud storage
- collaborative docs → docs system
- code → git hosting
- temporary → local
What makes it work:
- explicit decision rules
- fallbacks
- transparency: explain why you chose what you chose
Pattern 5: Domain-specific intelligence
Use when: your value is expertise, not just tool access.
Example: compliance, security checks, policy rules, data governance.
What makes it work:
- “rules before action”
- audit trail / logging guidance
- clear escalation path when rules fail
Troubleshooting (common failure modes)
Skill won’t upload
- Error like “SKILL.md not found” → file name is wrong. It must be exactly
SKILL.md. - “Invalid frontmatter” → YAML syntax issue. Always keep the
---delimiters. - “Invalid skill name” → use kebab-case, no spaces/caps.
Skill doesn’t trigger
- Your description is too generic.
- You didn’t include realistic trigger phrases.
- You didn’t mention key domain terms or file types.
Debug trick: ask the model “When would you use this skill?” If it can’t clearly answer from the description, your router is broken.
Skill triggers too often
- Add negative triggers (“Do NOT use when…”).
- Narrow the scope (be explicit about the one thing you handle).
MCP calls fail
Checklist:
- Confirm the server/connector is actually connected.
- Confirm auth is valid and has required permissions.
- Call a tool directly without the skill (to isolate the problem).
- Check tool names and casing match the integration docs.
Instructions aren’t followed
Common causes:
- too verbose (important bits get lost)
- critical rules buried (move them up and repeat them)
- ambiguous language (“validate properly” means nothing)
If something must be enforced, write it like a pre-flight checklist:
- “Before calling
create_project, verify A, B, C. If any fail, ask the user to fix inputs.”
Model “laziness” / rushing
Sometimes you need explicit priorities:
- “Take your time.”
- “Don’t skip validation.”
- “Quality matters more than speed.”
(In practice, putting this in the user prompt often helps more than burying it deep in the skill file.)
Large context issues
If responses get slow or quality degrades:
- move deep docs into
references/ - link to them instead of pasting everything into SKILL.md
- keep SKILL.md reasonably sized (the guide suggests staying under ~5,000 words)
- reduce how many skills are enabled at once (the guide flags that once you’re in the ~20–50+ range, it’s worth switching to selective enablement)
Appendix A: Quick checklist
Use this as a pre-flight before you share a skill.
Before you start
- I identified 2–3 concrete use cases.
- I know which tools are involved (built-in and/or MCP).
- I reviewed similar example skills.
- I planned a folder structure (
scripts/,references/,assets/as needed).
During development
- Folder name is kebab-case.
-
SKILL.mdexists and is spelled exactly. - Frontmatter uses
---delimiters. -
nameis kebab-case (no spaces/caps). -
descriptionincludes what + when + trigger phrases. - No XML angle brackets (
<or>) anywhere in the skill. - Instructions are step-by-step and actionable.
- Error handling / troubleshooting exists.
- Examples exist (at least 2).
- References are linked clearly (not pasted inline forever).
Before upload / release
- I tested “obvious trigger” prompts.
- I tested paraphrases.
- I tested decoys (“should NOT trigger”).
- Functional scenario tests pass.
- Tool integration works (if applicable).
- If needed, I zipped the folder correctly.
After release
- I tested in real conversations.
- I watched for under/over-triggering.
- I collected feedback.
- I iterated on description and instructions.
- I bumped a version field in metadata (if I’m tracking versions).
Appendix B: YAML frontmatter reference
Required minimal frontmatter
---
name: skill-name-in-kebab-case
description: What it does and when to use it. Include realistic trigger phrases.
---
Full example (with common optional fields)
---
name: projecthub-sprint-planning
description: Plans a sprint end-to-end (prioritization, task creation, labels, estimates). Use when the user says "plan a sprint", "create sprint tasks", or "prioritize backlog". Do NOT use for general project chat.
license: MIT
compatibility: Requires network access and the ProjectHub MCP server connected; intended for Claude.ai and local coding environments.
allowed-tools: "Bash(python:*) WebFetch"
metadata:
author: ExampleCo
version: 1.0.0
mcp-server: projecthub
category: productivity
tags: [project-management, automation]
---
Security notes recap:
- Avoid
<and>in frontmatter. - Don’t try to smuggle instructions into metadata. Keep it clean and descriptive.
Appendix C: Where to find full examples
The PDF points to official docs and public example repositories. The highlights:
Official docs and guides (names as referenced in the guide):
- Best Practices Guide
- Skills Documentation
- API Reference
- MCP Documentation
Example skills:
- GitHub repository:
(the guide also referencesanthropics/skills
for bug reports)anthropics/skills/issues - Document skill examples (PDF, DOCX, PPTX, XLSX creation)
- Partner skills directories (skills from popular integrations like issue trackers, design tools, monitoring tools, etc.)
If you’re stuck, the fastest path is usually:
- Find the closest example skill.
- Copy the structure.
- Swap in your domain rules and triggers.
How this maps to Codex skills (practical translation)
Everything above is the conceptual model. Here’s how it typically lands in Codex.
Where Codex skills live
Codex skills are usually stored in your user profile under:
C:\Users\{username}\.codex\skills\{skill-name}\SKILL.md
For example, if you installed a skill named pdf, it would live at:
C:\Users\{username}\.codex\skills\pdf\SKILL.md
A high-quality skill demonstrates “progressive disclosure”:
- A short frontmatter
nameanddescriptionthat clearly says when to use it. - A body that gives a concrete workflow (dependencies, conventions, quality expectations).
Codex frontmatter: same idea, similar pattern
Codex follows the same basic routing concept:
- Frontmatter is the router (when to load)
- Body is the playbook (how to do the work)
- Linked files are depth (references/templates/scripts)
Note: the exact validation rules and supported fields are platform-specific, but the “small router + bigger playbook” architecture carries over cleanly.
Minimal Codex skill frontmatter (example):
---
name: csv-audit
description: Audits CSV files for schema issues and generates a short data-quality report. Use when the user uploads a .csv or asks to "validate a CSV", "check required columns", or "find data issues". Do NOT use for basic charting.
---
A quick “Claude-skill → Codex-skill” adaptation checklist
- Tighten
descriptionuntil it’s a good router (what + when + realistic phrases). - Add negative triggers if you see overtriggering.
- Keep SKILL.md steps modular and easy to follow.
- Add validation gates and troubleshooting for predictable failure points.
- Push bulky docs into
references/and link them (don’t bloat the main file). - When correctness matters, prefer scripts for validation over “English-only” rules.
One last practical tip
If you’re building multiple skills, keep a shared “test prompt bank” per skill:
- 10 prompts that should trigger
- 10 that shouldn’t
- 3–5 functional scenarios
It turns iteration from vibes into a loop you can actually control.