Agent Skill Best Practices: Structured Design, Claude Code Integration, and a Closed-Loop 5xx Troubleshooting Workflow - Devuly | Smart Analytics for Developers & Projects

This guide presents an engineering-focused Skill framework for AI Agents. Its core capability is to structure, standardize, and close the loop on production 5xx incident troubleshooting by integrating tools into the workflow. It addresses hard-to-maintain prompts, insufficient troubleshooting evidence, and drifting analysis outputs. Keywords: Agent Skill, incident troubleshooting, Claude Code.

Table of Contents

The technical specification snapshot captures the overall design

Parameter	Description
Primary form	Markdown rules + reference materials + examples + scripts
Runtime environment	Claude Code / OpenClaw-style Agent environments
Core languages	Markdown, JSON, Shell/Python (scripts are extensible)
Typical protocols	Tool Call / Function Call / filesystem interaction
Applicable scenarios	AIOps, 5xx incident analysis, standardized diagnostic workflows
Star count	Not provided in the original source
Core dependencies	`SKILL.md`, `references/`, `examples/`, `scripts/`

A Skill is not a prompt but a maintainable engineering structure

Many teams treat a Skill as just a longer prompt, but that view is too narrow. The real value of a Skill is that it breaks down “trigger conditions, execution order, references, output format, and behavioral constraints” into a stable structure, giving AI a reusable workflow.

The original example focuses on production incident troubleshooting. It does not aim for a one-shot answer. Instead, it expects the Agent to collect evidence along a defined path, identify anomalies, produce conclusions, and continue calling tools when evidence is incomplete.

A minimal Skill directory usually includes four layers

.
├── SKILL.md
├── references/
├── examples/
└── scripts/

This layout defines the minimal engineering skeleton of a Skill: the main entry file owns the rules, the references directory owns the knowledge base, the examples directory calibrates behavior, and the scripts directory handles execution.

Each directory must have clearly separated responsibilities

SKILL.md is the main entry point. It should explicitly state what problem the Skill solves, when it should trigger, which materials to read first, what execution order to follow, what the output must look like, and which behaviors to avoid.

references/ stores accumulated knowledge and rules. This keeps the main file from growing into an unmaintainable monolith. When you later optimize troubleshooting strategies, you only need to update the relevant documents locally.

The directory split in the practical example is closer to a production setup

.
├── SKILL.md
├── references/
│   ├── triage-playbook.md
│   ├── metrics-checklist.md
│   └── output-template.md
├── examples/
│   ├── alert-input.json
│   └── expected-analysis.md
└── scripts/

This structure decouples the process playbook, metrics checklist, output template, input examples, and scripting capabilities so they can evolve independently.

This Skill structure is especially well suited to production incident troubleshooting

In AIOps, troubleshooting is naturally staged: first inspect symptoms, then check logs, then compare metrics, and finally narrow down the root cause. If you hard-code this logic into a single prompt, it becomes nearly impossible to maintain and difficult to keep outputs consistent.

The file mapping in the example is critical: triage-playbook.md defines the troubleshooting order, metrics-checklist.md specifies required metrics, output-template.md constrains the output structure, and examples/ provides aligned input-output samples for the model.

A standard input can significantly improve analytical completeness

{
  "service": "order-service",
  "env": "prod",
  "time_window": "2026-04-24 14:05 ~ 14:12",
  "alert_title": "订单服务 5xx 错误率升高",
  "symptom": "/api/order/create 接口错误率从 0.3% 升到 18%",
  "logs": [
    "2026-04-24T14:06:13 ERROR order-service create order failed: dial tcp 10.21.4.15:3306: i/o timeout",
    "2026-04-24T14:06:14 ERROR order-service query inventory failed: dial tcp 10.21.4.15:3306: i/o timeout"
  ],
  "metrics": {
    "5xx_rate": "0.3% -> 18%",
    "p95_latency": "120ms -> 4.8s",
    "db_connection_timeout": "持续升高",
    "cpu": "无明显异常",
    "memory": "无明显异常"
  }
}

This input provides the service, environment, time window, symptoms, logs, and metrics all at once, allowing the Skill to complete a structured diagnosis through an evidence chain.

Non-standard input reveals the Skill’s evidence dependency boundaries

The original article intentionally shows a more realistic case: the user provides only a vague alert and a single error log. In that case, the Skill can only provide a preliminary attribution, not a definitive root cause, because the evidence chain is still far from complete.

This shows that the value of a Skill is not that it “guesses more like a human,” but that it preserves process discipline even when evidence is insufficient. It tells you clearly which data is missing instead of answering just for the sake of answering.

When only a single log line is available, the output should remain a pending diagnosis

def analyze_incident(log_line: str) -> dict:
    # Extract known facts from the log first
    if "3306" in log_line and "i/o timeout" in log_line:
        return {
            "preliminary_judgement": "suspected database connection timeout",  # Start with a preliminary judgment
            "confidence": "medium",  # Evidence is insufficient, so do not declare root cause directly
            "next_actions": [
                "Check MySQL connectivity",
                "Review connection pool and timeout metrics",
                "Compare p95 latency and 5xx changes within the same time window"
            ]
        }
    return {"preliminary_judgement": "insufficient information"}

This code demonstrates a conservative analysis strategy under insufficient evidence: provide an initial judgment first, then clearly define the next evidence-gathering actions.

A Skill forms a true closed-loop workflow only after it integrates tools

When static input is not enough, the Skill must actively fetch data. In the original example, this is done by adding a query tool under scripts/, such as get_mysql_state, to determine whether MySQL is currently reachable and whether it is unhealthy.

At the same time, SKILL.md should declare the available tools, while references/tools.md should describe each tool’s purpose, input parameters, invocation method, and return structure. Only then does the model know when to query, what to query, and how to use the results afterward.

Tool declarations should be written as executable constraint interfaces

## Available tools

- Tool name: get_mysql_state
- Purpose: Check MySQL connectivity, connection count, and basic health state
- Trigger conditions: Logs show 3306 timeouts, database connection errors, or connection pool anomalies
- Output requirements: Return status, error messages, timestamp, and key metrics
- Prohibited actions: Do not invoke blindly when there is no database-related evidence

These rules upgrade the Skill from “able to analyze” to “able to decide when to gather new evidence.”

The boundary between Skills and function calls is not a conflict but a layered collaboration

The original comparison is accurate: function calls are better suited for foundational tool construction, can be deployed independently, and look more like components of a traditional engineering system. A Skill acts more like the orchestration layer of an Agent, responsible for process, judgment, context organization, and output standardization.

In other words, function calls answer “what can be done,” while Skills answer “when to do it, in what order to do it, and how to present the result.” They are not substitutes. They are upper and lower layers of the same system.

One line of pseudocode explains how they work together

def skill_workflow(alert):
    # Read rules and checklists first to determine troubleshooting order
    context = load_references()
    # Decide whether to call lower-level tools based on symptoms
    if needs_mysql_check(alert):
        mysql_state = get_mysql_state(alert)
    # Merge new evidence into the context and generate a structured conclusion
    return render_analysis(context, alert, mysql_state)

This code shows that the Skill handles orchestration, while the function call handles execution. The combination is what makes complex incident scenarios manageable.

The image shows a troubleshooting case illustration rather than a brand logo

AI Visual Insight: This image helps illustrate how the Skill-driven troubleshooting flow works in practice. It typically shows the Agent’s analysis results, evidence retrieval, or state feedback after tool invocation. Its technical significance is that it emphasizes the Skill is not static documentation, but a mechanism that can generate a continuous diagnostic chain around logs, metrics, and tool-derived evidence.

The key to operationalizing a Skill is not writing longer documents but fixing the diagnostic protocol

If you want to use this kind of Skill in production, the priority is not adding more background explanation. The priority is to continuously improve three classes of assets: troubleshooting playbooks, metrics checklists, and tool interfaces. These determine whether the Skill is stable, portable, and auditable.

For teams, the best practice is to first build templated Skills for high-frequency incidents, then gradually extend specialized references/ and scripts/ for databases, caches, networks, and message queues, eventually forming a composable troubleshooting asset library.

FAQ in a structured Q&A format

Q1: What is the essential difference between a Skill and a normal prompt?

A1: A prompt is usually a one-off expression, while a Skill is engineered orchestration. A Skill manages rules, materials, examples, tools, and output templates in separate layers, so it is much better suited for long-term maintenance and team reuse.

Q2: Why does a Skill avoid giving a direct root cause under non-standard input?

A2: Because incident analysis depends on an evidence chain. If metrics, context, or tool return values are missing, declaring a root cause directly increases the risk of hallucination. A high-quality Skill should identify information gaps first and then guide further evidence collection.

Q3: When should you prioritize a Skill, and when should you prioritize a function call?

A3: Prioritize a function call when you need reusable low-level capabilities, independent deployment, or system integration. Prioritize a Skill when you need multi-step reasoning, reference constraints, output standardization, and Agent orchestration. In production, a common pattern is for a Skill to invoke function calls.

Core summary

This article reconstructs a practical Agent Skill case based on Claude Code. It systematically explains the responsibility boundaries of SKILL.md, references/, examples/, and scripts/, shows how to turn incident troubleshooting into a maintainable, iterative, tool-callable engineering capability, and compares the applicability boundaries between Skills and function calls.