How to Let an LLM Choose the Right Function: Multi-Tool Function Calling for AIOps Incident Troubleshooting

This article shows how to let an LLM decide which function to call when multiple tools are available, and how it can gather evidence step by step like an operations engineer to complete incident diagnosis. The core value is replacing rigid if/else rules and improving automation in AIOps troubleshooting. Keywords: Function Calling, AIOps, tool selection.

The technical specification snapshot provides the key context

Parameter Value
Primary language Python
Invocation pattern LLM Function Calling / Tool Calling
Typical scenarios AIOps, intelligent troubleshooting, log analysis, monitoring diagnostics
Protocols / interfaces involved Kubernetes, Prometheus, log platform APIs
Model invocation mode tool_choice="auto"
Core dependencies litellm, json, datetime
Publishing platform Blog Garden
Original article view count Shown as 78 in the article

This article explains the real decision logic in multi-tool scenarios

After many teams integrate large language models, their first instinct is to attach multiple functions and expect the model to “become smarter automatically.” The real issue is not the number of functions. It is whether the model understands what layer of evidence each tool is responsible for.

In AIOps scenarios, incidents often span three layers: Kubernetes, application logs, and node resources. If you continue to hard-code branches with if/else, the rules quickly spiral out of control. The value of Function Calling is that it hands tool selection over to the model.

Three functions each handle a different category of evidence

This example defines three functions: get_pod_status, get_business_log, and get_system_metrics. They correspond to container state, business errors, and system resources, forming a typical layered troubleshooting design.

from datetime import datetime, timedelta

def get_pod_status(namespace: str = "default", pod_name: str = ""):
    return {
        "namespace": namespace,
        "pod_name": pod_name,
        "status": "CrashLoopBackOff",  # The Pod is currently stuck in a repeated restart state
        "restart_count": 5,  # Total number of restarts so far
        "node": "10.10.1.23",
        "events": [
            "2026-04-20 11:42:11 Readiness probe failed: connection refused",
            "2026-04-20 11:42:25 Container restarted",
            "2026-04-20 11:42:25 Last State: Terminated, Reason: OOMKilled"
        ]
    }

This code provides Pod runtime facts to help the model determine whether probe failures, OOM events, or abnormal restarts are present.

def get_business_log(service_name: str = "", start_time: str = "", end_time: str = ""):
    if not start_time:
        start_time = (datetime.now() - timedelta(minutes=30)).strftime("%Y-%m-%d %H:%M:%S")
    if not end_time:
        end_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

    return {
        "service_name": service_name,
        "time_range": {
            "start_time": start_time,
            "end_time": end_time
        },
        "logs": [
            "2026-04-20 11:41:02 [ERROR] order-service create order failed: java.lang.NullPointerException",
            "2026-04-20 11:41:03 [ERROR] order-service db timeout when inserting order record",
            "2026-04-20 11:41:04 [WARN] order-service retry failed, return HTTP 500"
        ]
    }

This code returns business log evidence to help the model locate HTTP 500 errors, null pointer exceptions, and database timeouts.

Tool descriptions determine whether the model chooses the right function

In multi-function scenarios, the model is not primarily memorizing function names. It is performing semantic matching. If the user says, “The Pod keeps restarting,” the model is more likely to match the Pod status tool first. If the user says, “The API keeps returning 500,” it is more likely to hit the log tool.

That means description is not a decorative field. It is the tool manual. The more specific it is, the more accurately the model can decide. The more abstract it is, the more likely the model is to choose the wrong tool.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_pod_status",
            "description": "Get the runtime status, restart count, and event information of a Kubernetes Pod for troubleshooting abnormal restarts, startup failures, probe failures, CrashLoopBackOff, and related issues.",
            "parameters": {
                "type": "object",
                "properties": {
                    "namespace": {"type": "string", "description": "The namespace where the Pod is located. Defaults to default."},
                    "pod_name": {"type": "string", "description": "The Pod name or workload name"}
                },
                "required": ["pod_name"]
            }
        }
    }
]

This configuration defines tool boundaries so the model knows what kind of problem should trigger which function.

The model completes the evidence chain through iterative calls

The core of this example is not a single invocation but a loop: pass the user question and tool list to the model; if the model decides to call a function, execute it and feed the result back; if the evidence is still insufficient, continue to the next round of tool calls until the model produces a conclusion.

This pattern closely matches real operational troubleshooting: first determine the incident layer, then collect evidence, then cross-check it, and finally produce a root cause and remediation guidance.

The most important agent execution framework is very short

import json
from litellm import completion

def run_agent(user_query: str):
    messages = [
        {
            "role": "system",
            "content": (
                "You are an online incident troubleshooting assistant."
                "Autonomously choose the most appropriate tool based on the user's question."
                "If one tool is not enough to identify the issue, continue calling other tools."
                "Base your answers on facts returned by the tools, and do not fabricate information."
            )
        },
        {"role": "user", "content": user_query}
    ]

    while True:
        response = completion(
            model="doubao-seed-2.0-pro",
            messages=messages,
            tools=tools,
            tool_choice="auto"  # Let the model decide whether to call tools on its own
        )

        message = response.choices[0].message
        tool_calls = message.tool_calls

        if not tool_calls:
            return message.content  # If there are no new tool calls, return the final conclusion directly

        messages.append(message)

This code implements a multi-round tool decision framework and serves as the minimum closed loop for applying Function Calling in AIOps.

Real tests show the model selects different tools based on the problem layer

The example includes four categories of requests: Pod restarts, API 500 errors, node slowness, and a combined issue of “500 + restart.” The results show that the model does not choose functions randomly. It triggers the corresponding tools based on the semantics of the issue.

A single issue usually calls only one function. A composite issue calls multiple functions in sequence, linking Pods, logs, and node resources into a complete evidence chain. This shows that the model already has the ability to collect layered evidence.

One composite incident makes the strategy easy to see

test_query = "order-service returns 500, and the Pod also keeps restarting. Help me analyze the issue end to end."
answer = run_agent(test_query)
print(answer)  # Output an analysis conclusion summarized from evidence across multiple tools

This test verifies that for composite problems, the model automatically chains multiple tools together to complete an overall analysis.

Improving function selection accuracy depends more on design than on the model itself

First, tool descriptions must explicitly state what kind of issues they are used to troubleshoot. A description like “get logs” has almost no decision value, while “used to investigate API errors, database timeouts, and code exceptions” directly improves hit rate.

Second, function boundaries must be clear. If two tools both claim they can “check everything,” the model will hesitate. The best design is one function per evidence category: no overlap and no ambiguity.

Parameter naming also directly affects tool selection

Abstract parameters such as id, name, and target are easy for the model to misinterpret. By contrast, pod_name, service_name, and node_ip provide much stronger semantic constraints and help the model fill in arguments automatically.

Third, the system prompt must clearly tell the model that it may call tools multiple times, that it must answer based on facts, and that it should continue querying when evidence is insufficient. Without these constraints, the model often becomes too conservative or fabricates conclusions.

system_prompt = (
    "You are an online incident troubleshooting assistant."
    "Check the facts first, then draw conclusions."
    "If one tool is not enough to identify the issue, continue calling other tools."
    "All answers must be based on data returned by the tools. Do not fabricate information."
)

This prompt defines the model’s behavioral boundaries and has a significant impact on multi-tool reasoning quality.

Function Calling still requires engineering safeguards in production

The article also highlights a key reality: Function Calling is only the decision entry point. It does not automatically make a system production-ready. To deploy it in production, you still need auditing, access control, timeouts, retries, rate limiting, and result traceability.

In other words, the model can replace if/else decision logic, but it cannot replace engineering governance. A good multi-tool system is fundamentally the combination of model decision-making, tool boundaries, and engineering constraints.

FAQ structured Q&A

Q1: Why does the model sometimes choose the wrong tool even with the same function list?

A1: The most common reasons are tool descriptions that are too short, overlapping boundaries, or abstract parameter names. The model relies on semantic matching to make decisions, so the clearer the tool documentation is, the higher the hit rate.

Q2: Is a multi-tool setup always better than a single-tool setup?

A2: Not necessarily. If the scenario is simple, a single tool is usually more stable. Multi-tool calling significantly improves diagnosis efficiency only when the problem spans multiple evidence layers, such as Pods, logs, and system monitoring.

Q3: What is most often overlooked when moving this into production?

A3: The most commonly overlooked areas are access control, invocation timeouts, audit logging, and rate limiting. Without these mechanisms, even if the model chooses the right tool, it can still create security and stability risks.

Core Summary: Based on a real incident troubleshooting case, this article reconstructs the core mechanism behind multi-function Function Calling: how an LLM uses tool descriptions, parameter semantics, and system prompts to autonomously choose functions such as Pod status, business logs, and system monitoring, then completes incident diagnosis through iterative evidence gathering. This pattern applies to multi-tool decision scenarios such as AIOps, intelligent customer support, and data analysis.