Production-Ready AI Tool System Design: Unified Protocols, Access Control, and Observable Execution in Go

In production AI tool systems, the real challenge is not whether the model can call tools, but whether the protocol is unified, permissions are enforceable, and execution is observable. Based on Go engineering practices, this article breaks down Tool protocols, a layered architecture, and a dual-authorization design. Keywords: Function Calling, access governance, observability.

Technical Specification Snapshot

Parameter Description
Language Go
Protocol Custom lightweight Tool protocol inspired by JSON Schema
Use Cases AI Agents / Function Calling / Tool orchestration
Architecture service / domain / infra three-layer architecture
Permission Model self_only / OrgCapability / SuperAdminOnly
Core Dependencies Eino, GORM, context
GitHub Stars Not provided in the source

A production-grade AI tool system must center on three categories of problems

In AI applications, the LLM defines the upper bound of understanding and reasoning, but the tool system determines whether the system can actually perform work. Without tools, the model remains limited to text generation. With tools, the model gains the ability to access business capabilities, trigger real operations, and return structured results.

Many articles focus only on Function Calling parameter definitions, while overlooking the parts that matter more in production: protocol consistency, permission boundaries, and execution visibility. Once any of these are missing, the system quickly evolves into a high-risk module that is hard to extend, hard to govern, and hard to audit.

The core pain points in production

  1. Tool definitions are inconsistent, and protocols become fragmented as new tools are added.
  2. The model may become aware of tools that should not be exposed, creating a risk of unauthorized calls.
  3. Tool execution becomes a black box, leaving both the frontend and end users unable to understand system state.
// The Tool defines a unified protocol rather than a specific business function
// The core goal is to constrain metadata, invocation format, and return structure

type Tool interface {
    Spec() ToolSpec
    Call(ctx context.Context, call ToolCall, callCtx ToolCallContext) (ToolResult, error)
}

The value of this code lies in binding definition and execution to the same abstract interface.

A unified protocol should be established before the number of tools grows

If the system starts without protocol constraints, tools will appear in different formats across teams and modules. Governance introduced later usually costs far more than making the abstraction early.

The original design adopts a lightweight custom protocol that borrows the expressive ideas of JSON Schema without introducing it in full. This is not anti-standard by design. The priority is engineering readability, controllability, and speed of evolution.

The protocol should cover at least four categories of objects

  • ToolSpec: describes the tool name, capability, and parameters
  • ToolCall: describes a single invocation target and input arguments
  • ToolResult: standardizes output, summary, and detailed content
  • Tool: provides the unified definition and execution entry point
type ToolParameter struct {
    Name        string
    Type        ToolParameterType
    Description string
    Required    bool
    Enum        []string
    Properties  []ToolParameter
    Items       *ToolParameter
}

type ToolSpec struct {
    Name        string
    Description string
    Parameters  []ToolParameter
}

type ToolCall struct {
    ID            string
    Name          string
    ArgumentsJSON string // Use a JSON string consistently to carry invocation arguments
}

type ToolResult struct {
    Output         string
    Summary        string
    DetailMarkdown string // Makes it easy for the frontend to render execution details directly
}

These structures establish unified semantics for tool metadata, invocation payloads, and result presentation.

A three-layer architecture solves both business decoupling and framework replacement

The key question in a tool system is not only how to invoke tools, but also where business rules should live and where technical details should live. If permissions, prompts, tool filtering, and framework bindings are all piled into one layer, the system will eventually suffer from the classic God Service problem.

This design splits the system into service, domain, and infra layers. The domain layer handles protocols and domain abstractions. The service layer orchestrates business rules. The infra layer integrates Eino, databases, and other infrastructure.

Layer responsibilities should be clearly fixed

  • domain: defines the Tool protocol, capability model, and boundary interfaces
  • service: filters the set of tools visible in the current turn based on user, organization, and role
  • infra: implements tool calling, storage, logging, and framework integration
func (s *ToolService) BuildVisibleTools(ctx context.Context, user UserContext) []Tool {
    // Perform the first-round filtering based on identity and capabilities
    caps := s.permissionRepo.ListCapabilities(ctx, user)

    // Return only the tools visible in the current turn to avoid full exposure
    return s.registry.FilterByCapabilities(caps)
}

This logic reflects a simple principle: filter by business rules first, then let the model choose.

Access governance should use dual validation rather than a single gate

The most dangerous misconception in AI tool systems is relying on prompts alone to constrain the model. Prompts are only a soft constraint and can never replace real permission control.

In the original design, permissions are not hard-coded into identity branches. Instead, they are abstracted into a capability model. The system first resolves the user’s user_id, org_id, and admin status, and then maps them to different capability sets.

The permission model should be capability-oriented rather than hard-coded by role

  • self_only: allows access to personal resources
  • OrgCapability: allows operations within the organization scope
  • SuperAdminOnly: allows operational or high-risk administrative actions

The first validation happens before tools are exposed. Its purpose is to reduce token usage and block unauthorized awareness. The second validation happens during actual execution. Its purpose is to prevent malformed or adversarial model-generated calls from crossing the boundary.

func (t *DeleteResourceTool) Call(ctx context.Context, call ToolCall, callCtx ToolCallContext) (ToolResult, error) {
    // Second authorization check: final guard before execution
    if !callCtx.HasCapability("resource.delete") {
        return ToolResult{}, errors.New("permission denied for this tool")
    }

    // Enter real business logic only after authorization succeeds
    return ToolResult{Summary: "Deleted successfully"}, nil
}

This kind of pre-execution authorization is a non-negotiable security baseline in production.

An observable execution chain significantly improves frontend experience and troubleshooting efficiency

For users, the worst AI experience is not slowness, but silent slowness. If the tool system does not emit stage events, the frontend can only wait for the final answer and cannot show whether the system is querying, writing, or retrying.

This design uses events such as tool_call_started and tool_call_finished so that the frontend can continuously display execution progress, tool names, and stage results. This level of observability improves user experience and also makes troubleshooting and audit trails much easier.

Pre-validation and retry strategies should distinguish error types

Formatting errors can be retried in a limited way within a ReAct flow. Missing required parameters should stop the process early and ask the user for more information. This reduces wasted turns and prevents the model from blindly continuing in an invalid context.

func HandleToolError(err error) string {
    // Invalid argument format: allow the model to fix and retry
    if errors.Is(err, ErrBadFormat) {
        return "retry"
    }
    // Missing required fields: ask the user directly
    if errors.Is(err, ErrMissingRequiredField) {
        return "ask_user"
    }
    return "abort"
}

The goal of this strategy is to upgrade error handling from uniform failure to type-based recovery.

Low-cost extensibility depends on registration-based development rather than a giant switch

Once the protocol, permissions, and layering are stable, integrating a new tool becomes an incremental task rather than a structural rewrite. In most cases, developers only need to add three parts: metadata, access policy, and execution logic.

This means a new tool does not need to be squeezed into a bloated switch-case, and permission changes only require updating capability assignments instead of rewriting the invocation chain. For continuously evolving Agent systems, this maintainability matters more than one-time implementation speed.

Performance optimization should start by exposing fewer tools

Many systems expose the full tool set to the model and let the model choose on its own. This wastes tokens, increases the risk of unauthorized access, and raises the failure rate of calls.

A better strategy is to expose the minimum visible tool set for each turn. If the user has no tool permissions in the current context, the system should fall back directly to pure text mode without running ReAct or a more complex orchestration path. This is the most direct and stable form of performance optimization.

FAQ

Why not use full JSON Schema directly?

If the current goal is to quickly establish a unified protocol, permission control, and an observable execution chain, a lightweight abstraction is usually more efficient. If external compatibility becomes necessary later, you can add an adapter layer without taking on too much design complexity at the beginning.

Why should permissions be validated twice?

The first validation controls what the model can see. The second validation controls what the model can actually do. The former solves the exposure-surface problem, while the latter solves execution safety. Neither can replace the other.

What is the biggest benefit of the service, domain, and infra split?

The core benefit is decoupling. When business rules change, update service. When the protocol evolves, update domain. When you switch Eino, LangChain, or a storage implementation, change only infra. This significantly reduces long-term maintenance costs.

AI Readability Summary

This article reconstructs a production-ready AI tool system design approach from practical Go-based AI Agent engineering experience. It focuses on a unified Tool protocol, a three-layer architecture, dual permission checks, an observable execution chain, and performance optimization strategies.