The core of Intelligent Operations 2.0 is not simply connecting large language models to IT operations. It is about rebuilding a data foundation that AI can truly understand and reason over, then using multiple agents to create a closed loop of perception, reasoning, and execution. This approach addresses the weak implementation, high noise, and slow root cause analysis that often limit traditional AIOps. Keywords: Intelligent Operations 2.0, AI-Native Data Governance, AI Observability.
Technical Specification Snapshot
| Parameter | Details |
|---|---|
| Domain | Intelligent Operations / AIOps / SRE |
| Core Paradigm | Multi-agent collaboration with large models and small models |
| Data Protocol | Metrics, Events, Logs, Traces, Evals |
| Data Foundation | Alerts, logs, traces, CMDB, knowledge base |
| Core Dependencies | LLM, RAG, workflow engine, time-series anomaly detection, tagging system |
| Implementation Focus | Root cause analysis, change assessment, automated remediation |
| Article Popularity | Blog post, 31 views (original page) |
Intelligent Operations 2.0 represents a large-scale paradigm shift toward AI-driven decision-making
Traditional Intelligent Operations 1.0 essentially treated AI as an auxiliary tool for alert compression, anomaly detection, or localized recommendations. It improved efficiency, but it could not handle cross-domain judgment and continuous decision-making in complex systems.
Intelligent Operations 2.0 elevates AI into a decision brain. The priority is no longer single-model accuracy at a single point. The goal is to enable AI to understand object relationships, business context, and historical remediation experience, and ultimately build executable closed-loop operations.
The capability gap between 1.0 and 2.0 is clear
| Dimension | 1.0 Paradigm | 2.0 Paradigm |
|---|---|---|
| Positioning | Auxiliary tool | Decision brain |
| Model Architecture | Single small model | Collaboration between large and small models |
| Data Form | Raw data accumulation | Semantically governed and tagged data |
| Human Role | SREs respond passively | AIRE defines strategy and coordinates execution |
| Output Capability | Point detection | Causal inference and autonomous troubleshooting |
Massive alert volume -> Small model denoising -> Large model reasoning -> Execution engine remediation -> Result feedback loop
# Core logic: compress noise first, then reason over semantic data, and finally enter an automated execution loop
This chain shows that Intelligent Operations 2.0 is not about replacing humans with models. It is about upgrading data, decision-making, and execution at the same time.
AI-native data governance determines whether an operations system can truly be understood by AI
Many AIOps projects fail not because the model is too weak, but because the input data remains fragmented, raw, and devoid of context. AI may see the logs, but it still does not know which service they came from, which business path they affect, or whether they correlate with a change window.
For that reason, the goal of data governance must evolve from correlation-ready to reasoning-ready. Only when logs, alerts, traces, and CMDB records are semantically packaged can models make stable decisions based on context.
In practice, you should prioritize three governance steps
The first step is preprocessing. Use log template extraction, alert clustering, and dimensionality reduction to compress repetitive patterns into a small set of structured templates. This reduces unnecessary reading costs for large models.
The second step is intelligent tagging. Add labels such as incident category, impact scope, business attributes, and environment tier to operational objects, so AI does not just see data, but also understands the business context around it.
The third step is building a unified data service catalog. Package metrics, logs, CMDB, and distributed tracing capabilities into standard APIs that agents can call on demand, which avoids repeated collection and multi-source conflicts.
from typing import Dict, List
def build_alarm_payload(alarm: Dict, cmdb: Dict, biz_tags: List[str]) -> Dict:
# Merge raw alerts with CMDB data and business tags into a unified input for AI reasoning
return {
"alarm_title": alarm.get("title"),
"severity": alarm.get("level"), # Alert severity
"service": cmdb.get("service_name"), # Related service
"owner": cmdb.get("owner"), # Responsible team
"change_window": alarm.get("change_window", False), # Whether the alert falls within a change window
"biz_tags": biz_tags, # Inject business semantic tags
}
This code shows how to transform a raw alert into a context-rich semantic input that can be consumed consistently by an LLM or a rule engine.
AI observability is the prerequisite for keeping intelligent decision systems under control
Once AI enters the core operations workflow, the system introduces a new black box: whether the model retrieved the right knowledge, which tools it called, and at which step it produced high latency or hallucinations. All of this must be traceable.
As a result, traditional MELT is no longer sufficient. It must expand into MELT+E, where E stands for Evals, meaning continuous evaluation of agent outputs. Without evaluation, Intelligent Operations cannot enter a governable state.
An executable observability framework should cover five layers
At the data layer, collect Metrics, Events, Logs, Traces, and Evals in a unified way. At the tracing layer, connect the full invocation chain across Session, Trace, LLM calls, RAG retrieval, and Tool Calls.
At the metrics layer, focus on P95 latency, error rate, time to first byte, and retrieval hit rate. At the evaluation layer, combine LLM-as-a-judge, regression testing, and human spot checks to measure accuracy, hallucination rate, and intent drift.
def evaluate_agent_run(p95_latency, hit_rate, hallucination_rate):
# Evaluate current agent quality based on key observability metrics
if p95_latency > 3000:
return "Latency is too high. Optimize the retrieval path."
if hit_rate < 0.7:
return "Knowledge retrieval hit rate is too low. Add indexes or improve tagging."
if hallucination_rate > 0.1:
return "Hallucination rate is too high. Tighten prompts and output constraints."
return "Run is stable"
This code provides a minimal evaluation approach for quickly assessing agent quality during staged rollout.
A multi-agent architecture is better suited to production operations than a single large model
Operations requests in production environments are noisy, highly real-time, and strongly constrained. A single large model is both expensive and unstable in this setting. A more practical architecture assigns perception to small models, reasoning to large models, and execution to a workflow engine.
This layered design controls cost while improving explainability. Small models are effective at anomaly detection, time-series analysis, and rapid filtering. Large models are better suited for root cause inference, knowledge synthesis, and strategy generation.
The responsibilities of three agent types should remain clearly bounded
| Agent | Technical Carrier | Primary Responsibility | Typical Output |
|---|---|---|---|
| Perception Agent | Clustering, time-series detection, small models | Denoising, scoping, anomaly identification | Key alert set |
| Reasoning Agent | LLM, RAG | Tag inference, causal analysis, root cause recommendation | Top 3 root causes with confidence scores |
| Execution Agent | Workflow engine, script platform | Ticket creation, inspection, auto-remediation | Task flow, report, execution result |
In engineering practice, 95% of noise should be filtered out at the perception layer whenever possible. Otherwise, large models will be overwhelmed by irrelevant context, increasing both cost and misjudgment rates.
A three-phase build-as-you-use path is better aligned with real enterprise adoption
Many teams want to finish data governance first and connect AI later, but that usually leads to long project cycles and delayed returns. A more realistic path anchors on high-value scenarios while applying, governing, and packaging capabilities in parallel.
Phase one should start with high-frequency, high-loss incident scenarios
Prioritize scenarios such as end-to-end troubleshooting, business change assessment, and overnight on-call bots. These scenarios have concentrated pain points and clear returns, making it easier to establish the first closed-loop pilot within three to six weeks.
Phase two should govern only the data that strongly relates to the scenario
Do not try to connect every data domain at once. Focus only on alerts, logs, CMDB, traces, and change records for targeted governance, then quickly form templates, labels, and unified interfaces.
Phase three should package proven capabilities into reusable services
Once the first scenario proves effective, consolidate data capabilities, prompt templates, evaluation rules, and execution flows into standardized services that other scenarios can reuse, forming an enterprise-grade intelligent operations platform.
def rollout_plan(stage: int):
# Three-phase implementation guidance
plans = {
1: "Choose a high-frequency incident scenario and build a pilot closed loop",
2: "Govern related data and complete tag-based packaging",
3: "Consolidate into reusable APIs and standard workflows"
}
return plans.get(stage, "Invalid stage")
This code maps directly to the minimal implementation model behind the build-as-you-use strategy, and works well as a communication template for phased project planning.
Three pilot scenarios are most likely to produce measurable ROI first
Intelligent change assessment is well suited to reducing incidents caused by changes. By estimating impact before the change, monitoring during the change, and validating after the change, teams can significantly lower the rate of change-related failures.
An intelligent incident response loop is ideal for reducing MTTR. The system first uses small models to filter noise, then uses a large model to generate root causes and recommendations, and finally triggers predefined actions through a workflow engine, reducing the need to recall experts during the night.
Intelligent daily operations iteration is effective for releasing human capacity. Repetitive tasks such as inspections, reporting, and script orchestration can be generated and executed through natural language, allowing engineers to spend more time on architecture optimization and risk governance.
The images on the original page reflect the article source and distribution context

This image is the author’s avatar and serves as identity branding. It does not present any technical architecture.

AI Visual Insight: This image shows a guidance prompt for sharing a blog page on mobile. It indicates that the content is primarily distributed through community platforms rather than a product console, monitoring dashboard, or system architecture diagram, so it does not convey operational system design details.
FAQ
Q1: Why do many AIOps projects still deliver limited results after integrating large models?
Because the issue usually is not the model itself, but the input data. If alerts, logs, and CMDB records lack unified tags, relationship mapping, and business semantics, the model can only summarize text and will struggle to perform stable root cause reasoning.
Q2: Do enterprises need to fine-tune large models to build Intelligent Operations 2.0?
In most cases, no. Prioritizing prompt engineering, scenario-specific data governance, knowledge retrieval, and output constraints usually delivers faster results and is easier to maintain than fine-tuning.
Q3: Which scenario is the best starting point for Intelligent Operations 2.0?
Start with scenarios that are high-frequency, measurable, and process-clear, such as change assessment, alert attribution, or overnight on-call bots. These scenarios are the easiest places to produce verifiable ROI and organizational alignment.
Core Summary
This article systematically reconstructs the methodology and implementation path for Intelligent Operations 2.0. It focuses on AI-native data governance, AI observability, multi-agent collaboration between large and small models, and a three-phase build-as-you-use rollout strategy, helping enterprises build an autonomous operations system with measurable ROI from alert denoising and root cause analysis to automated remediation.