Technologies used:
๐ Project Overview Link to heading
The SRE Agent is an autonomous multi-agent system designed to automate Incident Response in Kubernetes environments. By leveraging Large Language Models (LLMs) and a Divide & Conquer strategy, it significantly reduces the Mean Time to Resolution (MTTR) for complex microservice faults.
This system integrates with AIOpsLab for realistic fault injection and uses a custom Model Context Protocol (MCP) server to interface with observability tools (Prometheus, Jaeger, Kubernetes API) securely and efficiently.
๐ค Architecture & Key Features Link to heading

The agent implements a parallel multi-agent workflow to diagnose faults efficiently.
Core Components Link to heading
- ๐ Triage Agent (Hybrid): Detects symptoms explicitly. It combines deterministic heuristics (based on the Four Golden Signals: Latency, Errors, Saturation) with LLM reasoning. This hybrid approach grounds the diagnosis in hard evidence to minimize hallucinations.
- ๐ Planner Agent (Topology-Aware): Strategies the investigation. Uses a Graph-Based Datagraph to understand cluster topology (dependencies, upstream services). It generates a deduplicated, prioritized list of RCA Tasks, assigning specific investigation goals and target resources.
- ๐ฌ RCA Workers (Parallel Execution): Execute the investigation using a Divide & Conquer approach. Multiple workers run in parallel, each handling a specific task. They use MCP tools (Logs, Traces, Metrics) to gather evidence and produce a diagnostic report. A deterministic RCA Router manages task dispatching.
- ๐ Supervisor Agent: The Final Decision Maker. Aggregates worker reports to synthesize a final Root Cause Analysis. It can either finalize the diagnosis or trigger a feedback loop to schedule pending tasks if more evidence is needed.
Key Features Link to heading
- Datagraph: A graph representation of the cluster topology (Infrastructure & Data dependencies) that guides the agent, preventing irrelevant resource exploration.
- Custom MCP Server: Standardizes tool interaction and performs “pre-digestion” of data (e.g., retrieving only relevant metrics or error logs) to optimize context window usage and reduce token costs.
๐งช Automated Evaluation Pipeline Link to heading
The repository includes a robust pipeline for automated experimentation and benchmarking.
Framework Link to heading
- Integration: Built on top of AIOpsLab to deploy testbeds (Hotel Reservation, Social Network) and inject realistic faults (Network delays, Pod failures, Misconfigurations).
- Batch Execution:
automated_experiment.pyorchestrates end-to-end batch runs: Cluster Setup โ Fault Injection โ Agent Execution โ Evaluation โ Cleanup.
Metrics Link to heading
The system is evaluated on:
- Detection Accuracy: Correct identification of an anomaly.
- Localization Accuracy: Correct identification of the root cause resource (Service/Pod).
- RCA Score: Semantic evaluation of the diagnosis using LLM-as-a-Judge (1-5 scale with rationale).
๐ Repository Structure Link to heading
SRE-agent/
โโโ sre-agent/ # ๐ง Main Multi-Agent System implementation (LangGraph)
โโโ MCP-server/ # ๐ Custom Model Context Protocol server for observability tools
โโโ notebooks/ # ๐ Jupyter notebooks for analysis and development
โโโ Results/ # ๐ Experiment outputs, logs, and reports
โโโ archive/ # ๐ฆ Archive of previous project iterations
โโโ assets/ # ๐ผ๏ธ Diagrams and static assets
๐ Links & Resources Link to heading
- GitHub Repository: Complete source code and implementation.