SRE Agent: Autonomous Kubernetes Operations

Technologies used:

Python badge LangChain badge LangGraph badge Kubernetes badge Prometheus badge Jupyter notebook badge Poetry badge Docker badge Model Context Protocol badge

Go to the GitHub Repository

๐Ÿš€ Project Overview Link to heading

The SRE Agent is an autonomous multi-agent system designed to automate Incident Response in Kubernetes environments. By leveraging Large Language Models (LLMs) and a Divide & Conquer strategy, it significantly reduces the Mean Time to Resolution (MTTR) for complex microservice faults.

This system integrates with AIOpsLab for realistic fault injection and uses a custom Model Context Protocol (MCP) server to interface with observability tools (Prometheus, Jaeger, Kubernetes API) securely and efficiently.

๐Ÿค– Architecture & Key Features Link to heading

SRE Agent Architecture

The agent implements a parallel multi-agent workflow to diagnose faults efficiently.

Core Components Link to heading

  • ๐Ÿ” Triage Agent (Hybrid): Detects symptoms explicitly. It combines deterministic heuristics (based on the Four Golden Signals: Latency, Errors, Saturation) with LLM reasoning. This hybrid approach grounds the diagnosis in hard evidence to minimize hallucinations.
  • ๐Ÿ“‹ Planner Agent (Topology-Aware): Strategies the investigation. Uses a Graph-Based Datagraph to understand cluster topology (dependencies, upstream services). It generates a deduplicated, prioritized list of RCA Tasks, assigning specific investigation goals and target resources.
  • ๐Ÿ”ฌ RCA Workers (Parallel Execution): Execute the investigation using a Divide & Conquer approach. Multiple workers run in parallel, each handling a specific task. They use MCP tools (Logs, Traces, Metrics) to gather evidence and produce a diagnostic report. A deterministic RCA Router manages task dispatching.
  • ๐Ÿ‘” Supervisor Agent: The Final Decision Maker. Aggregates worker reports to synthesize a final Root Cause Analysis. It can either finalize the diagnosis or trigger a feedback loop to schedule pending tasks if more evidence is needed.

Key Features Link to heading

  • Datagraph: A graph representation of the cluster topology (Infrastructure & Data dependencies) that guides the agent, preventing irrelevant resource exploration.
  • Custom MCP Server: Standardizes tool interaction and performs “pre-digestion” of data (e.g., retrieving only relevant metrics or error logs) to optimize context window usage and reduce token costs.

๐Ÿงช Automated Evaluation Pipeline Link to heading

The repository includes a robust pipeline for automated experimentation and benchmarking.

Framework Link to heading

  • Integration: Built on top of AIOpsLab to deploy testbeds (Hotel Reservation, Social Network) and inject realistic faults (Network delays, Pod failures, Misconfigurations).
  • Batch Execution: automated_experiment.py orchestrates end-to-end batch runs: Cluster Setup โ†’ Fault Injection โ†’ Agent Execution โ†’ Evaluation โ†’ Cleanup.

Metrics Link to heading

The system is evaluated on:

  1. Detection Accuracy: Correct identification of an anomaly.
  2. Localization Accuracy: Correct identification of the root cause resource (Service/Pod).
  3. RCA Score: Semantic evaluation of the diagnosis using LLM-as-a-Judge (1-5 scale with rationale).

๐Ÿ“ Repository Structure Link to heading

SRE-agent/
โ”œโ”€โ”€ sre-agent/        # ๐Ÿง  Main Multi-Agent System implementation (LangGraph)
โ”œโ”€โ”€ MCP-server/       # ๐Ÿ”Œ Custom Model Context Protocol server for observability tools
โ”œโ”€โ”€ notebooks/        # ๐Ÿ““ Jupyter notebooks for analysis and development
โ”œโ”€โ”€ Results/          # ๐Ÿ“Š Experiment outputs, logs, and reports
โ”œโ”€โ”€ archive/          # ๐Ÿ“ฆ Archive of previous project iterations
โ””โ”€โ”€ assets/           # ๐Ÿ–ผ๏ธ Diagrams and static assets