SRE Agent: Autonomous Kubernetes Operations

Technologies used:

Python badge LangChain badge LangGraph badge Kubernetes badge Prometheus badge ChromaDB badge Jupyter notebook badge Poetry badge Docker badge Model Context Protocol badge

Go to the GitHub Repository

πŸš€ Project Overview Link to heading

The SRE Agent is an autonomous system designed to revolutionize Site Reliability Engineering (SRE) by leveraging Large Language Models (LLMs) and advanced workflow orchestration. This project enables automated Kubernetes incident detection, diagnosis, and mitigation without human intervention.

Developed as part of the Multidisciplinary Project course at Politecnico di Milano, this system integrates cutting-edge AI technologies with cloud-native infrastructure to create a comprehensive AIOps solution.

πŸ—οΈ Architecture & Key Features Link to heading

Core Components Link to heading

  • 🧠 LLM Agents: GPT-5-mini for intelligent reasoning and decision-making
  • πŸ”§ Model Context Protocol (MCP): Provides kubectl, Prometheus, and ChromaDB tools
  • πŸ“Š LangGraph: Orchestrates investigation workflows with state management
  • πŸ” ReAct Loop: Autonomous Agent β†’ Tools β†’ Agent cycle until diagnosis completion
  • ☸️ Kubernetes Integration: Direct cluster access for real-time analysis
  • πŸ“ Structured Output: Token-efficient state management and detailed reporting

Advanced Capabilities Link to heading

  • Autonomous Cluster Analysis: Automatically examines deployments, pods, services, and logs
  • Root Cause Detection: Identifies issues through intelligent investigation patterns
  • Structured Reporting: Generates comprehensive diagnostic reports with mitigation plans
  • RAG-Enhanced Learning: Retrieves and learns from previous incidents using ChromaDB
  • Token Optimization: Multiple approaches to reduce LLM computational costs

πŸ› οΈ Workflow Variants Link to heading

The project implements five distinct workflow approaches, each optimized for different scenarios:

1. Baseline ReAct Agent Link to heading

  • Context: Full message history approach
  • Pros: Complete context awareness for complex investigations
  • Cons: Higher token usage for extensive operations

2. Reduced Context Agent Link to heading

  • Context: Limited to last 7 messages for efficiency
  • Pros: Balanced efficiency and contextual understanding
  • Cons: Potential loss of historical investigation context

3. Structured Schema Agent Link to heading

  • Context: Optimized state management with insights/steps structure
  • Pros: Minimal token usage while maintaining critical information
  • Cons: Additional complexity in state orchestration

4. Mitigation Plan Agent Link to heading

  • Context: Complete investigation workflow with automated remediation planning
  • Pros: End-to-end SRE workflow from detection to resolution
  • Cons: Highest system complexity and resource requirements

5. RAG-Enhanced Mitigation Agent Link to heading

  • Context: Full workflow with incident knowledge retrieval
  • Features:
    • Queries ChromaDB for similar historical incidents
    • Reuses proven mitigation strategies when applicable
    • Automatically stores new incidents for future reference
    • Marks recurring incidents for pattern analysis
  • Pros: Complete SRE workflow with organizational learning
  • Cons: Higher computational overhead due to vector similarity search

πŸ“Š Technical Implementation Link to heading

Development Stack Link to heading

  • Language: Python 3.13+ with Poetry dependency management
  • AI Framework: LangChain & LangGraph for agent orchestration
  • LLM Integration: OpenAI GPT models with optional Google Gemini support
  • Vector Database: ChromaDB for incident knowledge storage
  • Monitoring: Prometheus integration for metrics collection
  • Development Environment: Jupyter Notebooks with LangGraph Studio