SRE Agent: Autonomous Kubernetes Operations

2024 - 2025

Technologies used:

Go to the GitHub Repository

🚀 Project Overview Link to heading

The SRE Agent is an autonomous system designed to revolutionize Site Reliability Engineering (SRE) by leveraging Large Language Models (LLMs) and advanced workflow orchestration. This project enables automated Kubernetes incident detection, diagnosis, and mitigation without human intervention.

Developed as part of the Multidisciplinary Project course at Politecnico di Milano, this system integrates cutting-edge AI technologies with cloud-native infrastructure to create a comprehensive AIOps solution.

🏗️ Architecture & Key Features Link to heading

Core Components Link to heading

🧠 LLM Agents: GPT-5-mini for intelligent reasoning and decision-making
🔧 Model Context Protocol (MCP): Provides kubectl, Prometheus, and ChromaDB tools
📊 LangGraph: Orchestrates investigation workflows with state management
🔍 ReAct Loop: Autonomous Agent → Tools → Agent cycle until diagnosis completion
☸️ Kubernetes Integration: Direct cluster access for real-time analysis
📝 Structured Output: Token-efficient state management and detailed reporting

Advanced Capabilities Link to heading

Autonomous Cluster Analysis: Automatically examines deployments, pods, services, and logs
Root Cause Detection: Identifies issues through intelligent investigation patterns
Structured Reporting: Generates comprehensive diagnostic reports with mitigation plans
RAG-Enhanced Learning: Retrieves and learns from previous incidents using ChromaDB
Token Optimization: Multiple approaches to reduce LLM computational costs

🛠️ Workflow Variants Link to heading

The project implements five distinct workflow approaches, each optimized for different scenarios:

1. Baseline ReAct Agent Link to heading

Context: Full message history approach
Pros: Complete context awareness for complex investigations
Cons: Higher token usage for extensive operations

2. Reduced Context Agent Link to heading

Context: Limited to last 7 messages for efficiency
Pros: Balanced efficiency and contextual understanding
Cons: Potential loss of historical investigation context

3. Structured Schema Agent Link to heading

Context: Optimized state management with insights/steps structure
Pros: Minimal token usage while maintaining critical information
Cons: Additional complexity in state orchestration

4. Mitigation Plan Agent Link to heading

Context: Complete investigation workflow with automated remediation planning
Pros: End-to-end SRE workflow from detection to resolution
Cons: Highest system complexity and resource requirements

5. RAG-Enhanced Mitigation Agent Link to heading

Context: Full workflow with incident knowledge retrieval
Features:
- Queries ChromaDB for similar historical incidents
- Reuses proven mitigation strategies when applicable
- Automatically stores new incidents for future reference
- Marks recurring incidents for pattern analysis
Pros: Complete SRE workflow with organizational learning
Cons: Higher computational overhead due to vector similarity search

📊 Technical Implementation Link to heading

Development Stack Link to heading

Language: Python 3.13+ with Poetry dependency management
AI Framework: LangChain & LangGraph for agent orchestration
LLM Integration: OpenAI GPT models with optional Google Gemini support
Vector Database: ChromaDB for incident knowledge storage
Monitoring: Prometheus integration for metrics collection
Development Environment: Jupyter Notebooks with LangGraph Studio

🔗 Links & Resources Link to heading

GitHub Repository: Complete source code and implementation
Project Report (PDF): Detailed technical documentation