Technologies used:
π Project Overview Link to heading
The SRE Agent is an autonomous system designed to revolutionize Site Reliability Engineering (SRE) by leveraging Large Language Models (LLMs) and advanced workflow orchestration. This project enables automated Kubernetes incident detection, diagnosis, and mitigation without human intervention.
Developed as part of the Multidisciplinary Project course at Politecnico di Milano, this system integrates cutting-edge AI technologies with cloud-native infrastructure to create a comprehensive AIOps solution.
ποΈ Architecture & Key Features Link to heading
Core Components Link to heading
- π§ LLM Agents: GPT-5-mini for intelligent reasoning and decision-making
- π§ Model Context Protocol (MCP): Provides kubectl, Prometheus, and ChromaDB tools
- π LangGraph: Orchestrates investigation workflows with state management
- π ReAct Loop: Autonomous Agent β Tools β Agent cycle until diagnosis completion
- βΈοΈ Kubernetes Integration: Direct cluster access for real-time analysis
- π Structured Output: Token-efficient state management and detailed reporting
Advanced Capabilities Link to heading
- Autonomous Cluster Analysis: Automatically examines deployments, pods, services, and logs
- Root Cause Detection: Identifies issues through intelligent investigation patterns
- Structured Reporting: Generates comprehensive diagnostic reports with mitigation plans
- RAG-Enhanced Learning: Retrieves and learns from previous incidents using ChromaDB
- Token Optimization: Multiple approaches to reduce LLM computational costs
π οΈ Workflow Variants Link to heading
The project implements five distinct workflow approaches, each optimized for different scenarios:
1. Baseline ReAct Agent Link to heading
- Context: Full message history approach
- Pros: Complete context awareness for complex investigations
- Cons: Higher token usage for extensive operations
2. Reduced Context Agent Link to heading
- Context: Limited to last 7 messages for efficiency
- Pros: Balanced efficiency and contextual understanding
- Cons: Potential loss of historical investigation context
3. Structured Schema Agent Link to heading
- Context: Optimized state management with insights/steps structure
- Pros: Minimal token usage while maintaining critical information
- Cons: Additional complexity in state orchestration
4. Mitigation Plan Agent Link to heading
- Context: Complete investigation workflow with automated remediation planning
- Pros: End-to-end SRE workflow from detection to resolution
- Cons: Highest system complexity and resource requirements
5. RAG-Enhanced Mitigation Agent Link to heading
- Context: Full workflow with incident knowledge retrieval
- Features:
- Queries ChromaDB for similar historical incidents
- Reuses proven mitigation strategies when applicable
- Automatically stores new incidents for future reference
- Marks recurring incidents for pattern analysis
- Pros: Complete SRE workflow with organizational learning
- Cons: Higher computational overhead due to vector similarity search
π Technical Implementation Link to heading
Development Stack Link to heading
- Language: Python 3.13+ with Poetry dependency management
- AI Framework: LangChain & LangGraph for agent orchestration
- LLM Integration: OpenAI GPT models with optional Google Gemini support
- Vector Database: ChromaDB for incident knowledge storage
- Monitoring: Prometheus integration for metrics collection
- Development Environment: Jupyter Notebooks with LangGraph Studio
π Links & Resources Link to heading
- GitHub Repository: Complete source code and implementation
- Project Report (PDF): Detailed technical documentation