This project is a prototype of a simple AI-driven "self-healing" system capable of autonomously detecting anomalies in simulated system telemetry and triggering predefined recovery actions. It demonstrates a proactive approach to operational excellence, reducing Mean Time To Recovery (MTTR) and minimizing human intervention in the face of common infrastructure failures.
The system is composed of several microservices that communicate via Redis streams:
- Telemetry Generator (
telemetry_generator.py
): Simulates system metrics (CPU, memory, etc.) for multiple services and injects anomalies. It publishes this data to a Redis stream. - Anomaly Detector (
anomaly_detector.py
): Ingests telemetry data from Redis, uses an unsupervised machine learning model (Isolation Forest) to detect anomalies, and publishes alerts to another Redis stream. - Self-Healing Orchestrator (
self_healing_orchestrator.py
): Subscribes to anomaly alerts. Based on a simple rulebook, it triggers mock recovery actions (e.g., "restarting a service") and logs these actions to a Redis stream. - Dashboard (
dashboard.py
): A Streamlit application that provides a real-time view of the system. It visualizes telemetry, highlights detected anomalies, and displays a log of recovery actions. - Redis: Acts as the message broker and data store for telemetry, anomalies, and actions.
graph TD
subgraph "Containerized Services"
A[Telemetry Generator] -->|Telemetry Data| R(Redis Streams)
R -->|Telemetry Data| B[Anomaly Detector]
B -->|Anomaly Alerts| R
R -->|Anomaly Alerts| C[Self-Healing Orchestrator]
C -->|Healing Actions| R
R -->|All Data| D[Dashboard]
end
style A fill:lightcoral,stroke:#000,stroke-width:2px
style B fill:lightblue,stroke:#000,stroke-width:2px
style C fill:lightgreen,stroke:#000,stroke-width:2px
style D fill:gold,stroke:#000,stroke-width:2px
style R fill:salmon,stroke:#000,stroke-width:2px
- Docker
- Docker Compose
- Clone the repository.
- Navigate to the
Self-Healing-POC-for-Distributed-Systems
directory. - Run the following command to build and start all services:
docker-compose up --build
- Open a web browser and go to
http://localhost:8501
(streamlit) to view the dashboard.
- Python: For its simplicity.
- Scikit-learn: For the
IsolationForest
anomaly detection model. - Streamlit: For rapid development of the interactive dashboard.
- Redis: For lightweight and fast inter-service communication.
- Docker & Docker Compose: For containerizing the application, ensuring a consistent and reproducible environment.