member-1
Alloy push
member-n
Alloy push
Mimir
metrics
Loki
logs
Tempo
traces
Pyroscope
profiles
Central
LGTM+ control room
Huginn Agent
LangGraphquery loop
Telegram thread
#incident-auth-service
Alert fired
auth-service p95 latency crossed 2.4s
Hypothesis: latency aligns with Redis saturation after deploy canary-42.
Suggested action
Scale redis-session to two replicas, then watch p95 for 15 minutes.
A self-hosted agent layer for the LGTM+ stack.
Huginn keeps the familiar Grafana data plane and adds a reviewable agent loop on top. Alerts become cited RCA threads, chats can ask follow-up questions, and remediation stays behind explicit approval until trust is earned.
Reactive RCA
Alertmanager webhooks trigger an agent that queries metrics, logs, traces, profiles and topology before posting a hypothesis.
Conversational on-call
Telegram threads keep incident context alive so operators can ask why a service is slow without switching across query languages.
Approval-gated remediation
Runbooks become action cards with approve or reject decisions, audit logs and an opt-in path to automation.
From signal to cited hypothesis.
The product shape follows the main operational pain: correlate four observability pillars under pressure, without hiding the evidence behind a black box.
- 01
Collect
Alloy ships metrics, logs, traces and profiles from member nodes into the central LGTM+ stack.
- 02
Investigate
The LangGraph agent iterates over PromQL, LogQL, TraceQL, Pyroscope and topology tools.
- 03
Cite
Every hypothesis links back to Grafana Explore with the exact query and time window that produced it.
- 04
Decide
Telegram action cards capture approve, reject and false-positive decisions before any command can run.
Centralized data, reviewable control.
Huginn separates the data plane from the agent control plane. The agent never reads raw object storage; it uses the same APIs an operator would verify in Grafana.
Data plane
Mimir, Loki, Tempo, Pyroscope and MinIO store the four signals behind one tenant header.
Control plane
Alertmanager triggers the Python agent. Telegram carries the thread, citations, action card and approvals.
BYOK runtime
OpenAI-compatible, Anthropic, Ollama or local endpoints. No bundled model and no phone-home telemetry by default.
Two modes, one config schema.
The same system can start as a single-instance homelab deployment and grow into central plus member nodes without changing its mental model.
Single-instance
One host runs collectors, LGTM+ backends, MinIO, Grafana, Alertmanager and the agent for local tenant operations.
Central + members
Member nodes push through HTTPS with bearer token and X-Scope-OrgID, so NAT and dynamic IP environments stay workable.
Trust boundary
PII redaction, local-only KB files, sandboxed commands and command whitelists keep automation reviewable.