Huginn

Self-hosted LGTM+ observability with AI-assisted RCA.

Open Source

View Source

huginn.central/incident/rca

Telemetry intaketenant: member_3

member-1

Alloy push

member-n

Alloy push

Mimir

metrics

Loki

logs

Tempo

traces

Pyroscope

profiles

Central

LGTM+ control room

MimirLokiTempoPyroscope

MinIO / S3 object store

Huginn Agent

LangGraph

query loop

PromQLLogQLTraceQLKB/topology

Telegram thread

#incident-auth-service

RCA

Alert fired

auth-service p95 latency crossed 2.4s

Hypothesis: latency aligns with Redis saturation after deploy canary-42.

Grafana Explore citation attached

Suggested action

Scale redis-session to two replicas, then watch p95 for 15 minutes.

Approve & ExecuteMark False Positive

Operations intelligence

A self-hosted agent layer for the LGTM+ stack.

Huginn keeps the familiar Grafana data plane and adds a reviewable agent loop on top. Alerts become cited RCA threads, chats can ask follow-up questions, and remediation stays behind explicit approval until trust is earned.

Reactive RCA

Alertmanager webhooks trigger an agent that queries metrics, logs, traces, profiles and topology before posting a hypothesis.

Conversational on-call

Telegram threads keep incident context alive so operators can ask why a service is slow without switching across query languages.

Approval-gated remediation

Runbooks become action cards with approve or reject decisions, audit logs and an opt-in path to automation.

Incident loop

From signal to cited hypothesis.

The product shape follows the main operational pain: correlate four observability pillars under pressure, without hiding the evidence behind a black box.

01
Collect
Alloy ships metrics, logs, traces and profiles from member nodes into the central LGTM+ stack.
02
Investigate
The LangGraph agent iterates over PromQL, LogQL, TraceQL, Pyroscope and topology tools.
03
Cite
Every hypothesis links back to Grafana Explore with the exact query and time window that produced it.
04
Decide
Telegram action cards capture approve, reject and false-positive decisions before any command can run.

Architecture

Centralized data, reviewable control.

Huginn separates the data plane from the agent control plane. The agent never reads raw object storage; it uses the same APIs an operator would verify in Grafana.

Data plane

Mimir, Loki, Tempo, Pyroscope and MinIO store the four signals behind one tenant header.

MimirLokiTempoPyroscopeMinIO

Control plane

Alertmanager triggers the Python agent. Telegram carries the thread, citations, action card and approvals.

AlertmanagerLangGraphTelegram

BYOK runtime

OpenAI-compatible, Anthropic, Ollama or local endpoints. No bundled model and no phone-home telemetry by default.

BYOKOllamaAnthropicOpenAI-compatible

Deployment

Two modes, one config schema.

The same system can start as a single-instance homelab deployment and grow into central plus member nodes without changing its mental model.

members = 0

Single-instance

One host runs collectors, LGTM+ backends, MinIO, Grafana, Alertmanager and the agent for local tenant operations.

X-Scope-OrgID

Central + members

Member nodes push through HTTPS with bearer token and X-Scope-OrgID, so NAT and dynamic IP environments stay workable.

audit log

Trust boundary

PII redaction, local-only KB files, sandboxed commands and command whitelists keep automation reviewable.