Health Sweeps

The /health command runs a comprehensive infrastructure health check, scanning across incidents, alarms, error logs, metrics, and recent deployments to produce a structured report.

Usage

/health          # 1-hour lookback (default)
/health 4h       # 4-hour lookback
/health 24h      # Full day
/health 15m      # Quick 15-minute check

What it checks

Category	Data Sources	What NeuBird Looks For
Incidents	PagerDuty, OpsGenie, ServiceNow	Active incidents, severity, acknowledgment status
Alarms	CloudWatch, Datadog monitors	Currently firing alarms, alarm history
Error logs	Grafana Loki, Splunk	Error rate by service, new error patterns
Metrics	Prometheus, CloudWatch, Datadog	Resource saturation, latency trends, anomalies
Deployments	GitHub, GitLab, ArgoCD	Recent merges, deployment status, risky changes

Report structure

The health report is organized into severity-ranked sections:

Section	Description
Bad	Active problems with evidence, blast radius, and timeline
Good	Services operating normally, with evidence to prove it
Ugly	Not broken yet, but trending the wrong way
Recommended Actions	Prioritized next steps ranked by urgency

Each finding includes the specific data sources and queries that produced the evidence.