Complete Onboarding Guide¶
Step-by-step guide to set up Hawkeye MCP and start investigating incidents.
Overview¶
This guide walks you through the complete onboarding process for Hawkeye MCP:
The Onboarding Journey¶
graph LR
A[Install MCP] --> B[Add Connections]
B --> C[Create Project]
C --> D[Run Test Investigation]
D --> E[Find & Investigate Alerts]
E --> F[Configure Instructions]
F --> G[Optimize & Refine] Phases at a Glance¶
📦 Prerequisites & Installation (5-10 minutes)
- Install Hawkeye MCP Server and configure your AI client
- See Installation Guide for details
🔗 Phase 1: Add Connections (10-15 minutes)
- Connect your cloud providers (AWS, Azure, GCP)
- Link monitoring tools (Datadog, PagerDuty, etc.)
- Wait for initial data sync
📁 Phase 2: Create Project (5 minutes)
- Set up your first project
- Link connections to the project
- Set as default project
🔍 Phase 3: Run First Investigation (5-10 minutes)
- Create a manual investigation to test the system
- Review the RCA (Root Cause Analysis)
- Understand the investigation output
🎯 Phase 4: Find & Investigate Real Alerts (10-15 minutes)
- List uninvestigated alerts from your systems
- Investigate a real incident
- Execute corrective actions
📋 Phase 5: Configure Instructions (15-20 minutes)
- Add SYSTEM instructions for context
- Create FILTER instructions to reduce noise
- Test RCA instructions on past incidents
Total time: 50-75 minutes
Phased Approach
You don't need to complete everything in one sitting! Many teams split this across multiple sessions: - Day 1: Install + Connections + Project (20-30 min) - Day 2: First investigations + Instructions (30-45 min)
Prerequisites¶
Before starting, ensure you have:
- Hawkeye account (book demo if needed)
- Hawkeye MCP installed (see Installation Guide)
- AI client configured (Claude Desktop, Claude Code, Cursor, or GitHub Copilot)
- Cloud provider access (AWS, Azure, or GCP credentials)
- Monitoring tool credentials (Datadog, PagerDuty, New Relic, etc.)
Installation Required
If you haven't installed Hawkeye MCP yet, follow the Installation Guide first, then return here.
Phase 1: Add Connections¶
Connect your cloud providers and monitoring tools to Hawkeye.
Step 1.1: Understand Connections¶
Hawkeye needs access to your infrastructure to investigate incidents:
| Connection Type | What It Provides | Required? |
|---|---|---|
| AWS | CloudWatch logs/metrics, EC2, RDS, Lambda, etc. | ✅ Recommended |
| Azure | Azure Monitor, App Insights, VMs | ✅ If using Azure |
| GCP | Cloud Logging, Monitoring, Compute | ✅ If using GCP |
| Grafana | Prometheus, Loki, Tempo | ✅ If using Grafana with Kubernetes |
| Datadog | Logs, metrics, traces, APM | 🟡 Optional but helpful |
| Dynatrace | APM, infrastructure monitoring | 🟡 Optional |
| PagerDuty | Alert management, on-call schedules | 🟡 Optional |
| ServiceNow | Alert management, on-call schedules | 🟡 Optional |
| FireHydrant | Alert management, on-call schedules | 🟡 Optional |
| Incident.io | Alert management, on-call schedules | 🟡 Optional |
Start minimal
Begin with your primary cloud provider + one monitoring tool + one incident management tool. You can add more connections later.
Step 1.2: Create AWS Connection¶
Ask Claude:
Claude will guide you through providing: - IAM Role ARN and ExternalId - Regions to monitor - Connection name (e.g., "Production AWS")
Example:
✓ Created AWS connection: Production AWS
Role ARN: arn:aws:iam::1234567890:role/neubird-hawkeye-readonly
Status: Syncing (this may take 5-10 minutes)
ExternalID: <your-external-id>
Regions: us-east-1, us-west-2
Learn more about creating your AWS role here: AWS Connection Setup
Step 1.3: Wait for Connection Sync¶
Check sync status:
Response:
Connection: Production AWS
Status: SYNCED ✓
Last sync: 2 minutes ago
Resources discovered:
- 45 EC2 instances
- 12 RDS databases
- 23 Lambda functions
- 156 CloudWatch alarms
First sync takes time
Initial sync can take 2-5 minutes depending on resource count. You can proceed to the next phase while it syncs. Be patient, it's worth it!
Step 1.4: Add Monitoring Tools (Optional)¶
Add Datadog:
Learn more about creating your Datadog API keys here: Datadog Connection SetupAdd PagerDuty:
Learn more about creating your PagerDuty API keys here: PagerDuty Connection SetupStep 1.5: Verify Connections¶
Expected output:
Found 2 connections:
1. Production AWS (AWS)
Status: SYNCED and TRAINED ✅
Telemetry Types: Alert, Config, Log, Metric
2. Production Datadog (Datadog)
Status: SYNCED and TRAINED ✅
Telemetry Types: Alert, Log, Metric
Phase 2: Create Project¶
Projects organize your connections, instructions, and investigation history.
Step 2.1: Understand Projects¶
A Hawkeye project organizes: - Connections - Which cloud/monitoring tools to use - Instructions - How to investigate incidents - Sessions - Investigation history
Most teams start with one project per environment (Production, Staging, etc.).
Step 2.2: Create Project¶
Ask Claude:
Claude will use hawkeye_create_project:
✓ Created project "Production"
UUID: abc-123-def-456
Status: Active
Next steps:
1. Add connections
2. Configure instructions
3. Start investigating
Save the Project UUID - You'll need it for later steps.
Step 2.3: Set as Default Project¶
Set this project as your default to avoid specifying project_uuid in every command:
Or use the UUID directly:
Claude will confirm:
✓ Default project set to: Production
UUID: abc-123-def-456
All operations will now use this project by default.
Benefits: - 🎯 No need to specify project_uuid in commands - 🔄 Easy switching between environments later - 💬 Use natural language: "Switch to Staging project"
Multiple Projects
If you create multiple projects (Production, Staging, Dev), you can quickly switch between them using hawkeye_set_default_project. The default persists for your entire MCP session.
Step 2.4: Link Connections to Project¶
Claude will use hawkeye_add_connection_to_project:
✓ Added 2 connections to Production project
- Production AWS (AWS)
- Production Datadog (Datadog)
Project is now ready for investigations!
Step 2.5: Verify Project Setup¶
This uses hawkeye_get_project_details:
Production Project Details
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
UUID: abc-123-def-456
Status: Active
Created: Just now
Connections (2):
- Production AWS (AWS) - SYNCED
- Production Datadog (Datadog) - SYNCED
Instructions: None yet
Sessions: None yet
Phase 3: Run First Investigation¶
Test the system with a manual investigation before waiting for real alerts.
Step 3.1: Create a Test Investigation¶
Create a manual investigation to verify everything works:
Investigate potential memory leak in user-api pods that I noticed this morning.
Memory usage increased from 500MB to 1.2GB between 8am-10am UTC today.
No alerts fired yet but trending upward.
Claude will use hawkeye_create_manual_investigation:
✓ Created manual investigation
Session UUID: xyz-789-abc-123
Status: Running
Investigation will complete in 2-5 minutes.
Analyzing:
- CloudWatch metrics for memory usage
- Pod restart patterns
- Application logs
- Resource allocation changes
Step 3.2: Wait for Completion¶
You can check status:
Or wait 2-5 minutes and then get the RCA.
Step 3.3: Review the RCA¶
Claude uses hawkeye_get_rca:
Root Cause Analysis
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Incident: Memory Leak - user-api
Severity: P2 (High)
Duration: Ongoing (2 hours)
Status: Active
ROOT CAUSE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Memory leak caused by unclosed database connections
in the user session handler. Connection pool reached
maximum capacity, preventing cleanup of old connections.
TIMELINE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
08:00 UTC - Normal memory usage (500MB)
08:15 UTC - Gradual increase begins
09:00 UTC - Memory at 800MB (60% increase)
09:30 UTC - Connection pool at 85% capacity
10:00 UTC - Memory at 1.2GB (140% increase)
10:15 UTC - Connection pool at max (100 connections)
CORRECTIVE ACTIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Immediate:
1. Restart affected pods to clear connections
kubectl rollout restart deployment/user-api
2. Temporarily increase connection pool timeout
kubectl set env deployment/user-api DB_POOL_TIMEOUT=30000
Long-term fixes:
1. Fix connection leak in code (user-session.js:45)
2. Implement connection pool monitoring
3. Add automated cleanup for stale connections
TIME SAVED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Manual investigation time: ~30 minutes
Hawkeye investigation time: 2 minutes
Time saved: 28 minutes ⚡
Step 3.4: Understanding the Output¶
Every RCA includes:
- Incident Summary - What happened and severity
- Root Cause - Why it happened with technical details
- Timeline - Chronological event sequence
- Corrective Actions - Ready-to-execute fixes
- Time Savings - Efficiency metrics
Test Complete!
If you got a detailed RCA like above, your setup is working correctly! The system successfully: - Connected to your cloud provider - Analyzed metrics and logs - Generated actionable insights
Phase 4: Find & Investigate Real Alerts¶
Now that the system is working, investigate real incidents from your infrastructure.
Step 4.1: List Uninvestigated Alerts¶
This uses hawkeye_list_sessions with only_uninvestigated=true:
Found 3 uninvestigated incidents:
1. High API Latency - user-service
Alert ID: /aws/cloudwatch/alerts/latency-spike-123
Severity: P1
Time: 2 hours ago
Source: CloudWatch
2. Database Connection Pool Exhausted
Alert ID: /datadog/alerts/db-pool-456
Severity: P2
Time: 4 hours ago
Source: Datadog
3. Lambda Cold Start Timeout
Alert ID: /aws/cloudwatch/alerts/lambda-789
Severity: P3
Time: 6 hours ago
Source: CloudWatch
Step 4.2: Investigate an Alert¶
Claude will: 1. Extract the alert_id 2. Call hawkeye_investigate_alert 3. Wait for completion (30-90 seconds) 4. Retrieve RCA automatically
🔍 Starting investigation...
⏳ Analyzing CloudWatch logs... (20s)
⏳ Checking application traces... (15s)
⏳ Reviewing database metrics... (10s)
⏳ Correlating events... (15s)
✓ Investigation complete! (60s total)
Step 4.3: Review and Act¶
The RCA will provide: - Root cause explanation - Timeline of events - Corrective actions (with bash commands) - Preventive measures
Execute the recommended actions:
Step 4.4: Ask Follow-Up Questions¶
Claude uses hawkeye_continue_investigation to provide deeper insights.
Phase 5: Configure Instructions¶
Fine-tune how Hawkeye investigates incidents by adding instructions.
Step 5.1: Understand Instruction Types¶
| Type | Purpose | When to Use | Example |
|---|---|---|---|
| SYSTEM | Provide context about your architecture | Always (start with 1-2) | "We use microservices on Kubernetes" |
| FILTER | Reduce noise by filtering low-priority alerts | When getting too many alerts | "Only investigate P1 and P2 incidents" |
| RCA | Guide investigation steps for specific scenarios | For common incident types | "For database issues, check slow queries first" |
| GROUPING | Group related alerts together | When seeing duplicate alerts | "Group alerts from same service within 5 min" |
Step 5.2: Add SYSTEM Instruction¶
Provide high-level context:
Create a SYSTEM instruction for my Production project:
"Our infrastructure runs on AWS EKS with 15 microservices.
We use PostgreSQL for databases and Redis for caching.
Peak traffic is 9am-5pm EST. We have auto-scaling enabled
for all services with min 2, max 10 replicas."
Claude uses hawkeye_create_project_instruction:
✓ Created SYSTEM instruction
Type: SYSTEM
Status: Active
This context will be used in all future investigations
to provide more relevant analysis.
Step 5.3: Add FILTER Instruction¶
Reduce noise:
Create a FILTER instruction for my Production project:
"Only investigate incidents with severity P1 (Critical)
or P2 (High). Ignore P3 and P4 alerts unless they occur
more than 5 times in 10 minutes."
This prevents low-priority alerts from creating unnecessary investigations.
Step 5.4: Add RCA Instruction¶
Guide investigation for common scenarios:
Create an RCA instruction for my Production project:
"For database-related incidents:
1. Check for slow queries (>1 second)
2. Review connection pool usage
3. Analyze index usage with EXPLAIN
4. Check for table locks or deadlocks
5. Suggest specific index improvements"
Step 5.5: Test Instructions on Past Incidents¶
Before deploying instructions broadly, test them:
I want to test this new RCA instruction on the database
incident from yesterday. Apply it to that session and rerun
the investigation.
Claude will: 1. Apply instruction to specific session 2. Rerun the investigation 3. Compare new RCA vs original 4. Help you decide if it improved the results
See Testing Instructions Guide for detailed workflow.
Step 5.6: Verify Instructions¶
Found 3 instructions:
1. Architecture Context (SYSTEM)
Status: Active
Created: 10 minutes ago
2. Priority Filter (FILTER)
Status: Active
Created: 5 minutes ago
3. Database Investigation Steps (RCA)
Status: Active
Created: 2 minutes ago
Optimize & Refine¶
Continuous improvement tips for getting the most from Hawkeye.
Monitor Investigation Quality¶
Review metrics: - MTTR (Mean Time To Resolution) - Time Saved vs manual investigation - Investigation Quality Scores - Noise Reduction from filtering
Refine Instructions¶
Based on investigation results:
- Too many investigations? → Add/adjust FILTER instructions
- Missing context? → Update SYSTEM instructions
- Incomplete RCA? → Add specific RCA instructions
- Duplicate alerts? → Configure GROUPING instructions
Add More Connections¶
As your needs grow:
Create Environment-Specific Projects¶
Each project can have different: - Connections (staging vs prod resources) - Instructions (different investigation depth) - Alert thresholds
Next Steps¶
-
Daily Workflows
Common investigation patterns for day-to-day use
-
Testing Instructions
Learn to validate and test instructions safely
-
Advanced Workflows
Power user techniques and optimization
Summary Checklist¶
By the end of this guide, you should have:
- Installed and configured Hawkeye MCP
- Added cloud connections (AWS, Datadog, etc.)
- Created your first project
- Set it as your default project
- Run a test investigation
- Investigated real alerts
- Configured investigation instructions
- Reviewed RCA and executed corrective actions
- Tested and refined instructions
Congratulations! You're now ready to use Hawkeye MCP for autonomous incident investigation.
Getting Help¶
- Inline Help: Ask Claude "How do I..." and use the guidance system
- Documentation: Browse other guides
- Examples: See real-world examples
- Support: Contact NeuBird for assistance