Running Investigations¶

Master the art of AI-powered incident investigation with Hawkeye MCP. This guide covers both alert-based and manual investigations.

Quick Reference¶

graph LR
    A[List Uninvestigated] --> B[Investigate Alert]
    B --> C[Monitor Progress]
    C --> D[Get RCA]
    D --> E[Implement Actions]
    E --> F[Follow-up Questions]
    F --> G[Close Incident]

Finding Alerts to Investigate¶

List Uninvestigated Alerts¶

Ask Claude:

Show me uninvestigated alerts from the last 24 hours

Uses hawkeye_list_sessions with only_uninvestigated=true.

Filtering options:

Show me uninvestigated P1 alerts from the last 7 days

Find uninvestigated database alerts

Show me alerts from api-service in the last hour

Getting Links to Investigations¶

After listing investigations, you can request a direct link to view them in the Hawkeye web interface:

By list number:

Link for #2

By session ID:

Link for sessionID abc-123-def-456

This provides a URL to open the investigation directly in your browser, useful for: - Sharing investigations with team members - Viewing investigations in the full web interface - Bookmarking specific investigations - Including in incident reports or documentation

Understanding Alert Information¶

Each alert shows:

Field	Description
Alert ID	Unique identifier for investigation
Title	Incident description
Severity	P1 (critical) to P4 (low)
Timestamp	When alert fired
Source	Monitoring tool (CloudWatch, Datadog, etc.)

Important: The alert_id (shown as incident_info.id) is what you use to start an investigation.

Starting an Investigation¶

Basic Investigation¶

Investigate alert ID: /subscriptions/.../alerts/cpu-spike-123

Or simply:

Investigate the first alert

Claude will: 1. Extract the alert_id 2. Call hawkeye_investigate_alert 3. Monitor progress using hawkeye_get_investigation_status 4. Retrieve RCA when complete

Manual Investigations¶

Create investigations from custom prompts without needing an existing alert.

When to Use Manual Investigations¶

Proactive analysis - Investigate potential issues before they become alerts
Historical research - Analyze past incidents not captured as alerts
What-if scenarios - Test hypothetical situations
Training - Create example investigations for documentation
Testing - Validate system behavior without waiting for real alerts

Creating a Manual Investigation¶

Good prompt (specific):

Investigate high latency in payment-service between 2pm-3pm EST on Jan 15, 2025.
Users reported checkout taking 30+ seconds instead of the usual 2 seconds.

Bad prompt (too vague):

Something went wrong with the payment service

Claude will use hawkeye_create_manual_investigation:

✓ Created manual investigation
Session UUID: abc-123-def-456
Status: Running

Investigation will complete in 2-5 minutes.
Use hawkeye_get_rca to retrieve results.

Prompt Quality Guidelines¶

✅ Include these details:

Service/Resource names - Which components are involved?
Time frame - When did this occur or should be analyzed?
Symptoms - What behavior are you seeing?
Context - Any additional relevant information?

❌ Avoid:

Vague descriptions ("something broke")
No timeframe ("recently", "earlier")
Missing service names ("the API", "the database")

Example prompts:

Investigate memory leak in user-api pods in production namespace.
Started around 8am UTC today, memory usage climbing from 500MB to 2GB.

Analyze database connection pool exhaustion in postgres-primary.
Between 1pm-2pm EST yesterday. Connection count hit max of 100.

Check for cascading failures in microservices during deployment.
January 10, 2025 at 3:15pm UTC. Multiple services returned 503 errors.

Agent Behavior¶

If your prompt is too vague, Claude will ask clarifying questions:

You: Investigate the payment issue

Claude: I need more details to create an effective investigation:
- Which payment service? (payment-api, payment-processor, etc.)
- When did this occur? (specific date/time or time range)
- What symptoms did you observe? (errors, slow response, failures?)
- Any other context? (deployment, traffic spike, etc.)

Workflow¶

Complete manual investigation workflow:

1. Create investigation
   Investigate API latency spike in checkout-service on Jan 15 between 2-3pm EST

2. Wait for completion (2-5 minutes)
   Check status with: What's the status of this investigation?

3. Get RCA
   Show me the root cause analysis

4. Ask follow-ups
   What caused the latency spike?
   Were there any related infrastructure changes?

Parameters¶

The tool accepts:

Parameter	Type	Default	Description
`prompt`	string	required	Investigation description (min 10 chars)
`project_uuid`	string	optional	Uses default project if not specified

Using Default Project

If you've set a default project with hawkeye_set_default_project, you don't need to specify project_uuid. Manual investigations will automatically use your default project.

Comparison with Alert-Based Investigations¶

Feature	Alert-Based	Manual
Trigger	Existing alert	Custom prompt
Use Case	React to alerts	Proactive analysis
Alert ID	Required	Not needed
Prompt	Auto-generated	You provide
Flexibility	Limited to alert data	Any scenario

Understanding the RCA¶

RCA Structure¶

Every RCA includes:

1. Incident Summary

What happened
When it happened
Current status

2. Timeline

Chronological event sequence
Key state changes
System behaviors

3. Root Cause

Why it happened
Contributing factors
Technical details

4. Corrective Actions

Immediate fixes (often auto-executed)
Manual actions needed
Ready-to-execute bash scripts

5. Time Savings

Manual investigation time estimate
Actual Hawkeye time
Time saved

Asking Follow-Up Questions¶

Common Follow-Ups¶

Understanding the cause:

Why did this happen?
Has this happened before?
What changed recently?

Investigation details:

What data sources were checked?
Show me the logs that were analyzed
What queries were run?

Prevention:

How can we prevent this?
What monitoring should we add?
What tests would catch this?

Similar incidents:

Have we seen similar issues?
Is this related to recent deployments?
Are other services affected?

Follow-Up Workflow¶

# After getting RCA
Show me the chain of thought for this investigation

Uses hawkeye_get_chain_of_thought to show reasoning steps.

What data sources were consulted?

Uses hawkeye_get_investigation_sources.

What queries were executed?

Uses hawkeye_get_investigation_queries.

What are suggested follow-up questions?

Uses hawkeye_get_follow_up_suggestions.

Implementing Corrective Actions¶

One of the most powerful features of Hawkeye MCP is that you can ask your AI coding agent to directly implement the recommended fixes. Your agent has full context of both your codebase and Hawkeye's analysis.

The Power of MCP Integration¶

Because Hawkeye runs as an MCP server, your coding agent (Claude Code, Claude Desktop, Cursor, etc.) can:

Get the RCA with detailed corrective actions
Understand your codebase through file access
Implement fixes directly in your code
Execute commands through its shell access
Verify the fix by checking logs and metrics

This creates a seamless incident response workflow where investigation and remediation happen in the same conversation.

Workflow: From Investigation to Implementation¶

Step 1: Get the RCA

Show me the RCA for this investigation

Step 2: Ask your agent to implement the fix

Please implement the corrective actions from this RCA.
Start with the highest priority items.

Step 3: Your agent will: - Read the relevant code files - Make the necessary changes - Run tests to verify - Execute deployment commands (if appropriate) - Confirm the changes

Example: Implementing a Database Index¶

Investigation finds: Missing index causing slow queries

RCA provides:

CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

You ask your agent:

Please add this database index. Use a migration file
following our project's migration pattern.

Your agent will: 1. Check existing migration files to understand the pattern 2. Create a new migration file (e.g., 20250124_add_orders_user_id_index.sql) 3. Add the CREATE INDEX statement with proper syntax 4. Add a corresponding DROP INDEX for rollback 5. Update the migration tracking 6. Suggest running the migration command

Example: Fixing a Configuration Issue¶

Investigation finds: Connection pool too small for traffic

RCA provides:

kubectl set env deployment/api-service MAX_CONNECTIONS=200

You ask your agent:

Please update the MAX_CONNECTIONS configuration to 200.
Update both the deployment manifest and our docs.

Your agent will: 1. Find the deployment YAML file 2. Update the environment variable 3. Update documentation to reflect the change 4. Suggest applying with kubectl apply -f deployment.yaml 5. Offer to add monitoring for connection pool usage

Example: Implementing a Code Fix¶

Investigation finds: Memory leak from unclosed database connections

RCA explains: Connections not being released in error paths

You ask your agent:

Please fix the connection leak in the payment service.
Make sure all code paths properly close connections.

Your agent will: 1. Read the payment service code 2. Identify all database connection usage 3. Add proper try/finally blocks or context managers 4. Ensure connections are closed in error paths 5. Add tests to verify the fix 6. Run existing tests to ensure no regressions

Preventive Measures¶

After fixing the immediate issue, implement preventive measures:

Please implement the preventive measures from the RCA:
1. Add monitoring for connection pool usage
2. Add alerts for connection pool >80% full
3. Update our runbook with this scenario

Best Practices for Agent-Driven Remediation¶

✅ Do: - Review the RCA thoroughly before asking for implementation - Be specific about your project's patterns and conventions - Ask the agent to run tests after making changes - Request documentation updates alongside code changes - Have the agent explain what it's doing - Use staging environments for testing changes

❌ Don't: - Blindly apply changes to production without review - Skip testing steps - Implement changes you don't understand - Ignore the agent's warnings or questions - Rush through validation steps

Manual Review Required¶

Some actions require human judgment and shouldn't be fully automated:

Deployment decisions: - Rolling back to previous versions - Scaling production resources - Database schema changes - Infrastructure modifications

For these, ask your agent to:

Please prepare the rollback command but don't execute it yet.
Show me what will happen and wait for my approval.

Bash Scripts in RCA¶

RCAs include ready-to-run commands for immediate fixes:

# Example: Add database index
psql -h prod-db.rds.amazonaws.com -U admin -d production

CREATE INDEX CONCURRENTLY idx_orders_user_id
  ON orders(user_id);

You can execute these directly or ask your agent to integrate them into your workflow:

Please add this as a migration following our standard process.

Validation Checklist¶

Before executing any fix, verify:

Commands are safe (no destructive operations)
Correct environment (prod vs staging)
Proper permissions and access
Backup/rollback plan in place
Tests pass after changes
Changes follow project conventions
Documentation is updated

Advanced Investigation Techniques¶

Filtering by Severity¶

Show me only P1 and P2 uninvestigated alerts

Filtering by Date Range¶

Show me uninvestigated alerts from January 1-15

Searching by Keyword¶

Find alerts related to "database timeout"

Uses search_term parameter in hawkeye_list_sessions.

Compact Mode¶

For large result sets:

Show me all uninvestigated alerts in compact format

Returns minimal details for faster browsing.

Investigation Status Tracking¶

Check ongoing investigations:

Show me the status of session abc-123-def-456

Uses hawkeye_get_investigation_status.

Real-Time Progress Monitoring¶

Hawkeye provides live streaming updates during investigations, giving you visibility into what's happening at each step.

Understanding Progress Updates¶

When an investigation is running, hawkeye_get_investigation_status provides:

Current Progress: - progress_percentage: 0-100 indicating completion based on completed steps - current_step: Human-readable description of what's happening now - status: Investigation state (in_progress, completed, failed)

Investigation Breakdown: - total_steps: Estimated number of investigation steps - completed_steps: How many steps have finished - unique_sources: All data sources consulted (logs, metrics, alarms)

Step Details: Each step includes: - step_id: Unique identifier for detailed lookups - step_number: Sequential step number - description: What question is being answered - category: Type of investigation step (see below) - status: Step state (in_progress, completed, possible_question) - sources_consulted: Which data sources contributed to this step

Investigation Step Categories¶

Each step is categorized by its purpose:

Category	Purpose	Example
discovery	Initial fact-finding and data gathering	"What was the system.load.1 metric value?"
analysis	Examining patterns and correlations	"Analyzing correlation between CPU and memory"
diagnosis	Root cause identification	"Identifying why the process failed"
remediation	Solution recommendations	"Suggesting corrective actions"
validation	Verification of findings	"Confirming the root cause hypothesis"

Monitoring Recommendations¶

Polling frequency: - Check status every 10-15 seconds for updates - More frequent polling won't speed up investigation - Less frequent means you might miss interim progress

What to watch:

Show me the investigation status

# Look for:
# - progress_percentage: How far along (0-100)
# - current_step: What's happening right now
# - unique_sources: Which data sources have been checked
# - completed_steps / total_steps: Step progress

Data Source Attribution¶

The unique_sources field shows exactly which monitoring systems contributed to the investigation:

Example sources:

unique_sources: [
  "log_datadog.datadog_logs",
  "monitor_datadog.monitor_events",
  "alarm_aws_poc_sandbox.alarm_history",
  "log_aws_poc_sandbox.log_aws_containerinsights_eks_htm_prod",
  "metric_aws_poc_sandbox.cloudwatch_metrics"
]

Why this matters: - Verify comprehensive coverage - Ensure all relevant systems were checked - Understand thoroughness - More sources = more thorough analysis - Identify gaps - Missing expected sources? May indicate sync issues - Debug investigations - See exactly what data was available

Progress Example¶

Starting an investigation:

{
  "session_uuid": "abc123...",
  "status": "in_progress",
  "progress_percentage": 0,
  "current_step": "Initializing investigation...",
  "investigation_summary": {
    "total_steps": 5,
    "completed_steps": 0,
    "total_sources_consulted": 0,
    "unique_sources": []
  }
}

Mid-investigation:

id=__codelineno-43-1 name=__codelineno-43-1 href=#__codelineno-43-1>{ "session_uuid": "abc123...", "status": "in_progress", "progress_percentage": 45, "current_step": "🔍 Discovery: Analyzing system.load.1 metrics across production hosts", "investigation_summary": { "total_steps": 5, "completed_steps": 2, "total_sources_consulted": 8, "unique_sources": [ "log_datadog.datadog_logs", "monitor_datadog.monitor_events", "alarm_aws_poc_sandbox.alarm_history" ], "steps": [ { "step_id": "69247c6b517e7056d602abd1", "step_number": 1, "description": "What was the system.load.1 metric value and which hosts experienced load average above 2.0?", "category": "discovery", "status": "completed", "sources_consulted": [ "monitor_datadog.monitor_events", "log_datadog.datadog_logs" ] }, { "step_id": "69247c6b517e7056d602abd2", "step_number": 2, "description": "Analyzing CPU and memory usage patterns during the incident", "category": "analysis", "status": "in_progress", "sources_consulted": [ "alarm_aws_poc_sandbox.alarm_history" ] } ] } class=p>}

Typical Investigation Flow¶

Phase 1: Discovery (0-30%) - Gathering initial facts - Identifying affected resources - Collecting metrics and logs - Timeline construction

Phase 2: Analysis (30-60%) - Pattern detection - Correlation analysis - Comparing with baselines - Examining related events

Phase 3: Diagnosis (60-85%) - Root cause identification - Validating hypotheses - Determining contributing factors - Impact assessment

Phase 4: Remediation (85-100%) - Generating corrective actions - Creating preventive measures - Calculating time savings - Finalizing RCA report

Common Progress Patterns¶

Fast start, then slower:

0% → 40% (first minute) → 60% (next 2 minutes) → 100%

Initial discovery is quick, detailed analysis takes longer.

Steady progress:

0% → 20% → 40% → 60% → 80% → 100% (evenly paced)

Indicates straightforward incident with clear data.

Stuck at percentage:

0% → 45% (stays here for 1-2 minutes) → 100%

Normal for complex data queries or correlation analysis.

Investigation Quality¶

Quality Scoring¶

Every investigation gets scored on:

Accuracy (Root cause correctness, impact analysis)
Completeness (Data coverage, remediation steps)

Check quality:

Show me the quality score for this investigation

Uses hawkeye_get_rca_score.

Good scores: 85-100 (Excellent) Acceptable: 70-84 (Good) Needs improvement: <70 (Add more instructions)

Improving Quality¶

Add context instructions:

Create a SYSTEM instruction about our architecture

Add investigation steps:

Create an RCA instruction for database issues

Provide feedback: If an RCA is incorrect, add an instruction to guide future investigations.

Common Investigation Patterns¶

Pattern 1: API Latency¶

Symptoms: Slow API responses, timeouts

Investigation approach: 1. Check APM traces (Datadog, New Relic) 2. Review database query performance 3. Check external API dependencies 4. Examine cache hit rates 5. Review resource utilization

Common causes: - Missing database indexes - Inefficient queries - External API slowdowns - Cache misses - Insufficient resources

Pattern 2: Memory Leaks¶

Symptoms: Increasing memory, eventual OOM crash

Investigation approach: 1. Review memory growth timeline 2. Check for unclosed connections 3. Look for accumulating caches 4. Examine heap dumps 5. Review recent code changes

Common causes: - Unclosed database connections - Event listener leaks - Growing in-memory caches - Circular references

Pattern 3: Database Issues¶

Symptoms: Slow queries, connection timeouts

Investigation approach: 1. Check slow query logs 2. Review connection pool metrics 3. Examine table locks 4. Check index usage 5. Review database resource metrics

Common causes: - Missing indexes - Lock contention - Connection pool exhaustion - Long-running transactions - Inefficient queries

Pattern 4: Deployment Failures¶

Symptoms: Service crashes after deployment

Investigation approach: 1. Compare deployment manifests 2. Review container startup logs 3. Check environment variables 4. Examine health check failures 5. Review resource limits

Common causes: - Configuration errors - Missing environment variables - Insufficient resources - Database migration failures - Incompatible dependencies

Workflow Tips¶

Morning Review¶

# Check overnight incidents
Show me uninvestigated alerts from the last 12 hours

# Investigate critical ones
Investigate all P1 alerts

Incident Response¶

# During active incident
Show me alerts from the last 30 minutes

# Quick investigation
Investigate this alert and wait for results

# Get actionable fixes
Show me the corrective actions

Post-Mortem¶

# Review incident
Show me the complete RCA for session abc-123

# Understand what happened
Show me the timeline and chain of thought

# Prevent recurrence
What preventive measures are recommended?

Troubleshooting Investigations¶

Investigation Taking Too Long¶

Normal duration: 30-90 seconds

If longer: - First investigation may take 5-10 minutes (syncing connections) - Complex incidents with many data sources take longer - Check connection sync status

Incomplete RCA¶

If RCA lacks details: 1. Add more SYSTEM instructions with context 2. Add RCA instructions with investigation steps 3. Ensure connections are properly synced 4. Check data source availability

Incorrect Root Cause¶

If RCA misidentifies the cause: 1. Use follow-up questions to guide investigation 2. Add instruction with correct approach 3. Test instruction on past session 4. Add to project if improved

Next Steps¶

Use Instructions

Learn to create and test investigation instructions

Using Instructions
Manage Connections

Add and configure data sources

Managing Connections
Advanced Workflows

Power user techniques

Advanced Workflows