Using Instructions¶
Learn to create, test, and manage investigation instructions to guide Hawkeye's behavior.
What Are Instructions?¶
Instructions are guidelines you provide to Hawkeye that customize how it investigates incidents. They help Hawkeye understand your infrastructure, follow your investigation patterns, filter out noise, and focus on what matters for your organization.
Think of instructions as teaching Hawkeye about your specific environment and how you want it to work.
Instruction Types¶
Hawkeye supports four types of instructions, each serving a specific purpose:
FILTER Instructions¶
Purpose: Reduce noise by filtering out low-value alerts
When to use:
- You're getting too many minor alerts
- Certain alert sources are unreliable
- You want to focus on critical incidents only
Example:
Only investigate incidents with:
- Severity P1 or P2
- Affecting production environment
- Not from load testing systems
- Not during scheduled maintenance windows
Impact: Alerts matching these criteria won't be investigated, reducing noise and focusing attention on important issues.
SYSTEM Instructions¶
Purpose: Provide architectural context and infrastructure details
When to use:
- Hawkeye needs to understand your architecture
- Investigations lack context about your systems
- You want better root cause analysis
- You need service-specific guidance
Example:
Our infrastructure:
- Microservices architecture on AWS EKS
- api-service: Handles REST API requests
- payment-service: Processes payments via Stripe
- user-service: Authentication and user management
- notification-service: Email and SMS notifications
- worker-service: Background job processing
- Databases:
- PostgreSQL RDS (primary + 2 read replicas)
- Redis ElastiCache for session storage and caching
- Monitoring:
- Datadog APM for distributed tracing
- CloudWatch for AWS metrics
- PagerDuty for alerting
- Traffic patterns:
- Peak: 1000 req/sec between 2-6pm EST
- Normal: 200 req/sec outside peak hours
- Scheduled deployments: Tuesdays 10am EST
Impact: Hawkeye uses this context to make better inferences about root causes and dependencies.
GROUPING Instructions¶
Purpose: Group related alerts together to avoid duplicate investigations
When to use:
- Multiple alerts fire for the same underlying issue
- Cascading failures create alert storms
- You want to investigate grouped incidents once
Example:
Group incidents when:
- Same service and error type within 15 minutes
- Same root cause indicator (deployment, database outage)
- Cascading failures from single upstream service
- Auto-scaling events triggering multiple alerts
Impact: Related alerts are grouped into a single investigation, reducing duplicate work.
RCA Instructions¶
Purpose: RCA instructions define what sections to include in RCA reports, language preferences, and how findings should be presented. They standardize report structure across all investigations.
When to use:
- You need standardized RCA report formatting for your organization
- Reports must be readable for both technical and non-technical stakeholders
- You want consistent documentation structure across all incidents
- Compliance or organizational standards require specific report sections
- You need RCAs formatted for specific audiences (leadership, regulators, etc.)
Example - Standardized RCA Report Format:
Prompt:
Create an RCA instruction for my Test Production project:
"All RCA reports must include these sections:
1. Executive Summary - A 2-3 sentence overview suitable for leadership
2. Impact Assessment - Include affected services, user impact percentage, and estimated revenue impact if applicable
3. Timeline - Use UTC timestamps and include detection time, first response, and resolution time
4. Root Cause - Explain in both technical and non-technical terms
5. Corrective Actions - Separate into 'Immediate' (within 24h) and 'Long-term' (within 30 days)
6. Prevention Measures - Include specific monitoring thresholds to add
Format the report in markdown with clear headers. Keep technical jargon to a minimum in the Executive Summary section."
What this instruction does:
- Ensures consistent report structure across all investigations
- Makes RCAs readable for both technical and non-technical stakeholders
- Provides clear action items with timelines
- Includes forward-looking prevention measures
Impact: All RCA reports follow your organization's documentation standards and include the sections required by your stakeholders.
Why Test Instructions?¶
Problem: Bad instructions affect ALL future investigations.
Solution: Test instructions on past investigations before adding to your project.
Testing ensures your instruction: - Actually improves the investigation quality - Doesn't introduce false positives or noise - Works with your actual data and alert patterns - Produces actionable recommendations
graph LR
A[Write Instruction] --> B[Validate]
B --> C[Apply to Test Session]
C --> D[Rerun Investigation]
D --> E[Compare RCAs]
E -->|Better| F[Add to Project]
E -->|Worse| G[Refine & Retry] Quick Start: Testing Workflow¶
5-Step Testing Process¶
1. Pick a past investigation as test case
2. Validate your instruction
3. Apply instruction to that session
4. Rerun the investigation
5. Compare new RCA with original
Only add to project if the new RCA is better!
Step-by-Step Testing Guide¶
Step 1: Choose a Test Session¶
Find a past investigation to test against:
Pick one that represents the type of incident your instruction targets.
Example:
Save the session_uuid from the results.
Step 2: Write Your Instruction¶
Draft the instruction content:
Example - API Latency Investigation:
For API latency or timeout incidents:
1. Check database query performance in slow query logs
2. Review connection pool metrics (active, idle, waiting)
3. Examine API endpoint traces in Datadog APM
4. Check for downstream service latency
5. Verify cache hit rates in Redis
6. Look for recent deployments or configuration changes
7. Analyze request rate patterns and traffic spikes
8. Provide specific optimization recommendations with commands
Step 3: Validate the Instruction¶
Validate this RCA instruction:
"For API latency or timeout incidents:
1. Check database query performance in slow query logs
2. Review connection pool metrics (active, idle, waiting)
3. Examine API endpoint traces in Datadog APM
4. Check for downstream service latency
5. Verify cache hit rates in Redis
6. Look for recent deployments or configuration changes
7. Analyze request rate patterns and traffic spikes
8. Provide specific optimization recommendations with commands"
Uses hawkeye_validate_instruction:
✓ Instruction validated successfully
Generated name: "API Latency Investigation Methodology"
Type: RCA
Refined content: [AI-improved version]
The instruction is ready to test.
Step 4: Apply to Test Session¶
Uses hawkeye_apply_session_instruction:
✓ Instruction applied to test session
The instruction has been added as a session-specific
override. It will ONLY affect this session when rerun.
Next step: Rerun the investigation to see the impact.
Step 5: Rerun the Investigation¶
Uses hawkeye_rerun_session:
🔍 Rerunning investigation...
⏳ Applying new instruction... (5s)
⏳ Re-analyzing data... (30s)
⏳ Generating new RCA... (15s)
✓ Investigation complete! (50s total)
Step 6: Compare Results¶
Uses hawkeye_get_rca:
Root Cause Analysis (UPDATED)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[NEW] Database Query Analysis:
Identified slow query in checkout endpoint:
Query: SELECT * FROM orders WHERE user_id = ?
Execution time: 850ms (vs 5ms baseline)
[NEW] Missing Index Identified:
CREATE INDEX CONCURRENTLY idx_orders_user_id
ON orders(user_id);
Estimated improvement: 850ms → 5ms (99.4% faster)
[NEW] Connection Pool Analysis:
- Current utilization: 95/100 connections
- Peak during incident: 100/100 (exhausted)
- Recommendation: Increase pool size to 150
[NEW] Cache Performance:
- Cache hit rate: 45% (baseline: 85%)
- Cache invalidation spike detected at incident start
- Related to recent deployment at 2:15pm
[NEW] Deployment Correlation:
Deployment of checkout-service v2.3.1 at 2:15pm
introduced N+1 query pattern in order lookup.
The new RCA includes specific actionable insights
that were missing in the original investigation!
Step 7: Decision Time¶
Compare with original RCA:
| Aspect | Original RCA | New RCA |
|---|---|---|
| Root cause identified | ✓ Generic | ✓ Specific |
| Specific queries shown | ✗ No | ✓ Yes |
| Index suggestion | ✗ Generic | ✓ Specific SQL |
| Performance estimate | ✗ No | ✓ Yes |
| Connection pool analysis | ✗ No | ✓ Yes |
| Cache analysis | ✗ No | ✓ Yes |
| Deployment correlation | ✗ No | ✓ Yes |
Verdict: New RCA is significantly better!
Step 8: Add to Project¶
Uses hawkeye_create_project_instruction:
✓ Created RCA instruction for Production project
Name: "API Latency Investigation Methodology"
Status: Active
All future API latency investigations will use this
instruction to provide detailed performance analysis.
Best Practices¶
Writing Effective Instructions¶
✅ Do: - Be specific and actionable - Reference actual tools and systems you use - Include commands or queries when relevant - Focus on outcomes (what to find, not how) - Use clear, numbered steps for RCA instructions - Include context about your environment - Test before deploying
❌ Don't: - Write vague instructions ("check performance") - Reference tools you don't have ("check EXPLAIN output" if not logged) - Create instructions for edge cases - Make instructions too long (keep under 300 words) - Add instructions without testing - Duplicate information across instructions
Testing Strategies¶
Test multiple scenarios:
1. Test on the incident type you're targeting
2. Test on a related but different incident type
3. Test on an unrelated incident type
This ensures your instruction helps the right cases and doesn't hurt others.
Iterate based on results: - First version too broad? Make it more specific - Missing important checks? Add more steps - Too prescriptive? Make it more flexible - Not improving results? Refine or discard
Managing Instructions¶
Review regularly:
Disable underperforming instructions:
Update instructions as your system evolves: - New services added? Update SYSTEM instructions - New monitoring tools? Update RCA instructions - Alert patterns changed? Update FILTER instructions
Common Instruction Patterns¶
Pattern 1: Service-Specific Investigation¶
For incidents affecting the payment-service:
1. Check Stripe API response times and error rates
2. Review payment transaction logs for failures
3. Examine database connection pool for payment DB
4. Check for PCI compliance logging issues
5. Verify webhook delivery status
6. Look for rate limiting from Stripe
Pattern 2: Time-Based Context¶
SYSTEM instruction:
Scheduled maintenance:
- Database backups: Daily 2-3am EST
- Deployment windows: Tuesday/Thursday 10am EST
- Cache warmup: After each deployment (5-10 min)
- Traffic patterns: Peak 2-6pm EST weekdays
Consider timing when analyzing incident causes.
Pattern 3: Cascading Failure Detection¶
GROUPING instruction:
Group incidents when they occur within 15 minutes and:
- Multiple services report connection timeouts
- Database or Redis metrics show anomalies
- Gateway/load balancer shows elevated errors
- Upstream service degradation detected
Likely cascading failure from shared dependency.
Pattern 4: Noise Reduction¶
FILTER instruction:
Do NOT investigate:
- Alerts from dev/staging environments
- Load test results from automated testing
- Synthetic monitoring checks (exclude alert tag: synthetic)
- Auto-scaling events (normal operational behavior)
- Alerts during maintenance windows (2-3am EST)
Troubleshooting Instructions¶
Instruction Not Improving RCA¶
Possible causes: 1. Instruction too vague 2. Data sources don't support the checks 3. Instruction targets wrong incident type 4. System context missing
Solution: Refine and test again with more specific guidance.
Instruction Causing Worse Results¶
Possible causes: 1. Too prescriptive (limiting Hawkeye's analysis) 2. Incorrect assumptions about infrastructure 3. Conflicts with other instructions
Solution: Disable, refine, or remove the instruction.
Instruction Not Being Applied¶
Check: 1. Instruction is active (not disabled) 2. Instruction type matches use case 3. No conflicting instructions 4. Project has the instruction
Next Steps¶
-
Run Investigations
Apply your instructions to real incidents
-
Manage Connections
Connect data sources for investigations
-
Advanced Workflows
Power user techniques