Troubleshooting
Alerts & Troubleshooting
What each Slack alert means and decision trees for common issues
What This Does
The platform sends alerts to Slack when things go wrong. Alerts are routed by severity:
- Critical →
#maverick-alerts-critical(immediate action needed) - Warning →
#maverick-alerts(investigate when possible)
Alert Reference
API Health
| Alert | Severity | Trigger | What To Do |
|---|---|---|---|
APIInstanceDown | Critical | Prometheus can't reach API for >2 min | Check Status Page → Core Services. If down, SSH to server and run docker compose -f docker-compose.production.yml restart api |
APIHighErrorRateWarning | Warning | 5xx rate >1% over 5 min | Check Status Page → Logs → api for error patterns |
APIHighErrorRateCritical | Critical | 5xx rate >5% over 3 min | Likely a code bug or DB issue. Check Logs → api for stack traces. May need rollback |
APIHighLatencyWarning | Warning | P95 latency >1s | Check database latency on Status Page → Database module |
APIHighLatencyCritical | Critical | P95 latency >3s | Database or Redis likely overloaded. Check System Resources for CPU/memory |
Celery Workers and Queues
| Alert | Severity | Trigger | What To Do |
|---|---|---|---|
EmailEventsQueueDepthHigh | Warning | email_events queue >200 | Webhook backlog building. Check if email_bison worker is running |
EmailEventsQueueDepthCritical | Critical | email_events queue >1000 | Worker likely crashed. Restart: docker restart celery-email-bison |
VerificationQueueDepthHigh | Warning | verification queue >5000 | Large batch submitted. Normal — takes ~2h to clear at 3K/hour |
ScrapingQueueDepthHigh | Warning | scraping queue >10 | Multiple scraping jobs queued. Only 1 runs at a time (browser automation). Queue will drain slowly |
CeleryHighFailureRateWarning | Warning | >10% task failures | Check Logs for failing task patterns |
CeleryHighFailureRateCritical | Critical | >25% task failures | Systemic issue — check DB connectivity, Redis, external APIs |
System Resources
| Alert | Severity | Trigger | What To Do |
|---|---|---|---|
DiskSpaceWarning | Warning | Disk >70% full | Check Docker images and logs consuming space. Run docker system prune on server |
DiskSpaceCritical | Critical | Disk >85% full | Urgent — clean up immediately or services will crash |
HighMemoryUsageWarning | Warning | RAM >80% | Check which containers are using most memory: docker stats |
HighCPULoad | Warning | 5-min load avg >6 | Usually scraping or verification spike. Should resolve on its own |
Pipeline SLOs
| Alert | Severity | Trigger | What To Do |
|---|---|---|---|
PipelineSuccessRateSLOBreach | Warning | below 95% success over 1h | Check which pipeline stage is failing — look at Celery Workers module |
VerificationBacklogSLOBreach | Warning | Verification queue >15K | Massive batch — will take 4+ hours. No action needed unless it grows |
BisonAPIHighErrorRate | Warning | Bison >10% errors | Check Bison status. May be rate limiting or token issue |
BisonAPIDown | Critical | Bison >75% errors | Bison API is likely down. Check send.maverickmarketingllc.com. No action until they resolve |
DebounceAPIHighErrorRate | Warning | Debounce >10% errors | Likely rate limiting. Verification will slow but continue |