Alerts & Troubleshooting

What This Does

The platform sends alerts to Slack when things go wrong. Alerts are routed by severity:

Critical → #maverick-alerts-critical (immediate action needed)
Warning → #maverick-alerts (investigate when possible)

Alert Reference

API Health

Alert	Severity	Trigger	What To Do
`APIInstanceDown`	Critical	Prometheus can't reach API for >2 min	Check Status Page → Core Services. If down, SSH to server and run `docker compose -f docker-compose.production.yml restart api`
`APIHighErrorRateWarning`	Warning	5xx rate >1% over 5 min	Check Status Page → Logs → `api` for error patterns
`APIHighErrorRateCritical`	Critical	5xx rate >5% over 3 min	Likely a code bug or DB issue. Check Logs → `api` for stack traces. May need rollback
`APIHighLatencyWarning`	Warning	P95 latency >1s	Check database latency on Status Page → Database module
`APIHighLatencyCritical`	Critical	P95 latency >3s	Database or Redis likely overloaded. Check System Resources for CPU/memory

Celery Workers and Queues

Alert	Severity	Trigger	What To Do
`EmailEventsQueueDepthHigh`	Warning	email_events queue >200	Webhook backlog building. Check if email_bison worker is running
`EmailEventsQueueDepthCritical`	Critical	email_events queue >1000	Worker likely crashed. Restart: `docker restart celery-email-bison`
`VerificationQueueDepthHigh`	Warning	verification queue >5000	Large batch submitted. Normal — takes ~2h to clear at 3K/hour
`ScrapingQueueDepthHigh`	Warning	scraping queue >10	Multiple scraping jobs queued. Only 1 runs at a time (browser automation). Queue will drain slowly
`CeleryHighFailureRateWarning`	Warning	>10% task failures	Check Logs for failing task patterns
`CeleryHighFailureRateCritical`	Critical	>25% task failures	Systemic issue — check DB connectivity, Redis, external APIs

System Resources

Alert	Severity	Trigger	What To Do
`DiskSpaceWarning`	Warning	Disk >70% full	Check Docker images and logs consuming space. Run `docker system prune` on server
`DiskSpaceCritical`	Critical	Disk >85% full	Urgent — clean up immediately or services will crash
`HighMemoryUsageWarning`	Warning	RAM >80%	Check which containers are using most memory: `docker stats`
`HighCPULoad`	Warning	5-min load avg >6	Usually scraping or verification spike. Should resolve on its own

Pipeline SLOs

Alert	Severity	Trigger	What To Do
`PipelineSuccessRateSLOBreach`	Warning	below 95% success over 1h	Check which pipeline stage is failing — look at Celery Workers module
`VerificationBacklogSLOBreach`	Warning	Verification queue >15K	Massive batch — will take 4+ hours. No action needed unless it grows
`BisonAPIHighErrorRate`	Warning	Bison >10% errors	Check Bison status. May be rate limiting or token issue
`BisonAPIDown`	Critical	Bison >75% errors	Bison API is likely down. Check send.maverickmarketingllc.com. No action until they resolve
`DebounceAPIHighErrorRate`	Warning	Debounce >10% errors	Likely rate limiting. Verification will slow but continue

General Troubleshooting Decision Tree

Something seems wrong
├── Is the Status Page showing any red modules?
│   ├── Yes → Click the red module for details
│   │   ├── Core Services down → restart API container
│   │   ├── Celery down → check which queue, restart that worker
│   │   ├── Database down → check Supabase status
│   │   └── System Resources critical → check disk/memory
│   └── No → Check Slack for recent alerts
│
├── Is a specific pipeline slow?
│   ├── Check Celery Workers → Active Tasks tab for what's running
│   ├── Check queue depths — high depth = backlog, not failure
│   └── Check Logs module for the relevant worker
│
└── Is data not updating?
    ├── Campaign data → wait for hourly sync (check Beat schedule)
    ├── Pipeline data → check if the stage completed (jobs table)
    └── Dashboard → hard refresh (Ctrl+Shift+R)

Alerts & Troubleshooting

On this page