Operations
Scraping & Pipeline
How contact scraping works, monitoring pipeline runs, and restarting stuck jobs
What This Does
Scraping is the first stage of the contact pipeline. It uses Playwright (headless browser automation) to log into Xpressdocs and export insurance contact lists by ZIP code and renewal month.
The full pipeline flows in strict order:
Each stage reads from the previous stage's output. Never skip stages.
Pipeline Month Semantics
The stored month is the renewal month. The timeline offset:
- Pull phase (renewal month - 2): Scraping contacts now
- Email phase (renewal month - 1): Sending campaigns
- Renewal: The stored month (insurance renewals happen)
Example: Month 2026-06 means pull in April, email in May, June renewals.
How To Use It
Starting a Scraping Job
- Go to Scraping in the sidebar
- Select a workspace from the dropdown
- Click Create Job — choose the target month and ZIP codes
- The job appears in the jobs table with status
pending - A Celery worker picks it up within seconds (status →
running)
Monitoring Progress
- The jobs table shows status, created time, and contact counts
- The pipeline progress bar shows counts at each stage
- On the Status Page, the Celery Workers module shows active scraping tasks
Pipeline Stats
The Overview page shows pipeline stats per workspace per month:
raw: Total contacts scrapedfiltered: After filter rules appliedverified: After email verificationbatched: Grouped for uploaduploaded: Pushed to Email Bison (this is what counts toward the monthly target)
Xpressdocs Constraints
- 10K record limit per export — the system automatically paginates
- One credential at a time — only one scraping task can use a login simultaneously
- Browser automation — runs Playwright in a Docker container, so it's slower than API calls
Common Issues
| Symptom | Cause | Fix |
|---|---|---|
Job stuck at running for 30+ min | Playwright browser hung or Xpressdocs session expired | Check Celery logs on Status Page → Logs module. If hung, the job will eventually timeout (6h limit) |
Job failed with login_failed | Xpressdocs password changed or account locked | Update credentials in Settings → Xpressdocs Config |
| 0 contacts scraped | No contacts available for that ZIP/month combo, or Xpressdocs returned empty export | Verify ZIP assignments exist for the workspace, check the month is correct |
| Pipeline stops after scraping | Downstream tasks not triggered | Check the jobs table — each stage creates its own job. If filtering didn't start, check Celery queue depth on Status Page |
| Duplicate contacts across months | Same person has multiple renewal dates | Expected behavior — dedup happens at the verification stage via email uniqueness |
Related Alerts
- ScrapingQueueDepthHigh: More than 10 scraping tasks queued — workers may be overloaded
- CeleryHighFailureRate: >10% task failures across all queues
- PipelineSuccessRateSLOBreach: Pipeline success rate below 95% over 1 hour