Scraping & Pipeline

What This Does

Scraping is the first stage of the contact pipeline. It uses Playwright (headless browser automation) to log into Xpressdocs and export insurance contact lists by ZIP code and renewal month.

The full pipeline flows in strict order:

Scrape → Filter → Verify → Batch → Upload

Each stage reads from the previous stage's output. Never skip stages.

Pipeline Month Semantics

The stored month is the renewal month. The timeline offset:

Pull phase (renewal month - 2): Scraping contacts now
Email phase (renewal month - 1): Sending campaigns
Renewal: The stored month (insurance renewals happen)

Example: Month 2026-06 means pull in April, email in May, June renewals.

How To Use It

Starting a Scraping Job

Go to Scraping in the sidebar
Select a workspace from the dropdown
Click Create Job — choose the target month and ZIP codes
The job appears in the jobs table with status pending
A Celery worker picks it up within seconds (status → running)

Monitoring Progress

The jobs table shows status, created time, and contact counts
The pipeline progress bar shows counts at each stage
On the Status Page, the Celery Workers module shows active scraping tasks

Pipeline Stats

The Overview page shows pipeline stats per workspace per month:

raw: Total contacts scraped
filtered: After filter rules applied
verified: After email verification
batched: Grouped for upload
uploaded: Pushed to Email Bison (this is what counts toward the monthly target)

Xpressdocs Constraints

10K record limit per export — the system automatically paginates
One credential at a time — only one scraping task can use a login simultaneously
Browser automation — runs Playwright in a Docker container, so it's slower than API calls

Common Issues

Symptom	Cause	Fix
Job stuck at `running` for 30+ min	Playwright browser hung or Xpressdocs session expired	Check Celery logs on Status Page → Logs module. If hung, the job will eventually timeout (6h limit)
Job failed with `login_failed`	Xpressdocs password changed or account locked	Update credentials in Settings → Xpressdocs Config
0 contacts scraped	No contacts available for that ZIP/month combo, or Xpressdocs returned empty export	Verify ZIP assignments exist for the workspace, check the month is correct
Pipeline stops after scraping	Downstream tasks not triggered	Check the jobs table — each stage creates its own job. If filtering didn't start, check Celery queue depth on Status Page
Duplicate contacts across months	Same person has multiple renewal dates	Expected behavior — dedup happens at the verification stage via email uniqueness

ScrapingQueueDepthHigh: More than 10 scraping tasks queued — workers may be overloaded
CeleryHighFailureRate: >10% task failures across all queues
PipelineSuccessRateSLOBreach: Pipeline success rate below 95% over 1 hour

Scraping & Pipeline

On this page