Maverick Docs
Operations

Scraping & Pipeline

How contact scraping works, monitoring pipeline runs, and restarting stuck jobs

What This Does

Scraping is the first stage of the contact pipeline. It uses Playwright (headless browser automation) to log into Xpressdocs and export insurance contact lists by ZIP code and renewal month.

The full pipeline flows in strict order:

Scrape → Filter → Verify → Batch → Upload

Each stage reads from the previous stage's output. Never skip stages.

Pipeline Month Semantics

The stored month is the renewal month. The timeline offset:

  • Pull phase (renewal month - 2): Scraping contacts now
  • Email phase (renewal month - 1): Sending campaigns
  • Renewal: The stored month (insurance renewals happen)

Example: Month 2026-06 means pull in April, email in May, June renewals.

How To Use It

Starting a Scraping Job

  1. Go to Scraping in the sidebar
  2. Select a workspace from the dropdown
  3. Click Create Job — choose the target month and ZIP codes
  4. The job appears in the jobs table with status pending
  5. A Celery worker picks it up within seconds (status → running)

Monitoring Progress

  • The jobs table shows status, created time, and contact counts
  • The pipeline progress bar shows counts at each stage
  • On the Status Page, the Celery Workers module shows active scraping tasks

Pipeline Stats

The Overview page shows pipeline stats per workspace per month:

  • raw: Total contacts scraped
  • filtered: After filter rules applied
  • verified: After email verification
  • batched: Grouped for upload
  • uploaded: Pushed to Email Bison (this is what counts toward the monthly target)

Xpressdocs Constraints

  • 10K record limit per export — the system automatically paginates
  • One credential at a time — only one scraping task can use a login simultaneously
  • Browser automation — runs Playwright in a Docker container, so it's slower than API calls

Common Issues

SymptomCauseFix
Job stuck at running for 30+ minPlaywright browser hung or Xpressdocs session expiredCheck Celery logs on Status Page → Logs module. If hung, the job will eventually timeout (6h limit)
Job failed with login_failedXpressdocs password changed or account lockedUpdate credentials in Settings → Xpressdocs Config
0 contacts scrapedNo contacts available for that ZIP/month combo, or Xpressdocs returned empty exportVerify ZIP assignments exist for the workspace, check the month is correct
Pipeline stops after scrapingDownstream tasks not triggeredCheck the jobs table — each stage creates its own job. If filtering didn't start, check Celery queue depth on Status Page
Duplicate contacts across monthsSame person has multiple renewal datesExpected behavior — dedup happens at the verification stage via email uniqueness
  • ScrapingQueueDepthHigh: More than 10 scraping tasks queued — workers may be overloaded
  • CeleryHighFailureRate: >10% task failures across all queues
  • PipelineSuccessRateSLOBreach: Pipeline success rate below 95% over 1 hour

On this page