Two-engine Python toolkit for scraping any paginated HTML or WordPress AJAX business directory — config-driven, checkpoint-resumable, concurrent fetching, Cloudflare email decode, and professionally formatted three-sheet Excel export. Point either engine at a new directory with a single config file change. No code edits required.
Found this useful? A ⭐ on GitHub helps other developers find it.
- Preview
- What It Does
- Engines
- Use Cases
- Features
- Performance
- What Data You Get
- Quick Start
- Choosing an Engine
- Configuration Reference
- Output
- Runtime Controls
- Auto-Protection Features
- Resuming a Run
- Prerequisites
- Project Structure
- Running Tests
- Troubleshooting
- Known Limitations
- Part of the B2B Lead Toolkit
- License
| Terminal — live progress bar | Excel Output — three sheets |
|---|---|
![]() |
![]() |
See assets/Sample_Output.csv for realistic sample output rows.
- Reads config.yaml — one file controls the target URL, CSS selectors (HTML engine) or AJAX parameters (WordPress engine), geographic filter, category mapping, rate limits, and output format.
- Discovers all listings — paginates through every search-results page (HTML engine) or AJAX POST page (WordPress engine) to collect every profile URL or business marker.
- Fetches profile pages concurrently — up to
--workers Nthreads (default 3) fetch profile pages in parallel, decoding Cloudflare-obfuscated emails automatically. - Filters by geography — HTML engine: regex on location text. WordPress engine: lat/lng bounding box from AJAX map markers.
- Deduplicates — by URL (HTML) or name + postcode composite key (WordPress).
- Exports to Excel — a styled Data sheet (clean records), a Flagged sheet (excluded records with reason), and a Summary sheet (run metadata).
Everything is config-driven. The Python source files contain zero site-specific strings — every URL, selector, and parameter lives in config.yaml.
| Engine | Target platform | Entry point | Unique technique |
|---|---|---|---|
| HTML Scraper | Any paginated HTML directory | engines/html/scraper.py |
CSS selectors, listing→profile two-phase crawl, Cloudflare XOR email decode |
| WordPress Scraper | WordPress admin-ajax.php directories | engines/wordpress/scraper.py |
Nonce extraction + mid-run refresh, AJAX POST pagination, manual gzip decompression |
| Who uses it | What they do | Which engine |
|---|---|---|
| Lead gen teams | Scrape member directories to build B2B contact lists filtered to a city or region | Either, depending on the directory tech stack |
| Property data analysts | Extract letting agents, property managers, and sales agents from industry directories | WordPress engine (TPOS, similar) or HTML engine (Propertymark, similar) |
| Market researchers | Pull structured business datasets from trade association directories | HTML engine for static directories |
| CRM admins | Automate monthly refreshes of contact data from a membership directory | Either engine with cron / Task Scheduler |
| Developers | Adapt to any new directory in minutes using a new config.yaml | Copy config.yaml.example, fill in 5 fields, run |
| Feature | Detail |
|---|---|
| Zero-code retargeting | Point at any compatible directory by editing config.yaml — no Python changes needed |
| Two-phase HTML crawl | Phase 1: paginate listing pages and collect profile URLs. Phase 2: visit each profile for full contact details |
| WordPress AJAX POST | Posts directly to admin-ajax.php — the same endpoint the site's own search form uses |
| Cloudflare XOR email decode | Transparently decodes /cdn-cgi/l/email-protection and data-cfemail obfuscated emails |
| Concurrent profile fetching | --workers N (default 3, max 8) fetches profile pages in parallel — ~3× throughput vs sequential |
| WordPress nonce auto-extraction | Scrapes the nonce security token from the register page before the main loop; refreshes mid-run if expired |
| Manual gzip/zlib decompression | Sniffs magic bytes and decompresses AJAX responses that lack Content-Encoding headers |
| Geographic filtering | HTML engine: configurable regex on location text. WordPress engine: lat/lng bounding box from map markers |
| Configurable postcode/ZIP extraction | postcode_regex config key — works for UK postcodes, US ZIP codes, AU postcodes, or any pattern |
| Atomic checkpoint writes | Write-to-temp-then-rename — checkpoint survives a crash mid-write |
| Three-sheet Excel export | Data · Flagged · Summary — styled, dated, named from config |
| Keyboard controls | P pause · R resume · S stop · W stats — no Enter needed, event-driven (responds instantly) |
| Auto-protection | Stop time · low-disk guard · circuit breaker · ConnectTimeout instant skip on dead sites |
| Configurable header colour | WordPress engine: set output.header_color in config to distinguish scraper outputs at a glance |
| 123 tests | Full offline test suite across both engines — no network calls, no API key needed |
| Mode | Throughput | Notes |
|---|---|---|
| v1.0.0 sequential | ~12 records/min | One profile fetch at a time |
| v1.1.0 concurrent (3 workers) | ~35 records/min | ~3× improvement — default setting |
| v1.1.0 concurrent (5 workers) | ~50 records/min | Diminishing returns above 5 workers |
Tuning guidance:
- Use
--workers 3(default) for polite scraping on shared-hosting directories. - Use
--workers 5for faster runs on directories with robust infrastructure. - Never exceed
--workers 8(enforced by a hard cap inscraper.py). - Increase
delay_min/delay_maxin config if you receive HTTP 429 responses. profile_timeout_seconds: 6eliminates stalls on dead company websites —ConnectTimeoutreturns instantly.
Expected runtime formula:
minutes ≈ total_records / (workers × 12) × delay_avg
| Field | Example |
|---|---|
| Company | Acme Property Consultants Ltd |
| info@acmeproperty.co.uk | |
| Phone | 02071234567 |
| Website | https://www.acmeproperty.co.uk |
| Location | SW1A 1AA |
| Category | Residential Sales |
| Source | Directory |
See assets/Sample_Output.csv for realistic sample output.
git clone https://github.com/FAAQJAVED/html-directory-scrapers.git
cd html-directory-scrapers# HTML engine
cd engines/html
pip install -r requirements.txt
# WordPress engine
cd engines/wordpress
pip install -r requirements.txtcp config.yaml.example config.yamlEdit config.yaml — set your target URL, CSS selectors (HTML) or AJAX parameters (WordPress), and geographic filter. Every option is annotated in the example file.
cp .env.example .envPaste session cookies as SCRAPER_COOKIES=... (HTML) or SCRAPER_COOKIES_RAW=... (WordPress). Never stored in config.yaml.
# Standard run
python scraper.py
# Use a different config file
python scraper.py --config my_config.yaml
# Start fresh (clears any saved checkpoint)
python scraper.py --fresh
# Increase concurrent workers for faster scraping
python scraper.py --workers 5pip install -e ../../ # from either engine folder
html-scraper --config config.yaml
wp-scraper --config config.yamlUse the HTML engine if the directory renders its listings directly in the HTML page source — you can see company names and links when you use View Source in your browser.
Use the WordPress engine if you see POST requests to /wp-admin/admin-ajax.php in browser DevTools → Network tab when you trigger a search. The response will be a JSON object containing markers and cards data.
| Key | Type | Default | Description |
|---|---|---|---|
base_url |
string | — | Root URL of the target directory (required) |
list_path |
string | — | Path to the listing/search-results page (required) |
categories |
list | — | Category name objects {name: ...} (required) |
selectors |
dict | — | CSS selectors for card and profile elements (required) |
all_services |
list | [] |
Service slugs to iterate; empty = single pass with no filter |
location_filter_regex |
string | "" |
Regex applied to profile page text; empty = no filter |
delay_min / delay_max |
float | 1.0 / 2.5 | Per-request random delay range in seconds |
profile_timeout_seconds |
int | 6 | Timeout for company profile page fetches |
verify_email |
bool | false | SMTP RCPT handshake per extracted email (slow) |
stop_at |
string | "" |
24-hour time to auto-stop, e.g. "23:00" |
postcode_regex |
string | "" |
Regex to extract postcode/ZIP from card meta text |
| Key | Type | Default | Description |
|---|---|---|---|
base_url |
string | — | WordPress directory root URL (required) |
register_path |
string | — | Path to the search page for nonce extraction (required) |
ajax_action |
string | — | WordPress AJAX action name (required) |
sectors |
list | — | [{name: ..., category: ...}] objects (required) |
geo_bounds |
dict | none | lat_min/max, lng_min/max bounding box; omit to disable |
crawl_websites |
bool | true | Visit company websites to find emails |
postcode_regex |
string | "" |
Regex to extract postcode/ZIP from AJAX card addresses |
skip_domains |
list | [] |
Website domains to exclude from company URL extraction |
junk_domains |
list | [...] |
Email domains to reject |
output.header_color |
string | "1F4E79" |
Hex colour for Excel header rows (no #) |
Full option documentation is in config.yaml.example inside each engine folder.
| Sheet | Contents |
|---|---|
| Data | Clean, validated records — frozen header row, navy header, alternating row shading, auto-width columns |
| Flagged | Records excluded by geographic filter or failed profile fetches, each with a Flag Reason column |
| Summary | Run metadata: date, source, record counts, email/phone/website hit rates, elapsed time, status |
Timestamped console + file logging. Full DEBUG to file, clean INFO to console.
While the scraper is running, press a key — no Enter needed on any platform. Controls are event-driven and respond instantly:
| Key | Action |
|---|---|
P |
Pause after current page completes |
R |
Resume from pause |
S |
Save checkpoint and stop cleanly |
W |
Print live stats snapshot (saved, flagged, email rate, ETA) |
HTML engine only — write to command.txt in the engine folder (useful for headless / cron runs):
echo pause > command.txt
echo resume > command.txt
echo stop > command.txt
echo fresh > command.txt| Feature | Trigger | Behaviour |
|---|---|---|
| Stop time | Configured HH:MM |
Saves checkpoint and exits cleanly |
| Low disk guard | Free disk < min_free_disk_mb |
Auto-pauses; resume with R or echo resume > command.txt |
| Circuit breaker | 3 consecutive HTTP failures | Auto-pauses main session; resumes on R |
| ConnectTimeout skip | TCP-level host unreachable | Returns immediately — dead sites never stall the run |
| One retry on errors | ReadTimeout or connection error | Retries once with backoff before moving on |
Both engines save a checkpoint after every listing page. If a run is interrupted for any reason, simply re-run — the checkpoint is detected and the run continues from exactly where it stopped.
python scraper.py # automatically resumes if checkpoint exists
python scraper.py --fresh # discard checkpoint and start fresh| Requirement | Version | Notes |
|---|---|---|
| Python | ≥ 3.9 | Tested on 3.9, 3.10, 3.11, 3.12 |
| pip | any current | Bundled with Python |
| Git | any | For cloning |
| OS | Windows / Linux / macOS | Windows and macOS have CI smoke tests |
No browser driver (Playwright / Selenium) needed. Both engines use pure HTTP requests only.
html-directory-scrapers/
├── assets/ ← Preview images and sample output
│ ├── Terminal_Preview.png # Live terminal screenshot
│ ├── Output_Preview.png # Excel output screenshot
│ └── Sample_Output.csv # Sample data rows
├── engines/
│ ├── html/ ← HTML Directory Scraper
│ │ ├── scraper.py # Thin orchestrator (CLI entry point)
│ │ ├── config.py # YAML loader + env var injection
│ │ ├── fetcher.py # httpx client, safe_get, progress, SMTP
│ │ ├── parser.py # parse_cards, scrape_profile, CF email
│ │ ├── exporter.py # 3-sheet Excel workbook
│ │ ├── checkpoint.py # CheckpointManager (atomic JSON)
│ │ ├── controls.py # command.txt watcher, InputController
│ │ ├── config.yaml.example
│ │ ├── .env.example
│ │ └── requirements.txt
│ └── wordpress/ ← WordPress Directory Scraper
│ ├── scraper.py # Thin orchestrator (CLI entry point)
│ ├── config.py # YAML loader + env var injection
│ ├── fetcher.py # Session, AJAX POST, nonce, safe_decode
│ ├── parser.py # parse_cards, filter_by_bounds, profile
│ ├── exporter.py # 3-sheet Excel workbook
│ ├── checkpoint.py # CheckpointManager (atomic JSON)
│ ├── controls.py # InputController (P/R/S/W keyboard)
│ ├── config.yaml.example
│ ├── .env.example
│ └── requirements.txt
├── tests/
│ ├── conftest.py
│ ├── html/
│ │ ├── conftest.py
│ │ └── test_html_engine.py # 59 tests
│ └── wordpress/
│ ├── conftest.py
│ └── test_wordpress_engine.py # 64 tests
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE
├── pyproject.toml
└── README.md
pip install pytest pytest-cov
pytest tests/ -v --cov=engines --cov-report=term-missing123 tests across both engines. No network calls. No API key required. Full suite runs offline in under 10 seconds.
"Config file not found: config.yaml"
Copy the annotated example and fill in your values:
cp config.yaml.example config.yamlScraper runs but saves zero records
- Open the target URL in your browser and press Ctrl+U (View Source). If you see JavaScript but no company cards, the directory is JS-rendered and not compatible — see Known Limitations.
- Check DevTools → Network tab for POST requests to
/wp-admin/admin-ajax.php. If present, you need the WordPress engine, not the HTML engine. - Verify
selectors.card_containermatches at least one element on the listing page.
Keyboard controls not responding
Press W first — if the stats line prints, the listener is working and the scraper is running normally (a slow site may take 10–30 s per page). If no response, check that your terminal window has focus. On macOS, grant Accessibility permissions: System Settings → Privacy & Security → Accessibility → add your terminal app.
HTTP 429 Too Many Requests
Increase delay_min and delay_max in config.yaml, or reduce --workers to 1:
delay_min: 3.0
delay_max: 6.0Excel output locked / PermissionError
Close the previous output file in Excel before running — Excel holds an exclusive lock on open .xlsx files.
WordPress engine: "Nonce not found"
The scraper visits register_path to extract the nonce. Check that register_path is the correct path to the directory's search page (not the homepage). Open the page in a browser and search for "nonce" in the HTML source to confirm it is embedded there.
-
JavaScript-rendered directories are not supported. If listings only appear after JavaScript executes, neither engine will find any cards. Confirm via View Source before configuring.
-
WordPress nonce expiry on very long runs. Nonces typically expire after 24 hours. The engine detects expiry and refreshes automatically. If a run exceeds 24 hours and stops finding records, restart with
--fresh. -
Rate limiting is per-worker, not per-run. With
--workers 3anddelay_min: 1.0, up to 3 concurrent requests fire per second during profile fetches. Reduce workers if you hit 429 responses. -
SMTP email verification is slow.
verify_email: trueadds a DNS + SMTP handshake per email. On a 500-record run this adds 10–20 minutes. Use only when email accuracy is critical.
This toolkit is one component of a broader B2B lead generation pipeline.
| Repo | What it does |
|---|---|
| HTML Directory Scrapers ← you are here | Two-engine toolkit for HTML and WordPress AJAX directories |
| JSON Directory Harvester | Configurable harvester for any JSON directory API |
| Google Maps Business Scraper | Extracts and enriches business listings from Google Maps |
| Email Phone Enrichment Tool | Converts a website list into a verified email + phone database |
| LeadHunter Pro | Multi-engine search scraper with HOT/WARM/COLD lead scoring |
| Trustpilot Business Scraper | Extracts business contact data from Trustpilot search results |
All tools share the same three-sheet Excel output schema (Data · Flagged · Summary) — results combine directly in Excel or import together into a CRM.

