[crawler] section controls how squirrelscan discovers and fetches pages.
Configuration
Crawl Limits
max_pages
Type: number
Default: 500
Range: 1 to 10000 (capped by CLI)
Maximum number of pages to crawl per audit.
Examples:
Small site:
timeout_ms
Type: number
Default: 30000 (30 seconds)
Range: 1000 to 60000 recommended
Timeout for each page request in milliseconds.
Examples:
Fast timeout for quick sites:
- Page marked as failed
- Crawl continues with next URL
- Logged in error output
Rate Limiting
delay_ms
Type: number
Default: 100 (100ms)
Base delay between requests in milliseconds.
Examples:
Fast crawl (be careful):
per_host_delay_ms and concurrency settings.
per_host_delay_ms
Type: number
Default: 200 (200ms)
Minimum delay between requests to the same host.
Ensures politeness even with high concurrency.
Examples:
Aggressive (use cautiously):
per_host_concurrency = 2 and per_host_delay_ms = 200:
- At most 2 concurrent requests to same host
- At least 200ms between requests to same host
- Other hosts can be fetched simultaneously
concurrency
Type: number
Default: 5
Range: 1 to 20 recommended
Maximum number of concurrent requests globally.
Examples:
Sequential (single request at a time):
- Higher = faster crawls
- Higher = more server load
- Bounded by
per_host_concurrency
per_host_concurrency
Type: number
Default: 2
Range: 1 to 5 recommended
Maximum number of concurrent requests per host.
Prevents overwhelming a single server even with high global concurrency.
Examples:
One request per host at a time:
concurrency:
- Up to 10 total concurrent requests
- At most 2 concurrent requests to any single host
- Can fetch from up to 5 different hosts simultaneously
URL Filtering
include
Type: string[]
Default: [] (empty = include all URLs from seed domain)
URL patterns to include. If set, only matching URLs are crawled.
Pattern Syntax:
Uses glob syntax:
*- Match anything except/**- Match anything including/?- Match single character[abc]- Match character set
include is set, it overrides the domains setting in [project].
exclude
Type: string[]
Default: [] (empty = exclude nothing)
URL patterns to exclude from crawling.
Takes precedence over include - if a URL matches both, it’s excluded.
Examples:
Exclude admin areas:
Pattern Matching Examples
| Pattern | Matches | Doesn’t Match |
|---|---|---|
/blog/* | /blog/post | /blog/post/comment |
/blog/** | /blog/post, /blog/post/comment | /about |
*.pdf | /file.pdf, /docs/guide.pdf | /file.html |
*?preview=* | /page?preview=true | /page |
/api/*/users | /api/v1/users | /api/v1/v2/users |
Query Parameters
allow_query_params
Type: string[]
Default: [] (empty = drop all query params for deduplication)
Query parameters to preserve during URL deduplication.
Why this matters:
URLs are deduplicated before crawling:
/page?id=1&utm_source=google→/page?id=1(utm dropped)
allow_query_params.
Examples:
Preserve pagination:
/products?category=shoes✓/products?category=shoes&page=2✓
/products?utm_source=google✗ (becomes/products)/products?gclid=abc123✗ (becomes/products)
drop_query_prefixes
Type: string[]
Default: ["utm_", "gclid", "fbclid"]
Query parameter prefixes to always drop, even if in allow_query_params.
Default tracking params dropped:
utm_*- Google Analytics (utm_source, utm_medium, etc.)gclid- Google Adsfbclid- Facebook Ads
Crawl Strategy
breadth_first
Type: boolean
Default: true
Use breadth-first crawling for better site coverage.
Breadth-first (default):
- Crawls level-by-level
- Discovers homepage, then all links from homepage, then all links from those pages
- Better site coverage
- Avoids getting stuck in deep paths
false):
- Crawls as deep as possible before backtracking
- Can get stuck in deep sections
- Less even coverage
true (default) for most sites.
max_prefix_budget
Type: number
Default: 0.25 (25%)
Range: 0.1 to 1.0
Maximum percentage of crawl budget for any single path prefix.
Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts).
How it works:
With max_pages = 500 and max_prefix_budget = 0.25:
- At most 125 pages (25%) from any single path prefix
- Ensures diverse coverage across site sections
max_prefix_budget = 0.25 and max_pages = 500:
- At most 125 blog posts crawled
- Remaining budget for other sections
Request Configuration
request_method
Type: "standard" | "browser_impersonate"
Default: "browser_impersonate"
Request method for fetching pages.
Options:
"browser_impersonate" (default):
- Uses TLS fingerprinting to appear as a real browser
- Better success rate with bot protection
- Bypasses basic bot detection
- Slightly slower
"standard":
- Standard HTTP fetch
- Faster
- May be blocked by aggressive bot protection
impersonate_browser
Type: "chrome_131" | "chrome_120" | "firefox_133" | "firefox_120" | "safari_ios_18_0"
Default: "chrome_131"
Browser to impersonate when using request_method = "browser_impersonate".
Options:
"chrome_131"- Chrome 131 (latest)"chrome_120"- Chrome 120"firefox_133"- Firefox 133 (latest)"firefox_120"- Firefox 120"safari_ios_18_0"- Safari on iOS 18
request_method = "browser_impersonate".
user_agent
Type: string
Default: "" (empty = random browser user agent per crawl)
Custom user agent string.
Default behavior (empty string):
Random modern browser user agent, refreshed per crawl:
follow_redirects
Type: boolean
Default: true
Follow HTTP 3xx redirects.
When true (default):
- Follows redirects automatically
- Crawls final destination URL
- Redirect chains tracked for analysis
false:
- Stops at redirect
- Does not fetch redirect destination
- Useful for debugging redirect issues
true (default) for normal audits.
Robots.txt
respect_robots
Type: boolean
Default: true
Obey robots.txt rules and crawl-delay directives.
When true (default):
- Fetches and parses robots.txt
- Respects
Disallow:rules - Honors
Crawl-delay:directive - Polite and ethical
false:
- Ignores robots.txt
- Crawls all URLs (including disallowed)
- Use only for your own sites
true (default) when crawling third-party sites.
Complete Examples
Fast Local Development
Polite Production Crawl
High-Volume Crawl
Focused Blog Crawl
E-commerce Site
Related
- Project Settings - Domains configuration
- Rules Configuration - Which rules to run
- Examples - More configuration examples