Skip to main content
The [crawler] section controls how squirrelscan discovers and fetches pages.

Configuration

[crawler]
max_pages = 500
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 2
per_host_delay_ms = 200
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
breadth_first = true
max_prefix_budget = 0.25
request_method = "browser_impersonate"
impersonate_browser = "chrome_131"
user_agent = ""
follow_redirects = true

Crawl Limits

max_pages

Type: number Default: 500 Range: 1 to 10000 (capped by CLI) Maximum number of pages to crawl per audit. Examples: Small site:
[crawler]
max_pages = 50
Large site:
[crawler]
max_pages = 2000
CLI override:
squirrel audit https://example.com -m 100
Note: The CLI enforces a hard cap (currently 10,000 pages) regardless of config.

timeout_ms

Type: number Default: 30000 (30 seconds) Range: 1000 to 60000 recommended Timeout for each page request in milliseconds. Examples: Fast timeout for quick sites:
[crawler]
timeout_ms = 10000  # 10 seconds
Slow sites or APIs:
[crawler]
timeout_ms = 45000  # 45 seconds
When request exceeds timeout:
  • Page marked as failed
  • Crawl continues with next URL
  • Logged in error output

Rate Limiting

delay_ms

Type: number Default: 100 (100ms) Base delay between requests in milliseconds. Examples: Fast crawl (be careful):
[crawler]
delay_ms = 50
Polite crawl:
[crawler]
delay_ms = 500
No delay (local development only):
[crawler]
delay_ms = 0
Note: Actual delays depend on per_host_delay_ms and concurrency settings.

per_host_delay_ms

Type: number Default: 200 (200ms) Minimum delay between requests to the same host. Ensures politeness even with high concurrency. Examples: Aggressive (use cautiously):
[crawler]
per_host_delay_ms = 100
Very polite:
[crawler]
per_host_delay_ms = 1000  # 1 second
How it works: With per_host_concurrency = 2 and per_host_delay_ms = 200:
  • At most 2 concurrent requests to same host
  • At least 200ms between requests to same host
  • Other hosts can be fetched simultaneously

concurrency

Type: number Default: 5 Range: 1 to 20 recommended Maximum number of concurrent requests globally. Examples: Sequential (single request at a time):
[crawler]
concurrency = 1
Moderate parallelism:
[crawler]
concurrency = 10
High parallelism (use cautiously):
[crawler]
concurrency = 20
Impact:
  • Higher = faster crawls
  • Higher = more server load
  • Bounded by per_host_concurrency

per_host_concurrency

Type: number Default: 2 Range: 1 to 5 recommended Maximum number of concurrent requests per host. Prevents overwhelming a single server even with high global concurrency. Examples: One request per host at a time:
[crawler]
per_host_concurrency = 1
Allow more parallel requests:
[crawler]
per_host_concurrency = 4
How it interacts with concurrency:
[crawler]
concurrency = 10
per_host_concurrency = 2
  • Up to 10 total concurrent requests
  • At most 2 concurrent requests to any single host
  • Can fetch from up to 5 different hosts simultaneously

URL Filtering

include

Type: string[] Default: [] (empty = include all URLs from seed domain) URL patterns to include. If set, only matching URLs are crawled. Pattern Syntax: Uses glob syntax:
  • * - Match anything except /
  • ** - Match anything including /
  • ? - Match single character
  • [abc] - Match character set
Examples: Only crawl blog:
[crawler]
include = ["/blog/**"]
Multiple sections:
[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]
Specific file types:
[crawler]
include = ["*.html", "*.htm"]
Important: When include is set, it overrides the domains setting in [project].

exclude

Type: string[] Default: [] (empty = exclude nothing) URL patterns to exclude from crawling. Takes precedence over include - if a URL matches both, it’s excluded. Examples: Exclude admin areas:
[crawler]
exclude = ["/admin/**", "/wp-admin/**"]
Exclude file types:
[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]
Exclude API endpoints:
[crawler]
exclude = ["/api/**", "/v1/**"]
Exclude query parameters:
[crawler]
exclude = ["*?preview=*", "*?draft=*"]
Common exclusions:
[crawler]
exclude = [
  "/admin/**",
  "/wp-admin/**",
  "/wp-content/**",
  "/api/**",
  "*.pdf",
  "*.zip",
  "*.jpg",
  "*.png",
  "*?preview=*",
  "*?print=*"
]

Pattern Matching Examples

PatternMatchesDoesn’t Match
/blog/*/blog/post/blog/post/comment
/blog/**/blog/post, /blog/post/comment/about
*.pdf/file.pdf, /docs/guide.pdf/file.html
*?preview=*/page?preview=true/page
/api/*/users/api/v1/users/api/v1/v2/users

Query Parameters

allow_query_params

Type: string[] Default: [] (empty = drop all query params for deduplication) Query parameters to preserve during URL deduplication. Why this matters: URLs are deduplicated before crawling:
  • /page?id=1&utm_source=google/page?id=1 (utm dropped)
Without configuration, all query params are dropped except those in allow_query_params. Examples: Preserve pagination:
[crawler]
allow_query_params = ["page"]
Preserve filters:
[crawler]
allow_query_params = ["category", "sort", "filter", "q"]
Preserve all query params:
[crawler]
allow_query_params = ["*"]
Use case: E-commerce site with filters:
[crawler]
allow_query_params = ["category", "price", "brand", "page"]
This preserves:
  • /products?category=shoes
  • /products?category=shoes&page=2
This drops:
  • /products?utm_source=google ✗ (becomes /products)
  • /products?gclid=abc123 ✗ (becomes /products)

drop_query_prefixes

Type: string[] Default: ["utm_", "gclid", "fbclid"] Query parameter prefixes to always drop, even if in allow_query_params. Default tracking params dropped:
  • utm_* - Google Analytics (utm_source, utm_medium, etc.)
  • gclid - Google Ads
  • fbclid - Facebook Ads
Examples: Drop more tracking params:
[crawler]
drop_query_prefixes = [
  "utm_",
  "gclid",
  "fbclid",
  "mc_",       # Mailchimp
  "_ga",       # Google Analytics
  "ref",       # Referrer
  "source"     # Generic source tracking
]
Drop nothing:
[crawler]
drop_query_prefixes = []

Crawl Strategy

breadth_first

Type: boolean Default: true Use breadth-first crawling for better site coverage. Breadth-first (default):
  • Crawls level-by-level
  • Discovers homepage, then all links from homepage, then all links from those pages
  • Better site coverage
  • Avoids getting stuck in deep paths
Depth-first (false):
  • Crawls as deep as possible before backtracking
  • Can get stuck in deep sections
  • Less even coverage
Example: Disable breadth-first:
[crawler]
breadth_first = false
Recommendation: Keep true (default) for most sites.

max_prefix_budget

Type: number Default: 0.25 (25%) Range: 0.1 to 1.0 Maximum percentage of crawl budget for any single path prefix. Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts). How it works: With max_pages = 500 and max_prefix_budget = 0.25:
  • At most 125 pages (25%) from any single path prefix
  • Ensures diverse coverage across site sections
Examples: More strict (better coverage):
[crawler]
max_prefix_budget = 0.15  # Max 15% per prefix
More lenient (deeper coverage):
[crawler]
max_prefix_budget = 0.5   # Max 50% per prefix
Disable budget (not recommended):
[crawler]
max_prefix_budget = 1.0   # No limit
Use case: Site with large blog:
/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact
With max_prefix_budget = 0.25 and max_pages = 500:
  • At most 125 blog posts crawled
  • Remaining budget for other sections

Request Configuration

request_method

Type: "standard" | "browser_impersonate" Default: "browser_impersonate" Request method for fetching pages. Options: "browser_impersonate" (default):
  • Uses TLS fingerprinting to appear as a real browser
  • Better success rate with bot protection
  • Bypasses basic bot detection
  • Slightly slower
"standard":
  • Standard HTTP fetch
  • Faster
  • May be blocked by aggressive bot protection
Examples: Use standard for local development:
[crawler]
request_method = "standard"
Use browser impersonation (default):
[crawler]
request_method = "browser_impersonate"
impersonate_browser = "chrome_131"

impersonate_browser

Type: "chrome_131" | "chrome_120" | "firefox_133" | "firefox_120" | "safari_ios_18_0" Default: "chrome_131" Browser to impersonate when using request_method = "browser_impersonate". Options:
  • "chrome_131" - Chrome 131 (latest)
  • "chrome_120" - Chrome 120
  • "firefox_133" - Firefox 133 (latest)
  • "firefox_120" - Firefox 120
  • "safari_ios_18_0" - Safari on iOS 18
Examples: Impersonate Firefox:
[crawler]
request_method = "browser_impersonate"
impersonate_browser = "firefox_133"
Impersonate mobile Safari:
[crawler]
request_method = "browser_impersonate"
impersonate_browser = "safari_ios_18_0"
Note: Only applies when request_method = "browser_impersonate".

user_agent

Type: string Default: "" (empty = random browser user agent per crawl) Custom user agent string. Default behavior (empty string): Random modern browser user agent, refreshed per crawl:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
Examples: Custom user agent:
[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"
Mobile user agent:
[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"
Recommendation: Leave empty (default) for best results with bot protection.

follow_redirects

Type: boolean Default: true Follow HTTP 3xx redirects. When true (default):
  • Follows redirects automatically
  • Crawls final destination URL
  • Redirect chains tracked for analysis
When false:
  • Stops at redirect
  • Does not fetch redirect destination
  • Useful for debugging redirect issues
Example: Disable redirect following:
[crawler]
follow_redirects = false
Recommendation: Keep true (default) for normal audits.

Robots.txt

respect_robots

Type: boolean Default: true Obey robots.txt rules and crawl-delay directives. When true (default):
  • Fetches and parses robots.txt
  • Respects Disallow: rules
  • Honors Crawl-delay: directive
  • Polite and ethical
When false:
  • Ignores robots.txt
  • Crawls all URLs (including disallowed)
  • Use only for your own sites
Example: Ignore robots.txt (testing only):
[crawler]
respect_robots = false
Recommendation: Always keep true (default) when crawling third-party sites.

Complete Examples

Fast Local Development

[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
request_method = "standard"
respect_robots = false

Polite Production Crawl

[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = true
request_method = "browser_impersonate"
impersonate_browser = "chrome_131"

High-Volume Crawl

[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2

Focused Blog Crawl

[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]

E-commerce Site

[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]