Crawler Settings

The [crawler] section controls how squirrelscan discovers and fetches pages.

Configuration

[crawler]
max_pages = 500
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 2
per_host_delay_ms = 200
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
breadth_first = true
max_prefix_budget = 0.25
user_agent = ""
follow_redirects = true

Crawl Limits

`max_pages`

Type: number Default: 500 Range: 1 to 5000 (capped by CLI) Maximum number of pages to crawl per audit. Examples: Small site:

[crawler]
max_pages = 50

Large site:

[crawler]
max_pages = 2000

CLI override:

squirrel audit https://example.com -m 100

Note: The CLI enforces a hard cap (currently 5,000 pages) regardless of config.

`timeout_ms`

Type: number Default: 30000 (30 seconds) Range: 1000 to 60000 recommended Timeout for each page request in milliseconds. Examples: Fast timeout for quick sites:

[crawler]
timeout_ms = 10000  # 10 seconds

Slow sites or APIs:

[crawler]
timeout_ms = 45000  # 45 seconds

When request exceeds timeout:

Page marked as failed
Crawl continues with next URL
Logged in error output

Rate Limiting

`delay_ms`

Type: number Default: 100 (100ms) Base delay between requests in milliseconds. Examples: Fast crawl (be careful):

[crawler]
delay_ms = 50

Polite crawl:

[crawler]
delay_ms = 500

No delay (local development only):

[crawler]
delay_ms = 0

Note: Actual delays depend on per_host_delay_ms and concurrency settings.

`per_host_delay_ms`

Type: number Default: 200 (200ms) Minimum delay between requests to the same host. Ensures politeness even with high concurrency. Examples: Aggressive (use cautiously):

[crawler]
per_host_delay_ms = 100

Very polite:

[crawler]
per_host_delay_ms = 1000  # 1 second

How it works: With per_host_concurrency = 2 and per_host_delay_ms = 200:

At most 2 concurrent requests to same host
At least 200ms between requests to same host
Other hosts can be fetched simultaneously

`concurrency`

Type: number Default: 5 Range: 1 to 20 recommended Maximum number of concurrent requests globally. Examples: Sequential (single request at a time):

[crawler]
concurrency = 1

Moderate parallelism:

[crawler]
concurrency = 10

High parallelism (use cautiously):

[crawler]
concurrency = 20

Impact:

Higher = faster crawls
Higher = more server load
Bounded by per_host_concurrency

`per_host_concurrency`

Type: number Default: 2 Range: 1 to 5 recommended Maximum number of concurrent requests per host. Prevents overwhelming a single server even with high global concurrency. Examples: One request per host at a time:

[crawler]
per_host_concurrency = 1

Allow more parallel requests:

[crawler]
per_host_concurrency = 4

How it interacts with concurrency:

[crawler]
concurrency = 10
per_host_concurrency = 2

Up to 10 total concurrent requests
At most 2 concurrent requests to any single host
Can fetch from up to 5 different hosts simultaneously

URL Filtering

`include`

Type: string[] Default: [] (empty = include all URLs from seed domain) URL patterns to include. If set, only matching URLs are crawled. Pattern Syntax: Uses glob syntax:

* - Match anything except /
** - Match anything including /
? - Match single character
[abc] - Match character set

Examples: Only crawl blog:

[crawler]
include = ["/blog/**"]

Multiple sections:

[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]

Specific file types:

[crawler]
include = ["*.html", "*.htm"]

Important: When include is set, it overrides the domains setting in [project].

`exclude`

Type: string[] Default: [] (empty = exclude nothing) URL patterns to exclude from crawling. Takes precedence over include - if a URL matches both, it’s excluded. Examples: Exclude admin areas:

[crawler]
exclude = ["/admin/**", "/wp-admin/**"]

Exclude file types:

[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]

Exclude API endpoints:

[crawler]
exclude = ["/api/**", "/v1/**"]

Exclude query parameters:

[crawler]
exclude = ["*?preview=*", "*?draft=*"]

Common exclusions:

[crawler]
exclude = [
  "/admin/**",
  "/wp-admin/**",
  "/wp-content/**",
  "/api/**",
  "*.pdf",
  "*.zip",
  "*.jpg",
  "*.png",
  "*?preview=*",
  "*?print=*"
]

Pattern Matching Examples

Pattern	Matches	Doesn’t Match
`/blog/*`	`/blog/post`	`/blog/post/comment`
`/blog/**`	`/blog/post`, `/blog/post/comment`	`/about`
`*.pdf`	`/file.pdf`, `/docs/guide.pdf`	`/file.html`
`?preview=`	`/page?preview=true`	`/page`
`/api/*/users`	`/api/v1/users`	`/api/v1/v2/users`

Query Parameters

`allow_query_params`

Type: string[] Default: [] (empty = drop all query params for deduplication) Query parameters to preserve during URL deduplication. Why this matters: URLs are deduplicated before crawling:

/page?id=1&utm_source=google → /page?id=1 (utm dropped)

Without configuration, all query params are dropped except those in allow_query_params. Examples: Preserve pagination:

[crawler]
allow_query_params = ["page"]

Preserve filters:

[crawler]
allow_query_params = ["category", "sort", "filter", "q"]

Preserve all query params:

[crawler]
allow_query_params = ["*"]

Use case: E-commerce site with filters:

[crawler]
allow_query_params = ["category", "price", "brand", "page"]

This preserves:

/products?category=shoes ✓
/products?category=shoes&page=2 ✓

This drops:

/products?utm_source=google ✗ (becomes /products)
/products?gclid=abc123 ✗ (becomes /products)

`drop_query_prefixes`

Type: string[] Default: ["utm_", "gclid", "fbclid"] Query parameter prefixes to always drop, even if in allow_query_params. Default tracking params dropped:

utm_* - Google Analytics (utm_source, utm_medium, etc.)
gclid - Google Ads
fbclid - Facebook Ads

Examples: Drop more tracking params:

[crawler]
drop_query_prefixes = [
  "utm_",
  "gclid",
  "fbclid",
  "mc_",       # Mailchimp
  "_ga",       # Google Analytics
  "ref",       # Referrer
  "source"     # Generic source tracking
]

Drop nothing:

[crawler]
drop_query_prefixes = []

Crawl Strategy

`breadth_first`

Type: boolean Default: true Use breadth-first crawling for better site coverage. Breadth-first (default):

Crawls level-by-level
Discovers homepage, then all links from homepage, then all links from those pages
Better site coverage
Avoids getting stuck in deep paths

Depth-first (false):

Crawls as deep as possible before backtracking
Can get stuck in deep sections
Less even coverage

Example: Disable breadth-first:

[crawler]
breadth_first = false

Recommendation: Keep true (default) for most sites.

`max_prefix_budget`

Type: number Default: 0.25 (25%) Range: 0.1 to 1.0 Maximum percentage of crawl budget for any single path prefix. Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts). How it works: With max_pages = 500 and max_prefix_budget = 0.25:

At most 125 pages (25%) from any single path prefix
Ensures diverse coverage across site sections

Examples: More strict (better coverage):

[crawler]
max_prefix_budget = 0.15  # Max 15% per prefix

More lenient (deeper coverage):

[crawler]
max_prefix_budget = 0.5   # Max 50% per prefix

Disable budget (not recommended):

[crawler]
max_prefix_budget = 1.0   # No limit

Use case: Site with large blog:

/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact

With max_prefix_budget = 0.25 and max_pages = 500:

At most 125 blog posts crawled
Remaining budget for other sections

Request Configuration

`user_agent`

Type: string Default: "" (empty = random browser user agent per crawl) Custom user agent string. Default behavior (empty string): Random modern browser user agent, refreshed per crawl:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36

Examples: Custom user agent:

[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"

Mobile user agent:

[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"

Recommendation: Leave empty (default) for best results with bot protection.

`follow_redirects`

Type: boolean Default: true Follow HTTP 3xx redirects. When true (default):

Follows redirects automatically
Crawls final destination URL
Redirect chains tracked for analysis

When false:

Stops at redirect
Does not fetch redirect destination
Useful for debugging redirect issues

Example: Disable redirect following:

[crawler]
follow_redirects = false

Recommendation: Keep true (default) for normal audits.

Robots.txt

`respect_robots`

Type: boolean Default: true Obey robots.txt rules and crawl-delay directives. When true (default):

Fetches and parses robots.txt
Respects Disallow: rules
Honors Crawl-delay: directive
Polite and ethical

When false:

Ignores robots.txt
Crawls all URLs (including disallowed)
Use only for your own sites

Example: Ignore robots.txt (testing only):

[crawler]
respect_robots = false

Recommendation: Always keep true (default) when crawling third-party sites.

Complete Examples

Fast Local Development

[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
respect_robots = false

Polite Production Crawl

[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = true

High-Volume Crawl

[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2

Focused Blog Crawl

[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]

E-commerce Site

[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]

Project Settings - Domains configuration
Rules Configuration - Which rules to run
Examples - More configuration examples

Getting Started

Concepts

Dashboard

CLI Reference

Rules Reference

Configuration

Configuration

Crawl Limits

`max_pages`

`timeout_ms`

Rate Limiting

`delay_ms`

`per_host_delay_ms`

`concurrency`

`per_host_concurrency`

URL Filtering

`include`

`exclude`

Pattern Matching Examples

Query Parameters

`allow_query_params`

`drop_query_prefixes`

Crawl Strategy

`breadth_first`

`max_prefix_budget`

Request Configuration

`user_agent`

`follow_redirects`

Robots.txt

`respect_robots`

Complete Examples

Fast Local Development

Polite Production Crawl

High-Volume Crawl

Focused Blog Crawl

E-commerce Site

Getting Started

Concepts

Dashboard

CLI Reference

Rules Reference

Configuration

​Configuration

​Crawl Limits

​max_pages

​timeout_ms

​Rate Limiting

​delay_ms

​per_host_delay_ms

​concurrency

​per_host_concurrency

​URL Filtering

​include

​exclude

​Pattern Matching Examples

​Query Parameters

​allow_query_params

​drop_query_prefixes

​Crawl Strategy

​breadth_first

​max_prefix_budget

​Request Configuration

​user_agent

​follow_redirects

​Robots.txt

​respect_robots

​Complete Examples

​Fast Local Development

​Polite Production Crawl

​High-Volume Crawl

​Focused Blog Crawl

​E-commerce Site

​Related

Configuration

Crawl Limits

`max_pages`

`timeout_ms`

Rate Limiting

`delay_ms`

`per_host_delay_ms`

`concurrency`

`per_host_concurrency`

URL Filtering

`include`

`exclude`

Pattern Matching Examples

Query Parameters

`allow_query_params`

`drop_query_prefixes`

Crawl Strategy

`breadth_first`

`max_prefix_budget`

Request Configuration

`user_agent`

`follow_redirects`

Robots.txt

`respect_robots`

Complete Examples

Fast Local Development

Polite Production Crawl

High-Volume Crawl

Focused Blog Crawl

E-commerce Site

Related