Crawling

SquirrelScan uses a smart crawling system that balances thoroughness with efficiency. This page explains how crawling works under the hood.

How It Works

When you run squirrel audit https://example.com, the crawler:

Fetches robots.txt to respect site rules
Seeds the frontier with your starting URL
Discovers links by parsing each page’s HTML
Crawls breadth-first to prioritize important pages
Stores everything in a local SQLite database

squirrel audit https://example.com

Coverage Modes

SquirrelScan supports three coverage modes to balance thoroughness with speed:

Mode	Default Pages	Behavior	Use Case
`quick`	25	Seed + sitemaps only, no link discovery	CI checks, fast health check
`surface`	100	One sample per URL pattern	General audits (default)
`full`	500	Crawl everything up to limit	Deep analysis

# Quick health check (25 pages, no link discovery)
squirrel audit https://example.com -C quick

# Default surface crawl (100 pages, pattern sampling)
squirrel audit https://example.com

# Full comprehensive audit (500 pages)
squirrel audit https://example.com -C full

# Override page limit for any mode
squirrel audit https://example.com -C surface -m 200

Surface Mode Pattern Detection

Surface mode is smart about detecting URL patterns. When it sees /blog/my-first-post, /blog/another-post, and /blog/third-post, it recognizes these as the same pattern (/blog/{slug}) and only crawls one sample. Detected Patterns:

Numeric IDs: /products/12345 → /products/{id}
UUIDs: /doc/a1b2c3d4-e5f6-... → /doc/{id}
Dates: /blog/2024/01/15 → /blog/{date}/{date}/{date}
Slugs: /blog/my-awesome-post → /blog/{slug}

This means a blog with 10,000 posts gets sampled efficiently without wasting crawl budget on duplicate templates.

Surface mode is the default and recommended for most audits. It gives you comprehensive coverage of unique page templates while avoiding over-crawling repetitive content like blog archives or product listings.

Redirect Following

SquirrelScan automatically follows both HTTP and client-side redirects when starting an audit. This ensures you audit the correct final destination, even through complex redirect chains.

Supported Redirects

HTTP redirects (301, 302, 303, 307, 308) - handled by native fetch
Meta refresh - <meta http-equiv="refresh" content="0;url=...">
JavaScript redirects - window.location, window.location.href, location.href

How It Works

Before crawling begins, SquirrelScan:

Follows HTTP redirect chains automatically
Fetches the target page and checks for client-side redirects
Continues following redirects up to 10 hops
Detects and prevents redirect loops
Uses the final URL as the crawl base URL

Example: Geo-Targeted Redirects

Many sites redirect based on location. SquirrelScan handles this intelligently:

squirrel audit gymshark.com
# Following redirect: https://gymshark.com/ → https://www.gymshark.com/
# SQUIRRELSCAN REPORT
# https://www.gymshark.com • 500 pages • 88/100 (B)

Behind the scenes:

HTTP redirect:        gymshark.com → us.checkout.gymshark.com
Client-side redirect: us.checkout.gymshark.com → www.gymshark.com
Final crawl target:   www.gymshark.com

The original and final URLs are stored in the crawl session for reference. This is useful for sites with A/B testing, geo-targeting, or domain migrations.

Crawl Sessions

Each audit creates a crawl session with a unique ID. Sessions are stored per-domain in ~/.squirrel/projects/<domain>/project.db.

Session Behavior

Scenario	What Happens
First audit	Creates new crawl session
Re-run audit	Creates new session (old preserved for history)
Interrupted (Ctrl+C)	Session paused, can be resumed
Resume interrupted	Continues from where it left off

Old crawl sessions are preserved for historical comparison. Future versions will support crawl diffs to track changes over time.

Conditional GET (304 Caching)

SquirrelScan is smart about re-crawling. When fetching a URL:

Checks if we’ve seen this URL before (any previous crawl)
Sends If-None-Match (ETag) or If-Modified-Since headers
If server returns 304 Not Modified, uses cached content
Otherwise fetches fresh content

This makes re-crawling fast - unchanged pages are nearly instant.

# First crawl: fetches all pages fresh
squirrel audit https://example.com -m 50

# Second crawl: 304s for unchanged pages, much faster
squirrel audit https://example.com -m 100

URL Normalization

URLs are normalized before crawling to avoid duplicates:

Lowercased scheme and host
Sorted query parameters
Removed default ports (80, 443)
Removed trailing slashes
Decoded percent-encoding where safe

Query Parameter Handling

By default, query parameters are stripped except those in your allowlist:

[crawler]
# Keep these query params (e.g., for pagination)
allow_query_params = ["page", "sort"]

# Drop tracking params (default)
drop_query_prefixes = ["utm_", "gclid", "fbclid"]

Scope Control

Control which URLs get crawled with include/exclude patterns:

[crawler]
# Only crawl blog pages
include = ["/blog/*"]

# Skip admin and api routes
exclude = ["/admin/*", "/api/*", "*.pdf"]

Changing include, exclude, allow_query_params, or drop_query_prefixes creates a new crawl session since these affect which URLs are in scope.

Multi-Domain Crawling

By default, only the seed domain is crawled. To allow additional domains:

[project]
domains = ["example.com", "blog.example.com", "cdn.example.com"]

User-Agent

By default, SquirrelScan uses a random browser user-agent for each crawl session. This helps avoid bot detection and ensures your audit sees the same content real users would see.

Default Behavior

Each crawl session generates a random user-agent from real browser fingerprints (Chrome, Firefox, Safari, Edge) across desktop, mobile, and tablet devices. The same user-agent is used for all requests within a single crawl.

Custom User-Agent

To override the random user-agent with a fixed value:

[crawler]
# Use a specific user-agent
user_agent = "MyBot/1.0 (+https://example.com/bot)"

# Or use the SquirrelScan bot identifier
user_agent = "SquirrelScan/2.0 (+https://squirrelscan.com/bot)"

Set a custom user_agent if you need to:

Whitelist the crawler in your WAF or firewall
Test how your site responds to specific browsers
Identify squirrelscan requests in your server logs

Rate Limiting

SquirrelScan is polite by default:

[crawler]
concurrency = 5              # Total concurrent requests
per_host_concurrency = 2     # Max concurrent per host
delay_ms = 100               # Base delay between requests
per_host_delay_ms = 200      # Min delay per host

This prevents overloading servers while still crawling efficiently.

Robots.txt

By default, SquirrelScan respects robots.txt:

[crawler]
respect_robots = true  # default

The crawler:

Fetches /robots.txt before crawling
Honors Disallow rules for the SquirrelScan and * user agents
Discovers sitemaps from Sitemap: directives

Set respect_robots = false only for sites you own or have permission to audit fully.

Data Storage

Crawl data is stored in SQLite databases organized by domain:

~/.squirrel/projects/
├── example-com/
│   └── project.db      # All crawl sessions for example.com
├── blog-example-com/
│   └── project.db      # Separate for subdomains

Each database contains:

crawls - Session metadata and config
pages - HTML content, headers, timing
links - Internal and external links
images - Image metadata
frontier - URL queue state

Resuming Interrupted Crawls

If a crawl is interrupted (Ctrl+C, crash, etc.), it can be resumed:

# Interrupted at 30/100 pages
squirrel audit https://example.com -m 100
# ^C

# Resume - continues from page 31
squirrel audit https://example.com -m 100

The crawler detects the incomplete session and picks up where it left off.

Fresh Crawl (—refresh)

To ignore the cache and fetch all pages fresh:

squirrel audit https://example.com --refresh

This skips conditional GET and re-downloads everything. Useful when:

Debugging caching issues
Testing after major site changes
Verifying server responses

Crawler Stats

After each crawl, stats are stored:

Stat	Description
`pagesTotal`	Total pages in crawl
`pagesFetched`	Pages fetched fresh (200 responses)
`pagesUnchanged`	Pages from cache (304 responses)
`pagesFailed`	Failed fetches
`pagesSkipped`	Skipped (out of scope, robots.txt)
`avgLoadTimeMs`	Average page load time
`bytesTotal`	Total bytes downloaded

Timing Data

Each page records timing information:

loadTimeMs - Total request time
ttfb - Time to first byte
downloadTime - Body download time

This data feeds into performance rules like perf/ttfb.

Performance Optimizations

SquirrelScan uses several techniques to crawl efficiently:

Parallel URL Fetching

URLs are fetched in parallel batches respecting concurrency limits:

[crawler]
concurrency = 5              # Total concurrent requests
per_host_concurrency = 2     # Max concurrent per host

The crawler pops multiple URLs from the frontier and processes them concurrently, significantly speeding up crawls compared to sequential fetching.

Content Caching

HTML and JavaScript content is stored in a global content cache (~/.squirrel/content-store.db) with:

Gzip compression - Typically 80-90% space savings
Content deduplication - Identical content stored once
LRU eviction - Old entries pruned when cache is full

This means:

Repeated crawls of unchanged pages are instant
CDN scripts shared across sites are cached once
Large crawl sessions use less disk space

Smart Resource Limits

Script scanning automatically scales with site size:

Site Size	Scripts Scanned
< 100 pages	10 scripts
100-500 pages	10-50 scripts
> 500 pages	50 scripts (cap)

This ensures small sites get thorough scanning while large sites don’t waste time on excessive script analysis.

Database Optimizations

SQLite databases use WAL mode and optimized indexes for:

Fast frontier operations (URL queue)
Efficient link counting
Quick page lookups

Getting Started

Concepts

Dashboard

CLI Reference

Rules Reference

Configuration

How It Works

Coverage Modes

Surface Mode Pattern Detection

Redirect Following

Supported Redirects

How It Works

Example: Geo-Targeted Redirects

Crawl Sessions

Session Behavior

Conditional GET (304 Caching)

URL Normalization

Query Parameter Handling

Scope Control

Multi-Domain Crawling

User-Agent

Default Behavior

Custom User-Agent

Rate Limiting

Robots.txt

Data Storage

Resuming Interrupted Crawls

Fresh Crawl (—refresh)

Crawler Stats

Timing Data

Performance Optimizations

Parallel URL Fetching

Content Caching

Smart Resource Limits

Database Optimizations

Getting Started

Concepts

Dashboard

CLI Reference

Rules Reference

Configuration

​How It Works

​Coverage Modes

​Surface Mode Pattern Detection

​Redirect Following

​Supported Redirects

​How It Works

​Example: Geo-Targeted Redirects

​Crawl Sessions

​Session Behavior

​Conditional GET (304 Caching)

​URL Normalization

​Query Parameter Handling

​Scope Control

​Multi-Domain Crawling

​User-Agent

​Default Behavior

​Custom User-Agent

​Rate Limiting

​Robots.txt

​Data Storage

​Resuming Interrupted Crawls

​Fresh Crawl (—refresh)

​Crawler Stats

​Timing Data

​Performance Optimizations

​Parallel URL Fetching

​Content Caching

​Smart Resource Limits

​Database Optimizations

How It Works

Coverage Modes

Surface Mode Pattern Detection

Redirect Following

Supported Redirects

How It Works

Example: Geo-Targeted Redirects

Crawl Sessions

Session Behavior

Conditional GET (304 Caching)

URL Normalization

Query Parameter Handling

Scope Control

Multi-Domain Crawling

User-Agent

Default Behavior

Custom User-Agent

Rate Limiting

Robots.txt

Data Storage

Resuming Interrupted Crawls

Fresh Crawl (—refresh)

Crawler Stats

Timing Data

Performance Optimizations

Parallel URL Fetching

Content Caching

Smart Resource Limits

Database Optimizations