Skip to main content
The [external_links] section controls how squirrelscan validates outbound links during crawls.

Configuration

[external_links]
enabled = true
cache_ttl_days = 7
timeout_ms = 10000
concurrency = 5

Options

enabled

Type: boolean Default: true Enable external link checking during crawl. When enabled, squirrelscan validates all external links found during crawling to detect broken outbound links (404s, timeouts, DNS failures). Examples: Enable (default):
[external_links]
enabled = true
Disable for faster crawls:
[external_links]
enabled = false
When to disable:
  • Local development (localhost URLs)
  • Network restrictions (firewalls, VPNs)
  • Speed priority over external link validation
  • Large sites with many external links
Impact when disabled:
  • Faster crawls (no external HTTP requests)
  • links/broken-external-links rule won’t report issues
  • Outbound link quality not validated

cache_ttl_days

Type: number Default: 7 (days) Range: 1 to 365 recommended How long to cache external link check results in days. External link checks are cached globally per URL to avoid repeatedly checking the same external resources across multiple crawls. How caching works:
  1. First crawl checks https://example.com/article
  2. Result cached for 7 days (default)
  3. Next crawl within 7 days reuses cached result
  4. After 7 days, link is re-checked
Examples: Short cache (1 day):
[external_links]
cache_ttl_days = 1
Long cache (30 days):
[external_links]
cache_ttl_days = 30
No caching (always fresh):
[external_links]
cache_ttl_days = 0  # Not recommended
Recommendations:
Use CaseRecommended TTLReason
Daily crawls1-2 daysFresh data daily
Weekly crawls7 days (default)Balance freshness/speed
Monthly crawls14-30 daysReduce external requests
CI/CD pipeline1 dayCatch issues quickly
Cache location:
~/.squirrel/cache/external-links/
Clear cache:
rm -rf ~/.squirrel/cache/external-links/

timeout_ms

Type: number Default: 10000 (10 seconds) Range: 1000 to 60000 (1-60 seconds) Timeout for external link checks in milliseconds. External link checks use HEAD requests by default (faster, no body download). If a site doesn’t respond within this timeout, it’s marked as failed. Examples: Fast timeout (5 seconds):
[external_links]
timeout_ms = 5000
Slow sites tolerance (30 seconds):
[external_links]
timeout_ms = 30000
Very aggressive (3 seconds):
[external_links]
timeout_ms = 3000
When request exceeds timeout:
  • Link marked as “timeout”
  • Reported as broken external link
  • Counted in links/broken-external-links rule
Recommendations:
ScenarioTimeoutReason
Most sites10s (default)Balance speed/reliability
Fast CDN links5sCDNs are fast
Slow sites20-30sAllow slow responses
CI/CD5-10sFail fast

concurrency

Type: number Default: 5 Range: 1 to 20 recommended Maximum number of concurrent external link checks. Controls how many external URLs are validated simultaneously during crawling. Examples: Sequential (one at a time):
[external_links]
concurrency = 1
Moderate parallelism (default):
[external_links]
concurrency = 5
High parallelism:
[external_links]
concurrency = 15
Impact: Higher concurrency:
  • ✓ Faster external link validation
  • ✓ Better for sites with many external links
  • ✗ More network connections
  • ✗ May trigger rate limits
Lower concurrency:
  • ✓ Polite to external sites
  • ✓ Less network overhead
  • ✗ Slower external link validation
Recommendations:
Use CaseConcurrencyReason
Most sites5 (default)Good balance
Many external links (100+)10-15Speed up validation
Slow network2-3Avoid overload
Rate-limited1-2Avoid 429 errors

Request Strategy

  1. HEAD request first
    • Faster (no body download)
    • Checks if URL responds
    • Most efficient
  2. GET fallback
    • If HEAD fails/not supported
    • Downloads full response
    • Slower but more reliable
  3. User agent
    • Uses configured crawler user agent
    • Browser impersonation if enabled
    • Respects request_method setting

Status Detection

StatusMeaningReported As
200-299SuccessWorking link
300-399RedirectWorking (followed)
400-499Client errorBroken link
500-599Server errorBroken link
TimeoutNo responseBroken link
DNS failureDomain not foundBroken link

Caching Behavior

Cached for TTL period:
  • 200-299 (success)
  • 404 (not found)
  • Redirects (with final destination)
Not cached:
  • Timeouts (may be transient)
  • Server errors (5xx - may be temporary)
  • DNS failures (may recover)

Configuration Examples

For local development or quick audits:
[external_links]
enabled = false
Skips all external link validation.
For comprehensive link validation:
[external_links]
enabled = true
cache_ttl_days = 1        # Fresh daily
timeout_ms = 30000        # 30s tolerance
concurrency = 15          # High parallelism
Use cases:
  • Link quality audits
  • Outbound link monitoring
  • Content freshness validation

Conservative settings for respectful crawling:
[external_links]
enabled = true
cache_ttl_days = 30       # Cache longer
timeout_ms = 15000        # 15s tolerance
concurrency = 2           # Low parallelism
Use cases:
  • Many external links
  • Avoid rate limits
  • Network restrictions

CI/CD Pipeline

Fast feedback with fresh data:
[external_links]
enabled = true
cache_ttl_days = 1        # Fresh each run
timeout_ms = 5000         # Fail fast
concurrency = 10          # Speed up validation
Use cases:
  • Automated testing
  • PR checks
  • Daily builds

Performance Impact

External link checking adds overhead to crawls. Typical impact: With external links enabled (default):
100 internal pages
50 unique external links
= 100 page fetches + 50 external checks (if not cached)
= ~150 total HTTP requests
With external links disabled:
100 internal pages
= 100 page fetches
= ~100 total HTTP requests
Cache impact: First crawl:
50 external links × 10s timeout = up to 10s total (5 concurrent)
Second crawl (within cache TTL):
50 external links cached = 0s overhead

External link configuration affects these rules: What it checks:
  • External links returning 4xx/5xx errors
  • Timeout failures
  • DNS resolution failures
Requires:
[external_links]
enabled = true  # Must be enabled
Configuration:
[external_links]
enabled = true
timeout_ms = 10000

[rules]
enable = ["links/broken-external-links"]

links/https-downgrade

What it checks:
  • HTTPS page linking to HTTP external URL
  • Security downgrade warnings
Requires:
[external_links]
enabled = true  # To validate external URLs

Troubleshooting

Symptoms:
  • Many “timeout” failures
  • External link checks slow
Solutions: Increase timeout:
[external_links]
timeout_ms = 20000  # 20 seconds
Reduce concurrency:
[external_links]
concurrency = 2  # Fewer parallel requests
Check network:
curl -I https://example.com  # Test external access

Cause: Some sites block HEAD requests or require browser user agents Solution: External link checker automatically falls back to GET requests Verify crawler user agent:
[crawler]
request_method = "browser_impersonate"
impersonate_browser = "chrome_131"

Symptoms:
  • Crawl takes long time
  • Hundreds of external links
Solutions: Disable external links:
[external_links]
enabled = false
Or increase concurrency:
[external_links]
concurrency = 15
timeout_ms = 5000

Cache not working

Verify cache location:
ls -la ~/.squirrel/cache/external-links/
Clear and rebuild:
rm -rf ~/.squirrel/cache/external-links/
squirrel audit https://example.com  # Rebuild cache

Complete Example

[project]
name = "mysite"

[crawler]
max_pages = 500
request_method = "browser_impersonate"

[external_links]
# Enable external link validation
enabled = true

# Cache results for 7 days
cache_ttl_days = 7

# 15 second timeout for slow sites
timeout_ms = 15000

# Check 10 links concurrently
concurrency = 10

[rules]
# Enable broken external link detection
enable = ["*"]
disable = ["ai/*"]
Running this config:
  • Crawls up to 500 pages
  • Validates all external links
  • Caches results for 7 days
  • Allows 15s per external link
  • Checks 10 external links in parallel
  • Reports broken external links in audit