Documentation Index
Fetch the complete documentation index at: https://docs.squirrelscan.com/llms.txt
Use this file to discover all available pages before exploring further.
The [external_links] section controls how squirrelscan validates outbound links during crawls.
Configuration
[external_links]
enabled = true
cache_ttl_days = 7
timeout_ms = 10000
concurrency = 5
Options
enabled
Type: boolean
Default: true
Enable external link checking during crawl.
When enabled, squirrelscan validates all external links found during crawling to detect broken outbound links (404s, timeouts, DNS failures).
Examples:
Enable (default):
[external_links]
enabled = true
Disable for faster crawls:
[external_links]
enabled = false
When to disable:
- Local development (localhost URLs)
- Network restrictions (firewalls, VPNs)
- Speed priority over external link validation
- Large sites with many external links
Impact when disabled:
- Faster crawls (no external HTTP requests)
links/broken-external-links rule won’t report issues
- Outbound link quality not validated
cache_ttl_days
Type: number
Default: 7 (days)
Range: 1 to 365 recommended
How long to cache external link check results in days.
External link checks are cached globally per URL to avoid repeatedly checking the same external resources across multiple crawls.
How caching works:
- First crawl checks
https://example.com/article
- Result cached for 7 days (default)
- Next crawl within 7 days reuses cached result
- After 7 days, link is re-checked
Examples:
Short cache (1 day):
[external_links]
cache_ttl_days = 1
Long cache (30 days):
[external_links]
cache_ttl_days = 30
No caching (always fresh):
[external_links]
cache_ttl_days = 0 # Not recommended
Recommendations:
| Use Case | Recommended TTL | Reason |
|---|
| Daily crawls | 1-2 days | Fresh data daily |
| Weekly crawls | 7 days (default) | Balance freshness/speed |
| Monthly crawls | 14-30 days | Reduce external requests |
| CI/CD pipeline | 1 day | Catch issues quickly |
Cache location:
~/.squirrel/cache/external-links/
Clear cache:
rm -rf ~/.squirrel/cache/external-links/
timeout_ms
Type: number
Default: 10000 (10 seconds)
Range: 1000 to 60000 (1-60 seconds)
Timeout for external link checks in milliseconds.
External link checks use HEAD requests by default (faster, no body download). If a site doesn’t respond within this timeout, it’s marked as failed.
Examples:
Fast timeout (5 seconds):
[external_links]
timeout_ms = 5000
Slow sites tolerance (30 seconds):
[external_links]
timeout_ms = 30000
Very aggressive (3 seconds):
[external_links]
timeout_ms = 3000
When request exceeds timeout:
- Link marked as “timeout”
- Reported as broken external link
- Counted in
links/broken-external-links rule
Recommendations:
| Scenario | Timeout | Reason |
|---|
| Most sites | 10s (default) | Balance speed/reliability |
| Fast CDN links | 5s | CDNs are fast |
| Slow sites | 20-30s | Allow slow responses |
| CI/CD | 5-10s | Fail fast |
concurrency
Type: number
Default: 5
Range: 1 to 20 recommended
Maximum number of concurrent external link checks.
Controls how many external URLs are validated simultaneously during crawling.
Examples:
Sequential (one at a time):
[external_links]
concurrency = 1
Moderate parallelism (default):
[external_links]
concurrency = 5
High parallelism:
[external_links]
concurrency = 15
Impact:
Higher concurrency:
- ✓ Faster external link validation
- ✓ Better for sites with many external links
- ✗ More network connections
- ✗ May trigger rate limits
Lower concurrency:
- ✓ Polite to external sites
- ✓ Less network overhead
- ✗ Slower external link validation
Recommendations:
| Use Case | Concurrency | Reason |
|---|
| Most sites | 5 (default) | Good balance |
| Many external links (100+) | 10-15 | Speed up validation |
| Slow network | 2-3 | Avoid overload |
| Rate-limited | 1-2 | Avoid 429 errors |
How External Link Checking Works
Request Strategy
-
HEAD request first
- Faster (no body download)
- Checks if URL responds
- Most efficient
-
GET fallback
- If HEAD fails/not supported
- Downloads full response
- Slower but more reliable
-
User agent
- Uses browser-like headers and user agent
- Detects WAF/bot protection (Cloudflare, Akamai, etc.)
- WAF-blocked 403s reported separately from broken links
Status Detection
| Status | Meaning | Reported As |
|---|
| 200-299 | Success | Working link |
| 300-399 | Redirect | Working (followed) |
| 400-499 | Client error | Broken link |
| 500-599 | Server error | Broken link |
| Timeout | No response | Broken link |
| DNS failure | Domain not found | Broken link |
Caching Behavior
Cached for TTL period:
- 200-299 (success)
- 404 (not found)
- Redirects (with final destination)
Not cached:
- Timeouts (may be transient)
- Server errors (5xx - may be temporary)
- DNS failures (may recover)
Configuration Examples
Fast Crawl (Disable External Links)
For local development or quick audits:
[external_links]
enabled = false
Skips all external link validation.
Aggressive External Link Checking
For comprehensive link validation:
[external_links]
enabled = true
cache_ttl_days = 1 # Fresh daily
timeout_ms = 30000 # 30s tolerance
concurrency = 15 # High parallelism
Use cases:
- Link quality audits
- Outbound link monitoring
- Content freshness validation
Polite External Link Checking
Conservative settings for respectful crawling:
[external_links]
enabled = true
cache_ttl_days = 30 # Cache longer
timeout_ms = 15000 # 15s tolerance
concurrency = 2 # Low parallelism
Use cases:
- Many external links
- Avoid rate limits
- Network restrictions
CI/CD Pipeline
Fast feedback with fresh data:
[external_links]
enabled = true
cache_ttl_days = 1 # Fresh each run
timeout_ms = 5000 # Fail fast
concurrency = 10 # Speed up validation
Use cases:
- Automated testing
- PR checks
- Daily builds
External link checking adds overhead to crawls. Typical impact:
With external links enabled (default):
100 internal pages
50 unique external links
= 100 page fetches + 50 external checks (if not cached)
= ~150 total HTTP requests
With external links disabled:
100 internal pages
= 100 page fetches
= ~100 total HTTP requests
Cache impact:
First crawl:
50 external links × 10s timeout = up to 10s total (5 concurrent)
Second crawl (within cache TTL):
50 external links cached = 0s overhead
External Link Rules
External link configuration affects these rules:
links/broken-external-links
What it checks:
- External links returning 4xx/5xx errors
- Timeout failures
- DNS resolution failures
Requires:
[external_links]
enabled = true # Must be enabled
Configuration:
[external_links]
enabled = true
timeout_ms = 10000
[rules]
enable = ["links/broken-external-links"]
links/https-downgrade
What it checks:
- HTTPS page linking to HTTP external URL
- Security downgrade warnings
Requires:
[external_links]
enabled = true # To validate external URLs
Troubleshooting
External links all timeout
Symptoms:
- Many “timeout” failures
- External link checks slow
Solutions:
Increase timeout:
[external_links]
timeout_ms = 20000 # 20 seconds
Reduce concurrency:
[external_links]
concurrency = 2 # Fewer parallel requests
Check network:
curl -I https://example.com # Test external access
False positives (working links marked broken)
Cause: Some sites block HEAD requests or use WAF/bot protection
Solution: External link checker automatically:
- Falls back to GET requests when HEAD fails
- Detects WAF/bot protection (Cloudflare, Akamai, etc.)
- Reports WAF-blocked 403s as “unverifiable” (not broken)
If a site truly blocks bot traffic, the link will be reported as WAF-blocked with the provider name.
Too slow with many external links
Symptoms:
- Crawl takes long time
- Hundreds of external links
Solutions:
Disable external links:
[external_links]
enabled = false
Or increase concurrency:
[external_links]
concurrency = 15
timeout_ms = 5000
Cache not working
Verify cache location:
ls -la ~/.squirrel/cache/external-links/
Clear and rebuild:
rm -rf ~/.squirrel/cache/external-links/
squirrel audit https://example.com # Rebuild cache
Complete Example
[project]
name = "mysite"
[crawler]
max_pages = 500
[external_links]
# Enable external link validation
enabled = true
# Cache results for 7 days
cache_ttl_days = 7
# 15 second timeout for slow sites
timeout_ms = 15000
# Check 10 links concurrently
concurrency = 10
[rules]
# Enable broken external link detection
enable = ["*"]
disable = ["ai/*"]
Running this config:
- Crawls up to 500 pages
- Validates all external links
- Caches results for 7 days
- Allows 15s per external link
- Checks 10 external links in parallel
- Reports broken external links in audit