Bot Detection & Visitor Deduplication
smo1 intermediate 5 min read
ELI5
Not every visitor to a short link is a real person. Search engines, social media preview bots, and automated scripts also follow links. The edge worker acts like a nightclub bouncer with a checklist: if you are on the bot list, you get in (redirected) but you are not counted in the crowd statistics. For real people, the bouncer stamps your hand with a special ink that only lasts 24 hours — click twice in one day and you only count as one unique visitor.
Technical Deep Dive
Bot Detection Flow
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e8f4f8', 'primaryTextColor': '#2d3748', 'primaryBorderColor': '#90cdf4', 'lineColor': '#718096', 'secondaryColor': '#f0fff4', 'tertiaryColor': '#fefcbf'}}}%%flowchart TD A[Incoming Request] --> B[Extract User-Agent] B --> C{Matches bot<br/>pattern list ?} C -->|Yes| D[Set isBot = true] C -->|No| E[Set isBot = false] D --> F[Skip analytics tracking] E --> G[Compute visitor hash] F --> H[Resolve link] G --> H H --> I[Return redirect/proxy]Bot Pattern List
zoomies-edge matches the User-Agent header (case-insensitive) against 36+ patterns:
Search engines & crawlers:
googlebot, bingbot, yandex, baiduspider, duckduckbot
Social preview bots:
facebookexternalhit, twitterbot, linkedinbot, slackbot, discordbot, telegrambot, whatsapp
Development tools & headless browsers:
curl, wget, python-requests, httpie, postman, headlesschrome, phantomjs, selenium
Monitoring & health checks:
uptimerobot, pingdom, datadog
The list is maintained as a RegExp array in zoomies-edge/src/bot-detection.ts (or equivalent). New patterns can be added without redeploying the backend.
Visitor Deduplication
For non-bot requests, a visitor hash is computed to identify unique individuals within a 24-hour window:
visitor_id = SHA-256( IP_address + "|" + User-Agent + "|" + YYYY-MM-DD )Components:
- IP address: From
CF-Connecting-IPheader (Cloudflare-provided, not the direct connection IP) - User-Agent: Full UA string
- Date:
YYYY-MM-DDin UTC — hash resets at midnight, so the same person tomorrow is a new unique visitor
Implementation:
- Cloudflare Workers: Web Crypto API (
crypto.subtle.digest('SHA-256', ...)) - Fallback: DJB2 hash for synchronous contexts where Web Crypto is unavailable
Why Deduplicate?
Without deduplication, analytics would be misleading:
- A user refreshing the page 10 times = 10 clicks, but 1 unique visitor
- A user clicking the same link on mobile and desktop = 2 unique visitors (different UA)
- A user clicking today and tomorrow = 2 unique visitors (date changes)
ClickHouse stores every raw click event, but dashboard queries use uniqExact(visitor_id) to count unique individuals.
Bot Requests Still Redirect
Important: bot detection only skips analytics tracking. The link is still resolved and the redirect (or proxy) still happens. This ensures:
- Search engines can follow and index redirected content
- Social media preview bots can fetch OpenGraph metadata
- Monitoring tools verify link health
Code Snippet (zoomies-edge)
// Simplified bot detectionconst botPatterns = [ /googlebot/i, /bingbot/i, /facebookexternalhit/i, /twitterbot/i, /linkedinbot/i, /curl/i, /wget/i, /python-requests/i, /headlesschrome/i, /phantomjs/i, // ... 26 more patterns];
function isBot(userAgent: string): boolean { return botPatterns.some(p => p.test(userAgent));}
// Visitor hash computationasync function computeVisitorId(ip: string, ua: string, date: string): Promise<string> { const data = new TextEncoder().encode(`${ip}|${ua}|${date}`); const hashBuffer = await crypto.subtle.digest('SHA-256', data); return btoa(String.fromCharCode(...new Uint8Array(hashBuffer)));}Key Terms
- User-Agent (UA) → HTTP header identifying the browser, OS, and device making the request
- CF-Connecting-IP → Cloudflare header containing the original client IP (not the proxy IP)
- SHA-256 → Cryptographic hash function producing a 256-bit digest; collision-resistant
- DJB2 → Simple non-cryptographic hash used as a fallback when async Web Crypto is unavailable
- uniqExact → ClickHouse function that counts distinct values exactly (not approximately)
- Fire-and-forget → Analytics POST where the worker does not wait for a response
Q&A
Q: Why not block bots entirely? A: Blocking would break search engine indexing, social media link previews, and monitoring. The goal is accurate analytics, not bot exclusion.
Q: Can a malicious actor inflate unique visitor counts? A: To count as multiple unique visitors, an attacker must vary IP, User-Agent, or date. IP rotation is the main vector, but Cloudflare’s IP reputation and rate limiting mitigate this. The system is designed for “reasonable accuracy,” not fraud-proofing.
Q: Why include User-Agent in the hash instead of just IP? A: Multiple people behind the same NAT (office, coffee shop) share an IP. Including UA distinguishes different browsers/devices while still grouping repeated clicks from the same browser.
Q: What happens on January 1st at 00:00 UTC? A: The date component changes, so all visitor hashes reset. A user who clicked at 23:59 on December 31st and again at 00:01 on January 1st counts as two unique visitors. This is acceptable because daily unique visitor counts are meant to be daily.
Examples
Think of a concert venue:
- Bot detection is the security guard with a list of delivery trucks and media vans — they get in through the service entrance (redirected) but are not counted in attendance
- Visitor hash is the wristband stamp — it uses invisible ink that only lasts until the venue closes (midnight). Come back tomorrow and you get a new stamp
- IP + UA + date is the stamp formula: your face (IP), your height (UA), and today’s date. Two people who look alike but are different heights get different stamps
neighbors on the map
- Click Tracking Pipeline debugging missing or duplicate click counts
- Request Routing & Edge Resolution debugging why a slug returns 404 instead of redirecting