CRUMB a card from devarno-cloud

Bot Detection & Visitor Deduplication

smo1 intermediate 5 min read

ELI5

Not every visitor to a short link is a real person. Search engines, social media preview bots, and automated scripts also follow links. The edge worker acts like a nightclub bouncer with a checklist: if you are on the bot list, you get in (redirected) but you are not counted in the crowd statistics. For real people, the bouncer stamps your hand with a special ink that only lasts 24 hours — click twice in one day and you only count as one unique visitor.

Technical Deep Dive

Bot Detection Flow

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e8f4f8', 'primaryTextColor': '#2d3748', 'primaryBorderColor': '#90cdf4', 'lineColor': '#718096', 'secondaryColor': '#f0fff4', 'tertiaryColor': '#fefcbf'}}}%%
flowchart TD
A[Incoming Request] --> B[Extract User-Agent]
B --> C{Matches bot<br/>pattern list ?}
C -->|Yes| D[Set isBot = true]
C -->|No| E[Set isBot = false]
D --> F[Skip analytics tracking]
E --> G[Compute visitor hash]
F --> H[Resolve link]
G --> H
H --> I[Return redirect/proxy]

Bot Pattern List

zoomies-edge matches the User-Agent header (case-insensitive) against 36+ patterns:

Search engines & crawlers: googlebot, bingbot, yandex, baiduspider, duckduckbot

Social preview bots: facebookexternalhit, twitterbot, linkedinbot, slackbot, discordbot, telegrambot, whatsapp

Development tools & headless browsers: curl, wget, python-requests, httpie, postman, headlesschrome, phantomjs, selenium

Monitoring & health checks: uptimerobot, pingdom, datadog

The list is maintained as a RegExp array in zoomies-edge/src/bot-detection.ts (or equivalent). New patterns can be added without redeploying the backend.

Visitor Deduplication

For non-bot requests, a visitor hash is computed to identify unique individuals within a 24-hour window:

visitor_id = SHA-256( IP_address + "|" + User-Agent + "|" + YYYY-MM-DD )

Components:

  • IP address: From CF-Connecting-IP header (Cloudflare-provided, not the direct connection IP)
  • User-Agent: Full UA string
  • Date: YYYY-MM-DD in UTC — hash resets at midnight, so the same person tomorrow is a new unique visitor

Implementation:

  • Cloudflare Workers: Web Crypto API (crypto.subtle.digest('SHA-256', ...))
  • Fallback: DJB2 hash for synchronous contexts where Web Crypto is unavailable

Why Deduplicate?

Without deduplication, analytics would be misleading:

  • A user refreshing the page 10 times = 10 clicks, but 1 unique visitor
  • A user clicking the same link on mobile and desktop = 2 unique visitors (different UA)
  • A user clicking today and tomorrow = 2 unique visitors (date changes)

ClickHouse stores every raw click event, but dashboard queries use uniqExact(visitor_id) to count unique individuals.

Bot Requests Still Redirect

Important: bot detection only skips analytics tracking. The link is still resolved and the redirect (or proxy) still happens. This ensures:

  • Search engines can follow and index redirected content
  • Social media preview bots can fetch OpenGraph metadata
  • Monitoring tools verify link health

Code Snippet (zoomies-edge)

// Simplified bot detection
const botPatterns = [
/googlebot/i, /bingbot/i, /facebookexternalhit/i,
/twitterbot/i, /linkedinbot/i, /curl/i, /wget/i,
/python-requests/i, /headlesschrome/i, /phantomjs/i,
// ... 26 more patterns
];
function isBot(userAgent: string): boolean {
return botPatterns.some(p => p.test(userAgent));
}
// Visitor hash computation
async function computeVisitorId(ip: string, ua: string, date: string): Promise<string> {
const data = new TextEncoder().encode(`${ip}|${ua}|${date}`);
const hashBuffer = await crypto.subtle.digest('SHA-256', data);
return btoa(String.fromCharCode(...new Uint8Array(hashBuffer)));
}

Key Terms

  • User-Agent (UA) → HTTP header identifying the browser, OS, and device making the request
  • CF-Connecting-IP → Cloudflare header containing the original client IP (not the proxy IP)
  • SHA-256 → Cryptographic hash function producing a 256-bit digest; collision-resistant
  • DJB2 → Simple non-cryptographic hash used as a fallback when async Web Crypto is unavailable
  • uniqExact → ClickHouse function that counts distinct values exactly (not approximately)
  • Fire-and-forget → Analytics POST where the worker does not wait for a response

Q&A

Q: Why not block bots entirely? A: Blocking would break search engine indexing, social media link previews, and monitoring. The goal is accurate analytics, not bot exclusion.

Q: Can a malicious actor inflate unique visitor counts? A: To count as multiple unique visitors, an attacker must vary IP, User-Agent, or date. IP rotation is the main vector, but Cloudflare’s IP reputation and rate limiting mitigate this. The system is designed for “reasonable accuracy,” not fraud-proofing.

Q: Why include User-Agent in the hash instead of just IP? A: Multiple people behind the same NAT (office, coffee shop) share an IP. Including UA distinguishes different browsers/devices while still grouping repeated clicks from the same browser.

Q: What happens on January 1st at 00:00 UTC? A: The date component changes, so all visitor hashes reset. A user who clicked at 23:59 on December 31st and again at 00:01 on January 1st counts as two unique visitors. This is acceptable because daily unique visitor counts are meant to be daily.

Examples

Think of a concert venue:

  • Bot detection is the security guard with a list of delivery trucks and media vans — they get in through the service entrance (redirected) but are not counted in attendance
  • Visitor hash is the wristband stamp — it uses invisible ink that only lasts until the venue closes (midnight). Come back tomorrow and you get a new stamp
  • IP + UA + date is the stamp formula: your face (IP), your height (UA), and today’s date. Two people who look alike but are different heights get different stamps

neighbors on the map