Click Tracking Pipeline
smo1 intermediate 7 min read
ELI5
When someone clicks a short link, the system sends them to their destination instantly — it does not wait to finish writing down “someone clicked this.” The “writing it down” part happens in the background, like a restaurant waiter who seats you immediately and then rings up your order on the POS later. This keeps links fast while still collecting every click for statistics.
Technical Deep Dive
Pipeline Sequence
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e8f4f8', 'primaryTextColor': '#2d3748', 'primaryBorderColor': '#90cdf4', 'lineColor': '#718096', 'secondaryColor': '#f0fff4', 'tertiaryColor': '#fefcbf'}}}%%sequenceDiagram autonumber actor U as User participant Z as zoomies-edge participant KV as Cloudflare KV participant API as purr-api participant CH as ClickHouse participant PG as PostgreSQL
U->>Z: GET smo1.io/:slug Z->>Z: Bot detection (UA check) alt Is bot Z-->>U: 302 Redirect (no tracking) else Not bot Z->>KV: Lookup link:{slug} alt Cache miss Z->>API: GET /api/links/slug/:slug API-->>Z: Link data Z->>KV: Write link:{slug} (TTL 300s) end Z-->>U: 302 Redirect (immediate) par Async Tracking Z->>API: POST /api/analytics/track Note over Z,API: Fire-and-forget - response ignored and Async Analytics Engine Z->>Z: Write to CF Analytics Engine (if bound) end API->>API: Parse UA (browser, OS, device) API->>API: Hash visitor_id = SHA-256(ip + UA + date) API->>CH: INSERT link_clicks (JSONEachRow) API->>PG: UPDATE links.total_clicks += 1 API->>PG: UPDATE users.total_clicks += 1 endWhy Async?
The redirect (step 8) is returned before the analytics POST (step 9) is even sent. This guarantees:
- Sub-100ms redirect latency even if purr-api or ClickHouse is slow
- No user-facing failures from analytics backend outages
- Fire-and-forget semantics: the worker does not wait for the analytics API to respond
Bot Detection Gate
Before any tracking occurs, the worker checks the User-Agent against 36+ bot patterns (googlebot, bingbot, facebookexternalhit, curl, wget, python-requests, headlesschrome, etc.). Bot requests get redirected but produce zero analytics events.
Visitor Deduplication
To count “unique visitors” rather than “raw clicks”:
visitor_id = SHA-256( IP | User-Agent | YYYY-MM-DD )- IP: From
CF-Connecting-IPheader (Cloudflare strips the true client IP) - User-Agent: Full UA string
- Date:
YYYY-MM-DD— resets the hash at midnight UTC, so the same person clicking twice on the same day counts as one unique visitor
ClickHouse stores visitor_id per event. Dashboard queries use uniqExact(visitor_id) for unique counts.
IP Anonymization
For privacy compliance, the analytics pipeline anonymizes IP addresses before storage:
- IPv4: last octet zeroed (
203.0.113.42→203.0.113.0) - IPv6: masked to /48 (
2001:db8:abcd:ef01::1→2001:db8:abcd::)
ClickHouse Schema (link_clicks)
| Column | Type | Description |
|---|---|---|
event_id | UUID | Unique event identifier |
link_id | UUID | Foreign key to links table |
user_id | String | Link owner (denormalized) |
clicked_at | DateTime64(3) | Timestamp with millisecond precision |
visitor_id | String | SHA-256 deduplication hash |
country | String | GeoIP country from cf.country |
city | String | GeoIP city from cf.city |
referer | String | HTTP Referer header |
user_agent | String | Full User-Agent string |
browser | String | Parsed browser name |
os | String | Parsed operating system |
device | String | Parsed device type |
utm_source | String | Inherited from link configuration |
Dual Write Strategy
The edge worker sends analytics to two destinations in parallel:
- purr-api — canonical event store (ClickHouse + Postgres counters)
- Cloudflare Analytics Engine — optional edge-native dataset (only if
ANALYTICSbinding is present)
This provides redundancy: if purr-api is temporarily unreachable, Analytics Engine still captures raw events for later reconciliation.
Key Terms
- Fire-and-forget → An async HTTP call where the caller does not wait for or check the response
- uniqExact → ClickHouse aggregate function that counts distinct values exactly (not approximately)
- JSONEachRow → ClickHouse input format where each line is a JSON object; used for bulk inserts
- CF-Connecting-IP → Cloudflare header containing the original client IP address
- Visitor hash → SHA-256 of
IP + User-Agent + datefor 24-hour deduplication
Q&A
Q: Why use SHA-256 instead of a simpler hash for visitor deduplication? A: SHA-256 is collision-resistant and available in both Web Crypto API (Workers) and Go’s standard library. The input is low-entropy (IP + UA + date), so a cryptographic hash prevents intentional collisions.
Q: What happens if ClickHouse is down? A: The analytics POST to purr-api may fail, but because it is fire-and-forget, the user already received their redirect. Counters in PostgreSQL are updated synchronously within purr-api, but ClickHouse inserts are best-effort. Operational monitoring should alert on ClickHouse insert failures.
Q: Do proxy-mode links track clicks differently? A: No. The tracking logic runs before the redirect/proxy decision. Both modes record the same analytics event.
Q: How are UTM parameters handled in tracking?
A: The link’s stored UTM fields (utm_source, utm_medium, etc.) are copied into the ClickHouse event row. This lets dashboards attribute clicks to campaigns even if the destination URL did not originally contain UTM params.
Examples
Think of the click pipeline like a toll booth on a highway:
- The driver (user) approaches the booth
- The camera (bot detection) checks if it is a self-driving truck (bot) — if yes, the gate opens without taking a ticket
- The gate (redirect) opens immediately so the driver never waits
- A sensor (async tracking) counts the axle as the car passes, but the driver is already gone
- The count is sent to a central office (ClickHouse) and a local ledger (Postgres) at the same time
- If the central office is closed, the local ledger still records it, and the sensor has a backup tape (Analytics Engine)
neighbors on the map
- Bot Detection & Visitor Deduplication investigating inflated click counts
- Multi-Layer Caching Strategy debugging stale link data
- Database Architecture designing a query that spans transactional and analytical data