Realtime Hub & Redis Presence
tektree intermediate 6 min read
ELI5
The realtime service is a switchboard with a 5-minute paper-tape: every connected user has a slip on the wall (presence:<userId>) that gets refreshed while they are around. Pull the slip and they’re “offline.” The switchboard also keeps a list of every phone (browser tab) the user has open, so a single message rings them all.
Technical Deep Dive
services/realtime-service runs on :8085 and combines two responsibilities: a gorilla/websocket Hub for live message fan-out and a Redis-backed presence tracker.
Component Class Diagram
classDiagram class Hub { +map[string][]*Client userMap +chan []byte broadcast +Run() +Register(c *Client) +Unregister(c *Client) +Broadcast(msg []byte) +SendToUser(userID string, msg []byte) +GetConnectedUsers() []string +Shutdown() } class Client { +string ID +string UserID +*websocket.Conn Conn +chan []byte Send } class Tracker { +redis.Client rdb +string Prefix +Duration TTL +SetOnline(userID) +SetOffline(userID) +IsOnline(userID) bool +RefreshPresence(userID) +GetOnlineUsers() []string } Hub "1" --> "*" Client : owns Hub <.. Tracker : separate but co-deployedConnection Lifecycle
sequenceDiagram autonumber participant Client participant GW as api-gateway participant RT as realtime-service :8085 participant Hub participant Redis Client->>GW: GET /api/v1/realtime/ws + Bearer JWT GW->>RT: forward + X-User-ID, ?user_id=... RT->>RT: gorilla.Upgrader.Upgrade (CheckOrigin permissive) RT->>Hub: Register(Client{UUID, userID, conn, send}) RT->>Redis: SET presence:<userID> ts EX 300 par read pump Client-->>RT: pong (every ~54s) RT->>Redis: EXPIRE presence:<userID> 300 RT->>RT: setReadDeadline +60s and write pump Hub-->>RT: SendToUser bytes RT-->>Client: WriteMessage end Client--xRT: disconnect / 60s pong miss RT->>Hub: Unregister RT->>Redis: DEL presence:<userID>Buffers and Timings
| Setting | Value | Source |
|---|---|---|
| Read/Write buffer | 1024 B | gorilla Upgrader |
| Send channel buffer | 256 messages | Hub.Client |
| Max message size | 512 KB | handlers.go:24-28 |
| Pong wait | 60 s | read deadline |
| Ping period | ~54 s | (≈ pongWait * 0.9) |
| Presence TTL | 5 min | presence.go presenceTTL |
| Presence key prefix | presence: | presence.go presenceKeyPrefix |
Multi-Tab Fan-Out
Hub.userMap is map[userID][]*Client. When the same user opens three tabs, three Clients are appended under the same key; SendToUser iterates the slice and writes to every Send channel. Each tab has its own UUID Client.ID. GetConnectedUsers returns the unique user keys.
Why Two Stores
The Hub is in-memory per-pod; Redis presence is cross-pod. A user connected to pod A still appears IsOnline=true on pod B because both write to the same Redis key. Cross-pod fan-out is not wired today (Hub.SendToUser only walks the local userMap); a pod-fanned message bus is the natural next step before scaling beyond one realtime pod.
Auth Surface
The websocket handler reads ?user_id= from the upgrade request — the gateway is expected to set or echo it from the JWT. There is no in-service JWT validation today; if a request reaches :8085 directly with any user_id, it will be trusted. Keep the port internal.
Key Terms
- Hub → in-memory connection registry, per realtime pod.
- Client → one websocket connection; multiple per user is normal (multi-tab).
- Presence TTL → 5-minute Redis expiry; refreshed on every pong.
- Pong wait → server side read deadline; missed pong → disconnect.
Q&A
Q: A user closes their last tab but appears online for ~5 minutes. Why?
A: Hub Unregister does call Tracker.SetOffline (DEL the key) on graceful close. If the close was unclean (network drop without close frame), DEL never fires and the key naturally expires after the 5-min TTL.
Q: Two realtime pods are deployed. Pod A’s Hub knows the user; pod B does not. How does a notification reach the user via pod B?
A: Today it does not — SendToUser is local to one Hub. The deployment is effectively single-pod for fan-out. Cross-pod requires a Redis pub/sub channel keyed by user_id that every Hub subscribes to.
Q: Why is CheckOrigin permissive?
A: It is the default scaffold value (returns true). Production must constrain it to the public web origin (or rely on CORS at the gateway) to prevent CSRF-via-WebSocket from arbitrary pages.
Examples
Sending a system notice to a user across all open tabs: hub.SendToUser("user_42", []byte("{\"type\":\"notice\",\"body\":\"Welcome to Pro\"}")). Each tab’s read pump dispatches it to the page’s WS event handler.
neighbors on the map
- WebSocket Session Lifecycle adding a new privileged WS handler
- Presence Broadcast Channels diagnosing missing cursor updates in the editor