CRUMB a card from devarno-cloud

Realtime Hub & Redis Presence

tektree intermediate 6 min read

ELI5

The realtime service is a switchboard with a 5-minute paper-tape: every connected user has a slip on the wall (presence:<userId>) that gets refreshed while they are around. Pull the slip and they’re “offline.” The switchboard also keeps a list of every phone (browser tab) the user has open, so a single message rings them all.

Technical Deep Dive

services/realtime-service runs on :8085 and combines two responsibilities: a gorilla/websocket Hub for live message fan-out and a Redis-backed presence tracker.

Component Class Diagram

classDiagram
class Hub {
+map[string][]*Client userMap
+chan []byte broadcast
+Run()
+Register(c *Client)
+Unregister(c *Client)
+Broadcast(msg []byte)
+SendToUser(userID string, msg []byte)
+GetConnectedUsers() []string
+Shutdown()
}
class Client {
+string ID
+string UserID
+*websocket.Conn Conn
+chan []byte Send
}
class Tracker {
+redis.Client rdb
+string Prefix
+Duration TTL
+SetOnline(userID)
+SetOffline(userID)
+IsOnline(userID) bool
+RefreshPresence(userID)
+GetOnlineUsers() []string
}
Hub "1" --> "*" Client : owns
Hub <.. Tracker : separate but co-deployed

Connection Lifecycle

sequenceDiagram
autonumber
participant Client
participant GW as api-gateway
participant RT as realtime-service :8085
participant Hub
participant Redis
Client->>GW: GET /api/v1/realtime/ws + Bearer JWT
GW->>RT: forward + X-User-ID, ?user_id=...
RT->>RT: gorilla.Upgrader.Upgrade (CheckOrigin permissive)
RT->>Hub: Register(Client{UUID, userID, conn, send})
RT->>Redis: SET presence:<userID> ts EX 300
par read pump
Client-->>RT: pong (every ~54s)
RT->>Redis: EXPIRE presence:<userID> 300
RT->>RT: setReadDeadline +60s
and write pump
Hub-->>RT: SendToUser bytes
RT-->>Client: WriteMessage
end
Client--xRT: disconnect / 60s pong miss
RT->>Hub: Unregister
RT->>Redis: DEL presence:<userID>

Buffers and Timings

SettingValueSource
Read/Write buffer1024 Bgorilla Upgrader
Send channel buffer256 messagesHub.Client
Max message size512 KBhandlers.go:24-28
Pong wait60 sread deadline
Ping period~54 s(≈ pongWait * 0.9)
Presence TTL5 minpresence.go presenceTTL
Presence key prefixpresence:presence.go presenceKeyPrefix

Multi-Tab Fan-Out

Hub.userMap is map[userID][]*Client. When the same user opens three tabs, three Clients are appended under the same key; SendToUser iterates the slice and writes to every Send channel. Each tab has its own UUID Client.ID. GetConnectedUsers returns the unique user keys.

Why Two Stores

The Hub is in-memory per-pod; Redis presence is cross-pod. A user connected to pod A still appears IsOnline=true on pod B because both write to the same Redis key. Cross-pod fan-out is not wired today (Hub.SendToUser only walks the local userMap); a pod-fanned message bus is the natural next step before scaling beyond one realtime pod.

Auth Surface

The websocket handler reads ?user_id= from the upgrade request — the gateway is expected to set or echo it from the JWT. There is no in-service JWT validation today; if a request reaches :8085 directly with any user_id, it will be trusted. Keep the port internal.

Key Terms

  • Hub → in-memory connection registry, per realtime pod.
  • Client → one websocket connection; multiple per user is normal (multi-tab).
  • Presence TTL → 5-minute Redis expiry; refreshed on every pong.
  • Pong wait → server side read deadline; missed pong → disconnect.

Q&A

Q: A user closes their last tab but appears online for ~5 minutes. Why? A: Hub Unregister does call Tracker.SetOffline (DEL the key) on graceful close. If the close was unclean (network drop without close frame), DEL never fires and the key naturally expires after the 5-min TTL.

Q: Two realtime pods are deployed. Pod A’s Hub knows the user; pod B does not. How does a notification reach the user via pod B? A: Today it does not — SendToUser is local to one Hub. The deployment is effectively single-pod for fan-out. Cross-pod requires a Redis pub/sub channel keyed by user_id that every Hub subscribes to.

Q: Why is CheckOrigin permissive? A: It is the default scaffold value (returns true). Production must constrain it to the public web origin (or rely on CORS at the gateway) to prevent CSRF-via-WebSocket from arbitrary pages.

Examples

Sending a system notice to a user across all open tabs: hub.SendToUser("user_42", []byte("{\"type\":\"notice\",\"body\":\"Welcome to Pro\"}")). Each tab’s read pump dispatches it to the page’s WS event handler.

neighbors on the map