LatencyMetric EMA Algorithm

weave intermediate 5 min read

ELI5

The EMA latency tracker is like a taxi driver’s mental average of how long a route takes: each new trip updates the estimate, but old experience still counts for 80%. After about 85 trips the driver is considered “reliable” (confidence ≥ 80). The driver is considered “fast” if the route consistently takes ≤ 5 minutes.

Technical Deep Dive

EMA Formula

new_latency = (1 − α) × old_latency + α × measured_ms
            = 0.8 × old_latency + 0.2 × measured_ms

Alpha α = 0.2 is hardcoded in LatencyMetric::update() (line 81, network.rs). This is a conservative smoothing factor — it takes roughly 9 samples to weight new measurements at > 80% of the total.

Confidence Growth

confidence = min(sample_count, 100) × 95 / 100

sample_count	confidence
0	0
1	0
85	80 (is_reliable threshold)
100	95 (maximum)

Confidence is capped at 95, not 100 — this leaves a permanent uncertainty margin so no link is treated as infallible.

Convergence to True Latency

xychart-beta
title "EMA convergence: true latency = 10ms, initial prior = 12ms"
x-axis "Sample count" [1, 5, 10, 20, 30, 50]
y-axis "Estimated latency (ms)" 8 --> 14
line [11.6, 10.7, 10.2, 10.05, 10.01, 10.0]

(Inferred from the EMA formula applied iteratively; true values computed analytically.)

Status Flags

Flag	Method	Condition
Fast	`is_fast()`	`latency_ms <= 5`
Reliable	`is_reliable()`	`confidence >= 80` (≈ 85 samples)

These flags are consumed by Transport::reselect_transport() via the score formula. A link can be fast but not reliable (few samples), or reliable but not fast (well-measured slow link).

Integration with Transport Scoring

The score latency_ms as i32 − confidence as i32 means:

A fresh fast link (3 ms, confidence 0) scores 3.
A mature fast link (3 ms, confidence 80) scores −77.
A mature slow link (30 ms, confidence 80) scores −50.

The mature fast link always wins once it accumulates samples. During cold-start, all links score near their priors.

Key Terms

EMA (Exponential Moving Average) → Smoothing filter: new = (1−α)×old + α×sample; weights recent samples more than old ones
alpha (α) → Smoothing factor; 0.2 in WEAVE — retains 80% of prior estimate per sample
confidence → Integer 0–95 derived from sample count; used as a tie-breaker bonus in transport scoring
is_fast → latency_ms <= 5; informational flag; not directly used in scoring
is_reliable → confidence >= 80; requires ≈ 85 samples; indicates stable measurement base

Q&A

Q: How many samples until a link beats the cold-start prior of another transport? A: A BLE link starts at 3 ms prior. After 1 sample its score is 3 - 0 = 3. A QUIC link at 12 ms prior after 85 samples (confidence 80) scores 12 - 80 = -68. QUIC would win the score competition even though it is slower in absolute terms — highlighting that the cold-start prior matters during the first 85 measurements.

Q: What prevents a temporary spike from permanently degrading a link’s preference? A: The EMA with α=0.2 heavily smooths spikes. A single spike at 3× the true value shifts the estimate by only 20% of the spike’s deviation. Recovery follows the same EMA rate — roughly 10 samples to halve the spike’s residual effect.

Q: Why cap confidence at 95 instead of 100? A: The cap models inherent measurement uncertainty — network latency is never perfectly stable. Reaching confidence = 100 would mean the score formula could produce scores far into the negatives, potentially causing spurious transport flaps when a minor measurement fluctuation temporarily raises latency_ms.

Examples

Simulating 3 BLE measurements matching the test in network.rs lines 243–250:

Initial: latency_ms = 3 (prior), confidence = 0, sample_count = 0
After update(10): latency_ms = 0.8*3 + 0.2*10 = 4.4 → 4, confidence = 0, count = 1
After update(10): latency_ms = 0.8*4 + 0.2*10 = 5.2 → 5, confidence = 0, count = 2
After update(10): latency_ms = 0.8*5 + 0.2*10 = 6, confidence = 2, count = 3
assert!(latency_ms <= 12)  ← test passes

This matches the assertion at line 250: after 3 samples converging toward 10 ms, the estimate is still ≤ 12 ms.

neighbors on the map

Multi-Underlay Transport Selection diagnosing why a peer keeps falling back to WebRTC when BLE should be available
Spanning Tree Election & Broadcast debugging why the broadcast root keeps changing unexpectedly under topology churn
FNP Observability & Prometheus Metrics monitoring FNP systems
In-Process Rate-Limit Bucket investigating ingest 429s