Pinned versions and tied clocks: $resend, FHIR history ordering, and replay under load

2026-05-11by

#healthcare #fhir #medplum #distributed-systems #subscriptions #audit #open-source

I spend a lot of time thinking about FHIR Subscriptions and the small pattern they encourage: load previousVersion and the current resource, diff or hash them, and decide whether downstream work should run. It is a pragmatic idempotency guard: skip duplicate noise, still react to real changes.

That pattern only works if I can reconstruct the same logical moment the first delivery attempt would have seen. If a handler failed after the server accepted an update, then the resource moved again, a naive “replay against current” path gives the handler a drifted (current, previous) pair. It may fire when it should stay quiet, or no-op when it should fire. Neither outcome is something I want to explain in operational or clinical-adjacent workflows.

What I put in PR #9062

In medplum/medplum#9062 I proposed extending POST /[Type]/[id]/$resend with an optional versionId. The motivation is straightforward: when redriving a failed delivery, the server should load the historical version the handler missed, derive previousVersion from versioned history, and reconstruct the (version, previousVersion) tuple from the failure point instead of from whatever is newest today.

I also added Repository.readPreviousVersion, aimed at a tight SQL path for “the row before this versionId” instead of ad-hoc paginated history scans. When versionId is present and interaction is omitted, the server infers create for the first version and update otherwise.

That is the shape I want for event-style replay: a stable handle to the payload slice I mean to re-execute, not “whatever the document looks like now.”

Where “sort by lastUpdated” stopped being enough for me

To walk backward one step, the implementation naturally reaches for meta.lastUpdated. In Medplum’s history tables, ordering ties to that field. Values often come from millisecond-precision timestamps; privileged clients may also set meta.lastUpdated directly. Distinct history rows for the same logical resource can share an identical timestamp.

FHIR does not hand you a spec-guaranteed monotonic sequence: meta.versionId is server-assigned and opaque, without required ordering semantics across servers, and history bundle ordering is “most recent first” in a way that does not define a deterministic tie-break when lastUpdated matches. I started a thread with other implementers on chat.fhir.org because I wanted that gap spelled out somewhere durable.

While I was wiring readPreviousVersion, I opened medplum/medplum#9112 to capture the underlying problem: I need a deterministic predecessor when timestamps collide. Querying our CDC stream of FHIR resource changes, I found collisions the same day where a (resourceType, id, lastUpdated) tuple maps to more than one versionId. Rare at a glance, but not hypothetical, and it gets easier to hit as write throughput rises or as more actors can stamp lastUpdated.

The mitigation in the PR is the engineering I could ship without a schema revolution: use <= against the target instant, fetch enough rows to detect ties, warn when two candidates share the same millisecond, and accept that this flags ambiguity without fully characterizing “three or more” ties without extra queries or metrics. Matt Willer asked exactly that on review; I agreed the scope of ambiguity is not fully captured and that analytics or metrics may be the better place to watch it. Longer term I still want an explicit disambiguation strategy rather than leaning on lastUpdated alone.

Why I keep mapping this to distributed systems, audit trails, and replay

To me this is the same family of problems as partial orders in distributed logs. Wall-clock timestamps are convenient for humans and indexes; they are not a global total order unless you invest in one (synthetic sequence numbers, hybrid logical clocks, or a single writer append log). Two events at the “same time” on the clock are not necessarily unrelated in causality, and conversely, skewed clocks can reorder unrelated events. When the audit narrative is “we can always walk history,” the implementation detail underneath is still a storage ordering choice. Under bursty writes, that choice shows seams.

Audit and compliance conversations often assume a neat timeline. Production prefers throughput, batching, retries, and concurrent writers. The tension is not philosophical: if replay, diffing, or legal-style reconstruction depends on “the row before this one,” then undefined order under ties is a correctness hazard, not a cosmetic bug.

Event replay has the same dependency. Idempotency keys and content hashes help, but healthcare-style subscriptions still want versioned semantics tied to the server’s own history. Pinning versionId for $resend is how I say “replay this committed state,” analogous to replaying a partition log from a specific offset rather than from “whatever the topic tail is now.”

Same stack, different axis

Concurrency in the browser layer is a different problem (cross-tab OAuth refresh under rotating tokens), but the rhyme is the same: assumptions that hold in one process break when reality is parallelized. I wrote about Medplum’s client-side Web Locks approach in /blog/medplum-multi-tab-token-refresh-web-locks; the PR there is medplum/medplum#9113.

Search

What I put in PR #9062

Where “sort by lastUpdated” stopped being enough for me

Why I keep mapping this to distributed systems, audit trails, and replay

Same stack, different axis

Links