Add duplicate publisher determinism proof
This commit is contained in:
parent
5d0f3077d3
commit
91dad67fc2
18 changed files with 21569 additions and 595 deletions
|
|
@ -0,0 +1,334 @@
|
|||
# ECP-0156: Duplicate Publisher Deterministic Data Layer
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Context
|
||||
|
||||
Two publisher nodes may broadcast the same logical channel at the same time. The archive and relay
|
||||
layers need this for resilience, but duplicate publishers currently risk looking like conflicting
|
||||
streams instead of convergent copies of the same media.
|
||||
|
||||
## Decision
|
||||
|
||||
Duplicate publishers are valid for a published channel. The data layer dedupes and verifies media by
|
||||
content identity, not by publisher envelope identity:
|
||||
|
||||
- CMAF init and media segment bytes for the same input, ladder profile, and chunk cadence must be
|
||||
byte-for-byte identical.
|
||||
- BLAKE3 media hashes and per-rung Merkle roots are the shared data identity.
|
||||
- Publisher manifests may carry different `stream_id`, `epoch_id`, `created_unix_ms`, signatures,
|
||||
locators, and manifest ids.
|
||||
- The archive must treat matching media hashes from different publishers as corroborating sources.
|
||||
- Archive records must carry source identity. Two copied buffers with the same `source_node` are not
|
||||
duplicate-publisher proof, even when their BLAKE3 hashes match.
|
||||
- Divergent hashes for the same logical channel, rendition, and media time are misses that must be
|
||||
measured before the data is promoted as redundant.
|
||||
|
||||
## Verification
|
||||
|
||||
The proof path has two stages:
|
||||
|
||||
1. Single-node duplicate-publisher tests produce the same ladder twice with different publisher
|
||||
identities and assert byte-for-byte BLAKE3 equality for every generated init and media segment.
|
||||
The `duplicate_publishers_same_input_produce_identical_cmaf_ladder_bytes` test is part of the
|
||||
default Rust test path when ffmpeg is present; it is not an ignored E2E.
|
||||
2. Production verification runs the same channel on two real publishers long enough to measure
|
||||
duplicate media convergence, hash divergence, missing objects, and backfill behavior in Grafana.
|
||||
|
||||
The goal is not just "two publishers are online." Success requires elapsed production time behind the
|
||||
numbers and dashboards that show duplicate hits, misses, and archive repair.
|
||||
|
||||
## Consequences
|
||||
|
||||
Manifest ids cannot be used as the archive dedupe key for duplicate publishers. Operators get a
|
||||
clear signal when two publishers produce identical bytes versus merely announcing the same channel.
|
||||
If encoder determinism changes, the single-node test fails before production redundancy silently
|
||||
degrades.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Dedupe by manifest id. This preserves envelope identity but misses the resilience property because
|
||||
duplicate publishers necessarily produce different envelopes.
|
||||
- Dedupe by logical channel and time only. This can hide encoder divergence and promote bad
|
||||
redundancy before byte-level media equality is proven.
|
||||
- Disable duplicate publishers until the scheduler is perfect. This avoids conflict handling but
|
||||
weakens live resilience and leaves the archive data layer untested.
|
||||
|
||||
## Rollout/teardown
|
||||
|
||||
Roll forward by landing the local deterministic test, adding miss/duplicate metrics to the archive
|
||||
scrape surface, then running two publishers for one logical channel in production. Roll back by
|
||||
disabling duplicate scheduling for that channel; existing content-addressed archive objects remain
|
||||
valid.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
The node-agent archive scrape now exposes duplicate-source and miss gauges without placing hashes in
|
||||
labels. Per node, role, broadcast, rendition, and track it reports duplicate matching hash sources,
|
||||
duplicate hash sequences, divergent hash sequences, and missing hash records. Grafana shows those
|
||||
next to archive ladder coverage so the production duplicate-publisher run has an operator-visible
|
||||
convergence and miss signal.
|
||||
|
||||
`ec-node archive-convergence` is the primary proof surface for duplicate media identity. It compares
|
||||
named archive manifest roots directly inside the Rust node binary, groups records by logical stream,
|
||||
rendition, track, and sequence, and only returns `ok` when every expected sequence has matching
|
||||
duplicate source hashes with no missing or divergent sequence. It also requires archive records to
|
||||
carry at least two distinct `source_node` values, so mirrored global-origin manifests cannot pass as
|
||||
independent publishers. This keeps the media-data invariant in the already-shipped Rust artifact
|
||||
instead of extending the Python node-agent. Rollout gates should use
|
||||
`ec-node archive-convergence --require-ok`; the command emits the JSON report either way, but
|
||||
`--require-ok` exits non-zero unless duplicate convergence is actually proven.
|
||||
`ec-node archive-convergence --prometheus` renders the same Rust convergence report as scrapeable
|
||||
`every_channel_archive_*` gauges for duplicate source records, duplicate sequences, divergent
|
||||
sequences, source-local divergence, missing hashes, missing source identity, media timing conflicts,
|
||||
record source count, and pass/fail state. This gives Grafana a Rust-owned proof metric path while
|
||||
the older node-agent ladder metrics remain available during migration.
|
||||
`ec-node archive-convergence-serve` keeps that proof path live for Prometheus: it serves `/health`
|
||||
and `/metrics`, recomputes convergence on each scrape, and emits `scrape_ok=0` metrics instead of
|
||||
disappearing when manifests are missing or not ready. Production Grafana can therefore distinguish a
|
||||
healthy metrics target from an unproven duplicate-publisher run.
|
||||
The Nix `services.every-channel.ec-node.archive.convergence.proofs` option turns those Rust proof
|
||||
servers into named systemd units. Each proof must name at least two `NAME=PATH` sources and gets a
|
||||
dedicated listen address, so operators can add one Prometheus scrape target per duplicate channel
|
||||
without resurrecting the Python node-agent as the proof oracle.
|
||||
Forge enables an initial `la-kcop-publisher-origin` proof target on `127.0.0.1:7812` and Prometheus
|
||||
scrapes it alongside the other local every.channel targets. Until two real publisher manifest roots
|
||||
are mounted or fetched into Forge, the target intentionally uses the Forge manifest root as a
|
||||
placeholder peer and must report unproven convergence rather than green duplicate-publisher proof.
|
||||
Forge also exposes a static two-NUC `la-kcet-remote-publisher-origin` proof target once that channel
|
||||
is the live converged duplicate sample. Dynamic Headscale file-SD remains useful for discovery, but
|
||||
it can include relays and stale nodes; duplicate-publisher proof should use an explicit publisher
|
||||
pair or future scheduler group labels so unrelated agents do not turn a passing channel red.
|
||||
This static proof exports its own Rust convergence gauges rather than gating on broad legacy
|
||||
Prometheus aggregates, because older node-agent archive metrics do not yet carry enough proof-role
|
||||
labels to avoid summing stale divergence from unrelated scrape targets.
|
||||
|
||||
`ec-node archive-convergence-measure` is the primary production proof harness. It fetches named
|
||||
node-agent `/v1/archive-manifest` samples or direct manifest JSONL URLs, writes bounded temporary
|
||||
manifest roots, reuses the Rust `archive-convergence` report, and optionally queries Prometheus for
|
||||
the Grafana-facing duplicate/miss series. A production run only counts as complete when the report
|
||||
has elapsed samples, matching duplicate media hashes, zero divergent hash sequences, and live
|
||||
Prometheus series for the duplicate/miss gauges. The measurement groups records by archive record
|
||||
source identity, not by the URL used to fetch a manifest, and reports source identity failures when
|
||||
the sample is too weak to prove independent publisher data. The older
|
||||
`scripts/measure-duplicate-publishers.py` stays compatibility-only until live operators and Forge
|
||||
jobs are switched to the Rust command.
|
||||
The convergence report carries bounded divergent-sequence samples with per-source hash, byte size,
|
||||
receive time, source node/session, CAS path, and media timing when present, so a red proof is
|
||||
immediately actionable without fetching full manifests by hand.
|
||||
It also reports a non-blocking media-timing-missing count and Prometheus gauge; hash equality can
|
||||
still prove duplicate bytes, but missing timing means a divergent proof cannot yet classify whether
|
||||
the mismatch is a phase/windowing problem or an encoder byte problem.
|
||||
Publisher service builders must pass proof cadence explicitly. Both the node-agent publisher
|
||||
supervisor and Nix systemd publisher module set `--publisher-archive-segment-duration-ms` and
|
||||
`--publisher-start-boundary-ms` by default, so netbooted NUCs do not depend on stale hotpatch CLI
|
||||
defaults when aligning duplicate publisher proof windows.
|
||||
|
||||
`ec-node archive-convergence-measure-serve` turns that production proof harness into a live
|
||||
Prometheus target. Each `/metrics` scrape fetches one fresh sample from node-agent or direct JSONL
|
||||
manifest URLs, keeps a bounded in-memory sample window, and only reports measurement `ok` after the
|
||||
configured elapsed window has passed. This avoids blocking Prometheus scrapes for the measurement
|
||||
duration while still preventing two immediate samples from looking like a real production run.
|
||||
The service emits measurement-level gauges for fetch success, source record counts, invalid records,
|
||||
elapsed seconds, Prometheus series presence, reasons, and then appends the same
|
||||
`every_channel_archive_*` convergence gauges from the latest sample. The service can also read
|
||||
Prometheus file-SD JSON from Forge's Headscale node-agent discovery and turn each discovered target
|
||||
into a sampled node-agent manifest source. The Nix
|
||||
`services.every-channel.ec-node.archive.convergence.remoteProofs` option creates these remote proof
|
||||
services as systemd units from either static `NAME=URL` endpoints or dynamic file-SD inputs. Forge
|
||||
now exposes `la-kcop-remote-publisher-origin` on `127.0.0.1:7813` using the live
|
||||
`/var/lib/prometheus/every-channel-node-agents.json` inventory. It must stay red until that
|
||||
inventory contains at least two independent publisher node-agents whose `publisher.m4s` records
|
||||
converge.
|
||||
|
||||
When archive-serve ports are not reachable from the proof runner, the node-agent exposes a bounded,
|
||||
tailnet-authenticated `/v1/archive-manifest` sample endpoint. The harness can use that endpoint for
|
||||
each named publisher, compare local manifest records directly, and still require at least two elapsed
|
||||
samples before declaring success.
|
||||
|
||||
Production duplicate proof also requires archive-buffer freshness on each participating publisher.
|
||||
During mixed-generation rollouts, the current node-agent may supervise an older installed
|
||||
`archive-hot-sync` helper. The agent must probe helper flag support and omit optional arguments such
|
||||
as `--link-mode` when an older helper lacks them, because a silently failing archive-buffer sync can
|
||||
leave one publisher with healthy live streams but stale manifests.
|
||||
|
||||
The publisher buffer refresh is freshness-first: the node-managed sync must mirror full manifests
|
||||
without origin object fetch before running the slower cache fill/prune pass. This lets convergence
|
||||
checks, Grafana scrape surfaces, and demand fetch see current BLAKE3 indexes even when proactive CAS
|
||||
object backfill is still catching up.
|
||||
|
||||
`wt-archive` stamps each archive index record with `source_node` and `source_session`. The Nix
|
||||
archive launcher passes the runtime hostname as `--source-node`; explicit CLI users can override it.
|
||||
Older records without this identity continue to parse, but proof commands and production measurement
|
||||
mark them incomplete instead of accepting them as independent publisher evidence.
|
||||
|
||||
Publisher-origin proof must be captured before relay/archive mirroring can collapse source identity.
|
||||
When node-agent archive buffering is enabled, supervised `wt-publish` processes pass
|
||||
`--publisher-archive-output-dir`, `--publisher-archive-manifest-dir`, and
|
||||
`--publisher-archive-source-node`. `wt-publish` now supervises the Rust
|
||||
`publisher-proof-archive-source` worker for that archive track. The worker splits the MPEG-TS source
|
||||
by source-clock windows, fresh-encodes each bounded window with the deterministic proof profile,
|
||||
stores the resulting media fragments under `publisher.m4s` in the same CAS/index format, and stamps
|
||||
them with node-agent source identity. The relay playback encoder remains continuous for watchability,
|
||||
but it is no longer the BLAKE3 data identity for duplicate-publisher proof. The source identity is
|
||||
explicit override first, then hostname plus a short hash of machine-id, with boot-id only as a
|
||||
fallback; hostname alone is not enough because publisher images can share names like `ec-node`.
|
||||
Production duplicate verification can therefore compare `publisher.m4s` from two publisher buffers
|
||||
without treating copied relay-origin manifests as independent sources.
|
||||
|
||||
Proof tooling defaults to `publisher.m4s`. The relay video track `0.m4s` is useful playback data,
|
||||
but it is not duplicate-publisher proof: a publisher buffer may hold relay/cache records on `0.m4s`
|
||||
that have no publisher source identity. Production convergence checks that sample `0.m4s` should be
|
||||
treated as playback/archive-cache diagnostics, not byte-for-byte duplicate publisher evidence.
|
||||
|
||||
The first live publisher-origin measurements on 2026-06-08 showed correct distinct source labels but
|
||||
zero matching duplicate sequences for `la-nbc4`, `la-pbs-socal`, and `la-kcet`. The failure is
|
||||
useful: independent `wt-publish` processes currently start their fragment sequence and encoder chunk
|
||||
phase at local process start, so sequence `0` from two publishers is not necessarily the same
|
||||
broadcast moment. Duplicate-publisher proof therefore requires a shared chunk clock or
|
||||
scheduler-controlled aligned encoder phase before byte-for-byte archive convergence can pass in
|
||||
production.
|
||||
|
||||
Publisher-origin `publisher.m4s` records now require timed fMP4 fragments for global proof and map
|
||||
those fragments onto observed wall-clock epoch buckets instead of local process counters. The Rust
|
||||
writer learns track timescales from the init `moov` box, reads fragment
|
||||
`moof/traf/tfhd+tfdt` decode timestamps to reject untimed proof when possible, then assigns
|
||||
`group_sequence = observed_epoch_bucket * bucket_stride + fragment_slot`. Fragments that lack usable
|
||||
timing still fall back to the previous local counter so publishing does not fail hard on malformed
|
||||
metadata, but duplicate-publisher proof should use timed fragments. The `wt-publish` ffmpeg path
|
||||
also preserves source timestamps and uses closed-GOP, single-threaded x264 settings with forced
|
||||
keyframe cadence so independent publishers have a real chance of producing identical bytes for the
|
||||
same media time window.
|
||||
|
||||
A later live run on 2026-06-08 found a stricter local invariant before cross-publisher byte equality:
|
||||
each publisher must produce at most one hash for a given `source_node` and `group_sequence`.
|
||||
Production `publisher.m4s` samples for `la-kcop` and `la-ktla` showed multiple hashes from the same
|
||||
source in the same sequence bucket because real fMP4 fragments can arrive faster than the configured
|
||||
proof segment duration, and the writer rounded decode time into repeated buckets. The writer now
|
||||
uses a fixed per-epoch bucket stride and increments an in-bucket fragment slot when multiple timed
|
||||
fragments arrive inside the same proof duration. This keeps source-local manifests unique while
|
||||
allowing independently restarted publishers to align on the same observed wall-clock bucket.
|
||||
`ec-node archive-convergence` reports this separately as `source_local_divergent_sequences` so
|
||||
operator tooling can distinguish a self-contradicting publisher from two publishers that simply
|
||||
disagree about the same sequence.
|
||||
Because bucket-strided proof sequences intentionally leave numeric gaps, archive convergence uses
|
||||
the observed sparse sequence union for publisher-origin manifests. Dense contiguous sequence ranges
|
||||
remain available in the simulation layer when a model explicitly expects every integer sequence.
|
||||
|
||||
The 2026-06-08 live `la-kcet/publisher.m4s` sample from Forge confirmed that both publishers now
|
||||
emit distinct source identities (`ec-node-c3546fa5abc3` and `ec-node-72cf1c3aa196`) with no missing
|
||||
source identity records on the sampled publisher-origin manifests. It also confirmed the remaining
|
||||
bug: 156 shared publisher-origin sequences had zero byte-for-byte BLAKE3 matches and 156 divergent
|
||||
hashes. The next production fix must align the publisher chunk clock and encoded fMP4 byte stream,
|
||||
not merely improve scrape or Grafana plumbing.
|
||||
|
||||
After the wall-clock bucket hotpatch, the same live proof no longer has fake sparse-range missing
|
||||
IDs: `la-kcet/publisher.m4s` reported 376 observed proof sequences, zero missing source identities,
|
||||
zero source-local divergent sequences, and 234 divergent shared sequences. A byte-level sample for
|
||||
sequence `7287381184512` had different sizes, different BLAKE3 hashes, different `tfdt`
|
||||
base-media-decode-times (`210210` versus `0`), and different `mdat` payload prefixes. Across that
|
||||
sampled window there were zero common fragment hashes even when sequence IDs were ignored, proving
|
||||
that the remaining failure was independent-encoder media phase and fMP4 payload determinism, not an
|
||||
archive manifest identity bug.
|
||||
|
||||
A later `la-kcop/publisher.m4s` sample exposed a stricter live-source bug: source-window proof
|
||||
records were using unsynced MPEG-TS PCR chunk indexes as `group_sequence` when the OTA UTC clock was
|
||||
unavailable, causing restart-dependent jumps such as 93M, 135M, 341M, and 390M. The source-proof
|
||||
writer now uses the chunk UTC start only when the chopper reports synced timing, otherwise it falls
|
||||
back to the local wall-clock window start, and rewrites fMP4 `tfdt` onto that shared window before
|
||||
hashing. The live HTTP proof worker also retries transient source opens/reader failures in unbounded
|
||||
live mode, so a tuner `503` or malformed TS burst is skipped/retried instead of killing the
|
||||
publisher proof process.
|
||||
|
||||
The synced source-window clock must use the chopper's exact global chunk index, not integer UTC
|
||||
seconds. A 1001 ms proof cadence makes whole-second UTC start metadata lossy: adjacent source
|
||||
windows can share the same `utc_start_unix`, which caused one publisher to write several different
|
||||
hashes under the same source-local `group_sequence`. Synced chunks therefore use
|
||||
`ChunkTiming.chunk_index` directly; only unsynced chunks fall back to local wall-clock receipt.
|
||||
The live source-window proof writer also keeps subfragment slot allocation as stream state instead
|
||||
of per-chunk state. Real source windows can be emitted in more than one proof chunk for the same
|
||||
media timing sequence; resetting the slot counter for every chunk reused the same
|
||||
`group_sequence` and made one healthy publisher look self-divergent. The counter is bounded so the
|
||||
long-running live worker does not grow state unbounded.
|
||||
|
||||
`wt-publish` now has an explicit Unix-epoch start boundary, defaulting to the publisher-origin proof
|
||||
cadence. After relay setup and immediately before spawning ffmpeg it waits until the next boundary,
|
||||
so a newly restarted duplicate publisher starts its forced-keyframe clock on the same global cadence
|
||||
as already-running publishers.
|
||||
This does not by itself prove byte equality; it removes the local-process-start phase error from the
|
||||
live publisher path and gives rollout measurement a deterministic knob (`--publisher-start-boundary-ms
|
||||
0` disables it). The live ffmpeg argument plan is factored into a Rust unit-testable helper so
|
||||
future timestamp/keyframe changes are pinned in `ec-node` instead of being inferred from node-agent
|
||||
process strings or production samples.
|
||||
|
||||
The first post-start-clock live sample still failed duplicate byte identity: both publishers landed
|
||||
in the same wall-clock proof bucket, but one fragment carried `tfdt=390390` while the other carried
|
||||
`tfdt=30030`, matching the staggered restart gap. Their `mdat` prefixes differed too, which means a
|
||||
continuous x264 encoder keeps enough local history that a later restart cannot prove byte equality
|
||||
merely by joining the same wall-clock cadence. The live profile therefore enables x264
|
||||
`stitchable=1` alongside closed GOP, no scenecut, no B-frames, no lookahead, and one thread. If that
|
||||
still does not converge in production, the next fix is a deliberately stateless per-fragment encode
|
||||
or a Rust-owned media clock/segmenter that resets encoder history at each proof boundary.
|
||||
|
||||
The follow-up production hotpatch moved the start-boundary wait to immediately before ffmpeg spawn,
|
||||
enabled `stitchable=1`, and restarted both publisher nodes in the same batch. The latest `la-kcet`
|
||||
sample still reported zero matching duplicate hashes with no missing source identity and no
|
||||
source-local divergence. A final sampled shared sequence differed by hundreds of milliseconds of
|
||||
receive time and by media size (`439737` versus `270283` bytes for the video fragment), so the
|
||||
remaining mismatch is not just MP4 timestamp metadata. Production duplicate proof now needs a
|
||||
stateless fragment boundary: either encode each proof segment from the same bounded source window
|
||||
with fresh encoder state, or make the Rust media pipeline own exact frame-window capture before
|
||||
calling ffmpeg/x264.
|
||||
|
||||
Archive manifests now carry optional fMP4 media timing for publisher-origin fragments. The
|
||||
`archive-convergence` gate treats equal archive group sequence IDs with different media sequence or
|
||||
decode-time metadata as `media_sequence_conflict`, even if the byte hash happens to match. This keeps
|
||||
production proof aligned with the Rust simulation model: a duplicate publisher only proves the same
|
||||
broadcast moment when the archive sequence and media window agree.
|
||||
|
||||
The first stateless proof primitives are now in `ec-node`. `publisher-proof-segment` takes one
|
||||
bounded MPEG-TS source-clock window, runs a fresh deterministic x264/AAC fMP4 encode, splits the
|
||||
result into init bytes and media fragments, and emits BLAKE3 hashes for each. `publisher-proof-windows`
|
||||
uses the Rust MPEG-TS source-clock splitter first, then fresh-encodes each bounded window and reports
|
||||
per-window source TS, init, and media hashes. Proof windows carry explicit MPEG-TS decoder context
|
||||
with `--preroll-packets`, defaulting to the repo-owned `WT_PUBLISH_PROOF_PREROLL_PACKETS` budget, so
|
||||
mid-GOP windows do not silently depend on best-effort decoder recovery. Focused Rust tests
|
||||
fresh-encode the same bounded input and the same finite source-window campaign twice and assert
|
||||
byte-for-byte identical proof hashes.
|
||||
|
||||
`publisher-proof-duplicates` is the single-node duplicate-publisher gate for the stateless path. It
|
||||
runs `publisher-proof-windows` independently under at least two publisher identity labels, defaults
|
||||
to `publisher-a` and `publisher-b`, and compares source TS, init, and media fragment BLAKE3 hashes
|
||||
for every source-clock window. `--require-ok` exits non-zero unless every compared window matches,
|
||||
and duplicate publisher labels are rejected so the proof cannot accidentally collapse to one source
|
||||
identity. `publisher-proof-compare` is the cross-machine stateless proof gate: each publisher can run
|
||||
`publisher-proof-windows` against the same bounded source TS file locally, copy the JSON report back
|
||||
to the operator host, and compare the reports by named publisher. It rejects mismatched chunk cadence,
|
||||
missing windows, source TS hash mismatches, init hash mismatches, media fragment hash mismatches, and
|
||||
empty media windows.
|
||||
|
||||
`publisher-proof-remote-compare` is the production operator harness for that cross-machine gate. It
|
||||
copies one bounded `.ts` proof input to each named SSH target, runs `ec-node publisher-proof-windows`
|
||||
on the target, stores each returned JSON report under the local output directory, writes a
|
||||
`compare.json`, and returns the existing compare report with upload/proof timing. Remote labels use
|
||||
the same single-component validation as publisher identities, remote proof roots are constrained to
|
||||
`/tmp/every-channel-*`, and cleanup is opt-in so the generated proof files remain inspectable unless
|
||||
the operator explicitly requests removal. This keeps the live proof path in Rust without making the
|
||||
Python node-agent a new oracle. It proves the machine/runtime/compiler boundary without requiring
|
||||
the two NUCs to share a live tuner at the exact same instant.
|
||||
|
||||
`publisher-proof-archive-source` is the live archive implementation of the same proof model. It can
|
||||
read local source files directly, read plain HTTP MPEG-TS bodies directly for HDHomeRun-style
|
||||
sources, or fall back to an ffmpeg MPEG-TS copy reader for other inputs. Each emitted source-clock
|
||||
window is encoded with fresh proof state, archived as CAS-backed `publisher.m4s` records, and mapped
|
||||
to source-clock group sequences with explicit media timing metadata. A focused Rust regression now
|
||||
archives the same bounded TS input as two source nodes, then runs `archive-convergence` against the
|
||||
two manifest roots and requires full duplicate convergence with zero divergent or source-local
|
||||
divergent sequences.
|
||||
|
||||
Forge `ci-gates` now runs the `publisher_proof` and `archive_convergence` Rust filters before the
|
||||
distributed simulator campaign, so single-node byte-for-byte determinism, source-window archive
|
||||
proof semantics, and duplicate archive convergence are checked before production rollout evidence is
|
||||
considered. The next production step is to deploy the updated node binary and let fresh
|
||||
`publisher.m4s` source-window records age into the Grafana scrape window so live duplicate metrics
|
||||
can replace the older continuous-encoder divergence.
|
||||
158
evolution/proposals/ECP-0157-rust-simulation-testing.md
Normal file
158
evolution/proposals/ECP-0157-rust-simulation-testing.md
Normal file
|
|
@ -0,0 +1,158 @@
|
|||
# ECP-0157: Rust Simulation Testing
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Context
|
||||
|
||||
Production is now fast enough to expose distributed bugs quickly, but it is still the wrong first
|
||||
place to discover scheduler, archive, and duplicate-publisher invariants. The Python node-agent also
|
||||
made this worse by putting core control behavior outside the already-built Rust node binary.
|
||||
|
||||
## Decision
|
||||
|
||||
Add a small deterministic simulation layer in `ec-core` and use it for distributed media invariants:
|
||||
|
||||
- `ec-node` remains the runtime owner for node behavior.
|
||||
- Tests model logical time, delayed delivery, backfill, duplicate publishers, and archive
|
||||
convergence in Rust.
|
||||
- Simulation scenarios are seed-replayable and include deterministic jitter, transient drops,
|
||||
partition windows, publisher outage/restart windows, backfill retries, and encoder drift faults.
|
||||
- A failing simulation must print or carry a replay hint so the exact schedule can be rerun.
|
||||
- Simulation reports include deterministic execution history so a failure has an ordered event trace,
|
||||
not only a final assertion.
|
||||
- Simulation campaigns run many seed schedules in one fast test and preserve the first failing seed,
|
||||
invariant report, and final state as the failure artifact.
|
||||
- Campaign execution has a reusable seeded runner so new models can share replay/failure accounting
|
||||
instead of copying bespoke loops.
|
||||
- First failures are automatically shrunk where the model supports it. For duplicate publishers the
|
||||
shrinker removes irrelevant partitions, publisher outages, timing jitter, transient drops, and
|
||||
excess media sequence range while keeping the original invariant unchanged.
|
||||
- Invariants are explicit checks, not implicit test prose: duplicate source count, missing
|
||||
sequences, divergent hashes, missing media timing, conflicting media timing, complete duplicate
|
||||
coverage, and convergence-deadline budgets.
|
||||
- Media identity is checked by BLAKE3 hashes for stream, rendition, track, sequence, profile, and
|
||||
source-material identity.
|
||||
- Media timing is part of the proof model. Matching hashes are not considered a complete duplicate
|
||||
proof unless both publishers also expose a shared logical media clock for the chunk.
|
||||
- Source-material identity is separate from stream metadata. Two publishers can advertise the same
|
||||
channel, sequence, timing, and encoder profile while still encoding different RF/source windows;
|
||||
that must fail in simulation before production archive comparisons burn wall-clock time.
|
||||
- Publisher-origin archive `group_sequence` is derived from parsed media-time identity plus stable
|
||||
track id, not local receive time. Receive time is telemetry; it is not proof that two publishers
|
||||
archived the same broadcast moment.
|
||||
- Live publisher archive proof normalizes fMP4 `tfdt` to the Unix media slot before hashing a
|
||||
fragment. The first fragment for each track anchors the process-local media clock to wall-clock
|
||||
time; later fragments preserve ffmpeg's media cadence from that origin. ffmpeg still runs with
|
||||
wall-clock timestamp input enabled where possible, but the Rust archive writer is the authority
|
||||
for the proof clock when source MPEG-TS timestamps are process-relative.
|
||||
- Archive `group_sequence` includes a stable subfragment slot inside each `(track_id,
|
||||
media_sequence)` pair, because audio can legitimately emit multiple fragments within one media
|
||||
slot and those must compare in order instead of colliding as source-local divergences.
|
||||
- Duplicate-publisher scenarios model publisher content phase separately from advertised archive
|
||||
sequence. A publisher that starts its local encoder at a different content phase must fail fast in
|
||||
simulation, because production fragments with the same local sequence are not proof of the same
|
||||
broadcast moment unless the chunk clock is shared.
|
||||
- `ec-node sim-duplicate-publishers` runs the same campaign model from the compiled Rust binary and
|
||||
emits JSON suitable for CI artifacts and rollout gates.
|
||||
- `ec-node sim-duplicate-publishers --failure-artifact <path>` writes the first failing campaign as
|
||||
a replayable JSON artifact with the shrunk scenario, invariant report, event trace, shrink steps,
|
||||
and a command hint for replaying `replay_scenario` through `--scenario-json -`.
|
||||
- `ec-node sim-duplicate-publishers --scenario-json <path-or->` replays an exact serialized
|
||||
`DuplicatePublisherScenario`, so a shrunk failure from CI or production investigation can be rerun
|
||||
without reconstructing command-line flags.
|
||||
- `ec-node sim-duplicate-publishers` can inject timing faults directly with
|
||||
`--missing-media-timing-publisher NODE` and `--publisher-media-time-offset NODE:OFFSET_MS`, so
|
||||
the current production proof class can be reproduced without hand-writing scenario JSON.
|
||||
- `ec-node sim-duplicate-publishers` and `ec-node sim-system` can inject source-window faults with
|
||||
`--publisher-source-material NODE:MATERIAL_ID`. Any campaign with multiple source-material ids
|
||||
reports source-material mismatch observations instead of leaving operators to infer that class
|
||||
from divergent hashes.
|
||||
- `ec-node archive-convergence` reads existing archive manifest JSONL and applies the same
|
||||
convergence semantics to real duplicate publisher outputs.
|
||||
- Control-plane simulation models logical nodes, seeded gossip fanout, delivery jitter, transient
|
||||
drops, node-specific partitions, node outages, duplicate deliveries, and propagation deadlines.
|
||||
- `ec-node sim-control-plane` runs the control-plane model from the compiled Rust binary and emits
|
||||
replayable JSON with the first failing seed, scenario, invariant report, and ordered trace.
|
||||
- Control-plane campaign reports track max propagation time, max delivery time, dropped messages,
|
||||
partition-delayed messages, outage-delayed messages, and duplicate messages, so prod rollout
|
||||
measurements have a fast simulation baseline.
|
||||
- System simulation composes control-plane propagation with duplicate-publisher media production.
|
||||
Control gossip produces per-publisher activation times; the media workload then proves that delayed
|
||||
schedule propagation still converges when publishers use the global media sequence clock and fails
|
||||
when they derive chunk identity from local activation time.
|
||||
- `ec-node sim-system` runs that composed workload from the deployed node binary. Its default
|
||||
campaign models the current publisher topology class and can switch `--sequence-clock` between
|
||||
`global` and `local-activation` to reproduce the exact class of duplicate-publisher phase bug
|
||||
before waiting for production samples.
|
||||
- `ec-node sim-system --fault-profile foundationdb` uses a FoundationDB-style fault profile: each
|
||||
seed generates a different but replayable cluster schedule with randomized control partitions, node
|
||||
outages, transient gossip drops, duplicate messages, media partitions, publisher outages, and
|
||||
archive backfill pressure.
|
||||
- The FoundationDB-style profile must also have an explicit negative regression for
|
||||
`local-activation` sequence clocks, so the model proves the current production failure class is
|
||||
caught in Rust before any rollout waits for live fragments.
|
||||
- `ec-node sim-system --failure-artifact <path>` writes the first failing composed system schedule
|
||||
as replayable JSON, including the exact control/media scenario, invariant report, ordered trace,
|
||||
and command hint for rerunning `--scenario-json -`.
|
||||
- System campaign reports must include fault coverage counters, not just pass/fail. A fast campaign
|
||||
is only useful if it proves that the simulated run actually exercised the failure modes operators
|
||||
care about.
|
||||
- System campaign reports also aggregate publisher phase-offset observations. A production-like
|
||||
divergence caused by local activation clocks should identify itself as a phase bug in the campaign
|
||||
JSON instead of requiring operators to infer that only from divergent hashes.
|
||||
- System campaign reports also aggregate source-material mismatch observations. A production-like
|
||||
divergence caused by independent tuner/source windows should identify itself as a source-material
|
||||
bug in the campaign JSON instead of being confused with codec nondeterminism.
|
||||
- System and duplicate-publisher reports aggregate missing media-timing records and media-timing
|
||||
conflicts, so the live failure class where fragments arrive without a usable media clock is visible
|
||||
in fast Rust simulation output.
|
||||
- FoundationDB-profile `sim-system` campaigns require that coverage by default: control transient
|
||||
drops, partition delays, node outage delays, duplicate messages, media transient drops, media
|
||||
partition delays, publisher outages, backfill, and observed convergence timing must all appear in
|
||||
the campaign report. A campaign that passes invariants but misses these classes is reported as a
|
||||
weak simulation, not a green rollout gate.
|
||||
- FoundationDB-profile coverage is breadth-gated, not only boolean-gated. By default at least
|
||||
`max(2, iterations / 32)` seeds must exercise every required distributed fault class; operators
|
||||
can raise that floor with `--min-fault-seed-coverage` for longer scientific campaigns.
|
||||
- Campaign reports track both event totals and seed counts per fault class, plus a bounded list of
|
||||
the slowest system schedules with replay hints. This makes green runs inspectable: operators can
|
||||
see how broadly the randomized schedule space was exercised and which seeds define the current
|
||||
latency tail.
|
||||
- System campaign reports also aggregate deterministic simulated convergence time and trace event
|
||||
counts. `ec-node sim-system` stamps wall-clock execution telemetry around the campaign so a run
|
||||
reports iterations per second, simulated system seconds per wall second, and trace events per
|
||||
second without putting wall-clock data into the replayed scenario itself.
|
||||
- `sim-system --failure-artifact <path>` writes an artifact for weak coverage as well as invariant
|
||||
failures, so CI can preserve evidence when a campaign was too small or too narrow to exercise the
|
||||
required distributed faults.
|
||||
- Forge `ci-gates` runs the Rust system simulator tests and a 1024-seed
|
||||
`sim-system --fault-profile foundationdb` campaign from the compiled `ec-node` binary before web
|
||||
build/deploy gates. This keeps the fast randomized check ahead of production rollout evidence.
|
||||
- Simulation failures must be actionable before any matching production rollout is considered
|
||||
healthy.
|
||||
|
||||
## Consequences
|
||||
|
||||
We get FoundationDB-style pressure in a much smaller shape: many deterministic failure schedules can
|
||||
run as normal Rust tests without booting machines. The first media model covers duplicate publisher
|
||||
convergence, network partitions, transient loss, publisher restart/backfill, convergence latency,
|
||||
encoder drift, and publisher phase alignment, and the first runtime command applies it to archive
|
||||
manifests. The first control model covers gossip propagation across relays and nodes under dropped,
|
||||
delayed, duplicated, partitioned, and outage-delayed control messages. The shrink/replay path makes
|
||||
supported failures small enough to debug before they become production event archaeology; exact
|
||||
scenario JSON is the replay contract. Later models can add tuner scheduling, relay cache eviction,
|
||||
and image rollout state machines. The composed system model is the first workload-level step: it
|
||||
checks the boundary between control-plane speed and media determinism, which is where production
|
||||
duplicate publishers are currently most fragile.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Keep writing production probes only. Rejected because probes prove what happened once, not what
|
||||
should happen across many fault schedules.
|
||||
- Extend the Python node-agent as the simulation oracle. Rejected because the image should get
|
||||
thinner and the runtime behavior belongs in the Rust node.
|
||||
|
||||
## Rollout/teardown
|
||||
|
||||
Roll forward by adding simulation tests next to each new distributed invariant. Roll back by keeping
|
||||
the production probes; the simulation module is library-only and has no runtime service impact.
|
||||
Loading…
Add table
Add a link
Reference in a new issue