Add duplicate publisher determinism proof

2026-06-10 03:28:55 -07:00 · 2026-06-10 03:28:55 -07:00 · 91dad67fc2
commit 91dad67fc2
parent 5d0f3077d3
18 changed files with 21569 additions and 595 deletions
--- a/evolution/proposals/ECP-0157-rust-simulation-testing.md
+++ b/evolution/proposals/ECP-0157-rust-simulation-testing.md
@ -0,0 +1,158 @@
+# ECP-0157: Rust Simulation Testing
+
+Status: Draft
+
+## Context
+
+Production is now fast enough to expose distributed bugs quickly, but it is still the wrong first
+place to discover scheduler, archive, and duplicate-publisher invariants. The Python node-agent also
+made this worse by putting core control behavior outside the already-built Rust node binary.
+
+## Decision
+
+Add a small deterministic simulation layer in `ec-core` and use it for distributed media invariants:
+
+- `ec-node` remains the runtime owner for node behavior.
+- Tests model logical time, delayed delivery, backfill, duplicate publishers, and archive
+  convergence in Rust.
+- Simulation scenarios are seed-replayable and include deterministic jitter, transient drops,
+  partition windows, publisher outage/restart windows, backfill retries, and encoder drift faults.
+- A failing simulation must print or carry a replay hint so the exact schedule can be rerun.
+- Simulation reports include deterministic execution history so a failure has an ordered event trace,
+  not only a final assertion.
+- Simulation campaigns run many seed schedules in one fast test and preserve the first failing seed,
+  invariant report, and final state as the failure artifact.
+- Campaign execution has a reusable seeded runner so new models can share replay/failure accounting
+  instead of copying bespoke loops.
+- First failures are automatically shrunk where the model supports it. For duplicate publishers the
+  shrinker removes irrelevant partitions, publisher outages, timing jitter, transient drops, and
+  excess media sequence range while keeping the original invariant unchanged.
+- Invariants are explicit checks, not implicit test prose: duplicate source count, missing
+  sequences, divergent hashes, missing media timing, conflicting media timing, complete duplicate
+  coverage, and convergence-deadline budgets.
+- Media identity is checked by BLAKE3 hashes for stream, rendition, track, sequence, profile, and
+  source-material identity.
+- Media timing is part of the proof model. Matching hashes are not considered a complete duplicate
+  proof unless both publishers also expose a shared logical media clock for the chunk.
+- Source-material identity is separate from stream metadata. Two publishers can advertise the same
+  channel, sequence, timing, and encoder profile while still encoding different RF/source windows;
+  that must fail in simulation before production archive comparisons burn wall-clock time.
+- Publisher-origin archive `group_sequence` is derived from parsed media-time identity plus stable
+  track id, not local receive time. Receive time is telemetry; it is not proof that two publishers
+  archived the same broadcast moment.
+- Live publisher archive proof normalizes fMP4 `tfdt` to the Unix media slot before hashing a
+  fragment. The first fragment for each track anchors the process-local media clock to wall-clock
+  time; later fragments preserve ffmpeg's media cadence from that origin. ffmpeg still runs with
+  wall-clock timestamp input enabled where possible, but the Rust archive writer is the authority
+  for the proof clock when source MPEG-TS timestamps are process-relative.
+- Archive `group_sequence` includes a stable subfragment slot inside each `(track_id,
+  media_sequence)` pair, because audio can legitimately emit multiple fragments within one media
+  slot and those must compare in order instead of colliding as source-local divergences.
+- Duplicate-publisher scenarios model publisher content phase separately from advertised archive
+  sequence. A publisher that starts its local encoder at a different content phase must fail fast in
+  simulation, because production fragments with the same local sequence are not proof of the same
+  broadcast moment unless the chunk clock is shared.
+- `ec-node sim-duplicate-publishers` runs the same campaign model from the compiled Rust binary and
+  emits JSON suitable for CI artifacts and rollout gates.
+- `ec-node sim-duplicate-publishers --failure-artifact <path>` writes the first failing campaign as
+  a replayable JSON artifact with the shrunk scenario, invariant report, event trace, shrink steps,
+  and a command hint for replaying `replay_scenario` through `--scenario-json -`.
+- `ec-node sim-duplicate-publishers --scenario-json <path-or->` replays an exact serialized
+  `DuplicatePublisherScenario`, so a shrunk failure from CI or production investigation can be rerun
+  without reconstructing command-line flags.
+- `ec-node sim-duplicate-publishers` can inject timing faults directly with
+  `--missing-media-timing-publisher NODE` and `--publisher-media-time-offset NODE:OFFSET_MS`, so
+  the current production proof class can be reproduced without hand-writing scenario JSON.
+- `ec-node sim-duplicate-publishers` and `ec-node sim-system` can inject source-window faults with
+  `--publisher-source-material NODE:MATERIAL_ID`. Any campaign with multiple source-material ids
+  reports source-material mismatch observations instead of leaving operators to infer that class
+  from divergent hashes.
+- `ec-node archive-convergence` reads existing archive manifest JSONL and applies the same
+  convergence semantics to real duplicate publisher outputs.
+- Control-plane simulation models logical nodes, seeded gossip fanout, delivery jitter, transient
+  drops, node-specific partitions, node outages, duplicate deliveries, and propagation deadlines.
+- `ec-node sim-control-plane` runs the control-plane model from the compiled Rust binary and emits
+  replayable JSON with the first failing seed, scenario, invariant report, and ordered trace.
+- Control-plane campaign reports track max propagation time, max delivery time, dropped messages,
+  partition-delayed messages, outage-delayed messages, and duplicate messages, so prod rollout
+  measurements have a fast simulation baseline.
+- System simulation composes control-plane propagation with duplicate-publisher media production.
+  Control gossip produces per-publisher activation times; the media workload then proves that delayed
+  schedule propagation still converges when publishers use the global media sequence clock and fails
+  when they derive chunk identity from local activation time.
+- `ec-node sim-system` runs that composed workload from the deployed node binary. Its default
+  campaign models the current publisher topology class and can switch `--sequence-clock` between
+  `global` and `local-activation` to reproduce the exact class of duplicate-publisher phase bug
+  before waiting for production samples.
+- `ec-node sim-system --fault-profile foundationdb` uses a FoundationDB-style fault profile: each
+  seed generates a different but replayable cluster schedule with randomized control partitions, node
+  outages, transient gossip drops, duplicate messages, media partitions, publisher outages, and
+  archive backfill pressure.
+- The FoundationDB-style profile must also have an explicit negative regression for
+  `local-activation` sequence clocks, so the model proves the current production failure class is
+  caught in Rust before any rollout waits for live fragments.
+- `ec-node sim-system --failure-artifact <path>` writes the first failing composed system schedule
+  as replayable JSON, including the exact control/media scenario, invariant report, ordered trace,
+  and command hint for rerunning `--scenario-json -`.
+- System campaign reports must include fault coverage counters, not just pass/fail. A fast campaign
+  is only useful if it proves that the simulated run actually exercised the failure modes operators
+  care about.
+- System campaign reports also aggregate publisher phase-offset observations. A production-like
+  divergence caused by local activation clocks should identify itself as a phase bug in the campaign
+  JSON instead of requiring operators to infer that only from divergent hashes.
+- System campaign reports also aggregate source-material mismatch observations. A production-like
+  divergence caused by independent tuner/source windows should identify itself as a source-material
+  bug in the campaign JSON instead of being confused with codec nondeterminism.
+- System and duplicate-publisher reports aggregate missing media-timing records and media-timing
+  conflicts, so the live failure class where fragments arrive without a usable media clock is visible
+  in fast Rust simulation output.
+- FoundationDB-profile `sim-system` campaigns require that coverage by default: control transient
+  drops, partition delays, node outage delays, duplicate messages, media transient drops, media
+  partition delays, publisher outages, backfill, and observed convergence timing must all appear in
+  the campaign report. A campaign that passes invariants but misses these classes is reported as a
+  weak simulation, not a green rollout gate.
+- FoundationDB-profile coverage is breadth-gated, not only boolean-gated. By default at least
+  `max(2, iterations / 32)` seeds must exercise every required distributed fault class; operators
+  can raise that floor with `--min-fault-seed-coverage` for longer scientific campaigns.
+- Campaign reports track both event totals and seed counts per fault class, plus a bounded list of
+  the slowest system schedules with replay hints. This makes green runs inspectable: operators can
+  see how broadly the randomized schedule space was exercised and which seeds define the current
+  latency tail.
+- System campaign reports also aggregate deterministic simulated convergence time and trace event
+  counts. `ec-node sim-system` stamps wall-clock execution telemetry around the campaign so a run
+  reports iterations per second, simulated system seconds per wall second, and trace events per
+  second without putting wall-clock data into the replayed scenario itself.
+- `sim-system --failure-artifact <path>` writes an artifact for weak coverage as well as invariant
+  failures, so CI can preserve evidence when a campaign was too small or too narrow to exercise the
+  required distributed faults.
+- Forge `ci-gates` runs the Rust system simulator tests and a 1024-seed
+  `sim-system --fault-profile foundationdb` campaign from the compiled `ec-node` binary before web
+  build/deploy gates. This keeps the fast randomized check ahead of production rollout evidence.
+- Simulation failures must be actionable before any matching production rollout is considered
+  healthy.
+
+## Consequences
+
+We get FoundationDB-style pressure in a much smaller shape: many deterministic failure schedules can
+run as normal Rust tests without booting machines. The first media model covers duplicate publisher
+convergence, network partitions, transient loss, publisher restart/backfill, convergence latency,
+encoder drift, and publisher phase alignment, and the first runtime command applies it to archive
+manifests. The first control model covers gossip propagation across relays and nodes under dropped,
+delayed, duplicated, partitioned, and outage-delayed control messages. The shrink/replay path makes
+supported failures small enough to debug before they become production event archaeology; exact
+scenario JSON is the replay contract. Later models can add tuner scheduling, relay cache eviction,
+and image rollout state machines. The composed system model is the first workload-level step: it
+checks the boundary between control-plane speed and media determinism, which is where production
+duplicate publishers are currently most fragile.
+
+## Alternatives considered
+
+- Keep writing production probes only. Rejected because probes prove what happened once, not what
+  should happen across many fault schedules.
+- Extend the Python node-agent as the simulation oracle. Rejected because the image should get
+  thinner and the runtime behavior belongs in the Rust node.
+
+## Rollout/teardown
+
+Roll forward by adding simulation tests next to each new distributed invariant. Roll back by keeping
+the production probes; the simulation module is library-only and has no runtime service impact.