ECP-0157: Rust Simulation Testing

Status: Draft

Context

Production is now fast enough to expose distributed bugs quickly, but it is still the wrong first place to discover scheduler, archive, and duplicate-publisher invariants. The Python node-agent also made this worse by putting core control behavior outside the already-built Rust node binary.

Decision

Add a small deterministic simulation layer in ec-core and use it for distributed media invariants:

ec-node remains the runtime owner for node behavior.
Tests model logical time, delayed delivery, backfill, duplicate publishers, and archive convergence in Rust.
Simulation scenarios are seed-replayable and include deterministic jitter, transient drops, partition windows, publisher outage/restart windows, backfill retries, and encoder drift faults.
A failing simulation must print or carry a replay hint so the exact schedule can be rerun.
Simulation reports include deterministic execution history so a failure has an ordered event trace, not only a final assertion.
Simulation campaigns run many seed schedules in one fast test and preserve the first failing seed, invariant report, and final state as the failure artifact.
Campaign execution has a reusable seeded runner so new models can share replay/failure accounting instead of copying bespoke loops.
First failures are automatically shrunk where the model supports it. For duplicate publishers the shrinker removes irrelevant partitions, publisher outages, timing jitter, transient drops, and excess media sequence range while keeping the original invariant unchanged.
Invariants are explicit checks, not implicit test prose: duplicate source count, missing sequences, divergent hashes, missing media timing, conflicting media timing, complete duplicate coverage, and convergence-deadline budgets.
Media identity is checked by BLAKE3 hashes for stream, rendition, track, sequence, profile, and source-material identity.
Media timing is part of the proof model. Matching hashes are not considered a complete duplicate proof unless both publishers also expose a shared logical media clock for the chunk.
Source-material identity is separate from stream metadata. Two publishers can advertise the same channel, sequence, timing, and encoder profile while still encoding different RF/source windows; that must fail in simulation before production archive comparisons burn wall-clock time.
Publisher-origin archive group_sequence is derived from parsed media-time identity plus stable track id, not local receive time. Receive time is telemetry; it is not proof that two publishers archived the same broadcast moment.
Live publisher archive proof normalizes fMP4 tfdt to the Unix media slot before hashing a fragment. The first fragment for each track anchors the process-local media clock to wall-clock time; later fragments preserve ffmpeg's media cadence from that origin. ffmpeg still runs with wall-clock timestamp input enabled where possible, but the Rust archive writer is the authority for the proof clock when source MPEG-TS timestamps are process-relative.
Archive group_sequence includes a stable subfragment slot inside each (track_id, media_sequence) pair, because audio can legitimately emit multiple fragments within one media slot and those must compare in order instead of colliding as source-local divergences.
Duplicate-publisher scenarios model publisher content phase separately from advertised archive sequence. A publisher that starts its local encoder at a different content phase must fail fast in simulation, because production fragments with the same local sequence are not proof of the same broadcast moment unless the chunk clock is shared.
ec-node sim-duplicate-publishers runs the same campaign model from the compiled Rust binary and emits JSON suitable for CI artifacts and rollout gates.
ec-node sim-duplicate-publishers --failure-artifact <path> writes the first failing campaign as a replayable JSON artifact with the shrunk scenario, invariant report, event trace, shrink steps, and a command hint for replaying replay_scenario through --scenario-json -.
ec-node sim-duplicate-publishers --scenario-json <path-or-> replays an exact serialized DuplicatePublisherScenario, so a shrunk failure from CI or production investigation can be rerun without reconstructing command-line flags.
ec-node sim-duplicate-publishers can inject timing faults directly with --missing-media-timing-publisher NODE and --publisher-media-time-offset NODE:OFFSET_MS, so the current production proof class can be reproduced without hand-writing scenario JSON.
ec-node sim-duplicate-publishers and ec-node sim-system can inject source-window faults with --publisher-source-material NODE:MATERIAL_ID. Any campaign with multiple source-material ids reports source-material mismatch observations instead of leaving operators to infer that class from divergent hashes.
ec-node archive-convergence reads existing archive manifest JSONL and applies the same convergence semantics to real duplicate publisher outputs.
Control-plane simulation models logical nodes, seeded gossip fanout, delivery jitter, transient drops, node-specific partitions, node outages, duplicate deliveries, and propagation deadlines.
ec-node sim-control-plane runs the control-plane model from the compiled Rust binary and emits replayable JSON with the first failing seed, scenario, invariant report, and ordered trace.
Control-plane campaign reports track max propagation time, max delivery time, dropped messages, partition-delayed messages, outage-delayed messages, and duplicate messages, so prod rollout measurements have a fast simulation baseline.
System simulation composes control-plane propagation with duplicate-publisher media production. Control gossip produces per-publisher activation times; the media workload then proves that delayed schedule propagation still converges when publishers use the global media sequence clock and fails when they derive chunk identity from local activation time.
ec-node sim-system runs that composed workload from the deployed node binary. Its default campaign models the current publisher topology class and can switch --sequence-clock between global and local-activation to reproduce the exact class of duplicate-publisher phase bug before waiting for production samples.
ec-node sim-system --fault-profile foundationdb uses a FoundationDB-style fault profile: each seed generates a different but replayable cluster schedule with randomized control partitions, node outages, transient gossip drops, duplicate messages, media partitions, publisher outages, and archive backfill pressure.
The FoundationDB-style profile must also have an explicit negative regression for local-activation sequence clocks, so the model proves the current production failure class is caught in Rust before any rollout waits for live fragments.
ec-node sim-system --failure-artifact <path> writes the first failing composed system schedule as replayable JSON, including the exact control/media scenario, invariant report, ordered trace, and command hint for rerunning --scenario-json -.
System campaign reports must include fault coverage counters, not just pass/fail. A fast campaign is only useful if it proves that the simulated run actually exercised the failure modes operators care about.
System campaign reports also aggregate publisher phase-offset observations. A production-like divergence caused by local activation clocks should identify itself as a phase bug in the campaign JSON instead of requiring operators to infer that only from divergent hashes.
System campaign reports also aggregate source-material mismatch observations. A production-like divergence caused by independent tuner/source windows should identify itself as a source-material bug in the campaign JSON instead of being confused with codec nondeterminism.
System and duplicate-publisher reports aggregate missing media-timing records and media-timing conflicts, so the live failure class where fragments arrive without a usable media clock is visible in fast Rust simulation output.
FoundationDB-profile sim-system campaigns require that coverage by default: control transient drops, partition delays, node outage delays, duplicate messages, media transient drops, media partition delays, publisher outages, backfill, and observed convergence timing must all appear in the campaign report. A campaign that passes invariants but misses these classes is reported as a weak simulation, not a green rollout gate.
FoundationDB-profile coverage is breadth-gated, not only boolean-gated. By default at least max(2, iterations / 32) seeds must exercise every required distributed fault class; operators can raise that floor with --min-fault-seed-coverage for longer scientific campaigns.
Campaign reports track both event totals and seed counts per fault class, plus a bounded list of the slowest system schedules with replay hints. This makes green runs inspectable: operators can see how broadly the randomized schedule space was exercised and which seeds define the current latency tail.
System campaign reports also aggregate deterministic simulated convergence time and trace event counts. ec-node sim-system stamps wall-clock execution telemetry around the campaign so a run reports iterations per second, simulated system seconds per wall second, and trace events per second without putting wall-clock data into the replayed scenario itself.
sim-system --failure-artifact <path> writes an artifact for weak coverage as well as invariant failures, so CI can preserve evidence when a campaign was too small or too narrow to exercise the required distributed faults.
Forge ci-gates runs the Rust system simulator tests and a 1024-seed sim-system --fault-profile foundationdb campaign from the compiled ec-node binary before web build/deploy gates. This keeps the fast randomized check ahead of production rollout evidence.
Simulation failures must be actionable before any matching production rollout is considered healthy.

Consequences

We get FoundationDB-style pressure in a much smaller shape: many deterministic failure schedules can run as normal Rust tests without booting machines. The first media model covers duplicate publisher convergence, network partitions, transient loss, publisher restart/backfill, convergence latency, encoder drift, and publisher phase alignment, and the first runtime command applies it to archive manifests. The first control model covers gossip propagation across relays and nodes under dropped, delayed, duplicated, partitioned, and outage-delayed control messages. The shrink/replay path makes supported failures small enough to debug before they become production event archaeology; exact scenario JSON is the replay contract. Later models can add tuner scheduling, relay cache eviction, and image rollout state machines. The composed system model is the first workload-level step: it checks the boundary between control-plane speed and media determinism, which is where production duplicate publishers are currently most fragile.

Alternatives considered

Keep writing production probes only. Rejected because probes prove what happened once, not what should happen across many fault schedules.
Extend the Python node-agent as the simulation oracle. Rejected because the image should get thinner and the runtime behavior belongs in the Rust node.

Rollout/teardown

Roll forward by adding simulation tests next to each new distributed invariant. Roll back by keeping the production probes; the simulation module is library-only and has no runtime service impact.

12 KiB Raw Blame History