every.channel/evolution/proposals/ECP-0157-rust-simulation-testing.md

# ECP-0157: Rust Simulation Testing

Status: Draft

## Context

Production is now fast enough to expose distributed bugs quickly, but it is still the wrong first
place to discover scheduler, archive, and duplicate-publisher invariants. The Python node-agent also
made this worse by putting core control behavior outside the already-built Rust node binary.

## Decision

Add a small deterministic simulation layer in `ec-core` and use it for distributed media invariants:

- `ec-node` remains the runtime owner for node behavior.
- Tests model logical time, delayed delivery, backfill, duplicate publishers, and archive
  convergence in Rust.
- Simulation scenarios are seed-replayable and include deterministic jitter, transient drops,
  partition windows, publisher outage/restart windows, backfill retries, and encoder drift faults.
- A failing simulation must print or carry a replay hint so the exact schedule can be rerun.
- Simulation reports include deterministic execution history so a failure has an ordered event trace,
  not only a final assertion.
- Simulation campaigns run many seed schedules in one fast test and preserve the first failing seed,
  invariant report, and final state as the failure artifact.
- Campaign execution has a reusable seeded runner so new models can share replay/failure accounting
  instead of copying bespoke loops.
- First failures are automatically shrunk where the model supports it. For duplicate publishers the
  shrinker removes irrelevant partitions, publisher outages, timing jitter, transient drops, and
  excess media sequence range while keeping the original invariant unchanged.
- Invariants are explicit checks, not implicit test prose: duplicate source count, missing
  sequences, divergent hashes, missing media timing, conflicting media timing, complete duplicate
  coverage, and convergence-deadline budgets.
- Media identity is checked by BLAKE3 hashes for stream, rendition, track, sequence, profile, and
  source-material identity.
- Media timing is part of the proof model. Matching hashes are not considered a complete duplicate
  proof unless both publishers also expose a shared logical media clock for the chunk.
- Source-material identity is separate from stream metadata. Two publishers can advertise the same
  channel, sequence, timing, and encoder profile while still encoding different RF/source windows;
  that must fail in simulation before production archive comparisons burn wall-clock time.
- Publisher-origin archive `group_sequence` is derived from parsed media-time identity plus stable
  track id, not local receive time. Receive time is telemetry; it is not proof that two publishers
  archived the same broadcast moment.
- Live publisher archive proof normalizes fMP4 `tfdt` to the Unix media slot before hashing a
  fragment. The first fragment for each track anchors the process-local media clock to wall-clock
  time; later fragments preserve ffmpeg's media cadence from that origin. ffmpeg still runs with
  wall-clock timestamp input enabled where possible, but the Rust archive writer is the authority
  for the proof clock when source MPEG-TS timestamps are process-relative.
- Archive `group_sequence` includes a stable subfragment slot inside each `(track_id,
  media_sequence)` pair, because audio can legitimately emit multiple fragments within one media
  slot and those must compare in order instead of colliding as source-local divergences.
- Duplicate-publisher scenarios model publisher content phase separately from advertised archive
  sequence. A publisher that starts its local encoder at a different content phase must fail fast in
  simulation, because production fragments with the same local sequence are not proof of the same
  broadcast moment unless the chunk clock is shared.
- `ec-node sim-duplicate-publishers` runs the same campaign model from the compiled Rust binary and
  emits JSON suitable for CI artifacts and rollout gates.
- `ec-node sim-duplicate-publishers --failure-artifact <path>` writes the first failing campaign as
  a replayable JSON artifact with the shrunk scenario, invariant report, event trace, shrink steps,
  and a command hint for replaying `replay_scenario` through `--scenario-json -`.
- `ec-node sim-duplicate-publishers --scenario-json <path-or->` replays an exact serialized
  `DuplicatePublisherScenario`, so a shrunk failure from CI or production investigation can be rerun
  without reconstructing command-line flags.
- `ec-node sim-duplicate-publishers` can inject timing faults directly with
  `--missing-media-timing-publisher NODE` and `--publisher-media-time-offset NODE:OFFSET_MS`, so
  the current production proof class can be reproduced without hand-writing scenario JSON.
- `ec-node sim-duplicate-publishers` and `ec-node sim-system` can inject source-window faults with
  `--publisher-source-material NODE:MATERIAL_ID`. Any campaign with multiple source-material ids
  reports source-material mismatch observations instead of leaving operators to infer that class
  from divergent hashes.
- `ec-node archive-convergence` reads existing archive manifest JSONL and applies the same
  convergence semantics to real duplicate publisher outputs.
- Control-plane simulation models logical nodes, seeded gossip fanout, delivery jitter, transient
  drops, node-specific partitions, node outages, duplicate deliveries, and propagation deadlines.
- `ec-node sim-control-plane` runs the control-plane model from the compiled Rust binary and emits
  replayable JSON with the first failing seed, scenario, invariant report, and ordered trace.
- Control-plane campaign reports track max propagation time, max delivery time, dropped messages,
  partition-delayed messages, outage-delayed messages, and duplicate messages, so prod rollout
  measurements have a fast simulation baseline.
- System simulation composes control-plane propagation with duplicate-publisher media production.
  Control gossip produces per-publisher activation times; the media workload then proves that delayed
  schedule propagation still converges when publishers use the global media sequence clock and fails
  when they derive chunk identity from local activation time.
- `ec-node sim-system` runs that composed workload from the deployed node binary. Its default
  campaign models the current publisher topology class and can switch `--sequence-clock` between
  `global` and `local-activation` to reproduce the exact class of duplicate-publisher phase bug
  before waiting for production samples.
- `ec-node sim-system --fault-profile foundationdb` uses a FoundationDB-style fault profile: each
  seed generates a different but replayable cluster schedule with randomized control partitions, node
  outages, transient gossip drops, duplicate messages, media partitions, publisher outages, and
  archive backfill pressure.
- The FoundationDB-style profile must also have an explicit negative regression for
  `local-activation` sequence clocks, so the model proves the current production failure class is
  caught in Rust before any rollout waits for live fragments.
- `ec-node sim-system --failure-artifact <path>` writes the first failing composed system schedule
  as replayable JSON, including the exact control/media scenario, invariant report, ordered trace,
  and command hint for rerunning `--scenario-json -`.
- System campaign reports must include fault coverage counters, not just pass/fail. A fast campaign
  is only useful if it proves that the simulated run actually exercised the failure modes operators
  care about.
- System campaign reports also aggregate publisher phase-offset observations. A production-like
  divergence caused by local activation clocks should identify itself as a phase bug in the campaign
  JSON instead of requiring operators to infer that only from divergent hashes.
- System campaign reports also aggregate source-material mismatch observations. A production-like
  divergence caused by independent tuner/source windows should identify itself as a source-material
  bug in the campaign JSON instead of being confused with codec nondeterminism.
- System and duplicate-publisher reports aggregate missing media-timing records and media-timing
  conflicts, so the live failure class where fragments arrive without a usable media clock is visible
  in fast Rust simulation output.
- FoundationDB-profile `sim-system` campaigns require that coverage by default: control transient
  drops, partition delays, node outage delays, duplicate messages, media transient drops, media
  partition delays, publisher outages, backfill, and observed convergence timing must all appear in
  the campaign report. A campaign that passes invariants but misses these classes is reported as a
  weak simulation, not a green rollout gate.
- FoundationDB-profile coverage is breadth-gated, not only boolean-gated. By default at least
  `max(2, iterations / 32)` seeds must exercise every required distributed fault class; operators
  can raise that floor with `--min-fault-seed-coverage` for longer scientific campaigns.
- Campaign reports track both event totals and seed counts per fault class, plus a bounded list of
  the slowest system schedules with replay hints. This makes green runs inspectable: operators can
  see how broadly the randomized schedule space was exercised and which seeds define the current
  latency tail.
- System campaign reports also aggregate deterministic simulated convergence time and trace event
  counts. `ec-node sim-system` stamps wall-clock execution telemetry around the campaign so a run
  reports iterations per second, simulated system seconds per wall second, and trace events per
  second without putting wall-clock data into the replayed scenario itself.
- `sim-system --failure-artifact <path>` writes an artifact for weak coverage as well as invariant
  failures, so CI can preserve evidence when a campaign was too small or too narrow to exercise the
  required distributed faults.
- Forge `ci-gates` runs the Rust system simulator tests and a 1024-seed
  `sim-system --fault-profile foundationdb` campaign from the compiled `ec-node` binary before web
  build/deploy gates. This keeps the fast randomized check ahead of production rollout evidence.
- Simulation failures must be actionable before any matching production rollout is considered
  healthy.

## Consequences

We get FoundationDB-style pressure in a much smaller shape: many deterministic failure schedules can
run as normal Rust tests without booting machines. The first media model covers duplicate publisher
convergence, network partitions, transient loss, publisher restart/backfill, convergence latency,
encoder drift, and publisher phase alignment, and the first runtime command applies it to archive
manifests. The first control model covers gossip propagation across relays and nodes under dropped,
delayed, duplicated, partitioned, and outage-delayed control messages. The shrink/replay path makes
supported failures small enough to debug before they become production event archaeology; exact
scenario JSON is the replay contract. Later models can add tuner scheduling, relay cache eviction,
and image rollout state machines. The composed system model is the first workload-level step: it
checks the boundary between control-plane speed and media determinism, which is where production
duplicate publishers are currently most fragile.

## Alternatives considered

- Keep writing production probes only. Rejected because probes prove what happened once, not what
  should happen across many fault schedules.
- Extend the Python node-agent as the simulation oracle. Rejected because the image should get
  thinner and the runtime behavior belongs in the Rust node.

## Rollout/teardown

Roll forward by adding simulation tests next to each new distributed invariant. Roll back by keeping
the production probes; the simulation module is library-only and has no runtime service impact.