158 lines
12 KiB
Markdown
158 lines
12 KiB
Markdown
# ECP-0157: Rust Simulation Testing
|
|
|
|
Status: Draft
|
|
|
|
## Context
|
|
|
|
Production is now fast enough to expose distributed bugs quickly, but it is still the wrong first
|
|
place to discover scheduler, archive, and duplicate-publisher invariants. The Python node-agent also
|
|
made this worse by putting core control behavior outside the already-built Rust node binary.
|
|
|
|
## Decision
|
|
|
|
Add a small deterministic simulation layer in `ec-core` and use it for distributed media invariants:
|
|
|
|
- `ec-node` remains the runtime owner for node behavior.
|
|
- Tests model logical time, delayed delivery, backfill, duplicate publishers, and archive
|
|
convergence in Rust.
|
|
- Simulation scenarios are seed-replayable and include deterministic jitter, transient drops,
|
|
partition windows, publisher outage/restart windows, backfill retries, and encoder drift faults.
|
|
- A failing simulation must print or carry a replay hint so the exact schedule can be rerun.
|
|
- Simulation reports include deterministic execution history so a failure has an ordered event trace,
|
|
not only a final assertion.
|
|
- Simulation campaigns run many seed schedules in one fast test and preserve the first failing seed,
|
|
invariant report, and final state as the failure artifact.
|
|
- Campaign execution has a reusable seeded runner so new models can share replay/failure accounting
|
|
instead of copying bespoke loops.
|
|
- First failures are automatically shrunk where the model supports it. For duplicate publishers the
|
|
shrinker removes irrelevant partitions, publisher outages, timing jitter, transient drops, and
|
|
excess media sequence range while keeping the original invariant unchanged.
|
|
- Invariants are explicit checks, not implicit test prose: duplicate source count, missing
|
|
sequences, divergent hashes, missing media timing, conflicting media timing, complete duplicate
|
|
coverage, and convergence-deadline budgets.
|
|
- Media identity is checked by BLAKE3 hashes for stream, rendition, track, sequence, profile, and
|
|
source-material identity.
|
|
- Media timing is part of the proof model. Matching hashes are not considered a complete duplicate
|
|
proof unless both publishers also expose a shared logical media clock for the chunk.
|
|
- Source-material identity is separate from stream metadata. Two publishers can advertise the same
|
|
channel, sequence, timing, and encoder profile while still encoding different RF/source windows;
|
|
that must fail in simulation before production archive comparisons burn wall-clock time.
|
|
- Publisher-origin archive `group_sequence` is derived from parsed media-time identity plus stable
|
|
track id, not local receive time. Receive time is telemetry; it is not proof that two publishers
|
|
archived the same broadcast moment.
|
|
- Live publisher archive proof normalizes fMP4 `tfdt` to the Unix media slot before hashing a
|
|
fragment. The first fragment for each track anchors the process-local media clock to wall-clock
|
|
time; later fragments preserve ffmpeg's media cadence from that origin. ffmpeg still runs with
|
|
wall-clock timestamp input enabled where possible, but the Rust archive writer is the authority
|
|
for the proof clock when source MPEG-TS timestamps are process-relative.
|
|
- Archive `group_sequence` includes a stable subfragment slot inside each `(track_id,
|
|
media_sequence)` pair, because audio can legitimately emit multiple fragments within one media
|
|
slot and those must compare in order instead of colliding as source-local divergences.
|
|
- Duplicate-publisher scenarios model publisher content phase separately from advertised archive
|
|
sequence. A publisher that starts its local encoder at a different content phase must fail fast in
|
|
simulation, because production fragments with the same local sequence are not proof of the same
|
|
broadcast moment unless the chunk clock is shared.
|
|
- `ec-node sim-duplicate-publishers` runs the same campaign model from the compiled Rust binary and
|
|
emits JSON suitable for CI artifacts and rollout gates.
|
|
- `ec-node sim-duplicate-publishers --failure-artifact <path>` writes the first failing campaign as
|
|
a replayable JSON artifact with the shrunk scenario, invariant report, event trace, shrink steps,
|
|
and a command hint for replaying `replay_scenario` through `--scenario-json -`.
|
|
- `ec-node sim-duplicate-publishers --scenario-json <path-or->` replays an exact serialized
|
|
`DuplicatePublisherScenario`, so a shrunk failure from CI or production investigation can be rerun
|
|
without reconstructing command-line flags.
|
|
- `ec-node sim-duplicate-publishers` can inject timing faults directly with
|
|
`--missing-media-timing-publisher NODE` and `--publisher-media-time-offset NODE:OFFSET_MS`, so
|
|
the current production proof class can be reproduced without hand-writing scenario JSON.
|
|
- `ec-node sim-duplicate-publishers` and `ec-node sim-system` can inject source-window faults with
|
|
`--publisher-source-material NODE:MATERIAL_ID`. Any campaign with multiple source-material ids
|
|
reports source-material mismatch observations instead of leaving operators to infer that class
|
|
from divergent hashes.
|
|
- `ec-node archive-convergence` reads existing archive manifest JSONL and applies the same
|
|
convergence semantics to real duplicate publisher outputs.
|
|
- Control-plane simulation models logical nodes, seeded gossip fanout, delivery jitter, transient
|
|
drops, node-specific partitions, node outages, duplicate deliveries, and propagation deadlines.
|
|
- `ec-node sim-control-plane` runs the control-plane model from the compiled Rust binary and emits
|
|
replayable JSON with the first failing seed, scenario, invariant report, and ordered trace.
|
|
- Control-plane campaign reports track max propagation time, max delivery time, dropped messages,
|
|
partition-delayed messages, outage-delayed messages, and duplicate messages, so prod rollout
|
|
measurements have a fast simulation baseline.
|
|
- System simulation composes control-plane propagation with duplicate-publisher media production.
|
|
Control gossip produces per-publisher activation times; the media workload then proves that delayed
|
|
schedule propagation still converges when publishers use the global media sequence clock and fails
|
|
when they derive chunk identity from local activation time.
|
|
- `ec-node sim-system` runs that composed workload from the deployed node binary. Its default
|
|
campaign models the current publisher topology class and can switch `--sequence-clock` between
|
|
`global` and `local-activation` to reproduce the exact class of duplicate-publisher phase bug
|
|
before waiting for production samples.
|
|
- `ec-node sim-system --fault-profile foundationdb` uses a FoundationDB-style fault profile: each
|
|
seed generates a different but replayable cluster schedule with randomized control partitions, node
|
|
outages, transient gossip drops, duplicate messages, media partitions, publisher outages, and
|
|
archive backfill pressure.
|
|
- The FoundationDB-style profile must also have an explicit negative regression for
|
|
`local-activation` sequence clocks, so the model proves the current production failure class is
|
|
caught in Rust before any rollout waits for live fragments.
|
|
- `ec-node sim-system --failure-artifact <path>` writes the first failing composed system schedule
|
|
as replayable JSON, including the exact control/media scenario, invariant report, ordered trace,
|
|
and command hint for rerunning `--scenario-json -`.
|
|
- System campaign reports must include fault coverage counters, not just pass/fail. A fast campaign
|
|
is only useful if it proves that the simulated run actually exercised the failure modes operators
|
|
care about.
|
|
- System campaign reports also aggregate publisher phase-offset observations. A production-like
|
|
divergence caused by local activation clocks should identify itself as a phase bug in the campaign
|
|
JSON instead of requiring operators to infer that only from divergent hashes.
|
|
- System campaign reports also aggregate source-material mismatch observations. A production-like
|
|
divergence caused by independent tuner/source windows should identify itself as a source-material
|
|
bug in the campaign JSON instead of being confused with codec nondeterminism.
|
|
- System and duplicate-publisher reports aggregate missing media-timing records and media-timing
|
|
conflicts, so the live failure class where fragments arrive without a usable media clock is visible
|
|
in fast Rust simulation output.
|
|
- FoundationDB-profile `sim-system` campaigns require that coverage by default: control transient
|
|
drops, partition delays, node outage delays, duplicate messages, media transient drops, media
|
|
partition delays, publisher outages, backfill, and observed convergence timing must all appear in
|
|
the campaign report. A campaign that passes invariants but misses these classes is reported as a
|
|
weak simulation, not a green rollout gate.
|
|
- FoundationDB-profile coverage is breadth-gated, not only boolean-gated. By default at least
|
|
`max(2, iterations / 32)` seeds must exercise every required distributed fault class; operators
|
|
can raise that floor with `--min-fault-seed-coverage` for longer scientific campaigns.
|
|
- Campaign reports track both event totals and seed counts per fault class, plus a bounded list of
|
|
the slowest system schedules with replay hints. This makes green runs inspectable: operators can
|
|
see how broadly the randomized schedule space was exercised and which seeds define the current
|
|
latency tail.
|
|
- System campaign reports also aggregate deterministic simulated convergence time and trace event
|
|
counts. `ec-node sim-system` stamps wall-clock execution telemetry around the campaign so a run
|
|
reports iterations per second, simulated system seconds per wall second, and trace events per
|
|
second without putting wall-clock data into the replayed scenario itself.
|
|
- `sim-system --failure-artifact <path>` writes an artifact for weak coverage as well as invariant
|
|
failures, so CI can preserve evidence when a campaign was too small or too narrow to exercise the
|
|
required distributed faults.
|
|
- Forge `ci-gates` runs the Rust system simulator tests and a 1024-seed
|
|
`sim-system --fault-profile foundationdb` campaign from the compiled `ec-node` binary before web
|
|
build/deploy gates. This keeps the fast randomized check ahead of production rollout evidence.
|
|
- Simulation failures must be actionable before any matching production rollout is considered
|
|
healthy.
|
|
|
|
## Consequences
|
|
|
|
We get FoundationDB-style pressure in a much smaller shape: many deterministic failure schedules can
|
|
run as normal Rust tests without booting machines. The first media model covers duplicate publisher
|
|
convergence, network partitions, transient loss, publisher restart/backfill, convergence latency,
|
|
encoder drift, and publisher phase alignment, and the first runtime command applies it to archive
|
|
manifests. The first control model covers gossip propagation across relays and nodes under dropped,
|
|
delayed, duplicated, partitioned, and outage-delayed control messages. The shrink/replay path makes
|
|
supported failures small enough to debug before they become production event archaeology; exact
|
|
scenario JSON is the replay contract. Later models can add tuner scheduling, relay cache eviction,
|
|
and image rollout state machines. The composed system model is the first workload-level step: it
|
|
checks the boundary between control-plane speed and media determinism, which is where production
|
|
duplicate publishers are currently most fragile.
|
|
|
|
## Alternatives considered
|
|
|
|
- Keep writing production probes only. Rejected because probes prove what happened once, not what
|
|
should happen across many fault schedules.
|
|
- Extend the Python node-agent as the simulation oracle. Rejected because the image should get
|
|
thinner and the runtime behavior belongs in the Rust node.
|
|
|
|
## Rollout/teardown
|
|
|
|
Roll forward by adding simulation tests next to each new distributed invariant. Roll back by keeping
|
|
the production probes; the simulation module is library-only and has no runtime service impact.
|