Add duplicate publisher determinism proof
This commit is contained in:
parent
5d0f3077d3
commit
91dad67fc2
18 changed files with 21569 additions and 595 deletions
158
evolution/proposals/ECP-0157-rust-simulation-testing.md
Normal file
158
evolution/proposals/ECP-0157-rust-simulation-testing.md
Normal file
|
|
@ -0,0 +1,158 @@
|
|||
# ECP-0157: Rust Simulation Testing
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Context
|
||||
|
||||
Production is now fast enough to expose distributed bugs quickly, but it is still the wrong first
|
||||
place to discover scheduler, archive, and duplicate-publisher invariants. The Python node-agent also
|
||||
made this worse by putting core control behavior outside the already-built Rust node binary.
|
||||
|
||||
## Decision
|
||||
|
||||
Add a small deterministic simulation layer in `ec-core` and use it for distributed media invariants:
|
||||
|
||||
- `ec-node` remains the runtime owner for node behavior.
|
||||
- Tests model logical time, delayed delivery, backfill, duplicate publishers, and archive
|
||||
convergence in Rust.
|
||||
- Simulation scenarios are seed-replayable and include deterministic jitter, transient drops,
|
||||
partition windows, publisher outage/restart windows, backfill retries, and encoder drift faults.
|
||||
- A failing simulation must print or carry a replay hint so the exact schedule can be rerun.
|
||||
- Simulation reports include deterministic execution history so a failure has an ordered event trace,
|
||||
not only a final assertion.
|
||||
- Simulation campaigns run many seed schedules in one fast test and preserve the first failing seed,
|
||||
invariant report, and final state as the failure artifact.
|
||||
- Campaign execution has a reusable seeded runner so new models can share replay/failure accounting
|
||||
instead of copying bespoke loops.
|
||||
- First failures are automatically shrunk where the model supports it. For duplicate publishers the
|
||||
shrinker removes irrelevant partitions, publisher outages, timing jitter, transient drops, and
|
||||
excess media sequence range while keeping the original invariant unchanged.
|
||||
- Invariants are explicit checks, not implicit test prose: duplicate source count, missing
|
||||
sequences, divergent hashes, missing media timing, conflicting media timing, complete duplicate
|
||||
coverage, and convergence-deadline budgets.
|
||||
- Media identity is checked by BLAKE3 hashes for stream, rendition, track, sequence, profile, and
|
||||
source-material identity.
|
||||
- Media timing is part of the proof model. Matching hashes are not considered a complete duplicate
|
||||
proof unless both publishers also expose a shared logical media clock for the chunk.
|
||||
- Source-material identity is separate from stream metadata. Two publishers can advertise the same
|
||||
channel, sequence, timing, and encoder profile while still encoding different RF/source windows;
|
||||
that must fail in simulation before production archive comparisons burn wall-clock time.
|
||||
- Publisher-origin archive `group_sequence` is derived from parsed media-time identity plus stable
|
||||
track id, not local receive time. Receive time is telemetry; it is not proof that two publishers
|
||||
archived the same broadcast moment.
|
||||
- Live publisher archive proof normalizes fMP4 `tfdt` to the Unix media slot before hashing a
|
||||
fragment. The first fragment for each track anchors the process-local media clock to wall-clock
|
||||
time; later fragments preserve ffmpeg's media cadence from that origin. ffmpeg still runs with
|
||||
wall-clock timestamp input enabled where possible, but the Rust archive writer is the authority
|
||||
for the proof clock when source MPEG-TS timestamps are process-relative.
|
||||
- Archive `group_sequence` includes a stable subfragment slot inside each `(track_id,
|
||||
media_sequence)` pair, because audio can legitimately emit multiple fragments within one media
|
||||
slot and those must compare in order instead of colliding as source-local divergences.
|
||||
- Duplicate-publisher scenarios model publisher content phase separately from advertised archive
|
||||
sequence. A publisher that starts its local encoder at a different content phase must fail fast in
|
||||
simulation, because production fragments with the same local sequence are not proof of the same
|
||||
broadcast moment unless the chunk clock is shared.
|
||||
- `ec-node sim-duplicate-publishers` runs the same campaign model from the compiled Rust binary and
|
||||
emits JSON suitable for CI artifacts and rollout gates.
|
||||
- `ec-node sim-duplicate-publishers --failure-artifact <path>` writes the first failing campaign as
|
||||
a replayable JSON artifact with the shrunk scenario, invariant report, event trace, shrink steps,
|
||||
and a command hint for replaying `replay_scenario` through `--scenario-json -`.
|
||||
- `ec-node sim-duplicate-publishers --scenario-json <path-or->` replays an exact serialized
|
||||
`DuplicatePublisherScenario`, so a shrunk failure from CI or production investigation can be rerun
|
||||
without reconstructing command-line flags.
|
||||
- `ec-node sim-duplicate-publishers` can inject timing faults directly with
|
||||
`--missing-media-timing-publisher NODE` and `--publisher-media-time-offset NODE:OFFSET_MS`, so
|
||||
the current production proof class can be reproduced without hand-writing scenario JSON.
|
||||
- `ec-node sim-duplicate-publishers` and `ec-node sim-system` can inject source-window faults with
|
||||
`--publisher-source-material NODE:MATERIAL_ID`. Any campaign with multiple source-material ids
|
||||
reports source-material mismatch observations instead of leaving operators to infer that class
|
||||
from divergent hashes.
|
||||
- `ec-node archive-convergence` reads existing archive manifest JSONL and applies the same
|
||||
convergence semantics to real duplicate publisher outputs.
|
||||
- Control-plane simulation models logical nodes, seeded gossip fanout, delivery jitter, transient
|
||||
drops, node-specific partitions, node outages, duplicate deliveries, and propagation deadlines.
|
||||
- `ec-node sim-control-plane` runs the control-plane model from the compiled Rust binary and emits
|
||||
replayable JSON with the first failing seed, scenario, invariant report, and ordered trace.
|
||||
- Control-plane campaign reports track max propagation time, max delivery time, dropped messages,
|
||||
partition-delayed messages, outage-delayed messages, and duplicate messages, so prod rollout
|
||||
measurements have a fast simulation baseline.
|
||||
- System simulation composes control-plane propagation with duplicate-publisher media production.
|
||||
Control gossip produces per-publisher activation times; the media workload then proves that delayed
|
||||
schedule propagation still converges when publishers use the global media sequence clock and fails
|
||||
when they derive chunk identity from local activation time.
|
||||
- `ec-node sim-system` runs that composed workload from the deployed node binary. Its default
|
||||
campaign models the current publisher topology class and can switch `--sequence-clock` between
|
||||
`global` and `local-activation` to reproduce the exact class of duplicate-publisher phase bug
|
||||
before waiting for production samples.
|
||||
- `ec-node sim-system --fault-profile foundationdb` uses a FoundationDB-style fault profile: each
|
||||
seed generates a different but replayable cluster schedule with randomized control partitions, node
|
||||
outages, transient gossip drops, duplicate messages, media partitions, publisher outages, and
|
||||
archive backfill pressure.
|
||||
- The FoundationDB-style profile must also have an explicit negative regression for
|
||||
`local-activation` sequence clocks, so the model proves the current production failure class is
|
||||
caught in Rust before any rollout waits for live fragments.
|
||||
- `ec-node sim-system --failure-artifact <path>` writes the first failing composed system schedule
|
||||
as replayable JSON, including the exact control/media scenario, invariant report, ordered trace,
|
||||
and command hint for rerunning `--scenario-json -`.
|
||||
- System campaign reports must include fault coverage counters, not just pass/fail. A fast campaign
|
||||
is only useful if it proves that the simulated run actually exercised the failure modes operators
|
||||
care about.
|
||||
- System campaign reports also aggregate publisher phase-offset observations. A production-like
|
||||
divergence caused by local activation clocks should identify itself as a phase bug in the campaign
|
||||
JSON instead of requiring operators to infer that only from divergent hashes.
|
||||
- System campaign reports also aggregate source-material mismatch observations. A production-like
|
||||
divergence caused by independent tuner/source windows should identify itself as a source-material
|
||||
bug in the campaign JSON instead of being confused with codec nondeterminism.
|
||||
- System and duplicate-publisher reports aggregate missing media-timing records and media-timing
|
||||
conflicts, so the live failure class where fragments arrive without a usable media clock is visible
|
||||
in fast Rust simulation output.
|
||||
- FoundationDB-profile `sim-system` campaigns require that coverage by default: control transient
|
||||
drops, partition delays, node outage delays, duplicate messages, media transient drops, media
|
||||
partition delays, publisher outages, backfill, and observed convergence timing must all appear in
|
||||
the campaign report. A campaign that passes invariants but misses these classes is reported as a
|
||||
weak simulation, not a green rollout gate.
|
||||
- FoundationDB-profile coverage is breadth-gated, not only boolean-gated. By default at least
|
||||
`max(2, iterations / 32)` seeds must exercise every required distributed fault class; operators
|
||||
can raise that floor with `--min-fault-seed-coverage` for longer scientific campaigns.
|
||||
- Campaign reports track both event totals and seed counts per fault class, plus a bounded list of
|
||||
the slowest system schedules with replay hints. This makes green runs inspectable: operators can
|
||||
see how broadly the randomized schedule space was exercised and which seeds define the current
|
||||
latency tail.
|
||||
- System campaign reports also aggregate deterministic simulated convergence time and trace event
|
||||
counts. `ec-node sim-system` stamps wall-clock execution telemetry around the campaign so a run
|
||||
reports iterations per second, simulated system seconds per wall second, and trace events per
|
||||
second without putting wall-clock data into the replayed scenario itself.
|
||||
- `sim-system --failure-artifact <path>` writes an artifact for weak coverage as well as invariant
|
||||
failures, so CI can preserve evidence when a campaign was too small or too narrow to exercise the
|
||||
required distributed faults.
|
||||
- Forge `ci-gates` runs the Rust system simulator tests and a 1024-seed
|
||||
`sim-system --fault-profile foundationdb` campaign from the compiled `ec-node` binary before web
|
||||
build/deploy gates. This keeps the fast randomized check ahead of production rollout evidence.
|
||||
- Simulation failures must be actionable before any matching production rollout is considered
|
||||
healthy.
|
||||
|
||||
## Consequences
|
||||
|
||||
We get FoundationDB-style pressure in a much smaller shape: many deterministic failure schedules can
|
||||
run as normal Rust tests without booting machines. The first media model covers duplicate publisher
|
||||
convergence, network partitions, transient loss, publisher restart/backfill, convergence latency,
|
||||
encoder drift, and publisher phase alignment, and the first runtime command applies it to archive
|
||||
manifests. The first control model covers gossip propagation across relays and nodes under dropped,
|
||||
delayed, duplicated, partitioned, and outage-delayed control messages. The shrink/replay path makes
|
||||
supported failures small enough to debug before they become production event archaeology; exact
|
||||
scenario JSON is the replay contract. Later models can add tuner scheduling, relay cache eviction,
|
||||
and image rollout state machines. The composed system model is the first workload-level step: it
|
||||
checks the boundary between control-plane speed and media determinism, which is where production
|
||||
duplicate publishers are currently most fragile.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Keep writing production probes only. Rejected because probes prove what happened once, not what
|
||||
should happen across many fault schedules.
|
||||
- Extend the Python node-agent as the simulation oracle. Rejected because the image should get
|
||||
thinner and the runtime behavior belongs in the Rust node.
|
||||
|
||||
## Rollout/teardown
|
||||
|
||||
Roll forward by adding simulation tests next to each new distributed invariant. Roll back by keeping
|
||||
the production probes; the simulation module is library-only and has no runtime service impact.
|
||||
Loading…
Add table
Add a link
Reference in a new issue