Add duplicate publisher determinism proof
Some checks failed
deploy-cloudflare / checks (push) Failing after 3s
ci-gates / checks (push) Failing after 5s
deploy-cloudflare / deploy (push) Has been skipped

This commit is contained in:
every.channel 2026-06-10 03:28:55 -07:00
parent 5d0f3077d3
commit 91dad67fc2
No known key found for this signature in database
18 changed files with 21569 additions and 595 deletions

View file

@ -0,0 +1,158 @@
# ECP-0157: Rust Simulation Testing
Status: Draft
## Context
Production is now fast enough to expose distributed bugs quickly, but it is still the wrong first
place to discover scheduler, archive, and duplicate-publisher invariants. The Python node-agent also
made this worse by putting core control behavior outside the already-built Rust node binary.
## Decision
Add a small deterministic simulation layer in `ec-core` and use it for distributed media invariants:
- `ec-node` remains the runtime owner for node behavior.
- Tests model logical time, delayed delivery, backfill, duplicate publishers, and archive
convergence in Rust.
- Simulation scenarios are seed-replayable and include deterministic jitter, transient drops,
partition windows, publisher outage/restart windows, backfill retries, and encoder drift faults.
- A failing simulation must print or carry a replay hint so the exact schedule can be rerun.
- Simulation reports include deterministic execution history so a failure has an ordered event trace,
not only a final assertion.
- Simulation campaigns run many seed schedules in one fast test and preserve the first failing seed,
invariant report, and final state as the failure artifact.
- Campaign execution has a reusable seeded runner so new models can share replay/failure accounting
instead of copying bespoke loops.
- First failures are automatically shrunk where the model supports it. For duplicate publishers the
shrinker removes irrelevant partitions, publisher outages, timing jitter, transient drops, and
excess media sequence range while keeping the original invariant unchanged.
- Invariants are explicit checks, not implicit test prose: duplicate source count, missing
sequences, divergent hashes, missing media timing, conflicting media timing, complete duplicate
coverage, and convergence-deadline budgets.
- Media identity is checked by BLAKE3 hashes for stream, rendition, track, sequence, profile, and
source-material identity.
- Media timing is part of the proof model. Matching hashes are not considered a complete duplicate
proof unless both publishers also expose a shared logical media clock for the chunk.
- Source-material identity is separate from stream metadata. Two publishers can advertise the same
channel, sequence, timing, and encoder profile while still encoding different RF/source windows;
that must fail in simulation before production archive comparisons burn wall-clock time.
- Publisher-origin archive `group_sequence` is derived from parsed media-time identity plus stable
track id, not local receive time. Receive time is telemetry; it is not proof that two publishers
archived the same broadcast moment.
- Live publisher archive proof normalizes fMP4 `tfdt` to the Unix media slot before hashing a
fragment. The first fragment for each track anchors the process-local media clock to wall-clock
time; later fragments preserve ffmpeg's media cadence from that origin. ffmpeg still runs with
wall-clock timestamp input enabled where possible, but the Rust archive writer is the authority
for the proof clock when source MPEG-TS timestamps are process-relative.
- Archive `group_sequence` includes a stable subfragment slot inside each `(track_id,
media_sequence)` pair, because audio can legitimately emit multiple fragments within one media
slot and those must compare in order instead of colliding as source-local divergences.
- Duplicate-publisher scenarios model publisher content phase separately from advertised archive
sequence. A publisher that starts its local encoder at a different content phase must fail fast in
simulation, because production fragments with the same local sequence are not proof of the same
broadcast moment unless the chunk clock is shared.
- `ec-node sim-duplicate-publishers` runs the same campaign model from the compiled Rust binary and
emits JSON suitable for CI artifacts and rollout gates.
- `ec-node sim-duplicate-publishers --failure-artifact <path>` writes the first failing campaign as
a replayable JSON artifact with the shrunk scenario, invariant report, event trace, shrink steps,
and a command hint for replaying `replay_scenario` through `--scenario-json -`.
- `ec-node sim-duplicate-publishers --scenario-json <path-or->` replays an exact serialized
`DuplicatePublisherScenario`, so a shrunk failure from CI or production investigation can be rerun
without reconstructing command-line flags.
- `ec-node sim-duplicate-publishers` can inject timing faults directly with
`--missing-media-timing-publisher NODE` and `--publisher-media-time-offset NODE:OFFSET_MS`, so
the current production proof class can be reproduced without hand-writing scenario JSON.
- `ec-node sim-duplicate-publishers` and `ec-node sim-system` can inject source-window faults with
`--publisher-source-material NODE:MATERIAL_ID`. Any campaign with multiple source-material ids
reports source-material mismatch observations instead of leaving operators to infer that class
from divergent hashes.
- `ec-node archive-convergence` reads existing archive manifest JSONL and applies the same
convergence semantics to real duplicate publisher outputs.
- Control-plane simulation models logical nodes, seeded gossip fanout, delivery jitter, transient
drops, node-specific partitions, node outages, duplicate deliveries, and propagation deadlines.
- `ec-node sim-control-plane` runs the control-plane model from the compiled Rust binary and emits
replayable JSON with the first failing seed, scenario, invariant report, and ordered trace.
- Control-plane campaign reports track max propagation time, max delivery time, dropped messages,
partition-delayed messages, outage-delayed messages, and duplicate messages, so prod rollout
measurements have a fast simulation baseline.
- System simulation composes control-plane propagation with duplicate-publisher media production.
Control gossip produces per-publisher activation times; the media workload then proves that delayed
schedule propagation still converges when publishers use the global media sequence clock and fails
when they derive chunk identity from local activation time.
- `ec-node sim-system` runs that composed workload from the deployed node binary. Its default
campaign models the current publisher topology class and can switch `--sequence-clock` between
`global` and `local-activation` to reproduce the exact class of duplicate-publisher phase bug
before waiting for production samples.
- `ec-node sim-system --fault-profile foundationdb` uses a FoundationDB-style fault profile: each
seed generates a different but replayable cluster schedule with randomized control partitions, node
outages, transient gossip drops, duplicate messages, media partitions, publisher outages, and
archive backfill pressure.
- The FoundationDB-style profile must also have an explicit negative regression for
`local-activation` sequence clocks, so the model proves the current production failure class is
caught in Rust before any rollout waits for live fragments.
- `ec-node sim-system --failure-artifact <path>` writes the first failing composed system schedule
as replayable JSON, including the exact control/media scenario, invariant report, ordered trace,
and command hint for rerunning `--scenario-json -`.
- System campaign reports must include fault coverage counters, not just pass/fail. A fast campaign
is only useful if it proves that the simulated run actually exercised the failure modes operators
care about.
- System campaign reports also aggregate publisher phase-offset observations. A production-like
divergence caused by local activation clocks should identify itself as a phase bug in the campaign
JSON instead of requiring operators to infer that only from divergent hashes.
- System campaign reports also aggregate source-material mismatch observations. A production-like
divergence caused by independent tuner/source windows should identify itself as a source-material
bug in the campaign JSON instead of being confused with codec nondeterminism.
- System and duplicate-publisher reports aggregate missing media-timing records and media-timing
conflicts, so the live failure class where fragments arrive without a usable media clock is visible
in fast Rust simulation output.
- FoundationDB-profile `sim-system` campaigns require that coverage by default: control transient
drops, partition delays, node outage delays, duplicate messages, media transient drops, media
partition delays, publisher outages, backfill, and observed convergence timing must all appear in
the campaign report. A campaign that passes invariants but misses these classes is reported as a
weak simulation, not a green rollout gate.
- FoundationDB-profile coverage is breadth-gated, not only boolean-gated. By default at least
`max(2, iterations / 32)` seeds must exercise every required distributed fault class; operators
can raise that floor with `--min-fault-seed-coverage` for longer scientific campaigns.
- Campaign reports track both event totals and seed counts per fault class, plus a bounded list of
the slowest system schedules with replay hints. This makes green runs inspectable: operators can
see how broadly the randomized schedule space was exercised and which seeds define the current
latency tail.
- System campaign reports also aggregate deterministic simulated convergence time and trace event
counts. `ec-node sim-system` stamps wall-clock execution telemetry around the campaign so a run
reports iterations per second, simulated system seconds per wall second, and trace events per
second without putting wall-clock data into the replayed scenario itself.
- `sim-system --failure-artifact <path>` writes an artifact for weak coverage as well as invariant
failures, so CI can preserve evidence when a campaign was too small or too narrow to exercise the
required distributed faults.
- Forge `ci-gates` runs the Rust system simulator tests and a 1024-seed
`sim-system --fault-profile foundationdb` campaign from the compiled `ec-node` binary before web
build/deploy gates. This keeps the fast randomized check ahead of production rollout evidence.
- Simulation failures must be actionable before any matching production rollout is considered
healthy.
## Consequences
We get FoundationDB-style pressure in a much smaller shape: many deterministic failure schedules can
run as normal Rust tests without booting machines. The first media model covers duplicate publisher
convergence, network partitions, transient loss, publisher restart/backfill, convergence latency,
encoder drift, and publisher phase alignment, and the first runtime command applies it to archive
manifests. The first control model covers gossip propagation across relays and nodes under dropped,
delayed, duplicated, partitioned, and outage-delayed control messages. The shrink/replay path makes
supported failures small enough to debug before they become production event archaeology; exact
scenario JSON is the replay contract. Later models can add tuner scheduling, relay cache eviction,
and image rollout state machines. The composed system model is the first workload-level step: it
checks the boundary between control-plane speed and media determinism, which is where production
duplicate publishers are currently most fragile.
## Alternatives considered
- Keep writing production probes only. Rejected because probes prove what happened once, not what
should happen across many fault schedules.
- Extend the Python node-agent as the simulation oracle. Rejected because the image should get
thinner and the runtime behavior belongs in the Rust node.
## Rollout/teardown
Roll forward by adding simulation tests next to each new distributed invariant. Roll back by keeping
the production probes; the simulation module is library-only and has no runtime service impact.