every.channel/evolution/proposals/ECP-0122-publisher-source-locks-and-cgroup-cleanup.md
Conrad Kramer cfc4902016
Some checks are pending
ci-gates / checks (push) Waiting to run
deploy-cloudflare / checks (push) Waiting to run
deploy-cloudflare / deploy (push) Blocked by required conditions
Harden LA publishers and add multi-relay guide
2026-06-10 01:28:15 -07:00

44 lines
2.2 KiB
Markdown

# ECP-0122: Publisher Source Locks And Cgroup Cleanup
Status: Draft
## Problem statement
LA channels disappeared when stale proof/archive publisher helpers kept HDHomeRun tuner HTTP streams
open after the managed publishers restarted. The restarted publishers saw `503 Service Unavailable`
from the tuners, stopped refreshing the public stream directory, and the guide expired to empty.
## Constraints
- A publisher restart must not leave child media processes holding tuners.
- A duplicate publisher on the same node must not open the same physical source URL.
- Keep rollback simple and deployment-owned; no source-device firmware or manual tuner reset should be
required for normal recovery.
## Decision
The NixOS publisher wrapper now takes a non-blocking per-source lock under
`/run/every-channel/source-locks` before launching `ec-node`. If another managed publisher on the
same node is already reading that input URL, the duplicate launch logs and skips instead of opening a
second tuner stream.
Publisher and archive worker services also set explicit `KillMode=control-group`,
`TimeoutStopSec=10s`, and `SendSIGKILL=true`, and archive auto-workers terminate tracked children on
shutdown before systemd's cgroup cleanup runs. The async `wt-publish` and `nbc-wt-publish` ffmpeg
children are marked kill-on-drop so cancelled Rust futures do not strand encoder children.
## Alternatives considered
- Rely on operator cleanup only. Rejected because the failure silently empties the public guide after
TTL expiry.
- Run duplicate publishers for redundancy. Rejected because OTA tuner capacity is the scarce resource;
redundancy should happen after one source read, via publisher fanout and relay mirroring.
- Add only systemd cgroup cleanup. Rejected because it does not prevent two managed units from
intentionally opening the same source at the same time.
## Rollout / teardown plan
Deploy the NixOS module update to every publisher node. Confirm no stale proof/archive helpers remain,
all managed publisher units are active, and `/api/public-streams` lists the expected channels.
Rollback is reverting this module change and redeploying; source locks are runtime files under `/run`
and disappear on reboot.