every.channel/evolution/proposals/ECP-0122-publisher-source-locks-and-cgroup-cleanup.md
Conrad Kramer cfc4902016
Some checks are pending
ci-gates / checks (push) Waiting to run
deploy-cloudflare / checks (push) Waiting to run
deploy-cloudflare / deploy (push) Blocked by required conditions
Harden LA publishers and add multi-relay guide
2026-06-10 01:28:15 -07:00

2.2 KiB

ECP-0122: Publisher Source Locks And Cgroup Cleanup

Status: Draft

Problem statement

LA channels disappeared when stale proof/archive publisher helpers kept HDHomeRun tuner HTTP streams open after the managed publishers restarted. The restarted publishers saw 503 Service Unavailable from the tuners, stopped refreshing the public stream directory, and the guide expired to empty.

Constraints

  • A publisher restart must not leave child media processes holding tuners.
  • A duplicate publisher on the same node must not open the same physical source URL.
  • Keep rollback simple and deployment-owned; no source-device firmware or manual tuner reset should be required for normal recovery.

Decision

The NixOS publisher wrapper now takes a non-blocking per-source lock under /run/every-channel/source-locks before launching ec-node. If another managed publisher on the same node is already reading that input URL, the duplicate launch logs and skips instead of opening a second tuner stream.

Publisher and archive worker services also set explicit KillMode=control-group, TimeoutStopSec=10s, and SendSIGKILL=true, and archive auto-workers terminate tracked children on shutdown before systemd's cgroup cleanup runs. The async wt-publish and nbc-wt-publish ffmpeg children are marked kill-on-drop so cancelled Rust futures do not strand encoder children.

Alternatives considered

  • Rely on operator cleanup only. Rejected because the failure silently empties the public guide after TTL expiry.
  • Run duplicate publishers for redundancy. Rejected because OTA tuner capacity is the scarce resource; redundancy should happen after one source read, via publisher fanout and relay mirroring.
  • Add only systemd cgroup cleanup. Rejected because it does not prevent two managed units from intentionally opening the same source at the same time.

Rollout / teardown plan

Deploy the NixOS module update to every publisher node. Confirm no stale proof/archive helpers remain, all managed publisher units are active, and /api/public-streams lists the expected channels. Rollback is reverting this module change and redeploying; source locks are runtime files under /run and disappear on reboot.