Wire HDHomeRun observations and recover Forge OP Stack
This commit is contained in:
parent
8065860449
commit
0d86104762
18 changed files with 1613 additions and 58 deletions
|
|
@ -0,0 +1,56 @@
|
|||
# ECP-0109: Local HDHomeRun publishers submit observation rail commitments
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Problem / context
|
||||
|
||||
`ecp-forge` has the Ethereum / OP Stack direction and observation ledger contracts, while local
|
||||
nodes have the HDHomeRun tuners and can already produce verified manifests. The missing bridge is a
|
||||
publisher path that can run on the local LAN, observe real tuner-derived epochs, and submit compact
|
||||
observation headers to the remote chain without moving media bytes on chain.
|
||||
|
||||
## Decision
|
||||
|
||||
Add an optional observation-rail sink to `ec-node moq-publish`:
|
||||
|
||||
- each published manifest epoch can become one `EveryChannelObservationLedger.ObservationHeader`,
|
||||
- `streamHash` is `keccak256(stream_id)`,
|
||||
- `epochHash` is `keccak256(epoch_id)`,
|
||||
- `dataRoot` is the manifest's Ethereum data-root commitment,
|
||||
- `locatorHash` commits to a compact JSON locator for the manifest and MoQ broadcast,
|
||||
- `observedUnixMs` and `sequence` come from the manifest body, and
|
||||
- submission uses a configured RPC URL, ledger address, and witness private key.
|
||||
|
||||
The sink is disabled unless explicitly configured. It is intended for a local publisher talking to
|
||||
the remote every.channel chain through the remote host's local-only RPC surface, typically via an
|
||||
SSH tunnel. The OP Stack L2 RPC uses a distinct local port from the full Ethereum nodes on the same
|
||||
host so publisher submissions do not accidentally target mainnet or Sepolia L1 RPC.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Local HDHomeRun boxes can act as reality witnesses without running the full chain locally.
|
||||
- The chain stores compact observation commitments only; media segments and full manifests remain
|
||||
on MoQ / iroh / archive storage.
|
||||
- The first implementation uses Foundry `cast` for transaction submission so the repo can validate
|
||||
end-to-end with Anvil before committing to an embedded Rust transaction signer.
|
||||
- A quorum greater than one still requires additional witnesses to attest; the local publisher only
|
||||
proposes and self-attests when the configured key is a registry witness.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Run the full chain locally next to the HDHomeRuns. Rejected because the desired validation target
|
||||
is the remote every.channel chain, and a local chain would hide remote reachability/configuration
|
||||
failures.
|
||||
- Push full media or manifests on chain. Rejected because the observation rail only needs compact
|
||||
commitments and locators.
|
||||
- Add an embedded Rust transaction signer immediately. Deferred until the end-to-end rail proves
|
||||
useful with Foundry tooling.
|
||||
|
||||
## Rollout / teardown
|
||||
|
||||
1. Add manifest-to-observation derivation in `ec-eth`.
|
||||
2. Add optional `ec-node moq-publish` flags and environment fallbacks for observation submission.
|
||||
3. Add an ignored HDHomeRun + Anvil E2E test and a wrapper script.
|
||||
4. Point local publishers at the remote RPC once the remote chain is reachable.
|
||||
|
||||
Teardown is simply disabling the observation options; local manifest publication remains unchanged.
|
||||
|
|
@ -0,0 +1,50 @@
|
|||
# ECP-0110: `ecp-forge` Hetzner Robot recovery wrapper
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Problem / context
|
||||
|
||||
`git.every.channel` is a single dedicated Hetzner host. When SSH and HTTPS are both unreachable,
|
||||
the blockchain and Forgejo validation path stalls before repo-owned deployment tools can connect.
|
||||
Robot can recover the host, but browser-only recovery is hard to repeat and easy to lose across
|
||||
agent handoffs.
|
||||
|
||||
## Decision
|
||||
|
||||
Add a repo-local Robot wrapper for `ecp-forge` recovery:
|
||||
|
||||
- default to server `2800441` / `95.216.114.54`,
|
||||
- read Robot Webservice credentials from environment variables or the existing 1Password item at
|
||||
runtime,
|
||||
- avoid storing Robot passwords in git or shell profiles,
|
||||
- expose explicit status, rescue, reset, recover, and reachability-probe commands, and
|
||||
- mask Robot-generated rescue passwords unless the operator explicitly opts into printing them.
|
||||
|
||||
The wrapper treats rescue activation and reset as operational recovery steps, not deployment. Once
|
||||
the host is reachable again, `scripts/deploy-ecp-forge.sh` remains the source of truth for the
|
||||
NixOS system state.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Future agents can recover the Forge after a local 1Password CLI sign-in without asking for pasted
|
||||
Robot secrets.
|
||||
- The host identity and Robot server number are documented in the repo instead of being rediscovered
|
||||
from the browser UI.
|
||||
- Recovery actions remain explicit commands; ordinary probes never mutate Robot state.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Continue browser-only Robot recovery. Rejected because it is too stateful for repeated agent
|
||||
handoffs and does not leave a repo-owned runbook.
|
||||
- Store Robot credentials in a repo-local file. Rejected because Robot credentials are operational
|
||||
secrets and should stay in 1Password or the caller's environment.
|
||||
- Move recovery into the deploy script. Rejected because Robot rescue/reset is a host-recovery action,
|
||||
while `deploy-ecp-forge.sh` should remain the NixOS deployment entrypoint.
|
||||
|
||||
## Rollout / teardown
|
||||
|
||||
1. Add `scripts/hetzner-robot-forge.sh`.
|
||||
2. Document the emergency path in `docs/DEPLOY_ECP_FORGE.md`.
|
||||
3. Use `probe` first, then `status`, then `recover` only when the Forge is unreachable.
|
||||
|
||||
Teardown is removing the wrapper and returning to browser-only Robot operations.
|
||||
|
|
@ -0,0 +1,46 @@
|
|||
# ECP-0111: Disable Host Mullvad for Forge Public Recovery
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Problem / context
|
||||
|
||||
`git.every.channel` must stay reachable on public SSH and HTTPS so blockchain validation, deploys,
|
||||
and Forgejo review can proceed. The current `ecp-forge` boot reaches Forgejo, Caddy, and SSH socket
|
||||
activation, but the host becomes unreachable once the host-wide Mullvad daemon connects and applies
|
||||
its firewall policy.
|
||||
|
||||
## Decision
|
||||
|
||||
Disable host-wide Mullvad on `ecp-forge` and stop making forge NBC workers wait for host Mullvad.
|
||||
The public Forge host stays on the Hetzner interface. NBC egress that needs Mullvad should return
|
||||
through a process-scoped or namespace-scoped design that does not install a host-wide kill switch.
|
||||
|
||||
## Consequences
|
||||
|
||||
- `git.every.channel` can serve SSH, HTTPS, and ACME challenges on the public Hetzner address.
|
||||
- Forge recovery no longer depends on manual Mullvad split-tunnel state.
|
||||
- Forge NBC Philadelphia publishing loses the host-wide Mullvad egress assumption until a narrower
|
||||
worker-only egress path lands.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Keep host-wide Mullvad and rely on split-tunnel exceptions. Rejected because production logs show
|
||||
public SSH and HTTPS time out while Mullvad's firewall policy is active.
|
||||
- Keep Mullvad enabled but mask only Caddy or SSH from the tunnel. Rejected because the daemon's
|
||||
firewall policy still governs inbound public reachability at the host level.
|
||||
- Disable the whole `ec-node` service. Rejected because archive and blockchain workers should remain
|
||||
independent of the NBC egress incident.
|
||||
|
||||
## Rollout / teardown
|
||||
|
||||
1. From Rescue, inspect the previous boot and confirm Forgejo/Caddy start before Mullvad applies its
|
||||
firewall policy.
|
||||
2. If Mullvad rewrites its cached target state back to `secured`, temporarily append
|
||||
`systemd.mask=mullvad-daemon.service systemd.mask=mullvad-early-boot-blocking.service` to the
|
||||
default GRUB entry and reboot production.
|
||||
3. Deploy the NixOS config that keeps host-wide Mullvad disabled, which regenerates the bootloader
|
||||
without the emergency mask.
|
||||
4. Verify `ssh`, `https://git.every.channel/`, Forgejo, and Caddy.
|
||||
|
||||
Teardown is re-enabling host Mullvad only after a tested design preserves public inbound Forge
|
||||
traffic.
|
||||
|
|
@ -0,0 +1,39 @@
|
|||
# ECP-0112: Match Nested OP Deployer Intent Schema
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Problem / context
|
||||
|
||||
`ecp-forge` OP Stack bootstrap failed with `missing key id` even though
|
||||
`/var/lib/every-channel/op-stack/deployer/.deployer/intent.toml` contained an `id` field. After that
|
||||
was repaired, bootstrap also found a placeholder `state.json` whose deployment fields were still
|
||||
null. The current `op-deployer` intent format writes chain and role values under nested TOML
|
||||
sections, while the bootstrap helper only matched keys at the start of a line and treated any
|
||||
`state.json` as completed state.
|
||||
|
||||
## Decision
|
||||
|
||||
Update the OP Stack bootstrap helper to replace TOML keys after optional indentation, preserve that
|
||||
indentation when writing the replacement value, and run `op-deployer apply` unless the state file has
|
||||
non-null applied deployment fields.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The existing `op-deployer/v0.6.0-rc.3` intent file can be repaired in place.
|
||||
- The bootstrap service can generate sequencer, batcher, proposer, challenger, and dispute monitor
|
||||
runtime config from the existing deployment state.
|
||||
- Placeholder `state.json` files no longer block the apply step.
|
||||
- The change stays compatible with flat TOML keys if `op-deployer` changes the layout again.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Regenerate the deployment state from scratch. Rejected because a surgical config repair is safer
|
||||
for an already deployed OP Stack root.
|
||||
- Keep matching only top-level keys. Rejected because it does not match the live `op-deployer`
|
||||
schema on `ecp-forge`.
|
||||
|
||||
## Rollout / teardown
|
||||
|
||||
Deploy the updated bootstrap helper, restart `every-channel-op-stack-bootstrap.service`, and then
|
||||
restart the dependent OP Stack containers. Teardown is reverting this helper change and regenerating
|
||||
the OP Stack root with a known-flat intent schema.
|
||||
|
|
@ -0,0 +1,53 @@
|
|||
# ECP-0113: Keep OP Stack Runtime Compatible With Forge Host Services
|
||||
|
||||
Status: Draft
|
||||
|
||||
## Problem / context
|
||||
|
||||
`ecp-forge` now runs the OP Stack bootstrap far enough to produce `deployment.json`, `genesis.json`,
|
||||
and `rollup.json`, but the runtime containers still failed to stay up. `op-geth` tried to bind the
|
||||
default Ethereum P2P port `30303`, already owned by the host full Ethereum node. The pinned
|
||||
`op-node:v1.13.5` rejected current `op-deployer/v0.6.0-rc.3` rollup fields such as `minBaseFee`.
|
||||
After aligning to `op-node:v1.14.0`, that image still rejected the newer
|
||||
`genesis.system_config.daFootprintGasScalar` field. The generated rollup config also carried
|
||||
`eip1559Params = 0x0000000000000000` even though the genesis `extraData` and chain config encode
|
||||
denominator `250` and elasticity `6`; that zero value caused `op-geth` to panic when the sequencer
|
||||
requested the first payload. `op-batcher:v1.14.0` also no longer accepts `--batch-inbox-address`.
|
||||
Isolated compatibility probes showed `op-node:v1.16.6` paired with `op-geth:v1.101702.0-rc.1` can
|
||||
run against the generated genesis hash and produce L2 blocks.
|
||||
|
||||
## Decision
|
||||
|
||||
Assign `op-geth` a repo-owned L2 P2P port in the existing `285xx` range, align `op-node` to the
|
||||
probed `v1.16.6` runtime, move `op-geth` to the probed
|
||||
`v1.101702.0-rc.1` image, remove the stale batcher inbox-address flag, delete only
|
||||
`genesis.system_config.daFootprintGasScalar` from generated rollup configs, and derive zero
|
||||
`eip1559Params` from the generated `chain_op_config`.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The host Ethereum node can keep `30303` without blocking OP Stack startup.
|
||||
- The OP Stack RPC and P2P port assignments stay documented in repo config.
|
||||
- Runtime image compatibility is explicit in Nix config.
|
||||
- The rollup JSON normalization is intentionally narrow: it removes the exact field rejected by the
|
||||
older `op-node:v1.14.0` parser and repairs only the zero EIP-1559 params that caused the live
|
||||
`op-geth` payload panic.
|
||||
- The `op-geth` image is an explicit release-candidate tag because the previously pinned image
|
||||
panicked against the current deployer output.
|
||||
|
||||
## Alternatives considered
|
||||
|
||||
- Stop the host full Ethereum node. Rejected because the OP Stack should coexist with the existing
|
||||
Ethereum services.
|
||||
- Strip all newer-looking fields from `rollup.json`. Rejected because `op-node:v1.14.0` accepts the
|
||||
other generated fields tested during recovery; broad deletion would hide schema drift.
|
||||
- Leave zero `eip1559Params` in place. Rejected because the live sequencer/geth pair panicked before
|
||||
the first L2 block could be built.
|
||||
- Keep `op-geth:v1.101511.1`. Rejected because it reproducibly panics on first payload construction
|
||||
for this generated chain config.
|
||||
|
||||
## Rollout / teardown
|
||||
|
||||
Deploy the updated NixOS module and bootstrap helper, reset failed OP Stack units, and verify L2 RPC
|
||||
and rollup RPC locally on `ecp-forge`. Teardown is reverting the port assignment and rollup JSON
|
||||
normalization, then regenerating runtime files with a mutually compatible deployer/runtime image set.
|
||||
Loading…
Add table
Add a link
Reference in a new issue