every.channel/evolution/proposals/ECP-0110-ecp-forge-hetzner-robot-recovery-wrapper.md

2.2 KiB

ECP-0110: ecp-forge Hetzner Robot recovery wrapper

Status: Draft

Problem / context

git.every.channel is a single dedicated Hetzner host. When SSH and HTTPS are both unreachable, the blockchain and Forgejo validation path stalls before repo-owned deployment tools can connect. Robot can recover the host, but browser-only recovery is hard to repeat and easy to lose across agent handoffs.

Decision

Add a repo-local Robot wrapper for ecp-forge recovery:

  • default to server 2800441 / 95.216.114.54,
  • read Robot Webservice credentials from environment variables or the existing 1Password item at runtime,
  • avoid storing Robot passwords in git or shell profiles,
  • expose explicit status, rescue, reset, recover, and reachability-probe commands, and
  • mask Robot-generated rescue passwords unless the operator explicitly opts into printing them.

The wrapper treats rescue activation and reset as operational recovery steps, not deployment. Once the host is reachable again, scripts/deploy-ecp-forge.sh remains the source of truth for the NixOS system state.

Consequences

  • Future agents can recover the Forge after a local 1Password CLI sign-in without asking for pasted Robot secrets.
  • The host identity and Robot server number are documented in the repo instead of being rediscovered from the browser UI.
  • Recovery actions remain explicit commands; ordinary probes never mutate Robot state.

Alternatives considered

  • Continue browser-only Robot recovery. Rejected because it is too stateful for repeated agent handoffs and does not leave a repo-owned runbook.
  • Store Robot credentials in a repo-local file. Rejected because Robot credentials are operational secrets and should stay in 1Password or the caller's environment.
  • Move recovery into the deploy script. Rejected because Robot rescue/reset is a host-recovery action, while deploy-ecp-forge.sh should remain the NixOS deployment entrypoint.

Rollout / teardown

  1. Add scripts/hetzner-robot-forge.sh.
  2. Document the emergency path in docs/DEPLOY_ECP_FORGE.md.
  3. Use probe first, then status, then recover only when the Forge is unreachable.

Teardown is removing the wrapper and returning to browser-only Robot operations.