Worker self-update plan

Back to Home

Worker self-update plan {#worker-self-update}

Overview

This follow-up plan describes the work required after Phase A (artifact packaging) to enable ByteBiota workers to download and install updates automatically. It assumes the packaging workflow defined in auto-upgrade.md is in place and that versioned binaries plus checksums are published for each platform.

Guiding principles

  • Opt-in rollout first. Operators should be able to toggle self-update per deployment or per worker pool while we gain confidence.
  • Reuse existing infrastructure. Extend FastAPI services and worker CLI without introducing parallel frameworks or redundant schedulers.
  • Auditable behaviour. Every upgrade attempt must emit structured logs so we can trace version adoption across the fleet.

Phase breakdown

Phase B1 – Server-side update orchestration {#phase-b1-server}

Objective: expose the authoritative manifest that workers consult before attempting an upgrade.

Task Description Key files / modules Exit criteria
B1.1 Create update service scaffold src/bytebiota/server/update_api_service.py, src/bytebiota/server/app.py FastAPI routes /api/worker-updates/latest and /api/worker-updates/report registered with shared auth dependencies.
B1.2 Define manifest schema src/bytebiota/server/schemas/update_manifest.py, wiki/operations/environment.md Pydantic models enforce required fields (version, platform triples, checksum, download URL). Schema documented in the wiki.
B1.3 Manifest storage + caching scripts/build_worker.sh, new S3/GitHub Release publication script Build pipeline uploads worker-release.json; server reads from storage with 5-minute cache and exposes ETag header.
B1.4 Observability src/bytebiota/server/logging.py Each manifest fetch, report, and error logged with correlation IDs.

Dependencies: Phase A manifest artifacts, secrets for storage bucket access, existing auth middleware.

Phase B2 – Worker updater client {#phase-b2-worker}

Objective: teach the worker process how to check for, download, and apply updates safely.

Task Description Key files / modules Exit criteria
B2.1 Version embedding worker.spec, scripts/build_worker.sh, src/bytebiota/worker/worker.py bytebiota-worker --version returns the packaged version and build metadata.
B2.2 Updater module src/bytebiota/worker/updater.py Module exports check_for_update(config), download_update(manifest_entry), apply_update(path) with unit tests covering success and failure paths.
B2.3 CLI integration src/bytebiota/worker/worker.py New CLI flags --no-auto-update and --force-update; environment variable AUTO_UPDATE=0 documented and respected.
B2.4 Download + verification src/bytebiota/worker/updater.py, tests/worker/test_updater.py Stream download to temp path, verify SHA-256, atomically swap binary (Windows fallback uses helper script).
B2.5 Rollback safeguards src/bytebiota/worker/updater.py On failed checksum or apply, restore previous binary and emit structured log worker_update_failure with reason.

Dependencies: Phase B1 manifest API availability, packaging version metadata, OS-specific file replacement strategy (documented below).

Phase B3 – Handshake and policy enforcement {#phase-b3-policy}

Objective: exchange version data between workers and server so we can coordinate staged rollouts and minimum supported versions.

Task Description Key files / modules Exit criteria
B3.1 Handshake payload update src/bytebiota/worker/connection_manager.py, src/bytebiota/server/schemas/worker_registration.py Worker registration includes current_version, server responds with minimum_version and recommended_version.
B3.2 Policy enforcement src/bytebiota/worker/worker.py Worker refuses to start when server reports higher minimum_version unless --force-start is provided; event logged.
B3.3 Feature flag plumbing src/bytebiota/config/worker_config.py, wiki/operations/environment.md New config fields auto_update_enabled, auto_update_channel. Documented and defaulted safely.
B3.4 Metrics + dashboards src/bytebiota/worker/metrics.py, wiki/analytics/metrics-and-observability.md Emit metrics (worker_update_attempt, worker_update_success, worker_update_failure, worker_version_current). Grafana panel spec added to analytics wiki.

Dependencies: Telemetry stack online (Prometheus exporters, logging sinks), update API accessible from worker networks.

Phase B4 – Integration + rollout {#phase-b4-rollout}

Objective: validate the end-to-end flow and provide operators with runbooks for safe deployment.

Task Description Key files / modules Exit criteria
B4.1 Integration tests tests/integration/test_worker_update_flow.py Simulated manifest server delivers new version; test asserts download, apply, and restart behaviour.
B4.2 Canary deployment automation .github/workflows/canary-worker-update.yml, scripts/promote_worker_release.py Workflow promotes manifest to canary channel only; manual approval gates production channel.
B4.3 Operator documentation DEPLOYMENT.md, wiki/operations/runbooks/worker-update.md Runbooks cover enabling updates, forcing upgrades, rolling back, reading metrics.
B4.4 Post-deploy monitoring wiki/analytics/metrics-and-observability.md Alert thresholds defined for failure rate and version skew.

Dependencies: Prior phases complete, staging environment available for dry runs, release engineering sign-off.

Telemetry and safety requirements

  • Emit structured logs and metrics for each update attempt (worker_update_attempt, worker_update_success, worker_update_failure).
  • Add feature flag configuration to WorkerConfig and document environment variables in wiki/operations/environment.md.
  • Write integration tests covering a simulated update cycle (download stub file, checksum mismatch, rollback path).
  • Define rollback process: store previous binary before applying update; revert on failure; document manual rollback command sequence.

Deployment considerations

  • Start with staged rollouts: enable auto-update on canary worker pools and observe metrics before broad rollout.
  • Document operator runbooks for forced updates, rolling back to a previous version, and troubleshooting failed downloads.
  • Update DEPLOYMENT.md once the feature is feature-flagged in production.
  • Coordinate release timing with server maintenance windows to avoid workers downloading manifests that reference unpublished binaries.