enqueuing deployment syncs #242
Labels
No labels
0 points
0.5 points
1 point
13 points
2 points
21 points
3 points
34 points
5 points
55 points
8 points
api service
blocked
component: fediversity panel
component: nixops4
documentation
estimation high: >3d
estimation low: <2h
estimation mid: <8h
infinite points
productisation
project-management
question
role: application developer
role: application operator
role: hosting provider
role: maintainer
security
technical debt
testing
type unclear
type: bug
type: deliverable
type: key result
type: objective
type: task
type: user story
user experience
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Blocks
Depends on
#289 key features improving user experience supported
fediversity/fediversity
#368 API available
fediversity/fediversity
#228 [D2.3] brought into production [2027-11-01]
fediversity/fediversity
Reference
fediversity/fediversity#242
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As a user of a Fediversity web front-end,
I want to be able to enqueue distinct configuration deployment syncs,
so that my UI could focus on a single application at a time.
implementation notes
panel?):rio-build- planet nix 2026 talk by Bernardo Meurer Costa: The Missing Part of Nix (and where to find it). Nix-aware build queue exposing a gRPC API. It provides the information needed by our API's operations:details vis-a-vis our API
deployment.submitSchedulerService.SubmitBuildnix build --store ssh-ng://rio; the gateway speaks Nix wire protocol. Build submitted with tenant + priority class.deployment.getSchedulerService.QueryBuildStatusdeployment.cancelSchedulerService.CancelBuildbuild.logsAdminService.GetBuildLogssince_lineoffset. Also streamed live viaWatchBuild.SchedulerService.WatchBuildBuildEventmessages with sequence numbers for resumption:BuildStarted,BuildProgress,DerivationEvent,BuildCompleted/Failed/Cancelled.AdminService.ListBuildsClusterStatusgives aggregate queue depth.Replace semantics: rio-build has no built-in "supersede a queued build." The API service implements replace mode by calling
CancelBuildon the prior build before submitting the new one.Multi-tenancy: built-in tenant support with per-tenant GC retention and storage caps.
Here's how the Fediversity API specification compares with rio-build's gRPC API:
Mapping summary
The spec doc already contains a mapping table (lines 99-113 of api-specification.md). Here's the fuller picture:
What maps cleanly
┌────────────────────────┬───────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Fediversity API │ rio-build gRPC │ Gap │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.submit │ SchedulerService.SubmitBuild │ rio-build takes a derivation DAG + tenant/priority; our API takes a high-level configuration + deployment_type. A translation layer is needed to │
│ │ │ evaluate Nix, produce the DAG, then submit. │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.get │ SchedulerService.QueryBuildStatus │ rio-build returns derivation-level counters (total/completed/cached/running/failed/queued) that our DeploymentState doesn't surface -- we only have │
│ │ │ coarse states (building, deploying, etc.). │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.cancel │ SchedulerService.CancelBuild │ Close match. rio-build includes a reason field; ours doesn't. │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ build.logs │ AdminService.GetBuildLogs │ rio-build supports per-derivation filtering and streaming; our API has only build_id + offset -- simpler but less granular. │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PubSub builds. │ SchedulerService.WatchBuild │ rio-build streams structured BuildEvent messages with sequence numbers for reconnection. Our PubSub events lack sequence numbers for resumption. │
│ events │ │ │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Queue position │ AdminService.ListBuilds │ rio-build has pagination + status/tenant filtering. Our API has no list-all-builds operation -- only per-deployment pending_builds. │
└────────────────────────┴───────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
What our API has that rio-build doesn't
┌─────────────────────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ Fediversity API │ Notes │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.delete (teardown) │ rio-build is a build queue, not a deployment manager. Teardown is our concern. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.get state machine (idle/active/deploying/destroying) │ rio-build tracks build states, not deployment lifecycle. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ schema.get │ NixOS-specific; rio-build has no configuration schema concept. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ deployments. state-change PubSub │ Deployment-level events are ours; rio-build only does build-level. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ Dynamic auth (auth.authorize, auth.authenticate) │ rio-build has tenant support but no pluggable auth callback model like WAMP dynamic auth. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ Replace mode (mode: "replace") │ rio-build lacks built-in supersede; the spec notes we must CancelBuild + re-submit. │
└─────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘
What rio-build has that our API doesn't expose
┌──────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ rio-build capability │ Notes │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Derivation-level granularity │ Build progress with per-derivation status, DAG structure, derivation events. Our progress events are flat log lines. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Store services (PutPath, GetPath, FindMissingPaths, chunking) │ The entire NAR store layer. Our API doesn't expose store operations. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Worker management (ListWorkers, DrainWorker, Heartbeat, bloom filter cache locality) │ Infrastructure-level; intentionally not in our operator-facing API. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Admin operations (ClusterStatus, TriggerGC, ClearPoison) │ Hosting-provider internal tooling; not surfaced to operators. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Multi-tenant management (ListTenants, CreateTenant) │ rio-build has this built in; our API delegates tenancy to the web app via auth callbacks. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sequence numbers on build event streams │ Allows reconnection without re-fetching; our PubSub has no resumption mechanism beyond build.logs offset. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Priority classes on build submission │ Our API has no priority concept. │
└──────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Key gaps to watch
completed, failed) could be missed.
alternate options
yensid: HAProxy-based SSH proxy for Nix remote builders offering load balancing, but lacks build awareness: no queue, no status tracking, no logs, no cancellation; as per above talk lacks smart features like knowing which nodes may make builds faster based on what they already had in cachedjango-tasks- as per docs sounds like this uses manual calls to check for results, which sounds clunky. not sure about querying to say list tasks. does support external back-ends.celery: distributed task queue that supports external back-ends and result stores. allows checking whether a task has started.gradientserversfunction as a build (c.f. #366) queueas listed by enqueue: WAMP (c.f. #334)?What's a "sync"?
@fricklerhandwerk currently our button i think said deploy, but that's a bit of a misnomer, given pressing it instead syncs configuration, which may simply alter settings for an existing service.
Ah so you're saying that what we today call deployment should be async? Yes, it should be that in any case, and it's implied in the later user stories where one gets a notification once a deployment is done, even when leaving the page early.
not just async. this story came from the recent meeting with the designers, who envisioned services to be separated across UI sections, each of which presumably altered independently. following such a UI, it would make sense for a deployment system to support that, e.g.:
enqueue deployment syncsto enqueuing deployment syncskiara referenced this issue2025-06-08 13:09:56 +02:00
kiara referenced this issue2025-06-09 18:56:22 +02:00
`rio-build` is looking unwieldy, which seems a barrier given e.g. `yensid` doesn't offer a queue
rio-build is a 14-crate Rust workspace meant for K8s deployment with Postgres + S3 + Raft (?) + (bring your own) nix-eval-jobs.
hm, rio-build does not even do the eval part, raising questions on to what extent that might even solve our problem.
edit: hm, fwiw on a cached (ish?) deploy, i got 0-ish sec eval 14.66 sec deploy (
time nix run ".#dev-ssh-forgejo-actions-runner")should maybe further consider gradient for this (also see #362)
rechecking options
src/api-specification.md currently nominates https://github.com/lovesegfault/rio-build as the queue behind the WAMP fediversity.deployment.* RPCs. #242 flags rio-build as heavy — it presumes k8s, and its headline features (distributed scheduling, cache-locality-aware builder assignment via bloom filters, critical-path DAG priorities) are nice-to-haves rather than blockers for our small-to-medium hosting-provider target. The spec itself says "the queue backend is a deployment choice; the API abstracts over it" (L99), so swapping it is supported by design.The queue's real requirements (spec L87-118):
Language-neutral evaluation: Python-nativeness is not a priority (panel is Python today but the backend may change). What matters operationally: existing NixOS module (or
trivial-to-package) and a programmatic API we can drive from the WAMP layer.
#362 already enumerates many CI systems with a Nix/concurrency/secrets matrix — this plan focuses specifically on the
queue-API question that #242 is asking.
Critical constraint: jobs are impure deployments, not pure builds
Our unit of work is not a pure Nix build. A deployment:
This rules against schedulers whose data model is "evaluate a flake, build store paths, push to cache." Those treat env vars as anti-features (impurity breaks substituter
guarantees) and have no first-class way to express "after the build, run this effectful step with these secrets and stream its output." In Nix's vocabulary this is what
Hercules CI calls an effect; outside Hercules it's barely modeled.
Concretely, this weakens Gradient and buildbot-nix as candidates and strengthens generic CI (Forgejo Actions, Jenkins) and generic queues (River, Celery, Temporal) whose
workers just run a script with whatever env they're given. The "Nix-aware" tier mainly helps with the build sub-phase (parallel eval, per-drv progress, cache push); we'd still
need to bolt on an effect-runner for the apply step.
Nix-aware candidates
Gradient (https://github.com/Wavelens/Gradient, AGPL-3.0, v1.1.1 May 2026, Rust)
over WebSocket (zstd-precompressed).
tracked at https://github.com/wavelens/gradient/issues/25.
start new eval. Same pattern as rio-build's "cancel + resubmit" but built on coarser primitives.
minimises network traffic between builders" (substitution-aware assignment). Built-in binary cache (S3-backable). Interactive dependency graph.
and should expect to file bugs upstream (per-build cancel chief among them).
inside a Gradient build means either smuggling them into an impure derivation (__noChroot/__impureHostDeps, fragile) or running them outside Gradient and using Gradient only
for the upstream evaluation+build phase — at which point Gradient is just a fancy build cache and we still need a separate runner for the deploy. The API also has no notion of
passing per-submit secrets to a build.
buildbot-nix (https://github.com/nix-community/buildbot-nix, Python+Nix, v1.1.0 Sep 2025)
scheduling one build per attribute. Its data model is "this flake has these checks, build them on push" — pull-request CI. To use it for "deploy tenant X with runtime-submitted
config Y via WAMP" the realistic paths are:
a. Generate a per-deployment flake attribute — write config to a flake, push to a branch, let buildbot-nix discover it. Slow, conflates git with the request queue.
b. Bypass nix-eval-jobs entirely — write a custom Buildbot BuilderConfig + scheduler. At which point you're using Buildbot, not buildbot-nix; the module's added value drops
to ~zero and you inherit Buildbot's master.cfg learning curve.
c. Long-lived control flake re-evaluated on each submit — racy on the shared flake, re-eval cost on every request.
None are deal-breakers; each costs real engineering. The honest read: buildbot-nix is optimised for git-push triggers CI; our shape is web-API triggers one-shot deploy. Picking
it likely means option (2) — using buildbot-nix as a hint and writing custom Buildbot config underneath.
herculesCI.onPush..outputs.effects derivations and wires them in as ordinary Buildbot builders (/run-effect). It does not depend on hercules-ci-effects at runtime —
it bundles a minimal effects-lib.nix. Secrets are injected hercules-style via HERCULES_CI_SECRETS_JSON=/run/secrets.json (docs/EFFECTS.md); knobs include
effects_per_repo_secrets, effects_branches, effects_extra_sandbox_paths. Effects are cancellable/observable via REST like any other build, and pipeline stages distinguish
nix-eval / nix-build / nix-effects. Open caveats: effects run in parallel per push (#587 wants sequential), may not re-run on cached builds (#295), and the sandbox needs
allow-listing for SSH/TF state dirs. The "bend the flake/check vocabulary" tradeoff applies only to the build sub-phase; the impure deploy fits natively as an effect.
but buildbot-nix doesn't configure one out of the box — we'd need to add it via Buildbot config. Real but contained.
Jenkins (https://www.jenkins.io/, MIT, in nixpkgs as services.jenkins)
Listed in #362 with cancellation ✅. Worth surfacing here because the queue primitives match well:
(progressiveText).
values are per-run.
Forgejo Actions (already running in our infra)
We already host forgejo-ci and king runners (see CLAUDE.md). Forgejo Actions supports:
marks it as ✅ for concurrency.
lowest-new-software path of anything on the list.
runner has full network access for tofu apply and SSH activation. Nix-awareness is supplied by the script itself (the job calls nix build / nix-fast-build), which is the right
separation of concerns for a job that's half pure-build and half impure-effect.
pre-declared in on.workflow_dispatch.inputs.
nix-fast-build (https://github.com/Mic92/nix-fast-build)
Not a queue — CLI parallelizer combining nix-eval-jobs + nix-output-monitor with JSON/JUnit output. Keep as the builder invoked by whichever queue we pick; it gives us
per-attribute progress for free.
Dropped Nix-aware candidates
mutation).
submit-with-payload shape, or SaaS-only, or much smaller adoption than the four CI options above. Not pursued; reconsider only if a specific gap surfaces.
Generic queue backends
The API spec is backend-agnostic, so a non-Nix queue is viable — the worker shells out to nix build / nix-fast-build and we stream output ourselves. Cost: we own Nix-progress
reporting (parse nix-output-monitor JSON).
Project: River (https://github.com/riverqueue/river)
Lang / store: Go + Postgres
Nix packaging: Pkg in nixpkgs; no module yet
Replace: First-class UniqueOpts (args+queue+state)
Cancel: Native
Log streaming: Telemetry subs only; we build on top
Notes: Cleanest unique/cancel primitives. Postgres reuses panel's store.
Project: NATS JetStream
Lang / store: Go single binary
Nix packaging: services.nats module in nixpkgs
Replace: Native via AllowRollup + per-subject dedup (deployment_id as subject)
Cancel: Out-of-band only
Log streaming: Build on KV/extra stream
Notes: Best native replace; weakest cancellation+log story. Worthwhile if NATS earns its keep elsewhere.
Project: Temporal
Lang / store: Go + Postgres/Cassandra, multi-service
Nix packaging: pkgs.temporal exists; no NixOS module
Replace: WorkflowIDReusePolicy=TerminateIfRunning is exactly our semantic
Cancel: Native, plus signals/queries
Log streaming: Via activities
Notes: Semantic match is unbeatable; 5+ services is disproportionate at our scale. Revisit if orchestration grows multi-step DAGs.
Project: Celery + Redis / Procrastinate
Lang / store: Python + Redis or Postgres
Nix packaging: In nixpkgs with modules
Replace: Redis/Postgres lock + revoke prior task
Cancel: Native (SIGTERM)
Log streaming: Custom: subprocess stdout → pub/sub → WAMP
Notes: Lowest engineering cost today if we stay Python; locks us harder to Python than we want long-term.
Dropped: RQ (strict subset of Celery), pg-boss (Node mismatch), django-q2 (subset of Celery feature set, same Python lock-in).
Ranked recommendation
Reordering after the impurity + external-trigger checks: generic CI gains, Gradient loses, buildbot-nix is salvaged by its first-class hercules-style effects support.
are first-class API calls; secrets and env vars are first-class; runner runs nix build and tofu apply in one job. Only gap is log streaming via public API; runner-side scrape
or sidecar shipper closes it. Lowest new software footprint and best fit for half-pure-half-impure jobs.
/run/secrets.json. Effects are observable/cancellable via Buildbot's REST API. Cost: write a small Buildbot config to add a ForceScheduler (stock buildbot-nix is git-push
driven) and decide between the three "flake vocabulary" patterns for submit-time configs. Pick this if per-drv visibility and cache push (Cachix/Attic/Harmonia) are real
product requirements.
we own Nix-progress reporting and packaging a NixOS module. Best choice if we want the queue conceptually decoupled from any CI tool.
cost.
deploy with secrets. No effects concept; we'd use it for the build sub-phase only and need a separate runner for apply. Reconsider if it grows effects support or once the
apply phase can itself be modeled as a derivation.
forgejo-actions spike #973 (+ CI runs), in its attempt to make
dispatch.py+.forgejo/workflows/deploy.ymlas a starting point for WAMPfediversity.deployment.*RPCs, turned up with gaps:dispatch.pyfind-runis broken. Forgejo'sGET.../actions/runslisting does not return inputs in a usable form (trigger_eventis a string, not the structured object the driver expects).find-runandreplace-testcannot locate runs bydeployment_id. The WAMP wiring needs another correlation strategy (return the run id from a wrapping API, or write thedeployment_idinto a tag/branch we can list on)./tmp/build-<name>.logfiles onforgejo-cionly exist for jobs that go through the local nix build wrapper, so the spike workflow doesn't produce them. Mitigations to consider:teeto a known path inside the workflow step, or wait for upstream PRs (#8873 / #11330).GETendpoints exist on/repos/{owner}/{repo}/actions/runs/...or.../jobsendpoints in the swagger. The WAMPdeployment.cancelRPC will need to trigger cancel indirectly - e.g. dispatch a no-op run with the sameconcurrency.groupso cancel-in-progress does the work.