enqueuing deployment syncs #242

Open
opened 2025-03-12 13:32:37 +01:00 by kiara · 8 comments
Owner

As a user of a Fediversity web front-end,
I want to be able to enqueue distinct configuration deployment syncs,
so that my UI could focus on a single application at a time.

implementation notes

  • this comes down to using a queue, potentially backed by multiple builders (#366)
    • ideally dropping outdated syncs for a deployment in favor of the newer one
  • nix-specific builder queues seem an open challenge still, with underlying work including work on the nix protocol
  • deployment triggers could then distinguish replace vs. enqueue operations, but let's default to replace.
  • we may need to further be able to query such a queue, whether to indicate progress (#184), or support CRUD-like end-points to introspect, enqueue, replace or cancel builds (c.f. #334).
  • c.f. opentofu resource targeting
  • queue options (for panel?):
  • building on #900, this should further switch the deployment jobs in the django model to such a proper queue, and implement catch-up.
details vis-a-vis our API
API operation rio-build gRPC Notes
deployment.submit SchedulerService.SubmitBuild Clients connect via nix build --store ssh-ng://rio; the gateway speaks Nix wire protocol. Build submitted with tenant + priority class.
deployment.get SchedulerService.QueryBuildStatus Returns state (pending/active/succeeded/failed/cancelled), derivation counts (total/completed/cached/running/failed/queued), timestamps.
deployment.cancel SchedulerService.CancelBuild Per-build, with reason tracking. Propagates cancel signals to workers.
build.logs AdminService.GetBuildLogs Streaming, with per-derivation filtering and since_line offset. Also streamed live via WatchBuild.
PubSub progress SchedulerService.WatchBuild Streams BuildEvent messages with sequence numbers for resumption: BuildStarted, BuildProgress, DerivationEvent, BuildCompleted/Failed/Cancelled.
Queue position AdminService.ListBuilds Paginated listing with status/tenant filtering. ClusterStatus gives aggregate queue depth.
Builder assignment Built-in Cache-locality-aware scheduling (workers report cached paths via bloom filter), critical-path priority within DAGs.

Replace semantics: rio-build has no built-in "supersede a queued build." The API service implements replace mode by calling CancelBuild on the prior build before submitting the new one.

Multi-tenancy: built-in tenant support with per-tenant GC retention and storage caps.

Here's how the Fediversity API specification compares with rio-build's gRPC API:

Mapping summary

The spec doc already contains a mapping table (lines 99-113 of api-specification.md). Here's the fuller picture:

What maps cleanly

┌────────────────────────┬───────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Fediversity API │ rio-build gRPC │ Gap │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.submit │ SchedulerService.SubmitBuild │ rio-build takes a derivation DAG + tenant/priority; our API takes a high-level configuration + deployment_type. A translation layer is needed to │
│ │ │ evaluate Nix, produce the DAG, then submit. │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.get │ SchedulerService.QueryBuildStatus │ rio-build returns derivation-level counters (total/completed/cached/running/failed/queued) that our DeploymentState doesn't surface -- we only have │
│ │ │ coarse states (building, deploying, etc.). │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.cancel │ SchedulerService.CancelBuild │ Close match. rio-build includes a reason field; ours doesn't. │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ build.logs │ AdminService.GetBuildLogs │ rio-build supports per-derivation filtering and streaming; our API has only build_id + offset -- simpler but less granular. │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PubSub builds. │ SchedulerService.WatchBuild │ rio-build streams structured BuildEvent messages with sequence numbers for reconnection. Our PubSub events lack sequence numbers for resumption. │
│ events │ │ │
├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Queue position │ AdminService.ListBuilds │ rio-build has pagination + status/tenant filtering. Our API has no list-all-builds operation -- only per-deployment pending_builds. │
└────────────────────────┴───────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

What our API has that rio-build doesn't

┌─────────────────────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ Fediversity API │ Notes │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.delete (teardown) │ rio-build is a build queue, not a deployment manager. Teardown is our concern. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ deployment.get state machine (idle/active/deploying/destroying) │ rio-build tracks build states, not deployment lifecycle. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ schema.get │ NixOS-specific; rio-build has no configuration schema concept. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ deployments. state-change PubSub │ Deployment-level events are ours; rio-build only does build-level. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ Dynamic auth (auth.authorize, auth.authenticate) │ rio-build has tenant support but no pluggable auth callback model like WAMP dynamic auth. │
├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤
│ Replace mode (mode: "replace") │ rio-build lacks built-in supersede; the spec notes we must CancelBuild + re-submit. │
└─────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

What rio-build has that our API doesn't expose

┌──────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ rio-build capability │ Notes │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Derivation-level granularity │ Build progress with per-derivation status, DAG structure, derivation events. Our progress events are flat log lines. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Store services (PutPath, GetPath, FindMissingPaths, chunking) │ The entire NAR store layer. Our API doesn't expose store operations. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Worker management (ListWorkers, DrainWorker, Heartbeat, bloom filter cache locality) │ Infrastructure-level; intentionally not in our operator-facing API. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Admin operations (ClusterStatus, TriggerGC, ClearPoison) │ Hosting-provider internal tooling; not surfaced to operators. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Multi-tenant management (ListTenants, CreateTenant) │ rio-build has this built in; our API delegates tenancy to the web app via auth callbacks. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sequence numbers on build event streams │ Allows reconnection without re-fetching; our PubSub has no resumption mechanism beyond build.logs offset. │
├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Priority classes on build submission │ Our API has no priority concept. │
└──────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Key gaps to watch

  1. No event resumption -- if a client disconnects from builds., there's no sequence number to resume from. The build.logs RPC with offset partially covers this for log lines, but structured events (started,
    completed, failed) could be missed.
  2. No derivation-level progress -- operators see "building" but not "142/300 derivations done". rio-build provides this via QueryBuildStatus counters and DerivationEvent in the stream.
  3. No build listing -- there's no way to list all builds across deployments (rio-build's AdminService.ListBuilds). Could be useful for hosting provider dashboards.
  4. Replace semantics are synthetic -- documented as needing a CancelBuild + resubmit. Worth noting this creates a race window.
alternate options
  • yensid: HAProxy-based SSH proxy for Nix remote builders offering load balancing, but lacks build awareness: no queue, no status tracking, no logs, no cancellation; as per above talk lacks smart features like knowing which nodes may make builds faster based on what they already had in cache
  • django-tasks - as per docs sounds like this uses manual calls to check for results, which sounds clunky. not sure about querying to say list tasks. does support external back-ends.
  • celery: distributed task queue that supports external back-ends and result stores. allows checking whether a task has started.
  • gradient servers function as a build (c.f. #366) queue
  • as listed by enqueue: WAMP (c.f. #334)?
    • counterpoint: wamp-stream-service makes the following note, clarifying WAMP functions more like fire-and-forget (PubSub) than like a queue:

      The queue worker connects to the router in order to send wamp messages. So the router must be started before the queue worker.

**As** a user of a Fediversity web front-end, **I want** to be able to enqueue distinct configuration deployment syncs, **so that** my UI could focus on a single application at a time. ### implementation notes - this comes down to using a queue, potentially backed by multiple builders (#366) - ideally dropping outdated syncs for a deployment in favor of the newer one - nix-specific builder queues seem an open challenge still, with underlying work including work on the [nix protocol](https://www.tweag.io/blog/2024-04-25-nix-protocol-in-rust/) - deployment triggers could then distinguish replace vs. enqueue operations, but let's default to replace. - we may need to further be able to query such a queue, whether to indicate progress (#184), or support CRUD-like end-points to introspect, enqueue, replace or cancel builds (c.f. #334). - c.f. [opentofu resource targeting](https://opentofu.org/docs/cli/commands/plan/#resource-targeting) - queue options (for `panel`?): - [`rio-build`](https://github.com/lovesegfault/rio-build) - [planet nix 2026 talk by Bernardo Meurer Costa: The Missing Part of Nix (and where to find it)](https://youtube.com/live/HAguopYb46c?si=Z9Fik6l1CrXXt794&t=16775). Nix-aware build queue exposing a gRPC API. It provides the information needed by our API's operations: - building on #900, this should further switch the deployment jobs in the django model to such a proper queue, and [implement catch-up](https://git.fediversity.eu/fediversity/fediversity/pulls/900/files#diff-172bdb2d72d8eef002cb4cf4725eae12b923fc9d). <details> <summary> details vis-a-vis our API </summary> | API operation | rio-build gRPC | Notes | |---|---|---| | `deployment.submit` | `SchedulerService.SubmitBuild` | Clients connect via `nix build --store ssh-ng://rio`; the gateway speaks Nix wire protocol. Build submitted with tenant + priority class. | | `deployment.get` | `SchedulerService.QueryBuildStatus` | Returns state (pending/active/succeeded/failed/cancelled), derivation counts (total/completed/cached/running/failed/queued), timestamps. | | `deployment.cancel` | `SchedulerService.CancelBuild` | Per-build, with reason tracking. Propagates cancel signals to workers. | | `build.logs` | `AdminService.GetBuildLogs` | Streaming, with per-derivation filtering and `since_line` offset. Also streamed live via `WatchBuild`. | | PubSub progress | `SchedulerService.WatchBuild` | Streams `BuildEvent` messages with sequence numbers for resumption: `BuildStarted`, `BuildProgress`, `DerivationEvent`, `BuildCompleted`/`Failed`/`Cancelled`. | | Queue position | `AdminService.ListBuilds` | Paginated listing with status/tenant filtering. `ClusterStatus` gives aggregate queue depth. | | Builder assignment | Built-in | Cache-locality-aware scheduling (workers report cached paths via bloom filter), critical-path priority within DAGs. | **Replace semantics**: rio-build has no built-in "supersede a queued build." The API service implements replace mode by calling `CancelBuild` on the prior build before submitting the new one. **Multi-tenancy**: built-in tenant support with per-tenant GC retention and storage caps. Here's how the Fediversity API specification compares with rio-build's gRPC API: Mapping summary The spec doc already contains a mapping table (lines 99-113 of api-specification.md). Here's the fuller picture: What maps cleanly ┌────────────────────────┬───────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Fediversity API │ rio-build gRPC │ Gap │ ├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ deployment.submit │ SchedulerService.SubmitBuild │ rio-build takes a derivation DAG + tenant/priority; our API takes a high-level configuration + deployment_type. A translation layer is needed to │ │ │ │ evaluate Nix, produce the DAG, then submit. │ ├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ deployment.get │ SchedulerService.QueryBuildStatus │ rio-build returns derivation-level counters (total/completed/cached/running/failed/queued) that our DeploymentState doesn't surface -- we only have │ │ │ │ coarse states (building, deploying, etc.). │ ├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ deployment.cancel │ SchedulerService.CancelBuild │ Close match. rio-build includes a reason field; ours doesn't. │ ├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ build.logs │ AdminService.GetBuildLogs │ rio-build supports per-derivation filtering and streaming; our API has only build_id + offset -- simpler but less granular. │ ├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ PubSub builds.<id> │ SchedulerService.WatchBuild │ rio-build streams structured BuildEvent messages with sequence numbers for reconnection. Our PubSub events lack sequence numbers for resumption. │ │ events │ │ │ ├────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Queue position │ AdminService.ListBuilds │ rio-build has pagination + status/tenant filtering. Our API has no list-all-builds operation -- only per-deployment pending_builds. │ └────────────────────────┴───────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ What our API has that rio-build doesn't ┌─────────────────────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐ │ Fediversity API │ Notes │ ├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ deployment.delete (teardown) │ rio-build is a build queue, not a deployment manager. Teardown is our concern. │ ├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ deployment.get state machine (idle/active/deploying/destroying) │ rio-build tracks build states, not deployment lifecycle. │ ├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ schema.get │ NixOS-specific; rio-build has no configuration schema concept. │ ├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ deployments.<id> state-change PubSub │ Deployment-level events are ours; rio-build only does build-level. │ ├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ Dynamic auth (auth.authorize, auth.authenticate) │ rio-build has tenant support but no pluggable auth callback model like WAMP dynamic auth. │ ├─────────────────────────────────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ Replace mode (mode: "replace") │ rio-build lacks built-in supersede; the spec notes we must CancelBuild + re-submit. │ └─────────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘ What rio-build has that our API doesn't expose ┌──────────────────────────────────────────────────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ rio-build capability │ Notes │ ├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Derivation-level granularity │ Build progress with per-derivation status, DAG structure, derivation events. Our progress events are flat log lines. │ ├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Store services (PutPath, GetPath, FindMissingPaths, chunking) │ The entire NAR store layer. Our API doesn't expose store operations. │ ├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Worker management (ListWorkers, DrainWorker, Heartbeat, bloom filter cache locality) │ Infrastructure-level; intentionally not in our operator-facing API. │ ├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Admin operations (ClusterStatus, TriggerGC, ClearPoison) │ Hosting-provider internal tooling; not surfaced to operators. │ ├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Multi-tenant management (ListTenants, CreateTenant) │ rio-build has this built in; our API delegates tenancy to the web app via auth callbacks. │ ├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Sequence numbers on build event streams │ Allows reconnection without re-fetching; our PubSub has no resumption mechanism beyond build.logs offset. │ ├──────────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Priority classes on build submission │ Our API has no priority concept. │ └──────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ Key gaps to watch 1. No event resumption -- if a client disconnects from builds.<id>, there's no sequence number to resume from. The build.logs RPC with offset partially covers this for log lines, but structured events (started, completed, failed) could be missed. 2. No derivation-level progress -- operators see "building" but not "142/300 derivations done". rio-build provides this via QueryBuildStatus counters and DerivationEvent in the stream. 3. No build listing -- there's no way to list all builds across deployments (rio-build's AdminService.ListBuilds). Could be useful for hosting provider dashboards. 4. Replace semantics are synthetic -- documented as needing a CancelBuild + resubmit. Worth noting this creates a race window. </details> <details> <summary> alternate options </summary> - [`yensid`](https://github.com/garnix-io/yensid): HAProxy-based SSH proxy for Nix remote builders offering load balancing, but lacks build awareness: no queue, no status tracking, no logs, no cancellation; as per above talk lacks smart features like knowing which nodes may make builds faster based on what they already had in cache - [`django-tasks`](https://github.com/RealOrangeOne/django-tasks) - as per [docs](https://docs.djangoproject.com/en/6.0/topics/tasks/) sounds like this uses manual calls to check for results, which sounds clunky. not sure about querying to say list tasks. does support [external back-ends](https://docs.djangoproject.com/en/6.0/topics/tasks/#third-party-backends). - [`celery`](https://github.com/celery/celery): distributed task queue that supports [external back-ends](https://docs.celeryq.dev/en/stable/getting-started/first-steps-with-celery.html#choosing-a-broker) and [result stores](https://docs.celeryq.dev/en/stable/getting-started/introduction.html#celery-is). allows checking [whether a task has started](https://docs.celeryq.dev/en/stable/userguide/tasks.html#task-track-started). - [`gradient`](https://github.com/Wavelens/Gradient) `servers` function as a build (c.f. #366) queue - ~~as listed by [enqueue](https://github.com/php-enqueue/enqueue-dev): WAMP (c.f. #334)?~~ - counterpoint: [wamp-stream-service](https://hub.docker.com/r/unzel/wamp-stream-service) makes the following note, clarifying WAMP functions more like fire-and-forget (PubSub) than like a queue: > The queue worker connects to the router in order to send wamp messages. So the router must be started before the queue worker. </details>

What's a "sync"?

What's a "sync"?
Author
Owner

@fricklerhandwerk currently our button i think said deploy, but that's a bit of a misnomer, given pressing it instead syncs configuration, which may simply alter settings for an existing service.

@fricklerhandwerk currently our button i think said deploy, but that's a bit of a misnomer, given pressing it instead syncs configuration, which may simply alter settings for an existing service.

Ah so you're saying that what we today call deployment should be async? Yes, it should be that in any case, and it's implied in the later user stories where one gets a notification once a deployment is done, even when leaving the page early.

Ah so you're saying that what we today call deployment should be async? Yes, it should be that in any case, and it's implied in the later user stories where one gets a notification once a deployment is done, even when leaving the page early.
Author
Owner

not just async. this story came from the recent meeting with the designers, who envisioned services to be separated across UI sections, each of which presumably altered independently. following such a UI, it would make sense for a deployment system to support that, e.g.:

  1. edit mastodon config
  2. edit pixelfed config
  3. start editing peertube config
  4. receive notification mastodon sync succeeded
  5. finish editing peertube config
  6. receive notification pixelfed sync succeeded
  7. receive notification peertube sync succeeded
not just async. this story came from the recent meeting with the designers, who envisioned services to be separated across UI sections, each of which presumably altered independently. following such a UI, it would make sense for a deployment system to support that, e.g.: 1. edit mastodon config 1. edit pixelfed config 1. start editing peertube config 1. receive notification mastodon sync succeeded 1. finish editing peertube config 1. receive notification pixelfed sync succeeded 1. receive notification peertube sync succeeded
kiara changed title from enqueue deployment syncs to enqueuing deployment syncs 2025-06-01 16:57:43 +02:00
Author
Owner
`rio-build` is looking unwieldy, which seems a barrier given e.g. `yensid` doesn't offer a queue

rio-build is a 14-crate Rust workspace meant for K8s deployment with Postgres + S3 + Raft (?) + (bring your own) nix-eval-jobs.

<details> <summary> `rio-build` is looking unwieldy, which seems a barrier given e.g. `yensid` doesn't offer a queue </summary> rio-build is a 14-crate Rust workspace meant for K8s deployment with Postgres + S3 + Raft (?) + (bring your own) nix-eval-jobs.</details>
Author
Owner

hm, rio-build does not even do the eval part, raising questions on to what extent that might even solve our problem.

edit: hm, fwiw on a cached (ish?) deploy, i got 0-ish sec eval 14.66 sec deploy (time nix run ".#dev-ssh-forgejo-actions-runner")

hm, rio-build [does not even do the eval part](https://github.com/lovesegfault/rio-build/blob/main/docs/src/decisions/002-external-evaluation.md#decision), raising questions on to what extent that might even solve our problem. edit: hm, fwiw on a cached (ish?) deploy, i got 0-ish sec eval 14.66 sec deploy (`time nix run ".#dev-ssh-forgejo-actions-runner"`)
Author
Owner

should maybe further consider gradient for this (also see #362)

should maybe further consider [gradient](https://discourse.nixos.org/t/gradient-call-for-testers/77549) for this (also see #362)
Author
Owner
rechecking options src/api-specification.md currently nominates https://github.com/lovesegfault/rio-build as the queue behind the WAMP fediversity.deployment.* RPCs. #242 flags rio-build as heavy — it presumes k8s, and its headline features (distributed scheduling, cache-locality-aware builder assignment via bloom filters, critical-path DAG priorities) are nice-to-haves rather than blockers for our small-to-medium hosting-provider target. The spec itself says "the queue backend is a deployment choice; the API abstracts over it" (L99), so swapping it is supported by design.

The queue's real requirements (spec L87-118):

  • Per-deployment_id serialization with replace semantics (default) and opt-in enqueue mode
  • Best-effort cancellation of pending/running builds
  • Status state machine + derivation counts + live progress events + log streaming with offset/resume
  • Multi-tenant priority + capacity bound (queue_full)
  • Plays with WAMP event_history retention on the router side (nexusd/Crossbar)
  • Single-host deploy fine today; distributed multi-builder is a growth path, not a launch blocker

Language-neutral evaluation: Python-nativeness is not a priority (panel is Python today but the backend may change). What matters operationally: existing NixOS module (or
trivial-to-package) and a programmatic API we can drive from the WAMP layer.

#362 already enumerates many CI systems with a Nix/concurrency/secrets matrix — this plan focuses specifically on the
queue-API question that #242 is asking.

Critical constraint: jobs are impure deployments, not pure builds

Our unit of work is not a pure Nix build. A deployment:

  • Evaluates a NixOS configuration (pure).
  • Builds derivations (pure).
  • Runs tofu apply / activation over SSH (impure: side effects on remote infra).
  • Consumes env-var secrets (Proxmox API tokens, SSH keys, Netbox tokens, agenix material) — passed to TF providers and to nix copy / activation.
  • Mutates external state (TF backend, target hosts, DNS, S3).

This rules against schedulers whose data model is "evaluate a flake, build store paths, push to cache." Those treat env vars as anti-features (impurity breaks substituter
guarantees) and have no first-class way to express "after the build, run this effectful step with these secrets and stream its output." In Nix's vocabulary this is what
Hercules CI calls an effect; outside Hercules it's barely modeled.

Concretely, this weakens Gradient and buildbot-nix as candidates and strengthens generic CI (Forgejo Actions, Jenkins) and generic queues (River, Celery, Temporal) whose
workers just run a script with whatever env they're given. The "Nix-aware" tier mainly helps with the build sub-phase (parallel eval, per-drv progress, cache push); we'd still
need to bolt on an effect-runner for the apply step.

Nix-aware candidates

Gradient (https://github.com/Wavelens/Gradient, AGPL-3.0, v1.1.1 May 2026, Rust)

  • Footprint: drop-in flake with nixosModules.default + nixosModules.deploy; auto-configures Postgres + nginx/caddy + JWT. No k8s. Distributed gradient-worker instances connect
    over WebSocket (zstd-precompressed).
  • API (verified against docs/gradient-api.yaml):
  • Triggering a build: POST /projects/{org}/{project}/evaluate (returns evaluation UUID); per-deployment direct builds via POST /builds.
  • Cancellation: POST /evals/{evaluation} with method: "abort" cancels all in-progress and queued builds in that evaluation. No per-build cancel endpoint exists — feature
    tracked at https://github.com/wavelens/gradient/issues/25.
  • Log streaming: POST /builds/{build}/log (NDJSON) and POST /evals/{evaluation}/builds (stream logs for all currently-building drvs in an eval).
  • Status: GET /builds/{build}, GET /evals/{evaluation}/builds (pagination, status filter). No dedicated queue/scheduler endpoint.
  • Replace semantics: absent at the API level. Workable pattern: map each deployment_id 1:1 to a Gradient project, each submit to a new evaluation; replace = abort prior eval +
    start new eval. Same pattern as rio-build's "cancel + resubmit" but built on coarser primitives.
  • Nix awareness: distributed evaluation + builds; differentiator per the discourse https://discourse.nixos.org/t/gradient-call-for-testers/77549 is "smart scheduling that
    minimises network traffic between builders" (substitution-aware assignment). Built-in binary cache (S3-backable). Interactive dependency graph.
  • Multi-tenancy: organizations with independent workers + access control; API-Key + OAuth2/OIDC.
  • Caveats: "quite young", maintainer's own framing is "deployable but cannot replace Hydra yet"; scheduler statistics need real-world data to improve. We'd be early adopters
    and should expect to file bugs upstream (per-build cancel chief among them).
  • Impurity fit: poor. Gradient is a build scheduler + binary cache; the "build" abstraction is a Nix derivation, not a script-with-env. Running tofu apply and SSH activation
    inside a Gradient build means either smuggling them into an impure derivation (__noChroot/__impureHostDeps, fragile) or running them outside Gradient and using Gradient only
    for the upstream evaluation+build phase — at which point Gradient is just a fancy build cache and we still need a separate runner for the deploy. The API also has no notion of
    passing per-submit secrets to a build.
  • External trigger: native — POST /projects/{org}/{project}/evaluate and POST /builds accept structured payloads, return IDs.

buildbot-nix (https://github.com/nix-community/buildbot-nix, Python+Nix, v1.1.0 Sep 2025)

  • Footprint: systemd via NixOS module. SQLite default, Postgres optional. Stable underlying Buildbot (15+ years).
  • API: Buildbot REST + WebSocket — cancellation, log streaming, queue listing all first-class and well-documented.
  • Nix awareness: parallel eval via nix-eval-jobs, per-drv builds with GC roots, cache push to Cachix/Attic/Harmonia. Workers can be tagged; no cache-locality scheduler.
  • Cancel/replace: REST cancel is native; replace = cancel-prev + resubmit at the trigger layer, with Buildbot Lock primitives to enforce per-deployment_id serialization.
  • Tradeoff in detail — "bend the flake/check vocabulary": buildbot-nix discovers work by evaluating .#checks. (or a configured attr) of a flake using nix-eval-jobs,
    scheduling one build per attribute. Its data model is "this flake has these checks, build them on push" — pull-request CI. To use it for "deploy tenant X with runtime-submitted
    config Y via WAMP" the realistic paths are:
    a. Generate a per-deployment flake attribute — write config to a flake, push to a branch, let buildbot-nix discover it. Slow, conflates git with the request queue.
    b. Bypass nix-eval-jobs entirely — write a custom Buildbot BuilderConfig + scheduler. At which point you're using Buildbot, not buildbot-nix; the module's added value drops
    to ~zero and you inherit Buildbot's master.cfg learning curve.
    c. Long-lived control flake re-evaluated on each submit — racy on the shared flake, re-eval cost on every request.

None are deal-breakers; each costs real engineering. The honest read: buildbot-nix is optimised for git-push triggers CI; our shape is web-API triggers one-shot deploy. Picking
it likely means option (2) — using buildbot-nix as a hint and writing custom Buildbot config underneath.

  • Impurity fit — actually good, via first-class effects. buildbot-nix ships a dedicated buildbot-effects subpackage that consumes hercules-style
    herculesCI.onPush..outputs.effects derivations and wires them in as ordinary Buildbot builders (/run-effect). It does not depend on hercules-ci-effects at runtime —
    it bundles a minimal effects-lib.nix. Secrets are injected hercules-style via HERCULES_CI_SECRETS_JSON=/run/secrets.json (docs/EFFECTS.md); knobs include
    effects_per_repo_secrets, effects_branches, effects_extra_sandbox_paths. Effects are cancellable/observable via REST like any other build, and pipeline stages distinguish
    nix-eval / nix-build / nix-effects. Open caveats: effects run in parallel per push (#587 wants sequential), may not re-run on cached builds (#295), and the sandbox needs
    allow-listing for SSH/TF state dirs. The "bend the flake/check vocabulary" tradeoff applies only to the build sub-phase; the impure deploy fits natively as an effect.
  • External-trigger gap: stock buildbot-nix only wires git-push/PR-driven schedulers. Upstream Buildbot has POST /api/v2/scheduler/{id} with custom properties (ForceScheduler),
    but buildbot-nix doesn't configure one out of the box — we'd need to add it via Buildbot config. Real but contained.

Jenkins (https://www.jenkins.io/, MIT, in nixpkgs as services.jenkins)

Listed in #362 with cancellation . Worth surfacing here because the queue primitives match well:

  • API: REST API can trigger builds (POST /job//build), cancel queued (/queue/cancelItem) and running (/job/.../stop) builds, list queue, stream console output
    (progressiveText).
  • Replace: GitHub-style concurrency comes via the "Throttle Concurrent Builds" / "Lockable Resources" plugins, or implemented at the trigger layer.
  • NixOS: services.jenkins + services.jenkinsSlave modules; nixpkgs has a test (nixos/tests/jenkins.nix).
  • Concern: runtime is JVM-heavy and configuration ergonomics are dated; multi-tenant model is plugin-driven. Probably overkill for our scale but a known-good fallback.
  • Impurity fit: good. Generic CI; jobs are shell scripts with env vars and credential injection (withCredentials). No conceptual friction with impure deploys.
  • External trigger: POST /job//buildWithParameters returns a queue-item URL pollable for the eventual build URL. Caveat: parameters must be pre-declared on the job;
    values are per-run.

Forgejo Actions (already running in our infra)

We already host forgejo-ci and king runners (see CLAUDE.md). Forgejo Actions supports:

  • Trigger via API: workflow_dispatch event over the Forgejo REST API (POST /repos/{owner}/{repo}/actions/workflows/{workflow_id}/dispatches).
  • Native replace semantics: GitHub-style concurrency.group: "deployment-${{ inputs.deployment_id }}" + cancel-in-progress: true — this is exactly our requirement. #362 already
    marks it as for concurrency.
  • Cancellation: API POST /repos/{owner}/{repo}/actions/runs/{run_id}/cancel.
  • Logs: no public API yet (per CLAUDE.md note on Forgejo 16). We'd need to scrape from runner /tmp/build-*.log or wait on the upstream log-API PRs.
  • Caveat — log gap: is the only significant blocker. If we can either (a) live with runner-side log scraping or (b) sidecar a log shipper to WAMP, this is the
    lowest-new-software path of anything on the list.
  • Impurity fit: excellent. The model is run this script in a job runner with these secrets — exactly our shape. secrets: are first-class, env vars are the default, and the
    runner has full network access for tofu apply and SSH activation. Nix-awareness is supplied by the script itself (the job calls nix build / nix-fast-build), which is the right
    separation of concerns for a job that's half pure-build and half impure-effect.
  • External trigger: native — POST /repos/{owner}/{repo}/actions/workflows/{filename}/dispatches with an inputs: map, returns 201. Caveat: inputs are stringly-typed and must be
    pre-declared in on.workflow_dispatch.inputs.

nix-fast-build (https://github.com/Mic92/nix-fast-build)

Not a queue — CLI parallelizer combining nix-eval-jobs + nix-output-monitor with JSON/JUnit output. Keep as the builder invoked by whichever queue we pick; it gives us
per-attribute progress for free.

Dropped Nix-aware candidates

  • Hydra — jobset/git-poll centric, no log-streaming API, can't cancel cleanly; no external-trigger path for our deployment_id + config blob shape (would require racy jobset
    mutation).
  • Hercules CI — controller is proprietary SaaS; only the agent is OSS.
  • eka-ci, typhon — per #362, "unfinished" / "abandoned-ish".
  • botanix, argunix — too new to assess as a backend (worth tracking).
  • nix-fast-build — not a queue at all; keep as the builder invoked inside whichever runner we pick.
  • Concourse, Woodpecker, Dagger, Agola, Tangled Spindle, Buildkite, Garnix, nix-ci (from #362 survey) — either no clear externally-triggerable job API for our
    submit-with-payload shape, or SaaS-only, or much smaller adoption than the four CI options above. Not pursued; reconsider only if a specific gap surfaces.

Generic queue backends

The API spec is backend-agnostic, so a non-Nix queue is viable — the worker shells out to nix build / nix-fast-build and we stream output ourselves. Cost: we own Nix-progress
reporting (parse nix-output-monitor JSON).

Project: River (https://github.com/riverqueue/river)
Lang / store: Go + Postgres
Nix packaging: Pkg in nixpkgs; no module yet
Replace: First-class UniqueOpts (args+queue+state)
Cancel: Native
Log streaming: Telemetry subs only; we build on top
Notes: Cleanest unique/cancel primitives. Postgres reuses panel's store.

Project: NATS JetStream
Lang / store: Go single binary
Nix packaging: services.nats module in nixpkgs
Replace: Native via AllowRollup + per-subject dedup (deployment_id as subject)
Cancel: Out-of-band only
Log streaming: Build on KV/extra stream
Notes: Best native replace; weakest cancellation+log story. Worthwhile if NATS earns its keep elsewhere.

Project: Temporal
Lang / store: Go + Postgres/Cassandra, multi-service
Nix packaging: pkgs.temporal exists; no NixOS module
Replace: WorkflowIDReusePolicy=TerminateIfRunning is exactly our semantic
Cancel: Native, plus signals/queries
Log streaming: Via activities
Notes: Semantic match is unbeatable; 5+ services is disproportionate at our scale. Revisit if orchestration grows multi-step DAGs.

Project: Celery + Redis / Procrastinate
Lang / store: Python + Redis or Postgres
Nix packaging: In nixpkgs with modules
Replace: Redis/Postgres lock + revoke prior task
Cancel: Native (SIGTERM)
Log streaming: Custom: subprocess stdout → pub/sub → WAMP
Notes: Lowest engineering cost today if we stay Python; locks us harder to Python than we want long-term.

Dropped: RQ (strict subset of Celery), pg-boss (Node mismatch), django-q2 (subset of Celery feature set, same Python lock-in).

Ranked recommendation

Reordering after the impurity + external-trigger checks: generic CI gains, Gradient loses, buildbot-nix is salvaged by its first-class hercules-style effects support.

  1. Forgejo Actions — already running on our infra; native replace via concurrency.group: "deployment-${{ inputs.deployment_id }}" + cancel-in-progress: true; trigger+cancel
    are first-class API calls; secrets and env vars are first-class; runner runs nix build and tofu apply in one job. Only gap is log streaming via public API; runner-side scrape
    or sidecar shipper closes it. Lowest new software footprint and best fit for half-pure-half-impure jobs.
  2. buildbot-nix (with buildbot-effects) — the Nix-aware option that actually models our shape: pure builds via nix-eval-jobs + impure deploy via a hercules-style effect with
    /run/secrets.json. Effects are observable/cancellable via Buildbot's REST API. Cost: write a small Buildbot config to add a ForceScheduler (stock buildbot-nix is git-push
    driven) and decide between the three "flake vocabulary" patterns for submit-time configs. Pick this if per-drv visibility and cache push (Cachix/Attic/Harmonia) are real
    product requirements.
  3. River (or Procrastinate) — generic Postgres-backed queue; cleanest unique/cancel primitives of the generic tier; worker is a plain process so impure deploys are natural;
    we own Nix-progress reporting and packaging a NixOS module. Best choice if we want the queue conceptually decoupled from any CI tool.
  4. Jenkins — known-good fallback; documented queue API, buildWithParameters for external trigger, mature withCredentials, in nixpkgs. JVM weight and dated ergonomics are the
    cost.
  5. Gradient — rio-build's spiritual successor and the only "Nix CI" with rio-build-like distributed scheduling, but its data model is pure Nix build + cache, not impure
    deploy with secrets. No effects concept; we'd use it for the build sub-phase only and need a separate runner for apply. Reconsider if it grows effects support or once the
    apply phase can itself be modeled as a derivation.
  6. NATS JetStream — only if NATS earns its keep elsewhere; replace primitive is elegant but cancellation and log streaming are work we'd own.
  7. Temporal — correct semantics, operational weight disproportionate at our scale today.

forgejo-actions spike #973 (+ CI runs), in its attempt to make dispatch.py + .forgejo/workflows/deploy.yml as a starting point for WAMP fediversity.deployment.* RPCs, turned up with gaps:

  • #973 dispatch.py find-run is broken. Forgejo's GET .../actions/runs listing does not return inputs in a usable form (trigger_event is a string, not the structured object the driver expects). find-run and replace-test cannot locate runs by deployment_id. The WAMP wiring needs another correlation strategy (return the run id from a wrapping API, or write the deployment_id into a tag/branch we can list on).
  • No log-streaming API (helps for #184); runner-side scrape is limited. Job stdout is only on the Forgejo server (no public endpoint). The /tmp/build-<name>.log files on forgejo-ci only exist for jobs that go through the local nix build wrapper, so the spike workflow doesn't produce them. Mitigations to consider: tee to a known path inside the workflow step, or wait for upstream PRs (#8873 / #11330).
  • No cancel endpoint in Forgejo 16 (gitea: #37590), only GET endpoints exist on /repos/{owner}/{repo}/actions/runs/... or .../jobs endpoints in the swagger. The WAMP deployment.cancel RPC will need to trigger cancel indirectly - e.g. dispatch a no-op run with the same concurrency.group so cancel-in-progress does the work.
<details> <summary> rechecking options </summary> src/api-specification.md currently nominates https://github.com/lovesegfault/rio-build as the queue behind the WAMP fediversity.deployment.* RPCs. #242 flags rio-build as heavy — it presumes k8s, and its headline features (distributed scheduling, cache-locality-aware builder assignment via bloom filters, critical-path DAG priorities) are nice-to-haves rather than blockers for our small-to-medium hosting-provider target. The spec itself says "the queue backend is a deployment choice; the API abstracts over it" (L99), so swapping it is supported by design. The queue's real requirements (spec L87-118): - Per-deployment_id serialization with replace semantics (default) and opt-in enqueue mode - Best-effort cancellation of pending/running builds - Status state machine + derivation counts + live progress events + log streaming with offset/resume - Multi-tenant priority + capacity bound (queue_full) - Plays with WAMP event_history retention on the router side (nexusd/Crossbar) - Single-host deploy fine today; distributed multi-builder is a growth path, not a launch blocker Language-neutral evaluation: Python-nativeness is not a priority (panel is Python today but the backend may change). What matters operationally: existing NixOS module (or trivial-to-package) and a programmatic API we can drive from the WAMP layer. https://git.fediversity.eu/fediversity/fediversity/issues/362 already enumerates many CI systems with a Nix/concurrency/secrets matrix — this plan focuses specifically on the queue-API question that #242 is asking. Critical constraint: jobs are impure deployments, not pure builds Our unit of work is not a pure Nix build. A deployment: - Evaluates a NixOS configuration (pure). - Builds derivations (pure). - Runs tofu apply / activation over SSH (impure: side effects on remote infra). - Consumes env-var secrets (Proxmox API tokens, SSH keys, Netbox tokens, agenix material) — passed to TF providers and to nix copy / activation. - Mutates external state (TF backend, target hosts, DNS, S3). This rules against schedulers whose data model is "evaluate a flake, build store paths, push to cache." Those treat env vars as anti-features (impurity breaks substituter guarantees) and have no first-class way to express "after the build, run this effectful step with these secrets and stream its output." In Nix's vocabulary this is what Hercules CI calls an effect; outside Hercules it's barely modeled. Concretely, this weakens Gradient and buildbot-nix as candidates and strengthens generic CI (Forgejo Actions, Jenkins) and generic queues (River, Celery, Temporal) whose workers just run a script with whatever env they're given. The "Nix-aware" tier mainly helps with the build sub-phase (parallel eval, per-drv progress, cache push); we'd still need to bolt on an effect-runner for the apply step. Nix-aware candidates Gradient (https://github.com/Wavelens/Gradient, AGPL-3.0, v1.1.1 May 2026, Rust) - Footprint: drop-in flake with nixosModules.default + nixosModules.deploy; auto-configures Postgres + nginx/caddy + JWT. No k8s. Distributed gradient-worker instances connect over WebSocket (zstd-precompressed). - API (verified against docs/gradient-api.yaml): - Triggering a build: POST /projects/{org}/{project}/evaluate (returns evaluation UUID); per-deployment direct builds via POST /builds. - Cancellation: POST /evals/{evaluation} with method: "abort" cancels all in-progress and queued builds in that evaluation. No per-build cancel endpoint exists — feature tracked at https://github.com/wavelens/gradient/issues/25. - Log streaming: POST /builds/{build}/log (NDJSON) and POST /evals/{evaluation}/builds (stream logs for all currently-building drvs in an eval). - Status: GET /builds/{build}, GET /evals/{evaluation}/builds (pagination, status filter). No dedicated queue/scheduler endpoint. - Replace semantics: absent at the API level. Workable pattern: map each deployment_id 1:1 to a Gradient project, each submit to a new evaluation; replace = abort prior eval + start new eval. Same pattern as rio-build's "cancel + resubmit" but built on coarser primitives. - Nix awareness: distributed evaluation + builds; differentiator per the discourse https://discourse.nixos.org/t/gradient-call-for-testers/77549 is "smart scheduling that minimises network traffic between builders" (substitution-aware assignment). Built-in binary cache (S3-backable). Interactive dependency graph. - Multi-tenancy: organizations with independent workers + access control; API-Key + OAuth2/OIDC. - Caveats: "quite young", maintainer's own framing is "deployable but cannot replace Hydra yet"; scheduler statistics need real-world data to improve. We'd be early adopters and should expect to file bugs upstream (per-build cancel chief among them). - Impurity fit: poor. Gradient is a build scheduler + binary cache; the "build" abstraction is a Nix derivation, not a script-with-env. Running tofu apply and SSH activation inside a Gradient build means either smuggling them into an impure derivation (__noChroot/__impureHostDeps, fragile) or running them outside Gradient and using Gradient only for the upstream evaluation+build phase — at which point Gradient is just a fancy build cache and we still need a separate runner for the deploy. The API also has no notion of passing per-submit secrets to a build. - External trigger: native — POST /projects/{org}/{project}/evaluate and POST /builds accept structured payloads, return IDs. buildbot-nix (https://github.com/nix-community/buildbot-nix, Python+Nix, v1.1.0 Sep 2025) - Footprint: systemd via NixOS module. SQLite default, Postgres optional. Stable underlying Buildbot (15+ years). - API: Buildbot REST + WebSocket — cancellation, log streaming, queue listing all first-class and well-documented. - Nix awareness: parallel eval via nix-eval-jobs, per-drv builds with GC roots, cache push to Cachix/Attic/Harmonia. Workers can be tagged; no cache-locality scheduler. - Cancel/replace: REST cancel is native; replace = cancel-prev + resubmit at the trigger layer, with Buildbot Lock primitives to enforce per-deployment_id serialization. - Tradeoff in detail — "bend the flake/check vocabulary": buildbot-nix discovers work by evaluating .#checks.<system> (or a configured attr) of a flake using nix-eval-jobs, scheduling one build per attribute. Its data model is "this flake has these checks, build them on push" — pull-request CI. To use it for "deploy tenant X with runtime-submitted config Y via WAMP" the realistic paths are: a. Generate a per-deployment flake attribute — write config to a flake, push to a branch, let buildbot-nix discover it. Slow, conflates git with the request queue. b. Bypass nix-eval-jobs entirely — write a custom Buildbot BuilderConfig + scheduler. At which point you're using Buildbot, not buildbot-nix; the module's added value drops to ~zero and you inherit Buildbot's master.cfg learning curve. c. Long-lived control flake re-evaluated on each submit — racy on the shared flake, re-eval cost on every request. None are deal-breakers; each costs real engineering. The honest read: buildbot-nix is optimised for git-push triggers CI; our shape is web-API triggers one-shot deploy. Picking it likely means option (2) — using buildbot-nix as a hint and writing custom Buildbot config underneath. - Impurity fit — actually good, via first-class effects. buildbot-nix ships a dedicated buildbot-effects subpackage that consumes hercules-style herculesCI.onPush.<x>.outputs.effects derivations and wires them in as ordinary Buildbot builders (<project>/run-effect). It does not depend on hercules-ci-effects at runtime — it bundles a minimal effects-lib.nix. Secrets are injected hercules-style via HERCULES_CI_SECRETS_JSON=/run/secrets.json (docs/EFFECTS.md); knobs include effects_per_repo_secrets, effects_branches, effects_extra_sandbox_paths. Effects are cancellable/observable via REST like any other build, and pipeline stages distinguish nix-eval / nix-build / nix-effects. Open caveats: effects run in parallel per push (#587 wants sequential), may not re-run on cached builds (#295), and the sandbox needs allow-listing for SSH/TF state dirs. The "bend the flake/check vocabulary" tradeoff applies only to the build sub-phase; the impure deploy fits natively as an effect. - External-trigger gap: stock buildbot-nix only wires git-push/PR-driven schedulers. Upstream Buildbot has POST /api/v2/scheduler/{id} with custom properties (ForceScheduler), but buildbot-nix doesn't configure one out of the box — we'd need to add it via Buildbot config. Real but contained. Jenkins (https://www.jenkins.io/, MIT, in nixpkgs as services.jenkins) Listed in #362 with cancellation ✅. Worth surfacing here because the queue primitives match well: - API: REST API can trigger builds (POST /job/<name>/build), cancel queued (/queue/cancelItem) and running (/job/.../stop) builds, list queue, stream console output (progressiveText). - Replace: GitHub-style concurrency comes via the "Throttle Concurrent Builds" / "Lockable Resources" plugins, or implemented at the trigger layer. - NixOS: services.jenkins + services.jenkinsSlave modules; nixpkgs has a test (nixos/tests/jenkins.nix). - Concern: runtime is JVM-heavy and configuration ergonomics are dated; multi-tenant model is plugin-driven. Probably overkill for our scale but a known-good fallback. - Impurity fit: good. Generic CI; jobs are shell scripts with env vars and credential injection (withCredentials). No conceptual friction with impure deploys. - External trigger: POST /job/<name>/buildWithParameters returns a queue-item URL pollable for the eventual build URL. Caveat: parameters must be pre-declared on the job; values are per-run. Forgejo Actions (already running in our infra) We already host forgejo-ci and king runners (see CLAUDE.md). Forgejo Actions supports: - Trigger via API: workflow_dispatch event over the Forgejo REST API (POST /repos/{owner}/{repo}/actions/workflows/{workflow_id}/dispatches). - Native replace semantics: GitHub-style concurrency.group: "deployment-${{ inputs.deployment_id }}" + cancel-in-progress: true — this is exactly our requirement. #362 already marks it as ✅ for concurrency. - Cancellation: API POST /repos/{owner}/{repo}/actions/runs/{run_id}/cancel. - Logs: no public API yet (per CLAUDE.md note on Forgejo 16). We'd need to scrape from runner /tmp/build-*.log or wait on the upstream log-API PRs. - Caveat — log gap: is the only significant blocker. If we can either (a) live with runner-side log scraping or (b) sidecar a log shipper to WAMP, this is the lowest-new-software path of anything on the list. - Impurity fit: excellent. The model is run this script in a job runner with these secrets — exactly our shape. secrets: are first-class, env vars are the default, and the runner has full network access for tofu apply and SSH activation. Nix-awareness is supplied by the script itself (the job calls nix build / nix-fast-build), which is the right separation of concerns for a job that's half pure-build and half impure-effect. - External trigger: native — POST /repos/{owner}/{repo}/actions/workflows/{filename}/dispatches with an inputs: map, returns 201. Caveat: inputs are stringly-typed and must be pre-declared in on.workflow_dispatch.inputs. nix-fast-build (https://github.com/Mic92/nix-fast-build) Not a queue — CLI parallelizer combining nix-eval-jobs + nix-output-monitor with JSON/JUnit output. Keep as the builder invoked by whichever queue we pick; it gives us per-attribute progress for free. Dropped Nix-aware candidates - Hydra — jobset/git-poll centric, no log-streaming API, can't cancel cleanly; no external-trigger path for our deployment_id + config blob shape (would require racy jobset mutation). - Hercules CI — controller is proprietary SaaS; only the agent is OSS. - eka-ci, typhon — per #362, "unfinished" / "abandoned-ish". - botanix, argunix — too new to assess as a backend (worth tracking). - nix-fast-build — not a queue at all; keep as the builder invoked inside whichever runner we pick. - Concourse, Woodpecker, Dagger, Agola, Tangled Spindle, Buildkite, Garnix, nix-ci (from #362 survey) — either no clear externally-triggerable job API for our submit-with-payload shape, or SaaS-only, or much smaller adoption than the four CI options above. Not pursued; reconsider only if a specific gap surfaces. Generic queue backends The API spec is backend-agnostic, so a non-Nix queue is viable — the worker shells out to nix build / nix-fast-build and we stream output ourselves. Cost: we own Nix-progress reporting (parse nix-output-monitor JSON). Project: River (https://github.com/riverqueue/river) Lang / store: Go + Postgres Nix packaging: Pkg in nixpkgs; no module yet Replace: First-class UniqueOpts (args+queue+state) Cancel: Native Log streaming: Telemetry subs only; we build on top Notes: Cleanest unique/cancel primitives. Postgres reuses panel's store. Project: NATS JetStream Lang / store: Go single binary Nix packaging: services.nats module in nixpkgs Replace: Native via AllowRollup + per-subject dedup (deployment_id as subject) Cancel: Out-of-band only Log streaming: Build on KV/extra stream Notes: Best native replace; weakest cancellation+log story. Worthwhile if NATS earns its keep elsewhere. Project: Temporal Lang / store: Go + Postgres/Cassandra, multi-service Nix packaging: pkgs.temporal exists; no NixOS module Replace: WorkflowIDReusePolicy=TerminateIfRunning is exactly our semantic Cancel: Native, plus signals/queries Log streaming: Via activities Notes: Semantic match is unbeatable; 5+ services is disproportionate at our scale. Revisit if orchestration grows multi-step DAGs. Project: Celery + Redis / Procrastinate Lang / store: Python + Redis or Postgres Nix packaging: In nixpkgs with modules Replace: Redis/Postgres lock + revoke prior task Cancel: Native (SIGTERM) Log streaming: Custom: subprocess stdout → pub/sub → WAMP Notes: Lowest engineering cost today if we stay Python; locks us harder to Python than we want long-term. Dropped: RQ (strict subset of Celery), pg-boss (Node mismatch), django-q2 (subset of Celery feature set, same Python lock-in). Ranked recommendation Reordering after the impurity + external-trigger checks: generic CI gains, Gradient loses, buildbot-nix is salvaged by its first-class hercules-style effects support. 1. Forgejo Actions — already running on our infra; native replace via concurrency.group: "deployment-${{ inputs.deployment_id }}" + cancel-in-progress: true; trigger+cancel are first-class API calls; secrets and env vars are first-class; runner runs nix build and tofu apply in one job. Only gap is log streaming via public API; runner-side scrape or sidecar shipper closes it. Lowest new software footprint and best fit for half-pure-half-impure jobs. 2. buildbot-nix (with buildbot-effects) — the Nix-aware option that actually models our shape: pure builds via nix-eval-jobs + impure deploy via a hercules-style effect with /run/secrets.json. Effects are observable/cancellable via Buildbot's REST API. Cost: write a small Buildbot config to add a ForceScheduler (stock buildbot-nix is git-push driven) and decide between the three "flake vocabulary" patterns for submit-time configs. Pick this if per-drv visibility and cache push (Cachix/Attic/Harmonia) are real product requirements. 3. River (or Procrastinate) — generic Postgres-backed queue; cleanest unique/cancel primitives of the generic tier; worker is a plain process so impure deploys are natural; we own Nix-progress reporting and packaging a NixOS module. Best choice if we want the queue conceptually decoupled from any CI tool. 4. Jenkins — known-good fallback; documented queue API, buildWithParameters for external trigger, mature withCredentials, in nixpkgs. JVM weight and dated ergonomics are the cost. 5. Gradient — rio-build's spiritual successor and the only "Nix CI" with rio-build-like distributed scheduling, but its data model is pure Nix build + cache, not impure deploy with secrets. No effects concept; we'd use it for the build sub-phase only and need a separate runner for apply. Reconsider if it grows effects support or once the apply phase can itself be modeled as a derivation. 6. NATS JetStream — only if NATS earns its keep elsewhere; replace primitive is elegant but cancellation and log streaming are work we'd own. 7. Temporal — correct semantics, operational weight disproportionate at our scale today. </details> forgejo-actions spike #973 (+ [CI runs](https://git.fediversity.eu/fediversity/fediversity/actions?workflow=deploy.yml&actor=0&status=0)), in its attempt to make `dispatch.py` + `.forgejo/workflows/deploy.yml` as a starting point for WAMP `fediversity.deployment.*` RPCs, turned up with gaps: - #973 `dispatch.py` `find-run` is broken. Forgejo's `GET` `.../actions/runs` listing does not return inputs in a usable form (`trigger_event` is a string, not the structured object the driver expects). `find-run` and `replace-test` cannot locate runs by `deployment_id`. The WAMP wiring needs another correlation strategy (return the run id from a wrapping API, or write the `deployment_id` into a tag/branch we can list on). - No log-streaming API (helps for #184); runner-side scrape is limited. Job stdout is only on the Forgejo server (no public endpoint). The `/tmp/build-<name>.log` files on `forgejo-ci` only exist for jobs that go through the local nix build wrapper, so the spike workflow doesn't produce them. Mitigations to consider: `tee` to a known path inside the workflow step, or wait for upstream PRs ([#8873](https://codeberg.org/forgejo/forgejo/pulls/8873) / [#11330](https://codeberg.org/forgejo/forgejo/pulls/11330)). - No cancel endpoint in Forgejo 16 (gitea: [#37590](https://github.com/go-gitea/gitea/pull/37590)), only `GET` endpoints exist on `/repos/{owner}/{repo}/actions/runs/...` or `.../jobs` endpoints in the [swagger](https://git.fediversity.eu/api/swagger). The WAMP `deployment.cancel` RPC will need to trigger cancel indirectly - e.g. dispatch a no-op run with the same `concurrency.group` so cancel-in-progress does the work.
kiara referenced this issue from a commit 2026-05-23 15:49:18 +02:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks Depends on
Reference
fediversity/fediversity#242
No description provided.