NixOS tests break when CI runs them in parallel #362

Open
opened 2025-06-04 19:38:54 +02:00 by kiara · 0 comments
Owner

as per @Niols, our NixOS tests often break because they're all running CI at the same time. a common symptom seems Input/output error. options to address this seem to include:

  • As a work-around to get these to work again for now, we could limit the runner to capacity 1 so that it only ever takes one job. While slow, this would be better than CI failing altogether.
  • isolate the CI runs. Note that this poses a challenge of providing enough isolation to resolve the problem, without unnecessarily affecting performance. Options include:
    • have many runners (e.g. nix containers - see #405) running on shared resources while trying not to pay cost of isolation too much. with our hardware machine Forgejo-CI this might be our best option.
    • many Proxmox VMs that each only accept one job, tho this might be excruciatingly slow.
  • if we manage the above and builds no longer come from the same store, we should use a Nix cache (#92) shared between our runners and ourselves, as some of this stuff is expensive to build, yet sharing the Nix daemon doesn't work well

c.f. CI resource issues (#13)

CI comparison

CI license nix integration forgejo integration secret scoping concurrency groups (to cancel previous / overridden commits) parallel runs (#362)
gitea / forgejo runner GPLv3+ no special support but allow native runners functionality present, tho i did not get it to cancel on push-force
buildbot GPL2.0 buildbot-nix (MIT) thru buildbot-nix or (probably) gitea plugin secrets incl. effects test
jenkins MIT wiki, test, services.jenkins / services.jenkinsSlave nodes added manually or on demand by cloud providers incl. nomad/k8s/docker/mesos/vsphere/jclouds/libvirt/cloudify/buildstash/sharing/proxmox
buildkite runner MIT, web UI proprietary - may be acceptable using gerrit (#302)? tokens depend on their proprietary platform thru nix generators) by buildkite-forgejo-webhook thru policies (with env vars)
garnix proprietary
hercules-ci proprietary
nix-ci proprietary
hydra services.hydra GPL-3.0 made for nixpkgs hooks / tokens
gradient (supports OIDC, not yet LDAP) AGPL-3.0 TODO thru deploy apiKeyFiles
eka-ci AGPL-3.0 ? (made for eka over vanilla nix) (also just seems unfinished)
typhon AGPL3.0 (plus abandoned-ish)
woodpecker / crow Apache-2.0 e.g. flake-pipeliner
dagger Apache-2.0 ? ? ?
agola Apache-2.0 ? ?
as per @Niols, our NixOS tests often break because they're all running CI at the same time. a common symptom seems `Input/output error`. options to address this seem to include: - [x] As a work-around to get these to work again for now, we could limit the runner to `capacity` 1 so that it only ever takes one job. While slow, this would be better than CI failing altogether. - [ ] isolate the CI runs. Note that this poses a challenge of providing enough isolation to resolve the problem, without unnecessarily affecting performance. Options include: - have many runners (e.g. nix containers - see #405) running on shared resources while trying not to pay cost of isolation too much. with our hardware machine Forgejo-CI this might be our best option. - ~~many Proxmox VMs that each only accept one job, tho~~ this might be excruciatingly slow. - [ ] if we manage the above and builds no longer come from the same store, we should use a Nix cache (#92) shared between our runners and ourselves, as some of this stuff is expensive to build, yet sharing the Nix daemon doesn't work well c.f. CI resource issues (#13) ### CI comparison | CI | license | nix integration | [forgejo integration](https://git.fediversity.eu/fediversity/fediversity/settings/hooks) | secret scoping | concurrency groups (to cancel previous / overridden commits) | parallel runs (#362) | |-|-|-|-|-|-|-| | [`gitea`](https://docs.gitea.com/next/usage/actions/act-runner/) / [`forgejo` runner](https://forgejo.org/docs/latest/admin/actions/) | ✅ GPLv3+ | no special support but allow native runners | ✅ | [✅](https://forgejo.org/docs/next/user/actions/reference/#onpull_request) | [functionality present](https://forgejo.org/docs/next/user/actions/reference/#concurrencygroup), tho i did not get it to cancel on push-force | | [`buildbot`](https://github.com/buildbot/buildbot) | ✅ GPL2.0 | ✅ [`buildbot-nix`](https://github.com/nix-community/buildbot-nix/) (MIT) | ✅ [thru `buildbot-nix`](https://github.com/nix-community/buildbot-nix/?tab=readme-ov-file#step-3-configure-buildbot-nix) or (probably) [gitea plugin](https://github.com/lab132/buildbot-gitea) | [secrets](https://docs.buildbot.net/current/manual/secretsmanagement.html) incl. [effects](github.com/nix-community/buildbot-nix/#experimental-hercules-ci-effects) | [✅](https://buildbot.readthedocs.io/en/stable/manual/configuration/services/old_build_canceller.html) | [test](https://github.com/NixOS/nixpkgs/blob/master/nixos/tests/buildbot.nix) | | [`jenkins`](https://www.jenkins.io/) | ✅ MIT | ✅ [wiki](https://wiki.nixos.org/wiki/Jenkins), [test](https://github.com/NixOS/nixpkgs/blob/master/nixos/tests/jenkins.nix), [`services.jenkins`](https://search.nixos.org/options?channel=unstable&show=services.jenkins.jobBuilder.jsonJobs&query=services.jenkins.) / [`services.jenkinsSlave`](https://search.nixos.org/options?channel=unstable&show=services.jenkins.jobBuilder.jsonJobs&query=services.jenkinsSlave.) | [✅](https://github.com/go-gitea/gitea/issues/18299#issuecomment-1044210813) | [✅](https://www.jenkins.io/doc/developer/security/secrets/) | [✅](https://community.jenkins.io/t/how-to-cancel-redundant-builds-on-branches-pulls-but-not-master/5381/2) | nodes added [manually](http://localhost:8080/manage/computer/) or on demand by [cloud providers](http://localhost:8080/manage/pluginManager/available?filter=Cloud+Providers) incl. nomad/k8s/docker/mesos/vsphere/jclouds/libvirt/cloudify/buildstash/sharing/proxmox | | [`buildkite`](https://github.com/buildkite/agent) | ❌ runner MIT, web UI proprietary - ~~may be acceptable using gerrit (#302)?~~ tokens [depend on their proprietary platform](https://buildkite.com/docs/agent/v3/tokens#create-a-token) | ✅ [thru](https://git.snix.dev/snix/snix/src/branch/canon/nix/buildkite/default.nix) [nix](https://discourse.nixos.org/t/announcing-nixkite-buildkite-pipelines-using-the-nixos-module-system/7266) [generators](https://github.com/hackworthltd/nix-buildkite-plugin/)) | ✅ by [`buildkite-forgejo-webhook`](https://github.com/rscorer/buildkite-forgejo-webhook) | ✅ [thru policies](https://buildkite.com/docs/pipelines/security/secrets) | [✅](https://buildkite.com/docs/pipelines/configure/workflows/controlling-concurrency) (with [env vars](https://buildkite.com/docs/pipelines/configure/environment-variables#variable-interpolation)) | | [`garnix`](https://garnix.io/) | ❌ proprietary | ✅ | | [`hercules-ci`](https://hercules-ci.com/) | ❌ proprietary | ✅ | | [`nix-ci`](https://nix-ci.com/) | ❌ proprietary | ✅ | | [`hydra`](https://github.com/NixOS/hydra/) [`services.hydra`](https://search.nixos.org/options?channel=unstable&query=services.hydra) | ✅ GPL-3.0 | ✅ made for nixpkgs | [hooks](https://github.com/NixOS/hydra/blob/master/doc/manual/src/webhooks.md) / ✅ [tokens](https://discourse.nixos.org/t/hydra-integration-without-exposing-authentication-tokens-to-the-nix-store/13117) | ❌ | | [`gradient`](https://github.com/Wavelens/Gradient) (supports OIDC, not yet LDAP) | ✅ AGPL-3.0 | ✅ | [TODO](https://github.com/wavelens/gradient/issues/25) | [thru deploy `apiKeyFiles`](https://petstore.swagger.io/?url=https://raw.githubusercontent.com/wavelens/gradient/master/docs/gradient-api.yaml) | | [`eka-ci`](https://github.com/ekala-project/eka-ci) | ✅ AGPL-3.0 | ? (made for eka over vanilla nix) | | ❌ (also just seems unfinished) | | [`typhon`](https://typhon-ci.org/) | ✅ AGPL3.0 | ✅ | | ❌ (plus abandoned-ish) | | [`woodpecker`](https://woodpecker-ci.org/) / [crow](https://www.reddit.com/r/opensource/comments/1ittoe9/crow_ci_dronewoodpecker_fork/) | ✅ Apache-2.0 | e.g. [`flake-pipeliner`](https://github.com/pinpox/woodpecker-flake-pipeliner) | ✅ | ✅ | [❌](https://github.com/woodpecker-ci/woodpecker/issues/1461) | ✅ | | [`dagger`](https://dagger.io/) | ✅ Apache-2.0 | ? | ? | ? | ❌ | | [`agola`](https://agola.io/) | ✅ Apache-2.0 | ? | ✅ | ? | ❌ |
kiara added this to the Fediversity project 2025-06-18 14:33:07 +02:00
kiara self-assigned this 2025-08-07 15:13:32 +02:00
kiara 2025-12-02 19:11:52 +01:00
  • closed this issue
  • added the
    2 points
    label
kiara 2025-12-15 15:45:05 +01:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#224 automated dev-ops workflows
fediversity/fediversity
Reference: fediversity/fediversity#362
No description provided.