Introduce test for deploying all services with nixops4 apply #329

Merged
fricklerhandwerk merged 19 commits from Niols/Fediversity:integration-test-multiple-rebased into main 2025-05-19 02:18:56 +02:00
Owner

Closes Fediversity/Fediversity#276

This PR adds a CLI deployment test. It builds on top of Fediversity/Fediversity#323. This test features a deployer node and four target nodes. The deployer node runs nixops4 apply on a deployment built with our actual code in deployment/default.nix, which pushes onto the four target machines combinations of Garage/Mastodon/Peertube/Pixelfed depending on a JSON payload. We check that the expected services are indeed deployed on the machines. Getting there involved reworking the existing basic test to extract common patterns, and adding support for ACME certificates negotiation inside the NixOS test.

What works:

  • deployer successfully runs nixops4 apply with various payloads
  • target machines indeed get the right services pushed onto them and removed
  • services on target machines successfully negotiate ACME certificates

What does not work: the services themselves depend a lot on DNS and that is not taken care of at all, so they are probably very broken. Still, this is a good milestone.

Future work:

  • Working on DNS in the test, making sure the services run correctly, actually testing the services (eg. with Selenium scripts etc.)
  • A deployment of some services (Fediversity? dummy?) triggered from the panel (← this is what I will work on next).
  • A deployment of some services triggered from Terraform (@kiara is on it, I believe).

Test it yourself by running nix build .#checks.x86_64-linux.deployment-basic -vL and nix build .#checks.x86_64-linux.deployment-cli -vL. On the very beefy machine that I am using, the basic test runs in ~4 minutes and the CLI test in ~17 minutes. We know from Fediversity/Fediversity#323 that the basic test runs in ~12 minutes on the CI runner, so maybe about an hour for the CLI test?

You can check the rendering of the documentation (in particular for Mermaid diagrams) here.

I believe the commits are good as they are and should be fast-forwarded only.

Closes https://git.fediversity.eu/Fediversity/Fediversity/issues/276 This PR adds a CLI deployment test. It builds on top of https://git.fediversity.eu/Fediversity/Fediversity/pulls/323. This test features a deployer node and four target nodes. The deployer node runs `nixops4 apply` on a deployment built with our actual code in `deployment/default.nix`, which pushes onto the four target machines combinations of Garage/Mastodon/Peertube/Pixelfed depending on a JSON payload. We check that the expected services are indeed deployed on the machines. Getting there involved reworking the existing basic test to extract common patterns, and adding support for ACME certificates negotiation inside the NixOS test. What works: - deployer successfully runs `nixops4 apply` with various payloads - target machines indeed get the right services pushed onto them and removed - services on target machines successfully negotiate ACME certificates What does not work: the services themselves depend a lot on DNS and that is not taken care of at all, so they are probably very broken. Still, this is a good milestone. Future work: - Working on DNS in the test, making sure the services run correctly, actually testing the services (eg. with Selenium scripts etc.) - A deployment of some services (Fediversity? dummy?) triggered from the panel (← this is what I will work on next). - A deployment of some services triggered from Terraform (@kiara is on it, I believe). --- Test it yourself by running `nix build .#checks.x86_64-linux.deployment-basic -vL` and `nix build .#checks.x86_64-linux.deployment-cli -vL`. On the very beefy machine that I am using, the basic test runs in ~4 minutes and the CLI test in ~17 minutes. We know from https://git.fediversity.eu/Fediversity/Fediversity/pulls/323 that the basic test runs in ~12 minutes on the CI runner, so maybe about an hour for the CLI test? You can check the rendering of the documentation (in particular for Mermaid diagrams) [here](https://git.fediversity.eu/Niols/Fediversity/src/branch/integration-test-multiple-rebased/deployment/README.md). I believe the commits are good as they are and should be fast-forwarded only.
Niols added 3 commits 2025-05-08 09:57:22 +02:00
Introduce CLI deployment test
Some checks failed
/ check-pre-commit (pull_request) Successful in 11s
/ check-peertube (pull_request) Failing after 10s
/ check-panel (pull_request) Successful in 1m14s
/ check-deployment-basic (pull_request) Successful in 44m58s
/ check-deployment-cli (pull_request) Successful in 44m56s
2bfca55a07
fricklerhandwerk reviewed 2025-05-08 10:08:58 +02:00
@ -7,0 +15,4 @@
$ nix build .#checks.<system>.deployment-<name> -vL
```
## Basic deployment check

### Basic deployment check

`### Basic deployment check`
Author
Owner

Done in cc0c5d0519.

Done in cc0c5d05199d0288ffcc0b35ddd91b1430fe2fc3.
Niols marked this conversation as resolved
fricklerhandwerk reviewed 2025-05-08 10:11:47 +02:00
@ -7,0 +14,4 @@
``` console
$ nix build .#checks.<system>.deployment-<name> -vL
```

Maybe worth mentioning here that since nixops4 apply operates on a flake, the tests take this repository's flake as a template, and this also why there are some dummy files that will be overwritten inside the test. Although why are we even keeping the unused files? They could just as well all be created during the test, no?

Maybe worth mentioning here that since `nixops4 apply` operates on a flake, the tests take this repository's flake as a template, and this also why there are some dummy files that will be overwritten inside the test. Although why are we even keeping the unused files? They could just as well all be *created* during the test, no?
Author
Owner

Done in cc0c5d0519. I did remove most of the files, but some are still necessary when building the VMs initially. I would like to get rid of them, though. We could also imagine getting rid of flakes but creating a whole flake in the test.

Done in cc0c5d05199d0288ffcc0b35ddd91b1430fe2fc3. I did remove most of the files, but some are still necessary when building the VMs initially. I would like to get rid of them, though. We could also imagine getting rid of flakes but creating a whole flake in the test.
Niols marked this conversation as resolved
fricklerhandwerk reviewed 2025-05-08 10:27:04 +02:00
@ -0,0 +236,4 @@
in
''
deployer.copy_from_host("${targetNetworkJSON}", "/root/target-network.json")
deployer.succeed("mv /root/target-network.json work/${pathFromRoot}/${tm}-network.json")

Why not copy it to the final location directly?

Why not copy it to the final location directly?
Author
Owner

That's how it was done in the nixops4-nixos; I assumed there was a reason and never checked. Done in cc0c5d0519.

That's how it was done in the nixops4-nixos; I assumed there was a reason and never checked. Done in cc0c5d05199d0288ffcc0b35ddd91b1430fe2fc3.
Niols marked this conversation as resolved
fricklerhandwerk reviewed 2025-05-08 10:34:20 +02:00
@ -0,0 +35,4 @@
./minimalTarget.nix
(lib.modules.importJSON (pathToRoot + "/${pathFromRoot}/${tm}-network.json"))
]
++ optional enableAcme (makeAcmeClientModule {

It would be a lot simpler and much more idiomatic if you enabled acme right there in the module. Then we wouldn't need the extra file, which makes it harder to track down what's happening.

{
  # ...
  nixos.module = nixos: {
    # ...
    security.acme = {
      acceptTerms = enableAcme;
      # ...
    };
    networking.extraHosts = mkIf nixos.config.security.acme.acceptTerms "${acmeNodeIP acme.test}";
    # ...
  }; 
}
It would be a lot simpler and much more idiomatic if you enabled acme right there in the module. Then we wouldn't need the extra file, which makes it harder to track down what's happening. ```nix { # ... nixos.module = nixos: { # ... security.acme = { acceptTerms = enableAcme; # ... }; networking.extraHosts = mkIf nixos.config.security.acme.acceptTerms "${acmeNodeIP acme.test}"; # ... }; } ```
Author
Owner

Done in b03973603e.

Done in b03973603ee41e506430e97a824a2036eccdf6e4.
Niols marked this conversation as resolved
kiara reviewed 2025-05-08 10:42:53 +02:00
@ -0,0 +92,4 @@
extraTestScript = ''
with subtest("Run deployment with no services enabled"):
deployer.succeed("cd work && nixops4 apply check-deployment-cli-nothing --show-trace --no-interactive 1>&2")
Owner

where does this work directory come from?

where does this `work` directory come from?

It's created in the unpacking subtest defined in the test module generator.

It's created in the `unpacking` subtest defined in the [test module generator](https://git.fediversity.eu/Fediversity/Fediversity/pulls/329/files#diff-708976230d3a93d725a457b5a9407dc82ec3710b).
Author
Owner

I think this shows a lack of documentation and that's a good point. Maybe we can even unpack the flake in the current working directory such that we don't need to cd work. I have to check whether there are things in the current working directory that would clash with that.

I think this shows a lack of documentation and that's a good point. Maybe we can even unpack the flake in the current working directory such that we don't need to `cd work`. I have to check whether there are things in the current working directory that would clash with that.
Author
Owner

Done in cc0c5d0519.

Done in cc0c5d05199d0288ffcc0b35ddd91b1430fe2fc3.
Niols marked this conversation as resolved

TODOs from verbal sync:

  • modularise all test parameters, it's the main barrier to readability
  • try to fold the fake node into a nested derivation (not a blocker)
  • try to address the minor FIXMEs such as the dangling peertube and gixy dependencies (nice to have)
TODOs from verbal sync: - modularise all test parameters, it's the main barrier to readability - try to fold the `fake` node into a nested derivation (not a blocker) - try to address the minor FIXMEs such as the dangling `peertube` and `gixy` dependencies (nice to have)
kiara approved these changes 2025-05-08 11:03:15 +02:00
Niols added 1 commit 2025-05-08 14:33:46 +02:00
Apply suggestions from code review
Some checks failed
/ check-pre-commit (pull_request) Successful in 11s
/ check-peertube (pull_request) Failing after 12s
/ check-panel (pull_request) Successful in 1m9s
/ check-deployment-basic (pull_request) Successful in 6s
/ check-deployment-cli (pull_request) Successful in 41m49s
cc0c5d0519
Niols force-pushed integration-test-multiple-rebased from 0c90805339 to cc0c5d0519 2025-05-08 15:01:50 +02:00 Compare
Niols added 2 commits 2025-05-09 08:50:24 +02:00
Get rid of fake node
Some checks failed
/ check-pre-commit (pull_request) Successful in 12s
/ check-peertube (pull_request) Failing after 11s
/ check-panel (pull_request) Successful in 1m13s
/ check-deployment-basic (pull_request) Successful in 43m41s
/ check-deployment-cli (pull_request) Successful in 43m39s
dfa1f8cf9e
Author
Owner

Added two commits. The first one re-modularises the common parts of the test. This is a whole reorganisation. The second one gets rid of the fake node concept and makes a temporary NixOS system to achieve the same goal. I am not sure it makes sense to review these commits separately as they really touch everything.

Added two commits. The first one re-modularises the common parts of the test. This is a whole reorganisation. The second one gets rid of the fake node concept and makes a temporary NixOS system to achieve the same goal. I am not sure it makes sense to review these commits separately as they really touch everything.
Author
Owner

TODOs for oral sync:

  • Type and document all options
  • Merge common/acme/* into common/*
  • Make testCerts an option with default import ...
  • Simplify node.acme
    imports = [ "${inputs.nixpkgs}/nixos/tests/common/acme/server" ];
    systemd.services.pebble.environment.PEBBLE_VA_ALWAYS_VALID = "1";
    
TODOs for oral sync: - Type and document all options - Merge `common/acme/*` into `common/*` - Make `testCerts` an option with default `import ...` - Simplify node.acme ``` imports = [ "${inputs.nixpkgs}/nixos/tests/common/acme/server" ]; systemd.services.pebble.environment.PEBBLE_VA_ALWAYS_VALID = "1"; ```
Niols added 1 commit 2025-05-09 15:51:13 +02:00
Apply suggestions from code review
Some checks failed
/ check-pre-commit (pull_request) Successful in 11s
/ check-peertube (pull_request) Failing after 11s
/ check-panel (pull_request) Successful in 1m11s
/ check-deployment-basic (pull_request) Successful in 42m50s
/ check-deployment-cli (pull_request) Successful in 42m48s
1e930fd23e
Fediversity/Fediversity#329 (comment)
Author
Owner

@Niols wrote in Fediversity/Fediversity#329 (comment):

TODOs for oral sync:

  • Type and document all options
  • Merge common/acme/* into common/*
  • Make testCerts an option with default import ...
  • Simplify node.acme
    imports = [ "${inputs.nixpkgs}/nixos/tests/common/acme/server" ];
    systemd.services.pebble.environment.PEBBLE_VA_ALWAYS_VALID = "1";
    

All done. What do we think of the current status of things?

@Niols wrote in https://git.fediversity.eu/Fediversity/Fediversity/pulls/329#issuecomment-6532: > TODOs for oral sync: > > * Type and document all options > * Merge `common/acme/*` into `common/*` > * Make `testCerts` an option with default `import ...` > * Simplify node.acme > ```text > imports = [ "${inputs.nixpkgs}/nixos/tests/common/acme/server" ]; > systemd.services.pebble.environment.PEBBLE_VA_ALWAYS_VALID = "1"; > ``` All done. What do we think of the current status of things?
Author
Owner

With this last commit, CI should be a happy green. Is there anything that remains before we can merge? I only think of things that can be left out and added in another PR, except maybe for the name of this “CLI” deployment test - now would be a good time to find a better name.

With this last commit, CI should be a happy green. Is there anything that remains before we can merge? I only think of things that can be left out and added in another PR, except maybe for the name of this “CLI” deployment test - now would be a good time to find a better name.
Owner

@Niols i'm good, thanks for finishing this!
@fricklerhandwerk?

@Niols i'm good, thanks for finishing this! @fricklerhandwerk?
fricklerhandwerk changed title from Introduce CLI deployment test to Introduce test for deploying all services with nixops4 apply 2025-05-15 11:35:04 +02:00
fricklerhandwerk reviewed 2025-05-15 11:35:34 +02:00
@ -7,0 +56,4 @@
### CLI deployment check
The CLI deployment check factors out the panel by running a direct invokation of

invocation

invocation
Author
Owner

Done in a328818f1d7c8a5e055b90eb372be82e3fccd252.

Done in a328818f1d7c8a5e055b90eb372be82e3fccd252.
Niols marked this conversation as resolved
fricklerhandwerk reviewed 2025-05-15 11:36:20 +02:00
@ -7,0 +54,4 @@
deployer -->|deploys| target_machines
```
### CLI deployment check
### Service deployment check using `nixops4 apply`

How about this?

```` ### Service deployment check using `nixops4 apply` ```` How about this?
Author
Owner

Sounds good. And for the name in the code, deployment-cli -> deployment-nixops4?

Sounds good. And for the name in the code, `deployment-cli` -> `deployment-nixops4`?
Author
Owner

Done in a328818f1d7c8a5e055b90eb372be82e3fccd252.

Done in a328818f1d7c8a5e055b90eb372be82e3fccd252.
fricklerhandwerk reviewed 2025-05-15 11:37:03 +02:00
@ -7,0 +124,4 @@
### [WIP] Panel deployment check
This is a full deployment check running the panel on the deployer machine,
deploying some services and checking that they are indeed on the target

deploying some services through the panel, and...

deploying some services through the panel, and...
Author
Owner

Done in a328818f1d7c8a5e055b90eb372be82e3fccd252.

Done in a328818f1d7c8a5e055b90eb372be82e3fccd252.
Niols marked this conversation as resolved

This looks pretty good from afar, but I'd like to take another full pass and then merge.

This looks pretty good from afar, but I'd like to take another full pass and then merge.
Niols force-pushed integration-test-multiple-rebased from a328818f1d to 0b3b252292 2025-05-16 12:21:23 +02:00 Compare
Author
Owner

Rebased on recent main.

Rebased on recent `main`.
Owner

ci fail :(

ci fail :(
Author
Owner

Ah dayum :-(

Ah dayum :-(
Author
Owner

I guess the culprit is the freshly merged Fediversity/Fediversity#330; we should have waited for this test, probably.

I guess the culprit is the freshly merged https://git.fediversity.eu/Fediversity/Fediversity/pulls/330; we should have waited for this test, probably.
Niols force-pushed integration-test-multiple-rebased from 27634f2294 to 9dcd0e360e 2025-05-17 09:12:01 +02:00 Compare
fricklerhandwerk added 1 commit 2025-05-18 19:35:05 +02:00
readme: one line per sentence, small wording fixes
All checks were successful
/ check-pre-commit (pull_request) Successful in 11s
/ check-peertube (pull_request) Successful in 18s
/ check-panel (pull_request) Successful in 1m7s
/ check-deployment-basic (pull_request) Successful in 39m40s
/ check-deployment-cli (pull_request) Successful in 39m38s
8d67e49ba1
fricklerhandwerk added 2 commits 2025-05-18 22:20:27 +02:00
simplify path finding
All checks were successful
/ check-pre-commit (pull_request) Successful in 11s
/ check-peertube (pull_request) Successful in 18s
/ check-panel (pull_request) Successful in 1m5s
/ check-deployment-basic (pull_request) Successful in 41m54s
/ check-deployment-cli (pull_request) Successful in 41m52s
76802f96a7
fricklerhandwerk added 2 commits 2025-05-19 01:24:41 +02:00
don't use dummy files
All checks were successful
/ check-pre-commit (pull_request) Successful in 11s
/ check-peertube (pull_request) Successful in 17s
/ check-panel (pull_request) Successful in 1m8s
/ check-deployment-basic (pull_request) Successful in 42m16s
/ check-deployment-cli (pull_request) Successful in 42m14s
9710c72f58
seems they're not needed after all
fricklerhandwerk approved these changes 2025-05-19 02:17:22 +02:00
fricklerhandwerk left a comment
Owner

Pretty good, awesome that it's all green now! Let's merge it like that, and keep cleaning it up. It's still a bit too hard to follow for my taste, because some unnecessary abstraction obscures the actual test cases. Ideally test cases would contain pretty much a (simplified) deployment as one would actually write it for production, and next to it a script that interacts with it to check invariants.

Pretty good, awesome that it's all green now! Let's merge it like that, and keep cleaning it up. It's still a bit too hard to follow for my taste, because some unnecessary abstraction obscures the actual test cases. Ideally test cases would contain pretty much a (simplified) deployment as one would actually write it for production, and next to it a script that interacts with it to check invariants.
fricklerhandwerk merged commit ee5c2b90b7 into main 2025-05-19 02:18:56 +02:00
fricklerhandwerk deleted branch integration-test-multiple-rebased 2025-05-19 02:18:56 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: fediversity/fediversity#329
No description provided.