Beefier Forgejo actions machines #13

Open
opened 2024-11-18 12:33:43 +01:00 by Niols · 10 comments
Owner

from #7 (comment):

The actions runners take forever to run my tests. On the Tweag builders, then run in under 2 minutes. On my laptop, they run in 15 to 30 minutes depending on whether I'm also in a call at the same time, for instance. Here, they take nearly an hour! Is there any way we could get beefier machines? I know we can add more cores and more RAM, but the effect is only marginal; I'd rather have faster cores, but I don't know if that is possible?

from https://git.fediversity.eu/fediversity/fediversity/pulls/7#issuecomment-3201: > The actions runners take forever to run my tests. On the Tweag builders, then run in under 2 minutes. On my laptop, they run in 15 to 30 minutes depending on whether I'm also in a call at the same time, for instance. Here, they take nearly an hour! Is there any way we could get beefier machines? I know we can add more cores and more RAM, but the effect is only marginal; I'd rather have faster cores, but I don't know if that is possible?
Author
Owner

It was discussed today in the stand-up that this wouldn't be so easy to provide.

@koen and @kevin are supposed to look into providing faster machines (I think they had i3 cores?), or seeing if something can be made faster without changing the machines (maybe CPU options in Proxmox? no idea about any of this)

I am supposed to look into making Selenium more lenient. The problem is that I know how to tell Selenium to wait potentially hours between each step, but I don't know how to tell it to wait for the browser to start at the very beginning.

It was mentioned to look at using the Firefox driver instead of the Chrome driver. I don't think that this would change anything about the situation, and we initially switched to Chrome driver because the Firefox one was missing a feature (probably JS console logs extraction, we're not quite sure anymore). Still, I will have a look.

It was discussed today in the stand-up that this wouldn't be so easy to provide. @koen and @kevin are supposed to look into providing faster machines (I think they had i3 cores?), or seeing if something can be made faster without changing the machines (maybe CPU options in Proxmox? no idea about any of this) I am supposed to look into making Selenium more lenient. The problem is that I know how to tell Selenium to wait potentially hours between each step, but I don't know how to tell it to wait for the browser to start at the very beginning. It was mentioned to look at using the Firefox driver instead of the Chrome driver. I don't think that this would change anything about the situation, and we initially switched to Chrome driver because the Firefox one was missing a feature (probably JS console logs extraction, we're not quite sure anymore). Still, I will have a look.
Niols self-assigned this 2024-11-18 16:14:11 +01:00

@Niols What exactly does Selenium need to wait for here? You can server.wait_for_unit() and the like, but that seems to already happen. Or is it supposed to wait for the browser engine itself to start up?

@Niols What exactly does Selenium need to wait for here? You can `server.wait_for_unit()` and the like, but that seems to already happen. Or is it supposed to wait for the browser engine itself to start up?
Author
Owner

is it supposed to wait for the browser engine itself to start up?

I think that's it, yes: Selenium probably starts the browser and then waits for it to answer to a certain API on a certain port. I suppose this operation reaches a timeout if things are too slow.

> is it supposed to wait for the browser engine itself to start up? I think that's it, yes: Selenium probably starts the browser and then waits for it to answer to a certain API on a certain port. I suppose this operation reaches a timeout if things are too slow.
Owner

our runner seems Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz i.e. i3 range - @Niols did you mean i3 as desired or current?

our runner seems [Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz](https://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+E5-2690+v3+%40+2.60GHz&id=2364) i.e. [i3 range](https://www.cpubenchmark.net/desktop.html) - @Niols did you mean i3 as desired or current?
Author
Owner

We had chased this a while back with @kevin and concluded that the runners were fine but the layer of virtualisation was quite costly. Also, IIRC, the problem wasn't actually CPU but IO. @kevin had managed to find some flags to QEMU that would boost the CPU of the VMs and made some difference, but the performance were never near the runner in terms of IO.

This information is probably lost deep in the Matrix history, but maybe some stand-up notes for back when still have some of that information?

We had chased this a while back with @kevin and concluded that the runners were fine but the layer of virtualisation was quite costly. Also, IIRC, the problem wasn't actually CPU but IO. @kevin had managed to find some flags to QEMU that would boost the CPU of the VMs and made some difference, but the performance were never near the runner in terms of IO. This information is probably lost deep in the Matrix history, but maybe some stand-up notes for back when still have some of that information?
Owner

not sure i could read old matrix messages, but the notes contain:

Kevin: Worked on providing more resources to the CI runner. May have to rack a machine with a faster CPU

not sure i could read old matrix messages, but the [notes contain](https://git.fediversity.eu/Fediversity/meta/src/commit/36f3332b69ad0956e53dab7115dcb4f8655c4256/meeting-notes/2024-11-19%20standup%20notes.md?display=source#L19): > Kevin: Worked on providing more resources to the CI runner. May have to rack a machine with a faster CPU
Author
Owner

Sounds quite vague. IIRC we did try with a beefier Proxmox node, but it made little difference. At some point, we tried with a bare metal machine and that was very efficient, but obviously a bit more annoying to maintain than VMs, and less robust.

Sounds quite vague. IIRC we did try with a beefier Proxmox node, but it made little difference. At some point, we tried with a bare metal machine and that was very efficient, but obviously a bit more annoying to maintain than VMs, and less robust.
Owner

@niols some digging in our dms and the here is bit of a summery

the og ci machine was very slow and running on our old vm env

we made a new one the the proxmox and it was faster ish after some tweaking and tuning for nested vm/container is got a bit better but still not good enough ci was like 15-20 min

this was the testing from then and the io compared to a physical node was huge the cpu and vm stats were simular

[root@forgejo-ci:~]#  nix run --extra-experimental-features 'nix-command flakes' nixpkgs#stress-ng -- --cpu 1 --iomix 1 --vm 1 --vm-bytes 1G --timeout 60 --metrics-brief
stress-ng: info:  [3868] setting to a 1 min run per stressor
stress-ng: info:  [3868] dispatching hogs: 1 cpu, 1 iomix, 1 vm
stress-ng: metrc: [3868] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [3868]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: metrc: [3868] cpu               73191     60.00     59.39      0.01      1219.77        1232.12
stress-ng: metrc: [3868] iomix             33049     60.06      1.16      5.85       550.28        4712.94
stress-ng: metrc: [3868] vm              4412533     61.50     49.34     11.35     71747.88       72716.44
stress-ng: info:  [3868] skipped: 0
stress-ng: info:  [3868] passed: 3: cpu (1) iomix (1) vm (1)
stress-ng: info:  [3868] failed: 0
stress-ng: info:  [3868] metrics untrustworthy: 0
stress-ng: info:  [3868] successful run completed in 1 min, 1.50 sec

this is the test on the phyiscal ci machine

[procolix@forgejo-ci:~]$ sudo nix run --extra-experimental-features 'nix-command flakes' nixpkgs#stress-ng -- --cpu 1 --iomix 1 --vm 1 --vm-bytes 1G --timeout 60 --metrics-brief
stress-ng: info:  [2164954] setting to a 1 min run per stressor
stress-ng: info:  [2164954] dispatching hogs: 1 cpu, 1 iomix, 1 vm
stress-ng: metrc: [2164954] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s
stress-ng: metrc: [2164954]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: metrc: [2164954] cpu               82577     60.00     59.87      0.00      1376.20        1379.13
stress-ng: metrc: [2164954] iomix           2824468     60.01     22.72     51.36     47069.92       38125.72
stress-ng: metrc: [2164954] vm              4830124     60.06     49.67     10.23     80426.69       80639.59
stress-ng: info:  [2164954] skipped: 0
stress-ng: info:  [2164954] passed: 3: cpu (1) iomix (1) vm (1)
stress-ng: info:  [2164954] failed: 0
stress-ng: info:  [2164954] metrics untrustworthy: 0
stress-ng: info:  [2164954] successful run completed in 1 min

and after you ran the ci on on those machines we noticed the higher the iomix was the faster the ci ran

hope that this helps a bit

@niols some digging in our dms and the here is bit of a summery the og ci machine was very slow and running on our old vm env we made a new one the the proxmox and it was faster ish after some tweaking and tuning for nested vm/container is got a bit better but still not good enough ci was like 15-20 min this was the testing from then and the io compared to a physical node was huge the cpu and vm stats were simular ``` [root@forgejo-ci:~]# nix run --extra-experimental-features 'nix-command flakes' nixpkgs#stress-ng -- --cpu 1 --iomix 1 --vm 1 --vm-bytes 1G --timeout 60 --metrics-brief stress-ng: info: [3868] setting to a 1 min run per stressor stress-ng: info: [3868] dispatching hogs: 1 cpu, 1 iomix, 1 vm stress-ng: metrc: [3868] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s stress-ng: metrc: [3868] (secs) (secs) (secs) (real time) (usr+sys time) stress-ng: metrc: [3868] cpu 73191 60.00 59.39 0.01 1219.77 1232.12 stress-ng: metrc: [3868] iomix 33049 60.06 1.16 5.85 550.28 4712.94 stress-ng: metrc: [3868] vm 4412533 61.50 49.34 11.35 71747.88 72716.44 stress-ng: info: [3868] skipped: 0 stress-ng: info: [3868] passed: 3: cpu (1) iomix (1) vm (1) stress-ng: info: [3868] failed: 0 stress-ng: info: [3868] metrics untrustworthy: 0 stress-ng: info: [3868] successful run completed in 1 min, 1.50 sec ``` this is the test on the phyiscal ci machine ``` [procolix@forgejo-ci:~]$ sudo nix run --extra-experimental-features 'nix-command flakes' nixpkgs#stress-ng -- --cpu 1 --iomix 1 --vm 1 --vm-bytes 1G --timeout 60 --metrics-brief stress-ng: info: [2164954] setting to a 1 min run per stressor stress-ng: info: [2164954] dispatching hogs: 1 cpu, 1 iomix, 1 vm stress-ng: metrc: [2164954] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s stress-ng: metrc: [2164954] (secs) (secs) (secs) (real time) (usr+sys time) stress-ng: metrc: [2164954] cpu 82577 60.00 59.87 0.00 1376.20 1379.13 stress-ng: metrc: [2164954] iomix 2824468 60.01 22.72 51.36 47069.92 38125.72 stress-ng: metrc: [2164954] vm 4830124 60.06 49.67 10.23 80426.69 80639.59 stress-ng: info: [2164954] skipped: 0 stress-ng: info: [2164954] passed: 3: cpu (1) iomix (1) vm (1) stress-ng: info: [2164954] failed: 0 stress-ng: info: [2164954] metrics untrustworthy: 0 stress-ng: info: [2164954] successful run completed in 1 min ``` and after you ran the ci on on those machines we noticed the higher the iomix was the faster the ci ran hope that this helps a bit
Author
Owner

It does help very much!

It does help very much!
Owner

given we are on the physical CI runner now, is this ticket still relevant?

given we are on the physical CI runner now, is this ticket still relevant?
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Blocks
#224 automated dev-ops workflows
fediversity/fediversity
Reference
fediversity/fediversity#13
No description provided.