There's this annoying pattern where the NFS mount fails on boot
and then the worker services all start up and take jobs, but they
instafail because the share isn't there.
Ideally we could handle this very easily with Restart= directives
but systemd has...*opinions* about this:
https://github.com/systemd/systemd/issues/4468https://github.com/systemd/systemd/issues/1312
so we have to do some fairly awkward hacks to just express:
* Retry the NFS mount if it fails
* Don't start the workers unless the NFS mount is up
* Retry the workers after a while if they were blocked
It's ugly, but in testing this same config on one worker it seems
to work...
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This reverts commit 4dc01bc892 and
a follow-up commit. I'm having trouble getting things to work
and want to see if it works if we go back to having the openQA
bridge be br0, and rename the bridge used for the system's bonded
network connection to something else instead.
On the new rdu3 worker hosts, br0 already exists and is the main
system 'interface' (it's a bridge on two bonded physical interfaces
connected to different switches, to make networking upgrades
easier). So we can't call our openvswitch bridge 'br0' any more.
Let's try calling it 'openqabr0' and see if anything explodes.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This is an awful hack to deal with
https://github.com/os-autoinst/os-autoinst/issues/2549 while we
try and fix it properly. This finds stuck qemu processes by
parsing the journal messages of the workers, and kills them.
workers stuck in the broken state should then recover on the
next checkin with the server. I tested this manually on all the
worker hosts and it...seemed to work, mostly. I'll keep an eye on
things after deploying it.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
It really needs to be called exactly 60-block-scheduler.rules
as it's overriding a file of the same name in `/usr`.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
The background to this is
https://bugzilla.redhat.com/show_bug.cgi?id=2073414 , in response
to which git was changed to die if a user runs git commands
on a repo which it doesn't own. In openQA, the test directory
is a git repo and openQA itself likes to run git commands on it,
but this is often going to be as a different user than the owner
of the directory. In fact on the worker hosts, the user that owns
the directory (geekotest on the server box) doesn't even exist.
This just sets the config by copying a file in place rather than
running a git command (which is hard to get to be idempotent) and
uses `/etc/gitconfig` so we don't wind up with a file in the
_openqa-worker user's home directory, which is meant to be empty.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This unifies prod and stg onto the ways of doing things for the
latest packages, and rejigs the swtpm stuff a bit to tear down
more (we shouldn't need the custom SELinux policy any more).
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This is because swtpm is designed not to be persistent, it's
sort of tied to a single "system" (VM in this case). We can't
expect an instance will stick around after it's been "used", it
doesn't do that, it exits successfully. So we need to restart it
when that happens.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
swtpm is a TPM emulator we want to use for testing Clevis on
IoT (and potentially other things in future). We're implementing
this by having os-autoinst just add the qemu args but expect
swtpm itself to be running already - that's counted as the
sysadmin's responsibility. My approach to this is to have openQA
tap worker hosts also be tpm worker hosts, meaning they run one
instance of swtpm per worker instance (as a systemd service) and
are added to a 'tpm' worker class which tests can use to ensure
they run on a suitably-equipped worker. This sets up all of that.
We need a custom SELinux policy module to allow systemd to run
swtpm - this is blocked by default.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
It shouldn't need anything but 10.0.2.*, and hopefully this will
stop it interfering with the rest of the infra network...
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This provides a mechanism for deploying scratch builds, and also
for controlling whether or not to install openQA and os-autoinst
from updates-testing.
I have been doing the scratch build thing for years already, just
manually by ssh'ing into the boxes. This is getting tiring now
we have like 15 worker hosts.
The scratch build mechanism isn't properly idempotent, but fixing
that would be hard and I really only intend to use it transiently
when I'm updating the packages, so I don't think it's worth the
effort.
This also adds a notification for restarting openQA worker
services when the packages or config are updated, and fixes the
worker playbook to enable the last worker service.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
For some reason /dev/kvm has 0600 perms after boot on the ppc64
worker host. Also, qemu won't run unless SMT is turned off, on
ppc64. I've just been doing this manually every time the box got
restarted, but that's dumb, so let's make it happen on boot with
a script and a service to run it.
Signed-off-by: Adam Williamson <awilliam@redhat.com>