network-scripts-openvswitch was removed in f40 and network-scripts
is going away in f41; we really need to get off using them.
This attempts to implement the same setup using NetworkManager,
based on a few different NM/ovs references, and the source of
openQA upstream's os-autoinst-setup-multi-machine . It might
need a bit of tweaking, so for now, we make it a separate task
and use it only on p09-worker01 for testing. This doesn't handle
tearing down the old network-scripts-based config as that's
pretty complex and will only need to happen once; I'll do it
manually before trying this out.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
We don't want it there - see earlier commits - but I didn't
notice it's actually explicitly listed here for all arches,
which breaks stuff on aarch64 now we told dnf to exclude it.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This works around an annoying problem where, for some reason, we
sometimes just miss sending completed test results to resultsdb.
I've never been able to figure out why this happens, but this
should band-aid it by looking, daily, for updates stuck in
waiting gating status, checking for cases where a test finished
but we didn't send a result, and sending it.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
Using the same approach as we do for the tests and fedora_openqa.
I wish I'd done this *before* I ran the playbook on lab and it
wiped every...single...goddamn...disk image.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
Encoding with ffmpeg rather than os-autoinst's built-in encoder
gives us less broken videos, but on aarch64 it seems to cause
problems, especially on stg's old, busted worker hosts - I think
it's more CPU-intensive and they just can't handle the load. So,
let's block it.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
I'm adding this as a Recommends: for os-autoinst, but want to
get it on the workers now. Having it installed gives us better
videos of test runs (the internal video encoder is a bit wonky
and produces videos that have errors which make jumping around
within the video not work properly).
Signed-off-by: Adam Williamson <awilliam@redhat.com>
It's overall simpler and more idempotent to just use a side repo
maintained outside of ansible than re-create one on each system
on each run of the plays.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
It is extremely slow to run, and we figured out that the problem
on openqa01 was excessive space being used by Netapp snapshots,
so we don't need this any more. It was actually deleting old
jobs before their time, because it had already wiped every
video file and didn't know what else to do...
Signed-off-by: Adam Williamson <awilliam@redhat.com>
We're having issues with test results eating up all the disk
space we can throw at them (prod is over 4T, stg is over 2T -
I don't know why prod is bigger, that's odd, but it may be an
odd effect of having more arches on stg, maybe aarch64 and
ppc64le tests generally have smaller videos, or something).
This config setting should make openQA keep the space usage
on the partition at a max of 85%, by deleting videos from older
tests as required.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
nirik and I went around and around a bit today and ended up back
where we started, but with a clearer understanding of where that
this. This explains it a bit better, and makes what's actually
going on in various places clearer with the use of appropriate
shared variables. This should not actually *change* anything at
all when deployed.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
I'm going to try splitting the tap jobs across multiple worker
hosts. We have quite a lot of tap jobs, now, and I have seen
sometimes a situation where all non-tap jobs have been run and
the non-tap worker hosts are sitting idle, but the single tap
worker host has a long queue of tap jobs to get through.
We can't just put multiple hosts per instance into the tap
class, because then we might get a case where job A from a tap
group is run on one host and job B from a tap group is run on
a different host, and they can't communicate. It's actually
possible to set this up so it works, but it needs yet more
complex networking stuff I don't want to mess with. So instead
I'm just gonna split the tap job groups across two classes,
'tap' and 'tap2'. That way we can have one 'tap' worker host
and one 'tap2' worker host per instance and arch, and they will
each get about half the tap jobs.
Unfortunately since we only have one aarch64 worker for lab it
will still have to run all the jobs, but for all other cases we
do have at least two workers, so we can split the load.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This reverts commit 892453da7e.
openQA still had problems with the very long request, so I just
did an ugly hack to get the request under the limit instead.
The openQA job scheduler was hitting 414 errors today because
an update has so many builds there are more than 8190 characters
(the default limit) in the POST request. Let's bump the limit
to 16000.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
It really needs to be called exactly 60-block-scheduler.rules
as it's overriding a file of the same name in `/usr`.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This applies only within Fedora infra for now, as we're not sure
whether worker hosts on different hardware hit this bug. It's
intended to work around:
https://bugzilla.redhat.com/show_bug.cgi?id=2009585
a bug which results in the infra worker hosts hanging after a
short time when running kernels newer than 5.11.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
Upstream implemented a feature that we can use to do the same
thing using just a test variable, so we're switching to that.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
Ugh, we delegate for the assetsize stuff too and there's tons of
that, splitting it would be awful. Let's try a different approach
with a new optional variable for the delegate target.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
Using the machine's own hostname works for the ansible delegate
stuff but doesn't work for openQA itself (if you try and access
the DB by hostname like this, postgres denies access; you have
to use 'localhost' for postgres to allow it). Using 'localhost'
works for postgres but doesn't do the right thing for delegation.
Let's use 'localhost' and split the two play steps into
delegated and non-delegated versions.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
The config file should treat these as optional, not every openQA
instance wants to report results.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
We don't want to include this section if the vars aren't set.
Not every openQA server has to be an AMQP publisher.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
The background to this is
https://bugzilla.redhat.com/show_bug.cgi?id=2073414 , in response
to which git was changed to die if a user runs git commands
on a repo which it doesn't own. In openQA, the test directory
is a git repo and openQA itself likes to run git commands on it,
but this is often going to be as a different user than the owner
of the directory. In fact on the worker hosts, the user that owns
the directory (geekotest on the server box) doesn't even exist.
This just sets the config by copying a file in place rather than
running a git command (which is hard to get to be idempotent) and
uses `/etc/gitconfig` so we don't wind up with a file in the
_openqa-worker user's home directory, which is meant to be empty.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
We were hiding these because in the past the only ISO assets
were those from the compose under test, and we wanted to avoid
people downloading them from openQA when we'd rather they get
them from dl.fp.o or the mirror system. But these days we have
tests that generate ISOs (update netinst and live image build
tests) and we often want to download the generated images to
test them locally.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This sets up the openQA lab instance to report to the new stg
instance of resultsdb, and use authentication. The scheduler
config file is now mode 0600 because it has a password in it.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
We need to treat it and the x86_64 update group separately to
do this, but it really doesn't need 200G. We have images from
three weeks ago, and we don't need that kind of buffer, and space
is a bit tight.
Note: there is no aarch64 updates group as we do not currently
run updates tests on aarch64.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
We've been using the httpd_can_network_connect boolean for years
to allow httpd to connect to the openQA server processes. This
is an unnecessarily large hammer when we only need it to be
able to connect to exactly the two openQA ports. This uses a
custom SELinux policy to allow connecting to those ports only,
and ensures the boolean is set back to off.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
Several of these requirements are old ones that were only needed
for createhdds, when we ran createhdds on the servers. All of
those can go. Also make the list line-by-line for easier git
blame tracking in future (and add comments for the remaining
entries so we know why they're there).
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This is just cleaning up the mess of the bad parameter from
earlier, run of this play broke halfway through, need to do the
remaining half without choking on this part.
Signed-off-by: Adam Williamson <awilliam@redhat.com>
This unifies prod and stg onto the ways of doing things for the
latest packages, and rejigs the swtpm stuff a bit to tear down
more (we shouldn't need the custom SELinux policy any more).
Signed-off-by: Adam Williamson <awilliam@redhat.com>