There are no fedmsg notifications for the live respin composes,
so we just try scheduling them every hour; when we've already
tested the current compose this will not create any new jobs,
and when a new compose shows up, this will test it.
relvalconsumer is the fedmsg consumer bot that creates Wikitcms
release validation test events. Up till now it's just been
running on one of my personal boxes; we should really move it
to infra. Run it on the openQA servers for now, as there's
nowhere more obviously correct, and I have root access there to
fix problems.
We add new inventory groups because it's *really important*
that there be exactly one (no more, no less) production consumer
at any given time. I don't want to just use the 'openqa' group
for this because it's vaguely possible it could contain more
than one host in future, and we really wouldn't want that to
result in there being two production relvalconsumers running.
The capability to handle a variance between prod and staging
here is just temporary while I'm testing the new fixed asset
handling stuff by deploying it on staging. Once it's tested
and merged we'll just have prod and staging do the same thing.
But for now we need to cleanly handle them having the static
disk images in different places.
I've enhanced `createhdds check` to exit 1 if all images are
present but some are old, and 2 if any images are missing. We
use this to only create images if any are missing here in the
play; we rely on the daily cron job to rebuild old images.
This is kind of a band-aid for a weird issue on openqa01 where
virt-install runs just don't seem to work properly after the
box has been running for a while, so createhdds doesn't actually
work and any playbook run gets hung up on it for a long time.
This doesn't fix that, but does at least mean we can run the
playbook without being bothered by it. To get createhdds to run
properly and actually regenerate the outdated images, we have
to reboot the system and run it right away, it seems to work
fine right after the system boots up.
We currently can't tell openQA to download the ARM kernel and
initramfs with a filename unique to the build being tested, so
they just get downloaded as `vmlinuz` and `initrd.img`, which
means that when the next compose is tested, we won't download
them again, we'll just use the existing copies (which are no
longer the right ones). Because of this our current 'F25' and
'Rawhide' ARM tests are actually still using some F24 kernel
image. Until the openQA bug which prevents us giving the files
unique names is resolved, here's a hacky workaround: a script
which wipes the files every hour if no openQA jobs are pending.
since resultsdb submission was added to the scheduler, we must
disable it here for now (as we don't want to use it yet), and
also update the name of the config directive that controls wiki
result submission.
we need to install some additional packages for the revised
createhdds (but we no longer need pexpect), and ensure libvirtd
is running before running createhdds.
it's not really fatal when it fails (except on first deployment)
and nothing else later depends on it, so we can go ahead and
continue the run even if it fails
instead of just relying on it getting run when we do an
ansible run, since that's intermittent and it's annoying
when you want to do an ansible run and it sits there for
hours creating disk images. This way we'll know they'll
get updated regularly and ansible runs should never get
blocked on image creation, though we still do it in the
ansible plays just in case (and for initial deployment).
This should now be safe, with the recent changes to make it
time out gracefully and run atomically. We also use withlock
to make sure we don't stack jobs.
this will only work with the new openqa package builds I just
did, but won't break anything with older ones. With a new enough
openQA package, it'll prevent the web UI from showing download
links for ISOs and HDD files.
OK, this GRE crap ain't working. Let's give up! Instead let's
have one tap-capable host per openQA deployment, so all the
tap jobs will go to it. This...should achieve that. Let's see
what blows up.
we have a big mismatch between prod and stg atm (stg has 4
workers, prod has 18). let's make it 14 vs. 8 and also give
stg two worker hosts so we can test multi-worker-host scenarios
NetworkManager entirely ignores the openvswitch devices, the
integration only works with network.service. So turn it on.
Apparently we can have both services enabled and things don't
explode...so far...