Another tale of a rawhide compose bug

nirik

2022-04-21 14:45

Astute observers will have noticed recently that Fedora rawhide composes stopped after 2022-04-13 and didn't resume until 2022-04-19. A particularly odd bug was to blame. This is the story of that bug and my investigations of it.

A bit of background first on how nightly images are made currently. There's a cron job that calls a script (called nightly.sh) in the pungi-fedora pagure.io git repo. It does a number of small associated things (like sending message bus messages of status, sending report emails, copying results around, etc) but the big thing it does is to call pungi. pungi in turn looks at it's config (in that same pungi-fedora repo) and does all the heavy lifting of the compose, calling mostly out to koji to do things, but also doing some things locally. For images, pungi calls koji with some parameters (use this repo and name for kickstart file, use these compose repos for packages, etc). koji then calls different tools depending on what the image is. For livemedia the tool is livemedia-creator (in the lorax package), for qcow2/raw images it calls ImageFactory/oz to do the actual image build on a builder.

The compose on the 14th failed making a aarch64 Cloud-Base image. https://koji.fedoraproject.org/koji/taskinfo?taskID=85654099 with an odd traceback at the very very end of the build:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x85 in position 17480: invalid start byte

So, my first go-to on these sorts of things is to look at what packages changed between the successfull compose on the 13th and the failed one on the 14th. That day of changes was pretty small, being a thursday in a week when everyone was focused on Fedora 36 final testing. The only thing that seemed like it could affect booting at all was a grub2 update. That update only added a 'read' module though, which normally I wouldn't think could cause any issue, but perhaps it was a odd aarch64 toolchain issue? So, I tried some composes with grub2 untagged. No luck, still failed the same way.

Next I tried to get some more information about exactly what was not utf-8 that it was complaining about. I did this by doing some koji scratch image builds. But of course my scratch builds never seemed to fail, they all worked fine.

Of course then it was easter weekend, I got busy around the house and didn't really dig back in to it until the following monday. This time I did some more scratch builds and finally managed to get it to fail correctly and with some added debug in oz on the builder to print out the data it was trying to convert to utf-8. At the very end of making the image (while it's shutting down), it was getting a bunch of:

cpio: Malformed number <bunch of junk encoded weird stuff>

This of course was not something python wanted to convert to utf-8. It looked like a totally different encoding in there, so oz marked it a failure.

A sidebar here on how oz works. oz takes its inputs and tells libvirt to fire off a vm for it, running the installer with the kickstart provided, and serial console pointed to a socket. It then starts polling (by default every 10 seconds). If there's disk or network activity on the vm, it gets data from the console and waits. If it hits the total timeout it kills the vm and returns a failure. If it doesnt see any disk or network activity for (default 300 seconds) it assumes the install has stalled/failed, and kills the vm and returns a failure. If the vm has shutdown, it assumes it's successfull and returns. The output of the console is returned back to koji so you can see the log file in the koji task. However, as Adam Williamson pointed out: That last 10 second window never actually gets collected by oz for converting/adding to logs. If the install shuts down right after oz polls, the next time it will poll the vm is shutdown and it just returns.

So, what is this cpio output? Well, when you shut down a Fedora system, dracut calls "/usr/lib/dracut/dracut-initramfs-restore". This script runs at the very end of the shutdown cycle and is supposed to copy out the initramfs to memory and then pivot your install to using that as / so it can unmount your real root directory cleanly. The script uses a wrapper called 'skipcpio' thats shipped with dracut. This wrapper skips past the first part of the initramfs, as this is just a small archive with the microcode to load that first thing when you boot. The next archive is the actual initramfs that does the heavy lifting. The script has no idea how you have compressed your initramfs (if you have), so it just blindly tries: not compressed? gzipped? bzipped? xz? lz4? lzop? zstd? The first try (not compressed) is the thing that results in the error spew. cpio bravely tries to unpack the zstd compressed initramfs and... has problems.

So, the flow is:

nightly.sh calls pungi
pungi calls koji to make Cloud-Base image
Oz calls libvirt to do install, starts polling.
Install boots, installs and finishes.
On shutdown dracut-initramfs-restore runs and because it tries uncompressed first, spews cpio errors
Oz polls again after the errors start, but before the vm is completely off.
Oz tries to convert the log to utf-8 and blows up.

Upstream dracut as it turns out just merged a PR to change the order it tries these things in (moving zstd to the first one) because it resulted in several seconds slower shutdowns. So, I just added that to the rawhide dracut package and everything started composing again.

So, what caused this to start happening now? Well, queue back to oz not getting the last bit of console output. My theory is that this has been happening for a long time, it's just since it was at the very end of the output, it always got dropped on the floor and oz never saw it. For some reason now we are shutting down faster, or slower and it's getting caught in that last poll that does have data.

Always fun digging into these sorts of things. :)