Fedora Infra musings for the Second week of july 2024
This week started out fun with some Oral surgery on monday. Luckily it all went very well. I went to sleep, woke up when they were done and had a bunch of pain medication on board. I'm getting pretty sick of 'soft' foods however.
Tuesday we had our logserver 100% full. Turns out a toddler (thing that takes actions on message bus messages) was crashing in a loop. When it does this it puts the message back on the queue and tries again. This works fine if it's some kind of transitory error and it can process it after a short while, but doesn't work very well at all if it needs intervention. So, 350GB of syslog later we disabled it until we can fix it. We did have some disucssion about this problem, and it seems like the way to go might be to cause the entire pod to crash on these things. That way it would alert us and require intervention instead of looping on something it can't ever process. Also, right now the toddlers are just generic pods that run all the handlers, but we are looking at a poddlers setup where each handler has it's own pod. That way a crash of one will not block all the rest. Interesting stuff.
Our new, updated mailman instance has been having memory pressure problems. We were finally able to track it down to the 'full text search' causing memory spikes in gunicorn workers. It's rebuilding it's indexes, but it's not been able to finish doing so yet, so without those this search is really memory intensive. So, we are going to disable it for now until the indexing is all caught up. This seems to have really helped it out. Fingers crossed.
This week was a mass update/reboot cycle. We try and do these every few months to pick up on non security updates (security updates get applied daily). So, on tuesday I did all the staging hosts and various other hosts I could do that wouldn't cause any outages for users/maintainers. Wed was the big event and all the rest were done. Ansible does make this pretty reasonable to do, but of course there's always things that don't apply right, don't reboot right, or break somehow. There's a share of those this time:
- All of our old lenovo emag aarch64 buildhw's wouldn't reboot. (see below)
- koji hubs fedora-messaging plugin wasn't working. Turns out the hardening in the f40 httpd service file prevented it from working. I've overridden that for now, but we should fix it to not need that override.
- Our staging openshift cluster had a node with a disk that died. This disk was used for storage, so the upgrade couldn't continue. Finally got it to delete that and continue today.
- flatpak builds were broken because f40 builders meant that we switched to createrepo_c 1.0, and thus, zstd by default. flatpak sig folks have fixes in the pipeline.
- epel8 builds were broken by f40's dnf no longer downloading filelists. rhel8 has requirements for /usr/libexec/platform-python that wouldn't work anymore, so no builds.I've just added platform-python to the koji epel8 build groups for now. Perhaps there will be a larger fix in mock.
So, we have a number of old lenovo emags. They have been our primary aarch64 builders for ages (since about 2019 or so). They are now no longer under warentee, and we have slowly been replacing them with newer boxes. They now will no longer boot at all. It seems like it has to be a shim or grub problem, but I can't really seem to get it working even with older versions, so I am now thinking it might be a firmware problem. There is actually a (slightly) newer firmware, if I can get a copy. Failing that we may have to accelerate our retirement's around these. They really served long and well, and are actually pretty nice hardware, but all things must end. Anyhow, looking for the new firmware to try that before giving up.
Been dealing with this bug in rawhide kernels lately. The last two days I have come in in the morning and my laptop is completely unresponsive. A few other times I have hit the kswapd storm, and backups have been taking many hours. I sure hope the fix lands soon. I might go back to f40 kernels if the upstream fix doesn't land soon. I know I could just make my own kernel, but... I've done that enough in my life.
Till next week, be kind to others!