Skip to main content

Bits from late jan 2025

Scrye into the crystal ball

January has gone by pretty fast. Here's some longer form thoughts about a few things that happened this last week.

Mass updates/reboots

We did a mass update/reboot cycle on (almost) all our instances. The last one was about 2 months ago (before the holidays), so we were due. We do apply security updates weekdayly (ie, monday - friday), but we don't apply bugfix updates except for this scheduled windows. Rebooting everything makes sure everything is on the latest kernel versions and also ensures that if we had to reset something for other reasons it would come back up in the correct/desired/working state. I did explore the idea of having things setup so we could do these sorts of things without an outage window at all, but at the time (a few years ago) the sticking point was database servers. It was very possible to setup replication, but it was all very fragile and required manual intervention to make sure failover/failback worked right. There's been a lot of progress in that area though, so later this year it might be time to revisit that.

We also use these outage windows to do some reinstalls and dist-upgrades. This time I moved a number of things from f40 to f41, and reinstalled the last vmhost we had still on rhel8. That was a tricky one as it had our dhcp/tftp vm and our kickstarts/repos vm. So, I live migrated them to another server, did the reinstall and migrated them back. It all went pretty smoothly.

There was some breakage with secure boot signing after these upgrades, but it turned out to be completely my fault. The _last_ upgrade cycle, opensc changed the name of our token. From: 'OpenSC Card (Fedora Signer)' to 'OpenSC Card'. The logic upstream being "Oh, if you only have one card, you don't need to know the actual token name". Which is bad for a variety of reasons, like if you suddenly add another card, or swap cards. In any case I failed to fix my notes on that and was trying the old name and getting a confusing and bad error message. Once I managed to fix it out everything was working again.

Just for fun, here's our top 5 os versions by number:

  • 237 Fedora 41

  • 108 RHEL 9.5

  • 31 Fedora 40

  • 23 RHEL 8.10

  • 4 RHEL 7.9

The 7.9 ones will go away once fedmsg and github2fedmsg are finally retired. (Hopefully soon).

Datacenter Move

A bunch of planning work on the upcoming datacenter move. I'm hoping next week to work a lot more on a detailed plan. Also, we in infrastructure should kick off some discussions around if there's anything we can change/do while doing this move. Of course adding in too much change could be bad given the short timeline, but there might be some things to consider.

I also powered off 11 of our old arm servers. They had been disabled in the buildsystem for a while to confirm we didn't really need them, so I powered them off and saved us some energy usage.

riscv-koji seconday hub

The riscv-koji seconday hub is actually installed and up now. However, there's still a bunch of things to do:

  • Need to setup authentication so people/I can login to it.

  • Need to install some buildvm-x86 builders to do newrepos, etc

  • Need to install a composer to build images and such on

  • Next week's riscv sig meeting hopefully we can discuss steps after that. Probibly we would just setup tags/targets/etc and import a minimal set of rpms for a buildroot

  • Need to figure out auth for builders and add some.

Overall progress finally. Sorry I have been a bottleneck on it, but soon I think lots of other folks can start in on working on it.

power9 lockups

We have been having anoying lockups of our power9 hypervisors. I filed https://bugzilla.redhat.com/show_bug.cgi?id=2343283 on that. In the mean time I have been moving them back to the 6.11.x kernels which don't seem to have the problem. I did briefly try a 6.13.0 kernel, but the network wasn't working there. I still need to file a bug on that when I can take one down and gather debugging info. It was the i40e module not being able to load due to some kind of memory error. ;(

Chat and work items

One thing that was bugging me last year is that I get a lot of notifications on chat platforms (in particular slack and matrix) where someone asks me something or wants me to do something. Thats perfectly fine, I'm happy to help. However, when I sit down in the morning, I usually want to look at whats going on and prioritze things, not get sidetracked into replying/working on something thats not the most important issue. This resulted in me trying to remember which things where needed responses and sometimes missing going back to them or getting distracted by them.

So, a few weeks ago I started actually noting things like that down as I came to them, then after higher pri things were taken care of, I had a nice list to go back through and hopefully not miss anything.

It's reduced my stress, and I'd recommend it for anyone with similar workflows.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/113930147248831003