Fedora Infra musings for the third week of july

nirik

2024-07-20 10:56

Another week has raced by (time flies when you're having fun?). flock to fedora is coming up really fast now. It's Aug 7th to 10th in Rochester, NY. Looking forward to meeting up with everyone there and having some great discussions. I have a talk (which I still need to write up) on matrix, which should be fun and then a Infrastructure and Release Engineering hackfest which I need to work on organizing a bit more. Look for more info on discussion.

On monday I managed to get updated firmware for our aarch64 emag's. Got them all updated, reinstalled and re-added as builders just barely in time for the mass rebuild. This sort of thing takes a really tremendous amount of time. I'd like to explain for those that haven't had this sort of fun before. There's a lot of parts of this process where you need to wait for something to happen and do something in reaction to it. ie, wait for one firmware (there's 3 on these aarch64 machines) to finish updating, then reload and upload the next one. For some reason I couldn't force them to pxe boot in all cases, so that meant: login to serial console, watch the server boot, when it gets to a specific point hit esc-shift-1 to pxe boot. If you miss it, you have to start over. You might think you could do other things while this is happening, but... when you do, you always miss the window to hit the key and have to keep doing it over and over. Next fun with these was that they have 4 interfaces and for some unknown reason, they are all active on various vlans and which one gets the 'default' route is somewhat random. If it's not the actual builder network, it can't reach resources and fails the install. Some times one or more of the interfaces wouldn't come up with a cryptic error. If this was the main network, you had to reboot and try again. Once they were pxe booted the kickstart install and ansiblizing was easy. Hopefully they will work on now until we retire them.

Our resultsdb app's pods have been restarting. It's not super clear as to the cause. They hit max threads and the health check fails and then they restart, but the reason for the max threads isn't clear. Is it somehow getting blocked on database writes so requests pile up? Or is it just getting too many requests at once to handle properly. I looked at it some, but the resultdb image that we use is made by the factory2 team (which no longer exists), and it's not very easy to enable debugging in. Will look at it more next week.

Overall this week I didn't feel like I got much done. Too many things that are difficult/require a lot of time and it's hard to feel progress on them. Hopefully next week will go better!