Skip to main content

Early Mid April infra bits 2025

Scrye into the crystal ball

Another week has gone by, and here's some more things I'd like to highlight from the last week.

Datacenter Move

I wrote up a community blog post draft with updates for the community. Hopefully it will be up early next week and I will also send a devel-announce list post and discussion thread.

We had a bit of a snafu around network cards. The new aarch64 boxes we got we missed getting 10G nics, so we are working to aquire those soon. The plan in the new datacenter is to have everything on dual 10G nics connected to different switches, so networking folks can update them without causing us any outages.

Some new power10 machines have arrived. I'm hopeful we might be able to switch to them as part of the move. We will know more about them once we are able to get in and start configuring them.

Next week I am hoping to get out of band management access to our new hardware in the new datacenter. This should allow us to start configuing firmware and storage and possibly do initial installs to start bootstraping things up.

Exciting times. I Hope we have enough time to get everything lined up before the june switcharoo date. :)

Fun with databases

We have been having a few applications crash/loop and others behave somewhat sluggishly of late. I finally took a good look at our main postgres database server (hereafter called db01). It's always been somewhat busy, as it has a number of things using it, but once I looked at i/o: yikes. (htop's i/o tab or iotop are very handy for this sort of thing). It showed that a mailman process was using vast amounts of i/o and basically causing the machine to be at 100% all the time. A while back I set db01 to log slow queries. So, looking at that log showed that what it was doing was searching the mailman.bounceevents table for all entries were 'processed' was 'f'. That table is 50GB. It has bounce events back 5 or 6 years at least. Searching around I found a 7 year old bug filed by my co-worker AurΓ©lien: https://gitlab.com/mailman/mailman/-/issues/343

That was fixed! bounces are processed. However, nothing ever cleans up this table at least currently. So, I proposed we just truncate the table. However, others made a good case that the less invasive change (we are in freeze after all) would just be to add a index.

So, I did some testing in staging and then made the change in production. The queries went from: ~300 seconds to pretty much 0. i/o was now still high but around the 20-30% range most of the time.

It's amazing what indexes will do.

Fedora 42 go for next week!

Amazingly, we made a first rc for fedora 42 and... it was GO! I think we have done this once before in all of fedora history, but it's sure pretty rare. So, look for the new release out tuesday.

I am a bit sad in that there's a bug/issue around the Xfce spin and initial setup not working. Xfce isn't a blocking deliverable, so we just have to work around it. https://bugzilla.redhat.com/show_bug.cgi?id=2358688 I am not sure whats going on with it, but you can probibly avoid it by making sure to create a user/setup root in the installer.

I upgraded my machines here at home and... nothing at all broke. I didn't even have anything to look at.

comments? additions? reactions?

As always, comment on mastodon: posts/2025/04/12/early-mid-april-infra-bits-2025.rst

Early April infra bits 2025

Scrye into the crystal ball

Another week gone by and it's saturday morning again. We are in final freeze for Fedora 42 right now, so things have been a bit quieter as folks (hopefully) are focusing on quashing release blocking bugs, but there was still a lot going on.

Unsigned packages in images (again)

We had some rawhide/branched images show up again with unsigned packages. This is due to my upgrading koji packages and dropping a patch we had that tells it to never use the buildroot repo for packages (unsigned) when making images, and to instead use the compose repo for packages.

I thought this was fixed upstream, but it was not. So, the fix for now was a quick patch and update of koji. I need to talk to koji upstream about a longer term fix, or perhaps the fix is better in pungi. In any case, it should be fixed now.

Amusing idempotentness issue

In general, we try and make sure our ansible playbooks are idempotent. That is, that if you run it once, it puts things in the desiired state, and if you run it again (or as many times as you want), it shouldn't change anything at all, as the thing is in the desired state.

There are all sorts of reasons why this doesn't happen, sometimes easy to fix and sometimes more difficult. We do run a daily ansible-playbook run over all our playbooks with '--check --diff', that is... check what (if anything) changed and what it was.

I noticed on this report that all our builders were showing a change in the task that installs required packages. On looking more closely, it turns out the playbook was downgrading linux-firmware every run, and dnf-automatic was upgrading it (because the new one was marked as a security update). This was due to us specifying "kernel-firmware" as the package name, but only the older linux-firmware package provided that name, not the new one. Switching that to the new/correct 'linux-firmware' cleared up the problem.

AI scraper update

I blocked a ton of networks last week, but then I spent some time to look more closely at what they were scraping. Turns out there were 2 mirrors of projects (one linux kernel and one git ) that the scrapers were really really interested in. Since those mirrors had 0 commits or updates in the last 5 years since they were initially created, I just made those both 403 in apache and... the load is really dramatically better. Almost back to normal. I have no idea why they wanted to crawl those old copies of things already available elsewhere, and I doubt this will last, but for now this gives us a bit of time to explore other options (because I am sure they will be back).

Datacenter Move

I'm going to likely be sending out a devel-announce / community blog post next week, but for anyone who is reading this a sneak preview:

We are hopfully going to gain at least some network on our new hardware around april 16th or so. This will allow us to get in and configure firmware, decide setup plans and start installing enough machines to bootstrap things up.

The plan currently is still to do the 'switcharoo' (as I am calling it) on the week of June 16th. Thats the week after devconf.cz and two weeks after flock.

For Fedora linux users, there shouldn't be much to notice. Mirrorlists will all keep working, websites, etc should keep going fine. pagure.io will not be directly affected (it's moving later in the year).

For Fedora contributors, monday and tuesday we plan to "move" the bulk of applications and services. I would suggest just trying to avoid doing much on those days as services may be moving around or broken in various ways. Starting wed, we hope to make sure everything is switched and fix problems or issues. In some ideal world, we could just relax then, but if not, Thursday and Friday will continue stablization work.

The following week, the newest of the old machines in our current datacenter will be shipped to the new one. We will bring those up and add capacity on them (many of them will add openqa or builder resources).

That is at least the plan currently.

Spam on matrix

There's been another round of spam on matrix this last week. It's not just Fedora thats being hit, but many other communities that are on Matrix. It's also not like older communications channels (IRC) didn't have spammers on them at times in the past either. The particularly disturbing part on the matrix end is that the spammers post _very_ distirbing images. So, if you happen to look before they get redacted/deleted it's quite shocking (which is of course what the spammer wants). We have (for a long while) a bot in place and it redacts things pretty quickly usually, but then you have sometimes a lag in matrix federation, so folks on some servers still see the images until their server gets the redaction events.

There are various ideas floated to make this better, but due to the way matrix works, along with wanting to allow new folks to ask questions/interact, there is not any simple answers. It may take some adjustments to the matrix protocol.

If you are affected by this spam, you may want to set your client to not 'preview' images (so it won't load them until you click on them), and be patient as our bot bans/kicks/redacts offenders.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114286697832557392

Late March infra bits 2025

Scrye into the crystal ball

Another week, another saturday blog post.

Mass updates/reboots

We did another mass update/reboot cycle. We try and do these every so often, as the fedora release schedule permits. We usually do all our staging hosts on a monday, on tuesday a bunch of hosts that we can reboot without anyone really noticing (ie, we have HA/failover/other paths or the service is just something that we consume, like backups), and finally on wednsday we do everything else (hosts that do cause outages).

Things went pretty smoothly this time, I had several folks helping out this time and thats really nice. I have done them all by myself, but it takes a while. We also fixed a number of minor issues with hosts: serial consoles not working right and nbde not running correctly and also zabbix users being setup correctly locally. There was also a hosted server where reverse dns was wrong, causing ansible to have the wrong fqdn and messing up our update/reboot playbook. Thanks James, Greg and Pedro!

I also used this outage to upgrade our proxies from Fedora 40 to Fedora 41.

After that our distribution of instances is:

number / ansible_distribution_version

252 41

105 9.5

21 8.10

8 40

2 9

1 43

It's interesting that we now have 2.5x as many Fedora instances as RHEL. Although thats mostly the case due to all the builders being Fedora.

The Fedora 40 GA compose breakage

Last week we got very low on space on our main fedora_koji volume. This was mostly caused by the storage folks syncing all the content to the new datacenter, which meant that it kept snapshots as it was syncing.

In an effort to free space (before I found out there was nothing we could do but wait) I removed an old composes/40/ compose. This was the final compose for Fedora 40 before it was released and the reason in the past that we kept it was to allow us to make delta rpms more easily. It's the same content as the base GA stuff, but it's in one place instead of split between fedora and fedora-secondary trees. Unfortunately, there were some other folks using this. Internally they were using it for some things and iot also was using it to make their daily image updates.

Fortunately, I didn't actually fully delete it, I just copied it to an archive volume, so I was able to just point the old location to the archive and everyone should be happy now.

Just goes to show you if you setup something for yourself, often unknown to you others find it helpfull as well, so retiring things is hard. :(

New pagure.io DDoS

For the most part we are handling load ok now on pagure.io. I think this is mostly due to us adding a bunch of resources, tuning things to handle higher load and blocking some larger abusers.

However, on friday we got a new fun one: A number of ip's were crawling an old (large) git repo grabbing git blame on ever rev of every file. This wasn't causing a problem on the webserver or bandwith side, but instead causing problems for the database/git workers. Since they had to query the db on every one of those and get a bunch of old historical data, it saturated the cpus pretty handily. I blocked access to that old repo (thats not even used anymore) and that seemed to be that, but they may come back again doing the same thing. :(

We do have a investigation open for what we want to do long term. We are looking at anubis, rate limiting, mod_qos and other options.

I really suspect these folks are just gathering content which they plan to resell to AI companies for training. Then the AI company can just say they bought it from bobs scraping service and 'openwash' the issues. No proof of course, but just a suspicion.

Final freeze coming up

Finally the final freeze for Fedora 42 starts next tuesday, so we have been trying to land anything last minute. If you're a maintainer or contributor working on Fedora 42, do make sure you get everthing lined up before the freeze!

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114247602988630824

Mid Late March infra bits 2025

Scrye into the crystal ball

Fedora 42 Beta released

Fedora 42 Beta was released on tuesday. Thanks to everyone in the Fedora community that worked so hard on it. It looks to be a pretty nice relase, lots of things in it and working pretty reasonably already. Do take it for a spin if you like: https://fedoramagazine.org/announcing-fedora-linux-42-beta/

Of course with the Beta out the door, our infrastructure freeze is lifted and so I merged 11 PR's that were waiting for that on Wed. Also, next week we are going to get in a mass update/reboot cycle before the final freeze the week after.

Ansible galaxy / collections fun

One of the things I wanted to clean up what the ansible collections that were installed on our control host. We have a number that are installed via rpm (from EPEL). Those are fine, we know they are there and what version, etc. Then, we have some that are installed via ansible. We have a requirements.txt file and running the playbook on the control host installs those exact versions of roles/collections from ansible galaxy. Finally we had a few collections installed manually. I wanted to get those moved into ansible so we would always know what we have installed and what version it was. So, simple right? Just put them in requirements.txt. I added them in there and... it said they were just not found.

The problem turned out to be that we had roles in there, but no collections anymore so I had not added a 'collections:' section, and it was trying to find 'roles' with those collection names. The error "not found" was 100% right, but it took me a few to realize why they were not found. :)

More A.I. Scrapers

AI scrapers hitting open source projects is getting a lot of buzz. I hope that some of these scraper folks will realize it's counterproductive to scrape things at a rate that makes them not work, but I'm not holding my breath.

We ran into some very heavy traffic and I had to end up blocking brazil for a while to pagure.io. We also added some CPU's and adjusted things to handle higher load. So far we are handling things ok now and I removed the brazil blockage. But no telling when they will be back. We may well have to look at something like anubis, but I fear the scrapers would just adjust to not be something it can catch. Time will tell.

Thats it for this week folks...

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114207270082200302

Mid March infra bits 2025

Scrye into the crystal ball

AI Scraper scourge

The AI scraper (I can only assume thats what they are) scourge continued, and intensified in the last week. This time they were hitting pagure.io really quite hard. We blocked a bunch of subnets, but it's really hard to block everything without inpacting legit users, and indeed, we hit several cases where we blocked legit users. Quickly reverted, but still troublesome. On thursday and friday it got even worse. I happened to notice that most of the subnets/blocks were from .br (Brazil). So, in desperation, I blocked .br entirely and that brought things back to being more responsive. I know thats not a long term solution, so I will lift that block as soon as I see the traffic diminish (which I would think it would once they realize it's not going to work). We definitely need a better solution here. I want to find the time to look into mod_qos where we could at least make sure important networks aren't blocked and other networks get low priority. I also added a bunch more cpus to the pagure.io vm. That also seemed to help some.

F42 Beta on the way

Fedora 43 Beta is going to be released tuesday! Shaping up to be another great release. Do download and test if you wish.

Datacenter Move

The datacenter move we are going to be doing later this year has moved a bit later in the year. Due to some logistics we are moving to a mid June window from the May window. That does give us a bit more time, but it's still going to be a lot of work in a short window. It's also going to be right after flock. We hope to have access to new hardware in a few weeks here so we can start to install and setup things. The actual 'switcharoo' in June will be over 3 or so days, then fixing anything that was broken by the move and hopefully all set before the F43 Mass rebuild.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114167827757899998

Early March infra bits 2025

Scrye into the crystal ball

Here we are saturday morning again. This week was shorter than normal for me work wise, as I took thursday and friday off, but there was still a lot going on.

Atomic desktops / iot / coreos caching issues

I spent a lot of time looking into some odd issues that ostree users were hitting. It was really hard to track down what was broken. No errors on our end, invalidated cloudfront a few times, did a bunch of tweaks to our backend varnish cashes and... the problem was caused by: me.

Turns out we are getting hit really hard all the time by (what I can only assume is crawlers working to fuel LLM's. It's not just us, see for example this excellent lwn article on the problem

We use amazon Cloudfront to serve ostree content to users, since it allows them to hit endpoints in their local region, so it's much faster and reduces load on our primary cache machines. Cloudfront in turn hits our cache machines to get the content it caches.

How does this relate to ostree issues you might ask? Well, I blocked a bunch of IP's that were hitting our kojipkgs servers particularly hard. It turns out some of those IP's were cloudfront, so just _some_ of the cloudfront endpoints didn't have access to the backend so their cache was out of date. I assume cloudfront also has multiple distributions at each region and it was only _some_ of those.

Removing all those blocks got everything working for everyone again (but of course the AI bots are ever present). I also enabled a thing called 'origin shield' which means cloudfront should only pull from one region and sync to the others, reducing load on our caches.

Longer term we probibly need to split up our kojipkgs cache or add more nodes or rearage how things are hit.

I'm deeply sorry about this issue, I know many users were frustrated. I sure was too. Lesson learned to be carefull in blocking bots.

s390x caching problems

And related to that issue, our s390x builders have been having problems pulling packages for builds. They have a local cache that in turn pulls from our primary one. Sometimes, sporadically, it's getting partial downloads or the like. I've still not fully figured out the cause here, but I did make a number of changes to the local cache there that seem to have reduced the problem.

Longer term here we probibly should seperate out this cache to hit a only internal one so the load on the main one doesn't matter.

Coffee machine fun

Friday my coffee machine ( delonghi magnifica ) got stuck in the middle of a cycle. It had gound the beans, but then stopped before the water step. So, I looked around at repair videos and then took it apart. It was actually pretty cool how it was put together, and I was able to basically manually turn a pully to move it down to unlocked, then I could remove the brew group and clean everything up and put it back together. Working great again. Kudos also to the The iFixit pro toolkit that I got a while back. A weird screw? no problem.

Home assistant

Been having a lot of fun tinkering with home assistant.

After looking, decided that zigbee networking is better than bluetooth and less power hungry than wifi, so I picked up a Zigbee gateway and it works just fine. At one point I thought I accidentally flashed it with esp32 builder, but seems it didn't work, so whew.

Got some smart plugs ( Amazon link to smart plugs ) and these little things are great! pair up fine, HA can manage their firmware versions/update them, lots of stats. I put one on the plug my car charges on, another on a plug that has a fridge and a freezer on, one on the plug my main server UPS is on, and kept one for 'roaming'. It's cool to see how much power the car charging takes in a nice graph.

Got some cheap temp sensors ( Amazon link to temp / humidity sensors ) They seem to be working well. I put one in my computer closet, one in our living room, one in the garage and one outside. (The living room seems to have a 4 degree change from day to night)

I had some old deako smart switches along with a gateway for them. They use a bluetooth mesh to talk to each other and an app, but the gateway is needed for them to be on wifi. I never bothered to setup the gateway until now, but HA needs it to talk to the switches. So I tried to set it up, but it would just fail at the last setup step. So, I mailed daeko and... they answered really quickly and explained that the gateway is no longer supported, but they would be happy to send me some of their new smart switches (that have wifi built in and can act as a gateway for the old ones) free of charge! I got those on thursday and set them up and they worked just dandy. But then I tripped over Chestertons Fence. The 3 old smart switches were all controlling the same light. That seemed silly to me. Why not just have one on that light, use two 'dumb' switches for the other two places for that light and then move the other smart ones to other lights? Well, turns out there are several problems with that: The 'dumb' switches have a physical position, so if you did that one could be 'on' with the light off, another 'off', etc But the biggest problem is that the smart switch is needed to route power around. If you turn the light 'off' on a 'dumb' switch you can have the one smart one with no power and it doesn't do anything at all. So, after messing them up I figured out how to factory reset them and re-pair them. For anyone looking the process is:

Resetting:

  • plug in and while it 'boots', press and hold the switch.

  • it should come up with a 3 2 1 buttons to press.

  • press each in turn

Pairing (you have to pair switches that all control the same lights):

  • unplug all switches

  • plug one in and Press and hold the switch it should come up with a flashing 1

  • If nothing happens, try each of the other two in turn. Only one has 'power'

  • press 1 on the first switch.

  • Repeat on switch 2 and press 2 on the first switch

  • Repeat on the last switch and press 3 on the first switch

I could have saved myself a bunch of time if I had just left it the way it was. Oh well.

Finally I got some reolink cameras. We have a small game camera we put out from time to time to make sure the local feral cats are ok, and to tell how many racoons are trying to eat the cats food. It's kind of a pain because you have to go put it outside, wait a few days and then remember to bring it back in, then pull the movies off it's sdcard.

So replacing that with something that HA could manage and we didn't need to mess with sounded like a win. I picked up a bundle with a Home Hub and two Argus Eco Ultra and 2 solar panels for them.

The Home hub is just a small wifi ap with sdcard slots. You plug it in and set it up with an app. Then, you pair the cameras to it and HA talks to the cameras via the hub. There's no external account needed, setup is all local and you can even firewall off reolink if you don't want them to auto uprgade firmware, etc. I've not yet set the cameras up outside, but a few impressions: The cameras like REALLY LOUD AUDIO. When you first power them on they greet you in a bunch of languages and tell you how to set them up. Thats fine, but when I did this people in my house were sleeping. Not cool. Even powering them off causes a chirp that is SUPER lOUD. The cameras have a 'siren' control that I have been afraid to try. :) Anyhow, more on these as I get them setup.

I had 2 UPSes here. One for my main server and critical loads and another one for less important stuff. With all the home assistant stuff I ran out of battery backed plugs, so I picked up a 3rd UPS. The new one was easy to add to nut, but I had a long standing problem with the two I already had: They are exactly the same model and product and don't provide a serial number on the usb port, so nut can't tell them apart. Finally I dug around and figured out that while I was specifying bus, port and device, it wasn't working until I moved one of them to another USB plug (and thus another bus). I then got all 3 of them added to HA. One thing that confused me there is that since all 3 of them are on the same nut server and are using the same upsmon user, how do I add more than 1 of them in HA? Well, it turns out if you go to the nut integration, add device, enter the same host/user/pass it will pop up a screen that asks you which one to add. So you can add each in turn.

So, lots of fun hacking on this stuff.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114128294620402197

Misc bits from late February 2025

Scrye into the crystal ball

Here's another misc bits from Fedora Infra land. I missed last weekend as I was off on some vacation. I've got a few more days coming up that I need to use before they go away. :)

Personal Stuff

We had a big wind storm here on monday. Knocked out power for about 5 hours (which wasn't so bad, since we have a generator, but was anoying). One of our large (150' douglas fir) trees got knocked down by the wind. Luckily it fell away from the house and a fence, and managed to hit a thicket of blackberries I was in the process of trying to remove. Still will need to take care of it, as it's in the way of our path. Also, our dishwasher died (gonna be a few weeks before someone can look at it). It's sad that repairing something takes a few weeks, but I bet I could buy a new one in a few days.

Day of learning on friday

Red Hat does these quarterly "day of learning" days, where everyone is encouraged to go learn something. Work related, not work related, interesting, useful or not. It's really a great thing. This time I decided to play around with Home Assistant some more and figure out how it does things. Adam Williamson mentioned it in a matrix discussion, and I had been meaning to look into it too, so seemed like a great time. I picked up a Home Assistant green (which is basically a small arm box that has home assistant (HA) all installed on it and ready to go. Initial setup was easy, no issues.

Several of my devices are using bluetooth, so I also picked up some little esp32 boards to use as a bluetooth proxy. It's pretty amazing how small these little guys are. I did make one mistake ordering though: I got some with microusb. Had to dig up some old cables. I think I am going to replace them with ones that have usbc. So, I flashed the bluetooth proxy on one, it joined my wifi and... then it didn't work. Took me a while to find out that since my wireless and wired networks are completely seperate, I needed to run a mdns-repeater on my gateway/firewall box to repeat the mdns advertisements. After doing that it saw that just fine, along with a printer that was on wifi.

I connected up my ups server (nut) with no problems, so now I have nice charts and graphs of power usage, battery state, etc.

I have some 'smart' light switches from Deako here too. They connect and work with an adroid app via bluetooth, so I thought they were using bluetooth for HA integration too, but it turns out it requires wifi to connect to them. I do have a old dongle to connect them to wifi, and I tried to set that up, but it seems to just hang at the end when it says it's preparing it. They no longer make that Deako Connect dongle, so I mailed them about it. Perhaps the provisioning for it is even no longer there.

Managed to add my car in with not much trouble. Kinda cool to have all its sensors available easily on the dashboard.

This stuff is a giant rabbit hole. Some things I want to do when time/energy permits:

  • There's a way to connect a esp32 device to the serial port on the hot water heater and get all kinds of info from it.

  • There's a way supposedly to connect to a internal connector on the heat pumps I have and get a ton of info/control them.

  • I'd like to figure a way to monitor my water holding tank. That one is going to be tricky, as the pumphouse is down a hill and not in line of sight of the house. Seems like people do this one of two ways: a pressure sensor dropped in the bottom of the tank, or a distance sensor at the top showing where the water line is.

  • After looking I think bluetooth is too much of a mess for things usually, so I order a Zigbee thing for the HA server and some zigbee based power outlets and some temp sensors to try that out.

Lots of fun.

caching issues

So, starting on tuesday there were some issues with out caching setup. First it was s390x builds. They would request some package/repodata from their local cache, it would in turn request it from the main ones and sometimes it would just get a partial read or a incorrect size. I couldn't find anything that seemed wrong. We are in Freeze for beta, so no real changes have been made to infrastructure. So, I was left with trying various things. I updated and rebooted the s390x cache machine. I blocked a bunch of ip's that were hitting the main cache too much. I updated and rebooted the main cache machines. Including a sidetrack where I noticed they were running an old kernel. Turns out 'sdubby' got installed there somehow and it was installing new kernels as efi. The vm wasn't booting efi, so it just kept booting on it's old kernel. After all that it seemed like the issue was gone, or perhaps happening much less often?

After that however, some folks are seeing problems with ostree downloads. All of them I see also don't exist on the cache hosts either, so I am not sure whats happening. I'm very sorry for the issue, as I know it's affecting a number of people. Will keep trying to track it down.

riscv secondary

Managed to mostly sort out builder authentication I think. There's eome limitataions for external builders, but I think we can live with that. Hopefully next week we will start onboarding builders, then will need to import builds, then it will be time to start building!

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114089126937486625

DC move, riscv, AI and more for early February 2025

Scrye into the crystal ball

Hey, another week, another blog. Still going, still hope things are of interest to you, the reader.

PTO

I realized I had some PTO (Paid time off, basically what used to be called 'vacation') that I needed to use before it disappeared, so I have planned a number of vacation days in the coming month. Including 2 in this last week. :)

I'll also point to an old post of mine about what I personally think about vacations/pto when you are working in a community: time off when you are paid to work in a community

The biggest things for me about this isn't completely disconnecting or ignoring everything, in fact most days I get up at the same time (cats have to be fed!) and sit at my laptop like I usually do, but it's that I don't have to attend meetings, and I can do things I fully enjoy. Sometimes thats still working in the community, but sometimes its like what I did on friday: Look into a battery backup for house and tradeoffs/ideas around that. In the end I decided not to do anything right now, but I had fun learning about it.

My next pto days are next thursday ( 2025-02-20 ) then next friday is a "recharge" day at Red Hat, then I have monday off ( 2025-02-24 ), then march 6th and 7th and finally march 11th.

Datacenter Move

More detailed planning is ramping up for me. I have been working on when and how to move particular things, how to shuffle resources around and so forth. I plan to work on my doc more next week and then open things up to feedback from everyone who wants to provide it.

A few things to note about this move:

  • There's going to be a week (tenatively in may) where we do the 'switcharoo'. That is, take down services in IAD2 and bring them up in RDU3. This is going to be disruptive, but I'm hoping we can move blocks of things each day and avoid too much outage time. It's going to be disruptive, but we will try to minimize that.

  • Once the switcharoo week is over and we are switched, there will be either no staging env at all, or a limited one. This will persist until hardware has been shipped from IAD2 to RDU3 and we can shuffle things around to bring staging entirely back up.

  • Once all this is over, we will be in a much better place and with much newer/faster hardware and I might sleep for a month. :)

riscv secondary koji

Slow progress being made. Thanks to some help from abompard auth is now working correctly. It was of course a dumb typo I made in a config, causing it to try and use a principal that didn't exist. OOps. Now, I just need to finish the compose host, then sort out keytabs for builders and hopefully the riscv SIG can move forward on populating it and next steps.

AI

Oh no! This blog has AI in it? Well, not really. I wanted to talk about something from this past week thats AI related, but first, some background. I personally think AI does have some good / interesting uses if carefully crafted for that use. It's a more usefull hype cycle than say cryptocoins or blockchain or 'web3', but less usefull than virtual machines, containers or clouds. Like absolutely anything else, when someone says "hey, lets add this AI thing here" you have to look at it and decide if it's actually worth doing. I think my employer, Red Hat, has done well here. We provide tools for running your own AI things, we try and make open source AI models and tools, we add it in limited ways where it actually makes sense to existing tools ( ansible lightspeed, etc).

Recently, Christian Schaller posted his regular 'looking ahead' desktop blog post. He's done this many times in the past to highlight desktop things his team is hoping to work on. It's great information. In this post: looking ahead to 2025 and fedora workstation and jobs on offer he had a small section on AI. If you haven't seen it, go ahead and read it. It's short and at the top of the post. ;)

Speaking for myself, I read this as the same sort of approach that Red Hat is taking. Namely, work on Open source AI tooling and integrations, provide those for users that want to build things with them, see if there's any other places that could make sense to add an integration points.

I've seen a number of people read this as "Oh no, they are shoving AI in all parts of Fedora now, I'm going to switch to another distro". I don't think that is at all the case. Everything here is being done the Open Source way. If you don't care to use those tools, don't. If AI integration is added it will be in the open and after tradeoffs and feedback about being able to completely disable it.

ansible lint

We had setup ansible-lint to run on our ansible playbooks years ago. Unfortunately, due to a bug it was always saying "ok". We fixed that a while back and now it's running, but it has some kind of opinionated ideas on how things should be. The latest case of this was names. It wants any play name to start with a capitol letter. Handlers in ansible are just plays that get notified when another thing changes. If you change the name of say "restart httpd" to 'Restart httpd" in the handler, you have to then change every single place that is notified too. This caused an anoying mess for a few weeks. Hopefully we have them all changed now but this rule seems a bit random to me.

fedmsg retirement

In case you didn't see it, we finally retired our old fedmsg bus! We switched to fedora-messaging a while back, but kept a bridge between them to keep both sides working. With the retirement of the old github2fedmsg service we were finally able to retire it.

πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰πŸŽ‰

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114009437084125263

Bits from early February 2025

Scrye into the crystal ball

Lets keep the blogging rolling. This week went by really fast, but a lot of it for me was answering emails and pull requests and meetings. Those are all important, but sometimes it makes it seem like not much was actually accomplished in the week.

riscv secondary koji hub

I got some x86 buildvm's setup. These are to do tasks that don't need to be done on a riscv builder, like createrepo/newrepos or the like. I'm still having a issue with auth on them however, which is related to the auth issue with the web interface. Will need to get that sorted out next week.

f42 branching day

Tuesday was the f42 branching day. It went pretty smoothly this cycle I think, but there's always a small number of things to sort out. It's really the most complex part of the release cycle for releng. So many moving parts and dispirate repos and configs needing changing. This time I tried to stay out of actually doing anything, in favor of just providing info or review for Samyak who was doing all the work. I mostly managed to do that.

Datacenter move

Planning for the datacenter move is moving along. I've been working on internal documents around the stuff that will be shipped after we move, and next week I am hoping to start a detailed plan for the logical migration itself. It's a pretty short timeline, but I am hoping it will all go smoothly in the end. We definitely will be in a better place with better hardware once we are done, so I am looking forward to that.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/113969409712070764

Bits from late jan 2025

Scrye into the crystal ball

January has gone by pretty fast. Here's some longer form thoughts about a few things that happened this last week.

Mass updates/reboots

We did a mass update/reboot cycle on (almost) all our instances. The last one was about 2 months ago (before the holidays), so we were due. We do apply security updates weekdayly (ie, monday - friday), but we don't apply bugfix updates except for this scheduled windows. Rebooting everything makes sure everything is on the latest kernel versions and also ensures that if we had to reset something for other reasons it would come back up in the correct/desired/working state. I did explore the idea of having things setup so we could do these sorts of things without an outage window at all, but at the time (a few years ago) the sticking point was database servers. It was very possible to setup replication, but it was all very fragile and required manual intervention to make sure failover/failback worked right. There's been a lot of progress in that area though, so later this year it might be time to revisit that.

We also use these outage windows to do some reinstalls and dist-upgrades. This time I moved a number of things from f40 to f41, and reinstalled the last vmhost we had still on rhel8. That was a tricky one as it had our dhcp/tftp vm and our kickstarts/repos vm. So, I live migrated them to another server, did the reinstall and migrated them back. It all went pretty smoothly.

There was some breakage with secure boot signing after these upgrades, but it turned out to be completely my fault. The _last_ upgrade cycle, opensc changed the name of our token. From: 'OpenSC Card (Fedora Signer)' to 'OpenSC Card'. The logic upstream being "Oh, if you only have one card, you don't need to know the actual token name". Which is bad for a variety of reasons, like if you suddenly add another card, or swap cards. In any case I failed to fix my notes on that and was trying the old name and getting a confusing and bad error message. Once I managed to fix it out everything was working again.

Just for fun, here's our top 5 os versions by number:

  • 237 Fedora 41

  • 108 RHEL 9.5

  • 31 Fedora 40

  • 23 RHEL 8.10

  • 4 RHEL 7.9

The 7.9 ones will go away once fedmsg and github2fedmsg are finally retired. (Hopefully soon).

Datacenter Move

A bunch of planning work on the upcoming datacenter move. I'm hoping next week to work a lot more on a detailed plan. Also, we in infrastructure should kick off some discussions around if there's anything we can change/do while doing this move. Of course adding in too much change could be bad given the short timeline, but there might be some things to consider.

I also powered off 11 of our old arm servers. They had been disabled in the buildsystem for a while to confirm we didn't really need them, so I powered them off and saved us some energy usage.

riscv-koji seconday hub

The riscv-koji seconday hub is actually installed and up now. However, there's still a bunch of things to do:

  • Need to setup authentication so people/I can login to it.

  • Need to install some buildvm-x86 builders to do newrepos, etc

  • Need to install a composer to build images and such on

  • Next week's riscv sig meeting hopefully we can discuss steps after that. Probibly we would just setup tags/targets/etc and import a minimal set of rpms for a buildroot

  • Need to figure out auth for builders and add some.

Overall progress finally. Sorry I have been a bottleneck on it, but soon I think lots of other folks can start in on working on it.

power9 lockups

We have been having anoying lockups of our power9 hypervisors. I filed https://bugzilla.redhat.com/show_bug.cgi?id=2343283 on that. In the mean time I have been moving them back to the 6.11.x kernels which don't seem to have the problem. I did briefly try a 6.13.0 kernel, but the network wasn't working there. I still need to file a bug on that when I can take one down and gather debugging info. It was the i40e module not being able to load due to some kind of memory error. ;(

Chat and work items

One thing that was bugging me last year is that I get a lot of notifications on chat platforms (in particular slack and matrix) where someone asks me something or wants me to do something. Thats perfectly fine, I'm happy to help. However, when I sit down in the morning, I usually want to look at whats going on and prioritze things, not get sidetracked into replying/working on something thats not the most important issue. This resulted in me trying to remember which things where needed responses and sometimes missing going back to them or getting distracted by them.

So, a few weeks ago I started actually noting things like that down as I came to them, then after higher pri things were taken care of, I had a nice list to go back through and hopefully not miss anything.

It's reduced my stress, and I'd recommend it for anyone with similar workflows.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/113930147248831003