Late March infra bits 2025

nirik

2025-03-29 20:04

Another week, another saturday blog post.

Mass updates/reboots

We did another mass update/reboot cycle. We try and do these every so often, as the fedora release schedule permits. We usually do all our staging hosts on a monday, on tuesday a bunch of hosts that we can reboot without anyone really noticing (ie, we have HA/failover/other paths or the service is just something that we consume, like backups), and finally on wednsday we do everything else (hosts that do cause outages).

Things went pretty smoothly this time, I had several folks helping out this time and thats really nice. I have done them all by myself, but it takes a while. We also fixed a number of minor issues with hosts: serial consoles not working right and nbde not running correctly and also zabbix users being setup correctly locally. There was also a hosted server where reverse dns was wrong, causing ansible to have the wrong fqdn and messing up our update/reboot playbook. Thanks James, Greg and Pedro!

I also used this outage to upgrade our proxies from Fedora 40 to Fedora 41.

After that our distribution of instances is:

number / ansible_distribution_version

252 41

105 9.5

21 8.10

8 40

2 9

1 43

It's interesting that we now have 2.5x as many Fedora instances as RHEL. Although thats mostly the case due to all the builders being Fedora.

The Fedora 40 GA compose breakage

Last week we got very low on space on our main fedora_koji volume. This was mostly caused by the storage folks syncing all the content to the new datacenter, which meant that it kept snapshots as it was syncing.

In an effort to free space (before I found out there was nothing we could do but wait) I removed an old composes/40/ compose. This was the final compose for Fedora 40 before it was released and the reason in the past that we kept it was to allow us to make delta rpms more easily. It's the same content as the base GA stuff, but it's in one place instead of split between fedora and fedora-secondary trees. Unfortunately, there were some other folks using this. Internally they were using it for some things and iot also was using it to make their daily image updates.

Fortunately, I didn't actually fully delete it, I just copied it to an archive volume, so I was able to just point the old location to the archive and everyone should be happy now.

Just goes to show you if you setup something for yourself, often unknown to you others find it helpfull as well, so retiring things is hard. :(

New pagure.io DDoS

For the most part we are handling load ok now on pagure.io. I think this is mostly due to us adding a bunch of resources, tuning things to handle higher load and blocking some larger abusers.

However, on friday we got a new fun one: A number of ip's were crawling an old (large) git repo grabbing git blame on ever rev of every file. This wasn't causing a problem on the webserver or bandwith side, but instead causing problems for the database/git workers. Since they had to query the db on every one of those and get a bunch of old historical data, it saturated the cpus pretty handily. I blocked access to that old repo (thats not even used anymore) and that seemed to be that, but they may come back again doing the same thing. :(

We do have a investigation open for what we want to do long term. We are looking at anubis, rate limiting, mod_qos and other options.

I really suspect these folks are just gathering content which they plan to resell to AI companies for training. Then the AI company can just say they bought it from bobs scraping service and 'openwash' the issues. No proof of course, but just a suspicion.

Final freeze coming up

Finally the final freeze for Fedora 42 starts next tuesday, so we have been trying to land anything last minute. If you're a maintainer or contributor working on Fedora 42, do make sure you get everthing lined up before the freeze!

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114247602988630824