Skip to main content

Mid Late March infra bits 2025

Scrye into the crystal ball

Fedora 42 Beta released

Fedora 42 Beta was released on tuesday. Thanks to everyone in the Fedora community that worked so hard on it. It looks to be a pretty nice relase, lots of things in it and working pretty reasonably already. Do take it for a spin if you like: https://fedoramagazine.org/announcing-fedora-linux-42-beta/

Of course with the Beta out the door, our infrastructure freeze is lifted and so I merged 11 PR's that were waiting for that on Wed. Also, next week we are going to get in a mass update/reboot cycle before the final freeze the week after.

Ansible galaxy / collections fun

One of the things I wanted to clean up what the ansible collections that were installed on our control host. We have a number that are installed via rpm (from EPEL). Those are fine, we know they are there and what version, etc. Then, we have some that are installed via ansible. We have a requirements.txt file and running the playbook on the control host installs those exact versions of roles/collections from ansible galaxy. Finally we had a few collections installed manually. I wanted to get those moved into ansible so we would always know what we have installed and what version it was. So, simple right? Just put them in requirements.txt. I added them in there and... it said they were just not found.

The problem turned out to be that we had roles in there, but no collections anymore so I had not added a 'collections:' section, and it was trying to find 'roles' with those collection names. The error "not found" was 100% right, but it took me a few to realize why they were not found. :)

More A.I. Scrapers

AI scrapers hitting open source projects is getting a lot of buzz. I hope that some of these scraper folks will realize it's counterproductive to scrape things at a rate that makes them not work, but I'm not holding my breath.

We ran into some very heavy traffic and I had to end up blocking brazil for a while to pagure.io. We also added some CPU's and adjusted things to handle higher load. So far we are handling things ok now and I removed the brazil blockage. But no telling when they will be back. We may well have to look at something like anubis, but I fear the scrapers would just adjust to not be something it can catch. Time will tell.

Thats it for this week folks...

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114207270082200302

Mid March infra bits 2025

Scrye into the crystal ball

AI Scraper scourge

The AI scraper (I can only assume thats what they are) scourge continued, and intensified in the last week. This time they were hitting pagure.io really quite hard. We blocked a bunch of subnets, but it's really hard to block everything without inpacting legit users, and indeed, we hit several cases where we blocked legit users. Quickly reverted, but still troublesome. On thursday and friday it got even worse. I happened to notice that most of the subnets/blocks were from .br (Brazil). So, in desperation, I blocked .br entirely and that brought things back to being more responsive. I know thats not a long term solution, so I will lift that block as soon as I see the traffic diminish (which I would think it would once they realize it's not going to work). We definitely need a better solution here. I want to find the time to look into mod_qos where we could at least make sure important networks aren't blocked and other networks get low priority. I also added a bunch more cpus to the pagure.io vm. That also seemed to help some.

F42 Beta on the way

Fedora 43 Beta is going to be released tuesday! Shaping up to be another great release. Do download and test if you wish.

Datacenter Move

The datacenter move we are going to be doing later this year has moved a bit later in the year. Due to some logistics we are moving to a mid June window from the May window. That does give us a bit more time, but it's still going to be a lot of work in a short window. It's also going to be right after flock. We hope to have access to new hardware in a few weeks here so we can start to install and setup things. The actual 'switcharoo' in June will be over 3 or so days, then fixing anything that was broken by the move and hopefully all set before the F43 Mass rebuild.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114167827757899998

Early March infra bits 2025

Scrye into the crystal ball

Here we are saturday morning again. This week was shorter than normal for me work wise, as I took thursday and friday off, but there was still a lot going on.

Atomic desktops / iot / coreos caching issues

I spent a lot of time looking into some odd issues that ostree users were hitting. It was really hard to track down what was broken. No errors on our end, invalidated cloudfront a few times, did a bunch of tweaks to our backend varnish cashes and... the problem was caused by: me.

Turns out we are getting hit really hard all the time by (what I can only assume is crawlers working to fuel LLM's. It's not just us, see for example this excellent lwn article on the problem

We use amazon Cloudfront to serve ostree content to users, since it allows them to hit endpoints in their local region, so it's much faster and reduces load on our primary cache machines. Cloudfront in turn hits our cache machines to get the content it caches.

How does this relate to ostree issues you might ask? Well, I blocked a bunch of IP's that were hitting our kojipkgs servers particularly hard. It turns out some of those IP's were cloudfront, so just _some_ of the cloudfront endpoints didn't have access to the backend so their cache was out of date. I assume cloudfront also has multiple distributions at each region and it was only _some_ of those.

Removing all those blocks got everything working for everyone again (but of course the AI bots are ever present). I also enabled a thing called 'origin shield' which means cloudfront should only pull from one region and sync to the others, reducing load on our caches.

Longer term we probibly need to split up our kojipkgs cache or add more nodes or rearage how things are hit.

I'm deeply sorry about this issue, I know many users were frustrated. I sure was too. Lesson learned to be carefull in blocking bots.

s390x caching problems

And related to that issue, our s390x builders have been having problems pulling packages for builds. They have a local cache that in turn pulls from our primary one. Sometimes, sporadically, it's getting partial downloads or the like. I've still not fully figured out the cause here, but I did make a number of changes to the local cache there that seem to have reduced the problem.

Longer term here we probibly should seperate out this cache to hit a only internal one so the load on the main one doesn't matter.

Coffee machine fun

Friday my coffee machine ( delonghi magnifica ) got stuck in the middle of a cycle. It had gound the beans, but then stopped before the water step. So, I looked around at repair videos and then took it apart. It was actually pretty cool how it was put together, and I was able to basically manually turn a pully to move it down to unlocked, then I could remove the brew group and clean everything up and put it back together. Working great again. Kudos also to the The iFixit pro toolkit that I got a while back. A weird screw? no problem.

Home assistant

Been having a lot of fun tinkering with home assistant.

After looking, decided that zigbee networking is better than bluetooth and less power hungry than wifi, so I picked up a Zigbee gateway and it works just fine. At one point I thought I accidentally flashed it with esp32 builder, but seems it didn't work, so whew.

Got some smart plugs ( Amazon link to smart plugs ) and these little things are great! pair up fine, HA can manage their firmware versions/update them, lots of stats. I put one on the plug my car charges on, another on a plug that has a fridge and a freezer on, one on the plug my main server UPS is on, and kept one for 'roaming'. It's cool to see how much power the car charging takes in a nice graph.

Got some cheap temp sensors ( Amazon link to temp / humidity sensors ) They seem to be working well. I put one in my computer closet, one in our living room, one in the garage and one outside. (The living room seems to have a 4 degree change from day to night)

I had some old deako smart switches along with a gateway for them. They use a bluetooth mesh to talk to each other and an app, but the gateway is needed for them to be on wifi. I never bothered to setup the gateway until now, but HA needs it to talk to the switches. So I tried to set it up, but it would just fail at the last setup step. So, I mailed daeko and... they answered really quickly and explained that the gateway is no longer supported, but they would be happy to send me some of their new smart switches (that have wifi built in and can act as a gateway for the old ones) free of charge! I got those on thursday and set them up and they worked just dandy. But then I tripped over Chestertons Fence. The 3 old smart switches were all controlling the same light. That seemed silly to me. Why not just have one on that light, use two 'dumb' switches for the other two places for that light and then move the other smart ones to other lights? Well, turns out there are several problems with that: The 'dumb' switches have a physical position, so if you did that one could be 'on' with the light off, another 'off', etc But the biggest problem is that the smart switch is needed to route power around. If you turn the light 'off' on a 'dumb' switch you can have the one smart one with no power and it doesn't do anything at all. So, after messing them up I figured out how to factory reset them and re-pair them. For anyone looking the process is:

Resetting:

  • plug in and while it 'boots', press and hold the switch.

  • it should come up with a 3 2 1 buttons to press.

  • press each in turn

Pairing (you have to pair switches that all control the same lights):

  • unplug all switches

  • plug one in and Press and hold the switch it should come up with a flashing 1

  • If nothing happens, try each of the other two in turn. Only one has 'power'

  • press 1 on the first switch.

  • Repeat on switch 2 and press 2 on the first switch

  • Repeat on the last switch and press 3 on the first switch

I could have saved myself a bunch of time if I had just left it the way it was. Oh well.

Finally I got some reolink cameras. We have a small game camera we put out from time to time to make sure the local feral cats are ok, and to tell how many racoons are trying to eat the cats food. It's kind of a pain because you have to go put it outside, wait a few days and then remember to bring it back in, then pull the movies off it's sdcard.

So replacing that with something that HA could manage and we didn't need to mess with sounded like a win. I picked up a bundle with a Home Hub and two Argus Eco Ultra and 2 solar panels for them.

The Home hub is just a small wifi ap with sdcard slots. You plug it in and set it up with an app. Then, you pair the cameras to it and HA talks to the cameras via the hub. There's no external account needed, setup is all local and you can even firewall off reolink if you don't want them to auto uprgade firmware, etc. I've not yet set the cameras up outside, but a few impressions: The cameras like REALLY LOUD AUDIO. When you first power them on they greet you in a bunch of languages and tell you how to set them up. Thats fine, but when I did this people in my house were sleeping. Not cool. Even powering them off causes a chirp that is SUPER lOUD. The cameras have a 'siren' control that I have been afraid to try. :) Anyhow, more on these as I get them setup.

I had 2 UPSes here. One for my main server and critical loads and another one for less important stuff. With all the home assistant stuff I ran out of battery backed plugs, so I picked up a 3rd UPS. The new one was easy to add to nut, but I had a long standing problem with the two I already had: They are exactly the same model and product and don't provide a serial number on the usb port, so nut can't tell them apart. Finally I dug around and figured out that while I was specifying bus, port and device, it wasn't working until I moved one of them to another USB plug (and thus another bus). I then got all 3 of them added to HA. One thing that confused me there is that since all 3 of them are on the same nut server and are using the same upsmon user, how do I add more than 1 of them in HA? Well, it turns out if you go to the nut integration, add device, enter the same host/user/pass it will pop up a screen that asks you which one to add. So you can add each in turn.

So, lots of fun hacking on this stuff.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114128294620402197

Misc bits from late February 2025

Scrye into the crystal ball

Here's another misc bits from Fedora Infra land. I missed last weekend as I was off on some vacation. I've got a few more days coming up that I need to use before they go away. :)

Personal Stuff

We had a big wind storm here on monday. Knocked out power for about 5 hours (which wasn't so bad, since we have a generator, but was anoying). One of our large (150' douglas fir) trees got knocked down by the wind. Luckily it fell away from the house and a fence, and managed to hit a thicket of blackberries I was in the process of trying to remove. Still will need to take care of it, as it's in the way of our path. Also, our dishwasher died (gonna be a few weeks before someone can look at it). It's sad that repairing something takes a few weeks, but I bet I could buy a new one in a few days.

Day of learning on friday

Red Hat does these quarterly "day of learning" days, where everyone is encouraged to go learn something. Work related, not work related, interesting, useful or not. It's really a great thing. This time I decided to play around with Home Assistant some more and figure out how it does things. Adam Williamson mentioned it in a matrix discussion, and I had been meaning to look into it too, so seemed like a great time. I picked up a Home Assistant green (which is basically a small arm box that has home assistant (HA) all installed on it and ready to go. Initial setup was easy, no issues.

Several of my devices are using bluetooth, so I also picked up some little esp32 boards to use as a bluetooth proxy. It's pretty amazing how small these little guys are. I did make one mistake ordering though: I got some with microusb. Had to dig up some old cables. I think I am going to replace them with ones that have usbc. So, I flashed the bluetooth proxy on one, it joined my wifi and... then it didn't work. Took me a while to find out that since my wireless and wired networks are completely seperate, I needed to run a mdns-repeater on my gateway/firewall box to repeat the mdns advertisements. After doing that it saw that just fine, along with a printer that was on wifi.

I connected up my ups server (nut) with no problems, so now I have nice charts and graphs of power usage, battery state, etc.

I have some 'smart' light switches from Deako here too. They connect and work with an adroid app via bluetooth, so I thought they were using bluetooth for HA integration too, but it turns out it requires wifi to connect to them. I do have a old dongle to connect them to wifi, and I tried to set that up, but it seems to just hang at the end when it says it's preparing it. They no longer make that Deako Connect dongle, so I mailed them about it. Perhaps the provisioning for it is even no longer there.

Managed to add my car in with not much trouble. Kinda cool to have all its sensors available easily on the dashboard.

This stuff is a giant rabbit hole. Some things I want to do when time/energy permits:

  • There's a way to connect a esp32 device to the serial port on the hot water heater and get all kinds of info from it.

  • There's a way supposedly to connect to a internal connector on the heat pumps I have and get a ton of info/control them.

  • I'd like to figure a way to monitor my water holding tank. That one is going to be tricky, as the pumphouse is down a hill and not in line of sight of the house. Seems like people do this one of two ways: a pressure sensor dropped in the bottom of the tank, or a distance sensor at the top showing where the water line is.

  • After looking I think bluetooth is too much of a mess for things usually, so I order a Zigbee thing for the HA server and some zigbee based power outlets and some temp sensors to try that out.

Lots of fun.

caching issues

So, starting on tuesday there were some issues with out caching setup. First it was s390x builds. They would request some package/repodata from their local cache, it would in turn request it from the main ones and sometimes it would just get a partial read or a incorrect size. I couldn't find anything that seemed wrong. We are in Freeze for beta, so no real changes have been made to infrastructure. So, I was left with trying various things. I updated and rebooted the s390x cache machine. I blocked a bunch of ip's that were hitting the main cache too much. I updated and rebooted the main cache machines. Including a sidetrack where I noticed they were running an old kernel. Turns out 'sdubby' got installed there somehow and it was installing new kernels as efi. The vm wasn't booting efi, so it just kept booting on it's old kernel. After all that it seemed like the issue was gone, or perhaps happening much less often?

After that however, some folks are seeing problems with ostree downloads. All of them I see also don't exist on the cache hosts either, so I am not sure whats happening. I'm very sorry for the issue, as I know it's affecting a number of people. Will keep trying to track it down.

riscv secondary

Managed to mostly sort out builder authentication I think. There's eome limitataions for external builders, but I think we can live with that. Hopefully next week we will start onboarding builders, then will need to import builds, then it will be time to start building!

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114089126937486625

DC move, riscv, AI and more for early February 2025

Scrye into the crystal ball

Hey, another week, another blog. Still going, still hope things are of interest to you, the reader.

PTO

I realized I had some PTO (Paid time off, basically what used to be called 'vacation') that I needed to use before it disappeared, so I have planned a number of vacation days in the coming month. Including 2 in this last week. :)

I'll also point to an old post of mine about what I personally think about vacations/pto when you are working in a community: time off when you are paid to work in a community

The biggest things for me about this isn't completely disconnecting or ignoring everything, in fact most days I get up at the same time (cats have to be fed!) and sit at my laptop like I usually do, but it's that I don't have to attend meetings, and I can do things I fully enjoy. Sometimes thats still working in the community, but sometimes its like what I did on friday: Look into a battery backup for house and tradeoffs/ideas around that. In the end I decided not to do anything right now, but I had fun learning about it.

My next pto days are next thursday ( 2025-02-20 ) then next friday is a "recharge" day at Red Hat, then I have monday off ( 2025-02-24 ), then march 6th and 7th and finally march 11th.

Datacenter Move

More detailed planning is ramping up for me. I have been working on when and how to move particular things, how to shuffle resources around and so forth. I plan to work on my doc more next week and then open things up to feedback from everyone who wants to provide it.

A few things to note about this move:

  • There's going to be a week (tenatively in may) where we do the 'switcharoo'. That is, take down services in IAD2 and bring them up in RDU3. This is going to be disruptive, but I'm hoping we can move blocks of things each day and avoid too much outage time. It's going to be disruptive, but we will try to minimize that.

  • Once the switcharoo week is over and we are switched, there will be either no staging env at all, or a limited one. This will persist until hardware has been shipped from IAD2 to RDU3 and we can shuffle things around to bring staging entirely back up.

  • Once all this is over, we will be in a much better place and with much newer/faster hardware and I might sleep for a month. :)

riscv secondary koji

Slow progress being made. Thanks to some help from abompard auth is now working correctly. It was of course a dumb typo I made in a config, causing it to try and use a principal that didn't exist. OOps. Now, I just need to finish the compose host, then sort out keytabs for builders and hopefully the riscv SIG can move forward on populating it and next steps.

AI

Oh no! This blog has AI in it? Well, not really. I wanted to talk about something from this past week thats AI related, but first, some background. I personally think AI does have some good / interesting uses if carefully crafted for that use. It's a more usefull hype cycle than say cryptocoins or blockchain or 'web3', but less usefull than virtual machines, containers or clouds. Like absolutely anything else, when someone says "hey, lets add this AI thing here" you have to look at it and decide if it's actually worth doing. I think my employer, Red Hat, has done well here. We provide tools for running your own AI things, we try and make open source AI models and tools, we add it in limited ways where it actually makes sense to existing tools ( ansible lightspeed, etc).

Recently, Christian Schaller posted his regular 'looking ahead' desktop blog post. He's done this many times in the past to highlight desktop things his team is hoping to work on. It's great information. In this post: looking ahead to 2025 and fedora workstation and jobs on offer he had a small section on AI. If you haven't seen it, go ahead and read it. It's short and at the top of the post. ;)

Speaking for myself, I read this as the same sort of approach that Red Hat is taking. Namely, work on Open source AI tooling and integrations, provide those for users that want to build things with them, see if there's any other places that could make sense to add an integration points.

I've seen a number of people read this as "Oh no, they are shoving AI in all parts of Fedora now, I'm going to switch to another distro". I don't think that is at all the case. Everything here is being done the Open Source way. If you don't care to use those tools, don't. If AI integration is added it will be in the open and after tradeoffs and feedback about being able to completely disable it.

ansible lint

We had setup ansible-lint to run on our ansible playbooks years ago. Unfortunately, due to a bug it was always saying "ok". We fixed that a while back and now it's running, but it has some kind of opinionated ideas on how things should be. The latest case of this was names. It wants any play name to start with a capitol letter. Handlers in ansible are just plays that get notified when another thing changes. If you change the name of say "restart httpd" to 'Restart httpd" in the handler, you have to then change every single place that is notified too. This caused an anoying mess for a few weeks. Hopefully we have them all changed now but this rule seems a bit random to me.

fedmsg retirement

In case you didn't see it, we finally retired our old fedmsg bus! We switched to fedora-messaging a while back, but kept a bridge between them to keep both sides working. With the retirement of the old github2fedmsg service we were finally able to retire it.

🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉🎉

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/114009437084125263

Bits from early February 2025

Scrye into the crystal ball

Lets keep the blogging rolling. This week went by really fast, but a lot of it for me was answering emails and pull requests and meetings. Those are all important, but sometimes it makes it seem like not much was actually accomplished in the week.

riscv secondary koji hub

I got some x86 buildvm's setup. These are to do tasks that don't need to be done on a riscv builder, like createrepo/newrepos or the like. I'm still having a issue with auth on them however, which is related to the auth issue with the web interface. Will need to get that sorted out next week.

f42 branching day

Tuesday was the f42 branching day. It went pretty smoothly this cycle I think, but there's always a small number of things to sort out. It's really the most complex part of the release cycle for releng. So many moving parts and dispirate repos and configs needing changing. This time I tried to stay out of actually doing anything, in favor of just providing info or review for Samyak who was doing all the work. I mostly managed to do that.

Datacenter move

Planning for the datacenter move is moving along. I've been working on internal documents around the stuff that will be shipped after we move, and next week I am hoping to start a detailed plan for the logical migration itself. It's a pretty short timeline, but I am hoping it will all go smoothly in the end. We definitely will be in a better place with better hardware once we are done, so I am looking forward to that.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/113969409712070764

Bits from late jan 2025

Scrye into the crystal ball

January has gone by pretty fast. Here's some longer form thoughts about a few things that happened this last week.

Mass updates/reboots

We did a mass update/reboot cycle on (almost) all our instances. The last one was about 2 months ago (before the holidays), so we were due. We do apply security updates weekdayly (ie, monday - friday), but we don't apply bugfix updates except for this scheduled windows. Rebooting everything makes sure everything is on the latest kernel versions and also ensures that if we had to reset something for other reasons it would come back up in the correct/desired/working state. I did explore the idea of having things setup so we could do these sorts of things without an outage window at all, but at the time (a few years ago) the sticking point was database servers. It was very possible to setup replication, but it was all very fragile and required manual intervention to make sure failover/failback worked right. There's been a lot of progress in that area though, so later this year it might be time to revisit that.

We also use these outage windows to do some reinstalls and dist-upgrades. This time I moved a number of things from f40 to f41, and reinstalled the last vmhost we had still on rhel8. That was a tricky one as it had our dhcp/tftp vm and our kickstarts/repos vm. So, I live migrated them to another server, did the reinstall and migrated them back. It all went pretty smoothly.

There was some breakage with secure boot signing after these upgrades, but it turned out to be completely my fault. The _last_ upgrade cycle, opensc changed the name of our token. From: 'OpenSC Card (Fedora Signer)' to 'OpenSC Card'. The logic upstream being "Oh, if you only have one card, you don't need to know the actual token name". Which is bad for a variety of reasons, like if you suddenly add another card, or swap cards. In any case I failed to fix my notes on that and was trying the old name and getting a confusing and bad error message. Once I managed to fix it out everything was working again.

Just for fun, here's our top 5 os versions by number:

  • 237 Fedora 41

  • 108 RHEL 9.5

  • 31 Fedora 40

  • 23 RHEL 8.10

  • 4 RHEL 7.9

The 7.9 ones will go away once fedmsg and github2fedmsg are finally retired. (Hopefully soon).

Datacenter Move

A bunch of planning work on the upcoming datacenter move. I'm hoping next week to work a lot more on a detailed plan. Also, we in infrastructure should kick off some discussions around if there's anything we can change/do while doing this move. Of course adding in too much change could be bad given the short timeline, but there might be some things to consider.

I also powered off 11 of our old arm servers. They had been disabled in the buildsystem for a while to confirm we didn't really need them, so I powered them off and saved us some energy usage.

riscv-koji seconday hub

The riscv-koji seconday hub is actually installed and up now. However, there's still a bunch of things to do:

  • Need to setup authentication so people/I can login to it.

  • Need to install some buildvm-x86 builders to do newrepos, etc

  • Need to install a composer to build images and such on

  • Next week's riscv sig meeting hopefully we can discuss steps after that. Probibly we would just setup tags/targets/etc and import a minimal set of rpms for a buildroot

  • Need to figure out auth for builders and add some.

Overall progress finally. Sorry I have been a bottleneck on it, but soon I think lots of other folks can start in on working on it.

power9 lockups

We have been having anoying lockups of our power9 hypervisors. I filed https://bugzilla.redhat.com/show_bug.cgi?id=2343283 on that. In the mean time I have been moving them back to the 6.11.x kernels which don't seem to have the problem. I did briefly try a 6.13.0 kernel, but the network wasn't working there. I still need to file a bug on that when I can take one down and gather debugging info. It was the i40e module not being able to load due to some kind of memory error. ;(

Chat and work items

One thing that was bugging me last year is that I get a lot of notifications on chat platforms (in particular slack and matrix) where someone asks me something or wants me to do something. Thats perfectly fine, I'm happy to help. However, when I sit down in the morning, I usually want to look at whats going on and prioritze things, not get sidetracked into replying/working on something thats not the most important issue. This resulted in me trying to remember which things where needed responses and sometimes missing going back to them or getting distracted by them.

So, a few weeks ago I started actually noting things like that down as I came to them, then after higher pri things were taken care of, I had a nice list to go back through and hopefully not miss anything.

It's reduced my stress, and I'd recommend it for anyone with similar workflows.

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/113930147248831003

Bits from mid late jan 2025

Scrye into the crystal ball

Hello again everyone! Three weeks in a row now. Go me. I'm not really sure what to name these posts, especially given that 'early/mid/late' does not work when there's 4 weeks in a month. I could just put the dates in the title, but that seems difficult to see what is going on. Perhaps I just need to pick out some highlights.

mass rebuild completed

The f42 mass rebuild was completed. This time submitting from the mass rebuild script seemed slower than normal. Possibly due to the item scaper item below. However, things mostly worked out. Instead of submitting everything and koji churning through the backlog, submissions were usually just handled as they came in, leaving koji with really no backlog.

A.I. Scrapers

The A.I. scrapers continue to pound away at things. I mean, I'm assuming thats what they are as they crawl all the content they can get. They are pretty hard to block due to using massive piles of IP's, different user agent strings every time and random access of content. This sort of thing is hitting lots of open source communities too. I think short time we probibly need to look at deploying mod_qos at least, so we can prioritize users/things that need access and let the scrapers get much lower levels of request answering. No idea if that will really stop them though.

Anoying outage on thursday

We had a very anoying outage on thursday. Our rabbitmq cluster started not processing things sometimes and just in general not working right. Restarting it and then a bunch of consumers that connect to it finally got things working again, but it's unclear what the cause was, and cleaning up a bunch of builds and such that were not processed due to missing messages was not at all fun. Currently our rabbitmq cluster is on rhel8 and an older version, we probibly want to move it to 9 and start using the version made by the CentOS messaging sig. If newer/older can talk to each other in a cluster it should be pretty easy. If they can't it will need some downtime.

Mass update/reboot next week

As we do from time to time, there's going to be a mass update/reboot cycle next week. Monday will be staging, then tuesday will be a bunch of things that don't cause any outages (where we can take things out and update/reboot and put them back in), and finally wed will have the actual outage.

riscv secondary koji hub progress

The riscv hub is installed and mostly configured, still need to sort out a few packages it needs to complete the ansible playbook. Next I need to test web interface, then install some x86 builders (that will do newrepos) and a compose host (for composes). Then will come the hard work of importing content / setting up actual riscv builders (which will be external for now until we can find hardware to use locally).

Datacenter move

We announced that we are going to be moving from our main datacenter (IAD2 in ashburn, va) to a new one (RDU3 in Raleigh, NC). It's going to be a ton of work (and lots has already happened, been spending a lot of time on it already), but when it's done we should be in a much better place. More space, better machines, and hopefully things like ipv6 support (that we have wanted for many years).

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/113891021464841039

Bits from mid jan 2025

Scrye into the crystal ball

Hello again, here's some longer form doings and thoughts from from mid january 2025 in and around fedora.

rawhide repodata change

Rawhide repodata has moved over to the createrepo_c default: zstd. This shouldn't affect any dnf use, or fedora createrepo_c use, but if you are using EL8/EL9, createrepo_c there currently doesn't understand zstd. There's a issue to add that in a minor release: https://issues.redhat.com/browse/RHEL-67689 and in the mean time if you are a EL8/9 user there's a copr: https://copr.fedorainfracloud.org/coprs/amatej/createrepo_c/

dist-repos retention

On the dist-repos space issue I mentioned last week, it turns out that the expectation on dist-repos is that you would sync them somewhere you wanted to serve them from, not just serve from koji. So our use case was misalined a bit from upstream here. However, they did adjust to keep latest repos. Should be in a upcoming koji release. Thanks koji folks!

fun email infrastructure mixup between ipa and postfix

There was a pretty curious email issue that came up this last week. fedoraproject contributors (that is folks with an account that is in at least one non system group) are setup with an email alias of theiraccountlogin@fedoraproject.org. This is just a very simple alias. We accept the email and forward it to their real email address. There's no mailbox here or authentication or anything, just a simple alias. We got an alert that disk space was getting low on our mail hub, so I took a look and found that users who were not contributors were getting emails to theiraccountlogin@fedoraproject.org delivered locally to /var/spool/mail on the hub! When we switched away from fas2 to our current IPA based setup, no one realized that sssd/ipa enumerates all users, even if they do not have access to actually login or do anything. There are good reasons for this, but somehow I at least didn't realize that it worked that way. So, since all users 'existed' there, and postfix's default for local users is:

proxy:unix:passwd.byname $alias_maps

It correctly looks them up and thinks they are local. Simply changing this to just be $alias_maps fixes the issue. There wasn't a bug here in postfix or ipa, they were just doing things as they expected. The issue was our misunderstanding how these things interacted.

f42 mass rebuild underway

The mass rebuild for f42 started (all be it a bit later than planned due to a issue getting golang to work on i686 when compiled with gcc 15). This time it seems like our submitting builds is much slower than before. In past years, pretty much everything was submitted in a few days, then koji chewed on the backlog. This time koji is easily keeping up with the submissions and we are only in the 'p's after 3.5 days. Oh well, hopefully we will finish mondayish, which is in line with past mass rebuilds.

forgejo kickoff meeting/discussions

There was a kickoff meeting about forgejo in fedora infra. I'm looking forward to this, but I have so many things going on I am not sure how much work I can do on the deployment. Lots of good ideas/plans discussed. I think it's going to be not a super lot of work to stand up a staging instance, but I think integrating with all our workflows will take a lot more effort. Time will tell.

riscv secondary koji hub

I did finally submit my work in progress PR for riscv-koji hub: https://pagure.io/fedora-infra/ansible/pull-request/2435 Hopefully can finish off things and start deploying next week.

bugzilla and needinfo

I asked on mastodon what folks thought about bugzilla needinfo requests and what they meant. There were a number of opinions: https://fosstodon.org/@nirik/113822583672492457 I think in the end it's a thing that people will use for their own use cases and those will sometimes mis match with recipients. Unless we want to try and make some community wide norm or guidelines (but of course even then not everyone will see those).

Xfce-4.20 and wayland testing

News from a while ago: Xfce 4.20 was released and it's got a bunch of wayland support for various things. However, it doesn't have xfwm4 / a compositor of it's own, so by default you get a X session. If you want to play with the wayland sessions and you are running a rawhide instance, you can install: xfce4-session-wayland-session which will by default pull in labwc as your compositor. You can manually modify the session file to use wayfire if you prefer that compositor. See the testing section at https://wiki.xfce.org/releng/wayland_roadmap I tried both out and they did indeed work, but there are still a bunch of rough edges. Still, great progress!

comments? additions? reactions?

As always, comment on mastodon: https://fosstodon.org/@nirik/113850897869803740

Bits from early jan 2025

Scrye into the crystal ball

Welcome to 2025. It's going to be a super busy year in Fedora land, all kinds of things are going to be worked on and hopefully rollout in 2025. Back from the holiday break things started out somewhat slow (which is great) but already have started to ramp up.

First up I fixed a few issues that were reported during the break:

  • Our infrastructure.fedoraproject.org host wasn't redirecting http to https (as all our other hosts do). Turns out we disabled this many years ago because virt-install couldn't handle https for some reason. I think this was in the RHEL6 days, and we just never went back and fixed it. This did end up breaking some provisioning kickstarts that had http links in them, but easy (and good!) to fix.

  • Some redirects we had for sites were retirecting to just {{ target }} variables, but if that was examplea.fedoraproject.org////exampleb.com it would redirect to examplea.fedoraproject.org.exampleb.com. A totally different domain. Fixed that by making sure there was a / at the end of the redirect. Unfortunately, I also broke the codecs site for basically everyone. ;( It was broken for a few hours, but easy to fix up after I was made aware.

There's been a ton of f42 changes flowing before/during/after the holidays. Lots of exciting things. Hopefully it can all land and work out ok.

I finally started in on work for a riscv-koji secondary arch hub. It would have been really easy, but we dropped all the secondary things from ansible after the last secondary arch hub went primary, so I am having to go though and adjust a lot of things. Making good progress however. Hopefully something to show next week. This secondary hub will allow maintainers to login/do scratch builds and get things one step closer to primary. There's still a long road for it though, as we need actual local builders and proof of keeping up with primary.

Next I cleaned up some space on our netapp (because see below).

  • I archived some iot composes and marked a bunch more for deletion. As soon as I get the ack on those, that should free up around 5TB.

  • I noticed our dist-repos were really pretty large. This turns out to be two issues: First, we were keeping 6 months of them. We did that because we use these for deployments and before if all repos were older than the last change to them, they just would be missing. I am pretty sure now that kojira keeps the latest one, so this is no longer a factor. I set them to 1 week (as default). This should free up many TBs. Secondly, the flatpak apps repos were not using the latest (they were pulling everything). Adjusted that and it should save us some.

  • Finally, I nuked about 35TB of old db backups. There's no reason to keep daily database dumps since 2022. I kept one from each month, but we have never had to go back years for data. In particular the datanommer and koji db's are... gigantic, even compressed. Unfortunately it will be a while before this is actually freed as we have a lengthy snapshot retention policy on the backup volume. Still should help a lot down the road.

With some freed up space, I now could make another iscsi volume and move our ppc64le buildvm's off to it. It took far longer than I would like to admit to get things working (turns out I had logged in on the client before setting up everything on the netapp and it just didn't see the lun until I ran a refresh). I expected there to be a pretty vast speed improvement, and the vm's are indeed a lot more responsive and install much faster, but I am not sure builds are really that much faster. Will need some more poking around to find out. The local disk on those are 7200 rpm spinning sata drives. The iscsi is a ssd backed netapp over 10G network. Unfortunately also we have seen instablity in the hosts in the last week, which is likely a kernel issue since I updated them to 6.12. Hopefully I can get everything sorted out before the f42 mass rebuild, which is next wed.

comment/reply on mastodon: https://fosstodon.org/@nirik/113811216382678546