Infra and Releng workshop at flock 2024
Last friday at flock, we had a Infrastructure and Release Engineering workshop/hackfest. It was from 9am to 1pm, so 4 hours and we used them all. We did take a couple of breaks, but overall we powered through discussing the entire agenda.
Before the workshop we brainstormed a bunch of disucssion items at: https://discussion.fedoraproject.org/t/planning-for-infra-and-releng-hackfest-at-flock-2024/110244 and created a hackmd document to record notes into: https://hackmd.io/HxpzTNpITfu0OYmOGRApiw
I'm going to list here each topic, some notes about it and then any action items from that.
- "Standards for OpenShift app deployments" - We run, but don't develop a number of applications in our OpenShift cluster. Right now the deployment methods are all over the map. Some apps use a source2image setup with production and staging branches, others just pull an image from quay.io where it's unclear how that image is made or could be adjusted, still others build local images, still others do even more different things. This makes it hard for us to debug or know what base images are in use. Also, some playbooks automatically fire off builds or deployments and they shouldn't. We should split this out to manual playbooks if we need it, but normally OpenShift will just do whatever is needed.
- ACTION: create comments in each app playbook that explains how it's deployed
- ACTION: with OpenShift 4.16 we will need to move all our apps that still have deploymentconfig to use deployment.
- ACTION: Look at deploying ACS (advanced cluster security) to gain more visibility when we have out of date or vulnerable images.
- ACTION: create a "best practices" guide (next to our development guide) doc that explains the way we consider best to deploy apps in our clusters. All of humaton, zlopez, smiller, dkirwan, abompard, lachmanfrantisek, lsm5, mohanboddu expressed interest in helping on this.
- "Infra SIG packages" - We have a packaging group called "infra-sig" that maintains a bunch of packages that we use (or used to use). The group doesn't have too many active members these days and we really need to look at what packages are in it and orphan ones we don't use/need/want anymore.
- ACTION: Find someone(s) to propose packages to orphan / add
- ACTION: Onboard them with packit to help us reduce maint. We can get packit folks a list and they can mass onboard them for us
- ACTION: look at list of folks in the sig and remove those who are no long around/interested.
- "Discuss Releng packages"
- ACTION: come up with list of releng packages that are owned directly by release engineering and add them to infra sig
- "Discuss proxy network: move to nginx? change things? or keep?" - We had a bit of discussion about moving away from httpd to nginx or gunicorn. In the end we didn't really come to much consensus on this one, needs further discussion. We do have a lot of ansible playbooks that are apache dependent and things are broadly working ok with the setup we have. HTTP/3 support would be nice as would better perf, but not a requirement.
- "Discuss making aws more ansiblized/managed, or not?" - We didn't really come to much conclusion on this one either. One problem is that our main amazon account is a subaccount of the amazon community account, so we can't divide it anymore and lots of groups use that, so we can't fully manage it very easily anyhow. This one also needs more thought I think.
- "Discuss onboarding, what we can do to make it better" - we had a pretty nice discussion on this one, including some folks that are not involved right now with some great perspectives.
- ACTION: kevin to post outline of docs changes and submit WIP PR for them for people to add to.
- ACTION: after docs are in better shape, look at marketing to potential contributors
- ACTION: after each release look at having a 'Hello' day where new folks can join and ask questions and learn about the setup.
- OpenShift apps deployment info - Did a quick tutorial on how we deploy apps for all those present. Should be folded into the docs above.
- "Look ahead: gitforge, bugzilla, matrix server" - This was just a discussion on all these things that are coming in the next year. It's going to be a ton of work.
- "Retire wiki pages / migrate to docs" - We talked about where end user docs might live over contributor/member docs. We talked about all the wiki pages that we want to migrate _somewhere_
- ACTION: some more discussion about where end user docs should go. Perhaps talk to the docs team?
- ACTION: Someone(s) to look at the docs in https://fedoraproject.org/wiki/Category:Infrastructure and archive/delete or migrate them all.
- "Datagrepper access" - This was a discussion about the commops team wanting to do database queries on datagrepper for community metrics. It's logistically difficult to get access to the actual database from anywhere the tools they want to run are. So, after a bit of gathering requirements, we brainstormed a solution: Setup a database in AWS using RDS, load a recent dump from datagrepper to it and then setup some datanommer instances in communishift (or wherever) that listen to our message bus and just insert new messages as they come. This was it should be up to date, but cause no load for the main datagrepper instance (it would be completely seperate!). We now have tickets pending to do this work for them.
- ACTION: infra folks to work tickets to get things setup alongside commops folks
- ACTION: commops to install and use whatever frontends they want to query the RDS db.
- ARA in infra - This would be nice reporting for us, although there was some discussion that if we get AWX setup it would have much of the same reporting in it. We left this I think as sort of a 'If someone had time and wanted to look at setting it up they could".
- AWX deployment - We talked about issues/roadblocks on AWX. It isn't really setup to handle the way our ansible repo is setup (with a public and a private repo). We should be able to move it forward for a proof of concept tho and can then decide how we want to redo our repos or if we do want to. Reworking things to be more standard would also allow us to have example values for secrets so people could test/deploy/use our playbooks more easily in CI or other places.
- ACTION: kevin to check on status and see if we can stand up the POC
- ACTION: once thats in place, discuss redoing things or other options.
- "zabbix checkin/testing/planning" - We have a zabbix setup thats pretty far along, we want to move it forward so we can retire nagios. Talked about the current status and ideas on moving things forward.
- ACTION: Setup a bot channel that sends zabbix alerts so we can see what it's alerting on in order to adjust settings.
- ACTION: adjust alerts based on above and based on when nagios alerts and zabbix doesn't.
- ACTION: see about moving to next LTS version that has some improvments.
- We then went to looking at our github repos for the fedora-infra group. We archived a bunch of old projects, a great way to end things!
I do wish we would have had a way to let remote folks interact with the workshop. We tried a google meet, but the hotel network was not kind to us on friday. So, there are a lot of actions above, we need to find people to match to them! Let us know if you have interest in helping us out.
All in all a great workshop and we used all our time and had some great discussions!