Changing how we work

nirik

2019-07-21 14:59

As those of you who read the https://communityblog.fedoraproject.org/state-of-the-community-platform-engineering-team/ blog know, we are looking at changing workflows and organization around in the Community Platform Engineering team (of which, I am a member). So, I thought I would share a few thoughts from my perspective and hopefully enlighten the community more on why we are changing things and what that might look like.

First, let me preface my remarks with a disclaimer: I am speaking for myself, not our entire team or anyone else in it.

So what are the reasons we are looking for change? Well, there are a number of them, some of them inter-related, but:

I know I spend more time on my job than any 'normal' person would. Thats great, but we don't want burn out or heroic efforts all the time. It's just not sustainable. We want to get things done more efficiently, but also have time to relax and not have tons of stress.
We maintain/run too many things for the number of people we have. Some of our services don't need much attention, but even so, we have added lots of things over the years and retired very few.
Humans suck at multitasking. There's been study after study that show that for the vast majority of people, it is MUCH more efficent to do one task at a time, finish it and then move on. Our team gets constant interruptions, and we currently handle them poorly.
It's unclear where big projects are in our backlog. When other teams approach us with big items to do it's hard to show them when we might work on the thing they want us to, or whats ahead of it, or what priority things have.
We have a lot of 'silos'. Just the way the team has worked, one person usually takes lead on each specific application or area and knows it quite well. This however means no one else does, no one else can help, they can never win the lottery, etc.
Things without a 'driver' sometimes just languish. If there is not someone (one of our team or even a requestor) pressing a work item forward, sometimes it just never gets done. Look at some of the old tickets in the fedora-infrastructure tracker. We totally want to do many of those, but they never get someone scheduling them and doing them.
There's likely more...

So, what have we done lately to help with these issues? We have been looking a lot at other similar teams and how they became more efficient. We have been looking at various of the 'agile' processes, although I personally do not want to cargo cult anything, if there's a ceremony some process calls for that makes no sense for us, we should not do it.

We setup an 'oncall' person (switched weekly). This person listens for pings on IRC, tickets or emails to anyone on the team and tries to intercept and triage them. This allows the rest of the team to focus on whatever they are working on (unless the oncall person deems this serious enough to bother them). Even if you stop and tell the person you don't have time and are busy on something else, the amount of time to swap that out and back in already makes things much worse for you. We of course will still be happy to work with people on IRC, just schedule time in advance in the associated ticket.
ticket or it doesn't exist. We still are somewhat bad about this, but the idea is that every work item should be a ticket. Why? So we can keep track of the things we do, so oncall can triage them and assign priority, so people can look at tickets when they have finished a task and not been interrupted in the middle of it. So we can hand off items that are still being worked on and coordinate. So we know who is doing what. And on and on.
We are moving our 'big project' items to be handled by teams that assemble for that project. This includes a gathering info phase, priority, who does what, estimated schedule, etc. This ensures that there's no silo (multiple people working on it), that it has a driver so it gets done and so on. Setting expectations is key.
We are looking to retire, outsource or hand off to community members some of the things we 'maintain' today. There's a few things that just make sense to drop because they aren't used much, or we can just point at some better one. There's also a group of things that we could run, but we could just outsource to another company that focuses on that application and have them do it. Finally there are things we really like and want to grow, but we just don't have any time to work on them. If we hand them off to people who are passioniate about them, hopefully they will grow much better than if we were still the bottleneck.

Finally, where are we looking at getting to?

We will probibly be setting up a new tracker for work (which may not mean anything to our existing trackers, we may just sync from those to the new one). This is to allow us to get lots more metrics and have a better way of tracking all this stuff. This is all still handwavy, but we will of course take input on it as we go and adjust.
Have an ability to look and see what everyone is working on right at a point in time.
Much more 'planning ahead' and seeing all the big projects on the list.
Have an ability for stakeholders to see where their thing is and who is higher priority and be able to negotiate to move things around.
Be able to work on single tasks to completion, then grab the next one from the backlog.
Be able to work "normal" amounts of time... no heroics!

I hope everyone will be patient with us as we do these things, provide honest feedback to us so we can adjust and help us get to a point where everyone is happier.

Attention epel6 and epel7 ppc64 users

nirik

2019-05-23 12:55

If you are a epel6 or epel7 user on the ppc64 platform, I have some sad news for you. If you aren't feel free to read on for a tale of eol architectures.

ppc64 (the big endian version of power) was shipped with RHEL6 and RHEL7 and Fedora until Fedora 28. It's been replaced by the ppc64le (little endian) version in Fedora and RHEL8.

The fedora build system (koji) runs Fedora on all it's builders. We moved to doing this because we often needed new features in rpm or lower level packages like that. For example when rpm switched to xz compression, we needed a rpm package that was new enough to do/understand that on all the builders, so we either had to backport this support to the RHEL version or just switch to Fedora. We found it more supportable to just switch to Fedora.

However, since Fedora stopped supporting ppc64 in Fedora 29, all our current ppc64 builders are Fedora 28 (the last release with support for ppc64). Fedora 28 is about to go end of life and we had to decide what to do about epel6/epel7 since they still support ppc64.

epel6 ppc64 users may be sticking with that platform because their hardware doesnt support ppc64le. epel7 users could move to ppc64le, but they might be keeping ppc64 instances to remain compatible or not wanting to change. ppc64 users are a very low number (around 100 checkins to our mirroring system per day, next to 1.5million for x86_64). Additionally, I would expect few of those ppc64 installs are new deployments or even directly on the internet.

We could make RHEL7 builders just for epel6/7 ppc64. However, our ansible playbooks for builder deployment have counted on them being Fedora for a long time now, so it would be a pretty big retooling effort. Additionally, those builders would sit around mostly idle, taking up resources we could use for other ones.

So, in the end, I think we are going to look at retiring ppc64 in epel6/7 next week when Fedora 28 goes end of life. The old packages would still be available in koji, but the repos would disappear. If there's some case we haven't thought of here, do bring it up to the epel steering comittee: https://pagure.io/epel/issue/57 or on the epel-devel list.

A preliminary review of /e/

nirik

2019-03-22 19:10

I've been running LineageOS on my phone for a while now (and cyanogenmod before that) and been reasonably happy overall. Still even LineageOS is pretty intertwined with the google ecosystem and worries me, especially given that google is first and foremost an ad company.

I happened to run accross mention of /e/ somewhere and since LineageOS did a jump from being based on ASOP15 to ASOP16 which required a new install anyhow, I decided to check it out.

As you may have gathered from the above, /e/ is a phone OS and platform, forked off from LineageOS14.1. It's located at https://e.foundation based in france (a non profit) headed by Gaël Duval, who Linux folks may know from his Mandrake/Mandriva days. The foundation has a lot of noble goals, starting with "/e/’s first mission is to provide everyone knowledge and good practices around personal data and privacy." They also have a slogan "Your data is your data!"

I downloaded and installed a 0.5 version here. Since I already had my phone unlocked and TWRP recovery setup, I just backed up my existing LineageOS install (to my laptop), wiped the phone and installed /e/. The install was painless and since (of course) there's no google connections wanted, I didn't even have to download a gapps bundle. The install worked just fine and I was off and exploring:

The good:

Most everything worked fine. Basically if it worked in LineageOS 14.1, it works here (phone, wifi, bluetooth, etc)
Many of the apps I use with my phone seem fine: freeotp, signal, twidere, tiny tiny rss reader, revolution irc are all the same apps I am used to using and are install-able from f-droid just fine.
There is of course no google maps anymore, but this was a great chance to try out OsmAnd, which has come a very long way. It's completely usable except for one thing: The voice navigation uses TTS voices and it sounds like a bad copy of Stephen Hawking is talking to you. Otherwise it's great!
My normal ebook reader app is available: fbreader, but I decided to look around as it's getting a bit long in the tooth. I settled so far on KOReader, which was orig a kobo app, but works pretty nicely on this OS as well.
For podcasts I had been using dogcatcher, but now I am trying out AntennaPod.
The security level of the image I got was March 2019, so they are keeping up with at least the "android" security updates now.

The meh:

The fdroid app isn't pre-installed, but it's easy to install it. They plan to have their own store for apps that will just show additional information over the play store, etc.
There is 'fennec' in f-droid. You can't seem to install firefox as all download links lead to the play store.
I had been using google photos to store backups/easy web access versions of pictures and movies I took, but of course now I just need to look into alternatives. Perhaps syncthing.

The bad:

A few apps I was using are of course non free and not available in f-droid: tello, vizio smartcast, various horrible IOT smart things apps, my credit unions silly app, etc. tello works fine if you can find a apk not on the play store. vizio smartcast seems to fail asking for location services (which should work, but oh well).
Untappd doesn't seem to have a .apk easily available, so I guess twitter will be spared my been drinking adventures. :)
Some infosec folks looked closely and there was still some traffic to google: https://infosec-handbook.eu/blog/e-foundation-first-look/#e-foundation but they had a very reasonable reply I thought (not trying to reject or ignore anything): https://hackernoon.com/leaving-apple-google-how-is-e-actually-google-free-1ba24e29efb9

The install is all setup with MicroG. "A free-as-in-freedom re-implementation of Google’s proprietary Android user space apps and libraries." It does a pretty good job pretending to be google for apps that need some google bits.

In addition to the OS, /e/ folks have a server side setup as well. I didn't play with it too much as I am waiting for their promised containerized versions of the server side so I can run them myself. These provide replacements for google drive, notes, address book, mail, etc.

The name /e/ is a bit strange to try and pronounce, or search for. Turns out they had another name at first, but someone else was using it and took exception. There is some mention that they are going to rename things before the magic 1.0 comes.

All in all I think I am going to keep using /e/ for now. Keeping up on security and the ability to make me look at open source alternatives to the various apps I use seems pretty nice to me. I do hope it catches on and more folks start to use it.

CPE meetings and devconf2019

nirik

2019-02-04 11:57

I recently went to Brno, CZ for CPE (Community Platform Engineering) meetings and then devconfcz 2019 and thought I would share my take on both of them.

Travel to and from Brno is always a long one for me. I'm currently based in Oregon, US, so my journey is:

Drive to portland airport (PDX)
Flight from Portland to Amsterdam (AMS) (a 9-11hour flight)
Flight to Prague (PRG) (usually a 1-2 hour flight)
Bus to train station (30-40min)
Train to Brno (2-3 hours)

And then the same in reverse on the way back, with all the associated timezone issues. :) I am very happy about the direct amsterdam flight, so I don't have to change planes in london or frankfurt or something.

A short word about the CPE team. We are a team in Red Hat that works on Fedora and CentOS (formerly Fedora Engineering). We have some application developer folks who write and fix our custom applications (bodhi, pagure, release-monitoring, etc) as well as a number of Operations folks who keep the Fedora and CentOS infrastructures running smoothly.

We spent the week of Jan 21st meeting up and discussing plans for the year as well as ways we could be more responsive to the community and better handle our (large) workflow.

2019-01-21: Brian Stinson went over the CentOS CI setup we have and we identified projects that we care about that didn't have any CI and worked on fixing them up. We got a bunch more projects with (all be it simple) tests running.
2019-01-22: We talked about ways to be more efficent with out workload. We determined to try and have a ops person paired with a dev person on deployments to avoid delays. We talked about doing more pair work. We talked about changing our status reports. Then we wrote up all the planned work we know of in the coming year, prioritized it, gave it owners to write up. We should have this info up on the wiki before too long (or somewhere).
2019-01-23: We talked about rawhide gating and changed our plan to be simpler than it had been. We went over the fedmsg to fedora-messaging changeover. We moved some apps to openshift and fedora-messaging. More to come.
2019-01-24: We had some meetings with some internal Red Hat teams on how we could help each other by doing things first in Fedora and how best to do that. We worked some more on priorities and upcoming tasks.

Then it was time for devconfcz. Always a great conference. Tons of talks to see and tons of people to talk to in the hallway track. A few of the talks I really wanted to go to I got to too late and they were already full, but I did see some interesting ones.

There was a lot of discussion about EPEL8 in the hallway track, but luckily we had a number of the people who knew how modularity works there to quash plans that wouldn't work and to propose ones that would. At this point the plan is to make a EPEL8beta that is just the "ursine" packages and test that out while working on modular EPEL8. For modular EPEL8 we are going to look at something that takes the modular RHEL repos and splits them out into one repo per module. Then we can hopefully get mbs to use these external modules when it needs them as build requirements and we can also decide what modules we want in the 'ursine' buildroot. This is all handwavy and subject to change, but it is a plan. :)

Smooge and I gave our EPEL talk and I think it went pretty well. There were a lot of folks there at any rate and we used up the time no problem.

As always after a chance to meet up with my co-workers and see tons of interesting talks I'm really looking forward to the next few months. Lots and lots of work to do, but we will get it done!

Rawhide notes from the trail, mid december 2018

nirik

2018-12-17 11:59

Just a few notes from the trail this week:

zchunk repodata should be in place as of todays compose. Feedback on how much repodata you need to download now or any other issues with it would be good to get fixed up before we branch f30 off of rawhide. Ideally people will be download a LOT less repodata now.
AdamW setup the openqa reports that go to the devel and test list to also note what tests would be gating and if we were gating what would we have hit. This is grep prep work for the gating landing, so we can fix those tests/issues and start with a GO.
Not directly rawhide, but related: bugzilla was updated finally to bugzilla5. Overall things went fine, but there's a few issues: bodhi is having trouble updating bugs sometimes, and things that use libreport (anaconda and abrt) are no longer just sending 1 email on new bugs, but an email for every attachment. These issues are being worked on.

I hope everyone has a relaxing holiday season.

OpenShift in Fedora Infrastructure

nirik

2018-12-09 17:50

I thought I would write up a quick post to fill folks in on what our OpenShift setup is in Fedora Infrastructure, what we are doing with it now, and what we hope to do with it in coming years. For those that are not aware, OpenShift is the Red Hat version of OKD, which is a open source, container application platform. That is, it's a way to deploy and manage application containers. Each of your applications can use a known framework to define how they are built, managed and run. It's pretty awesome. If you need to move your applicaiton somewhere else, you can just export and import it into another OpenShift/OKD and away you go. Recent versions also include monitoring and logging frameworks too. There is also a very rich permissions model, so you can basically give as much control to a particular application as you like. This means the developer(s) of the applications can also deploy/debug/manage their application without needing any ops folks around for that. Right now in Fedora Infrastructure we are running two separate OpenShift instances:One in our staging env and one in production. You may note that OpenShift changes the idea of needing a staging env, since you can run a separate staging instance or just test one container of a new version before using it for all of production, however, our main use for the staging OpenShift is not staging applications so much as having another OpenShift cluster to upgrade and test changes in. In our production instance we have a number of applications already: bodhi (the web part of it, there is still a seperate backend for updates pushes), fpdc, greenwave, release-monitoring, the silverblue web site, and waiverdb. There's more in staging that are working on getting ready for production. One of the goals we had from the sysadmin side of things was the ability to be easily able to completely re-install the cluster and all applications, so we have made some different setup choices that others might. First, in order to deploy the cluster we have in our ansible playbooks one that creates and provisions a 'control' host. On this control host we pull a exact version of the openshift-ansible git repository and run ansible from the control host with a inventory we generate and the specific openshift-ansible repo. This allows us to provision a cluster exactly the same everytime. One the cluster is setup, we have setup our ansible repo to have the needed definitions for every application and it can provision them all with a few playbook runs. Of course this means no containers with persistent storage in them (or very few using NFS), but so far thats fine. Most of our applications store their state in a database and we just run that outside of the cluster. Short term moving forward we plan to move as many applications as we can/make sense to OpenShift, as it's a much easier way to manage and deploy things. We also intend to set things up so our prod cluster can run staging containers (and talk to all the right things, etc). Also we hope to run a development instance in our new private cloud. This instance we hope to open more widely to contributors for their developing applications or proof of concepts. We would like to get some persistent storage setup for our clusters, but it's unclear right now what that would be. Longer term we hope to run other clusters in other locations so we can move applications around as makes sense and also for disaster recovery. I'd have to say that dealing with OpenShift has been very nice, there have been issues, but they are all logical and easy to track down, and the way things are setup just makes sense. Looking forward to 4.0!

Rawhide notes from the trail, the late November issue

nirik

2018-12-01 21:23

Greetings everyone! Lets take a look at notable things from the rawhide trail in the last week:

We had 2 DOOMED composes and 5 FINISHED_INCOMPLETE
The DOOMED ones failed because of broken deps in gnome-contacts (making the Workstation live media fail to compose). This was actually fixed very quickly (just needed a rebuild), but for some reason got stuck in the signing queue, so it still wasn't fixed when we thought it was.
I just updated pungi on rawhide-composer to 4.31.1, which should make composes a little faster. See this post by pungi developer Lsedlar
Some folks have been having problems with the dbus->dbus-broker change. Do make sure that dbus-broker is enabled to start on boot if you run into strange boot issues.
There's a good deal of high level talk about pushing the f31 release out next year to allow for more tooling and possibly longer lifecycles on the devel list. Do read and contribute if you have thoughts on the matter. I'd definitely like to see us improve things, but so far most of the discussion has been very high level and handywavy. Can't wait to get into details.

Otherwise its been a pretty typical week, between 2018-11-23 and 2018-11-30: Added packages: 19 Removed packages: 12 Modified packages: 1293

Rawhide notes from the trail: The thanksgiving edition

nirik

2018-11-23 16:17

Sheesh, it's been a while since I blogged anything again. I'm going to try and do at least 1-2 posts a week moving forward. So, whats been happening in the rawhide world? Lets look back on this last month:

2018-11-23 (today): dbus-broker switched to being the default dbus (A f29 change that didn't make it and is planned for f30 now). Unfortunately, this broke anaconda as it launches it's own bus and didn't have the right package installed to do that anymore. This meant that everything that well... installed did not work. Adamw quickly submitted a PR and new fixed build and another rawhide is running now.
2018-11-22: dnf maintainers were updating the dnf stack of things and happened to be in the middle of that when rawhide compose started, this meant that everything had broken buildroots and failed to compose. As a reminder, rawhide composes start at 05:15 UTC every day. Please avoid breaking things right before that.
2018-11-21: finished (incomplete)
2018-11-20: finished (incomplete)
2018-11-19: finished (incomplete)
2018-11-18: finished (incomplete)
2018-11-17: finished (incomplete)
2018-11-16 and 2018-11-15: The 14th compose got stuck, so the 15th never happened.
2018-11-14: A change in an ostree file meant that pungi didn't properly emit a fedmsg so the ostree could be signed. So, it got stuck there waiting. A fix was attempted, but (my mistake) not applied correctly (I didn't update pungi on the composer after I built a patched one).
2018-11-13: failed due to libreoffice deps being broken and the workstation live thus not building.
2018-11-12: finished (incomplete)
2018-11-11: finished (incomplete)
2018-11:10: finished (incomplete)
2018-11-09: finished (incomplete)
2018-11-08: failed due to pungi-gather segfaults. This has proven very difficult to track down. In case it was memory related (and because we had more memory handy) I increased the rawhide composers memory by a ton. It seems to avoid the issue now at least since we haven't seen this since this day.
2018-11-07: failed due to pungi-gather segfaults.
2018-11-06: failed due to pungi-gather segfaults.
2018-11-05: finished (incomplete)
2018-11-04: failed due to broken kde deps
2018-11-03: failed due to two issues: 1) aarch64 had a issue with networking that caused the aarch64 cloud image to fail and 2) there was a weird issue with python3 and anaconda.
2018-11-02: failed (see 11-03)
2018-11:01: failed (see 11-03)

Most of the more anoying failures would have been blocked if we had gating in place (no broken deps would have crept in and if we tried a base image with anaconda/python/etc we would have found the breakage there eariler). I sure hope we can get that in place someday. :)

A short note about posting links to facebook

nirik

2018-09-28 17:57

Just a short note to everyone out there that posts links to facebook pages or posts: Something you may not realize is that for those of us that do not have or want to use a facebook account the experience is pretty subpar. You go to the page (without being logged in to facebook) and the page or post loads, you start to read it and then... BAM a gigantic popup appears, blocking all the page contents. It's a dialog asking you to login to facebook or create a new account or (in small print on the bottom: "Not now"). So, not wanting to login (for whatever reasons), you click on 'Not Now'. The dialog disappears... but is replaced by a wide bar on the bottom of the page asking you to login or create a new account. Thee is no close button on this bar, and it blocks about the bottom 25% of the content, so you get to page up and down and try and read around it. I know all the reasons facebook is doing things this way, but I just wanted to mention it to those of you who do login to facebook and perhaps didn't even know about this. So, do think, next time you share a link, is there some more free place to share it?

backups⁉️

nirik

2018-09-03 16:27

A nice long weekend (in the US at least) is a great time to deal with your Backups. You do have backups right? They work right? For a number of years now I have been using rdiff-backup for my backups. Unfortunately, a week or so before flock my backups started erroring out and I put off looking into it for various reasons until now. rdiff-backup has a lot going for it, but these days it also has a lot against it:

It's in python2 with no python3 support, and as we all know, python2 is going away before too long.
There's not really any upstream development. A few years ago development was handed off to new team, but they haven't really been very active either.
No encryption/compression support
slow, especially when you have a lot of snapshots.

So, I decided it was time to look over the current crop of backup programs and see what was available. My critera, which may be very different from yours:

Packaged in Fedora (bonus points for epel also. I don't have any RHEL/CentOS boxes at home, but I'd like to be able to use whatever I find in Fedora Infrastructure too, where we are also using rdiff-backup).
Encryption/Compression support.
Active and responsive upstream

I went and looked at: BackupPC, zbackup, borgbackup, burp, bacula, obnam, amanda, restic, duplicy, and bup. (and possibly others I didn't remember). From those I narrowed things down to restic and borgbackup. Both have very active upstreams are packaged in Fedora, but retic doesn't have compression support yet (its written in go and they are waiting for a native implementation). Also restic isn't (yet) packaged in EPEL. So, with that I took a closer look at borgbackup. Upstream is quite active, there's various encryption and compression support. You can store an encryption key in the repository or outside and use zlib, lz4, or lzma compression. There is a interesting 'append-only mode', so you can set that in the ssh key a particular client uses to conact your backup server and then that client can only append, not delete any backup data, which might be nice if you are backing up a bunch of clients to the same repository (and thus getting the deduplication savings), of course you can only run one backup at a time on the same repository, so you would need to spread them out or keep them retrying or something.Likely not a big deal for my home network as I only have the laptop, and a main server to backup here, but in larger setups could be a problem. So, after a bit of cleanup on my laptop (I had some old copies of emails in several places, photos in a few places, junk I no longer needed/wanted, etc), I fired off a initial borg backup to my storage server. It's still running now, but it seems to be going along pretty nicely. As soon as I have a full backup, I'll try a restore of a few random files to make sure all is well.The final part of my backups is moving the backups from my storage server 'off site'. In past years I copied my backups to an encrypted external drive and gave it to a friend to store, but I don't have such a local person here, so I will need to investigate amazon glacier or other options, look for another post on that soon! If you are reading this, perhaps you could take a minute to think about your backups, make sure they exist, are running ok and work.