Fedora Infrastructure release day retrospective

nirik

2014-12-10 10:20

Fedora 21 was released yesterday. (If you haven't already, go get it: https://getfedora.org ) This release was not as smooth for infrastructure as previous releases have been, for which I apologize. Here's what happened: For the last few weeks we had been seeing sporadic slowdowns in the bodhi application, but had been unable to isolate what was causing them. This last week was the Fedora Infrastructure Mirrormanager 2 / Ansible FAD, and there we added some more debugging in, but still couldn't see where the problem was. It wasn't in bodhi itself, but somewhere in it's integration with the authentication system and getting to that via proxy01 (our main datacenter proxy). Proxy01 seemed busier than usual, but it gets a lot of traffic anyhow. We bumped memory up on it to make sure it could better cope with release day. Then, release day: proxy02 (a server in england) started being unable to cope with load and we removed it from DNS. Then, proxy01 started having problems. Since most services were slow in any case, we updated our status page that it was release day and to expect slowdowns. Most services (aside bodhi) were actually up and fine, just slower than normal. Some folks took this to mean we were completely down, but this was not the case. Next release we probibly will make a special banner telling people it's release day and to expect things to be slow, but up and all working. Finally this morning Patrick discovered a problem in our DNS setup. It had been there all along, but the amount of traffic we had been seeing in the last few weeks and especially on release day made it much worse: There were only proxy02 and proxy01 available for EU dns. This means that EU folks would always get those 2 proxies, and with one out, always get that single one. There were 2 other proxies that should have been in DNS for EU, but were not. We quickly added them, added proxy02 back in and things have been very quiet since then. With proxy01 not having to handle all of the EU traffic, bodhi was happy again and with 2 more proxies closer to EU, EU users should be happy again. Many thanks to Patrick for tracking this down finally. Sorry for the slowdowns and issues on release day. Everything should be back to normal now and we should not have this problem on the next release. In the last week, our master mirrors have pushed out around 50TB of data. Not bad.

Mirrormanager and ansible FAD (day 5)

nirik

2014-12-09 06:17

Last day to get things done. Tomorrow folks head home (on release day morning even). I cleaned up the postgres server ansible role we had to handle things more nicely. Ralph kept slugging away at the proxy playbooks. Patrick worked on single sign out, then openid for bodhi. Pingou got mirrormanager2 rpms built and playbooks setup to install them in staging. I looked over keepalived, and it should work for our limited needs. So, moving forward, we need to:

Finish getting proxy setup in staging and test it a bunch before rolling out to production.
Finish mirrormanager2 deployment in stg and test a bunch and roll to production.
Next week likely we will schedule some outages and migrate some machines that need downtime for that... fas servers, db servers, virthosts, etc.
mailman3 and bodhi2 need to land somewhat soonish before f22 alpha so we can not have to port the old bodhi1/mailman2 servers to ansible.
We need to finish koji in staging and finish keepalived.
Some more misc playbooks need doing.

We then got mirrormanager2 frontend, backend and crawler working in stg. Still need mirrorlists and a lot more testing. Then we helped out with some release setup for redirects and such for getfedora.org tomorrow. Tomorrow we all fly back home. It's been a great FAD, we got a ton of things done... I'm looking forward to the next one. :)

Mirrormanager and ansible FAD (day 4)

nirik

2014-12-08 06:15

Two more days of FAD to go. Today and tomorrow and then folks head home (on Fedora 21 release day). Today we really worked on figuring out our proxy setup in ansible. We divided up some things and got to work. I made a proxy02.stg instance that we could configure with ansible and then compare against the proxy01.stg made by puppet. Ralph worked on the templates for the websites and Pingou worked on varnish and haproxy ansible roles. Patrick worked on migrating all our nameservers from puppet to ansible. Luke worked on test coverage for mailman 2 and got it really going on the mirrorlists setup. Smooge worked on people and planet ansible migrating. David finished getting fedimg all working in stg. Later in the afternoon we all took a break to go up to Duram and visit Seth Vidals ghost bike. It was a somber occasion. I still very much miss him. ;( After that we met up for dinner with Tom Callaway and Ruth, then everyone headed to sleep. Last day tomorrow. Hope we can get the proxy setup mostly done. :)

Mirrormanager and ansible FAD (day 3)

nirik

2014-12-07 06:46

Today we started in looking at the big picture where we are in our migration of ansible and what things we need to do to complete it. There's 67 hosts still in puppet right now. 27 are all ready to move, we just need to schedule downtime for them and do them. 27 more just need some playbooks written then migrated. The rest are special in one way or another. lockbox01 needs to be the last one we do as it's the puppet master. The proxy hosts are tracky as they have a LOT of apache config on them. We discussed various ways of moving forward on it. I started working on landing the playbooks for fas (fedora account system) which turned out to be a bit tricky, but I got our staging instance going. We can do the production ones as soon as we schedule a downtime. Pingou made several playbooks for hosts and got them ready to go. Patrick got the nameservers sorted out and ready to move. Smooge got all our non logging hosts logging! This was great... of course as reward we get some error messages from a host that he then had to go and fix up, but it's great to have them all logging away correctly again. There was some work by Ralph and David to get fedimg all humming along in ansible in staging. Luke got most of the work to get atomic composes going with bodhi updates pushes setup. Hopefully we can land that next week after release. We talked about our existing proxy setup (httpd, varnish, haproxy, applications, memcached) and determined that we should change how we are doing memcached. Instead of some shared instances, just give applications a local one that they only talk to. This eliminates the external ones as a point of failure, and should help all the apps. Tomorrow more ansible. Hopefully we can at least prototype the proxy setup.

Mirrormanager and ansible FAD (day 2)

nirik

2014-12-06 07:32

The second day of the FAD was also devoted to Mirrormanager work (as we only have Matt Domsch around today). I created 4 new staging instances for us: a mirrorlist server, a frontend (to run the web frontend), a backend ( to see new content, make mirrorlists, etc) and a crawler (to crawl mirrors and find out of date ones). Took a bit of work to get the production nfs mount setup (read only of course) on the new backend. Next was fun with databases for me, I copied the production mirrormanager 1 db over to staging and got staging mirrormanager 1 working. (It had a config issue). Then, I made a mirrormanager 2 db from the mirrormanager 1 dump and pingou took that and migrated it to the mm2 schema. Smooge looked over our stats and found some strange issues, for which we added a bunch of logging to try and isolate what was going on and what people were really getting. There was a lot of discussion around atomic trees and how to add them properly to mirrormanager. Then, off to a quick run by the Red Hat holiday party that happened to be going on this evening and off to dinner with Greg from ansible. It was awesome that the ansible folks took us out for dinner, they are great and you should all be using ansible. ;) Tomorrow on to the ansible part of our hackfest.

Mirrormanager and ansible FAD (days 0 and 1)

nirik

2014-12-05 09:35

On Wed afternoon I flew out to Raleigh for our Mirrormanager2 and Ansible FAD. Luckily this time I got a direct flight (from denver) so no connections to miss. My flight ended up being about 40min late in the end (possibly due to fog, they never said) and I got into the hotel after midnight and crashed. A bit of background on this FAD. We had planned it many months ago hoping it would be after the Fedora 21 release (alas, due to slips this is not the case). Also, we had hoped to work on FAS3 instead of ansible, but we couldn't pull all the folks who work on FAS3 in, so we decided to use the second half of the FAD for ansible work that we would really like to get done sooner rather than later. Mirrormanager2 is a re-write of our aging mirrormanager1 setup and we wanted a chance to work with the author of MirrorManager1 (Matt Domsch) to get historical reasoning for choices and where we can redo design. Thursday was the first day of the FAD (which is hosted over at the Red Hat Tower in downtown Raleigh). We all gathered and did a bit of overview on what was implememented so far on mirrormanager2 (basically the user interface, db and backend) and what we still needed to work on (all the scripts, the crawler, the mirrorlists server, etc). Then, us sysadmin types stepped away for a few hours for some meetings with folks we work with in Red Hat IT. Some great discussion about plans for datacenters and making things better for our Fedora users via Red Hat infrastructure we can leverage. Finally we got back to the FAD and went over more information from the past and code started to flow. We ended the day by going out to a irish pub for dinner, then back to the hotel for some much needed sleep.

30 days! whew!

nirik

2014-11-30 09:44

Well, looks like I managed to do it: a blog post every day for 30days (all of november). :) I did have a few times, especially toward the end where I was at a loss for topics. I didn't want to repeat myself or go over the same things, but yet wanted them to be interesting. I hope folks found them interesting, thought provoking, or at least amusing. I think coming out of this I will probibly try and blog more, but will cut back from daily for a while at least.

Some handy systemd/journal tips

nirik

2014-11-29 10:31

I thought I would share some nifty little systemd/journal commands I have run into in the past year that are handy.

easy way to make sure something restarts on fail: mkdir /etc/systemd/systemd/nameofthing.conf.d then make a nameofthing-override.conf in that dir with: [Service] Restart=always
systemd-delta - Gives you the list of things that have changed from the 'default' units. Ones overridden, diff of any that were edited, etc.
journalctl --list-boots - Gives you a list of boots (with also the number you can use with the next command) and times/dates.
journalctl -b N - where N is the boot you want information from. I use this all the time with -1 or -2 to see how things changed in the previous 2 boots.
systemctl status PID - you can pass this any PID and it will give you the status of the unit that started it/controls it. Very handy to see what some random process is a part of.
systemctl suspend -i - Ignore any inhibitors and just suspend. This is like a --force, so be careful if you are in the middle of something like a package update.
systemd-inhibit --list - shows you all suspend inhibitions and who asked for them.
journalctl --disk-usage - show usage of journal logs on disk. You can tune them then in /etc/systemd/journald.conf

Hope you all find some of these as handy as I did. ;)

Fedora 21 RC1

nirik

2014-11-28 11:17

For once we managed to make an RC before the last minute for a release. ;) So, please if you have a few minutes this weekend and want to help us out, go and test the Fedora 21 Release Candidate 1.

See the test announce list post: https://lists.fedoraproject.org/pipermail/test-announce/2014-November/000958.html for lots of links and more information.
See #fedora-qa on chat.freenode.net to discuss findings with other QA folks.
See the blockerbugs application to see where things stand: https://qa.fedoraproject.org/blockerbugs/milestone/21/final/buglist

Fedora 21 looks like it's going to be a nice stable release. ;)

Thanks

nirik

2014-11-27 10:33

I'll take today's blog post as a chance to co-opt a silly holiday and turn it into a chance to be thankful to some folks: A big thanks to my co-workers: It's awesome working with all of you to make Fedora and the world a better place. A big thanks to the Fedora Community: I've met many wonderful people via Fedora and it's great anytime I interact with them. The Fedora community is like a large extended family. (including crazy uncles, great cooks, world travelers and all in between). A thanks to my company: Paying me to work on something I love working on and having a open source culture that makes me welcome. And finally of course a thank you to my family: My Girl and my dogs. Hope everyone out there has a chance to think and be thankfull for the people they have in their life.