Fedora Infrastructure Outages (and the one on 2016-10-08)

nirik

2016-10-08 14:03

Astute folks may have noticed that we had a pretty severe outage this morning that lasted just about an hour. This seems like a good time to talk about outages in general and this one in specific. In general these days our outages fall into roughly 3 'buckets':

Planned outages. We do these every few months to reboot servers and apply updates, or when we have to migrate something that will take some downtime. This is by far the largest bucket.
Unplanned outages, but that are small in scope. Typically this might be one of our many datacenters having network problems or a single service having some issue. These typically aren't noticed by very many people (in most cases only the infrastructure noc folks who disable a proxy or restart a service or whatever is needed).
Unplanned outages that are large in scope. These are thankfully rare. These are were we loose connectivity to our main datacenter (that has the majority of our hardware in it), or severe hardware or software failure (storage servers not working, database needs reloading, etc). Luckily these are pretty rare.

So what happens in an outage? Normally nagios lets us know about it pretty quickly and someone starts taking a look. Usually until we have some idea, we treat things as being in bucket 2 if they aren't planned. That is, we assume it's a minor issue until we investigate and see it's larger. Once the scope of the issue is determined, we update https://status.fedoraproject.org/ This is out status indicator. If you are wondering if there is a known outage, you can look at this. Note that this is MANUALLY updated after initial innvestigation, so do give a few minutes after seeing something before expecting status to be updated. Next, folks work on the outage, this would be anyone in the sysadmin-noc group and the primary infrastructure admins. Status is then updated when the outage is over. Today (2016-10-08) at approximately 16:15UTC our primary datacenter (PHX2) became unreachable. This turned out to be due to a firewall upgrade that did not go as planned. Service was fully restored at approximately 17:15UTC. Right now when our primary datacenter isn't reachable most things are also not working as we have our primary vpn hub located there, and thats what our various proxies use to talk to applications. Unfortunately this includes our most active service: mirrorlists. We have planned to move that service to a container and run it at each datacenter, but just have not gotten to deploying that yet. Once we do, at least mirrorlists will stay up if the primary datacenter is down. Additionally, we have plans to eliminate the first type of outage or reduce it greatly. For the most part our applications are deployed with pairs (or more) of instances so no service needs to go down, however the sticking point is rebooting database servers. I've been working on setting up some replication so hopefully we don't need to schedule any outages for that either.