On reboots

nirik

2011-06-26 08:36

Rebooting machines is a interesting study in the varied opinions of the Linux community. On one end, there are folks who will use ksplice or simply avoid rebooting for any reason short of a hardware failure. On the other you have desktop users who reboot their machines daily. I'm somewhere in the middle: For servers if there is a security update to the kernel or glibc that applies, the server should be rebooted. For my laptop, I usually reboot when there's a reason (I want to test something related to the boot process, there's a security update, etc). Rebooting servers regularly (and it seems doing so for security updates accomplishes this) has several other advantages:

You can schedule your rebooting. Sometimes power cycling or rebooting a machine puts some stress on the hardware, if it fails, you are able to call for service, etc. If it happens at 2am on sunday morning when you have 9 to 5 business day coverage, you are in much worse shape
Even with configuration management there are times when things are added to a server, but not set to start on boot, or need some config change to start properly. Better to fix those in a maint window than to not have them come up after an unattended reboot
You can find out a lot about your servers and how they interrelate, and where the points of failure are by rebooting them
Do you recall how to get to that serial console / IPMI / pdu / kvm for that server? Scheduling a reboot is a great time to make sure your console access works and shows the boot process.

Fedora Infrastructure has about 150 or so machine instances to manage currently. Until recently the mass reboot process was pretty much just scheduling a block of time and powering through rebooting things. Often we would go over our window as that is a lot of machines to reboot and confirm are back up nicely. So, I worked on changing our process for mass reboots. Now, all hosts are put in 3 different buckets:

The "C" group. These are machines that only infrastructure folks will notice are down or are machines where there are redundant resources, so they can be rebooted anytime as long as failover or dns changes are made first so no live traffic is still hitting them.
The "B" group. These are machines associated with Fedora contributors and package manintainers. End users won't notice these being down, but contributors/package maintainers will. These reboots need to be scheduled, but since there are few machines in this group, the outage is small.
The "A" group. These are machines that end users may notice being down or slow to respond. Database servers or mailing list hubs would be in this group. Again the number is very small, so outages would be very short

Over time, we are working on moving all the hosts in "A" and "B" groups into "C". Which would leave us being able to reboot things as time permits with no need for any scheduled outage, but at least with the above setup, outages are much shorter and less frantic. The last set of reboots we did, we used the new method and planning, and I think it went much more smoothly. We also discovered some additional points of failure:

An nameserver reboot caused some internal machines to stop processing, which allowed us to revamp our nameserver setup and make sure all machines were setup to failover properly to another nameserver.
A nfs server reboot in the "B" group affected some web servers in the "A" group. This allowed us to revisit why there's a dependent NFS mount on those web servers.

True to the Fedora way, the details are available at: Mass_Upgrade_Infrastructure_SOP.