Spurred by a completely broken server at one of our clients we are re-evaluating our own backups and disaster recovery plans. As we all know, most people give little thought to backups, and even less to how to use them for recovery. In the meantime our servers quietly hum along and then act up at the most inopportune time. It is not a question whether a server will fail or not, it will - it is just a question of when it will fail and what you will do to recover from the failure. In this post we will examine strategies to mitigate the pain and delays after a catastrophic failure.
Note that I use a very narrow definition of disaster recovery, we are not dealing with large scale floods or fires here, just localized service disruptions. For coping with the bigger issues you can peruse a recent Computer World article.
I will point out also that Sage Tree Solutions can help you work through the steps outlined below and put your mind at ease. Just give us a call or use the contact link above.
Your first line of defense is knowing your site, identifying single points of failure, and placing a probability on the failure. Then you figure out how much mitigation you need to reduce your exposure to risk.
I won't go into detail about the points of failure on your server, you know the server better than I do, but I will note that you should also think about personnel and procedures as single points of failure, in addition to hardware and software. And don't think that the cloud isolates you from such things. Our client's server was on the Amazon cloud, it failed utterly and completely, and had to be rebuilt from scratch.
Code and databases are most prone to break or otherwise get corrupted. Fortunately we have easy solutions in code repositories and backups. The harder problem is how you will make use of those tools to recover from a failure, but we will address this later. For now make sure you create backups on a regular basis and that at least some of these backups do not live on the same server. Also make sure that you have backups for everything, the server OS and its configuration, your site code and the database(s).
For hardware your options can vary widely: It can be as simple as having spare parts available for repairs, through a redundant setup for often failing components such as RAID array for storage, to a second server on standby, all the way to a fully redundant set of servers, load balancers, network switches, etc. Again, do not expect for a minute that a cloud server isolates you from hardware issues, it will simply fail less often, but when it is affected by failing hardware it will typically make your life miserable. And using hardware for risk mitigation is now replaced with making backups.
Risk mitigation at its most basic boils down to your comfort level and willingness to endure downtime. Figure that one out and everything else will follow.
Or maybe we should rather call this "the importance of off-server backups". Regardless, you need to absolutely make sure most of the backups are not stored on the server itself. If the server crashes completely and takes its storage with it - believe me, it happens! I have seen it many times, most recently last week - you need those off server backups. So be absolutely sure there are backups elsewhere and recent enough for your comfort level. You will need them some day.
Having backups and spare parts in hand is well and good, but you want to be able to do something with them. Moreover, when you need them you will be very stressed or worse, not be there yourself. You need a recovery procedure, fully documented and ready to go. Take your time with this, you will be glad you did. Make sure all steps are covered, that someone only vaguely familiar with the setup can understand it, and most importantly, that it contains information on how to access passwords and other security features in a secure manner. Your goal is to empower whoever will perform the recovery procedure with knowledge and resources so they can go about the job confidently and swiftly.
Your recovery procedure should include an estimate of how long it will take to perform. If you designed and wrote it well this will be straightforward. Now is the time to check with all stake holders and make sure the recovery time meets your business needs, i.e. tolerance for downtime.
Finally, you want to be sure you can actually get a server back with the help of your recovery procedure and necessary parts and backups. At the very least review the recovery procedure once or twice a year and check for accuracy given the current server configuration, and look through your backups to ensure they are readable and able to be unpacked. Better still is to perform the basic recovery steps and thus test whether it all works as planned. You can stop short of pointing DNS to your newly cloned server, but I would suggest you check the Drupal site and look through logs for obvious failures.
If you have followed all the steps above: Congratulations, you are a real hero! And no, I am not quite there yet myself, so excuse me while I get to work ...
In the meantime don't hesitate to give us a call if you would like to discuss your recovery options and we will gladly ease your worries.