Wednesday, March 16, 2011

Down Time

There are certain words, or combinations of them, that can cause any web development team to cringe. "Server" and "Crash" are, ideally, to be separated at all times and uttered in the same labored breath as rarely as possible.

A server going down usually translates into justifiably unhappy clients, an over-stressed web team, a very long night for all of thee above, and, in many cases, a loss of confidence internally and externally.

Unfortunately, the scenario of a tech team pounding away at keyboards in the middle of the night, slightly resembling character geek stereotypes from a TV drama (I’m picturing Chloe from 24), is not rare enough. Despite the efforts of the best technical minds our there, it happens. Taxes will be collected, we will all grow old (even with the best plastic surgeon), and your site will go down. The real trick is expedient recovery and, in the worst case, restoration of data on a new server. For some companies, the website is an extension or tool of marketing. For others, its a hub of vital business logic and process. For cascade, we fall under each category and treat them accordingly.

In September of 2010, JPMorgan Chase, the second largest bank in the world’s largest economy, with an IT budget and staff that would rival some small nations, suffered a complete outage that lasted not for an hour, not for half a day, but for THREE DAYS! For three long days, customers were unable to log-in, transfer money, make payments, or even check their balance. In January of 2009, then popular blogging platform JournalSpace actually ceased operation. Not only could they not get the site up; they literally could not retrieve any data. For an unknown number of users, six years of witty family reenactments and how-to posts were instantly gone. (Could you imagine the uprising of Mom bloggers if this happened today? Horrifying.) After weeks of failed attempts at data recovery (reportedly costing the company more money than they had made the prior year), users were informed that the site would cease operation.

Cascade has been fortunate enough to have impressive collective up-time, but we don't take this for granted. When we have had problems, we've generally been able to isolate the issue (i.e. a site going down does not mean you don't get email) and bring things back up in a comparatively expedient manner. The fact that we backup early and often is what allows us to sleep at night. When we do encounter an issue, our first priority is to do whatever it takes to get things back up and running. Our second priority is to evaluate exactly what went wrong and how we can prevent recurrence. In the event that prevention is not possible, how can we minimize the risk or at the very least improve recovery time? This is the silver lining for us as a smart web company. Having an outage is a huge problem but what we learn from it almost always makes us more stable in the long run.

After hosting our clients web sites for almost a decade, we have a robust set of protocols for prevention and recovery but will never call it “good enough”. Regardless of whether you are hosting with cascade web//development or anyone else, realize that the most important questions are, not whether your site will go down (it will), but how fast it will come back up and are the proper backups in place to ensure it will come back up at all.