Disaster Recovery: When Redundancy Fails

August 16, 2017 by Alex Collins

You're able to read this sentence now, because a lot of things between the server and your laptop screen repeat themselves, repeat themselves.

Redundancy – the duplication of critical components to ensure continuing operations – is baked deep into every unit of infrastructure with lives on the line. Airlines. Power systems. Computer networks. Redundancy as a principle has been so successful, in fact, that it only makes the news when it fails.

Two major airlines – Delta and British Airways (BA) – got the short end of the redundancy stick recently.

“Uncontrolled and uncommanded”

We've covered Delta's outage previously: a malfunctioning power control module caused a loss of power that killed its Atlanta command center, causing over 1,800 canceled flights, thousands of passengers sleeping on airport floors, and over $120 million in lost revenue.

A similar, but more recent outage took down BA's data center in Heathrow Airport last month, caused by an engineer who disconnected the data center's power supply, then reconnected it, as CEO of BA's parent company IAG Willie Walsh later put it, in “an uncontrolled and uncommanded fashion that caused physical damage to the servers and distribution panels.”

Analysts estimate the total cost of the outage to run anywhere between $102 to $138 million.

Highly questionable failures

You don't have to be a billion-dollar company to feel the devastating effects of a data center outage. A recent Cost of Data Center Outages study discovered that an average data center outage cost about $740,000, an increase of 38 percent from 2010. The study also found that power systems failure is the top cause of data center outages – something Delta and BA learned the hard way.

The latter's lack of any built-in redundancy has analysts scratching their heads, considering the major news column-inches dedicated to Delta's identical failure the previous year.

“It seems highly questionable why similar incidents with major US carriers in the last year have failed to see IAG move to ensure its airlines had plans in place to mitigate this risk… and also to have contingency plans in place,” wondered Damian Brewer, an analyst at RBC Capital Markets.

“It appears that BA management have seemingly not taken account of IT risk precedent already seen and already known at other carriers.”

Here's how BA could have acted differently (and saved $138 million in the process):



They could have built redundancy into the system. The error that caused the deadly power surge represented a single point of failure that could have been accounted for, if BA had analyzed its entire end-to-end system to find it. BA's vulnerability with its uninterruptible power supply would not be fatal by itself, if redundant structures had been factored into the design.

Why wasn't BA able to failover between redundant data centers when the surge killed the servers? Was it a lack of redundancies, or a lack of staff training for exactly this situation, that caused the problem to escalate? Or was this something akin to the Southwest outage of 2016, which its CEO attributed to faulty redundancy: "We have redundant systems that should have kicked in place, and they didn't"?

They could have insulated their data center from budget cuts. BA has long been under pressure from escalating costs, and many critics point to the airline's budget cuts as one of the crisis' main causes.

“Reducing the redundancy of builds is one of the first places they look and when they do that, they put themselves at risk,” Uptime Institute president Lee Kirby told Computer Weekly. “It’s really [down to] management decisions early on not to prop up the infrastructure, not to put up the training programmes to run 24/7.”

They could have backed up offsite. Both stricken datacenters were located in a single Heathrow location; the story might have been very different had they been able to failover to an offsite backup. (For example, All Covered's clients, with smaller IT budgets than BA, are able to count on our world-class SSAE-16 / SAS 70 type II compliant data centers located throughout the U.S., ensuring their cloud IT infrastructure runs optimally at all times. Not to toot our own horn.)

These are all “could haves” that even BA might not have had the wherewithal to implement: small to medium businesses have a little more wiggle room to seek a third-party evaluation of their potential weak spots, something a behemoth like BA might not easily do.

Smaller businesses can call on trusted service providers like All Covered to implement a coherent and comprehensive oversight of their IT systems, one that never sleeps. Our team of experts works in shifts 24x7 to manage and support our clients' network, servers and PCs; and our 24x7 Remote Monitoring (RMON) team and SOC (Security Operations Center) monitor servers and key network elements, and responds to threats and issues in real time.

Your business will never be as big as BA's, but you'll never fall into a $138 million hole due to a busted data center – not if All Covered can help it. To get started with our expert, tailored advice, contact our IT experts at All Covered at 866-446-1133.