Air Scare: Business Continuity Lessons from Delta's Systems Outage

September 27, 2016 by Alex Collins, IT Services Consultant

The big systems outages and business continuity failures tend to hog all the headlines, and last August's Delta Airlines downtime was no exception.

In Delta COO Giles West's own words, “a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power”  - the resulting power surge took Delta's entire Atlanta command center down, kicking off a series of unfortunate events that ended up with over 1,800 canceled flights and millions of man-hours of delays, with thousands sleeping on airport floors in the wake of the event. Total cost to Delta: over $120 million in revenue.

In the wake of this crippling loss – and knowing that such outages will happen nonetheless, even to your business – what lessons should cautious companies draw from Delta's downtime disaster?

Female passenger waiting for her flight.

Redundancy should be part of the plan. Many experts believe the outage was caused by a lack of redundancy in the system.

“Delta Airlines computer systems responsible for online check-in, kiosks, flight dispatching, crew scheduling, airport-departure information displays, ticket sales, frequent-flier programs and flight info displays are all located in a single datacenter located in Atlanta, Georgia,” surmises cloud and virtualization expert Marcel van den Berg. “Most likely for cost reasons Delta Airlines decided not to operate a twin datacenter concept.”

A Reddit commenter suggests that, far from a datacenter failure, an automated generator test started a cascading series of problems that crippled both power sources, a statistically unlikely event that was simply no match for Murphy's Law.

“What I think we're seeing is a failure to learn from history and allowing single points of failure in the system," explains Alan Woodward, computer science professor at the University of Surrey. “Whilst safety critical systems have failure modes analyzed, operational systems clearly are not undergoing the same degree of analysis. The result is not fatal, but nevertheless its impact can be enormous.”

There should always be a plan in place. Good business continuity practice relies on sound procedures that ensure the shortest route between downtime and business as usual. The larger and more complex the organization, the harder this may be to implement – which is what may have caused Delta Airlines to have difficulty bringing business back to normal in a reasonable time span.

Delta's existing business continuity procedures may not have taken the scale of the downtime into account. In desperation, some Delta employees resorted to writing out boarding passes by hand!

If Delta could be caught flat-footed, it's even more likely your business could be caught with your pants down without a process in place. According to a recent Symantec study, 57 percent of small and mid-sized businesses lack a recovery plan to cope with data loss, power outages, or other IT-related disasters.

Airplane landing at sunset

Be ready to pull in outside help. For many growing companies, an effective business continuity plan may be too complex to pull off entirely in-house.

Additional capacity to cope with a major systems outage may have to come from outside – meaning that IT managers having pre-existing relationships with outside suppliers can leverage capacity from experts in the business continuity field.

Consider what you might gain from a pre-existing partnership with All Covered: you can build your business continuity plan from the ground up, ensuring minimal downtime even in Delta Airlines-level situations.

In worst-case scenarios, All Covered can virtualize your entire business remotely, providing access to your IT environment even when your own servers are physically destroyed.

Learn more about our Business Continuity Services or call us at 866-446-1133 to speak with an All Covered IT consultant in your area.