By Jim Grey (about)
It was never a coordinated “cyber-attack,” as several news outlets speculated.
It was simple coincidence that several separate systems failed on the same day, last Wednesday, July 8: the trading system at the New York Stock Exchange, many systems at United Airlines, and the Web site of The Wall Street Journal.
Technology fails all the time. You just don’t usually recognize it. Have you ever noticed a page on a site loading unusually slowly? Or have you ever been unceremoniously logged out? I’m sure that as long as the screen finished loading, or that you were able to successfully log right back in, you shrugged it off and moved on. It might have been random Internet gremlins or lousy Wi-Fi. But it could also have been a failure in the service. Perhaps monitoring software noticed it and quietly performed a restart. Or maybe a few minutes of high drama unfolded in some technical operations center somewhere as technicians righted the situation.
But why do such systems fail? Several reasons:
Legacy systems patched and updated for so many years that the code has become sclerotic. Big, old companies like United Airlines are bursting with old systems. I wouldn’t be surprised if some part of their reservation system involves a mainframe! Systems like these have been repaired and extended for years upon years, and by now none of the original programmers and technicians still work there. The code has become difficult to restabilize after any change. It’s prohibitively expensive to build a new system from scratch, and even if you could afford it, you’d just introduce a whole host of new problems anyway.
System integrations and data migrations gone wrong. Company A buys Company B. There’s a lot of overlap in the technologies they use, so they integrate them or migrate the data from one to the other. In any such project, a thousand edge cases lurk that, when triggered, can cause failure. Even the most crack project team will miss some. There’s never time and money to find them all anyway. Missed edge cases are just ticking time bombs.
Poor original engineering. Because software engineering is still a nascent discipline, we’re still figuring out how best to do it. Every methodology has challenges and limitations. Smart engineers do the best they can to design a system that will work well, but are always limited by time and money. Sometimes revenue pressure leads engineers to favor fast over good. And even then, it’s very hard to imagine all the demands that will be placed on a system over time.
One of my past employers had a Web service that pumped customers’ backend-system data into our database. It was fast and reliable until we sold the product to a customer that wanted to blast in 10 years of historic data. We’d never done that before, nobody checked with the engineers first, and sure enough it made the Web service fall right onto its face. All of our customers experienced an outage.
Good old-fashioned hardware failure. United blamed its July 8 outage on a failed router. Some years ago, squirrels brought down NASDAQ by chewing through some power lines. These things happen, and most companies hedge against it with redundant hardware. But even then, sometimes a failure gets through.
Imperfect failure planning. Almost every company has failure plans in place. Most of them use as much automated failure recovery as they can. But there are just situations that evade even the best plans and the best automation.
Perfect technology is a myth. Occasional failure is certain.