Categories
Quality The Business of Software

It’s remarkable that the “Great Glitch of July 8” doesn’t happen more often

By Jim Grey (about)

It was never a coordinated “cyber-attack,” as several news outlets speculated.

GlitchForbes

It was simple coincidence that several separate systems failed on the same day, last Wednesday, July 8: the trading system at the New York Stock Exchange, many systems at United Airlines, and the Web site of The Wall Street Journal.

Technology fails all the time. You just don’t usually recognize it. Have you ever noticed a page on a site loading unusually slowly? Or have you ever been unceremoniously logged out? I’m sure that as long as the screen finished loading, or that you were able to successfully log right back in, you shrugged it off and moved on. It might have been random Internet gremlins or lousy Wi-Fi. But it could also have been a failure in the service. Perhaps monitoring software noticed it and quietly performed a restart. Or maybe a few minutes of high drama unfolded in some technical operations center somewhere as technicians righted the situation.

But why do such systems fail? Several reasons:

GlitchNBC

Legacy systems patched and updated for so many years that the code has become sclerotic. Big, old companies like United Airlines are bursting with old systems. I wouldn’t be surprised if some part of their reservation system involves a mainframe! Systems like these have been repaired and extended for years upon years, and by now none of the original programmers and technicians still work there. The code has become difficult to restabilize after any change. It’s prohibitively expensive to build a new system from scratch, and even if you could afford it, you’d just introduce a whole host of new problems anyway.

System integrations and data migrations gone wrong. Company A buys Company B. There’s a lot of overlap in the technologies they use, so they integrate them or migrate the data from one to the other. In any such project, a thousand edge cases lurk that, when triggered, can cause failure. Even the most crack project team will miss some. There’s never time and money to find them all anyway. Missed edge cases are just ticking time bombs.

GlitchCNN

Poor original engineering. Because software engineering is still a nascent discipline, we’re still figuring out how best to do it. Every methodology has challenges and limitations. Smart engineers do the best they can to design a system that will work well, but are always limited by time and money. Sometimes revenue pressure leads engineers to favor fast over good. And even then, it’s very hard to imagine all the demands that will be placed on a system over time.

One of my past employers had a Web service that pumped customers’ backend-system data into our database. It was fast and reliable until we sold the product to a customer that wanted to blast in 10 years of historic data. We’d never done that before, nobody checked with the engineers first, and sure enough it made the Web service fall right onto its face. All of our customers experienced an outage.

Good old-fashioned hardware failure. United blamed its July 8 outage on a failed router. Some years ago, squirrels brought down NASDAQ by chewing through some power lines. These things happen, and most companies hedge against it with redundant hardware. But even then, sometimes a failure gets through.

Imperfect failure planning. Almost every company has failure plans in place. Most of them use as much automated failure recovery as they can. But there are just situations that evade even the best plans and the best automation.

Perfect technology is a myth. Occasional failure is certain.

Categories
Process Quality The Business of Software

“Software engineering” might be an oxymoron

By Jim Grey (about)

I’ve said it to my test teams many times: Making software isn’t quite engineering. Building a bridge – now that’s engineering. You determine how long the bridge needs to be, how much load it needs to carry, and what kind of bridge to build (steel truss, concrete arch, etc.), and from there it’s mostly mathematics and physics. Just run the calculations and you’re good.

We have bridge-building down. With a couple of notable exceptions, such as the Tacoma Narrows bridge which heaved and twisted and finally collapsed (video here), new bridges seldom fail. Old bridges fail sometimes, but it’s reliably due to accident or neglect.

The S Bridge at Blaine

My apologies to any civil engineers who stumble upon this post. I’m sure you’re cringing that I’m overlooking many subtleties of your discipline.

There’s nothing subtle, however, about how often software fails. Our users aren’t happy about it, but they aren’t surprised by it, either.

For any thing you ask a software developer to build, there will be a whole bunch of valid ways to do it, each with its own unique ways of creating failures. This is especially true when when that developer enhances existing software that he or she didn’t make in the first place. It’s tough to predict exactly how the enhancements will affect the rest of the software. The more lines of legacy code, the more time and analysis it takes to think that through.

If a developer had unlimited time and money, it might be possible to deliver perfect software. Ah, a developer can dream! But here’s where bridge-building and making software have an important thing in common: time and money are never unlimited.

I sympathize with the folks who call software a craft. People who make software use tools and knowledge in its design and construction. These are hallmarks of craft.

Another way that software is like craft is that it’s difficult to fully separate the design from the making. Even when one person designs the software and another writes the code, the coder has to make a bunch of lower-level design decisions along the way.

The software craftsmanship movement meets corporate resistance because revenue and profit ride on what we build. Our companies need to sell features to meet revenue projections, or deliver bug fixes to retain customers. That’s why timed delivery is so important: if you wait too long to deliver, the opportunity to grow or retain revenue begins to shrink.

Feeling pressure to deliver, yet knowing that if we deliver junk we’ll be in an even worse pickle, we tend to manage software-development projects like engineering projects. I think we feel like we have better control when we manage them that way. But that feeling of control can’t mask it: no matter how tightly you plan a software project, no matter how you shape your development and delivery processes to mitigate risk, no matter how much you try to predict the troubles you’ll encounter, you will discover things along the way can seriously derail those plans. It happens in two-week scrum sprints just as it does in ten-month waterfall projects. Discovery is simply endemic to software development.

As a software project manager, I try to build in buffers for the unknown. I also steer projects daily based on what we discover, adjusting plans and communicating impacts to whomever needs to know. I try to make sure our development practices deliver the best possible code to test, and then I try to arrange testing to find the worst bugs first so that near the hoped-for end, only minor bugs remain. Despite all that, important bugs still sometimes reach the user.

We ship when the software is good enough. What “good enough” means varies from context to context, but it is unfailingly short of perfect. Shipping at good enough means you succeeded.

If I delivered bridges that way, I’d never drive over one I built.