It was never a coordinated “cyber-attack,” as several news outlets speculated.
It was simple coincidence that several separate systems failed on the same day, last Wednesday, July 8: the trading system at the New York Stock Exchange, many systems at United Airlines, and the Web site of The Wall Street Journal.
Technology fails all the time. You just don’t usually recognize it. Have you ever noticed a page on a site loading unusually slowly? Or have you ever been unceremoniously logged out? I’m sure that as long as the screen finished loading, or that you were able to successfully log right back in, you shrugged it off and moved on. It might have been random Internet gremlins or lousy Wi-Fi. But it could also have been a failure in the service. Perhaps monitoring software noticed it and quietly performed a restart. Or maybe a few minutes of high drama unfolded in some technical operations center somewhere as technicians righted the situation.
But why do such systems fail? Several reasons:
Legacy systems patched and updated for so many years that the code has become sclerotic. Big, old companies like United Airlines are bursting with old systems. I wouldn’t be surprised if some part of their reservation system involves a mainframe! Systems like these have been repaired and extended for years upon years, and by now none of the original programmers and technicians still work there. The code has become difficult to restabilize after any change. It’s prohibitively expensive to build a new system from scratch, and even if you could afford it, you’d just introduce a whole host of new problems anyway.
System integrations and data migrations gone wrong. Company A buys Company B. There’s a lot of overlap in the technologies they use, so they integrate them or migrate the data from one to the other. In any such project, a thousand edge cases lurk that, when triggered, can cause failure. Even the most crack project team will miss some. There’s never time and money to find them all anyway. Missed edge cases are just ticking time bombs.
Poor original engineering. Because software engineering is still a nascent discipline, we’re still figuring out how best to do it. Every methodology has challenges and limitations. Smart engineers do the best they can to design a system that will work well, but are always limited by time and money. Sometimes revenue pressure leads engineers to favor fast over good. And even then, it’s very hard to imagine all the demands that will be placed on a system over time.
One of my past employers had a Web service that pumped customers’ backend-system data into our database. It was fast and reliable until we sold the product to a customer that wanted to blast in 10 years of historic data. We’d never done that before, nobody checked with the engineers first, and sure enough it made the Web service fall right onto its face. All of our customers experienced an outage.
Good old-fashioned hardware failure. United blamed its July 8 outage on a failed router. Some years ago, squirrels brought down NASDAQ by chewing through some power lines. These things happen, and most companies hedge against it with redundant hardware. But even then, sometimes a failure gets through.
Imperfect failure planning. Almost every company has failure plans in place. Most of them use as much automated failure recovery as they can. But there are just situations that evade even the best plans and the best automation.
Perfect technology is a myth. Occasional failure is certain.
When I started my personal blog, which is largely about film photography using vintage cameras, I found a great use for my languishing Flickr account: hosting most of the photos for my blog. Flickr has been a great tool for sharing my photography everywhere on the Internet.
The other day, I uploaded my 10,000th photo to Flickr. That’s a lot of photos! It’s so many that finding one particular photo on my computer is nigh onto impossible. From the beginning, I should have used the photo organizer that came with my copy of Photoshop Elements. But I’ve let too much water pass under the bridge: years and years of photos remain unindexed in folders on my hard drive. It would be a big, unpleasant job to organize them now.
It turns out that the easiest way for me to find one of my photographs is to search for it on Flickr. I’ve left enough bread crumbs in the titles, descriptions, and tags that with a few words in Flickr’s search box I can find anything I’ve uploaded.
It also turns out that I was inadvertently leading the way. Flickr recently made some changes to the site that makes it easier than ever to store all of your photos and find any of them in an instant. I think these smart improvements reposition Flickr well in the new world of photo storage and sharing, and give it a solid chance at remaining relevant and vital.
And it’s not a moment too soon. Flickr had been geared toward people interested in photography who wanted to share and talk about their work. Many users appeared to carefully curate their photostreams, sharing only their best photos. It remained wonderful for this purpose. But in the meantime not only have digital cameras almost entirely supplanted film cameras, but camera phones have also largely supplanted dedicated digital cameras. People were taking pictures on their phones just so they could share them on Facebook and Instagram — and Flickr was getting none of that action. It was falling behind.
Flickr finally awoke from its slumber in 2013 with a new, more modern user interface, plus one terabyte of free storage — upwards of a half million photos — for anyone, for free. Flickr’s mission had shifted: please do dump all of your photos here. And then last month Flickr rolled out yet another new user interface, and has added several powerful new features meant to make the site the only photo storage and sharing site you’ll ever need:
Automatic photo uploading. Flickr can now automatically upload every photo from your computer and your phone — every past photo and every new photo you take. Flickr marks them all as private, so only you can see them, until you choose to make them public. To enable this, you have to download the new Flickr app to your phone and download a new “Uploadr” application for your computer. But after you do, you may never again lose a photograph to a crashed hard drive or to a lost or stolen phone. And if you do have such a mishap, Flickr now lets you download any or all of your photos en masse.
Image recognition and automatic tagging. Flickr now uses image-recognition technology to guess what’s in each of your photos, and adds descriptive tags to them. You’ve always been able to tag your photos manually; those tags appear with a gray background. Flickr’s automatic tags have a white background. These tags make photos easier to find in search. It’s not perfect — a photo I took of a construction site was mistakenly tagged with “seaside” and “shore.” But it works remarkably well overall, and Flickr promises that they will keep improving the technology.
Camera roll and Magic View. Flickr has introduced an iOS-style camera roll as the main way you interact with your own photos now. Flickr is criticized for stealing this concept from Apple. But they’ve gone Apple one better by adding Magic View, which organizes photos by their tags — including the automatically generated ones. It gives you astonishing views into your photos, grouping them smartly. Finally, all of my bridge photos are in one place, and I didn’t have to lift a finger!
Improved searchability. All these new tags makes Flickr even more searchable. You can find any of your photos in seconds on Flickr.
All of this makes Flickr a compelling place to store all of your photographs, and be able to easily find them. They’re stored on Yahoo! servers and are always backed up. With a couple clicks or taps, you can share them from there to most of the popular social media sites, including Facebook, Instagram (but only on your phone), and Twitter.
The best thing: You can still use Flickr for everything you could before. You can share your best photographs and have conversations about them. You can explore the beautiful photographs others have taken. You can geotag your photos and save them to albums and groups. And if you want nothing to do with Flickr’s new features, you can just ignore them.
I’m astonished by how well Flickr has shifted to its new mission without leaving legacy users behind. As someone who has made software for more than a quarter century, I can tell you: it is enormously difficult to do this.
Still, many of Flickr’s longtime users feel alienated. They’re expressing far less paint-peeling rage than they did after the 2013 changes, thank goodness, but they’re still quite upset. The leading complaint: there’s no way to opt out of automatic tagging, and no way to delete at once all the tags already generated. Longtime users who have carefully chosen their tags find Flickr’s automatic tags to be an unwelcome intrusion.
Flickr should probably address that. But first, they should congratulate themselves. They’ve done journeyman work.
… A slightly revised version of this is cross-posted today to my personal blog, Down the Road.
I was not surprised when I heard that the Obamacare Web site, healthcare.gov, crashed and burned right out of the gate.
But I was disappointed. Regardless of what I think of the Affordable Care Act, it’s the law. I wanted its implementation, including healthcare.gov, to go well.
Still, I wasn’t surprised because I know how government software gets made.
Several years ago I worked in middle management for a company that built a government Web application related to health-care customer service. I was in charge of testing it to make sure it worked. It is probably not going out on a limb to say that the people who built healthcare.gov experienced many of the same kinds of things I experienced on that project.
Let me be plain up front: I was a poor fit for government software development. I was too free-wheeling and entrepreneurial for the control-and-compliance environment that government contracting encourages. I find it difficult to write about the experience without showing my frustrations with its realities. But I think I understand those realities well and objectively.
The government doesn’t know how to do anything. They hire it all out, and then they manage and administer the process. As a result, on this project they relied heavily on compliance with “best practices,” as if those practices contained some sort of magic that delivered quality software. They don’t, of course; the government was shocked when Version 1.0 of our software had typical quality problems right out of the gate. Those practices served primarily to leave an audit trail the government could follow.
In the end, the project was a success. Despite Version 1.0’s glitches, which we quickly fixed, the software was immediately put to use and led to productivity improvements over an older, green-screen system. I spoke with many of the software’s users, and despite a few grumbles most of them liked using it.
But this was one mighty expensive piece of software to build, from winning the contract to defining what the software should do to building and maintaining the software. Here’s why.
The bid process
I was hired after we won the contract, but I heard stories about the bid process. We had no experience building software on this scale, but we wanted into the lucrative cost-plus government contracting business for its guaranteed profit margins. So we offered a lowball bid aimed at getting the government’s attention, not at what it actually was going to take to build the software. And then to our surprise we won the business. After the elation wore off, we were left with an “oh shit” feeling – we needed to actually build the software for that amount. How the heck would we pull that off?
We finished Version 1.0 on my watch, but I don’t know whether we delivered it within budget. It seemed to me, however, that the bid process encouraged underbidding and overspending.
The requirements process
When you make something for the government, they want to know exactly what they’re getting, in excruciating detail. So we started by writing the biggest, thickest requirements document I’ve ever seen. We weren’t building this software from scratch – we bought what was then the leading customer-relationship-management software platform and used it’s software-development toolkit to heavily customize it for our needs. But we had to write highly detailed specifications anyway.
Requirements gathering was more about navigating choppy political waters and brokering compromise than about specifying usable, stable, and scalable software. To develop the requirements, we flew in representatives from every company that would use the software and put them into a big room to hash out how it would work. But all of these companies were themselves government contractors. Their people all knew each other – and, frequently, competed against each other for contracts. Some of them competed against us trying to win this contract. The room was thick with mistrust and agenda,
The building process
The government lives in constant fear of being screwed by its contractors. It goes back to Abraham Lincoln’s time, when rampant fraud among suppliers threatened the Civil War effort. (Seriously. Gunpowder cut with sawdust. Uniforms that dissolved in the rain. Read about it here.)
So not only did the government hire us to build the software, they hired another firm to watch us do it. This is called independent verification and validation, or IV&V. Their job was to make sure that we followed software-development “best practices” and that we built what we said we were going to build. But making matters worse, the company that won the IV&V contract, I’m told, also had bid on the project to build the software in the first place. It always seemed clear to me that they wanted to show us to be fools so that they could take over the project. They ran us ragged over every last minor detail.
The level of perfectionism in terms of “best practice” adherence was intense. Yet when we delivered the software, it had several usability challenges and outright bugs. Worse, it struggled to keep up with the load users placed on it. If you’ve ever built software, you know that these are typical challenges with Version 1.0 of anything. But the government was shocked, dismayed, and appalled. We spent the next several months issuing update releases to make it perform as it needed to. Of course, IV&V ran roughshod over us the whole way – but they were in the hot seat too because their “best practices” had failed to prevent these problems.
The process overhead
Process is tricky to apply well. Too little leads to chaos, too much adds needless cost and delay. I’m not anti-process – rather, I’ve built a career on bringing just the right level of process into a software development environment to make it effective. But most of the process we had to follow involved documenting our work to prove to the government that we had actually done it. This frequently hindered our ability to deliver software cost-effectively, and sometimes stood in the way of quality.
We bought a well-known software product that stored requirements and linked them to the code and the test cases so we could prove that we built and tested each requirement. This involved tracing every requirement to every line of code and every test case, an enormous task in and of itself. I personally created a traceability report each quarter and sent it to the government. All of this required a lot of work from skilled technical people, but in my judgment did not materially help us better build or test the software.
Our test cases were contractually required to be documented in such detail that a trained monkey could execute them. They were at the level of “Step 1. Type your username into the Login box. Expected result: Your username appears in the Login box. Step 2. Type your password into the Password box. Expected result: A row of asterisks appears in the Password box.” A test case that took fifteen minutes to execute could have taken two hours to write and could have been a dozen pages long. We had hundreds of test cases. Many test cases were not appropriate to be added to the regression test suite and be executed every release, so we spent a lot of time writing them to execute them a small handful of times.
It was supposed to be against the rules to write a bug report that had no associated test case. Testers would often stumble upon a bug by accident or find one while doing ad-hoc testing – and then find themselves in a conundrum. Writing the test case that led to the bug and tracing it back to requirements took time we frequently lacked at that point in the game. When the bug was serious enough, everybody looked the other way when it wasn’t associated with a test case. I wonder whether any of the testers avoided writing test cases by falsely associating the bug with an existing test case.
We did get one big break. We lobbied for, and to our astonishment successfully won, an exception to a standard practice: we did not have to print screen shots of the results of every test step. Other projects for which we had contracts had to do this. As you can imagine, managing all that paper slowed progress considerably. Those projects collected those screen shots into boxes, which were sent to offsite storage.
The mounting costs
All of these process steps meant spending more money, mostly in the form of human effort. There were other ways in which the government’s way of making software added costs to the project. Here’s a short, incomplete list:
Frequent, ongoing training about compliance with standards, which, amusingly, is where I learned about the Civil War fraud.
Entering time worked on three separate software systems – one for the project-management tool, one for government accounting, and one my employer used to manage time off. I spent an hour a week entering time.
A prohibition on open-source software. The government wanted all software used to be “supported,” meaning that there had to be a phone number to call for help. So we spent money on commercial tools that sometimes weren’t as capable as open-source versions. In a couple cases, the only tool or component available for a task was open source, and we couldn’t build the application without it. We did get the government to bend the rule for us in those cases, but it took heavily documented justifications and layers of approvals to make it happen.
Strict separation of duties to protect the government against a rogue contract employee from sabotaging the system. This meant, for example, that I couldn’t restart the computers we used for testing when they needed it, I knew how to do it, but I was not allowed. I had to write a request for an infrastructure engineer to do it, and then wait sometimes for days for it to reach the top of his priority list.
As you can see, there was nothing easy or inexpensive about this project. Yet we got it done and the software worked. It’s still in use today. We showed that it’s possible – just slow and expensive – to build software the government’s way.
So I have great empathy for those who built healthcare.gov. No doubt about it: the site failed, and they built it. But they must feel tremendous pressure right now as they scramble to both handle the heat they’re getting from the government and to rush fixes to the site so that it works well enough. But if their experience building that site was anything like my experience building government software, it’s hardly shocking that it launched with challenges.
I’ve said it to my test teams many times: Making software isn’t quite engineering. Building a bridge – now that’s engineering. You determine how long the bridge needs to be, how much load it needs to carry, and what kind of bridge to build (steel truss, concrete arch, etc.), and from there it’s mostly mathematics and physics. Just run the calculations and you’re good.
We have bridge-building down. With a couple of notable exceptions, such as the Tacoma Narrows bridge which heaved and twisted and finally collapsed (video here), new bridges seldom fail. Old bridges fail sometimes, but it’s reliably due to accident or neglect.
My apologies to any civil engineers who stumble upon this post. I’m sure you’re cringing that I’m overlooking many subtleties of your discipline.
There’s nothing subtle, however, about how often software fails. Our users aren’t happy about it, but they aren’t surprised by it, either.
For any thing you ask a software developer to build, there will be a whole bunch of valid ways to do it, each with its own unique ways of creating failures. This is especially true when when that developer enhances existing software that he or she didn’t make in the first place. It’s tough to predict exactly how the enhancements will affect the rest of the software. The more lines of legacy code, the more time and analysis it takes to think that through.
If a developer had unlimited time and money, it might be possible to deliver perfect software. Ah, a developer can dream! But here’s where bridge-building and making software have an important thing in common: time and money are never unlimited.
I sympathize with the folks who call software a craft. People who make software use tools and knowledge in its design and construction. These are hallmarks of craft.
Another way that software is like craft is that it’s difficult to fully separate the design from the making. Even when one person designs the software and another writes the code, the coder has to make a bunch of lower-level design decisions along the way.
The software craftsmanship movement meets corporate resistance because revenue and profit ride on what we build. Our companies need to sell features to meet revenue projections, or deliver bug fixes to retain customers. That’s why timed delivery is so important: if you wait too long to deliver, the opportunity to grow or retain revenue begins to shrink.
Feeling pressure to deliver, yet knowing that if we deliver junk we’ll be in an even worse pickle, we tend to manage software-development projects like engineering projects. I think we feel like we have better control when we manage them that way. But that feeling of control can’t mask it: no matter how tightly you plan a software project, no matter how you shape your development and delivery processes to mitigate risk, no matter how much you try to predict the troubles you’ll encounter, you will discover things along the way can seriously derail those plans. It happens in two-week scrum sprints just as it does in ten-month waterfall projects. Discovery is simply endemic to software development.
As a software project manager, I try to build in buffers for the unknown. I also steer projects daily based on what we discover, adjusting plans and communicating impacts to whomever needs to know. I try to make sure our development practices deliver the best possible code to test, and then I try to arrange testing to find the worst bugs first so that near the hoped-for end, only minor bugs remain. Despite all that, important bugs still sometimes reach the user.
We ship when the software is good enough. What “good enough” means varies from context to context, but it is unfailingly short of perfect. Shipping at good enough means you succeeded.
If I delivered bridges that way, I’d never drive over one I built.
I was shocked when I logged into Flickr last week and found an entirely new interface.
My shock turned to disappointment and sadness that some of my contacts were super angry about the change, left strongly worded comments on their photostreams, and immediately moved their photos to other services.
I make software products for a living; I’ve seen firsthand how interface changes can alienate users. They become comfortable with a product’s features and usage, even when they’re flawed. They don’t want to learn anything new (which often masks a fear that they can’t learn something new).
At the same time, Flickr (and Facebook and any other thing you do on the Web) is a product, built by a company that is trying to make money in an ever-changing landscape.
I’ve seen it often, and it’s happened at companies where I’ve worked: A company builds a good product that takes off. Success causes the company to grow or to be sold to a larger company. And then some scrappy startup company builds a product in an overlapping market that becomes a new darling. By then, the big company is so invested in what it’s always done that it struggles to adapt to the shifting market.
From where I sit, it looks like all of this happened to Flickr. Founded in 2004, Flickr quickly became arguably the king of the hill among photo-sharing sites. Web giant Yahoo! quickly noticed and, in 2006, bought the fledgling company. Success!
But consider all that’s happened in photography and on the Web since 2006. Most people had just discarded their film cameras for digital cameras. Soon cameras in phones became good enough for casual, everyday use; many of them are now very good. Users found it easy to share their photos across any number of the social networks that had emerged – primarily Facebook, which was founded in 2004, too, but also on upstart Instagram. Today, the three cameras that take the most photos uploaded to Flickr are all iPhones.
The market has shifted. It was a matter of time before Flickr either responded or became a niche product of ever decreasing importance. This new interface is its bid to stay relevant. I’m impressed with Yahoo! for moving Flickr so boldly.
I think that if people give the new interface a chance, it will work for most of them. I’ve heard complaints about slowness; I advise patience as Yahoo! would be foolish not to address legitimate performance problems. I’ve heard complaints about how crowded the interface feels; I’m also sure Yahoo! will tweak the new interface over time for better usability.
Another source of uproar is that advertising now adorns Flickr pages. I hate Web ads too, but really, they are the major way many Web products make money.
I sympathize a little with one complaint: all of us who bought Flickr Pro accounts for unlimited photo uploads now feel kind of let down, given that everybody gets a terabyte of storage now. That much storage might as well be unlimited; you could upload one photo a day for the rest of your life and never run out of space. But Flickr is letting us cancel our Pro accounts with a pro-rated refund, or keep Pro at its rate of $25 per year and never see an ad. Anybody who doesn’t have Pro already will have to pay $50 per year for that same privilege. I think this is a reasonable trade.
Flickr’s real mistake might be in underestimating how attached its users were to the old interface. But if my experience is any indication, perhaps that mistake won’t be fatal. Of my contacts, about five percent of them have moved to other services. I’ll miss seeing their photos. I wonder if they’ll soon miss the rest of the Flickr community.