Braking on ice — handling slippery software delivery issues

I grew up in Colorado, where the roads get dangerously icy during the winter. When you hit an icy patch and your car starts losing control, your immediate instinctive reaction is to slam on the brakes. But it turns out that is exactly the opposite of what you should do. Your tires have nothing to grip; the brakes lock the wheels and just help the car spin out control.

When it comes to engineering practices, our instincts are very often similarly misguided.

At one of my previous jobs we released every two weeks. As we grew, the releases got more complex, with more changes packed into each release. One day our QA manager went to our VP of Engineering and said that we were finding and fixing bugs right up until we shipped, and it made him very nervous.

The VP of Engineering announced that in order to give QA more time, we were going to start releasing monthly so that we had more time to stabilize things.

It turns out, that’s exactly the wrong thing to do. If you don’t have enough time to test your releases, you need to release more frequently, not less. Our releases were starting to spin out of control, and he decided to slam on the brakes.

Addressing testing and quality issues by releasing more frequently seems so counter-intuitive. You would think it’s important to slow down if things are moving too fast. But when you think about it further, it actually makes a lot of sense

  • Releasing more frequently means you have fewer changes in each release. Each change has a chance to interact with the other changes in unexpected ways; the likelihood of an unexpected interaction increases exponentially with the number of changes.
  • When you have fewer changes in each release, it is much easier to identify the root cause of a problem when it occurs, so it is quicker and easier to fix.
  • Releasing more frequently reduces the pressure to get a change out for a particular release, as the change will come out sooner in the next release. This reduces shortcuts and hacks performed to shove a feature into the next release. So you end up with higher quality code, reducing the testing cycle overhead.

It turns out this is a common pattern. Our initial intuitions around improving software delivery are often not useful and can have significant negative impact.

One counterproductive intuition I see frequently is the desire to exercise control when things aren’t going well. If there are too many bugs, we require approvals from either QA or management or both before a release. To avoid security outages, we require each release go through security approval. Tickets need to be tracked; time needs to be tracked. We hit the brakes. When I come into a new team and see what processes are in place, I can usually figure out what kind of bad thing happened in the past. Engineering processes are like scar tissue over a wound.

However, research shows that heavyweight change approvals reduce both speed and stability of software delivery. You slam on the brakes, you spin out of control even more. It seems crazy, but it’s true.

If you study scalable software architectures, the reasons for this become apparent.

Teams delivering software are very much like processors in a software system. As you add more teams, you want to get a proportional increase in the features delivered for the business. This is called “linear scaling” — you double the number of teams, you get twice as much output.

In 1967 Gene Amdahl presented a paper at the American Federation of Information Processing Systems. He presented his analysis of parallel systems — systems where you can have multiple processors running at the same time. Just like with teams, ideally you want linear scaling as you add more processors.

His analysis showed how significantly synchronization impacts scaling (synchronization is when two processors have to “sync up” with each other before they can keep going).

Even if you only have five percent time spent synchronizing, the maximum speedup you can ever get is 20x.

Synchronization also impacts how quickly you can get that speedup. The more synchronization, the more processors it takes to get to a certain level of improved productivity.

So now you can see why heavyweight change controls are to be avoided — they are essentially synchronization points. The more you have, the more things lock up, and you end up spinning out of control.

So what to do? Breathe in deeply, and as you breathe out, release that desire to implement more controls when things are starting to look problematic. There are other approaches you can take, where you can improve quality and flow by eliminating queues and bottlenecks, through automation, and applying DevOps and Lean principles.

For a good start, take a look at the book Accelerate, where Dr. Nicole Forsgren, Jez Humble and Gene Kim summarize four years of research identifying what specific practices improve both speed and stability of software delivery. Another great place to look, if you don’t mind a little math, is Don Reinertsen’s excellent book The Principles of Product Development Flow, where he digs into the math and science behind such key principles as small batches, minimizing work in progress, fast feedback and so on.

I’ll be sharing my specific insights and practical applications of these principles in future articles, so stay tuned!

Architect at eBay, but still learning who I really am