How jam makes you toast — the danger of being busy

You’re driving smoothly down the freeway, and suddenly, you’re in stop-and-go traffic. After half an hour, you creep by an accident on the other side of the freeway. Just from people slowing down to see what’s going on, your direction has jammed up. Intuitively, it makes sense that when a freeway is close to capacity, any tiny slowdown can have a huge impact.

With software, these jams are invisible except for their effects. At most companies, we can’t see the cars lined up, all we see is that not a lot of stuff is being shipped, and people don’t seem to be very busy. Since they’re waiting around, we give them more features to work on.

But this actually makes things exponentially worse, and if you look at the math behind queuing theory you can see why this is the case.

Unlike manufacturing, units of work in software vary in their level of effort and the rate at which they arrive. Systems that behave this way are said to have a Markov distribution and follow a well-established mathematical formula. If you graph this formula, it looks like this:

Notice the “hockey stick” behavior of this curve. As you approach 100% utilization, the time a job is sitting waiting to be worked on exponentially moves towards infinity. Just like with freeways close to capacity, a software delivery system (like a scrum team) is quite unstable at high capacity and can cause massive wait times.

The other serious problem is, once you have a backup, it takes superhuman effort and lots of time to unwind it. Take a look at this experiment where you add 1 for each head and subtract 1 for each tail for a thousand times:

Notice how it slowly drifts away from zero. It turns out that once you drift like this it is exceedingly hard to get back to zero. If you’re at 10 or -10, the chances of getting back to zero with random tosses is 1 in 1000! The same things happen with backlogs — once they get long, they stay long without some kind of intervention.

Finally, it’s important to understand that the value of a feature decays over time. All the time it is not available to customers, you have lost value. You might even miss a key deadline such as a shopping season. Delays can also increase risk (such as from competition or a security breach) or prevent you from taking advantage of new opportunities.

What this all means: When you are at high capacity, work can back up quickly and cause huge delays. Once they’re backed up, they will most likely stay that way for a long time, causing significant loss of value for your business. And giving people more work makes it exponentially worse.

So how do we address this? You do it by managing your queues and keeping them at a reasonable range.

Why is this? Look at that graph again. Utilization is a function of wait time, and wait time is a function of the length of your queue.

This means if you control your queue length, you control your utilization, and you prevent your wait times from leaping up that hockey stick. This is called maintaining flow.

Traffic engineers know this. This is why there are metering lights on freeway onramps.

Enrolling your leadership

As incredibly effective as this technique is (called managing Work In Progress, or WIP), it is very rare to see it implemented out in the field. I suspect the reason is that it requires commitment and support from your leadership. If they don’t understand this principle, which is very non-intuitive, you will ultimately get pushback to take on more work. At one company I worked at we had over forty different “top priorities” for the year.

I recognize it is not easy to convince business leaders that limiting work in progress is key to producing a constant high rate of business value. Here is what I have found useful:

  • Start with the somewhat obvious fact that delays degrade business value. The same features delivered later bring fewer dollars in.
  • Show them the math which makes clear how being too busy can cause massive delays which are hard to recover from.
  • Use the metaphor of metering lights and freeways to demonstrate the importance of limiting WIP.
  • If there is still hesitance, ask for their support to pilot this with a particular value stream (like a product line or scrum team).

Implementing WIP controls

When it comes to actually implementing limits to your queues, there is a ton of literature on managing WIP. I highly recommend you do some reading and exploration on this topic.

But very quickly the basic idea is this: anywhere you are passing work from one group to another there is going to be a queue. Set a limit on that queue. Make it smaller than you think it should be. Then watch the queues. If the queues are staying reasonable, you’re good. If a queue is starting to pile up, make the WIP limit smaller. If it’s always at zero, make it slightly bigger.

Once you have these limits in place, pay attention to where the WIP limit is getting triggered. The first queue whose WIP limit is getting triggered is where your current bottleneck is — the step in the process that is slowing everything else down. Maybe it’s your QA process, or maybe it’s a system that is really hard to change.

Now you can figure out how to address it. Maybe you need to invest in refactoring a system, or maybe your QA process needs automation, or you are too dependent on unstable testing environments.

But here’s the important point: until the bottleneck is addressed, don’t hand more work to the team that is idle because they are blocked waiting. You can have them help fix the bottleneck, but don’t give them more features to work on. Keep those metering lights on. Otherwise, things will just pile up even more behind the bottleneck, and you will quickly have a mess on your hands.

Architect at eBay, but still learning who I really am