Software Architecture Rule of Thumb — Failing to Think about Failure

If you were to ask me what coding anti-pattern will get me riled up more than anything, it’s this:

String myMethod() {
try {
return doSomething();
} catch (Exception e) {
logger.warn(“Something funny happened”);
return null;
}
}

This just drives me crazy. Why would anyone think it’s OK to essentially swallow an error like that? Some would say “but I’m logging a warning!” Yes, but you have completely removed all the contextual information contained in the exception’s stack trace, and you’re returning null when something actually went wrong.

This is a Java-specific complaint, but it’s a reflection of an overall tendency I see, and that is to not want to think about or really deal with failures.

Thinking about Failure

You see this particularly with engineers who have never had to do production support. Once you find yourself up overnight trying to debug something while three different managers keep slacking you for updates, and three engineers are standing behind you, you learn how important thinking about failure is.

It’s not just at the code level. It’s even more important to think about this when designing systems. Almost every time I review a design I have to ask “what happens when this component fails” or “what happens if the caller sends you bad data” or “what happens if you all of a sudden got 10 times more traffic?” and a common response is “uh…”

When you are exercising your design with scenarios, you have to ruthlessly think about failure scenarios. It’s your responsibility as a professional. Come up with as many failure scenarios as you can think of, and understand how the system will handle it.

Mean Time to Recovery

You also need to ask yourself, how will someone quickly identify and fix a problem when (not if, but when) something goes wrong?

More and more we all have to deal with complex distributed systems. Distributed systems mean lots of interacting components. Which means that the likelihood of failure increases, until you reach a point where failure is almost guaranteed. Leslie Lamport, one of the seminal thinkers on distributed systems theory, once said

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable

Sometimes we build overly complex and slow development processes in an attempt to prevent any failure from happening. But no matter what you do, things are going to fail, particularly in distributed systems. So it’s really important to also focus on reducing mean time to recovery (MTTR).

Picture yourself hunched over your keyboard, and everyone standing behind you, and you’re struggling to figure out what happened and why. Feel the power of the value of MTTR.

Recovering from a problem has the following phases:

  • Detect the problem
  • Identify root cause
  • Determine and deploy the solution

Detecting the problem

This is all about monitoring and altering. If you are building a distributed system such as a microservices architecture, you absolutely have to have solid monitoring in place.

There is of course system monitoring: CPU, memory, disk (and hopefully you have elastic systems that can respond automatically).

You also need to be monitoring all the services for health, latency and error rates, and raise an alert as soon as there is an anomaly.

Then, each service team needs to think about what other pieces they should be monitoring — queue depth, cache miss rate, and so on.

One difficult area to monitor is if you are returning bad data or the UI is broken for particular browser or device. In this case your only warning is that people are leaving your application or site, so if possible you want to be monitoring user behavior as well.

Identifying Root Cause

This is the area that is commonly not given nearly enough attention and priority.

If, when something goes wrong, your answer to identifying root cause is to log on to a production system, find the error log, and start poking around, you have some work to do. That may work on a dev box, but imagine you have five or ten services involved and a cluster of ten nodes, and thousand of requests a second coming in. Now imagine pulling your hair out.

Anybody doing distributed systems such as microservices must have the following:

  • A log collection and analysis tool like ELK or Splunk
  • An Application Performance Monitoring (APM) tool like NewRelic, AppDynamics or AppOptics [we use AppOptics at Castlight and it’s really quite good]
  • A request correlation id that is included in every log message. This is essential. In the Spring Cloud world you can use Spring Cloud Sleuth.
  • If you are using Kafka, I have found it very useful to have tools that allow you to search topics for particular strings. I wasn’t successful finding off-the-shelf tools for this, although some suggest KSQL. I ended up building some tools myself.

This is just the beginning. When you are designing a system, think of the tooling you’ll need to make root cause analysis as fast as possible, and get the work to add that tooling into your backlog.

Determine and Deploy a Solution

This is not really the focus of this article, but this is where having a solid CI/CD pipeline can really help, as it allows you to rapidly roll out changes in a controlled and tested way. But more than often I see some kind of change control process where you have to open tickets and get approvals and carefully test in preprod and so on.

It also really helps to be able to reproduce the problem on your dev machine, as this allows you iterate quickly on the fix-and-test cycle as you try to come up with a solution.

Iterate on Tooling

As much as you do your best to think of the tools and information you will need for a feature or system you are designing, you won’t be able to think of all the possible failure scenarios, and you don’t necessarily want to build for all of the imagined scenarios until after you have a chance to see how they system actually behaves (and fails).

What I have found helpful is to to carve out some time on a regular basis for building new tools and mechanisms to make it easier to identify, resolve and fix issues.

One way to do this that I have seen work really well is to create a rotating role on the scrum team for someone who owns maintenance. This is both bugfixing and tooling. This person is taken out of the capacity for feature sprint planning, and works on open bugs. The rest of the team does not work on bugs at all. This creates a very effective flow-control for bug management, and it also keeps the team motivated to improve their MTTR, so they don’t have to allocate even more people to bug capacity.

Failing to Plan for Failures is Planning To Fail

Haha, that’s a mouthful…

Architect at eBay, but still learning who I really am