A modest testing strategy proposal

Traditionally we have been taught that we should strive for high coverage on unit tests, and build some integration tests for end-to-end flows. I’d like to refine that a little bit in this article on testing principles and strategy.

Principles of Testing

There are some key principles that drive the way I think about testing. It’s important to call these out because it helps you understand why you’re testing in a certain way, and what testing approaches are not as useful as you might think at first…

Enable empowered and independent teams

Independent teams are one of the most significant contributers to software delivery performance. Our testing strategy should allow teams to confidently change deploy their systems independently, without centralized approval or coordination. Here are the measures of an empowered team from the Accelerate book:

  • Can make large-scale design changes without external permission
  • Can make large-scale design changes without significant impact to other teams
  • Can complete work without communicating and coordinating externally
  • Can deploy/release on demand without requiring an integrated test environment
  • Can perform deployments during normal business hours with negligible downtime
  • We can and do deploy our application independent of other applications/services it depends on

Have an economic view of tests

A test delivers value in the form of information. The more information a tests provides, and the more likely it is to provide useful information, the more valuable the test. In general, the most valuable test is one that has a 50% likelihood of failing. If a test is very likely to pass, maybe it’s not worth writing it (for example, writing tests on accessor methods). You probably have also noticed that if a test is always failing, it is essentially ignored because it’s not very useful.

You also need to balance the value of a test with a cost. If you are constantly maintaining a test or the infrastructure needed to run a test, the economic value of the test decreases. So you need to do a cost-benefit analysis when considering what tests to write and how they are run.

Enable fast detection, root cause analysis, and resolution

One of the key metrics for software delivery performance is Meant Time to Recovery (MTTR). There are three aspects of this: detection, identifying root cause, and resolution.

Also, most continuous delivery pipelines will prevent a change set from being deployed to production if there are test failures. So the longer a test is not resolve, the longer the lead time, another crucial measure of software delivery performance. When deploys are blocked due to test failures, changes get backed up like cars in a traffic jam, and the longer changes wait, the more risk is introduced, and deliveries are in batch mode rather than incremental.

For these reasons, when developing tests, we need to have a clear and validated approach for detecting failures, identifying the problem, and resolving it.

It’s very important to note that when you deliver changes more frequently, in smaller batches, this significantly improves your ability to recover from a test failure quickly.

Testing Strategy

With these principles in mind, here are some recommendations I would like to offer around testing strategy…

Unit Tests

Unit tests run fast and are focused individual units — classes or functions. For them to be valid unit tests, any dependencies these units have must be implemented using mocks so that you can focus on the unit under test and when there are any failures you know it’s in the unit under test, not in some other interaction. This makes it easier to quickly identify root cause and fix the problem. Mocking out dependencies also allows the tests to run faster and can easily be run on a developer’s machine. Unit tests are your first line of defense and are usually the first stage of testing to be run in a continuous integration/delivery pipeline.

Some organizations strive for 100% test coverage. But I recommend using your best judgement here — are more tests adding good incremental value in terms of their likelihood to provide additional information? One risk with too many tests is that they slow down your ability to make changes because the tests will have to be rewritten when you change the interface or behavior of the unit. So you have to keep that cost into consideration as well.

The fact that unit tests can break when you change your system also means you want to avoid testing the internal algorithm of a unit, but instead test its contract. If you find yourself having to build methods on your class that are needed just for testing, this is usually a design smell that either your unit is too large and complex, has a poorly defined interface, or you are getting caught up in testing the internals of the unit.

I personally really like Test Driven Development. There are a couple of key reasons I like it:

  • It’s fairly easy to write a test that passes because it doesn’t actually test what you think it does. With TDD you make sure the test fails first, then add the functionality, and then see that the test goes green.
  • TDD forces you to build only what you need based on the contract you require for the story you are building
  • TDD generally forces you to build a more decomposed design with small, testable units. You can’t build large, complex, and ultimately untestable classes if you’re writing the tests first.

That said, care has to be taken not to be too bottom-up with this approach. When you write a bunch of tests for one unit at a time, you can get lost in the forest and not see the trees, and only later do you lift your head up and realize that all this work has been for classes that don’t work together to solve the actual story.

To help avoid this I prefer to do is write the contract test first (more on this below), making sure it fails and use that to drive the overall design of an API or user flow.

Contract Tests

There are a lot of terms that seem vaguely similar and this ends up with a lot of confusion: integration tests, end-to-end tests, API tests, contract tests, and so on. The general goal is you want to test something larger than an individual unit (class or function), where pieces are working together, to make sure they integrate well.

What I have landed on is the term contract test to identify the next level of test above unit tests. A contract test exercises an interface contract independent of the rest of the system. These are normally API or event contracts, and the term contract is important. It means this is the way you are exposing your component to the rest of the system. Developers or users outside of your little world are interacting with you component through this contract. It’s normally documented in some way, and in general you are providing strong backward compatibility guarantees.

These compatibility guarantees are super important, because a contract that only evolves in compatible ways is one of the foundational elements of a loosely coupled system, and a loosely coupled system is a pre-requisite to independent, empowered teams. So your contract tests are essentially making sure that your contract is behaving as expected and promised as the underlying implementation evolves.

Just as with unit tests, to make it easier to identify root cause and delivery of information, you want to test your component that provides a contract in isolation. This means that components that you depend on or which depend on your component (API or event consumers) are stubbed or mocked out. There are many tools to provide mocking of API calls that you can take advantage of.

To align with the measures of an empowered team, these contract tests need to be able to be run without requiring an integration test environment, and should be runnable on a developer’s machine so they can identify and reproduce issues.

Normally contract tests are the second stage of a CI/CD pipeline.

The question of value arises again when you are thinking about how many contract tests you want and what kind of tests you should write. If unit tests are about code coverage, I think of contract tests as interface coverage — can you say with confidence that your component will behave in compliance with the contract?

One strategy I like, which I see happen very often, is to have teams consuming your component write and contribute contract tests. This provides two different advantages. First of all, it ensures that the aspects important to that team are being tested. It also helps validate that the interface you are providing is actually useful and correct — often when a different team writes these tests, unexpected misunderstandings and assumptions are exposed and can then be resolved.

End-to-End Tests

Unit tests and contract tests are all well and good, but often where the rubber meets the road is full end-to-end tests of a system. How to write these tests, how many to write, and where they should run is often a huge point of contention and debate within an organization.

Where to run end-to-end tests

A common strategy is to build a “staging” environment where everything in the system runs in a state as close to production as possible, including some form of replication. But this strategy has a number of issues, particularly when we keep our testing principles in mind:

  • If a staging environment is shared, it becomes a point of contention and coordination across teams. You can’t deploy your code to staging whenever you want because someone else may be running their end-to-end tests. A key measure of independent and empowered teams is can deploy/release on demand without requiring an integrated test environment and for good reason. More often that not an integrated test environment blocks delivery, slows everyone down, and is a point of frustration and exhaustion
  • A staging environment impacts the economic value of tests because it is so difficult to maintain. The more components in your system increases the complexity and failure rate of a staging environment exponentially
  • A staging environment impacts the economic value of tests because staging is not production, so a passed test in staging doesn’t necessarily tell you that the system will succeed in production.
  • A staging environment impacts the economic value of tests because it can be very difficult if not impossible to replicate behavior in staging that you need for certain tests

For these reasons I strongly recommend running tests in production. There are a number of ways you can test in production

Use canary deploys to test with the organic traffic in production

One of the best ways to test a system in production is to let the natural flow of events exercise your system. But that doesn’t mean you just drop a new change into production and hope for the best. You can test updates in a controlled manner where you reduce the blast radius and make sure if there is an error you can recover very quickly. The best strategy I have seen is canary deploys. Essentially this lets you slowly roll out a new release of a component. You start with only a small portion of traffic being routed to the new version of the component. Then you monitor closely and soon as you see any errors, you roll back the deploy and raise an alert. If all looks good, you route a little more traffic to the new version, and a little more, until all traffic is going to the new version.

In order for you to be successful with a canary deploy, you need to have the following in place:

  • solid, reliable monitoring of your component. A great way to structure your monitoring is setting up a Service Level Objective with Service Level Indicators.
  • A mechanism for routing an ever-increasing amount of traffic to the new version of your component
  • Backward-compatibility and forward-compatibility of your component contract so that the old and new can run in tandem
  • A way to very quickly roll back the new version if an error is detected

There are tools on the market to support this process. Kubernetes supports canary deploys and AWS has Route 53

Acceptance testing

A common practice is when a new feature is released, a combination of QA and product go through a series of manual acceptance tests to make sure everything looks OK. It is my strong recommendation that as much as possible these tests are run in production.

Note that for some acceptance tests you need to exercise journeys in such a way that it doesn’t impact the experience of production for regular users or cause real impacts to the business. For example, you can’t use a real credit card to submit an order, and you can’t impact metrics around active users, order frequency, etc. Normally this requires test accounts and a way to create and use test data. I discuss test data below.

But how can you verify a new feature in production before it goes live? Doesn’t “in production” mean “live?” No, not it doesn’t, if you use feature flags. A feature flag allows you to turn control whether a feature is live or not in production, and who it’s live for. So for example you can enable a new feature just for test accounts or specific key users. Two popular feature flag frameworks are split.io and Launch Darkly.

Heartbeat tests of key user journeys

There are some key user journeys or flows that are really important. In these cases, it is very useful to have a test that runs on a regular heartbeat that exercises this flow. This strategy is very powerful but also requires test accounts and test data, which I discuss below.

You want to be thoughtful about how many of these tests you want to run. You’re trying to make sure everything is behaving well end-to-end. You shouldn’t use these tests to try and accomplish full API or code coverage. You don’t need to test every edge case or every data scenario.

Data validations

Most of our systems are distributed systems. Very often components need to share data, and this is done through various data pipelines such as replication, batch processing or events on a pub/sub channel like Kafka. These kinds of channels are necessary and useful, but you never know when something can fail, and data gets lost undetected.

So it’s often to ask what key data invariants need to exist, such as “if I have 2256 accounts in System A I should have 2256 accounts in System B”, and then build tests that regularly run validations to make sure those invariants are being met.

Be careful with these however — I have seen many teams run these directly against their production transaction DBs and cause real problems due to heavy load on the database. Use read replicas or run the validations late at night, and don’t select the full data set, just look at the data that’s changed since the last time you ran validations.

Chaos testing

Chaos tests introduce failures into production using carefully controlled experiments. This was popularized by Netflix’s Chaos Monkey. There are other companies that provide chaos testing frameworks and guidance, such as Gremlin.

Chaos tests have really good economics as they can result in a ton of information about system behaviors that are very hard to uncover any other way. And with a good toolset, they are not too difficult to set up.

I have noticed however that organizations are very very nervous about doing these. It’s funny to me, because working to introduce these into your system can only serve to provide very useful information and ultimately significantly improve the reliability and quality of your system.

Test data

In order to run actual tests in production, you need to do it in such a way that it doesn’t create visible impact to your real users or real business. For example, you can’t charge an order with a real credit card, or impact metrics on active users, revenue reports, and so on. And your tests need to be repeatable — if you are testing a registration flow, you need to be able to register the same test user over and over again.

Some people try the strategy of not actually creating the data, or going to mock services, but this defeats the purpose of running in production, which is that you want to exercise the real system, live in production.

Another approach is to try and delete the data that was inserted as a result of running a test. If you can do this reliably and safely that can work. But in some systems I have worked with, a transaction will generate an event, which then goes downstream to a bunch of consuming systems. Tracking down all the systems that have consumed this data and trying to remove data from those systems is complex, time-consuming, dangerous, and requires constant maintenance.

A better approach is to add support for test accounts and test data, and then have systems that use impacted data sets be able to recognize and correctly handle test data. The approach we are working on right now is to have every key aggregate in the system have an additional field called is_test, and then impacted systems can filter out, delete or otherwise handle test data in a way that makes sense to them.

Another way to reduce the blast radius of test data is to apply a filter at the point we emit an event to filter out test data. That’s acceptable if we agree that the scope of the test is to test the transaction, not the downstream processing of events.

Correctly setting up test data requires effort and care, but the payoff is significant. It allows you to reduce or eliminate the need for an integration test environment, which is not only an ongoing effort, but doesn’t deliver nearly as much value as production tests.

Should my end-to-end tests use the UI or just be against an API?

This is a great question, and is often a source of great debate.

You can argue UI tests are best because they are exercising the full experience. And you definitely need some of that. But UI tests are also brittle and prone to false failures, and can become a way of ossifying your user interface because any UI change can break your beautiful suite of tests. Now the tail is wagging the dog.

So my general recommendation is:

  • Build most of your tests using APIs because they are more stable and don’t ossify your UI
  • Build some key tests using your UI to verify your UI is working

Conclusion

The testing strategy above is new to a lot of people. It goes against traditional ways of thinking about QA. But they are driven by the key principles I mentioned in the beginning. When I have worked with teams that have rolled out a strategy like this, the result is what you would hope: more independent teams, more agility, better software delivery performance, higher quality, and engineers who feel more in control and more engaged. It is not a small thing.

Architect at eBay, but still learning who I really am