Testing Microservices in Production? Canaries Can Help

I recently gave a talk at a Meetup about testing microservices in production, and how canaries can assist harried engineers with this once-stressful process. I also helped author the Microservices for Startups book.

Moving from a monolith to microservices has changed not only software delivery integrations, but also the way software is tested. The Continuous Integration/Continuous Delivery (CI/CD) required in today’s world where deployment at scale happens several times a day-- and sometimes multiple times an hour-- requires both new tools and automation.

The testing pyramid is anchored by Unit Tests, then Component Tests, Integration Tests, API Tests, and finally GUI. It gets more costly as you go up the pyramid. It’s too costly for a human to do regression testing across all services, so what other tools can we use to minimize the risk?

Canary deployments are a new tool in the expanded toolbox needed to test the more complicated software delivery platforms using distributed systems. The name comes from the traditional miners’ practice of taking a canary with them into coal mines. When mined, coal can release an odorless toxic gas. So miners started taking a canary with them so the canary would die first-- which would suck for the canary, but was great if you were a human miner.

Similarly, a canary deployment launches one node into production, sees how it affects the system as a whole and discovers if the feature is introducing a lot more risk into the production environment. If risk is discovered, the automated canaries roll back the launch without human intervention. So using canaries is a lot cheaper than traditional manual QA testing to assess the risk.

We’re starting to see more people using canaries because they’re realizing there’s no other way to access risk at scale. Armory's Platform automates the process and makes it actionable.

It’s not just about velocity. It’s about having seatbelt on when you’re going really, really fast.

What makes a Good Canary?

First off, it’s fully automated. Data and metrics are good, but automated canaries make them actionable. Automation is what’s been missing from software delivery today. We take the engineers out of the process and get them back to doing what they do best, which is innovating and writing code, which they actually enjoy doing.

Second, it’s a scientific, repeatable approach vs. a human's best-efforts approach. A lot of Canary usage today is unscientific. For example, an engineer will set up a canary, but then will go to their Datadog or New Relic dashboard and look at 30 or 40 metrics and decide they look good-- or at least close enough-- and give the test a pass, especially when a product manager is breathing down their back. But in some cases, a 5% deviation in CPU is actually catastrophic at scale.

So our approach is to automate everything and remove manual judgements from the deployment process. We start by applying statistical analysis to all the metrics that are associated with a typical application, e.g. CPU, memory, disc, bytes i/o, queue sizes, and on and on. System level, engineering-focused metrics.

What you don’t get from a human reviewing a Datadog dashboard is even when all those engineering metrics are good, a new launch could affect revenue or an increase in call center traffic. We would consider that a bad deployment because it impacted the business negatively, but the engineering team isn’t looking at those metrics.

So we roll in data from ServiceNow or Jira into the canaries to determine how the launch effected the business and automatically roll it back if effect has a negative on the business side.

You simply can’t know how many things a coding change is going to affect. You still need the canary to see what unexpected changes the code is going to affect.

You want as many sources of data as possible to get a clear signal if it is a bad deployment.

For the Continuous Deployment system itself, the canaries need to be reproducible, transparent and reliable. If you run the same canary twice in production, you should get roughly the same result. Because it is production, there will be slight variations in the data, but they should be very slight.

You have to be able to see into the results to determine why the canaries failed. It’s the only way the engineers are going to know what to fix. At what exact point in the process did the failure occur? Why did it decide to roll back?

The last piece is reliability. If it’s not reliable, engineers will go rouge and build their own, so eventually you end up with three broken canary systems instead of one, and none of them are reproducible.

How Canary Helps Outages Caused By Testing in Production

There are two places we’ve found to be a big cause of outages. The first is that a human does a code review,, sees only one or two lines of code and passes it on. It looks trivial, but somehow, those are the biggest source of outages. You simply don’t know. It’s impossible for a human to assess the affect fn those two lines of code on dozens, hundreds or thousands of services.

The second cause of outages is user input. Especially at scale, and especially when you’re dealing with humans. You can have every piece of code correct and behaving as expected, but unexpected user input can cause a change for some reason unknown to you, and takes your system down.

For example, your engineers have an expected set of input data that tests out perfectly, but then you ship the code and now you’re dealing with input from, say Dubai and all of a sudden your system is dealing with different characters that you weren’t expecting, and all of a sudden things start to fall over.

These are the kinds of things canaries can protect against. Instead of having a full system outage, it’s contained and automatically rolled back.

Canaries allow you to limit your blast radius. Instead of impacting your entire user base, you can test with a small portion of users.

The cost-- because everything has a cost-- is that you are disrupting some of your production traffic. And that’s where other tools can help.

Canary vs. Feature Flags

Feature flags are a part of that toolbox, and both are used by engineers at Armory. The two features look similar and you can use one or the other when you are starting your journey into microservices. But once you scale, it’s helpful to have the differentiation.

Canarying is more of an engineering exercise, and feature flags are more of a product and marketing exercise.

Feature flags provide the ability to expose a particular feature at a particular date to a particular set or group of people. The flags limit who gets affected by the launch of a new feature. You can turn flags on so that only your manager sees the changes, or a group of your peers, or only your beta customers, so you can assess the impact before deploying to all production users.

At Armory, when we build features, we build both in conjunction. We start off with feature flag, whatever we’re building. We start with a scaffold, really just a white page,. Then we grab an image of product, create a baseline, then deploy the canary. So now we can compare one node to one node, against the same size in case in case we might have effected something we weren’t aware of that wasn’t hidden behind the feature flag. From there, we’re constantly iterating behind the feature flag. Once it’s done we send it into production using Canaries and flip the flag so one customer can see it.

With traditional waterfall software development, the assumption is that you’re going to reduce all of the risk up front before you deliver anything which, of course, takes a lot of time.

With the agile process, we’re going to break up the process and assess the risk at smaller intervals. The problem with this is that now your QA team will have to to manual regression testing a lot more times. That’s why automation is so critical. So now you’ve got the worst of both worlds, managing services via monolith methods, only doing more of them because you’ve broken up your monolith.