Nir Tarcic, Site Reliability Engineer at Waze, gave a fantastic talk at Google Next titled "Waze: migrating a multi million user app to Google Cloud Platform." Nir shared how Spinnaker has been key to enabling Waze to do multi-cloud deployments between AWS and GCP using GCP's Interconnect functionality. Nir talked about how Spinnaker helps Waze take a user-first approach to deploy to the cloud that's best for the user, and how it enables:

  • Automatic pipelines: Upgrade the binary and config in both AWS and GCP with one click.
  • Cross-Cloud Canary & Rollback: Each pipeline has canary testing and fast rollback if there's a problem with the canary.
  • Multi-Cloud Redundancy: Nir shared how Waze "barely survived" the 2015 AWS outage, but was "just fine" after the latest S3 outage. "If one goes down [Spinnaker] just launches instances in the next one."

In fact, Nir said that multi-cloud deployments would not have been possible without Spinnaker, and that "Spinnaker saved us 1,000 people."

If you'd like to dig in more, Nir wrote a great blog on Waze using Spinnaker for multi-cloud deployments. I also highly recommend watching Steven Kim's related talk, "Continuous Delivery on GCP: A Deep Dive on how Google Leverages Spinnaker". This is the single best talk I've found to describe the value of Spinnaker, and how it solves huge pain points for companies with massive userbases like Waze.

Here's the Talk Summary: Wondering how to migrate a large-scale service to Google Cloud Platform? Waze's migration architect shares their migration story to GCP from another cloud provider.

Here are some pics from Nir's talk:

Here's the Transcript:

Nir TarcicHi, everyone. I’m Nir Tarcic. I’m a Site Reliability Engineer for Waze. Picking a cloud provider is not an easy task, and there is a perception that the cost of switching providers is high. I’m here to show you it’s not always the case and switching cloud providers is a doable task with the right set of tools, procedures, and measures. For example, we migrated large portions of our system from AWS into GCP in just a matter of months with just a team of two people.

How many of you use Waze today? Raise your hands. Wow, okay laughs, thank you, awesome. For those of you who are not familiar with Waze, we provide turn-by-turn navigation with real-time traffic data. We currently have over 80 million active users with an average usage of 582 minutes every month. That’s a lot of traffic jam. You know, I really love working for Waze. I think it has an amazing culture for engineering. It’s so incredible to work on the product that helps a lot of people on a daily basis. Let’s watch a quick clip on what Waze does for its users.

Di-Ann Eisnor – Director of Growth, Waze When Waze first launched, its core was about outsmarting traffic together. GCP is the next natural evolution of Waze. Now we’re really talking about saving time for millions of people.

Paige Fitzgerald – Connected Citizens Program Manager, Waze Our drivers are out on the streets every hour of every day, reporting the most accurate information on closures and congestion. We are taking that information and delivering it to the government officials who can address incidents in real time.

Julie Mossler – Head of Global Communications, Waze This is a real program with real impact. We’re actually sitting down with partners and having measurable results every day.

Connor McKay – Data Scientist – Boston, MA Austin is a city of around 600,000 residents. But during a typical work day, we gross over a million people. Waze has allowed us to rapidly iterate through our experiments to see what sort of effects interventions have on congestion and traffic.

Chris Lambert – Systems Consultant – Kentucky DOT We had a major snow last March. Seeing the data come in from Waze every two minutes allows us to grab user reports, and then we can respond to those reports in a timely manner.

Phil Burks – President & CEO, Genesis Group Genesis is a software company that provides solutions to help first responders by putting Waze inside our Genesis PULSE product. Our first responders are saving five minutes, and that saves lives.

Di-Ann Eisnor We didn’t expect that so quickly we would have cities redistributing traffic personnel, that we would have 911 systems relying on this data.

Paige Fitzgerald As more data comes online, Waze is proud to bring together innovative transportation leaders from all around the world to develop the next chapter in traffic management.

Di-Ann Eisnor At the core, GCP has the same mission as Waze, which is to save people time every day. It’s showing us what our new goals are going to be, new forms of safety, and new forms of urban mobility, helping cities become a lab for making lives better.

Nir TarcicAll right. So now that we understand what we do in order to improve everyone’s lives, let’s take a second and look at our scale. What does it take to power up Waze to serve over 80 million users? Let’s start with the basics. Each front-end serving stack is behind a load balancer and has auto-scaling enabled to shrink and grow as needed. That way we allow ourselves to have the flexibility in terms of rush hour or save cost when it’s middle of the night and no one is driving.

On the database side, we use Cassandra. We have over 800 nodes of Cassandra. Each cluster also contains a Memcache cluster. That Memcache is used to store temporary data or Cassandra data. Because Cassandra is not the most scalable solution in terms of speed, we allow our Memcache to store the data for us to grow fast if needed.

We currently have over a hundred microservices just like this one. Each microservice is spread across multiple availability zones and across regions, and the most critical ones are even spread across multiple providers, Amazon and Google, to provide ourselves the best redundancy we can for our users.

For internal communications, we have developed our own discovery system and communication. This is done through a peer-to-peer system so servers connect directly. They do not have any single point of failures or any bottlenecks for traffic.

To power up all these 100+ microservices and Cassandra clusters and everything else, we currently have over 35,000 cores at any given time, growing up to double that amount in case of rush hours. I hope you can see. This is very large-scale and such. The thought of changing providers might be a crazy one even. Who does that, right? But as one of the folks who actually did the migration, I can say it’s very easy. It’s easier than you think. It’s very interesting because most migration talks you’ll hear are Boolean, either this provider or that. A partial migration lets you mix and match provider solutions, and you can decide to remain split among providers long-term. And you can enjoy the benefit of both worlds at the same time and experiment while you do so.

For example, in 2015, Amazon had an outage that lasted eight hours. This was pre our GCP migration. We wanted to spin up our GCP cluster sooner than what we – 06:33 wanted, but sadly, we couldn’t do it just because we weren’t ready. Our engineers worked a lot of time to make sure Waze will not crash that day, and I’m happy to say we didn’t, but it was very, very close. This was one of the triggers for us to actually realize we actually need a multi-cloud solution and not just rely on one provider.

Waze was acquired by Google in 2013, as you probably know. Now if you’re owned by a company who provides cloud services, it makes sense that we’ll look at it and not just say, “We’ll stick with what we have.” That doesn’t mean that we’ll actually migrate to that if it doesn’t make sense. We had that privilege of picking ourselves what provider we want to run. And we weren’t forcing anything upon us. It’s also very hard to choose a provider because there are so many of those, and they look differently, right? There’s this cloud and even this cloud. And how about this pretty one? And there’s even an actual cloud. So how do you choose one?

So why GCP? When you choose migrating to another provider, you have to set yourselves some key points that you want to do because you have to measure that. And if you do not measure that, how do you know it’s a successful move or if it even makes sense for you to do that? In our case, when we value the GCP, we choose that cost resilience and end-user performance is what matters to us. So let’s take a look at each one of those.

Cost: If you want to get the best price for your VM on other providers, you have to reserve them in advance and figure out how many you want to reserve because all our system uses auto-scaling. That number was never right. We either underestimated or overestimated. Doesn’t matter how you look at it. We lost money. Not to mention that this process was very time-consuming. Sadly, I did it so I know. We wasted days of work every quarter just to figure out what is that percentage that we need to buy. Not to mention that reserving an instance means that you’re locking to that one. Let’s say you have a new CPU family being released. You want to use that. Sadly, you can’t. You’re stuck for either one year or three years. What can you do? Or maybe you have redesigned your application, and now you actually don’t need those 200 nodes of Cassandra and you actually need 3. In GCP, we get that price out of the box because GCP has automatic, sustained use count – 10:26. For us, it made sense that all our expensive workloads should move to GCP as soon as we can because that will lower our cost significantly.

Next up, we have custom machine types. For those of you who are not familiar with custom machine types, they allow you to pick any flavor you want, take X cores and Y memory, and if you need, add a few local SSDs. But if you don’t, don’t add them. Each one, you can do whatever you want. At first, it may not seem that special. But when you’re dealing with the flaw of the – 11:12 microservices, it actually makes sense. You have to find a just balance between cost and performance. Here on the screen, we can see an example of a Cassandra cluster we have moved from Amazon to Google. That Cassandra cluster had six nodes. On Amazon side, it was an I2.4XL, which inaudible – 11:36 price for it is almost $2500. But we actually picked that one just because we needed a CPU. We didn’t need all that local SSD. We also didn’t need all that RAM. But you couldn’t do anything with that because that’s what you get. We used custom VMs to exactly pick what we needed. We needed 20 virtual CPUs, 30 gigabytes of memory, and just one local SSD, which in Google, that’s 375 gigabytes. As you can see, the price difference is huge. Now of course we could reserve those instances in Amazon, but still you can see there is a huge difference. If we combine both of those materials, reserving instances, and custom VMs, we have managed to save Waze over 20 percent every year just because of those, and that’s a lot.

The next metric we have was resilience. GCP’s flat network provides us a seamless, cross-region transport. A VM can communicate to a VM in the same region just as easily as it can talk to another region. You don’t have to worry about firewall – 13:06 rules, external IPs, encryption, or so forth. It’s as simple as that. This is something we couldn’t do in AWS, which prevented us from moving easily into a multi-region solution.

And lastly, we have latency. Google’s load balancer is globally deployed in all these locations, giving the best latency and performance for our end users. This was a critical path for us because Waze currently serves over 100 countries worldwide. It’s not just U.S. where you want the best latency, but you have to care about those South America customers or those APAC customers. Everyone matters in our case.

For example, let’s take a service that we’ve migrated into GCP. This service basically loads up the map that you see on the screen. Whenever you use Waze, you get those images. It’s a very simple application. It’s a standalone one that has Engenics – 14:24 on the frontend and Tomcat on the backend, and it’s being served by Cassandra and Memcache in order to have inaudible – 14:32 cache there. Memcache currently has over 99 percent hit ratio because we don’t change the map every second. We change it every day. Using the same design we had in Amazon, we have managed to improve our latency by 10 percent. That means 10 percent lower latency for all our customers just by moving. We didn’t do any change. And just like this service, we have dozens more that benefit this improvement.

Once we’ve decided that we want to migrate, we have set ourselves some principles on how we plan to migrate. First of all, we wanted to make sure that we shield our developers from this migration. Our team, inaudible – 15:41 team shouldn’t make the developer’s job harder because it’s more efficient to run on GCP. They should be working and not stressful about the migration. That’s not a healthy engineer.

Next you should always have an escape plan. In this case, I’m not sure if he has one. In case we have any issues with the migration, we wanted to have a fast rollback. This gave us more confidence in going fast because we always knew there’s a way back. We never let it go.

And lastly, we’re doing this for our users. If we’re impacting our users just because we want to migrate, there’s no reason to migrate. Let’s say you guys are finishing a long day’s work. All you want to do is get home as quickly as possible and making sure that Waze actually works for you. Otherwise, you might be stuck in a traffic jam, and you won’t get home to spend time with your family, friends, or just relax with your favorite pet. Or maybe you want a cuddly pet, up to you, but you want to get home. inaudible – 17:25 decided on how we are going to migrate. We need to design how we will do this because we’ve set ourselves some principles of fast rollback. We need a solution that we can run both services at the same time. We’ve decided to use Amazon’s and Google’s managed VPN solution. It made sense. Out-of-the-box solution? Why not? Sadly, that didn’t work for us. We quickly understood that Amazon’s VPN solution has some faults. It doesn’t scale as much as we need. And when you transmit large amounts of packets, it has also a lot of packet loss. Both things we couldn’t tolerate.

We’ve sat down with Google engineers to try and figure out what we can do. So we’ve built a solution that looks like this. On the Amazon side, we’ve purchased Direct Connect fibers that basically go into Equinix data center. There we’ve purchased a rack that has off-the-shelf VPN appliances, multiple of those, of course, for redundancy. From those VPN appliances, we’ve established connection to Google’s managed VPN solution because that one is scalable and doesn’t have any issues. This allowed us to have both providers look like it’s just one provider. A VM can talk to a VM just as it talks to a VM that’s in its zone, its region. And if it’s in GCP, it doesn’t know. It has another IP address, but for him, it’s an internal IP. It just works. This later has evolved into a managed solution that Google is providing customers. And it has improved since we’ve designed this. If you guys want to use this, you don’t need those VPN appliances. Google solution just allows you to take a fiber cable from Amazon and from Google and connect them, and that’s it. That product is called Private Interconnect.

So now that we have a reliable and a low-latency connectivity, we could start migrating our services. But we needed to come up with a way to achieve our other principles. How do we shield those engineers from being frustrated because we migrate? They shouldn’t care if it’s in Amazon or in Google. They just want to deploy their applications.

We’ve done a lot of research, and we found out that Spinnaker – which is an open-source application developed by Netflix and maintained also by multiple other companies – is the right choice for us. Spinnaker supports multi-cloud setup. It supports both Amazon, Google, and even Azure if you want one. It’s maintained by Google itself, and we even have engineers from Google, Microsoft, and other companies working on it here. As you can see here, we have an application. This is real inaudible – 21:15 traffic. And in one page, we can see our application running both in Amazon and in Google. We don’t need to worry about it. We don’t need to figure out where it’s running. And we can also see that 1 instance out of those 120 shards we have in Amazon is misbehaving. So we can terminate it easily or auto-scaling will automatically. We don’t need to worry about it.

Spinnaker also allows us to have automatic pipelines. In one single pipeline, we can upgrade the binary N configuration in both environments. We don’t need to worry about it. It’s just one click. That pipeline produces or actually executes more pipelines underneath. Each pipeline has canary testing and fast rollback in case there is a problem in the canary. That way our developers can deploy fast, without worrying about if the code is stable or not. If you want to learn more about continuous deployment and how we use Spinnaker in our system traffic, I recommend you sign up for my session that my colleague is giving on Friday that talks about Spinnaker.

Earlier I’ve talked about the outage Amazon had in 2015. I’m sure most of you will recognize the headline. This was just last week. They just gave me more material for my work, I guess. In 2015, we barely survived that outage. Our engineers worked very hard to maintain ourselves. But here we were just fine because we’ve implemented those principles. We’ve defined ourselves. We have multi-cloud, and we are not binding ourselves to a single provider. So in case one goes down, we’re just launching instances into the next one. A lot of companies did go down but we didn’t.

I want to summarize a few things that we’ve learned along the way. I think those will be valuable to you. Don’t look at it as a huge migration of a multimillion company or a multimillion user of – 24:21 company. Look at it on a service-by-service basis. You might find that you have any standalone services that you can just as easily migrate without worrying about those big blocks that have 500 dependencies or dozens of dependencies. And you can start saving up just because you’ve moved that part. And if you find that you evaluate GCP and you can’t run that service on GCP, just wait. It might get better. And I know it because we’ve done this for the past three years, and we’ve been constantly evaluating GCP until it has reached the point where we’re saying to ourselves, “This is the right moment. It’s ready for us. Let’s move.”

So what’s next for us? We are continuing our migration, and we’re hoping to finish it this year. We’re also looking on how we can actually utilize GCP’s infrastructure to better our service, to improve our service for the end users, and our redundancy. And as I said earlier, our users matter. Eventually, we’re doing it for them, not because we want to run on GCP or on AWS.

I want to finish up with this image that I found, which kind of reminds me of how I feel with this migration. Imagine you’re going to the same arcade shop every day for the past years, playing the same video games every day. And then there’s a new arcade shop across the street. Who won’t want to try that? It has inaudible – 26:26 and different games. So you’ll enjoy that. But that doesn’t mean you have to only use that new, shiny arcade shop. You can use both. What happens if there’s a power failure at the new one? You can go back to the other one. You’ll just think for your company and for your users. What seems to be the most logical thing to do? And go with your heart because that’s what we did, and we feel more confident that our service is more redundant today and resilient to any outage that we’ll have.

I want to thank you for your time, for being here, and for learning everything on GCP. So thank you.