Demetri Mouratis, Senior Staff Site Reliability Engineer at Box, agreed to give a talk about the use cases, potential for overlap, and comparing of the two tools Terraform and Spinnaker. We had previously covered this topic in another blog post with specifics about the two tools, but readers may find Demetri's experience as a Box engineer providing valuable insight as well.

Demetri covers the following:

  • Box's perspective with Terraform: a great tool for "doing the plumbing - the kind of stuff that you only need to set up once"
  • The importance of using DevOps tools in his line of work
  • How Terraform's use at Box has changed over time, and how some use cases have shifted over to Spinnaker
  • How Box is using Packer and Jenkins, two tools integral to how Spinnaker works
  • "It's almost as if, the way we were using those tools was in the same spirit as what Spinnaker eventually came to Box to do."
  • Box's journey to immutable infrastructure
  • Box's approach towards their own home-grown tool to be immutable
  • The pain of using mutable machines (Demetri was not a fan of mutable infrastructure)
  • The pain of tracking logs with Puppet
  • However, Demetri notes that Puppet is still being used at Box initially for build time, and is then turned off
  • Demetri's experience on what Terraform was not good at, and why Spinnaker is good at filling the gap
  • The competitive advantage of not spending time on "undifferentiated heavy-lifting"
  • Spinnaker's UI visualization of individual nodes and how this streamlined his ability to survey his deployments
  • Deployment strategies offered by Spinnaker
  • Taking advantage of "the fact that other people have done it before you"

Here's a transcript of the talk:

Armory Okay, we are live. Hello. This is…

Dimitri Hello.

Armory [...00:05] technical writer.

Isaac I’m Isaac, CTO and co-founder of Armory.

Armory And we’re here with Dimitri. Dimitri, do you want to introduce yourself?

Dimitri Hey, Dimitri here from Box, good morning.

Armory Today we’re here to talk about Spinnaker and Terraform and how to [...00:30] tools play with each other. And Dimitri, I graciously agree that [...00:34]. Would you like to dive right in? Or do you want to talk about any background information for the DevOps [...00:41] audience?

Dimitri Sure. From my understanding of the question at hand, what we’re trying to figure out here is for those maybe familiar with one tool or both tools, how do you think about the use cases and the potential for overlap and compare and contrast the two tools, in this case, Terraform, which is an open-source tool from HashiCorp, and Spinnaker, which is another open-source tool primarily from Netflix and Google among others.

Armory That’s right. Definitely, you can go back into the pre-Spinnaker world. How were you guys using Terraform? And what tools that you guys built around Terraform, and what the world looks like afterwards?

Dimitri Yeah. So Box is currently using both tools, both Terraform and Spinnaker. And we’re using them sort of in different stages of our progression into a more immutable server role. From our perspective, Terraform is a really great tool for doing what I call the plumbing. So it’s the kind of stuff that you need to set up once. In this case, you can think about things in AWS such as security groups or VPCs or load balancers, ELBs. All those types of things are sort of [...02:22]. Those things are all required in order for Spinnaker to do its work. So something has to set those up. You can certainly set those all up manually. And if you wanted to, you can point and click your way around the AWS console, and certainly you could get the job done that way, given that infrastructure as code is kind of the rallying cry as of late and the way the people prefer to do these things. Doing things kind of ad hocs through the UI is not the greatest way to go. It’s not reproducible. It’s not [...03:03]. It’s not stored. So there are a lot of risks associated with not using some type of tool. And of the tools out there [...03:11], Terraform was both the most feature-rich and the most well-documented, the best-supported tool for doing that kind of one-time configuration setup. So when we first introduced Terraform [...03:33] Box, we were basically using it to do a combination of two disparate things. One was the kind of one-time setup stuff that I was just discussing, and second was the actual deployments of your application. So we’re using another open-source tool [...03:48] Packer. And Packer is integral to the way Spinnaker works. So we were using both of those tools in conjunction and in conjunction with Jenkins as well, which is also fairly integral to the way Spinnaker works. So you can think of our approach sort of pre-Spinnaker as coupling together our own build pipeline and deployment pipeline out of the same tools that eventually got wrapped up and put behind the [...04:21] micro services that Spinnaker represents. So it’s almost as if the way we were using those tools was in the same spirit as what Spinnaker eventually came to Box to do but more kind of a do-it-yourself, tying these tools together with some shell scripts and some Jenkins jobs and so on and so forth. So that’s how we kind of got started with this.
And I think to be honest, the reality is for us, that was a big enough step forward for us that was difficult for people to kind of understand even immutable servers in the first place. So it’s kind of a reasonable place for us to land in terms of how we wanted to get started with our journey. And we’ve been at this for a year and a half, going on two years, maybe, from the very first kind of [...05:16] using that style of deployments, all the way through till now. And that takes us to Spinnaker. So as we were investigating all of this, we knew about Spinnaker. I definitely have been a big fan of Netflix’s blog and a lot of their open-source tools. I’ve contributed to a few of them. So I was aware and I heard of Spinnaker, had seen some of the early videos. I was aware that people were doing it. But it wasn’t really at the point where I could just jump straight to kind of a fully managed micro services based approach. So I kind of [...05:59] up and put a link and just some documentation there. [...06:04] actually [inaudible] want to come back to this. But let’s go off and do this in a more DIY fashion. And then we’ll get as far as we can get, and we’ll find out what are the limitations with the approach that we’ve put together. And then and only then should we solve those problems. Premature optimization is not the right way to go. You should first feel some of the pain and then try to go out to the community and find what are the tools to take away that pain. So we weren’t even feeling the pain portion yet when we started off on this journey. We were just kind of trying to transition ourselves from a non-immutable or [...06:39] servers in place kind of approach, making the transition to doing immutable. So that’s kind of how the two tools came together for us.

Armory I have a few questions. One is more of a fundamental question that I think people coming from the immutable world, coming from places like Chef to [...07:05]. You were talking about Terraform and Packer. And of course Spinnaker supports the immutable frameworks or I guess methodology. Why would anybody want to make that move? There is a baking cost associated with that. But why do you want to go there instead of just doing it the way you did before?

Dimitri Yeah, that’s a great question. I think it gets to the root of where the power lies in all this. So again, it’s the situation where you kind of need to feel that pain. And here are some examples of the pain of using what’s called long-lived as contrasted from immutable or ephemeral machines approach, which was prevalent, say, going back 10 years ago. And in this approach [...07:49] maybe it was on [...07:50] configuration management tool. Maybe it’s Chef. Maybe it’s Puppet. Maybe it’s Ansible. And you would use this sort of on again, off again. So Box is a Puppet [...08:01]. We have Puppet demons or agents that run say every hour. And their job is to basically look at the machine, find out any configuration that’s on that machine that is deviated or drifted from the desired configuration [...08:20] some centralized [inaudible] and then reconverged it back to the desired state. So an example is new user, John Smith, [...08:29]. They get entry into the Puppet database to say [...08:34]. Well, the first time that that Puppet run happens on a machine that’s been in Box for a long period of time, it will detect [...08:42] user John Smith doesn’t exist [inaudible] create this user. I need to create [...08:45] files, maybe [inaudible] etc. etc. And it will do a sort of a [inaudible] configuration.

Armory [...08:56].

Dimitri Yeah. What are the problems of this approach? One problem is it’s very poll-based. By that I mean the Puppet agents are running on any hundreds or thousands of machines, and they are phoning home to some Puppet master, let’s say, on a period of like one hour. But they’re not doing it [...09:14], everyone at exactly the same second. They’re doing it in some type of offset. So that means that to make a change, you need to make the change, stay in your configuration management systems, get repo. And you need to wait for all those Puppet runs across all the fleet to happen. So it’s not synchronous. It’s not designed to be synchronous. But that’s one challenge. It’s poll-based, which means you have a delay period from the time that you make a change to the time that that change is effectuated across the board. That’s going to be the same regardless of what approach you take, but it’s just one byproduct of a poll-based system like Puppet.

Second, [...09:57] thing that happens is not everything on the machine is managed by Puppet. Some things come in just from the base OS. So you have this [...10:06] question of did Puppet manage this file, did Puppet put this file in my home record, did I put this file in my home record. And it’s very difficult. People have written tools trying to solve this problem. It’s very difficult for the uninitiated user to decide, to determine which is which. So you have this kind of… There’s the superset of all the configuration which is some stuff done by hand, some stuff done by the operating system, some stuff done by Puppet. But puppet is only responsible for and only able to manage a subset of that.

A second type of problem that comes from this non-immutable approach is [...10:45] remove something. Let’s say I’ve provisioned a new machine and I decide I want to run Apache as a Web server on there. And then later on, I decide for whatever reason that Apache wasn’t the right choice here. I want to deploy Nginx. Well, when you deploy puppet and you [...11:04] manage this Apache configuration, it now is responsible for that and it will converge and say, “Oh, you don’t have your Apache server on. Let me turn it on for you.” But now if I make a change and I want to change from Apache to Nginx, it’s not a simple matter of just deleting the Apache code and replacing it with Nginx code. If I delete the Apache code, puppet no longer manages that. It doesn’t mean that it’s turning it off. It just means it’s not managing anymore. So you could have a situation where you originally deployed Apache and then you deleted that code, and you’re like, “Oh, I want to deploy Nginx here. It should just work.” But now these two are fighting over the same port because you can only have one listener at port 80 and one listener at port 443, for example. So now, by just silently removing what you thought was what was configuring your Apache Web server, what you fail to realize is that you have to explicitly tell Puppet to remove that configuration, and now that configuration lives forever because every new machine that comes on has to have Puppet turned off of managing Nginx or Apache. And any machine that has not [...12:15] Puppet needs that removable block to go and de-configure. So it’s like you accumulate this [cruft – 12:22] over time which is like all the decisions that you’ve made, combined with all the changes. And it’s not a simple matter of just [...12:28] configuration. So a configuration kind of grows linearly up into the right. And over time, you just end up with amazing cruft, this accretion of bad decisions or changes of mind that happen over time. And because these machines are not immutable, you have to carry that forward kind of indefinitely. There are a number of other things that have happened that are all kind of directly traceable to the same phenomenon. Like what happens if a user wants to do some testing on the machine and they pause Puppet. Now you don’t know when Puppet is going to come back online. Or what happens if a machine goes down for hardware maintenance and comes back? Well, many versions of software could’ve been released from the [...13:11] went down to the time it came back up. And if it comes back up in an unclean state, it might be running an old version of your application. And you don’t have any way to guarantee that that’s… You don’t have a good way to guarantee that that’s going to happen before the machine comes back up. So that’s the kind of class of problems that comes up when you have machines running. And we’re talking about machines here running for years.

Armory I assume all this stuff happened to you [...13:39] Puppet, from a place of experience, I assume.

Dimitri I have seen every manner of this problem, and I’m aware of the class of problems because I’ve seen it in more than one company. So it’s just frustrating when you know that this is lurking. But [...13:57] people who haven’t been bitten by this aren’t aware and aren’t as motivated as you are to kind of make the transition because again, they sort of need to feel that pain. And then after two or three incidents, they’re like, “Oh, I see why this approach that we thought was the correct approach, has these limitations.” And I’m open and willing to learn about a new approach. So what’s interesting about our transition was that we still use Puppet, but we use Puppet once. We use Puppet at build time. So we’ll provision a machine, say a VM and EC2 for one Puppet to get it in the state that we want, to create that home John Smith user account, for example. But then Puppet gets turned off. And the idea there is when it’s only running once, there’s only so much damage that it can do. Either it’s right or it’s wrong. But it’s reproducible. It’s the same every time. It’s [...14:48] GitHub. You have an [...14:51] to see what exactly happened. You’re not worried about this convergence, drift, re-convergence problem because you’re not perpetually running Puppet. So I think configuration management still has a role to play. I think you can drastically reduce the number of things you needed to handle. But I think our transition was such that we wanted to make the minimum amount of changes to our infrastructure and get the maximum benefit. So the way that we did that was by [...15:18] Puppet on top of Packer and letting it do the builds to do all this convergence once and then saying, “Okay, let’s say John Smith leads the company.” Now all we have to do is delete the John Smith entry from our Puppet Masters. And the next time we provision a new machine is it [kind of – 15:38] won’t be there. We don’t have to go in and say, “Make sure user John Smith is deleted and every person that’s ever [...15:44] company doesn’t have to have [inaudible] of all of their user accounts running indefinitely because we don’t know how far back the oldest machine has been provisioned with the company.” So the fundamental change there is that you take a little bit of the pain and say, “Okay, if I make a fundamental change like a person gets added or person gets deleted, then I need to build in the image.” But the trade-off there is you know that that image is the same across the entire fleet, and you don’t have to worry about all of the configuration drifts from time immemorial.

Armory Right. So Box is currently on [...16:19] immutable.

Dimitri No, I wouldn’t go that far. We’re on the journey.

Armory You’re on the journey. How far would you say you are?

Dimitri I would say I personally have sort of optimized for… We have different server types just like most companies do. Say an upload machine is [...16:38] Box, a download machine. We optimize for turning [...16:44] immutable servers, the server types that have the larger server count. So we call them big rocks. Go after the big rocks. Go after the ones where you could do the least amount of effort and get the most amount of return. So we’ve turned into immutable server, say, the top 10 server types. There are maybe hundred if you [...17:08]. Those are just [inaudible] entirely accurate. But the point is we have done the work and made the work extensible so that now, to take a new server type and turn it into immutable image, it’s just a single argument to like a Jenkins job. So we can turn into immutable [...17:24] that we want. And different teams have done this. The problem is they’re not yet over that hump where they understand immutable servers and all the changes in the way that we operate to make that top of the mind for them. So we’re kind of in this hybrid where we could [...17:39] immutable for [inaudible]. If we could get all of our tooling updated and if we could get all the processes updated and get everybody on board for doing it that way. So it’s a transition period. It’s a very difficult period because people are confused, rightfully so, about how do I know which is which. So we’re trying to minimize that pain and we’re trying to bring in the correct tooling and the correct approaches to facilitate the transition, which takes us back to Spinnaker.

Armory So I think the important thing here is to note that you think that immutable is the future for Box.

Dimitri Yeah, I do and many other people do as well. You can kind of take a distribution or artifact agnostic approach and say, “It doesn’t really much matter if we’re talking about an AMI or an open-stack image or a GCP image or a Docker container. The fact is all of these artifact types all support and are built on top of this concept of immutable images. So the reason why [...18:47] another really large effort at Box here to get us on to containerization, to get us running Kubernetes, for example. So there are a lot of similarities to the approach that we took in EC2, which is immutable VMs and the approach that another team or other teams here have taken with regard to the same style but using containers as the artifacts. So they’re similarly pushing towards that goal. So we’re totally aligned on that. It’s something that’s pretty pervasive across all Box. But the transition has been slow. It’s just been [...19:23] to make this all happen rapidly. So there are a lot of applications, a lot of engineers, a lot of processes, a lot of tooling that needs to come along for the ride.

Armory Yeah. Back to the Terraform stuff that we were talking about earlier, you said you had deployments with some shell scripts and you had it working with Terraform. Why [...19:49] work that you have done, that you invested in to going to Spinnaker? What changed there?

Dimitri The approach that we took got the job done for us, but it left us wanting more introspection and more control. So from my experience, Terraform is very good at [...20:11] stuff. It’s not very good at frequently deployed applications. It’s not that great at managing the applications themselves. I mean provisioning, scaling instances. So that’s where Spinnaker really shines. So we felt some pain around not having good tools of good UIs, of good APIs, with good ability to kind of interact over the long haul. And that’s where I think Spinnaker is really, really strong. So my answer to that question is that we felt some pain.

Armory [...20:51].

Dimitri Well…

Armory [...20:52] trying to do and just couldn’t get done or couldn’t do well with your other [...20:56].

Dimitri Yeah. Nobody could really tell, for example, what were all the instances [...21:01] built inside of Terraform because Terraform kind of sets up these auto-scaling groups and then it kind of goes away. It’s just kind of command-line tool. It’s not perpetually monitoring or available to monitor the health of what’s out there. So you kind of like set it and forget it. And the forgetting part made it difficult because now we had to go back to the AWS console that we were trying to get people off of in order to see an enumerated list of all the machines. So you’re kind of not really there if you have to use the console still. Your sign that you’re kind of [...21:34] is that you’re able to do all of your work without ever touching the console. And until you’re there, you know you have some pain or some leftover remnants of the old way of doing things that are just not quite gone.

Armory [...21:50] that gap, [inaudible] continue building [inaudible] on top, of filling the gap such that you would never have to get into the console [...21:59].

Dimitri We’re a software company. We can write software. But the question becomes is this software that we would write going to deliver direct value to our customers or not. And I think the answers are pretty obvious not. Our customers don’t much care how we deploy, how frequently we deploy, how many instances it takes. They care about using their application. They care about using the Box service. So this is sort of what Amazon CTO sometimes calls undifferentiated, heavy lifting, the stuff that you should not spend your time working on. So undifferentiated in the sense that it’s not what’s giving you competitive advantage in the marketplace. It’s heavy lifting and then it’s like hard work, doing all this stuff that’s actually a lot of work. Writing this application code if we chose to do it would mean we have to support it and we have to write documentation, we have to train people on it, and by saying, “Look, there are other companies out there that are much bigger than we are and have much more experience with this such as Google, Netflix, and others.” We can effectively [...23:00] and say, “Okay, you guys have the hundreds and thousands of engineers behind this. We have a couple of hundred. You have run this problem for many years. We have not. So let’s let the community solve this problem. Let’s join that community and be a part of it and contribute in that way rather than reinventing the wheel ourselves.” So that’s the calculus that you go through. I think smaller companies and some engineers that are maybe more junior kind of always like to challenge and like the opportunity to go off and solve those problems from scratch. But it takes a certain amount of discipline to recognize which are the right ones to solve and which aren’t. And I think that the framework that I gave, which is is this adding [...23:41] customer, yes or no, is how you should force yourself to make these decisions. And when we make that decision along those lines, writing that [tool in – 23:50] ourself does not add that value to our customer. So [...23:53] approaches.

Armory Obviously, we agree with that. So in terms of [...24:04] the console with the old tooling basically to visualize what’s there, can you explain a little bit about how Spinnaker allows you to visualize what’s there in your deployments and how it eased the pain [...24:19].

Dimitri Yeah. Spinnaker has this awesome kind of chicklet representation of [...24:24] with a series of squares or grids that represent these individual node and a collar associated with it. That was a style of representation of visualization of data that we are very familiar with. So being able to quickly look at that at a glance and see how many machines are up and running, which are running the correct version, which are healthy, and then being able to drill down on each one of those if you so choose to find out more information such as the actual instance ID, the IP number, tags if you’ve created them, all the kind of stuff from a place that gives you a certain amount of control. There are certain things you can actually do that are not read-only inside of Spinnaker but not that many things, not as many things [...25:12] like the AWS console at large. It’s kind of creating a way to channel the user and give them the right amount of control but actually hide some of the stuff that you don’t want people to mess with. So it’s creating a series of guard rails. It’s creating a visualization. It’s creating a read-only with some amount of right tool that gives you the right level of control to do what you need to do but doesn’t arm you with the full suite of destructive actions that you could take such as setting the auto-scaling groups to zero accidentally inside of the UI and just [...25:52] service. So for me, that balance of primarily a read-only tool, primarily Spinnaker, I think, from the UI has been primarily through reading. But there are actions you can take. You can do deployments. You can set up pipelines. You can do all that kind of stuff. And then eventually, I think the goal for us is going to be to remove ourselves from using the spinnaker UI and start treating that as an API as well. So it’s the same cycle over again but just a pretty large step forward in terms of functionality and ease of use and [...26:25] step down in terms of complexity, number of things, number of buttons you can push and knobs you can twist and just focusing on the core, essential building blocks that make up your ephemeral services.

Armory Sorry, you’re breaking up a little bit there at the end.

Dimitri Sorry. So just giving you the right amount of control and removing some of the extraneous things that can kind of confuse you if you’re not familiar with the AWS UI or console.

Armory So what about things like [...27:11] versus Spinnaker? What about higher-order kind of deployment strategies like Red/Black deployment [...27:21] more commonly known [inaudible] strategies? How did you handle that before? Because I know that Jenkins 2.0 has pipelines. I’ve used them and we use them still. [...27:37] difference there between what you have before with Terraform and what you have now.

Dimitri Yeah. So Terraform, again, is not a purpose-filled tool for deployments. Its kind of a more general-purpose tool. And as a result, some of these more exotic deployment strategies such as the highlander approach or the Blue/Green or Red/Black approach are first-class citizens in a tool like Spinnaker and maybe representable inside of Jenkins too. I haven’t explored that too carefully. But because you have these best practices that come from operational experience, running many large services over a long period of time, those distill down to very specific, very predictable, and [named – 28:25] approaches like Blue/Green or Red/Black or highlander approach that you can just avail yourself of without having to design it yourself and without having to craft it from scratch. So I think certainly, you could get to the same functionality [...28:42] then be writing what Spinnaker already kind of has built for you. So that’s the same discussion we had before. Are you really adding value by building more tooling on top of the primitives that you have? Or aren’t you better off by saying [...28:58]. Let me just take what they have built and conform my use case with their best practices.

So I think there are two things. One is the absence of a requirement to write the software yourself, which is always a bonus. And second is the shared learnings of the larger community and how those distill down to named deployment strategies that you can then understand and use. Those two are huge benefits because you can read the books. You can go out and read continuous integration, continuous deployment. You can follow the blog posts. You can understand all this stuff. And it makes sense conceptually. But in terms of rubber meeting the road, it’s really hard to build all that tooling and get all that operational knowledge yourself. And you can take advantage of the fact that other people have done that before you. So that’s again the calculus that we use and why Spinnaker is so attractive to us.

Armory We’re losing you on this computer. Why don’t you hit mute here, unmute, and then turn off the video here? That will turn it on here. Dimitri.

Dimitri Yeah, okay. [...30:41].

Armory Hey, can you hear us [...30:53]?

Dimitri Yup.

Armory Okay. All right, I think that’s where the majority of the questions we had really around Terraform and what you guys were doing and [...31:05]. I think the answer you gave is really what we advise people to do. It is undifferentiated, heavy lifting. I think people make the assumption that oh, it’s just a deployment. All you do is change the AMI [...31:22] Terraform ASG, and Terraform will handle everything for you. But there’s a lot more that goes into a deployment than just changing an AMI. And even if [...31:35] was simple, like you said, it’s still undifferentiated heavy lifting or any type of lifting and why not just apply that to your customer’s needs and requirements. So that’s really great feedback, and I might end up [...31:48].

Dimitri I [...31:53] and I make no apologies about doing so. I think that’s a great way to think about it. And I also have heard that phrase or that approach from other people of the community who I admire and respect and have learned a lot from. So I think that’s very good.

Armory Okay. Well, I don’t want to take up more of your time. I know you’re really busy. Hopefully, [...32:14] everything over to Spinnaker. But thanks a lot for the time, and we’ll see each other soon.

Dimitri Take care, guys. Thanks so much for your time today.

Armory All right, thank you. Bye.