A common saying is that "People leave managers not companies.” While 50% say that’s true; this post isn’t about that. It’s about keeping the other 50% happy and productive.
We’ve reached a point in the software industry where money is no longer the driving factor in why engineers decide to work and stay employed at a company. Engineers get paid well because their skills are in high demand and they are in low supply. More now than ever. Every company has become a software company.
How the hierarchy of needs applies to Engineers
Maslow’s hierarchy of needs is a theory of human motivation, which states that the lowest levels of hierarchy must be achieved before moving to satisfy higher level needs.
The two lower levels, physiological and safety, are needs regarding life necessities and survival, and can often be solved for with money. Thanks to high salaries resulting from an all-time high demand for software developers, most software engineers are past levels 1 and 2 in the hierarchy.
The third level, (Social in the diagram above), includes the need for love & belonging. Within companies, we more commonly call it "culture". It’s why every company that desperately wants to hire engineers has a blog post to express its values and work style. Engineers who are looking for jobs can read the blog posts and see if they like the culture of that company; it’s a large reason why people choose to join a company. It’s the reason why I’m writing this post right now - to express what “love and belonging” means to me, as someone who works at Armory.
Let's apply the hierarchy of needs to a hypothetical example: If a candidate had offers from both Facebook and Google, their decision is less about compensation because both satisfy levels 1 & 2. Therefore the career choice between the two companies comes down to level 3 (the company culture). I have friends who turned down offers from Facebook as well as Google, basing their decision on where they would feel most “love and belonging”.
This leaves us with levels 4 & 5: Esteem and Self Actualization. Esteem and Self Actualization cannot be accomplished by giving engineers more money, false praises, or any combination of words. It must be done through personal achievement, accomplishment, and impact. It’s why so many people leave large corporations that fulfill physiological, safety, and love & belonging needs and instead join startups.
When I ask job candidates why they want to join Armory they almost always answer with the same phrase: “I want to make an impact”. So how do software engineers make an impact? They deliver software that creates immense value for customers. It’s that simple. And yet, engineering processes are intent on preventing this very thing from happening.
So how do software engineers make an impact? They deliver software that creates immense value for customers. It’s that simple. And yet, engineering processes are intent on preventing this very thing from happening.
There is a misconception that processes like review boards, manual QA, and management sign-offs reduce risk. People think these initiatives work because the system in question doesn't change, and therefore no new risk is introduced. However, these processes not only stifle innovation but they also anger and frustrate the very people it's intended to help. The constraining processes effectively reduce the ability of an engineer to make a "huge impact".
The effects of these entrenched processes are staggering which can be seen in engineering talent turnover rates. According to a study by SHRM, there is a three-year average tenure, which lags behind the service industry whose average tenure is five years.
The problems of an in-house deployment solution
Engineers are happy when they can deliver valuable code to their customers in a transparent, reproducible, safe, and timely manner. Adding manual steps accomplishes none of these goals. In fact, manual steps push the deployment process in the opposite direction.
It’s easy to think of the software delivery process as the button that gets pressed to deliver code, but it’s much more than that. It includes everything that happens leading up to that point and moments afterward. So much time is lost in manual testing, reproducing stage environments, flipping flags, creating cloud resources, performing cumbersome integration testing, and communicating changes across the whole organization. For an engineer, every moment spent not writing code is time not impacting the world.
The most used deployment system in many companies is Jenkins plus
X is a solution that was developed in-house. Often, the engineer who developed the in-house solution has since moved on to a new company. Inevitably, the deployment system is some form of spaghetti code in the form of glue scripts that keep all the moving parts together. Documentation is lacking or non-existent. It's essentially a black-box that no one knows how it works and no one wants to work on.
This is why any deployment solution must be transparent. Whatever deployment system you use, it must have the right amount of verbosity, provide enough information for the engineer who is performing the deployment to locate any issue easily and solve the issues promptly. Black box deployment solutions are the opposite; they hide vital information instead of being open about its deployment steps.
The second problem with in-house solutions without any current ownership is lack of stability. Sometimes they work as expected, and sometimes they have hiccups. The unreliability makes the current engineers of the company nervous each time they deploy their code. The fear of a broken deployment is constant for an engineer, especially in companies where rollback is a painful and lengthy process.
The deployment tool is also not seen as strategic, so it hardly gets any resource allocation and is left riddled with bugs. Deployment code (like any code), needs to be constantly updated to accommodate new features and technologies. Issues with the deployment code itself should be found and fixed promptly.
This is where open source is king. Linus’ Law states:
Given enough eyeballs, all bugs are shallow.
Relying on a large open source community to find and fix bugs is not only more valuable to your company, but it also allows an engineer to focus on delivering value to the customer instead of fixing the custom tooling. In the majority of the cases, the collective power of all companies that use an open-source product greatly outperforms the expertise of an individual company that is using an in-house solution.
In-house solutions, by definition, only benefit the expertise of the company that created them. Open source solutions, on the other hand, can be improved upon by any user of the software, which means that the potential contributor pool is much larger than a closed source/in-house solution.
Have you ever tried to reproduce a bug in a staging environment only to find that the staging environment doesn’t match production and you can’t reproduce it without days of effort?
Everyone has run into this issue. It's maddening for engineers. Having the staging environment be an exact clone of production is one of the core tenets of a sound delivery strategy. In practice, however, it is easier said than done, which is why a lot of companies have staging environments that are vastly different than their production clusters.
Embracing immutable infrastructure is also another important factor for reproducible builds. Building your application and deployment pipeline to be immutable from the very beginning goes a long way towards a reliable deployment process. Your deployment system should be able to reproduce infrastructure quickly and in a deterministic manner. Your whole infrastructure should be stored in a code repository. This way it is always versioned and can be code reviewed. The more automation you have in place, the easier it is to reproduce issues that are found in production, in your staging environment.
Bad deployments happen to everybody - they just happen more often to some than others. The deployment system should be able to predict whether the deployment is likely to succeed or fail. If it’s likely to fail, deployments should be halted before actually affecting the application that runs in production.
Even though assessing deployment risk is a difficult problem on its own, there are many simple ways to reduce this risk through automation:
- Automated canary deployments - The word of importance here is “automated”. While manual canaries do reduce risk, they’re also time consuming and inefficient. It’s unlikely that a human will be able to look at 100 datasets in Datadog and accurately assess whether to proceed or not.
- Code Coverage - Is unit test coverage increasing or decreasing? A decreasing coverage would be cause for concern, and deployments to production should be reconsidered until the code coverage has increased.
- Deployment Windows - It’s probably a bad idea to deploy at 5 p.m. on Friday. While it may be a low traffic period, it also means that nobody is around during the weekend in case of a failed deployment. Conversely, deploying during a high traffic period would also be a bad idea as the negative effect on users would be very disruptive.
- SLA - Calculating a score that your team can agree on is a great way to understand whether your code and deployments are getting better or worse. Observing a downtrend in SLA should be addressed before continuing to deploy at the current rate, as it points to a deeper issue.
Deployments should deliver the code to the environment as quickly as possible. For software engineers to be productive, they need a fast feedback loop. If we could plot engineering happiness over the day, I’d suspect we would see dips when an engineer is waiting to see their changes in a given environment. The problem historically is that engineers have created shortcuts into this process by manually making changes in an environment to see their changes quickly causing drift in reproducibility.
There are many tips and tricks on how to reduce deployment time while still being immutable and safe, but that would be the topic of another post.
Deployments Don’t Have To Be Terrible
The software industry should not tolerate unsafe, slow, and unreliable deployments anymore. In fact, we, Armory, believe that the word “deployment" should not be used. We believe that software upgrades should instead happen continuously in the background. When a new version of the application is complete, a customer should get it as soon as possible. No more waiting for VP approval, daunting late-night deployments, or other processes that add a false sense of security. It’s time to start valuing the software delivery process as a strategic process that it is - a catalyst for innovation which ultimately keeps engineers happy and productive.