Deep Dive into Clouddriver

A lot of questions we get from customers are really about Clouddriver: how to scale it, how to diagnose errors or performance issues. We're sharing an overview of the service (no code, I promise) and some tips to operate Clouddriver at scale in the hope it will help the Spinnaker community. This is the first in a series of posts on Clouddriver.

What is Clouddriver used for?

When deploying your app, Clouddriver will create server groups, change load balancers, and inform the rest of the services of what's out there. It is the service that discovers the state of the world and changes it.

Clouddriver works by polling your cloud infrastructure on a regular interval and storing the result in a shared cache (more on that later).

It is used by the following services:

  • Orca to ensure up-to-date cache, and to query and modify cloud resources
  • Deck (via Gate) to display (cached) information to users
  • Fiat to query authorization about “cloud” accounts permissions and get applications from cloud resources (e.g. name + clusters + tagging in AWS)
  • Igor to retrieve Docker image names and tags
  • Rosco to retrieve artifacts when baking manifests with Helm

Clouddriver itself initiates communication with:

  • Fiat when checking whether a user (as in Spinnaker user or service account) has the rights to change a cloud resource
  • Front50 when searching for app definitions or referencing Spinnaker-specific artifacts

How Clouddriver works

Clouddriver defines cloud providers (such as AWS, Azure, GCP, CloudFoundry, Oracle, DC/OS, Kubernetes, Docker). Each provider can have accounts (such as a Kubernetes cluster or an AWS account).

There are two main functional areas in Clouddriver: caching and mutating operations.

Caching in detail

Caching agents query your cloud infrastructure for resources and store the results in a cache store. Each provider has its own set of caching agents that are instantiated per account and sometimes per region. Each caching agent is specialized in one type of resource such as server groups, load balancers, security groups, instances, etc.

In reality, the number of caching agents varies greatly between providers and with your Clouddriver configuration.

For instance, AWS might have between 16 and 20 agents per region, performing tasks such as caching the status of IAM roles, instances, and VPCs as well as some agents operating globally for tasks such as cleaning up detached instances. And Kubernetes (v2) might have a few agents per cluster, caching things like custom resources and Kubernetes manifests. We'll go over some of these specifics in a later post.

The cache store is where Clouddriver stores cloud resources. It comes in different flavors:

  • Redis - the default and most popular implementation
  • SQL - the new and still experimental SQL-backed store
  • Dynomite - Netflix's key value store
  • an in-memory cache that is not used for actual Spinnaker deployments

All these stores - with the exception of the in-memory store - work across multiple Clouddriver instances.

The agent scheduler is in charge of running caching agents at regular intervals across all Clouddriver instances. There are 5 types of schedulers:

  • the Redis-backed scheduler that will lock agents by reading/writing a key to Redis
  • the Redis-backed sort scheduler that does the same as above but can manage the order of agents being executed
  • the SQL-backed scheduler that locks agents by inserting a row in a table with a unique constraint - not very efficient, prefer other schedulers
  • the Dynomite-backed scheduler (and its sort variant) that is similar to its Redis counterpart but uses Dynomite as its store
  • the default scheduler that doesn't lock. Don't use it if you expect more than one Clouddriver to run.

Note that the cache store does not dictate the type of agent scheduler. For instance, you could use the SQL cache store along with the Redis-backed scheduler.

If you read Clouddriver source code, you'll see references to cats (aka Cache All The Stuff), which is the framework that manages agent scheduler + agents + cache store.

Putting it all together

Now that we have all the primitives, the startup sequence should be intuitive: Clouddriver inspects its configuration and instantiates the cache store and the agent scheduler. For each provider enabled, agents are instantiated per account/region and added to the scheduler.

When the scheduler runs:

  • Agents not currently running on this instance of Clouddriver are considered.
  • The scheduler attempts to obtain a lock on the agent type/account/region to ensure a single Clouddriver caches a given resource at any given time.
  • If successful, the agent is then run in a separate thread.
  • When the agent finishes, the lock TTL is updated to match the next desired execution time.

Operations in detail

Clouddriver has the concept of atomic operations - a single unit of work. Spinnaker pipeline tasks trigger these operations to mutate cloud resources.

There are more than 200 atomic operations available in Clouddriver, such as creating a server group, terminating EC2 instances, or deploying Kubernetes manifests.

Operations statuses are saved in a task repository, that can be backed by: Redis, SQL, in memory, or a "dual" repository to migrate from one store to the next seamlessly.

Operation Execution

  • If you have enabled authorization in your Spinnaker install, Clouddriver will first check the user has proper authorization to write to the account and to the application (if the operation is part of an app's pipeline).
  • Clouddriver then sends the atomic operations to a thread pool, and returns a reference to a task ID.
  • The client (Orca) can then query the status of the tasks (also known as kato tasks) with the task ID.

Note that atomic operations that are sent together are immediately executed together in the same thread.

Atomic operations vary greatly in their complexity. They generally try to be atomic but not always (e.g. deploying multiple Kubernetes manifests). We won't cover atomic operations implementation here but if you're interested, check out Clouddriver's code.

Looking into Clouddriver tasks

From a user perspective, Clouddriver tasks are not very visible. You can however spot these tasks in the source link of stages:


Each stage and tasks will contain the history of Clouddriver executions under the kato.tasks key:

"context": {
  ...
  "kato.last.task.id": {
    "id": "1f24bc99-2b96-451a-807e-0c459fed12eb"
  },
  "kato.task.firstNotFoundRetry": -1,
  "kato.task.notFoundRetryCount": 0,
  "kato.tasks": [{
    "history": [{
      "phase": "ORCHESTRATION",
      "status": "Initializing Orchestration Task..."
     }, {
      "phase": "ORCHESTRATION",
      "status": "Processing op: DisableAsgAtomicOperation"
     }, {
       "phase": "DISABLE_ASG",
       "status": "Initializing Disable ASG operation for [us-west-2:deploy-preprod-v015]..."
     },
     ...],
     "id": "1f24bc99-2b96-451a-807e-0c459fed12eb",
     "resultObjects": [],
     "status": {
        "completed": true,
        "failed": false
     }
  }],
  ...


The history contains tasks repeated with their status changes as well as any output. It's quite useful to understand what Spinnaker is actually doing under the hood and troubleshoot potential issues.

On-demand caching agents

We now have the main pieces of the puzzle:

  • When Spinnaker needs to execute a pipeline, it will run its stages. Each stage can be made of multiple tasks, and each task can trigger one or more Clouddriver atomic operations.
  • When Spinnaker needs information, it queries Clouddriver which will look into its cache. The cache itself is kept up to date by one or more instances of Clouddriver out of band.

However, most cloud mutating operations are not synchronous. For instance, when Clouddriver sends a request to AWS to launch a new EC2 instance, the API call will return successfully but the instance will take a while before it's ready. Even in Kubernetes, sending a manifest is accepted but it can take a few seconds before the resource is considered ready. This is when Spinnaker uses on-demand caching agents.

On-demand caching agents are - as their name implies - created on demand by the client (Orca) in tasks such as Force Cache Refresh or Wait for Up Instances. They are used to ensure cache freshness and know when a resource is created or effectively deleted.

The main gotcha is that when using a cache store that works across multiple Clouddrivers (like Redis), Clouddriver will wait for the next regular caching agent of the same type to run before declaring the cache consistent. It gives the cache store one more chance to replicate its state (to other replicas in the case of Redis).

Other operations

Clouddriver handles a couple more important functions that aren't described above:

  • Downloading artifacts from Jenkins, S3, GitHub, etc. for a pipeline.
  • Running jobs: you can run jobs in Kubernetes, DC/OS, and Titus. Jobs are a useful way to extend Spinnaker functionality.

And voilà! We're now equipped to understand potential bottlenecks and troubleshoot issues. We'll cover that in the next post.