Leveraging Node Groups to Segregate Workloads

Limian Wang
4 min readDec 10, 2021

Deploying workloads onto Kubernetes allows teams to significantly increase their deployment frequency. But as an organization begins to scale, there is an increasing need for the infrastructure teams to find ways to manage Kubernetes clusters without having to depend on the customers.

Let’s begin with a story. A story that truly depicts what infrastructure teams have to work with, and one that hopefully illustrates how much of a rollercoaster ride the journey itself presents.

A true story

On one of what-seems-to-be-a-normal-evening, right before Christmas of 2020, an on-call engineer on my team was paged due to an increase in latency on their production service. It was almost 8PM EST. As always, we attempted to dig through the reported issue and saw nothing out of the norm that would indicate a failure of our network stack or underlying infrastructure. In conjunction, the customer re-attempted to deploy their application, and the latency reduced (for now at least…).

In less than one hour, the symptom itself resurfaced, only this time, with more services being impacted.

I hopped on a call with my team and we began to slice and dice the logs and metrics we have on DataDog and we realized a particular service that was misbehaving, taking up all resources within the node (EC2 instance) and thus making all workloads deployed on it to be unresponsive (hence latency increase).

Now that we know the culprit, we paged that team that owned the suspected service and asked them to hop on the bridge. However, due to the late night, they were unable to push a fix out soon with the level of confidence that they needed. We (as the infra engineers) were stumped. What were we going to do? How are we going to tell our customers that their services are likely to experience a network failure due to another team that has, in most cases, nothing to do with them? From a customer standpoint, the impacted customers have deployed their services onto our infrastructure and they were experiencing a network degradation that impacts our end-users. This was definitely unacceptable.

Taking the matter into our own hands

This particular incident delineates a clear goal for us to be able to own our own destiny. We need to be able to control the overall health of our infrastructure regardless of who deployed what kind of workloads onto it without affecting one another.

The need to manage workloads across various tenants creates the need for multi-tenancy. Perhaps, you need to differentiate the access level for frontend applications vs backend applications; perhaps, the need is to grant teams with the elevated access to their own deployments and resources and theirs only.

However, multi-tenancy alone does not solve our incident. After all, as mentioned before, the underlying resources are shared (although this can be achieved via Resource Quotas). What we need is to have workloads to be dynamically moved from one node group to another, thus reducing the blast radius.

Kubernetes offers multiple declarative options for your deployments to deploy applications as a multi-tenant to multi-node group. This can be achieved via the combination of the following:

  • namespaces
  • taint and toleration
  • node affinity and anti-affinity.

All these options do work, and they do work well. But they share a common downfall: they all need the owners of the Kubernetes manifest to be aware of the underlying infrastructure (they may depend on the creation of the namespace, or be aware of the node labels that are available to them etc.)

This also introduces an extreme infrastructure overhead for upgrading and managing the nodes as there the requirement to update the services’ Kubernetes manifests and wait for them to be deployed by the service teams.

Enter Kyverno

Kyverno is a configuration based policy engine that is built for Kubernetes. Based on the policies specified, it can mutate the Kubernetes resources on creation and updates.

Why is this so powerful?

The main premise that make this so desirable is the ability to have the policy created and maintained by us, the infrastructure engineers. We offer an execution substrate that works and that is always up, and we strongly believe that customers do not need to be involved. We should be able to migrate the workloads from one node group to another under the hood or we should be able to specify the types of EC2 instances that workloads should be deployed onto (CPU optimized vs memory optimized vs general purpose ones) etc.

Resolution to the incident

Luckily we were already in the process of adopting Kyverno, so we immediately provisioned a new set of node groups to our clusters. We then created a policy that would deploy the problematic service onto that newly created node group. The mutation of the Kubernetes configuration actually happens under the hood, on the deployment event. Once the configuration was applied on our production clusters, all we had to do is add an label to the deployments of the problematic app and it was migrated off of the main node group. As soon as this was done, we have seen a drop in network latency.

One of the core traits for an infrastructure engineer is to be able to develop a set of toolkits within the toolbox. And this is one of them. Being able to decouple the infrastructure changes from the services allows a great flexibility and agility in making the infrastructure robust.

--

--

Limian Wang

Leveraging tech to build delightful products. Ex-Carta/Compass/EA/realtor.com. Continuous Learner.