25 Jan 2024

Charles Guebels

Kubernetes nodes auto-scaling with Karpenter

When an EKS cluster is set up, it is common to use EC2 node groups to provision the compute capacity of the Kubernetes cluster. The challenge is to dynamically adapt these node groups according to the computing demand of the hosted pods. Indeed, the number of pods can scale according to the dynamic demand thanks to HPA (Horizontal Pod Autoscaling) strategies, but the number of EC2 nodes behind the cluster is not dynamic. Karpenter’s goal is therefore to make the number and type of EC2 nodes dynamic. In other words, HPA is able to scale the number of pods and Karpenter is able to scale the number of EC2 nodes hosting these pods.

In Amazon Elastic Kubernetes Service (EKS) there are mainly two ways to autoscale Kubernetes nodes: Cluster Autoscaler on AWS and Karpenter. This blog post focuses on the latter, Karpenter.

The main goal of a dynamic scaling strategy is of course to keep the cost to a minimum while increasing computing capacity as necessary.

As we will see later, Karpenter is highly configurable, we will not present all the Karpenter features and settings in this article but rather give a good overview.

It is important to specify that at the time of writing these lines, the Karpenter tool is still in BETA version.

Installation

Karpenter installation requieres the creation of two types of resources: those of Kubernetes and AWS. Indeed, Karpenter works using Kubernetes resources but also needs AWS resources (mainly permissions) to be able to manage the EC2 node instances for example. We will not provide a complete installation guide, but rather explain the resources needed and how they can be deployed.

AWS resources

It is easy to guess that Karpenter needs IAM permissions to be able to interact with the EKS cluster, create EC2 instances etc.. In addition to these IAM permissions requirements, Karpenter also needs queuing resources such as SQS. The queuing system receives events from the EC2 service about the status of the EC2 instances, so Karpenter knows when an instance will be stopped, retired, etc. and can react as quickly as possible.

The required AWS resources are documented in detail in this documentation page.

Karpenter maintains a CloudFormation template able to create all these AWS resources, it can be downloaded using this command:

export KARPENTER_VERSION=v0.33.1
curl https://raw.githubusercontent.com/aws/karpenter-provider-aws/"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml > cloudformation.yaml

Kubernetes resources

The recommended way to create the Karpenter Kubernetes resources is Helm thanks to these instructions:

# Logout of helm registry to perform an unauthenticated pull against the public ECR
helm registry logout public.ecr.aws

export KARPENTER_NAMESPACE=kube-system
export KARPENTER_VERSION=v0.33.1
export CLUSTER_NAME="karpenter-demo"
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

But it is also possible to install it thanks to plain old Kubernetes manifests by extracting the Helm, for example:

# Logout of helm registry to perform an unauthenticated pull against the public ECR
helm registry logout public.ecr.aws

export KARPENTER_NAMESPACE=kube-system
export KARPENTER_VERSION=v0.33.1
export CLUSTER_NAME="karpenter-demo"
helm template --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  > karpenter.yaml
kubectl apply -f karpenter.yaml

The list of possible Helm configuration values are available here.

Configuration

Once Karpenter core resources installed, it is time to configure Karpenter to fit our scaling strategy. To do this we need to configure and deploy two additional types of Kubernetes resource :

NodePools: This configuration defines the type of EC2 instance that Karpenter can provision and which pods can be scheduled on it.
NodeClasses: This configuration is more about configuring these provisioned EC2 instances (subnets, security groups, etc.).

Multiple Node Pools and Node Classes can be created in a Kubernetes cluster. Multiple Node Classes can be assigned to a Node Pool but a Node Pool can be assigned to only one Node Class.

The main configuration of these two resources will be described in the sections below.

Node Pool configuration

A Node Pool defines attributes on EC2 instances that can be provisioned such as instance type, instance family, capacity type (on-demand or spot) but also which pod can be scheduled on these EC2 instances thanks to the taints and toleration mechanisms.

A Node Pool manifest template is available at the top of the documentation page, and some examples are available in Github. Below, we can find a good overview of the possible Node Pool settings.

spec.template.spec.nodeClassRef

spec:
  template:
    spec:
      nodeClassRef: myNodeClass

This parameter defines which Node Class this Node Pool is attached to. See next section for more information about Node Class.

spec.template.spec.requirements

spec:
  requirements:
    - key: "karpenter.k8s.aws/instance-category"
      operator: In
      values: ["m"]
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values: ["4", "8"]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["eu-north-1a", "eu-north-1b", "eu-north-1c"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]

This parameter defines filters to make Karpenter able to select the possible EC2 instance types to provision. For example, in the configuration above, Karpenter is allowed to provision EC2 instances with instance-category value “m” AND with 4 OR 8 vCPU AND possibility to start it in AZ eu-north-1a OR eu-north-1b OR eu-north-1c. This configuration means that Karpenter can provision EC2 instances of type, among others: m5.xlarge, m5.2xlarge, m7g.xlarge, m7g.2xlarge etc.

About the capacity-type attribute, if we add both “spot” and “on-demand” as value, Karpenter will first try to get a spot instance (at best price) and if there is none available, it will start an on-demand instance.

If we know exactly what instance types we want, we can do something simpler like:

spec:
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["m5.xlarge", "m5.2xlarge"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot", "on-demand"]

All the instance types available and their attribute values are documented here.

The Karpenter documentation explains this parameter in detail in this page. How Karpenter chooses which EC2 instance configuration to provision is explained here.

spec.limits

spec:
  limits:
    cpu: 100
    memory: 200Gi

This setting defines the total number of resources Karpenter is allowed to provision. It helps to control costs, but it can also be considered as a safeguard able to keep the resources provisioning under control in case of misconfiguration or attack. Note that this limit only concerns the resources provisioned by Karpenter as part of this Node Pool and doesn’t care about resources provisioned through another way or another Node Pool.

spec.weight

spec:
  weight: 10

In case a pod can by scheduled by multiple matching Node Pools, Karpenter chooses the one with the highest weight.

Labels

spec:
  template:
    metadata:
       labels:
         monitored: "false"

This configuration allows specific Kubernetes labels to be added to each node provisioned by Karpenter.

One use-case could be, for example, to use this label to prevent a specific DaemonSet from being scheduled on nodes provisioned by Karpenter. For this we can add the above snippet in the NodePool manifest and in the DaemonSet manifest itself we can add the affinity configuration below:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: monitored
                operator: NotIn
                values:
                - "false"

This affinity configuration means that the DaemonSet can be scheduled only on nodes for which the label named monitored is not set to false.

Taints

spec:
  template:
    spec:
      taints:
        - key: karpentable
          value: "true"
          effect: NoSchedule

The taints parameter defines which pods can be scheduled on the node provisioned by Karpenter. It is based on the kubernetes taints and toleration mechanism. In the example above, only the pods having the toleration karpentable:true:NoSchedule are allowed to be scheduled on the node. This means that in the deployment manifest of these pods we expect the snippet below:

spec:
  template:
    spec:
      tolerations:
        - key: karpentable
          operator: Equal
          value: 'true'
          effect: NoSchedule

Note that this kind of taint configuration prevents DaemonSets from being scheduled on the node, this can create very unwanted behaviors. The trick to allow DaemonSets is to add the toleration below in the DaemonSet manifest:

spec:
  template:
    spec:
      tolerations:
        - operator: Exists

Node Class configuration

As specified above, the Node Class resource is used to configure the EC2 node instance that will be started by Karpenter when there is a capacity demand. A Node Class manifest template is available at the top of the documentation page. Below we can find a good overview of the possible Node Class configuration.

spec.amiFamily

spec:
  amiFamily: AL2

This parameter generates default values for the EC2 AMI, user-data script and attached block volume. Possible values are AL2, Bottlerocket, Ubuntu, Windows2019, Windows2022 and Custom. The Custom value allows to not generate default values for the AMI, user-data and block volume configurations.

The default AMI value is the latest AMI available for this family. The default user-data script is documented here depending on the chosen family. For the block volume default configuration, it is documented here.

These AMI, user-data and volume block default values can be overridden by the parameters spec.amiSelectorTerms, spec.userData and spec.blockDeviceMappings respectively.

spec.subnetSelectorTerms

spec:
  subnetSelectorTerms:
    - tags:
        Type: private
        Environment: prod
    - id: subnet-08facd8d9f7e5d87b

This parameter allows to assign subnets to the EC2 instance provisioned by Karpenter. This parameter supports subnet discovery and selects all the subnets which match the configuration. For example, for the configuration above, Karpenter will assign to EC2 instances, all the subnets which have the AWS tags Type: private and Environment: prod OR for which the subnet id is subnet-08facd8d9f7e5d87b. Several ways to filter the subnets to select are possible, more details here.

spec.securityGroupSelectorTerms

spec:
  securityGroupSelectorTerms:
    - tags:
        Type: k8s
        Environment: prod
    - id: sg-06cfce69210ae20dc

Similar to subnets, Karpenter allows to specify certain filters to discover all the security groups to attach to the provisioned EC2 instances. For example, for the configuration above, Karpenter will assign to EC2 instances, all the security groups which have the AWS tags Type: k8s and Environment: prod OR for which the security group id is sg-06cfce69210ae20dc. More information here.

spec.role and spec.instanceProfile

These two parameters are both optional but one of them must be chosen. These parameters are mutually exclusive, meaning only one can be chosen.

The goal of this parameter is to choose which IAM role or EC2 instance profile must be assigned to provisioned EC2 node instance.

More information about these parameters here (role) and here (instance profile).

spec.tags

spec:
    tags:
      BillingTeam: team-A
      Environment: prod

This parameter specifies to Karpenter which AWS tags must be added on each AWS resource created by this Node Class. It is mainly on EC2 instances but can also be on EBS volumes for example. It can be very useful to add the specific tags used in billing report for example.

Note that some default AWS tags are always added by Karpenter. More information can be found here.

spec.blockDeviceMappings

spec:
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 75Gi
        volumeType: gp3
        iops: 15000
        encrypted: true
        kmsKeyID: "arn:aws:kms:eu-north-1:123456789123:key/1234abcd-12ab-34cd-56ef-1234567890ab"
        deleteOnTermination: true
        throughput: 125

This parameter overrides the default configuration regarding the block volume attached to the provisioned EC2 node instance. As we can see in the example above, we can fully configure the EBS volume, encryption included. Please note that in case of encryption using a KMS key, the AWS IAM policy attached to the Karpenter Kubernetes service-account must be updated to allow the use of this KMS key.

Go further

This blog post is just an overview of Karpenter capabilities and configuration, it would not be very relevant to expand it with more specific and complex configurations. However, below we can find two advanced topics for whoever wants to deepen Karpenter’s knowledge:

Conclusion

Karpenter meets a real need for scaling Kubernetes nodes. It allows to setup a minimum capacity and automatically increase it when needed and also reduce it when the load decreases. Karpenter is highly configurable and the good documentation makes the configuration easy. Even though Karpenter is still in BETA, it can already be deployed in non-production environments where costs can be reduce in terms of provisioned capacity resources but also cluster maintenance.