13 Jun 2024

Nicolas Labrot
Developer, Cloud Architect

Optimizing Application Resilience: A Deep Dive into Kubernetes Pod Disruption Budgets and Rollout Strategies

Kubernetes is a powerful tool for managing containerized applications, but maintaining stability during updates and disruptions is crucial. This article explores two key components that help achieve this:

Pod Disruption Budgets (PDB).
Rollout Strategies.

Table of Contents

Pod Disruption Budget (PDB)
- Why Is PDB Important?
- How to Define and Use PDB
Kubernetes Rollout Strategy
- Rollout Strategies for Deployment
  - Recreate Strategy
  - RollingUpdate Strategy
- Rollout Strategies for StatefulSets
  - OnDelete Strategy
  - RollingUpdate Strategy
Conclusion
Sources

Pod Disruption Budget (PDB)

A PDB is a Kubernetes resource that specifies the minimum number or percentage of replicas of a pod that must be up and running at any given time. This budget ensures that the application can tolerate certain disruptions without a significant impact on its availability.

PDBs are designed to constrain the number of pods that can be disrupted by the eviction API associated with Kubernetes operations such as node draining (common during cluster upgrades) and other maintenance tasks that use the eviction API. It is a constraint over the actual number of replicas, it does not modify the desired replica count of a ReplicaSet or StatefulSet. Instead, it specifies the conditions under which pods can be disrupted, ensuring that a certain number of pods remain available or limiting the number of pods that can be unavailable at any given time.

By setting a PDB, you inform Kubernetes of the acceptable level of disruption for your application, allowing the scheduler to make intelligent decisions about when and how to perform various operations.

PDB does not constraint the delete API and the deployment update:

Delete API: The act of deleting a pod manually is considered an administrative task and falls outside the scope of PDBs. It’s assumed that an administrator manually deleting pods understands the implications of such actions.
Deployment Updates: Deployment updates are managed through Kubernetes rollout strategies.

While PDBs don’t directly constrain deployment updates, proper rollout strategies (such as rolling updates) can be configured to minimize disruption. This will be detailed in the subsequent section.

Why Is PDB Important?

PDBs play a crucial role in maintaining the reliability and availability of applications in a Kubernetes cluster. Here are some key reasons why they are important:

Maintaining Service Availability: PDBs ensure that a sufficient number of pod replicas are always running, helping to maintain the availability of your application even during disruptions.
Facilitating Safe Node Upgrades and Maintenance: During cluster maintenance or node upgrades, PDBs help coordinate the draining of nodes by limiting the number of pods that can be disrupted at any time. This ensures that the application continues to function smoothly.
Improving Resilience to Failures: By specifying the minimum number of pods that must remain available, PDBs enhance the resilience of your application to unexpected failures, thus improving overall stability.

How to Define and Use PDB

Creating a PDB in Kubernetes involves defining a YAML file that specifies the desired budget. The configuration accepts two exclusive configuration parameters:

maxUnavailable: Indicates the maximum number of pods that must be unavailable at any time.
minAvailable: Indicates the minimum number of pods that must be available at any time.

Those parameters accept percentage or numeric values.

Percentage applies to the actual number of configured replicas of the ReplicaSet or StatefulSet, including any changes made by the horizontal pod autoscaler (HPA). Kubernetes rounds up the value to the nearest integer.

Here’s an example of a PDB definition:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app

In this example:

maxUnavailable: Indicates the maximum number of pods that must be unavailable at any time.
selector: Matches the pods to which the PDB applies based on the selector.

When to Use `maxUnavailable`

High Availability: If your primary concern is ensuring that a specific number of pods can be disrupted at any given time, use maxUnavailable. This is particularly useful for applications that can tolerate a defined number of pods being unavailable without significant impact.
Simplicity: For many applications, especially those with a relatively stable number of replicas, specifying maxUnavailable can be simpler and more intuitive.

Example Scenario: If you have an application running 10 replicas and you set maxUnavailable: 2, you ensure that no more than 2 pods can be disrupted at the same time. This configuration can be beneficial during a rolling update to avoid excessive downtime.

When to Use `minAvailable`

Critical Minimum Availability: Use minAvailable when it’s critical to maintain a minimum number of running pods for your application to function correctly. This approach is particularly useful for applications that have strict availability requirements or a minimum number of pods to meet a quorum.
Dynamic Scaling: minAvailable is beneficial when the number of replicas can vary, such as in auto-scaling scenarios with the HPA. It ensures a certain level of availability regardless of the total number of pods.
Failover: In scenarios where certain components must always be available to handle failover or redundancy, minAvailable ensures that the necessary pods remain running.

Consideration

A maxUnavailable of 0 (or 0%) will prevent any pods from being evicted.
A minAvailable of 1 with a replica count of 1 will prevent the pod from being evicted. It can be problematic when an HPA exists and is configured with a min replica of 1, there is an ongoing discussion on github.
A minAvailable of 100% will prevent any pods from being evicted.
A maxUnavailable of 1 is often a good starting point for many applications, but the optimal value depends on the specific needs and size of the application.
As PDB constrained the actual number of replicas, it does not support surge configurations (i.e., more pods than the configured replicas). This means that for any eviction to occur under a PDB, at least one pod must be allowed to be evicted.
PDB can block evictions but cannot ensure availability when replicas are configured with 1; for high availability, multiple replicas are required.

Best Practices for Using PDBs

Understand Your Application’s Tolerance for Disruption: Before setting a PDB, it’s essential to understand how much disruption your application can tolerate without impacting its functionality. This requires thorough testing and analysis.
Monitor and Adjust PDBs Regularly: As your application evolves, its tolerance for disruption may change. Regularly monitor the performance and adjust the PDB settings to reflect the current needs of your application.
Combine PDBs with Other Resiliency Strategies: PDBs are just one part of a comprehensive resiliency strategy. Combine them with other Kubernetes features like ReplicaSets, Deployments, and HPA to ensure robust application availability.
Consider Different PDBs for Different Components: Complex applications often consist of multiple components, each with different availability requirements. Define separate PDBs for each component to fine-tune the availability settings based on their specific needs.

Kubernetes Rollout Strategy

Kubernetes offers rollout strategies to update applications gradually, minimizing risk. From rolling updates to canary releases, these strategies ensure smooth and reliable updates.

A Rollout Strategy in Kubernetes lets you configure how updates to applications (deployment, StatefulSet) are managed and deployed. This is crucial for ensuring that new versions of applications are rolled out smoothly, with minimal disruption to users.

Minimized Downtime: By using a strategic approach to updates, Rollout Strategy ensures that downtime is minimized during the deployment of new application versions.
Controlled Deployment: Rollout strategies allow for controlled and gradual deployment of new versions, which helps in identifying and mitigating issues early in the process.
Improved User Experience: By avoiding abrupt and complete shutdowns of services, rollout strategies contribute to a seamless user experience.

The rollout strategies of a Deployment and a StatefulSet are different and will be detailed in the following sections.

Rollout Strategies for Deployment

Deployments are used for stateless applications where the state is not stored in the pods themselves but in external storage. In a deployment, pods are interchangeable and do not have unique, persistent identities. The rollout strategy of a deployment is tailored to those characteristics, it supports two strategies Recreate, and RollingUpdate.

Recreate Strategy

The Recreate strategy is the simplest form of rollout. It works by terminating all existing pods of the current version before deploying the new version. This approach guarantees that only one version of the application is running at any time.

Pros:

Simple to implement.
Ensures complete replacement of the old version. It can be particularly useful when the new version of an application and its external storage is not compatible with the previous one (eg. a database schema to migrate to the newest version).

Cons:

Causes downtime as old pods are stopped before new ones start.

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: my-app
 spec:
   strategy:
     type: Recreate

RollingUpdate Strategy

This is the default and most common strategy. It gradually replaces old pods with new ones. This method ensures that the application remains available during the update process.

Pros:

Zero downtime during updates.
Allows for controlled and gradual updates.

Cons:

It requires that the versions (and datastore) are backward compatible.

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: my-app
 spec:
   strategy:
     type: RollingUpdate
     rollingUpdate:
       maxSurge: 1
       maxUnavailable: 1

Key Parameters:

maxSurge: Specifies the maximum number of pods that can be created over the desired number of replicas during the update. maxSurge helps to ensure the availability and performance of the applications during the rollout process. It accepts either a number or a percentage.
maxUnavailable: Indicates the maximum number of pods compared to the replicas that can be unavailable during the update process. maxUnvailable can help speed up the rollout deployment. It also accepts either a number or a percentage.

Unlike PDB, the rollout strategy allows you to configure maxSurge to a value greater than the number of replicas. This parameter, in conjunction with maxUnavailable, ensures that there are enough pods running and ready during the update. However, this requires that your cluster has enough capacity to handle the additional pods.

For example, considering a deployment with 4 replicas:

With maxUnavailable: 1, during the rollout, while 1 pod is being updated, 3 pods will remain ready. The capacity will be degraded.
With maxUnavailable: 1 and maxSurge: 1, during the rollout, while 1 pod is being updated, 4 pods will remain ready since an additional pod will be created. The capacity will be preserved.

Rollout Strategies for StatefulSets

StatefulSets are used for applications that require persistent storage and stable network identities. StatefulSets have a different approach to rollouts compared to Deployments. The rollout strategy of StatefulSets is tailored to those characteristics. It supports two strategies : OnDelete and RollingUpdate.

OnDelete Strategy

The OnDelete strategy for StatefulSets requires manual intervention to delete old pods. New pods with the updated version are created only after the old pods are deleted by the user.

Pros:

Provides full control over the update process.
Useful for scenarios where updates need to be carefully managed or where automated updates are undesirable.

Cons:

Requires manual intervention, which can be error-prone and time-consuming.
Slower rollout process due to manual steps.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-stateful-app
spec:
  updateStrategy:
    type: OnDelete

RollingUpdate Strategy

This is the default and most common strategy. It updates pods incrementally, ensuring that a certain number of old pods are kept running while new ones are brought up. This minimizes downtime and allows for a more seamless transition.

The update process is slightly different from Deployments. Pods are updated sequentially, one at a time, in the order defined by their ordinal index (from highest to lowest ie. from {N-1..0}).

NOTE: The strategy to update one pod at a time can be configured using the podManagementPolicy, which defaults to OrderedReady.

Pros:

Maintains application availability with zero downtime.
Ensures sequential and controlled updates, critical for stateful applications.

Cons:

Slower update process due to sequential pod updates.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-stateful-app
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      partition: 0

Key Parameters:

maxUnavailable: Indicates the maximum number of pods that must be unavailable at any time.
partition: All Pods with an ordinal that is greater than or equal to the partition will be updated.

StatefulSets, due to their stateful nature, do not support the maxSurge parameter.

As stated in the Kubernetes documentation, in most cases you will not need to use a partition, but they are useful if you want to stage an update, rollout a canary, or perform a phased rollout.

For example to perform a canary deployment of 50%/50% on a StatefulSet configured with 4 replicas:

First, configure the partition to 2.
Trigger the update of the StatefulSet.
The pods with ordinals 2 and 3 will be updated in the order: 3 first, then 2.
Once the new application version is validated on pods 2 and 3, configure the partition to 0.
The pods with ordinals 0 and 1 will be updated in the order: 1 first, then 0.

Conclusion

In Kubernetes, PDB and rollout strategies address different types of disruptions to ensure application stability and availability:

PDBs focus on maintaining a minimum number of operational pods during voluntary disruptions, such as maintenance or updates, to keep critical services running smoothly.
Rollout strategies, on the other hand, manage the deployment of new application versions, reducing risks and ensuring seamless updates through methods like rolling updates, and canary releases.

By effectively implementing both PDBs and rollout strategies, you can enhance the resilience and reliability of your Kubernetes-managed applications, ensuring they remain stable and available even during disruption and updates.

Aspect	Pod Disruption Budgets (PDB)	Rollout Strategies
Purpose	Constraints the number of pods that must stay operational during disruptions	Manage deployment strategy of new application versions
Focus	High availability and resilience during maintenance	Smooth and reliable update process
Main Functionality	Constraints the number of pods that must stay operational	Gradually introduce new versions using methods like rolling updates, and canary releases
Type of Disruption	Voluntary disruptions (e.g., maintenance, node or cluster upgrades)	Application updates and version releases
Difference	It is a constraint over the replicas during an eviction	It is a constraint over the replicas during an upgrade but it can also update the number of running and ready pods
Limitation	Constraints the pods but by design will not prevent having fewer pods ready than the number of replicas	/

Optimizing Application Resilience: A Deep Dive into Kubernetes Pod Disruption Budgets and Rollout Strategies

Pod Disruption Budget (PDB)

Why Is PDB Important?

How to Define and Use PDB

When to Use maxUnavailable

When to Use minAvailable

Consideration

Best Practices for Using PDBs

Kubernetes Rollout Strategy

Rollout Strategies for Deployment

Recreate Strategy

RollingUpdate Strategy

Rollout Strategies for StatefulSets

OnDelete Strategy

RollingUpdate Strategy

Conclusion

Sources

Now it’s your turn!

When to Use `maxUnavailable`

When to Use `minAvailable`