k8s-autoscaler-benchmarker
The k8s-autoscaler-benchmarker
can be a useful tool for administrators and developers looking to optimize the scaling capabilities of their EKS clusters. The tool offers a streamlined process for benchmarking the performance of Karpenter and Cluster Autoscaler for EKS workloads.
By providing time metrics on EC2 instance initiation, node registration, pod readiness, node deregistration, and instance termination times, it enables users to quickly test autoscaler settings such as consolidateAfter
for Karpenter and scale-down-unneeded-time
, node-delete-delay-after-taint
, scale-down-delay-after-add
, etc. for Cluster Autoscaler.
This tool also supports customization through a variety of parameters, ensuring that users can adapt the benchmarking process to their specific environment while also providing defaults for quick testing.
Currently Supported Features
- Benchmarking Metrics: The tool currently tracks the following metrics for a given autoscaler (Karpenter or Cluster Autoscaler):
- Total time for EC2 instances to initiate their boot process after failed pod scheduling.
- Total time for EC2 instances to register to the k8s API after initiating their boot process.
- Total time for pod readiness of a deployment after EC2 instances are registered to the k8s API.
- Total time for EC2 instances deregistration from k8s API after scaling a deployment to 0.
- Total time for EC2 instances termination after scaling a deployment to 0.
- Customizable Parameters: A wide array of input parameters allows for the customization of the k8s deployment used for benchmarking - supply your own by providing then name and namespace or use an autogenerated deployment customizable via parameters.
- Clear Results Summary: Benchmark outcomes are concisely summarized to
stdout
.
- Flexible Environment Configuration: Supports optional parameters for specifying kubeconfig paths and AWS profiles with default values for ease of use.
Demo
* Note: Instance ids and ip addresses have been redacted with x's in the below demo video. The real program output will show your actual resource ids.
Karpenter Example
The below example uses a user provided Docker image passed in as a parameter. A new deployment will be created since an existing deployment isn't provided by the user. Once the program is terminated the deployment will be deleted.
./k8s-autoscaler-benchmarker --nodepool k8s-autoscaler-benchmarker --replicas 2 --container-name redis --container-image redis/redis-stack
https://github.com/moebaca/k8s-autoscaler-benchmarker/assets/12791848/d80173ef-7d7b-426a-a0b1-7d2c82feb619
Benchmarks Summary
--------------------------------------------
Instance Initiation Time: 3.65 seconds
Instance Registration Time: 40.22 seconds
Pod Readiness Time: 31.46 seconds
Instance Deregistration Time: 20.12 seconds
Instance Termination Time: 96.24 seconds
--------------------------------------------
Cluster Autoscaler Example
The below example does not use an existing deployment nor passes in a custom Docker image. A new deployment will be created with the default "inflate" deployment. Once the program is terminated the deployment will be deleted.
./k8s-autoscaler-benchmarker --node-group k8s-autoscaler-benchmarker-ng --replicas 2
https://github.com/moebaca/k8s-autoscaler-benchmarker/assets/12791848/6578dfcb-6c88-49e3-bfe1-3407f82ff9b1
Benchmarks Summary
--------------------------------------------
Instance Initiation Time: 21.63 seconds
Instance Registration Time: 34.21 seconds
Pod Readiness Time: 3.45 seconds
Instance Deregistration Time: 133.70 seconds
Instance Termination Time: 132.05 seconds
--------------------------------------------
Prerequisites
- An active EKS cluster
- AWS CLI configured with access to the EKS Cluster
- kubectl configured with access to the EKS Cluster
- Go 1.16 or later installed on your machine
- For Karpenter:
- Install Karpenter in the EKS Cluster
- Setup a NodePool and EC2NodeClass similar to this NodePool example (Note: The
eks.autify.com/k8s-autoscaler-benchmarker
label and taint are required with their respective values for the default values to function correctly unless overridden via parameters)
- For Cluster Autoscaler:
- Install Cluster Autoscaler in the EKS Cluster
- Create a Managed Node Group similar to this Managed Node Group example (Note: The
eks.autify.com/k8s-autoscaler-benchmarker
label and taint are required with its respective value for the default values to function correctly unless overridden via parameters)
Installation
To use the k8s-autoscaler-benchmarker
, clone this repository and build the tool using Go:
git clone https://github.com/moebaca/k8s-autoscaler-benchmarker.git && cd k8s-autoscaler-benchmarker
go build -o k8s-autoscaler-benchmarker
Usage
Execute the tool with the following command, providing any desired options:
./k8s-autoscaler-benchmarker [options]
Name |
Description |
Type |
Default |
Required |
nodepool |
The Karpenter node pool tag value to monitor. One of nodepool or node-group must be provided. |
string |
N/A |
Yes* |
node-group |
The ASG node group name to monitor. One of nodepool or node-group must be provided. |
string |
N/A |
Yes* |
kubeconfig |
Path to the kubeconfig file to use for CLI requests. |
string |
(uses default kubeconfig path) |
No |
aws-profile |
The AWS profile to use for accessing EC2 services. |
string |
default |
No |
deployment |
The name of the deployment to benchmark. If not supplied, one will be created automatically. This deployment WILL NOT be deleted upon program termination. |
string |
N/A |
No |
namespace |
The namespace of the deployment. |
string |
default |
No |
replicas |
The number of replicas to scale the deployment to. |
int |
1 |
No |
container-name |
The name of the container AND generated deployment if an existing deployment isn't supplied. This deployment WILL be deleted upon program termination. |
string |
inflate |
No |
container-image |
The image of the container in the generated deployment if an existing deployment isn't supplied. |
string |
public.ecr.aws/eks-distro/kubernetes/pause:3.7 |
No |
cpu-request |
The CPU request for the container in the generated deployment if an existing deployment isn't supplied. |
string |
1 |
No |
toleration-key |
The toleration key for the generated deployment if an existing deployment isn't supplied. |
string |
eks.autify.com/k8s-autoscaler-benchmarker |
No |
toleration-value |
The toleration value for the generated deployment if an existing deployment isn't supplied. |
string |
N/A |
No |
node-selector-key |
The node selector key for the generated deployment if an existing deployment isn't supplied. |
string |
eks.autify.com/k8s-autoscaler-benchmarker |
No |
node-selector-value |
The node selector value for the generated deployment if an existing deployment isn't supplied. |
string |
true |
No |
* Note: Either nodepool
(for Karpenter) or node-group
(for Cluster Autoscaler) is required for the tool to function correctly. Both parameters should not be provided at the same time.
Examples
Running a benchmark with all default settings:
With Karpenter:
./k8s-autoscaler-benchmarker --nodepool k8s-autoscaler-benchmarker
or with Cluster Autoscaler:
./k8s-autoscaler-benchmarker --node-group k8s-autoscaler-benchmarker
Benchmarking with Karpenter using an existing deployment in a custom namespace with 3 replicas:
./k8s-autoscaler-benchmarker --nodepool k8s-autoscaler-benchmarker --deployment my-deployment --namespace my-namespace --replicas 3
Benchmarking with Karpenter using a custom Docker image with 2 replicas:
./k8s-autoscaler-benchmarker --nodepool k8s-autoscaler-benchmarker --replicas 2 --container-name redis --container-image redis/redis-stack
Troubleshooting
- If the program prompts you of a timeout during the scaling of the deployment please check for pod errors before exiting with 'no':
- There may be an issue with taints/tolerations or labels not matching between the deployment and the node group/nodepool.
- Cluster Autoscaler may not scale node group initially right after creation. I've found manually setting min size and desired capacity to 1 and then back to 0 fixes this (only required right after initial creation).
- If you find the program stalls with only partial pod startup during the scaling of the deployment the autoscaler may not be able to scale the entire deployment due to node group limits (eg. maximum size of the node group reached). Use less replicas or increase the node group max size to fix this. Always restart the benchmark after making changes to the node group.
- If you find the program stalls with 0 pods starting up check to ensure there aren't any container
CrashLoopBackOff
occuring.
Contributing
Contributions to the k8s-autoscaler-benchmarker project are welcome. Please submit pull requests to the main repository.
TODO
More test coverage.
License
This project is licensed under the MIT License - see the LICENSE file for details.