Auto-scaling in kubernetes
In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of automatically scaling the workload to match demand.
Horizontal scaling means that the response to increased load is to deploy more Pods. This is different from vertical scaling, which for Kubernetes would mean assigning more resources (for example: memory or CPU) to the Pods that are already running for the workload.
If the load decreases, and the number of Pods is above the configured minimum, the HorizontalPodAutoscaler instructs the workload resource (the Deployment, StatefulSet, or other similar resource) to scale back down.
In this post I will show how to use the Horizontal Pod Autoscaler (hpa) to automatically scale- out and scale-in your application which makes a performant application and minimize your costs.
How does a HorizontalPodAutoscaler work?
Kubernetes implements horizontal pod autoscaling as a control loop that runs intermittently (it is not a continuous process). The interval is set by the --horizontal-pod-autoscaler-sync-period
parameter to the kube-controller-manager
(and the default interval is 15 seconds).
Once during each period, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. The controller manager finds the target resource defined by the scaleTargetRef
, then selects the pods based on the target resource’s .spec.selector
labels, and obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).
Horizontal Pod Auto-scaler (HPA)
The Horizontal Pod Auto- scaler automatically scales the number of Kubernetes pods, depending on resource utilization like CPU. For example, if you target a 300% CPU utilization for your pods but your pods have an 60% CPU utilization, the HPA will automatically create new pods. If the CPU utilization falls below 30%, for example, 20%, the HPA terminates pods. This makes that you always run enough pods to keep your customers happy and in the same time helps you not waste money by running too many pods. You can also use custom metrics to scale, custom metrics can be response time, queue length, or hits-per-second.
With HPA, you can configure the minimum and maximum amount of pods. This prevents the hpa from creating new pods (until you run out of resources) it also ensures a bottom line to guarantee high availability. The HPA checks by default the metrics every 15 seconds. You can even configure the interval with the HPA–sync-period flag.
How to create a Horizontal Pod Auto-scaler
In Helm charts folder inside the templates folder create a new yaml file named hpa. and paste the following code into the file:
{{- if .Values.hpa.enabled -}}
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: {{ template "productmicroservice.fullname" . }}
spec:
maxReplicas: {{ .Values.hpa.maxReplicas }}
minReplicas: {{ .Values.hpa.minReplicas }}
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ template "productmicroservice.fullname" . }}
targetCPUUtilizationPercentage: {{ .Values.hpa.averageCpuUtilization }}
{{- end }}
The above code creates a Horizontal Pod Auto-scaler if the hpa.enabled flag is set to true (from values.yaml file). Then it configures the specification with the maximum and minimum amount of replicas and at the end the target metric ( target metric is CPU utilization). In this file values starting with .Values are provided by the values.yaml file.
Now add the following code to the values.yaml file for enable/disable hpa.enable flag in hpa section:
The following code is hpa section in value.yaml file:
hpa:
enabled: true
minReplicas: 1
maxReplicas: 8
averageCpuUtilization: 40
You can even replace the placeholders in the hpa file with actual values:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: {{ template "productmicroservice.fullname" . }}
spec:
maxReplicas: 8
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ template "productmicroservice.fullname" . }}
target: CPUUtilizationPercentage: 40
Note: you should never run only one pod for production applications. I would recommend running at least 3 pods to ensure high availability.
Deploy the Horizontal Pod Auto-scaler
You can deploy the hpa to your Kubernetes cluster either locally or Azure Kubernets cluser (Aks). You can check my previous posts: Deploy microservice to Kubernetes using Helm Charts or Local Kubernetes Cluster
After the deployment is finished, check that the hpa got deployed correctly. You can use kubectl or Octant dashboard to check if the hpa values are set correctly.
Here I am using Kubernetes locally, after creating hpa file adding hpa.enable flag value to the vlues.yaml file, running the command:
helm upgrade product productmicroservice
To check the hpa via kubectl:
Start command prompt: and put the following command:
kubectl get hpa
Then you can see the following informaton regarding to the hpa you have configured in above:
The other way is via Octant dashboard:
Start the Octant dashboard : Discovery and Load Balancing: Horizontal Pod Autoscalers then you can see the Targets, Minimum pods, Maximum Pods and Replicas
Horizontal Pod Autoscaler addtional features
Scaling Policies
Scaling policies allow you to configure for how long time a certain value (CPU utilization) has to be reached until scaling happens. The following code is an example for scaleDown:
behavior:
scaleDown:
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 10
periodSeconds: 60
periodSeconds
indicates the length of time in the past for which the policy must hold true. The first policy (Pods) allows at most 4 replicas to be scaled down in one minute. The second policy (Percent) allows at most 10% of the current replicas to be scaled down in one minute.
Stabilization window
The stabilization window (time window) is used to restrict the flapping (move up and down) of replica count when the metrics used for scaling keep fluctuating (rising and falling). The autoscaling algorithm uses this window to infer a previous desired state and avoid unwanted changes to workload scale.
behavior:
scaleDown:
stabilizationWindowSeconds: 300
When the metrics indicate that the target should be scaled down the algorithm looks into previously computed desired states, and uses the highest value from the specified interval. In the above example, all desired states from the past 5 minutes will be considered.
Conclusion
In this post I have described about Horizontal Pod Autoscaler and how to configure automatic scaling which can be scaling out (scaleUp) to increase the throughput and performance of your application or scaling in (scaleDown) to reduce the used resources and therefore the costs. Scaling can perform on simple metrics like CPU utilization or on more complex metrics like response time or hits per second. We have talked also other features like Scaling policies and stabilization window (time window).
You can find the code of the demo on my GitHub.
In my next post I will talked about Configuration of Probes in Kubernetes
This post is part of “Kubernetes step by step”.
Recommended to read : kubernetes Horizontal Pod Autoscaling