KEDA (Kubernetes-based Event-Driven Autoscaling) is an opensource project built by Microsoft in collaboration with Red Hat, which provides event-driven autoscaling to containers running on an AKS (Azure Kubernetes Service), EKS ( Elastic Kubernetes Service), GKE (Google Kubernetes Engine) or on-premises Kubernetes clusters 😉
KEDA allows for fine-grained autoscaling (including to/from zero) for event-driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. Most of the time, we scale systems (manually or automatically) using some metrics that get triggered.
For example, if CPU > 60% for 5 minutes, scale our app service out to a second instance. By the time we’ve raised the trigger and completed the scale-out, the burst of traffic/events has passed. KEDA, on the other hand, exposes rich events data like Azure Message Queue length to the horizontal pod auto-scaler so that it can manage the scale-out for us. Once one or more pods have been deployed to meet the event demands, events (or messages) can be consumed directly from the source, which in our example is the Azure Queue.
Thinking that you managed to escape security? Guess again.
In this post, I will focus on containers with Azure Kubernetes in mind but these examples apply to all the worlds out there -EKS, GKE, and on-premises.
Containers gained popularity because they made publishing and updating applications a straightforward process. You can go from build to release very fast with little overhead, whether it’s the cloud or on-premises, but there’s still more to do.
The adoption of containers in recent years has skyrocketed in such a way that we see enterprises in banking, public sector, and health looking into or are already deploying containers. The main difference here would be that these companies have a more extensive list of security requirements and they need to be applied to everything. Containers included.
I’ve
discussed about AKS before but recently I have been doing a lot of
production deployments of AKS, and the recent deployment I’ve done was
with Nvidia GPUs.
This
blog post will take you through my learnings after dealing with a
deploying of this type because boy some things are not that simple as
they look.
The first problems come after deploying the cluster. Most of the times if not all, the NVIDIA driver doesn’t get installed and you cannot deploy any type of GPU constrained resources. The solution is to basically install an NVIDIA daemon and go from there but that also depends on the AKS version.
For example, if your AKS is running version 1.10 or 1.11 then the NVIDIA Daemon plugin must be 1.10 or 1.11 or anything that matches your version located here
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
kubernetes.io/cluster-service: "true"
name: nvidia-device-plugin
namespace: gpu-resources
spec:
template:
metadata:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
containers:
- image: nvidia/k8s-device-plugin:1.10 # Update this tag to match your Kubernetes version
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
nodeSelector:
beta.kubernetes.io/os: linux
accelerator: nvidia
The code snip from above creates a DaemonSet that installs the NVIDIA driver on all the nodes that are provisioned in your cluster. So for three nodes, you will have 3 Nvidia pods.
The problem that can appear is when you upgrade your cluster. You go to Azure and upgrade the cluster and guess what, you forgot to update the yaml file and everything that relies on those GPUs dies on you.
The best example I can give is the TensorFlow Serving container which crashed with a very “informative” error that the Nvidia version was wrong.
Other problems that appear is monitoring. How can I monitor GPU usage? What tools should I use?
Here you have a good solution which can be deployed via Helm. If you do a helm search for prometheus-operator you will find the best solution to monitor your cluster and your GPU 🙂
The prometheus-operator chart comes with Prometheus, Grafana and Alertmanager but out of the box you will not get the GPU metrics that are required for monitoring because of an error in the Helm chart which sets the cAdvisor metrics with https, the solution would be to modify the exporter HTTPS to false.
kubelet:
enabled: true
namespace: kube-system
serviceMonitor:
## Scrape interval. If not set, the Prometheus default scrape interval is used.
##
interval: ""
## Enable scraping the kubelet over https. For requirements to enable this see
## https://github.com/coreos/prometheus-operator/issues/926
##
https: false
And import the dashboard required to monitor your GPUs which you can find here: https://grafana.com/dashboards/8769/revisions and set it up as a configmap.
In most cases, you will want to monitor your cluster from outside and for that you will need to install / upgrade the prometheus-operator chart with the grafana.ingress.enabled value as true and grafana.ingress.hosts={domain.tld}
Next in line, you have to deploy your actual containers that use the GPU. As a rule, a container cannot use a part of a GPU but only the whole GPU so thread carefully when you’re deploying your cluster because you can only scale horizontally as of now.
When you’re defining the POD, add in the container spec the following snip below:
What happens if everything blows up and nothing is working?
In some rare cases, the Nvidia driver may blow up your data nodes. Yes that happened to me and needed to solved it.
The manifestation looks like this. The ingress controller works randomly, cluster resources show as evicted. The nvidia device restarts frequently and your GPU containers are stuck in pending.
The way to fix it is first by deleting the evicted / error status pods by running this command:
In a previous blog post, I talked about how excellent the managed Kubernetes service is in Azure and in another blog post I spoke about Azure Container Instances. In this blog post, we will be combining them so that we get the best of both worlds.
We know that we can use ACI for some simple scenarios like task automation, CI/CD agents like VSTS agents (Windows or Linux), simple web servers and so on but it’s another thing that we need to manage. Even though that ACI has almost no strings attached, e.g. no VM management, custom resource sizing and fast startup, we still may want to control them from a single pane of glass.
ACI doesn’t provide you with auto-scaling, rolling upgrades, load balancing and affinity/anti-affinity, that’s the work of a container orchestrator. So if we want the best of both worlds, we need an ACI connector.
The ACI Connector is a virtual kubelet that get’s installed on your AKS cluster, and from there you can deploy containers just by merely referencing the node.
If you’re interested in the project, you can take a look here.
To install the ACI Connector, we need to cover some prerequisites.
The first thing that we need to do is to do is to create a service principal for the ACI connector. You can follow this document here on how to do it.
When you’ve created the SPN, grant it contributor rights on your AKS Resource Group and then continue with the setup.
I won’t be covering the Windows Subsystem for Linux or any other bash system as those have different prerequisites. What I will cover in this blog post is how to get started using the Azure Cloud Shell.
So pop open an Azure Cloud Shell and (assuming you already have an AKS cluster) get the credentials.
az aks get-credentials -g RG -n AKSNAME
After that, you will need to install helm and upgrade tiller. For that, you will run the following.
helm init
helm init --upgrade
The reason that you need to initialize helm and upgrade tiller is not very clear to me but I believe that helm and tiller should be installed and upgraded to the latest version every time.
Once those are installed, you’re ready to install the ACI connector as a virtual kubelet. Azure CLI installs the connector using a helm chart. Type in the command below using the SPN you created.
az aks install-connector -g <AKS RG> -n <AKS name> --connector-name aciconnector --location westeurope --service-principal <applicationID> --client-secret <applicationSecret> --os-type both
As you can see the in command from above, I typed both for the –os-type. ACI supports Windows and Linux containers so there’s no reason not to get both 🙂
After the install, you can query the Kubernetes cluster for the ACI Connector.
kubectl --namespace=default get pods -l "app=aciconnector-windows-virtual-kubelet-for-aks" # Windows
kubectl --namespace=default get pods -l "app=aciconnector-linux-virtual-kubelet-for-aks" # Linux
Now that the kubelet is installed, all you need to do is just to run kubectl -f create YAML file, and you’re done 🙂
If you want to target the ACI Connector with the YAML file, you need to reference a nodeName of virtual-kubelet-ACICONNECTORNAME-linux or windows.
You run that example from above and the AKS cluster will provision an ACI for you.
What you should know
The ACI connector allows the Kubernetes cluster to connect to Azure and provision Container Instances for you. That doesn’t mean that it will provision the containers in the same VNET as the K8 is so you can do some burst processing or those types of workloads. This is let’s say an alpha concept which is being built upon and new ways of using it are being presented every day. I have been asked by people, what’s the purpose of this thing because I cannot connect to it, but the fact is that you cannot expect that much from a preview product. I have given suggestions on how to improve it, and I suggest you should too.
Well that’s it for today. As always have a good one!
Visual Studio Team Services or VSTS is Microsoft’s cloud offering that provides a complete set of tools and services that ease the life of small teams or enterprises when they are developing software.
I don’t want to get into a VSTS introduction in this blog post, but what we need to know about VSTS is that it’s the most integrated CI/CD system with Azure. The beautiful part is that Microsoft has a marketplace with lots of excellent add-ons that extend the functionally of VSTS.
Creating a CI/CD pipeline in VSTS to deploy containers to Kubernetes is quite easy. I will show in this blog post a straightforward pipeline design to build the container and deploy it to the AKS cluster.
The prerequisites for are the following:
VSTS Tenant and Project – Create for free here with a Microsoft Account that has access to the Azure subscription
VSTS Task installed – Replace Tokens Task
AKS Cluster
Azure Container Registry
Before we even start building the VSTS pipeline, we need to get some connection prerequisites out of the way. To deploy containers to the Kubernetes cluster, we need to have a working connection with it.
Open a Cloud Shell in Azure and type in:
az aks get-credentials -g AKS_RG -n AKS_NAME
It will tell you that the current context is located in “/home/NAME/.kube/config.”
Now open the /home/NAME/.kube/config with nano or cat and paste everything from there in a notepad. You need that wall of text to establish the connection to the cluster using VSTS.
Let’s go to VSTS where we will create a service endpoint to our Kubernetes cluster.
At the project dashboard, press on the whell icon and press on services.
Press on the New Service Endpoint and select Kubernetes.
Paste in the details from the .kube/config file in the kubeconfig box and the https://aksdns
Create a repository and add the following files and contents to it:
*I know it would be easier to clone from my Github Repo but when I’m learning I like doing copying and pasting stuff in VSCode, analyse and then upload.
server {
listen 80;
server_name localhost;
#charset koi8-r;
#access_log /var/log/nginx/host.access.log main;
location / {
root /usr/share/nginx/html;
index index.html index.htm;
}
#error_page 404 /404.html;
# redirect server error pages to the static page /50x.html
#
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /usr/share/nginx/html;
}
# proxy the PHP scripts to Apache listening on 127.0.0.1:80
#
#location ~ \.php$ {
# proxy_pass http://127.0.0.1;
#}
# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
#
#location ~ \.php$ {
# root html;
# fastcgi_pass 127.0.0.1:9000;
# fastcgi_index index.php;
# fastcgi_param SCRIPT_FILENAME /scripts$fastcgi_script_name;
# include fastcgi_params;
#}
# deny access to .htaccess files, if Apache's document root
# concurs with nginx's one
#
#location ~ /\.ht {
# deny all;
#}
}
You will need to add an “index.html” with whatever you want it to be written in it. I went with “One does not simply push changes to containers. Said no one ever.”
You’re done building the repository; now it’s time to setup the build definition.
Go to Build and Releases, Press on New and select the new Git repository that you just created.
At the template screen press on Container and press Apply.
On the new screen, go to variables and create two new variables
ACR_DNS with the value of your ACR registry link in the form of name.azurecr.io
BUILD_ID with the value $(Build.BuildId)
Now go back to the task pane by pressing the cross where the phase 1 task says and add the Replace Tokens task and Publish Artifacts task. The result should look like the screenshot below.
For each task fill in the following:
Build an Image
Container Registry Type = Azure Container Registry
Azure Subscription = Your Subscription
Azure Container Registry = Select what you created
Action = Build an Image
Docker File = **\Dockerfile
Use Default Build Context = Checked
ImageName = nginxdemo
Qualify Image Name = Checked
Additional Image Tags = $(Build.BuildId)
Push an Image
Container Registry Type = Azure Container Registry
Azure Subscription = Your Subscription
Azure Container Registry = Select what you created
Action = Puysh an image
ImageName = nginxdemo
Qualify Image Name = Checked
Additional Image Tags = $(Build.BuildId)
Publish Artifact
Path to publish = deploy.yaml
Artifact name = deploy
Artifact publish location = Visual Studio Team Service/TFS
Now go to triggers, select Continuous integration and check “Enable continuous integration” then press the arrow on Save & queue and press save.
The build has been defined; now we need to create a release.
Go to Build and Releases and press on Release
Press on the cross and then on the “Create release definition”
In the New Release Definition pane, select the “Deploy to Kubernetes Cluster template and press on Apply
Now that the template is pre-populated to deploy to the Kubernetes Cluster, you need to add an artifact, select the Build Definition and add it.
Now it’s time to enable Continuous deployment so press on the lightning bolt that’s located in the upper right corner of the artifact and enable the CD trigger.
Now go to the Tasks tab located near the Pipeline and modify the kubectl apply command. kubectl apply
Kubernetes Service Connection = Select the K8 connection that you created
Command = Apply
Use Configuration files = Checked
Configuration File = press on the three dots and reference the deploy.yaml or copy what is below.
$(System.DefaultWorkingDirectory)/K8Demo/deploy/deploy.yaml
Now press save, queue a new build and wait for the container to get deployed and when it’s done just type in the Azure Cloud Shell kubectl get services and the IP will pop.
Final Thoughts
So you finished configuring the CI/CD pipeline and deployed your first container to an AKS cluster. This might seem complicated at first but once you do this a couple of times, you will be a pro at it, and the problems you will face will be on how to make it more modular. I do similar things at clients most of the times when I’m automating application deployments for cloud-ready or legacy applications. This type of CI/CD deployment is quite easy to deploy, when you want to automate a full blow microservices infrastructure, then you will have a lot more tasks to do jobs. My most significant CI/CD pipeline consisted of 150 tasks that were needed to automate a legacy application.
What I would consider some best practices for CI/CD pipelines in VSTS or any other CI/CD tool is to never hard code parameters into tasks and make use of variables/variable groups. Tasks like the “Replace Tokens” one permit you to reference those variables so when one changes or you create one dynamically, they just get filled in the code. This is very useful when your release pipeline deploys to more than one environment, and you can have global variables and environment specific variables.