You've successfully subscribed to Florin Loghiade
Great! Next, complete checkout for full access to Florin Loghiade
Welcome back! You've successfully signed in
Success! Your account is fully activated, you now have access to all content.
LetsEncrypt Certificates in AKS

LetsEncrypt Certificates in AKS

in

I remember deploying my first Kubernetes cluster years ago and spending days figuring out how to secure ingresses quickly without much toil. Back then, it felt like teaching quantum physics to your cat—today, with LetsEncrypt and cert-manager, it's more like boiling an egg. Well, during that time, it was called Azure Container Services (ACS), and I first dipped my hands in the Kubernetes realm and started getting my first grey hairs from there.

Certificate management is an age-old problem. Usually, it's a year-and-certificate problem. There are so many ways to manage certificates, but we still have problems renewing and applying them.

Let's be honest. We would all be happier if we didn't need to manage certificates; however, we do have to do it from time to time. But in Kubernetes, whether it's AKS, EKS, GKE, or minikube, you name it, you have a solution called Cert Manager that integrates with LetsEncrypt to have almost all types of certificates that you need, just with some simple code.

What's LetsEncrypt and Why Should You Care?

LetsEncrypt is a free, automatable, open certificate authority that provides digital certificates to secure HTTPS connections. If you've been in the cloud space long enough, you'll remember when obtaining SSL certificates was (and still is) expensive, time-consuming, and requires annual renewal headaches.

The key word in the sentence is free, as it's a non-profit sponsored by many companies to keep it free with a mission to keep the internet secure and HTTP a thing of the past.

The main benefits that make it perfect for AKS deployments are:

  • It's completely free (yes, even for production workloads)
  • All major browsers trust certificates
  • It's designed for automation, which pairs perfectly with Kubernetes
  • The certificates auto-renew, eliminating those "certificate expired" emergencies.
  • They are short-lived; they come with a 90-day expiration date.
    • Thus, the automation part. Wanna do certificate renewals every 90 days manually? Oh no, we don't want that.

I remember deploying my first Kubernetes cluster years ago and spending days figuring out how to secure ingresses quickly without much toil. Back then, it felt like teaching quantum physics to your cat—today, with LetsEncrypt and cert-manager, it's more like boiling an egg. Well, you get the idea.

Setting Up cert-manager in AKS

To start using LetsEncrypt in AKS, we need to install a simple helm chart that contains cert-manager, a Kubernetes operator that handles certificate issuance and renewal. I've done this step so many times that it became a reflex, even if I don't need certificates, I'm still going to install it.

Let's start by installing cert-manager using Helm. This will create a new namespace and install all the required components:

# Add the Jetstack Helm repository
helm repo add jetstack https://charts.jetstack.io --force-update

# Update your local Helm chart repository cache
helm repo update

#apply the CRDS 
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.16.2/cert-manager.crds.yaml


# Install cert-manager with CRDs
helm install \
  cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.16.2 \
  --set crds.enabled=true

I can't tell you how many times I've forgotten to include the CRDs and I spent hours debugging an issue that shouldn't have existed in the first place. So never skip the CRDs, learn from me a veteran in doing stupid things 😄

My suggestion here is to create an IAC pipeline that acts as a desired state configuration manager. Whenever you deploy a new cluster, this step should kick in and do all this work. I automate so much of this stuff that I rarely remember how to do it manually.

After we confirmed that cert-manager is working properly, we need to create a ClusterIssuer resource that will ping LetsEncrypt to obtain certificates. LetsEncrypt has two environments: staging and production. I always recommend starting with staging to avoid hitting rate limits while testing.

Some years ago, a client insisted on going straight to production. "We don't have time to create a staging environment," they said. After multiple failed attempts and configuration errors, they hit LetsEncrypt's rate limits and had to wait a week before trying again, but we found a workaround that solved it in less than 24 hours. This involved a lot of work, and I won't share it here as it's quite a hack.

The moral of the story? Never test in production. Where did I read or hear that before?

Here's how to create a staging ClusterIssuer:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    # LetsEncrypt staging server
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-staging
    solvers:
    - http01:
        ingress:
          class: nginx

And then the production version, once you've confirmed everything works:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    # LetsEncrypt production server
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx

Apply these configurations with kubectl apply -f clusterissuer.yaml

When you set your e-mail, make it a real one and preferably one where you can get notifications from LetsEncrypt.

Requesting A Certificate

Now that the infrastructure is in place, let's request a certificate for your application. This is done by creating a Certificate resource:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-app-tls
  namespace: default
spec:
  secretName: my-app-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  commonName: app.example.com
  dnsNames:
  - app.example.com

Once applied, cert-manager will automatically start the certificate issuance process. You can check the status with kubectl describe certificate my-app-tls

The certificate will be stored in a Kubernetes Secret named my-app-tls which you can then reference in your Ingress resource

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - app.example.com
    secretName: my-app-tls
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-service
            port:
              number: 80

The Traffic Manager Challenge: When LetsEncrypt says Nope!

I remember the day I deployed my first multi-cluster AKS setup with Traffic Manager. Everything seemed perfect until I failed over the cluster, and that's when I learned an important lesson.

Azure Traffic Manager routes traffic across multiple endpoints which can be different AKS clusters. When you try to use cert-manager with LetsEncrypt in this setup, you'll encounter a very interesting problem: the HTTP-01 validation challenge.

Here's what happens: when you request a certificate, LetsEncrypt needs to verify that you control the domain by placing a file at a specific URL (the HTTP-01 challenge). But with Traffic Manager, this validation request might go to ANY of your clusters if configured in performance mode or only one of your clusters if it's set in priority mode, which is not necessarily the one that created the certificate request. This results in validation failures, and your certificates will remain pending forever.

To make matters worse, if multiple clusters are independently trying to request certificates for the same domain, you'll quickly hit LetsEncrypt's rate limits (which is only 5 duplicate certificates per week). And trust me, that's not a call you want to make to your team: "Sorry folks, such is life."

The Challenge: HTTP01 vs DNS01

As I said above Letsencrypt needs to verify that you control the domain before issuing a certificate. There are two primary challenge types: HTTP01 and DNS01.

HTTP01 is simpler but requires your AKS cluster to accept HTTP traffic on port 80. DNS01 is more complex but doesn't require exposing HTTP endpoints and works better with wildcard certificates.

For HTTP01 (the simpler option), the configuration is shown above. For DNS01 with Azure DNS, you'll need something more like this:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - dns01:
        azureDNS:
          resourceGroupName: dns-resource-group
          subscriptionID: your-subscription-id
          hostedZoneName: example.com
          environment: AzurePublicCloud

Of course, this also requires setting up the proper Azure credentials for cert-manager to access Azure DNS, but that's a bit of a later topic in another post.
This method would allow you to have dynamic entries end to end, so you don't have to bother, but you can do this operation with any DNS provider like Cloudflare.

Solution for Cross-Cluster Certificate Synchronization

After a few sleepless nights (and a lot of coffee), I found an approach that works most of the time and multiple production clusters that have HA enabled.

The first approach is to designate one cluster as the "certificate manager." This cluster is responsible for all certificate requests and renewals. Then, you sync the resulting certificates to your other clusters using a pipeline or another type of mechanism. I used a pipeline for that which synchronizes the certificate every time it runs, and it runs more than once during that 90-day period.

This approach works well because:

  • Only one cluster talks to LetsEncrypt, avoiding rate limit issues
  • You don't have validation problems since all requests come from the same place
  • Your certificates stay in sync across all clusters

Lessons Learned and Best Practices

After spending countless hours on this challenge, here are some lessons I learned:

  1. Start with staging issuer: Always use LetsEncrypt's staging environment first to avoid hitting rate limits during testing.
  2. Monitor connectivity: Ensure your cert-manager pods can reach LetsEncrypt's API. I once spent hours troubleshooting only to find out a network policy was blocking egress traffic.
  3. DNS01 over HTTP01 for multi-cluster: For Traffic Manager setups, DNS01 validation is significantly more reliable than HTTP01.
  4. Watch certificate renewal: LetsEncrypt certificates expire after 90 days. While cert-manager handles renewals automatically, ensure your sync mechanism is working before renewal time comes.
  5. Wildcard certificates: If you're using multiple subdomains, wildcard certificates can and usually will make your life easier.
  6. Check for clock synchronization: Time skew between your nodes can cause validation failures. Ensure your nodes have properly synchronized clocks.

Understand renewal: LetsEncrypt certificates are valid for 90 days, but cert-manager will automatically renew them around 30 days before expiry. Make sure your cluster is stable when renewal time comes

Network requirements: Ensure your cluster has the necessary network access to reach LetsEncrypt's API and for LetsEncrypt to reach your ingress controller (for HTTP01) or DNS servers (for DNS01).

Resource requests: For production environments, ensure cert-manager has adequate CPU and memory resources. I've seen issues with certificate renewals failing in resource-constrained environments.

Last month, I was troubleshooting a certificate issue where everything looked correct, but certificates weren't being issued. After an hour of head-scratching, we discovered that a fresh network policy was blocking egress traffic to LetsEncrypt's.


Moving to LetsEncrypt certificates in AKS has been one of those rare win-win situations in IT: it saves money while also improving automation and reducing maintenance overhead. No more calendar reminders to renew certificates, no more midnight calls about expired TLS, and no more budget discussions about certificate costs.

I remember a time, maybe six years ago when I spent an entire weekend recovering from an expired certificate in a production environment. That painful memory makes me appreciate the current state of automation even more.
If you're managing certificates manually in your AKS clusters, do yourself a favour and invest the few hours it takes to set up this automation. Your future self will thank you when you enjoy your weekend instead of dealing with certificate renewals.

Lastly, when choosing between HTTP01 and DNS01 challenges, pick the latter because while HTTP01 is easier to set up, DNS01 gives you more flexibility with wildcards and doesn't require opening port 80. It's like choosing between Sweden and West Europe for your Azure region—sometimes the less obvious choice offers better benefits in the long run.

That being said, have a great one!