I remember the first time I looked at an AKS invoice and thought, what exactly am I paying for here? The cluster wasn't doing anything crazy; it was a standard setup with a system pool, two user pools, some queue workers, and a handful of APIs. But the bill came in way higher than I expected, and the worst part was that nobody on the team could immediately explain why. We opened the Azure Cost Management blade, stared at the compute line items, looked at some CPU graphs with a lot of white space, and the initial reaction was "the platform is wasteful by default." That's a nice theory, but boy, it turned out to be completely wrong.
What I learned over the following weeks, and what I've seen repeated at enough other places that it stopped being coincidence, is that the spend that hurts in AKS comes from architecture choices that looked reasonable when they were made, which then calcified into an operating model nobody revisited. The team wants a standard platform, so they build one cluster. Then they add multiple user pools because isolation sounds responsible, then they overprovision baseline nodes since scale-up latency is scary, then every background processor becomes a permanently running Deployment, which is simpler than rethinking workload shape, then observability gets turned on with broad defaults since nobody wants to explain missing logs during an incident. Six months later the invoice arrives and it's doing exactly what the design told it to do, it's just doing it expensively.
Every blog post I see about Kubernetes cost says the same thing: resize that pool, delete orphaned disks, lower this retention value. Those things matter, but they're cleanup activities. I wanted to write about the actual design decisions that determine whether your AKS bill makes sense or not, the kind of stuff you'd discuss with your finance partner, your platform team, and the developers who have to live with the resulting constraints. Anyway, let's get into it.
Most Kubernetes Overspend Is Architectural
Here's what I keep telling teams: your pricing tier isn't the problem, your reservation strategy isn't the problem, the problem is the shape of your workloads and the boundaries your platform creates. If a platform team builds a cluster that assumes constant demand, fixed node shapes, broad isolation domains, and permanent workload residency, the bill is just doing what the architecture told it to do. Whatever you think is causing the spend, look at the design first.
Overprovisioning Is a Confidence Problem
Idle AKS spend exists since the platform team doesn't trust one of three things: workload declarations, autoscaler behavior, or application resilience during rescheduling. When teams don't trust requests and limits, they size nodes with extra padding. When they don't trust scaling, they keep bigger warm pools. When they don't trust their own services to survive disruption, they keep more replicas on more nodes than the steady state requires. Each of those reactions is understandable, but together they create a cluster that's permanently paying for anxiety.
I've built clusters like this myself. I had a production cluster a couple of years ago where the autoscaler took around four minutes to bring up a new node during a traffic spike since we were using a VM SKU that had limited regional availability. After that incident, I doubled the minimum node count and didn't revisit it for six months. Boy, the invoice was something :) The difference between buying headroom with intention and buying headroom out of fear is that the first one shows up in your design docs and the second one just shows up on the invoice.
Node Pool Sprawl Looks Responsible Until You Pay for It
You start with system and user pools, then someone wants a GPU pool (fair enough), then a data team wants a memory-heavy pool, then an integration workload needs a small isolated pool. Before long, the cluster resembles a compromise written by committee (and every enterprise has at least one).
The cost problem goes deeper than the number of pools. Every extra pool introduces a minimum node count to stay operational, lower packing efficiency since workloads are restricted to fewer nodes, upgrade overhead across more pools, and more temptation to keep idle buffer in each pool separately. I know, I know, isolation sounds responsible. But isolate by meaningful workload difference, not by organizational preference. Forcing two identical services into separate pools creates lower utilization with a story attached.
Why Are Your Batch Workloads Running as Services?
A team has background processing, queue-driven enrichment, or integration syncs. Instead of designing it as event-driven work that can scale down hard, they build it like a service. One replica becomes two for resilience, which traces back to somebody getting paged at 3 AM once. CPU usage hovers near nothing for the bulk of the day, but the pods remain resident, the nodes remain warm, and the bill remains confident.
This is one of the easiest places to waste money in AKS, and the platform will happily host inefficient workload shape forever. The scheduler has no moral stance on whether your queue worker should exist 24 hours a day, it'll just schedule it and move on. I had a queue worker once that processed about 200 messages a day, all between 9 AM and 11 AM, but it ran 24/7 with two replicas, which is just how it was originally deployed. Nobody questioned it for over a year.
Observability Spend Is a Signal Discipline Problem
In a lot of AKS estates, logs and metrics become expensive for boring reasons: debug and info logs shipped from everything in every environment, duplicate collection paths, high-cardinality labels that nobody challenged, full log retention applied to low-value namespaces, scrape frequency that makes sense for debugging (and makes zero sense at steady state). The problem compounds because each of those decisions seems reasonable in isolation, but the aggregate bill tells a different story. Teams turn on everything during an incident, which is the right call in the moment, and then nobody turns it back off when the incident resolves.
I've seen teams pay more for their log ingestion than for the compute that generated those logs, and the logs were 90% debug-level noise that nobody ever queried.
What Cost-Aware AKS Design Actually Means
Alright, so what does it actually mean to be cost-aware? For me it came down to one thing: can I explain every line item on the invoice? A healthy platform has a cost shape where you can point at each bucket and say "that's the baseline for the system pool, that's elastic compute for the queue workers, that's Prometheus ingestion for the signals we actually alert on, and that's the premium we pay for multi-zone APIs." If you can't do that, you're paying for things you don't understand, which is how you end up deleting monitoring to save money and then spending three times as much on an incident you can't diagnose (I've seen this happen twice).
Every workload has an implicit contract with the platform, some need predictable latency, some need high availability, some can disappear and retry, and some are happy to wait 90 seconds for capacity. Turn that implicit contract into explicit placement and scaling rules. If you don't, everything drifts toward on-demand, always-on, broadly observed, and more expensive than necessary. Check the AKS pricing page to understand what each capacity type actually costs you.
Now, some idle capacity is worth paying for: headroom in the system pool for upgrades, a warm baseline for latency-sensitive traffic, extra capacity before predictable peak windows. Other idle capacity is just comfort you can't justify, permanent extra nodes in queue-worker pools with daily hours of idleness, separate underutilized pools for teams that could safely share, overly large requests that force low packing density. Use Azure Cost Management to actually see where it lives. That being said, this doesn't mean starving the platform. You still pay for reliability where the workload needs it, you just pay consciously.
And please don't go running everything on spot instances just from reading this article. Spot is a pricing model tied to interruption risk, not a virtue. Chasing 100 percent utilization is a trap, a cluster pegged at perfect utilization means you've got no safe margin for disruption, upgrades, or burst, which is a little terrifying during upgrade day. And whatever you do, don't cut telemetry blindly. Shape your signals, don't deny them.
Cluster Baseline Design and Cost Shape
AKS has a cost floor, and the spend it represents is unavoidable: node baseline for system components, load balancers, disks, private networking, observability plumbing. You don't get to zero this out. The question is whether your baseline is proportionate and whether you're adding optional baselines on top of it without noticing.
Start With a Small, Defensible System Pool
Your system pool exists to host things that the cluster needs to function: CoreDNS, CNI components, CSI drivers, metrics and policy agents, ingress controllers. These components are lightweight but they need to be reliable, which means on-demand nodes with mainstream VM families. D-series works for system pool needs in the vast majority of cases. So, keep enough headroom for upgrades and daemonsets, and taint the system pool so random application workloads don't drift onto it. If you skip the taint, you'll eventually find some developer's test pod sitting on a system node because the scheduler had room and nobody told it not to.
Do You Actually Need That Extra Node Pool?
The best default for a lot of clusters is simpler than people think: one fixed system pool, one general application capacity class, one interruption-tolerant capacity class. You can add more when a workload truly needs a different node OS, a different accelerator profile, a different compliance boundary, or a different disruption model. But every new pool should answer a specific question. If the answer is "this team asked for it," you don't have a scheduling requirement yet, you just have a meeting outcome.
Separate Baseline Capacity From Burst Capacity
The thing that actually made a difference in my clusters was drawing a hard line between what must always exist and what only needs to exist when demand appears. Baseline capacity includes system nodes, steady API services with hard availability requirements, and ingress plus essential platform components. You pay for this around the clock and that's fine because it's the minimum viable platform.
Burst capacity is the other side: queue workers, scheduled processing, recommendation engines with asynchronous pipelines, internal reporting jobs, model inference workers with variable demand. This is where you can save real money by letting capacity appear and disappear with the workload itself, and where KEDA and NAP earn their keep.
Have You Right-Sized Requests Before Talking Node Economics?
If your requests are fiction, your cost review is fiction. Everything in Kubernetes schedules on requests, which means autoscalers reason from scheduling pressure. A useful pattern: start conservative but not absurd, observe steady-state and burst usage, resize requests toward realistic working levels, apply limits where they make behavioral sense, and repeat quarterly since workloads drift. This isn't glamorous, but it's where a lot of money lives.
Understand the Cost Shape of Replicas
Three replicas of a small service may be exactly right for a multi-zone production API. Three replicas of a dormant internal worker is architecture theater. Replicas interact with topology spread, PDBs, node packing, cluster scale-down feasibility, and upgrade behavior. The question you should be asking is: what availability promise is this replica count buying, and does the workload actually need it?
Node Auto-Provisioning and Dynamic Capacity
graph TD
A["Queue Backlog Rises"] --> B["KEDA ScaledObject\nDetects Metric"]
B --> C["Kubernetes Scheduler\nPending Pods"]
C --> D["NAP / Karpenter\nProvisions Node"]
D --> E["Pods Start\nProcess Work"]
E --> F["Backlog Falls"]
F --> G["Scale-Down Timer\nconsolidationPolicy"]
G --> H["NAP Removes Node"]
H -.->|"Cycle Repeats"| A
I wrote about Node Auto Provisioning before, and I keep coming back to it since it forces a mental shift. Traditional AKS cluster design begins with preselected pool shapes, the platform team decides what VM SKUs exist, how many nodes each pool has, and what autoscaling bounds they get, and workloads then try to fit into that catalog. NAP, which is built on Karpenter, changes the conversation completely. Instead of starting with fixed shapes and asking workloads to adapt, you define constraints and let the platform provision node shapes in response to pending pod requirements.
The Cost Impact of Dynamic Capacity
With classic fixed pools, teams overbuild the catalog. NAP reduces that pressure by letting you describe capacity classes rather than every exact pool you might need. Over time, this tends to improve bin packing efficiency and the willingness to remove static slack. That being said, NAP doesn't magically fix bad requests, bad workload shape, or bad policy. It does, however, remove one major source of waste: manually curating too many fixed VM assumptions.
The Real Shift: Stop Thinking in Named Pools
Teams using NAP retreat into old thinking: designing a small zoo of node pools (old habits die hard). The more useful framing is defining capacity classes, deciding what workloads belong in each class, what constraints matter, and what resource ceilings should keep NAP from becoming too enthusiastic. You move from pool inventory management to capacity policy.
One thing to keep in mind, NAP doesn't remove the need for fixed capacity entirely. Keep one small fixed system pool for boring reliability. NAP is best used to make user workload capacity elastic, not to make the whole cluster feel improvisational.
Example: Enabling NAP on a New Cluster
RESOURCE_GROUP=rg-aks-cost-aware
CLUSTER_NAME=aks-cost-aware-prod
LOCATION=westeurope
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--location $LOCATION \
--node-provisioning-mode Auto \
--network-plugin azure \
--network-plugin-mode overlay \
--network-dataplane cilium \
--enable-managed-identity \
--enable-oidc-issuer \
--enable-workload-identity \
--generate-ssh-keys
Pay attention to the hard ceiling and the labels, those are what keep NAP from going wild. The generic scope is deliberate too, since you want to avoid fragmenting capacity classes unless you have a real reason.
This is where cost-awareness becomes real. You've declared a cheaper capacity class with a behavioral contract. Only workloads that tolerate interruption should land there.
Spot Pools and Interruption-Aware Workload Design
Spot is one of the best cost levers in AKS and one of the easiest to misuse. The deal is simple: spot is cheap since Azure can take it back at any time. If the workload can handle that contract, you get meaningful savings, sometimes 60-80% off depending on the SKU and region. If it can't handle eviction, you're outsourcing availability to luck, and that never ends well.
What Makes a Good Spot Candidate?
Spot works well when a workload is stateless, queue-driven or backlog-driven, idempotent, retry-tolerant, and can tolerate interruption in minutes. Typical examples: document processing workers, image transformation jobs, asynchronous enrichment pipelines, indexers, and crawlers. The common thread is that these workloads don't hold state that can't be reconstructed, and the consumers of their output already expect asynchronous delivery. If a pod dies mid-operation, the message goes back on the queue and another pod picks it up.
What Should Stay Off Spot?
Single-instance stateful components, user-facing latency-sensitive APIs without enough on-demand baseline, sticky session services with weak reconnection behavior, and components where interruption cascades into noisy downstream failures. A lot of teams draw this line in the wrong place, and it's worth spending time getting it right, because one bad spot experience during an incident tends to scare an entire organization off the feature permanently.
Interruption Awareness Is an Application Property
Adding a toleration doesn't make a workload spot-ready. Designing for loss and retry does, which means idempotent message handling, checkpointing progress, graceful shutdown hooks, and external state instead of node-local assumptions. If your application can't handle being killed mid-operation, no amount of Kubernetes configuration will make spot safe for it. I learned this the hard way with a worker that stored intermediate results in a local temp directory, every eviction meant reprocessing the entire batch from scratch.
The labels, the priority class, the toleration, the affinity, when you look at this YAML you can immediately tell why the pod is on spot and what happens if it gets evicted. That's the point.
What Happens When Spot Is Scarce?
Not every spot workload needs on-demand fallback. Some truly can wait. But other workloads are cheap to run on spot during normal conditions and still important enough that you want a safety valve when spot capacity dries up (think overnight batch processing that feeds morning dashboards).
Possible patterns: separate worker deployments for spot-first and on-demand fallback, one KEDA-triggered deployment on spot plus manual switch when queue age crosses a threshold, a general on-demand capacity class that tolerates temporary overflow for critical backlog burn-down. Whatever approach you pick, fallback is a business choice, not an autoscaler accident.
KEDA as a Cost Control Decision
KEDA gets introduced as a scaling feature, which is how the product story starts. In cost-aware AKS design, I think it's more useful to treat it as an architecture decision. Specifically, KEDA forces you to answer whether a workload should exist when there's no work. That's not a small question, and teams rarely ask it.
It's the difference between a cluster that keeps a background estate permanently alive and a cluster that pays for work when work exists.
Is CPU Scaling Even the Right Model for Backlog Work?
HPA is fine for request-driven services, but AKS background workloads aren't fundamentally CPU-triggered, they're backlog-triggered: Service Bus queues, Event Hubs consumers, storage queues, cron-triggered processing windows, custom metrics from external systems. If a worker wakes up only because there's backlog, scaling on CPU usage is indirect at best. It also tends to keep too many replicas alive, since the system needs resident pods to produce the metric that will later justify more pods, which means you're paying for pods to exist so they can generate the metrics that prove they should exist. Circular, right?
KEDA short-circuits that by letting the event source drive scale. I wrote about this in more detail in my KEDA in production post if you want the full story.
Scaling to Zero Is an Economic Statement
Whenever a workload can safely scale to zero, you should take that option seriously. Not because zero is beautiful in theory, but because every permanently resident worker creates pressure on node baseline, daemonset footprint, observability volume, upgrade disruption surface, and scheduling complexity.
Teams resist scale-to-zero since they imagine cold starts as inherently unacceptable. In practice they rarely measure the real user impact, which is negligible when the workload is asynchronous and the queue already decouples the experience.
The Important Tuning Knobs Are Economic
People focus on KEDA for its ability to scale workloads, but the important operational questions are about how aggressively and how long. You need to think about: minReplicaCount, which controls whether the workload keeps a warm baseline; maxReplicaCount, which determines how much backlog burn-down you actually want to buy; pollingInterval, which sets how frequently it's worth checking; and cooldownPeriod, which governs how long replicas linger after demand falls.
Those aren't just performance settings, they shape your bill.
The key choices here: zero warm replicas when the queue is empty (why pay for idle?), enough max replicas to actually drain a backlog, and a 3-minute cooldown so pods don't thrash up and down after a short burst. I tweaked these numbers for about two weeks on my invoice-worker before landing on something that worked, your mileage will vary, but the principle is the same.
Workload Right-Sizing and Autoscaler Interactions
I used to think about HPA, KEDA, and NAP as separate things. Turns out the invoice doesn't care about your mental model. It sees one system, and if any part of the chain is misconfigured, the whole thing spends more than it should.
Are Your Requests Telling the Truth?
Requests are the inputs that determine placement, pending state, provisioning pressure, and ultimately node cost. They're not documentation, they're how the scheduler decides what to buy. If a service asks for 2 vCPU and 4 Gi when it really uses 250 millicores and 700 Mi, that request does real damage: the scheduler sees fewer valid packing options, NAP or the node autoscaler may need larger or more numerous nodes, scale-down becomes harder, and service pad calculations become more pessimistic.
The defense teams give for oversized requests is that the application once misbehaved under pressure, so they padded the numbers to stay safe (I've done this too, no judgment). That's understandable for a week. It's a bad permanent strategy.
Right-Sizing Needs Context, Not Just Percentiles
There's a lazy version of right-sizing that says "use the 95th percentile and move on." That can help, but it's not enough. Steady APIs need you to look at concurrency patterns, latency under sustained load, and GC behavior, keep enough CPU headroom for burst without turning every pod into a tiny dedicated node reservation. Background workers are a different animal, what matters there is cost per unit of work, how CPU and memory change with batch size, and whether bigger workers reduce queue latency enough to justify larger nodes.
Anyway, scheduled jobs are yet another story, where you need to separate startup overhead from real processing demand and figure out whether shorter, denser runs are cheaper than long, shallow execution windows. The details vary by workload, but the principle is always the same: give the platform accurate numbers so it buys the right capacity at the right moment.
What's the Cheapest Optimization Teams Avoid?
There are three broad workload shapes that matter for cost: always-on services, event-driven workers, and scheduled or finite jobs. The architectural mistake that shows up everywhere is treating all three as always-on services, which happens because Deployments are familiar and nobody stops to ask whether the workload actually needs permanent residency.
An event worker deployed as a permanent service keeps nodes warm when no work exists. A scheduled job implemented as a permanent service introduces unnecessary background telemetry and restart surface. A truly always-on API forced into a purely event-driven posture may suffer latency or warm-up issues it didn't need. So, stop defaulting to Deployments for everything and figure out what residency model the workload actually requires.
The Scaling Chain in Practice
Here's the simplified runtime story for a queue-driven workload:
Backlog appears in Service Bus
KEDA notices the backlog and increases desired replicas
The scheduler tries to place those pods using the requests you declared
If the cluster lacks fitting capacity, NAP or node autoscaling provisions nodes
The pods run and drain work
Backlog falls
KEDA reduces replicas
Empty or underused nodes become candidates for consolidation or scale-down
Every one of those steps can distort cost if the inputs are bad. Bad requests make steps 3 and 4 expensive. Broad affinity rules make step 3 restrictive. Aggressive max replicas make step 2 overshoot downstream dependencies. Long cooldown makes step 7 lazy. Strict disruption controls make step 8 sluggish. This is why reading about each feature in isolation doesn't help, the features interact, and the interactions are where the money goes.
Heuristics I Actually Trust
If a workload processes a queue and nobody notices a 30 to 90 second startup delay, push hard toward KEDA and scale-to-zero. If a workload carries user-facing latency and steady traffic, keep an on-demand baseline and use HPA or carefully tuned KEDA where it actually helps. If a workload has high startup cost, fix the startup cost before deciding elastic design is impossible, which is the real blocker more than people realize. If a workload's requests exceed observed steady use by multiples for months, treat that as technical debt, not safety.
If you can't explain why a placement rule exists, remove or challenge it. And if scale-down is consistently slower than scale-up, you've got a cost leak.
Managed Prometheus, Grafana, and Telemetry Cost
Observability is where a lot of respectable teams accidentally build a second bill. I've written about Managed Prometheus and Grafana before, and I love the managed stack, but the managed part only removes operational burden, it doesn't remove the need for discipline. Teams want logs, metrics, dashboards, and alerts. They want good debugging during incidents. All valid. The problem is that teams never turn those goals into signal policies and just enable collection and let volume become culture.
Metrics are valuable when they help you understand platform health, saturation, error behavior, and backlog. They become expensive when label cardinality explodes or you scrape everything at a fine interval on the theory that more detail feels safer.
A few practices help: keep scrape targets intentional, avoid unbounded labels (like request IDs in metrics, seriously, don't), aggregate at the right level for dashboards, reserve high-frequency scraping for signals where it actually changes response quality, and review custom exporters before they become permanent.
Logs Need Shaping, Not Reverence
The fastest way to grow telemetry cost is to treat all logs as equally valuable. Platform-critical namespaces need enough collection for real diagnosis. General application namespaces need useful application and stderr output, but debug logging shouldn't be a permanent religion. Batch namespaces should bias toward outcome logs, failures, and coarse progress rather than streaming every heartbeat forever. And development or ephemeral environments should get aggressively reduced retention and verbosity. There's no heroism in paying production-grade log retention for workloads that exist mainly to prove somebody remembered how Helm works.
So, the simplest telemetry cost review is monthly: which namespaces produce the most logs, which labels drive the highest cardinality, which dashboards are actually used during incidents, and which alerts fire without resulting in action? If you can't answer those, the observability estate is more expensive than it needs to be. The Managed Prometheus pricing will make all of this very clear once you see the per-metric-sample costs add up :)
Policy and Guardrails
Alright, none of this matters if you can't enforce it. I've learned this the hard way, you can design the most cost-efficient cluster architecture in the world, and six months later it'll be back to overspending if there's nothing preventing teams from shipping unbounded requests and permanent replicas for idle workers.
Require Requests and Limits
This is foundational and everything else builds on it.
Prevent Service Workloads From Landing on Spot by Accident
One practical approach is to enforce a label contract. Only allow cost-profile=interruptible workloads to tolerate the burst taint. Without this, it's just a matter of time before someone's production API ends up on a spot node (and then you get the incident that makes everyone afraid of spot forever).
Enforce Telemetry Profiles by Namespace
Namespace labels such as telemetry-profile=critical|standard|shaped give you a hook for log collection rules, retention classes, and alerting policy. This means you can enforce different collection behavior per namespace instead of treating the whole cluster like it's equally important.
Review Requests Against Reality Periodically
You can't policy-gate everything, so some things need a review loop. I like a monthly or quarterly report that flags: namespaces where requested CPU is consistently far above actual use, workloads with zero utilization but permanent replicas, and KEDA workloads whose cooldown keeps them alive too long after demand ends. The review doesn't need to be long, but it needs to exist. Without it, every cost-aware decision you made at design time will erode within a quarter.
Runbook for Cost Investigations
When the bill jumps, the useful questions are simple: which cost bucket actually moved (compute, disks, networking, or observability), did the baseline go up or did the peak behavior widen, and are requests still aligned with actual usage? If you can answer those three, you've got the shape of the problem before you touch anything in Kubernetes.
Inspect Capacity Behavior
az aks nodepool list \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
-o table
kubectl get nodes -L workload-tier,capacity-class,karpenter.sh/nodepool
kubectl top nodes
kubectl top pods -A --containers
Compare Requests Versus Usage
This is where Prometheus earns its keep. Requested CPU by namespace:
sum by (namespace) (
kube_pod_container_resource_requests{resource="cpu"}
)
Actual CPU usage by namespace over 5 minutes:
sum by (namespace) (
rate(container_cpu_usage_seconds_total{image!=""}[5m])
)
If you graph requested versus used resources side by side, cost conversations stop being emotional much faster.
The Bill Is Usually Telling the Truth
The clusters I run now are not cheaper because I found some magic setting, they're cheaper since I stopped buying comfort I couldn't justify. Extra nodes nobody checked, retention policies nobody questioned, pools nobody uses, that stuff adds up fast.
Node image retirement is not an emergency if you treat it as a predictable operating model. How to migrate images without chaos, what to test, and how to govern the decision.
APIM isn't just a gateway. It's a governance layer that enforces consistency across AKS, Container Apps, and other platforms. When to use it and when to keep things simple.
Network policy is not theoretical; Cilium and eBPF make it practical. Learn when segmentation actually matters, how to observe before you enforce, and why most teams get it wrong at first.