You have some options in Azure when you want to have a financially backed SLA for your VM deployments. When you can go into a distributed model, you can get 99.95, when you can’t then you have the option of getting 99.9% SLA when you’re using Premium Disks. But what if I want more?
If you want more, then it’s going to cost you more but before we jump into solutions, let’s understand what the numbers mean and why we should care.
You probably heard of the N nines SLA; three nines, four nines, five, six. To explain what that means, down below we have an excellent table which illustrates to us what those numbers mean in actual downtime.
In Azure for IaaS deployment, we have to option of gaining a 99.9% and 99.95% SLA. 99.9% translates into an acceptable downtime of 8.45 hours per year while 99.95% translates in around 4.22 hours per year. Now does this mean that we will have 4 or 8 hours of downtime for all of our IaaS deployments? Of course not but it might happen, that’s why you need to take all the necessary precautions so that your business critical application stays online all the time. We didn’t have the option of receiving a financially backed SLA for single VMs until recently so this is a big plus.
Recently Microsoft announced to ignite the public preview of Availability Zones which boost the SLA number to 99.99%, lowering the downtime to around 52 minutes in a year. But what are they exactly?
Availability Zones are the actual datacenter in a single region. All regions start with three zones but you during this preview, you might not be able to deploy services to all of them. If we’re talking about West Europe, then this region has three data centers that are physically separated in all terms and purposes. In order for Microsoft to financially back you for 99.99% SLA all the datacenters in a region have different power, network, and cooling providers so that if something happens to said provider then you won’t have a full region downtime and they are also 30 KM apart from each other, so they are protected from physical faults as well.
With Availability Zones, they also released Zone aware SKUs for some services like the Standard Load Balancer and Standard Public IP. At the time of writing we have the possibility of deploying VMs, VMSS, Managed Disks and IPs in an Availability Zone and SQL DB, Cosmos DB, Web Apps and Application Gateway already span three zones.
If you want to benefit from the four nine SLA, then you either deploy directly into availability zones or you redeploy your VMs.
As you can see from the above diagram, you need to use services that span zones, and after that, you need to deploy them in pairs just as you would do with Availability Sets. You clone your deployments, implement them in different zones, and you benefit from the 99.99% SLA.
*Preview Service: You have no guaranteed SLA while this service is in a preview. Once it goes GA, you will receive a financially backed SLA.
Achieving the SLA.
We have a couple of SLA numbers in our head, let’s now understand how to obtain them.
99.9% SLA- Single VMs – All your single VM deployments have to be backed by premium storage. That means that both the OS and Data disks have to be SSDs. We cannot mix and match and still qualify for the financially backed SLA. The best candidates for single VMs are the ones running relational databases or systems that cannot run in a distributed model. I wouldn’t recommend running web servers in single VM; you have App Services for that.
99.95% SLA – Availability Sets – All your distributed systems should run in Availability Sets to benefit from the 99.95% SLA and compared to single VM deployments, it doesn’t matter if you’re running Standard or Premium storage on them. AV Sets work nicely for Web Servers or other types of applications that are stateless or keep their state somewhere else. If your application has to keep its state on the actual VM, then your options are limited to the Load Balancer which can be set to have Sticky Sessions, but you will have problems in the long run. For stateful applications, it’s best to keep their state in a Redis Cache, Database or Azure Files Shares. This type of deployment works very well for most apps out there.
99.99% SLA – Availability Zones – This is the strongest SLA you can get at this time for your IaaS VMs. Availability Zones are similar in concept to the Availability Set deployment; you need to be aware of what candidates you’re deploying to the zones from an application standpoint and also from a financial standpoint. I’m saying financial because you need to use zone spanning services like the Standard SKU for the Public Load Balancer and Public IP. The standard Load Balancer is not free as the basic one, you pay for the number of load balancing rules you have, and you also pay for the data processed by it.
Financially backed SLA
Now that we have a basic understanding of SLAs, we have to understand what financially backed means regarding any cloud provider. When they say that the SLAs are financially supported, they mean that if something on the provider’s side causes an SLA breach, they will reimburse the running costs of the VM when the downtime occurred.
The formula looks like this:
Multiple VMs in Availability Sets
Monthly Uptime % = (Maximum Available Minutes-Downtime) / Maximum Available Minutes X 100
Maximum Available Minutes – This is the total number of runtime minutes for two or more VMs in a month.
Downtime – This is the total number of minutes where there was no connectivity on any of the VMs the AV Set.
This means that if the Monthly Uptime percentage is lower than 99.95%, you can ask Microsoft to grant you service credits.
Single VMs with Premium Storage
Monthly Uptime % = (Minutes in the Month – Downtime) / Minutes in the Month X 100
Minutes in a Month – Total number of minutes in a month.
Downtime – Total number of downtime minutes from the Minutes in a Month metric.
This means that if the calculated Monthly Uptime percentage is lower than 99.9%, then you can Microsoft to grant you service credits.
You might ask; How do I know that I had an SLA breach?
Well, you need to measure the uptime of your application. In the end, you might not care if one VM from your Availability set is down for say 10 minutes, but you will care if somebody calls you when the Website is down. You have multiple options out there to measure the availability of your application like UptimeRobot, Monitis, Pingdom, etc. You also have the possibility of doing measurements in Azure with Azure Monitor, but you’re not getting application uptime, so you need the best of both worlds to have an accurate view of the situation. I configure both because I want to know when something happens to a VM, and I also want to know if the application is up and healthy. The reason is that if you’re using say VMs and PaaS services, you need to know which one caused the downtime and if it was a human error. Microsoft will not pay for your mistakes, so you need to have self-healing systems in place to avoid human error. There are a lot of Configuration Management systems out there, systems like DSC / Chef / Puppet which ensure you that your configuration didn’t fail. Azure has Desired State Configuration integrated into it for example which grants you the ability to enforce states on VMs based on a configuration manifest.
That being said, gaining a financially backed SLA in Azure is not rocket science. I hope you obtained some useful information from this post 🙂
Have a good one!