This month Proximity Placement Groups have been announced as a public preview offer and this post is here to tell you about them.
For a long time, we’ve been using availability sets to bring our applications as close as possible to ensure the lowest latency possible however this scenario couldn’t always be achieved because the cloud is shared with multiple customers and you’re deploying resources in tandem with other people. This means that if you’re looking to run a latency-sensitive application in Azure then Availability Sets or Availability Zones are not always the answer.
I’ve been holding off this post for a while now to gather more information, to figure out how things can be done better and so on. This post is a culmination of experiences I’ve had with Service Fabric, how I solved them and hopefully solve your issues before you have a disaster on your hands.
Starting from the beginning. What is Service Fabric?
Service Fabric is a distributed systems platform that makes it easy to package, deploy, and manage scalable and reliable microservices and containers. Service Fabric also addresses the significant challenges in developing and managing cloud native applications. Developers and administrators can avoid complex infrastructure problems and focus on implementing mission-critical, demanding workloads that are scalable, reliable, and manageable. Service Fabric represents the next-generation platform for building and managing these enterprise-class, tier-1, cloud-scale applications running in containers.
Source: Azure docs
That being said, I have a small list of recommendations that should be enforced in practice so that you don’t repeat the mistakes I had to repair.
There’s no shortage of solutions when it comes to NGFW in the cloud but they all come at a hefty price, steep learning curve and require continuous maintenance from the ops teams. We have solutions from Barracuda, Fortigate, Checkpoint, Cisco and so on but in the end, they are some Linux Virtual Machines that have some third party software on them with or without built-in HA. Azure Firewall is here to provide another solution that can solve some of these issues that come from NVAs deployed in the cloud…but not all of them.
Let’s start off with what Azure Firewall can do and what it can not do at this moment:
Azure Firewall is:
A stateful firewall as a service
Has built-in high availability
Can do FQDN filtering
It has support for FQDN tags – At the time of writing we have support for Windows Update, ASE and Azure Backup
You can add network traffic filtering rules
Has outbound SNAT support
Has inbound DNAT support
You can centrally create, enforce, and log application and network connectivity policies across Azure subscriptions and VNETs
Azure Firewall is NOT:
An Intrusion Prevention System (IPS)
An Intrusion Detection System (IDS)
If you compare Azure Firewall with any NGFW solution from the marketplace you will see that it lacks a lot of features and might not appear to solve any of today’s current issues but stay a while and listen 🙂
Think of this. The current third-party firewalls started from the on-premises environment as physical appliances and then got slowly evolved towards virtual appliances, so most (not all) of them have features that are useless in the cloud (and you pay for them). Another thing is that you have to manage them end to end and even back them up. They are not a managed service that you licenses from a provider and just consume the service, it’s a full-blown IaaS machine and the list can go on.
What is Azure Firewall for?
Azure Firewall is a cloud-native stateful firewalling service that is not deployed as a VM. It’s a fully managed security service by Microsoft that scales automatically and requires no maintenance from the user (hence the fully managed part), and the only thing that you need to do is to configure it correctly.
At the time of writing this post, Azure Firewall blocks all inbound/outbound traffic with the possibility allow IP addresses, FQDNs or CIDR blocks and it deploys a UDR in the VNET it creates to redirect the 0/0 traffic through it, just like an NVA and it also plugs into Azure Monitor and I suspect that it will plug into Traffic Analytics and ASC because it makes sense on the long term.
Deploying an Azure Firewall is pretty simple and it doesn’t require too much configuration and a reference architecture looks something like this:
The best-practices around Azure Firewall show that it should be configured in a hub & spoke architecture where you deploy your core / shared services and have spokes that connect through them. The main reason for this is that the entry price is 780 EUR per scaling unit. The way I see it is that in combination with NSGs, App Gateway WAF and other services like DDOS Protection Standard would add more value to the enterprise client than anything else.
Finally I would like to add that from my point of view, Azure Firewall is still a work in progress but a very welcome addition to the cloud security offering that Microsoft adds in Azure.
You have some options in Azure when you want to have a financially backed SLA for your VM deployments. When you can go into a distributed model, you can get 99.95, when you can’t then you have the option of getting 99.9% SLA when you’re using Premium Disks. But what if I want more?
If you want more, then it’s going to cost you more but before we jump into solutions, let’s understand what the numbers mean and why we should care.
You probably heard of the N nines SLA; three nines, four nines, five, six. To explain what that means, down below we have an excellent table which illustrates to us what those numbers mean in actual downtime.
In Azure for IaaS deployment, we have to option of gaining a 99.9% and 99.95% SLA. 99.9% translates into an acceptable downtime of 8.45 hours per year while 99.95% translates in around 4.22 hours per year. Now does this mean that we will have 4 or 8 hours of downtime for all of our IaaS deployments? Of course not but it might happen, that’s why you need to take all the necessary precautions so that your business critical application stays online all the time. We didn’t have the option of receiving a financially backed SLA for single VMs until recently so this is a big plus.
Recently Microsoft announced to ignite the public preview of Availability Zones which boost the SLA number to 99.99%, lowering the downtime to around 52 minutes in a year. But what are they exactly?
Availability Zones are the actual datacenter in a single region. All regions start with three zones but you during this preview, you might not be able to deploy services to all of them. If we’re talking about West Europe, then this region has three data centers that are physically separated in all terms and purposes. In order for Microsoft to financially back you for 99.99% SLA all the datacenters in a region have different power, network, and cooling providers so that if something happens to said provider then you won’t have a full region downtime and they are also 30 KM apart from each other, so they are protected from physical faults as well.
With Availability Zones, they also released Zone aware SKUs for some services like the Standard Load Balancer and Standard Public IP. At the time of writing we have the possibility of deploying VMs, VMSS, Managed Disks and IPs in an Availability Zone and SQL DB, Cosmos DB, Web Apps and Application Gateway already span three zones.
If you want to benefit from the four nine SLA, then you either deploy directly into availability zones or you redeploy your VMs.
As you can see from the above diagram, you need to use services that span zones, and after that, you need to deploy them in pairs just as you would do with Availability Sets. You clone your deployments, implement them in different zones, and you benefit from the 99.99% SLA.
*Preview Service: You have no guaranteed SLA while this service is in a preview. Once it goes GA, you will receive a financially backed SLA.
Achieving the SLA.
We have a couple of SLA numbers in our head, let’s now understand how to obtain them.
99.9% SLA- Single VMs – All your single VM deployments have to be backed by premium storage. That means that both the OS and Data disks have to be SSDs. We cannot mix and match and still qualify for the financially backed SLA. The best candidates for single VMs are the ones running relational databases or systems that cannot run in a distributed model. I wouldn’t recommend running web servers in single VM; you have App Services for that.
99.95% SLA – Availability Sets – All your distributed systems should run in Availability Sets to benefit from the 99.95% SLA and compared to single VM deployments, it doesn’t matter if you’re running Standard or Premium storage on them. AV Sets work nicely for Web Servers or other types of applications that are stateless or keep their state somewhere else. If your application has to keep its state on the actual VM, then your options are limited to the Load Balancer which can be set to have Sticky Sessions, but you will have problems in the long run. For stateful applications, it’s best to keep their state in a Redis Cache, Database or Azure Files Shares. This type of deployment works very well for most apps out there.
99.99% SLA – Availability Zones – This is the strongest SLA you can get at this time for your IaaS VMs. Availability Zones are similar in concept to the Availability Set deployment; you need to be aware of what candidates you’re deploying to the zones from an application standpoint and also from a financial standpoint. I’m saying financial because you need to use zone spanning services like the Standard SKU for the Public Load Balancer and Public IP. The standard Load Balancer is not free as the basic one, you pay for the number of load balancing rules you have, and you also pay for the data processed by it.
Financially backed SLA
Now that we have a basic understanding of SLAs, we have to understand what financially backed means regarding any cloud provider. When they say that the SLAs are financially supported, they mean that if something on the provider’s side causes an SLA breach, they will reimburse the running costs of the VM when the downtime occurred.
The formula looks like this:
Multiple VMs in Availability Sets
Monthly Uptime % = (Maximum Available Minutes-Downtime) / Maximum Available Minutes X 100
Maximum Available Minutes – This is the total number of runtime minutes for two or more VMs in a month.
Downtime – This is the total number of minutes where there was no connectivity on any of the VMs the AV Set.
This means that if the Monthly Uptime percentage is lower than 99.95%, you can ask Microsoft to grant you service credits.
Single VMs with Premium Storage
Monthly Uptime % = (Minutes in the Month – Downtime) / Minutes in the Month X 100
Minutes in a Month – Total number of minutes in a month.
Downtime – Total number of downtime minutes from the Minutes in a Month metric.
This means that if the calculated Monthly Uptime percentage is lower than 99.9%, then you can Microsoft to grant you service credits.
You might ask; How do I know that I had an SLA breach?
Well, you need to measure the uptime of your application. In the end, you might not care if one VM from your Availability set is down for say 10 minutes, but you will care if somebody calls you when the Website is down. You have multiple options out there to measure the availability of your application like UptimeRobot, Monitis, Pingdom, etc. You also have the possibility of doing measurements in Azure with Azure Monitor, but you’re not getting application uptime, so you need the best of both worlds to have an accurate view of the situation. I configure both because I want to know when something happens to a VM, and I also want to know if the application is up and healthy. The reason is that if you’re using say VMs and PaaS services, you need to know which one caused the downtime and if it was a human error. Microsoft will not pay for your mistakes, so you need to have self-healing systems in place to avoid human error. There are a lot of Configuration Management systems out there, systems like DSC / Chef / Puppet which ensure you that your configuration didn’t fail. Azure has Desired State Configuration integrated into it for example which grants you the ability to enforce states on VMs based on a configuration manifest.
That being said, gaining a financially backed SLA in Azure is not rocket science. I hope you obtained some useful information from this post 🙂