Plan and implement a high availability and disaster recovery environment

Describe high availability and disaster recovery strategies

Describe recovery time objective and recovery point objective

Recovery Time Objective

Recovery Time Objective (RTO) is the maximum amount of time available to bring resources online after an outage or problem.

Recovery Point Objective

Recovery Point Objective (RPO) is the point in time to which a database should be recovered and equates to the maximum amount of data loss that the business is willing to accept.

Defining Recovery Time and Recovery Point Objectives

One of the aspects crucial for both RTO and RPO is knowing the cost of downtime

Explore high availability and disaster recovery options

Infrastructure-as-a-Service versus Platform-as-a-Service

SQL Server HADR Features for Azure Virtual Machine

The features available in SQL Server are shown in the table below


Feature Name


Always On Failover Cluster Instance (FCI)


Always On Availability Group (AG)


Log Shipping


An instance of SQL Server is the entire installation of SQL Server (binaries, all the objects inside the instance including things like logins, SQL Server Agent jobs, and databases).

Instance-level protection means that the entire instance is accounted for in the availability feature.

Database-level protection means that anything that is in the database, or is captured in the transaction log for a user or application database, is accounted for as part of the availability feature.

Anything that exists outside of the database or that is not captured as part of the transaction log such as SQL Server Agent jobs and linked servers must be manually dealt with to ensure the destination server can function like the primary if there is a planned or unplanned failover event.

Always On Failover Cluster Instances

An FCI is configured when SQL Server is installed. A standalone instance of SQL Server cannot be converted to an FCI.

The FCI is assigned a unique name as well as an IP address that is different from the underlying servers, or nodes, participating in the cluster.

The name and IP address must also be different from the underlying cluster mechanism.

Applications and end users would use the unique name of the FCI for access.

This abstraction enables applications to not have to know where the instance is running.

One major difference between Azure-based FCIs versus on-premises FCIs, is that for Azure, an internal load balancer (ILB) is required.

The ILB is used to help ensure applications and end users can connect to the FCI’s unique name.

FCIs using Standard Edition of SQL Server can have up to two nodes.

Always On availability groups

The primary replica is the instance participating in an AG that contains the read/write databases.

A secondary replica is where the primary sends transactions over the log transport to keep it synchronized.

Data movement between a primary replica can be synchronous or asynchronous

An AG in Standard Edition can have at most two replicas (one primary, one secondary) whereas Enterprise Edition supports up to nine (one primary, eight secondary).

An AG provides abstraction with the listener.

The listener functions like the unique name assigned to an FCI and has its own name and IP address that is different from anything else (WSFC, node, etc.).

Log shipping

The feature is based on backup, copy, and restore and is one of the simplest methods of achieving HADR for SQL Server.

Log shipping is primarily used for disaster recovery, but it could also be used to enhance local availability.

Describe Azure high availability and disaster recovery features for Azure Virtual Machines

Azure provides three main options to enhance availability for IaaS deployments:

All three of these options are external to the virtual machine (VM) and do not know what kind of workload is running inside of it.

· Availability Sets

§ Availability sets are separated into both fault domains and update domains to support both updates to the underlying Azure Infrastructure.

§ Fault domains are sets of servers within a data center, which use the same power source and network.

§ There can be up to three fault domains in a data center as depicted in the image below by FD 0, 1, and 2.

§ Update domains, denoted by UD in the image below, indicate groups of virtual machines and underlying physical hardware that can be rebooted at the same time

§ Availability sets is not the only way to separate IaaS VMs.

§ Azure also provides Availability Zones, but the two cannot be combined. You can pick one or the other.


· Availability Zones

o Availability zones account for data center-level failure in Azure.

o Each Azure region consists of many data centers with low latency network connections between them.

o When you deploy VM resources in a region that supports Availability Zones, you have the option to deploy those resources into Zone 1,2, or 3.

o A zone is a unique physical location, that is, a data center, within an Azure region.

· Azure Site Recovery

o Azure Site Recovery provides enhanced availability for VMs at the Azure level and can work with VMs hosting SQL Server.

o Azure Site Recovery replicates a VM from one Azure region to another to create a disaster recovery solution for that VM.

o The disks of a VM configured to use Azure Site Recovery are replicated to another region.

Describe high availability and disaster recovery options for PaaS deployments

PaaS is different when it comes to availability; you can only configure the options that Azure provides.

For the SQL Server-based options of Azure SQL Database and Azure SQL Database Managed Instance, the options are active geo-replication (Azure SQL Database only) and autofailover groups (Azure SQL Database or Azure SQL Database Managed Instance).

Azure Database for MySQL has a service level agreement, which guarantees availability of 99.99, meaning nearly no downtime should be encountered.

Azure Database for PostgreSQL uses a similar model to MySQL in its standard deployment model; however, Azure PostgreSQL also offers a scale-out hyperscale solution called Citus.

Citus provides both scale-out and additional high availability for a server group.

If enabled, a standby replica is configured for every node of a server group, which would also increase cost since it would double the number of servers in the group

Explore an IaaS high availability and disaster recovery solution

Distributed availability group

A distributed AG is an Enterprise Edition only feature introduced in SQL Server 2016.

It is different than a traditional AG.

Instead of having one underlying WSFC where all of nodes contain replicas participating in one AG as described in the previous example, a distributed AG is made up of multiple AGs.

The primary replica containing the read write database is known as the global primary.

The primary of the second AG is known as a forwarder and keeps the secondary replica(s) of that AG in sync. In essence, this is an AG of AGs.

Describe hybrid solutions

While an architecture can be deployed in one or more Azure regions, many organizations will need or want to have solutions that span both on premises and Azure, or possibly Azure to another public cloud.

This type of architecture is known as a hybrid solution.

Explore IaaS and PaaS platform tools for high availability and disaster recovery

Describe failover clusters in Windows Server

For all availability configurations of availability groups (AGs), an underlying cluster is required.

There are many aspects of setting up a cluster that you need to be aware of.

One of the most important aspects is deciding what to use for a witness resource.

Witness is a core component of the quorum mechanism.

Quorum is what helps ensure that everything in the WSFC stays up and running.

If you lose quorum, the WSFC will go down taking an AG or FCI with it.

The witness resource can be a disk, file share (SMB 2.0 or later), or cloud witness.

The next consideration is the Microsoft Distributed Transaction Coordinator (DTC or MSDTC).

Most WSFC deployments require the use of both AD DS and DNS;

A typical Azure-based WSFC will only require a single virtual network card (vNIC).

For a WSFC to be considered supported, it must pass cluster validation.

To create a WSFC properly in Azure, you cannot use the Wizard in Failover Cluster Manager for FCIs or AGs deployed using Windows Server 2016 and earlier

Due to the DHCP issue mentioned earlier, currently the only way to create the WSFC is to use PowerShell and specify the IP address.

For a configuration that has shared storage, use the following syntax:

New-Cluster -Name MyWSFC -Node Node1,Node2,…,NodeN -StaticAddress w.x.y.z

Configure Always-on availability groups

Same as the WSFC itself, you cannot reserve the listener’s IP address in Azure so you need to ensure something else does not come along and grab it otherwise there could be a conflict on the network, which in turn could cause availability headaches.

All virtual machines (VMs) that are participating in an AG should have the same storage configuration.

Before an AG can be configured, the AG feature must be enabled. This can be done in SQL Server Configuration Manager .

Once the listener is created, an internal load balancer (ILB) must be used.

Without configuring an ILB, applications, end users, administrators, and others cannot use the listener unless they were connected to the VM that hosts an AG’s primary replica.

Another consideration for the load balancer is the probe port.

Without the probe port, the listener will not work properly as it is not enough just to create the load balancer.

Each IP address that will use the load balancer requires a unique probe port.

If there are going to be two listeners, there must be two probe ports.

Probe ports are high numbers such as 59999.

If you have a multi-subnet configuration, a load balancer will need to be configured in each subnet (whether or not the other subnet is deployed to different region) and the probe port for that region associated with the IP resource for that subnet in the WSFC.

Describe temporal tables in Azure SQL Database

Azure SQL Database allows you to track and analyze the changes to your data using a feature called Temporal Tables.

This feature requires that the tables themselves be converted to be temporal, which means the table will have special properties and will also have a corresponding history table.

The Temporal Table feature allows you to use the history table to recover data that may have been deleted or updated.

Recovering data from the history table is a manual process involving Transact-SQL, but could be helpful in certain scenarios such as if a user accidentally deletes important data that the business needs.

Describe active geo-replication for Azure SQL Database

Another method to increase availability for Azure SQL Database is to use active geo-replication.

Active geo-replication creates a replica of the database in another region that is asynchronously kept up to date.

That replica is also readable, similar to an AG in IaaS.

Explore auto-failover groups for Azure SQL Database and Azure SQL Database Managed Instance

An auto-failover group is an availability feature that can be used with both Azure SQL Database and Azure SQL Database Managed Instance.

Autofailover groups let you manage how databases on an Azure SQL Database server or databases in Azure SQL Database Managed Instance are replicated to another region, and let you manage how failover could happen.

The name assigned to the autofailover group must be unique within the * domain.

Azure SQL Database Managed Instance only supports one autofailover group.

Autofailover groups provide AG-like functionality called a listener, which allows both read-write and read-only activity

There are two different kinds of listeners: one for read-write and one for read-only traffic.

Failovers can be performed manually even if automatic failover is allowed. Depending on the type of failover, there could be data loss.

Unplanned failovers could result in data loss if forced and the secondary is not fully synchronized with the primary.

Configuring GracePeriodWithDataLossHours controls how long Azure waits before failing over.

The default is one hour. If you have a tight RPO and cannot afford much data loss, set the value higher so Azure will wait longer before failing over, hopefully resulting in less data loss.

One autofailover group can contain one or more databases.

The database size and edition will be the same on both the primary and secondary.

The database is created automatically on the secondary through a process called seeding.

Depending on the size of the database, this may take some time.

Ensure that you plan accordingly and that you take into account things like the speed of the network.

Back up and restore databases

Back up and restore SQL Server running on Azure virtual machines

SQL Server has two types of databases: system and user.

System databases are the ones used by SQL Server such as master and msdb.

User databases are the ones created by users that store the data for applications

The most common types of backups generated for SQL Server installations are full, differential, and transaction log.

There are other backup options such as copy-only, file, filegroup, partial, and more.

Back up a SQL Server virtual machine

Azure Backup can back up VMs that contain SQL Server.

These backups would contain not just SQL Server databases; they would everything that is in the VM so it could be restored as a whole.

While this option may not be right for everyone, it can potentially protect against problems like ransomware.

Backup to URL requires an Azure storage account and uses the Azure blob storage service.

The following statement would back up a transaction log.

BACKUP LOG contoso

TO URL = ‘'

The following statement would restore a full database backup without recovering it, so that a differential or transaction log backups could be applied.




A SQL Server Credential can be composed of the Azure storage account name and access key authentication or a Shared Access Signature.

If the former is used, the backup will be stored as a page blob and if the latter, it will be stored as a block blob.

Any IaaS VM that has SQL Server installed can use the SQL Server resource provider.

One of its options is the ability to configure automated backups so Azure takes care of backing up SQL Server databases.

It requires the use of a storage account.

The automated backup option is currently only available for Windows Server-based SQL Server installations.

Back up and restore a database using Azure SQL Database

Back up and restore on SQL Server PaaS offering work differently than on IaaS.

Backups are generated automatically for Azure SQL Database, and Azure SQL Managed Instance.

A full backup is created once a week, a differential every 12 hours, and transaction log backups every 5–10 minutes.

All backups are located in read-access, geo-redundant (RA-GRS) blobs replicated to a datacenter that is paired based on Azure rules.

That means backups are safe from an outage in a single data center.

If the server containing the database is deleted, all backups will be deleted at the same time, and there is no way to recover them.

If the server is not deleted but the database is, you can restore the database normally.

Within Azure SQLMI you can only generate a COPY_ONLY backup since SQL Database Managed Instance is maintaining the log chain.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Setumo Raphela

Setumo Raphela


Entrepreneur | Data Scientist | AI | Jet Skier | Author |Oracle