Symptoms : Always On now reads as ‘not synchronized’ indicating loss of quorum for the database.

Impact : Critical

Each node and witness in a cluster has one vote towards the quorum, and the database is only viewed as ‘healthy’ when there is a majority of ‘online’ votes in the quorum.

Expected behavior :

Database should never lose quorum. Any instance needs to be investigated.

Possible causes

Hardware falure   Priority : Critical
The Cluster service on this node may have stopped or the availability replica has transitioned to the resolving role. The cluster node was removed from the active failover cluster membership. The Cluster service on this node may have stopped.
Recommended action :
All nodes and witnesses are located on specific hardware. Failure of this hardware will result in loss of the vote and may cause the quorum to be lost. Track which nodes are still functioning, to determine the point of failure. Take appropriate steps to bring the missing node back up.

Network failure  Priority : High
Either a networking or a firewall issue exists. The cluster node was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster.
Recommended action :
Run the ‘Validate a Configuration’ wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Background

AlwaysOn is a term used to in the context of Microsoft’s ‘high-availability and disaster recovery’ solutions. It has two modes:

  • AlwaysOn Failover Cluster Instances (FCI) – basically, traditional clusters;
  • AlwaysOn Availability Groups (AG) – basically, mirroring.

The foundation of both modes is to use multiple SQL Server hosts to distribute workload and also to address issues of disaster recovery, but the exact way they enable high-availability and disaster recovery are quite different.

  • Failover Cluster Instances use clustering technology to create two or more nodes or teamed hosts that coordinate with the domain controller to specify which of those physical nodes will control a virtual instance (IP address, Virtual Network Name, and a virtual instance) — but only one node can be active on only one physical host at a time. There’s only a single copy of databases
  • AlwaysOn Availability Groups uses mirroring (server-to-server communication via endpoints) to keep synchronized copies of data on multiple hosts and keep AG listeners pointed at a read/write replica for normal database interactions. As well, this enables having read-only replicas allowing for scale-out

Health of a cluster is defined via a voting mechanism. The total cluster is represented as a QUORUM, consisting of a number of host nodes and two possible witness types (file-disk or share-file).

Each node has a VOTE and the witness also does. The cluster must have majority of “online” votes to be viewed as operational. The counting is affected by the overall scheme. In case of a static scheme, the total number of nodes is constant, so in cases where nodes are shut off, they are counted as “offline” and therefore as a negative vote. In dynamic quorum schemes, nodes that are shut down are removed from the count all-together.

It is good practice that clusters are more stable if they have an odd number of votes.