SQL Server clustering is a high-availability technology for SQL Server instances. It involves the sharing of server resources between one or more nodes (or servers), which have one or more shared disks grouped into logical units called resource groups. A resource group containing at least one IP address, network name and disk resource is called a virtual server. The cluster service arbitrates ownership of the resource groups. A single node can own a resource group and its associated resources at any given time.
Each virtual server appears on the network as a complete system. When the virtual server contains SQL Server resources, clients connected to the virtual server access resources on its current host node. While the terms “active” and “passive” are often used in this context, they are not fixed roles, as all nodes in a cluster are interchangeable. Should the current host, sometimes designated as the primary, fail, the resource group will be transferred to another node (secondary node) in the cluster. With clusters having more than two nodes or two instances, it is important to set failover order by choosing the preferred node ownership order for each instance. The secondary will become the primary and host the virtual server. Active client connections will be broken during failover, but they can reconnect to the virtual server now hosted by the new node. The clients will have to reconnect manually, and work in progress will be lost during the failover. Most commercial applications now handle this reconnection task seamlessly.
The goal of clustering is to provide increased availability to clients by having a hot standby system with an automatic failover mechanism. SQL Server clustering is not a load-sharing or scale-out technology. On all clusters during a failure there will be a brief database server interruption. On large clusters with multiple nodes and instances, clients may experience degraded performance during a failure event but they will not lose database availability.
There are four types of cluster topologies — or arrangements of nodes in a cluster:
- Single instance
In this case, one node in a cluster owns all resource groups at any one time and the other nodes are offline. Should the primary node owning the resources fail, the resource groups will be transferred to the secondary node, which comes online. While the secondary node comes online, it will assume ownership of the resource groups, which typically consist of disks containing your database files and transaction logs. The secondary node comes online (starts up), and SQL Server will start up on the virtual server and roll uncommitted transactions in the transaction log backward or forward as it recovers the database.
This topology was formerly called active-passive. Single-instance clustering is most frequently used for mission-critical applications, where the cost of downtime far outweighs the cost of the wasted hardware resources of the secondary node sitting idle while offline.
In this situation, one virtual server in a cluster owns some of the resource groups and another virtual server owns other resource groups. At any one time, the virtual servers themselves can be hosted by a single node or different nodes and would appear to clients as named instances of a single server. In that case, they are named instances of a virtual server, hence the name multiple instance. With multiple-instance clustering, previously called active-active, the hardware requirements of each individual node are greater as each node may at any one time be hosting two (or more) virtual servers.
You should consider multiple-instance clusters to be more cost effective than single-instance clusters as there are no nodes offline or waiting. However, should one node host more than one virtual server, performance for clients is typically degraded. Your best bet is to use multiple instances when you require high availability but not high performance.
This is a modification of multiple-instance clustering topologies where two or more nodes share the same failover node. The secondary node will need enough hardware capabilities to support the load of all N servers at any one time should they all fail over simultaneously. You can achieve cost savings if multiple clusters use the same failover node. However, the cost of an individual
node tends to be small in comparison to other related clustering costs, such as storage.
Many people consider N+1 to be more cost effective than multiple-instance clustering because there is only one secondary node offline (or waiting) for several active nodes. However, depending on the hardware configuration of the failover node, it does not offer the performance of multiple-instance clustering. Use N+1 in environments where cost constraints force you to reduce the number of failover nodes and you need high availability but not high performance.
In a situation where you have two or more working nodes in a cluster along with two or more standby nodes, it is typically configured in eight-node clusters with six working nodes for every two standby, or five working nodes for every three standby.
N+M offers some of the cost benefits of N+1, but it has a lower chance of performance degradation during a multiple failure event than N+1 since the failover node(s) do not have to support the entire load of the failed nodes. Use N+M in environments where cost constraints force you to reduce the number of failover nodes and at the same time provide a high level of performance.
SQL Server clustering has several dependencies:
Clustering requires a private network among all nodes in a cluster. Clustering services use a private communication channel on each node to keep in sync with each other. This allows the cluster to communicate and act appropriately even if the public network is offline. Looks-Alive and Is-Alive checks — used by cluster services to determine if a cluster resource group is “up” — connect over the public networks to best emulate a client connection process.
Clustering requires specialized hardware and software. And to share resources between nodes, you need specialized disk controllers. Clustering hardware must be certified by Microsoft to meet the requirements of clustering. And, you must have a second set of network cards to provide the private network between cluster nodes.
To benefit from clustering services, you need specialized versions of the operating system (Windows 2000 and 2003 Enterprise or Data Center editions). You will also need SQL Server 2000 Enterprise Edition, SQL Server 2005 Standard Edition (up to two nodes) or SQL Server 2005 Enterprise Edition (up to eight nodes).
Clustering best practices
What follows is a list of clustering best practices. I have broken these down according to dependencies.
Network best practices
There are two different and contradictory settings required for the public network and the private network in clustering.
Ensure the private network is private. Clustering requires a 150-ms ping response time. If your private network is saturated or congested with other network traffic, you may find your clusters failing over unexpectedly. On your private network, consider isolating traffic by implementing a VLAN (virtual LAN), a separate subnet or use a crossover cable for Single-instance clusters. The actual traffic generated by cluster communication is small, so high-bandwidth networks are unnecessary. However, they must still be low latency and reliable. Make sure the following points are established on the private network:
- Use TCP/IP as the only protocol bound to the NIC.
- No default gateway is configured.
- No DNS servers are configured unless the cluster nodes are DNS servers, in which case 127.0.0.1 should be configured.
- No DNS registration or DNS suffix is configured.
- No WINS servers are configured.
- Static IP addresses are used for all nodes.
- NetBIOS over TCP/IP is disabled.
- No NIC teaming is used, where two network interface cards are aggregated together to act as a single NIC card.
For your public network, use at least two WINS or DNS servers on your cluster network segment or VLAN. While installing your cluster you will have to resolve cluster, DC (domain controller) and virtual server names. You must have a name server on your network for this. You can decrease the time required for a node to fail over by providing a name server on your network as well.
Use at least two DCs on your network. Clustering requires DCs not only during setup but also for normal functioning and failover.
If you use NIC teaming for greater bandwidth throughput and reliability, do not configure it while building the cluster. Add NIC teaming as a last step before final testing. Be prepared to “undo” NIC teaming as an early step in troubleshooting. Microsoft Customer Support Services (CSS) will likely direct you to disable teaming as a first diagnostic step, so be ready.
Ensure that your network card settings are identical for every server in your cluster and that they are not configured to automatically detect network settings.
Software best practices
Ensure applications are cluster aware and will not lose work or fail to meet the SLA during a cluster failover.
Ensure transactions are as small as possible in your application and on any jobs that may run on your clustered SQL Servers. Long-running transactions increase the length of time required to apply the transaction log on the failover node and consequently increase the amount of time for failover.
Do not run antivirus software on cluster nodes. If you must run antivirus software, be sure the quorum disk and database files are excluded from the scans. Even in this configuration, there have been reports of antivirus drivers interfering with cluster disk resource failover. Test your setup and make sure it fails as expected. Select another antivirus product if yours causes problems.
Make sure there are no password expiration policies in use for any of the cluster-related accounts. Cluster accounts should:
- be the same for all nodes in the cluster;
- have domain accounts (but not domain admin accounts) and have local administrative rights on each node in the cluster. SQL Server 2005 forces you to set up domain-level groups for these accounts and then grants appropriate rights to the groups.
- have the least security privileges to minimize damage that could be done to the node or other servers on your network should the password be compromised or the account be hijacked by a buffer overflow.
Ensure all software components are the same version (i.e., SQL Server 2005 Standard), same architecture (i.e., 64 bit for all OS and SQL Server components) and at the same service pack and hot fix level. The exception is that individual SQL Server instances can be at different releases, editions and hotfix levels.
Ensure all external software dependencies (COM components, file paths, binaries) are either cluster aware or installed on all nodes in a cluster. MSDTC (Microsoft Distributed Transaction Coordinator) is the most common external dependency in a cluster. While it is not necessary, many people install it before installing SQL Server because installing it later is much harder.
When installing a cluster, consider installing a single-node cluster and adding nodes to the cluster as required. This way, if the cluster setup fails while adding a single node, you are left with a working cluster (although it could be a single-node cluster).
While applying hot fixes or service packs that require a system reboot, apply it to the primary (current instance host), fail over to the secondary, reboot the primary, fail back to the primary and reboot the secondary. Typically hot fixes and service packs are cluster aware and install on all cluster nodes simultaneously.
Ensure that your cluster is approved by the vendor and that it is part of the Microsoft Windows Catalog with a specific endorsement for clustering.
Ensure each node in your cluster has identical hardware and components.
Regularly check vendor Web sites for potential hardware problems, fixes and BIOS patches for each component in your cluster.
Use the appropriate RAID technology to ensure that your disk array is fault tolerant. Be as proactive as possible in replacing failed or marginal disks. A disk failure will put a greater load on the remaining disks in an array and may cause other marginal disks to fail. Depending on your RAID technology, your RAID array may not be tolerant to more than one disk failure per array.
Ensure you have properly conditioned or charged batteries on any array controlle. It prevents data loss or corruption in the event of a power failure.
Use uninterrupted power supplies and be sure you have redundancy in your power supplies.
Use Hot-Add Memory if it’s supported by your SQL Server version, operating system and hardware. Hot-Add Memory is a hardware technology that allows you to add memory to a running system; the OS detects and uses the additional memory. Windows Server 2003, Enterprise and Data Center Editions, as well as SQL Server 2005 Enterprise Edition can take advantage of Hot-Add Memory. Read about Hot-Add Memory Support in Windows Server 2003.
Use ECC (Error Correction Code) memory chips, which store parity information used to reconstruct original data when errors are detected in data held in memory.
Use fault-tolerant NICs and network devices (switches).
Clustering is a relatively new technology and has a reputation for being fragile. SQL Server 2000 clustering is far simpler than the earlier versions and has proven to be much more reliable. Today, clustering on SQL Server 2000 and SQL Server 2005 is a highly reliable technology, but it still has many dependencies that prevent it from meeting your high-availability goals. Foremost among these dependencies is a staff that is trained and knowledgeable. Running a close second is having operating processes and procedures that are designed to work specifically with a SQL Server cluster. Ensure that you address all of your clustering dependencies to deliver high availability with SQL Server clustering.