Digital Equipment Corporation
Created: February 1995
Revised: April 1996
The explosive growth of Microsoft® Windows NT® as an enterprise-level operating system has generated a demand for new high-availability tools and system management features. These tools, based on UNIX and other operating systems, are available in well-established computing environments. Digital Equipment Corporation has begun to deliver advanced Windows NT solutions today.
Clustering technology is well understood by today’s UNIX and OpenVMS system administrators. Multiple systems are grouped together to appear as a single system to users and to other processes. The advantages of clustering are:
Digital Clusters for Windows NT introduces the benefits of clustering technology to today’s PC client/server LAN environments.
Today, many customers are looking to Windows NT clusters as the next logical step in PC LAN evolution. Digital has responded to this need with Digital Clusters for Windows NT. Digital’s goal in developing this product is to provide a low-cost, high-availability solution for PC client/server LANs. Digital’s open systems strategy, based on industry-standard hardware and software as well as its strategic alliance with Microsoft, offers a solution that is ready now. The first version of Digital Clusters for Windows NT focuses on minimizing the downtime caused by software, network, and system failures, and
Digital Clusters for Windows NT version 1.0 is a general purpose, high-availability solution for today’s PC client/server LANs. It enables two Windows NT systems to be joined together, via a shared SCSI bus, to create a single system environment, or cluster. End-user clients have access to all the cluster resources, such as shared disks, file shares, and database applications without having to know the names of the individual servers in the cluster. In the event that one system fails, the second server in the cluster will immediately assume its workload, reconnect clients, and migrate shared storage and file shares.
Digital Clusters for Windows NT software design is extensible, flexible, and hardware independent. The first release of the product delivers critical, high-availability features to Windows NT Servers. Due to its extensible architecture, advanced clustering capabilities can be easily added over time, built on the same core product functionality. Software control of clustering gives users flexibility in integrating clustering with their existing application environments.
Enterprises have come to rely on high levels of availability from information systems. They are unwilling and unable to tolerate downtime as they re-engineer their businesses and expand their global operations. Performance statistics show that the best current option—Windows NT Server on a state-of-the-art SMP processor—is still likely to have about 90 hours downtime per year. With the improvements that have been made to hardware, these failures are not the major reason for system downtime. Instead, software failures, software maintenance, and planned upgrades are currently the major sources.
Digital Clusters for Windows NT, with proper power management, supply engineering, and redundant communications—can reduce the average downtime to less than 12 hours per year and, in some cases, less than one hour per year. It can keep end users productive on business-critical applications and databases, and still allow for growth and flexibility.
Digital Clusters for Windows NT provides system-level high availability at very low cost using industry-standard, commodity components. A cluster is addressed by clients as if it were a single server. Similarly, the cluster configuration is managed as if it were a single server. Clustering provides high levels of availability through redundant CPUs, storage, and data paths.
High availability is made possible through failover capabilities. Simply stated, failover quickly redirects interrupted services and resources to clients using a backup path. Digital Clusters for Windows NT uses a common cluster name, which makes the functioning of the cluster invisible to the end user. The client does not need to know how the cluster is configured or how the workload is divided among servers. The benefit to users is that they can focus on tasks rather than technology.
Enterprises can leverage their application software investments because Digital Clusters for Windows NT works with both packaged and in-house applications. Existing client applications can make use of the failover function without modification. Depending on the application, either the user sees a message requesting a retry or the application continues uninterrupted. In-house applications can also take advantage of cluster failover capabilities by using the generic application failover feature.
The level of availability provided by Digital Clusters for Windows NT compares well with fault-tolerant systems. Digital Clusters for Windows NT is more cost-effective because it does not require a complete mirrored backup of the primary system. Although the “hot standby” in fault-tolerant systems provides nonstop availability, it does so at the cost of a backup system that does not add any computing capacity. Digital Clusters for Windows NT, on the other hand, allows users to partition workloads and use both servers. You have the benefit of greater capacity, and you maximize your investment in current resources. It reduces your investment in new resources because it does not depend on custom hardware or proprietary interconnects.
Digital Clusters for Windows NT is a highly scalable solution. You can add capacity to a cluster in several dimensions. I/O and storage can be added incrementally to efficiently and cost-effectively meet the dynamic needs of an enterprise.
The scalability of Digital Clusters for Windows NT rests with the partitioned data model of its software architecture. This model delivers numerous benefits:
The underlying design of Digital Clusters for Windows NT enables scalability. In contrast, both the mirrored backup and the high degree of synchronization of fault-tolerant systems undermine scalability.
Digital Clusters for Windows NT provides investment protection. Because it is designed for industry-standard hardware, software, interconnects, and protocols, it offers numerous advantages over other solutions. By supporting both Alpha and Intel architectures, enterprises are able to leverage their current hardware investments and feel secure in their ongoing choices.
Digital Clusters for Windows NT supports a variety of off-the-shelf hardware components such as RAID subsystems, SCSI-2 disks, and SCSI adapters. These off-the-shelf choices reduce cost in the short term and add to investment protection in the long term.
Similarly, supported clients for automatic failover include the most widely used desktops running Windows NT, Windows for Workgroups, or Windows 95. Other clients, such as MS-DOS®, Mac OS, and OS/2® operating systems, can also be used with manual reconnects. Another way that Digital Clusters for Windows NT provides investment protection is through support of industry-standard networking protocols between the clustered servers and clients. Digital Clusters for Windows NT supports all the Windows NT supported network protocols: TCP/IP, NetBeui, and IPX/SPX.
Digital Clusters for Windows NT is fully compatible with Windows NT Server version 3.51. Unlike other solutions, it is not a “port” of older technology to a new operating system; it was designed from the ground up for the Windows NT client/server environment. An SNMP agent is included that enables industry-standard SNMP management tools, such as ServerWorks, to work with Digital Clusters for Windows NT.
In a typical client/server LAN environment, a single-server system provides file, print, and application services to a group of desktop clients. In a cluster client/server configuration, the notion of a single server serving clients is extended to include multiple server systems. The collection of servers, called a cluster, is viewed by clients as a single server system. This is accomplished via cluster software, which performs the management, integration, and synchronization of the servers in the cluster, or cluster members. Work assigned to the cluster is partitioned across the two nodes with, for example, file services provided by one node and database services by the other.
In a clustered environment, users (clients) have access to the combined resources of the entire cluster. Like the single server environment, a cluster provides a single management environment. Clients view resources and services in the cluster as if they were local. A major advantage of clustering in a LAN environment is the ability to add system components incrementally in order to build in component redundancy for higher availability.
As customers deploy client/server solutions in their enterprise, they are concerned about system reliability and a cost-effective growth path for the future—attributes that are critical to supporting their user community and running their business. Digital’s cluster technology on Windows NT is well suited to address these concerns by enhancing the availability, scalability, and management of data and key services within a client/server LAN environment.
Digital Clusters for Windows NT supports both Intel and Alpha Windows NT systems as cluster members. Any single cluster must have the same processor architecture: either Intel–Intel or Alpha–Alpha. The two cluster members do not need to be identically configured, and one or both may be SMP systems. Each clustered server must have its own local system disk that may be used to store data or run applications.
Digital Clusters for Windows NT is a LAN server-based cluster solution. End-user clients are not members of the cluster. All Windows clients—Windows NT, Windows 95, and Windows for Workgroups—with LAN connections to the clustered servers are fully supported and can use the common cluster name to access the cluster. Other clients—such as Mac OS, OS/2, or MS-DOS—can access the cluster and benefit from the cluster resources. However, non-Windows-based clients must know the names of the clustered servers and must manually reconnect in the event of a failover. Manual reconnection is also required for clients that have WAN connections (separated by a router) from the clustered servers.
End-user clients access and manage Digital Clusters for Windows NT as a single system via the common cluster name. They don’t need to know the individual names of the servers they are connecting to. The cluster name service software directs the clients to the correct disk or file share. The cluster configuration can be changed from any one of the servers in the cluster.
The two servers in the cluster are connected by up to three physical connections:
Digital Clusters for Windows NT allows partitioning of the workload to the disk level by assigning disks on the shared bus to one server at a time for management and control. Moving disk resources from one server to another is simply a matter of reconfiguration via the graphical cluster administration tool—no need to power down or re-cable hardware components.
Digital Clusters for Windows NT supports the use of standard Windows NT management tools, such as the File Manager and net use commands, to manage cluster resources. The intuitive cluster administration tool allows easy configuration and re-configuration of the cluster. The cluster state and configuration are remotely accessible by any SNMP browser, through the standard SNMP agent and cluster MIB included with the cluster software.
There are two data access models that are used in clustering solutions: the partitioned data model and the symmetric data model. Digital Clusters for Windows NT is designed around the partitioned data model. Different workloads are distributed between the members of the cluster and operated on independently. Although the disks on the shared SCSI subsystem are connected to both cluster nodes at the same time, only one cluster member is allowed to access any one disk at any particular time. This partitioned workload distribution is culturally compatible with existing client/server PC LAN environments because PC LANs scale by re-partitioning the workload and adding more servers and storage to the LAN.
The symmetric (or shared everything) data access model is used by VMS clusters. In this model, the same workload is distributed across multiple server systems, executing in parallel. Concurrent access to shared disks is managed by a distributed lock manager. While the symmetric data access model may have benefits for future versions of Digital Clusters for Windows NT, it is not supported in the current version.
The specific events associated with failover, and total time for failover, depend on what type of failover is occurring. There are two types of failovers:
The following activities take place in the event of a failover:
Detection time for an involuntary failover begins when the two FMs lose communication with each other or realize too many errors are being logged against a disk. During normal cluster operation, a network connection with a “keep-alive” protocol is established between the two Failover Managers. In the event that communications are lost between the servers, the initial failover detection time is the amount of time that an FM will wait to re-establish network connection with the other FM, before assuming the other has failed and initiating a failover. Although this network communication time-out may be set as low as one second, the default setting is 30 seconds. Setting this time interval very low may risk false or premature failovers. SCSI failures are detected by accumulated I/O errors, either after users access the disk or as a result of the cluster software’s periodic test of the disk.
When evaluating overall failover performance and time-out settings, a tradeoff is made between fast failover times and cluster stability. The default parameters of network communication time-out, stabilization period, and SCSI bus arbitration, are set conservatively so that cluster software can make “wise” decisions about the need for failover. They can be set to shorter times to get a quicker failover, but the cluster resources may be subject to unnecessary migrations. For example, if a network connection time-out is lowered below a threshold where a busy server will have difficulty responding within the allotted time, then the cluster may be more subject to a spurious failover than if time-outs were left at higher settings.
Digital Clusters for Windows NT provides high availability for two types of failover objects: cluster services (for example, database applications) and cluster resources (for example, network shares). In the event of a failure in the primary server, cluster software will automatically provide the same resource to end-user clients using a backup path—the secondary server—to provide the resource or service. This transparent relocation of the failover object is called failover. This failover model is extensible in the sense that new failover objects may be added in the future—for example, printers, TCP/IP addresses, or customer-defined objects.
To flexibly define and manage failover objects, the system administrator creates logical groups of cluster services and resources that are called failover groups. Resources in a failover group will move together; either they all are online or they are all offline. Resources in different failover groups move independently. Resources in a failover group are ordered such that more primitive services are started first when going online, and the order is reversed when going offline.
Failover groups are easily defined by the graphical cluster administrator tool. They tend to be a collection of storage devices and applications that are used together. For example, an SQLserver (the application) and the disks used to store the SQLserver database may be specified in the same failover group. A group can also be made up of one or more disks without an associated application, as in the case of file service failover. If the Failover Manager detects a software or system failure, the whole failover group will be failed over to the alternate system.
Each failover group is associated with a failover policy that is defined by the system administrator using the cluster administrator tool. The ability to define failover policy for individual failover groups gives the system administrator more flexibility in workload balancing. Using the cluster administrator tool, the system administrator can define:
The administrator may determine that the cluster workload must be rebalanced—for example, a database application must be moved from one node to another because of changes in client service requirements. Rebalancing is easily accomplished through software reconfiguration of failover group policy for the cluster. The administrator can override the failover group policy by:
The “brains” of the cluster software is the Failover Manager (FM). The FM ensures that access to shared resources (such as disks on the shared SCSI bus[es]) is synchronized, identifies external or management events that may cause a failover, and executes appropriate failover policy for all failover groups. The Failover Managers running on both nodes in a cluster are synchronized through message-passing communications.
The Storage Shim Driver (SSD), is a small driver “wedged” in the SCSI I/O stack. It is a kernel mode component of Digital Clusters for Windows NT and uses standard Windows NT APIs. The SSD is a software switch that controls access to the disks on the shared SCSI buses. Its main goal is to ensure that the two clustered servers never access a disk at the same time. If there is any uncertainty about ownership of a disk, both SSDs will put the disk offline rather than risk its integrity.
All information about the state of the cluster, including disk ownership, failover objects, groups, and policies, is stored in the Cluster Failover Manager Database (CFMD), which is layered on top of the Windows NT registry. The CFMD implements transactional semantics for its data access operations to guarantee that both nodes in the cluster will always have consistent and synchronized views of key cluster data.
The cluster FMs on each cluster node communicate over the network in order to coordinate ownership of resources in the cluster. The secondary private Ethernet connection provides redundant communication in the cluster between FMs in order to make sure that the two cluster nodes can maintain communications. The enterprise network is used for cluster node communication where the private network has failed or no private network is installed. The secondary network is highly recommended to allow uninterrupted FM communication in the case of an overloaded or partitioned enterprise network.
The cluster Name Service supports a common cluster name, or alias, which can be used by clients to transparently reference cluster resources or services, regardless of which server is providing the requested resource to the client. There is a server and a client component to the Name Service. The client name service component must be installed on any client who wishes to take advantage of the common cluster name. The cluster name service is supported on all Windows clients: Windows NT version 3.51, Windows 95, and Windows for Workgroups version 3.1.1.
The client name service intercepts all client universal naming convention (UNC) (that is, \\name\share syntax) requests to the redirector. If the request is for a cluster alias, the client name service sends a request to the server name service to translate the cluster alias into the actual server UNC address. The server name service responds with the UNC of the cluster member exporting the requested cluster resource, and the client name service passes that UNC back to the client redirector. This address translation is done when the client connects to a cluster share or opens a file or named pipe to a cluster resource.
Digital Clusters for Windows NT can recognize and initiate failover in the event of system software, server, and SCSI controller failures. Failure of the Windows NT Server operating system will trigger a failover (initiated by the surviving cluster member), as will server failure caused by a system hang or power failure. The cluster software has the ability to monitor availability of supported applications; in version 1.0, Microsoft SQL Server failures are reported directly to the FM and cause a failover of defined database application failover groups. Failure in the SCSI controller or cable to the disk subsystem will trigger a failover by accumulated disk I/O errors.
In the current version of Digital Clusters for Windows NT, partitions of the enterprise network connection do not automatically cause a failover. The cluster has no way of knowing whether a network failure was caused by failure of the network adapter close to the clustered server or by a cable break between network clients away from the clustered server. If the system administrator recognizes a network problem that would be improved by failing over services from one cluster node to the other, the administrator can manually migrate cluster resources as long as the private cluster network connection is operational. If the private network connection is not installed, it is not possible to manually migrate cluster resources in the case of a network partition.
Failback describes what happens when the server causing the failover returns to an operational status again. If failback is enabled, the failover groups automatically migrate back to the primary host in their original configuration. The cluster software ensures that any stale data on the primary host regarding the disk and its file systems are cleaned up. If failback is not enabled, the failover groups remain on the secondary host. Failback is important to restore the cluster to full operation and restore the static load balancing of the cluster.
Digital Clusters for Windows NT will provide failover capabilities for:
Failover of client connections to cluster resources is supported; however, failover of open files is not. This restriction applies to client applications using named pipes for access to cluster resources as well. Named pipes are a networking transport used for communication. Clients talk to servers by addressing a particular named pipe name. The cluster name service provides a means for a cluster-wide named pipe name to refer to one server and then to another.
Generic application failover is supported by allowing the system administrator to provide scripts that execute when a failover group comes online and when a failover group goes offline. The path of the command-line script file is provided via the cluster administrator. Generic application failover scripts are most commonly used to start and stop applications that are not directly supported by cluster software, such as custom server applications that use data on a shared cluster disk. Another use of failover scripts is to send a notification message to the system administrator in the case of a failover event.
No provision for preserving application context is made by any Windows NT high-availability product—including Digital Clusters for Windows NT. This is true for both database and NTFS file service failover. This is exactly the same situation as when a server fails and is brought back online today. Digital Clusters for Windows NT neither improves nor worsens the situation. The difference is that in a cluster the failed “server” is returned to service significantly faster than today’s normal repair time. Think of clusters as the world’s fastest field service!
Digital Clusters for Windows NT provides high availability to the database. If a client application is reading or writing data to a database on a disk (or system) that fails, the database is failed over to the backup server. Where the application resumes the operation is the responsibility of the application. For SQL and Oracle, any in-progress transaction will have to be rolled back and restarted—one of the primary functions of database software is to provide the transactional semantics around database operations. This requirement is the same for all high-availability products for Windows NT.
In cases where there is additional server-side software—for example, a custom in-house server application, the server software needs to be failed over by mechanisms similar to those used by the underlying database. The generic application failover capability addresses many of these server applications by allowing users to provide command-line scripts for the starting and stopping of server applications.
The client applications connecting to a database that fails over will receive an I/O error or a connection lost error. It is the responsibility of the application to re-establish the connection (this requirement is the same for any Windows NT high-availability product). If the client has established the connection to the cluster alias, then the client simply reconnects to the cluster alias and the cluster name service automatically routes requests to the secondary server.
Microsoft SQL Server 6.5 also provides failover APIs for applications connecting to a specific cluster server using DBLIB or ODBC. In this case, failover happens as above, but it is not controlled by the cluster software.
If a client application is connected to a cluster resource containing an open file or named pipe, which then fails over, any subsequent read or write operation will fail. It is up to the client application to handle this error appropriately. “Well-behaved” client applications will close and reopen the file or named pipe upon receipt of this I/O error. If the client application is connected to the cluster using the common cluster name, the subsequent reopening of the file or named pipe will establish a connection to the failover server without requiring the end user to manually reconnect.
No assumptions should be made about the amount of data transferred before failover. When a file or named pipe is open during failover, it is the client application’s responsibility to maintain its own context and roll back to a safe point before continuing. Although the client file operation may need to be reissued, file system integrity is maintained.
From the client’s viewpoint, failover times can vary depending on the situation. Clients cannot connect to the cluster during a cluster failover. If a new connection is initiated during failover, the client will attempt to connect to the cluster for 15 seconds before returning an error to the client application. Once the failover is complete, subsequent attempts to connect will succeed.
However, clients with open files to the cluster are a different case. The first time the client attempts to access the open file during or after a cluster failover, the client will receive an error message from the redirector. The redirector will time out trying to access the open file with the translated UNC address of the cluster member that has failed over. It may take up to a minute for the client name service to discover and be redirected to the server providing the requested cluster service. Following the error message, the client must close and reopen the file or named pipe.
Windows clients accessing the cluster using the cluster alias will have minimal disruption to their work in the case of a failover. If the end user is accessing network file shares, or running a “well-behaved” client application accessing the cluster via named pipes or SMB, the end user may experience no disruption at all—if the user does not access the cluster until the failover is complete. If the user accesses the cluster during the failover, the user will receive an I/O or connection-lost error message. All the end user must do is to click on “retry” to re-establish the connection to the cluster.
Other clients that cannot access the cluster via the cluster alias can still benefit from high availability provided by Digital Clusters for Windows NT. However, these end users must know the names of the two clustered servers. In the case of a failover, the end-user client must manually reconnect to the secondary cluster server in order to continue access to the cluster service.
The minimum cluster configuration consists of:
Each server must be running:
Optional server software:
Client software is included with the standard software kit and can be installed on any of the following client desktops to support cluster aliasing and other cluster features:
Digital Clusters for Windows NT 1.0 supports the two most popular hardware architectures: Alpha and Intel. Both servers in the cluster must be of the same architecture, either Alpha or Intel. However, any models within each architecture can be combined in a cluster. For example, a Prioris HX server system could be used with a Prioris XL server system in a cluster. Table 1 below lists the supported servers.
Table 1. Supported Server Systems
AlphaServer | Prioris Server |
AlphaServer 1000 | Prioris ZX (Pentium) |
AlphaServer 1000A | Prioris ZX (Pentium Pro) |
AlphaServer 400 | Prioris HX |
AlphaServer 2000 | Prioris XL |
AlphaServer 2100 | |
AlphaServer 2100A | |
AlphaServer 4100 |
In addition to the internal bus adapter shipped with the system, each server system must have a supported SCSI adapter installed on the system expansion bus. Refer to the Cluster Hardware Compatibility List, on Digital’s World Wide Web site at www.windowsnt.digital.com, for detailed guidelines on supported SCSI subsystems.
Adapter
An integrated circuit expansion board that communicates with and controls a device or system.
Bus
A collection of wires in a cable, or copper traces on a circuit board, used to transmit data, status, and control signals. EISA, PCI, and SCSI are examples of buses.
Cluster
A computing environment created by connecting two servers via a shared SCSI bus for the purpose of ensuring system availability in the event of a failure. Storage devices such as disks or RAID subsystems connected to the SCSI bus are assigned to one of the servers in a cluster. To end user clients, a cluster appears as a single system.
Configuration Failover Manager Database (CFMD)
A database of registry updates for cluster information that keeps both servers in the cluster synchronized by using transactional semantics.
Device driver
A software module that provides an interface for communication between the operating system and system hardware (for example, a SCSI controller).
Failback
The automatic migration of failover groups to the primary host after the (primary) system causing an initial failover returns to operational status.
Failover
In a cluster system failure, the relocation of cluster services (such as applications) or cluster resources (such as cluster shares) to end-user clients using backup paths.
Failover group
A logical group of failover objects. The groups are typically made up of storage devices and applications. For example, SQLserver (the application) and the disks used to store the SQLserver database would be a logical failover group. A group can also be made up of one or more disks without an associated application.
Failover Manager
The software on each cluster server that draws information from, and manages, failover objects. It is responsible for keeping track of events (such as a SCSI adapter or system failure) and determining whether a failover should occur. It then determines what object or failover groups need to be failed over.
Failover object
Any cluster service or resource for that you want to ensure availability in the event of a system failure. Failover objects can be disks, database applications on the servers, or file shares.
Fault tolerance
A method of ensuring the availability of a computing environment by using a backup system that mirrors the primary system. The backup system is typically called a “hot standby.” The backup system does not provide any additional computing capacity; it is only available for use in the event of a failure of the primary system. For this reason, it is a costly method of ensuring availability.
Idle standby
A system that provides dormant backup to the primary server in a cluster. Its job is to wait for the other server to fail.
Load balancing
The ability to partition the workload by assigning resources (such as disks and databases) to servers in the cluster and defining failover/failback policies. This provides an efficient and effective use of server resources.
Name server
Software installed on the cluster servers that works with the client software to create the illusion of a single system through aliases. Using aliases, the client is unaware of the name of each server or how the cluster workload is distributed.
NTFS
NT File System. The standard file system for the Windows NT operating system.
RAID
Redundant Array of Independent Disks. A collection of storage devices configured to provide higher data transfer rates and data recovery.
Redundancy
A method of protecting against failures by building in extra, backup components to a system.
Scalability
In a system, the ability to add capacity as needed.
SCSI
Small Computer System Interface. An intelligent bus for transmitting data and commands between a variety of devices.
SCSI-2
The second generation of SCSI, which includes many improvements to SCSI-1, including fast, wide SCSI.
SCSI-3
The third generation of SCSI, which introduces improvements to the parallel bus and high-speed bus architectures.
Storage Shim Driver
Cluster software that works with the operating system and the drivers in the I/O stack. It acts as a switch that mediates access to disks and ensures that user data is not corrupted.
Symmetric multiprocessing (SMP)
A method of adding computing capacity to a system by adding processors.
UNC
Universal naming convention for LANMAN protocol. This is the traditional \\server\share syntax.