Site Server 3.0 Membership Directory Guidelines for Failsafe Operations

April 1999

Microsoft Corporation

Related Documents:

Purpose

This document addresses procedures for reducing or eliminating downtime in a Microsoft® Site Server version 3.0 environment. Hardware failure, data corruption, and physical site destruction all pose threats to a 24 x 7 operation. The goal of this document is to help you learn to minimize downtime in the Site Server environment

Overview

Keeping the Site Server system running in a failsafe environment means addressing potential failures at each of the three tiers: application servers, LDAP servers, and servers running Microsoft® SQL Server™. In the case of the application servers, round robin Domain Name System (DNS) provides an acceptable answer. Hardware solutions are also available from a variety of vendors. These solutions, in conjunction with Dynamic Directory replication, also provide a strong solution for LDAP server redundancy. Data stored on the server running SQL Server must be shared between the LDAP servers. If any of the SQL Server components fail, all LDAP servers using that store are compromised. Since all of the application servers are dependent upon these LDAP servers, those components that use the Site Server Membership Directory, such as Site Server 3.0 Personalization & Membership, will also be compromised.

Tier 1
Fault Tolerance:
Low state, no unique data,  inexpensive machines
Use round robin DNS, hardware load balancing
Tier 2
Fault Tolerance:
Low state, no data
Use round robin DNS, hardware load balancing
Tier 3
Fault Tolerance:
Holds system state, lots of unique data

Because of this critical single point of failure, this document will focus on methods that provide automatic fail-over, hot swap, or warm backup capabilities for the databases that support the Site Server Membership Directory. This document will cover four methods of protection against the single point of failure. The optimal solution is site dependent.

The recommended solutions for failsafe operations are identified in the following table. The remainder of this document provides additional information about these solutions.

Type of Failure Failsafe Detection Failover
Application or application server failure Multiple copies running with round-robin DNS or hardware Monitoring tools DNS: script removal from DNS

Hardware: automatic

LDAP application or LDAP server failure Multiple copies running with round-robin DNS or hardware Monitoring tools DNS: script removal from DNS

Hardware: automatic

SQL Server computer Multiple servers running SQL Server with Windows NT Clustering Services Automatic Automatic
SQL Server Disk RAID5 Automatic Automatic
Data corruption Warm backup SQL Server on separate server Monitoring tools Switch to use warm backup computer

SQL Server 6.5 Replication

SQL 6.5 replication allows administrators to configure their SQL Servers such that all modifications to data in one database are automatically propagated to another, allowing a copy of the data in one database to be present in another.

This method is limited in a Site Server Membership Directory environment by:

SQL 7.0 Replication

SQL Server 7.0 replication is much like SQL Server 6.5 replication, but addresses many of the limitations in SQL Server 6.5. Specifically, SQL Server 7.0 greatly improves data transfer rates, simplifies replication setup through wizards, and requires no special handling for image data. In addition, SQL Server 7.0 has the capability of replicating changes made to subscribers back to the master database, offering improved hot swap capabilities. SQL Server 7.0 continues to present the potential for replication of user-corrupted data, and is further limited by the fact that it is still in beta and hence, not useful in the short term.

Microsoft Windows NT Clustering Services

Microsoft® Windows NT® Clustering Services (also known as Microsoft Cluster Server) allows two computers to act as standby computers for each other, sharing only their disk array. This system allows one computer’s CPU, memory, operating system, or other critical components to fail without causing a service outage. Using RAID5 protects against physical media corruption (drive failures).

Clustering in conjunction with a failsafe storage device system represents a true fail-over solution for everything but data corruption. It is limited by expense of hardware, difficulty of configuration, shared media, controller failure potential, and the fact that it offers no protection against database corruption.

Keeping a Warm Backup SQL Server

The best data protection strategy is a combination of full backups and backups from transaction logs. If a redundant computer is continuously being restored from a master computer, service may be restored with excellent data integrity. Intelligent management of warm backup computers will lead to nominal downtimes.

SQL Server 6.5 Replication

SQL Server 6.5 replication is a poor solution for failsafe operations in the Site Server environment. It is primarily intended for geographic distribution of data and for query distribution in environments that are query driven (involving very few writes, modifies, or deletes). The following information is presented in the interest of correcting misconceptions and detailing what would be involved in using SQL Server 6.5 replication for failsafe operations.

Advantages

Limitations

Option 1: Full Replication

While SQL Server 6.5 replication is not a recommended solution for most deployments, there are a few situations where it can provide added value. The first of these is in a very high-read, low-write environment. In such an environment, replication of data for geographic distribution or query load balancing can be advantageous.

All tables are replicated in a straightforward manner, with the exception of the Object_Attributes table, whose image column should not be replicated directly, as discussed previously.

Detailed instructions about how to handle the image data are beyond the scope of this document. Briefly, however, the following steps are necessary to replicate the Object_Attributes table without incurring database bloat due to NULL image field replication.

  1. In the subscriber, the Object_Attributes table should be renamed to Object_Attributes_No_Image, and the img_Val column removed.

  2. A new table should be created to contain just the key data and the image. Call this table Object_Attributes_Image, and give it only two columns: I_Sequence (int), and img_Val (image). Create an index on the I_Sequence column of this table.

  3. A view should be created on these two tables. This view should be called “Object_Attributes” and consist of all the columns from Object_Attributes_No_Image joined by I_Sequence to the Object_Attributes_Image table.

  4. Appropriate indexes should be created on the Object_Attributes view to mimic those present in the standard Object_Attributes table.

This configuration allows you to use the replica as a read-only database while you are restoring the master database from backups or log files. If no backups are kept, or they are somehow lost or not maintained, restoring write capabilities from this database will be challenging. In addition to addressing the normal issues involved in promoting a subscriber to a publisher, the following steps must be taken:

  1. Alter the table on Object_Attributes_No_Image to add the img_val column (data type Image).

  2. Update the Object_Attributes_No_Image table to take the data from Object_Attributes_Image as follows: “UPDATE Object_Attributes_No_Image A, Object_Attributes_Image B SET A.img_val = B.img_Val where A.i_Sequence = B.i_Sequence.”

  3. Establish identity properties on the object_lookup (I_DSID), object_attributes (I_Sequence), attributes (I_AID), and classes (I_CLSID) tables.

  4. Build indexes on Object_Attributes_No_Image to mimic those on the normal Object_Attributes table.

  5. Delete the Object_Attributes view.

  6. Rename the Object_Attributes_No_Image table to Object_Attributes.

  7. Delete the Object_Attributes_Image table.

Option 2: Replication Without Indexes

By dropping indexes on the subscriber, the replication throughput can be increased somewhat, allowing a higher write, modify, and delete rate than is available in Option 1. Requirements and limitations are effectively the same, except that searches on the subscriber are limited by which indexes remain. For example, if you have dropped the indexes ind_dt_aid (date data index) and ind_int_aid (integer data index), any search on this information will almost certainly fail with a timeout.

If your system is used exclusively for authentication or personalization, the majority of the indexes can be dropped; only the clustered index on the Object_Attributes table is required.

Option 3: Partial Attribute Replication

A second way of improving replication throughput in a specialized system is to replicate only those attributes that are mission critical. In a typical Site Server deployment, the LDAP servers are using the SQL Server computers to fulfill many roles, including personalization, authentication, and targeted mailing. It is primarily the personalization functionality that induces the high transaction rate, causing problems for SQL Server replication.

By choosing to replicate only those attributes needed for authentication and mail resolution, you can reduce the add, modify, and delete rates so they can more readily be handled by the SQL Server replication system. These attributes change less frequently, so the replication engine has an easier time keeping up with the load.

In the event of a failure on the master, only the replicated attributes will be accessible. Attributes you did not replicate will be unrecoverable from the subscriber. Read operations, including authentication, mail routing, and mail name resolution, can be performed from the subscriber while you repair the publisher. Alternatively, the subscriber can be promoted to a publisher, as detailed above.

Other requirements and limitations are effectively the same as in Option 1, except that operations on the subscriber are limited by which attributes were replicated.

Performance Summary

  Hardware Failure Software Failure Data Corruption
Read Operations Good if pointers are swapped to the replica. As hardware Minimal to no protection. Corruption will either be replicated or stop the replication process.
Write Operations None until replica is converted to master, this conversion is error prone and will take considerable time depending on amount of data in store. As hardware Write operations against a corrupted database will be suspect at best.
System Restoration Complex, time consuming, prohibitive in a very high scale environment. Fail over to subscriber for read operations while rebooting master. No guaranteed recovery method.

SQL Server 7.0 Replication

SQL Server 7.0 replication offers significant benefits over SQL Server 6.5 replication. This allows SQL Server 7.0 to play a much more significant role in failsafe operations.

Advantages

Limitations

Option 1: Full Replication

This is the simplest option. Replicate all tables to provide a warm standby copy of the data. If there is a hardware or software failure on the publisher, promote a subscriber. If there is a hardware or software failure on the subscriber, synchronize a new subscriber. Note that this option offers no protection against data corruption.

Option 2: Leverage Writable Subscribers

SQL Server 7.0 has a replication mode called Merge Replication. This allows writes to a subscriber database to be periodically propagated back to the publisher, where they can in turn be pushed out to the other subscribers. This mode can be leveraged by configuring a one-to-one mapping of LDAP server sets to SQL Server sets.

A failure on a given SQL Server subscriber set only drops the affiliated LDAP server set; this failure can be repaired after taking that LDAP server set out of DNS, or redirecting the LDAP set to a SQL Server replica that is still functional.

If a failure occurs on a publisher, an LDAP server may be redirected to a functioning subscriber while the publisher is repaired or a subscriber is promoted to a publisher. Writes to subscribers are possible in this model, and will be distributed to all subscribers once a publisher is established and re-synchronized.

Like all SQL Server replication-based solutions, this offers little or no protection from user corruption of data.

Performance Summary

  Hardware Failure Software Failure Data Corruption
Read Operations Good. Read operations only affected on those LDAP servers that leverage the failed SQL server; recovery involves switching to the replica. As hardware Minimal to no protection. Corruption will either be replicated or stop the replication process.
Write Operations Poor until replica is converted to master. This conversion is error prone and will take considerable time depending on amount of data in store. As hardware Write operations against a corrupted database will be suspect at best.
System Restoration Fairly simple promotion of subscriber. Most operations recovered in minutes; failure of publisher implies tens of minutes down time. Fail over to subscriber for read operations while rebooting master. No guaranteed recovery method.

Microsoft Windows NT Clustering Services

Microsoft Windows NT Clustering Services (clustering) offers a great degree of protection from system failures, either at the hardware or software level. This protection is automatic and “lights out” in most cases, offering a very good administration/protection tradeoff. It offers no protection from data corruption and still holds a single point of failure in the disk controller. For this reason, hardware solutions for failsafe storage devices are recommended.

Advantages

Limitations

Option 1: Clustering with Shared Disk Array

Clustering with a single shared disk array is the standard scenario. This provides protection from all catastrophic hardware and software problems. Through use of RAID 5, all other single points of failure are eliminated. Recovery from other hardware or software failures is automated and imposes no downtime on the system.

As previously mentioned, this solution does not offer protection from data corruption.

Performance Summary

  Hardware Failure Software Failure Data Corruption
Read Operations Excellent. Excellent. Automatic fail-over. Minimal to no protection. Read operation may be possible against corrupted database.
Write Operations As-read operations. Excellent. Automatic fail-over. Write operations against a corrupted database will be suspect at best.
System Restoration Fix hardware. Reboot failed computer. No guaranteed recovery method.

Keeping a Warm Backup SQL Server

Using this strategy, a backup computer is created from the primary SQL server by taking a database dump from the source computer, then incrementally testing the source computer for data corruption and applying incremental backups (backing up from transaction logs). If the primary computer fails or becomes corrupted, the backup computer is available as a standby.

A well thought out and executed backup strategy is unquestionably the best single solution for failsafe operations. Maintaining a hot standby server from backups allows the best compromise between minimal downtime, data integrity, and reasonable administrative overhead. It requires constant monitoring and administrator involvement, but most (if not all) of these tasks can be automated through scripting.

Advantages

Limitations

Option 1: Restore on Demand

This solution requires low administrative overhead, and involves making an occasional full backup from the master database and storing a sequence of transaction log deltas. Once each interval, where the length of the interval is determined from the needs of the site, the full backup snapshot is taken again and the intermediate logs deleted.

In the event of corruption, software or hardware failure, deltas are incrementally tested for corruption and applied to the full backup to achieve a “known good” state.  Although this approach requires a much lower administrative overhead, it results in a longer interval to system recovery.

Option 2: Restore to Warm Backup Computer

Another option that greatly reduces downtime at the expense of administrative overhead involves maintaining a backup computer in a “warm” state. This implies priming the backup with a full snapshot of the source database, then applying transaction logs as follows:

  1. Use DBCC to test source database for corruption.

  2. Capture transaction log to backup computer.

  3. If no corruption is detected, apply transaction log to backup computer.

  4. If irreparable corruption is detected, discard transaction log and switch servers to use backup computer. Transactions since last good snapshot are lost if not recoverable.

  5. If the source database is irreparably corrupted, back up the old backup computer (now the source) to the old source computer (now the backup) and restart the cycle in the opposite direction.

Note: This option provides very low downtime at the expense of constant administrative monitoring as detailed above.  It is unquestionably the best solution available for failsafe data center operations.

Performance Summary

  Hardware Failure Software Failure Data Corruption
Read Operations Operator intervention required; minimal downtime while SQL server pointer is switched. As hardware Read operations will be unpredictable against a corrupt database.
Write Operations Operator intervention required; minimal downtime while SQL server pointer is switched. As hardware Write operations against a corrupted database will be suspect at best.
System Restoration Switch to warm backup computer. Switch to warm backup computer. Only guaranteed recovery method. Switch to warm backup; data involved in the corruption process will be lost.

Recommended Approach: Combined Solution

The ideal approach to achieving failsafe operations in the SQL tier of the Site Server architecture is the combination of clustering technology and maintaining a warm backup.

Clustering was previously detailed under Microsoft Windows NT Clustering Services, Option 1. This will provide protection against virtually all failure scenarios in hardware and software in a failsafe, “lights-out” manner. Data corruption is the only remaining vulnerability.

If the approach detailed in “Backing Up from Transactions” is used, data corruption can be detected and, if the corruption is irreparable, the warm backup can provide an immediately available, valid data store to work from.

This solution demands a high investment in both hardware and administrative overhead, but offers true, failsafe, 24-hour operations in the Site Server environment.

Where to Go for More Information

SQL Server Replication

http://www.microsoft.com/sql/default.htm

SQL 6.5 Replication

Microsoft SQL Server 6.0 Administrator’s Companion, Part 6

SQL 7.0 Replication

Microsoft SQL Server 7.0 Administrator’s Companion

http://www.microsoft.com/sql/default.htm

SQL Server Use of Microsoft Windows NT Clustering Services

http://www.microsoft.com/sql/default.htm

SQL Backup Strategies

Microsoft SQL Server 6.0 Administrator’s Companion, Appendix E

Microsoft SQL Server 7.0: http://www.microsoft.com/sql/default.htm

Summary

This document describes the options available to Site Server administrators who wish to operate their data center in a failsafe manner. This information should be applied in a way that makes the most sense for each individual environment.

For the site with sufficient hardware and administrative resources, combining clustering with the maintenance of a warm backup SQL Server is recommended, to protect against hardware, software, and data failures.

Note that the technical implementation details for the solutions outlined in this paper are only touched upon here. The administrator needing more information is strongly encouraged to refer to the supporting documentation for Microsoft Windows NT Clustering Services and Microsoft SQL Server, as well as documentation from individual hardware vendors.

The Site Serverteam is committed to providing you the best possible server products for the Internet. We welcome your feedback about how we can better meet your needs. You may contact the Microsoft support team by visiting: http://support.microsoft.com/support.

Information in this document, including URL and other Internet web site references, is subject to change without notice.  The entire risk of the use or the results of the use of this resource kit remains with the user.  This resource kit is not supported and is provided as is without warranty of any kind, either express or implied.  The example companies, organizations, products, people and events depicted herein are fictitious.  No association with any real company, organization, product, person or event is intended or should be inferred.  Complying with all applicable copyright laws is the responsibility of the user.  Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document.  Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© 1999-2000 Microsoft Corporation.  All rights reserved.

Microsoft, Windows, and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the U.S.A. and/or other countries/regions.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.