Fault Tolerance

Graceful failover and fault tolerance are vital for mission-critical applications that require high availability. Such resilience is usually achieved through a number of hardware, operating system, and application software mechanisms.

DCOM provides basic support for fault tolerance at the protocol level. A sophisticated pinging mechanism, described in Section 0, "Shared Connection Management Between Applications," detects network and client-side hardware failures. If the network recovers before the timeout interval, DCOM reestablishes connections automatically.

DCOM makes it easy to implement fault tolerance. One technique is the referral component introduced in the previous section. When clients detect the failure of a component, they reconnect to the same referral component that established the first connection. The referral component has information about which servers are no longer available and automatically provides the client with a new instance of the component running on another machine. Applications will, of course, still have to deal with error recovery at higher levels (consistency, loss of information, etc.).

With DCOM's ability to split a component into a server side and a client side, connecting and reconnecting to components, as well as consistency, can be made completely transparent to the client.

Example: Microsoft's Transaction Server ("Viper") provides a generic mechanism for handling consistency at the application level. Combining multiple method invocations into atomic transactions guarantees consistency and makes it easier for applications to avoid loss of information.

Figure 15 - Distributed component for fault-tolerance

Another technique is commonly referred to as "hot backup." Two copies of the same server component run in parallel on different machines, processing the same information. Clients can explicitly connect to both machines simultaneously. DCOM's "distributed components," make this action completely transparent to the client application by injecting server code on the client-side, which handles the fault-tolerance. Another approach would use a coordinating component running on a separate machine, which issues the client requests to both server components on behalf of the client.

A failover attempts to "migrate" a server component from one machine to the other when errors occur. This approach is used by the first release of Windows NT Clusters, but it can also be implemented at the application level. DCOM's "distributed components" make it easier to implement this functionality and shields clients from the details.

DCOM makes implementing sophisticated fault-tolerance techniques easier. Details of the solution can be hidden from clients using DCOM's "distributed components," which run part of the component in the client process. Developers can enhance their distributed application with fault-tolerance features without changing the client component or even reconfiguring the client machine.