Contingency Planning

This section identifies topics you should consider in your contingency planning.

Identifying Costs of a Failure

There are several ways to measure costs. Some costs are easy to understand and to calculate, such as the following:

But measuring the cost of server downtime to a department or your company can be more difficult.

And how do you measure the cost of a server failure on:

If you have kept records of failures, you might find them useful in your contingency planning. You can investigate ways to avoid each failure, or to minimize the downtime associated with the failure. If you have cost information for the failures, you can then compare the cost of each failure to the cost of preventing or minimizing the failure.

Here are two examples:

Failure

File server in sales department down, network card failure.

Router failure between development and testing department.

Effect

Lost sales

Lost productivity of employees.

Total downtime last year.

3 hours.

16 hours.

Costs of failure per hour.

$10,000

Average hourly wage of 10 affected employees is $18/hr.

Total downtime costs last year.

$30,000

$2,880

Possible resolution or workaround.

3 spare network cards @ $500 each.

Put an alternate router in place or obtain a spare router.

Expected costs of resolution or workaround.

$1,500

$500 - $2,000

Estimated savings during first year with resolution in place.

$28,500

$880-$2,380


What Components in the Computer Might Fail

A server failure is typically the most costly with respect to a corporation's business, whether it is a file server, a print server, or an applications server. This section discusses the components within the computer to help you decide how to configure it. You should also regularly run diagnostics on the individual components.

Motherboard and CPU (Central Processing Unit)

Motherboards consist of electronics that can and do fail, yet the motherboard and the CPU are the more reliable computer components. There is not a lot you can do to avoid a motherboard failure or CPU fault, except to regularly run system checks that ensure they are functioning correctly. Some vendors provide systems having built-in diagnostics that operate with Windows NT.

RAM

There are three major types of RAM, in the sense of error detection and correction.

There are a few vendors that supply products that you can use to check the RAM in a computer.

Video Cards

Video cards drive the screen as well as render images for display. Video cards rarely will cause a computer to crash, but rather, might cause the computer to behave erratically, which can be confusing to diagnose. More often, video cards cause screen redraw problems, application page faults, and the like. These problems are usually not critical enough to require you to shutdown the computer. To minimize video problems, be sure that your computer is running with the most recent release of a supported video driver from Microsoft, or the third-party vendor selling that card.

Disks and Disk Controllers

You should investigate IDE, EIDE, and SCSI technologies, because each offers different benefits with respect to fault tolerance and recovery. You have many choices for your disk configuration, including fault-tolerant configurations. The meantime between failures (MTBF) gives you a measure of expected reliability of disks and controllers.

Be sure to run disk and controller diagnostics during every preventive maintenance period. Diagnostics should be available from your hardware vendor. Windows NT automatically runs its Chkdsk program every time you start up, and you can run a surface scan of the disks by specifying chkdsk /r.

Backup Devices

Verifying your backups by doing test restores is the best way to make sure that your backup devices and media are working correctly.

Network Cards

FDDI, CDDI, and ATM network cards can have dual-channel connections. If one channel goes down, the other channel is automatically used.

Ethernet and token ring network cards do not have a dual-channel capability. If the manufacturer provides a diagnostic program, you should run diagnostics on the network cards during scheduled preventive maintenance or down-time periods.

You can evaluate network segments with network packet trace programs, called sniffers. The Windows NT Network Monitor can check for:

Power Considerations

You need to have consistent, reliable power to be able to run your computers. Power failures, power surges, and power sags can cause the computers to crash and can damage the electronics. There are different situations that affect the power supplied to your computer.

Power Supply

Be sure your computer has a high quality power supply that can simultaneously support all components attached to it. It is possible to overload a power supply by adding too many power consuming devices to a computer. It is also possible to overload a circuit by having too many computer components on it.

Power Outages

To protect against damage and loss of data from temporary power outages, consider the purchase of an uninterruptable power supply (UPS). Windows NT supports different kinds of UPSs and can send messages to users to save their data and to log off as soon as the UPS device signals an impending shutdown. For more information about the Windows NT UPS service, see the section titled "Avoiding Single Points of Failure," presented later in this chapter.

Power Surges

To prevent power surges that can destroy your data, obtain quality power surge surpressors.

What Components in the Network Might Fail

If users cannot connect to your computers running Windows NT Server, you do not have a fault-tolerant configuration. You need to consider what can fail in the connections between your computers as well as the individual computers themselves.

Network Cabling

If your company is growing, the connections between your computers could become saturated with network traffic. You should evaluate network traffic regularly to determine if you will need to upgrade with more equipment. You can also use newer technologies like FDDI, CDDI, ATM, Fast Ethernet 100 Base T, and the like.

Intermediary Devices

Devices that connect different segments of your network, such as routers, bridges, hubs, and switches, can also be bottlenecks and points of failure. You should have UPS protection for these devices.

For each of these devices, find out whether you can get the vendor support you need. What standards are supported by each of the devices, and does the vendor have a migration path for new standards? You should also find out if there are frequent software changes.

External Network Connections

If you lease an X.25, ISDN, or T1 line to connect to another building, branch, or subsidiary of your company, verify that your line vendor has recovery procedures in place that can guarantee minimum down times for the line.

Electric Wiring

Be sure the wiring in your building is capable of supplying your company with enough power as the demands in electricity increase. This is especially important if the building is older or if there are other companies in the building that are putting more demands on the supply of electricity.

What Else Might Fail

Climate Control

If the weather at the location of your company requires heating or cooling within the building to keep your computers and network devices within required operating temperatures, consider making the climate control system fault tolerant as well.

Software Failures

Does your software vendor provide the support you need in case of software failure? Does your company have a technical support group to assist users when there are software problems?

How Likely is a Failure

The meantime between failures (MTBF) information supplied by some manufacturers of equipment such as hard disks is unlikely to provide useful information without extensive analysis and modeling according to the variables that exist in your company's usage pattern. You can use the MTBF as a relative measure of reliability.

More useful is records of past failures and their causes, because you can use this information to help you in your planning. You can categorize failures by their type, such as:

These are some questions to ask about failures:

Fewer computers can be easier to manage than many computers. However, the relative downtime impact is higher when you have more users connected to a smaller number of servers.

Importance of Training and Technical Support

Having trained personnel can reduce the likelihood of failures and reduce their severity. However, you need to determine if the cost of the training will be worth the expected benefit.

There are several ways you can train your personnel and provide them with technical support:

You can also contract to Microsoft, your hardware vendors, and third-party consultants for support.

See the Windows NT Server Start Here book for a description of Microsoft's AnswerPoint Information Services. Microsoft's support offerings range from no-cost and low-cost online information services (available 24 hours a day, 7 days a week) to annual support plans.

Importance of Testing

Testing is an important component of your contingency planning. You can use testing to try to predict failure situations and to practice recovery procedures. Be sure to do stress testing and test all functionality.

The following list identifies some of the failures that you should test:

These are some of the stress tests that you should set up: