Contingency Planning

This section identifies topics you should consider in your contingency planning.

Identifying Costs of a Failure

There are several ways to measure costs. Some costs are easy to understand and to calculate, such as the following:

Replacement costs for file servers, mail servers, or print servers.
Replacement costs for servers running applications such as Microsoft SQL Server or Systems Management Server (SMS).
Replacement costs for gateway servers running Microsoft RAS, SNA, Proxy, or NetWare.
Workstation replacement costs for personnel in different departments.
Replacement costs for individual computer components, such as a hard disk or a network card.

But measuring the cost of server downtime to a department or your company can be more difficult.

And how do you measure the cost of a server failure on:

Lost sales.
Lost customer goodwill.
Lost employee productivity and confidence.
Increased costs because of makeup time.
Missed contractual obligations or possible legal liabilities.
Perishable products going to waste.
Loss of competitiveness.

If you have kept records of failures, you might find them useful in your contingency planning. You can investigate ways to avoid each failure, or to minimize the downtime associated with the failure. If you have cost information for the failures, you can then compare the cost of each failure to the cost of preventing or minimizing the failure.

Here are two examples:

Failure	File server in sales department down, network card failure.	Router failure between development and testing department.
Effect	Lost sales	Lost productivity of employees.
Total downtime last year.	3 hours.	16 hours.
Costs of failure per hour.	$10,000	Average hourly wage of 10 affected employees is $18/hr.
Total downtime costs last year.	$30,000	$2,880
Possible resolution or workaround.	3 spare network cards @ $500 each.	Put an alternate router in place or obtain a spare router.
Expected costs of resolution or workaround.	$1,500	$500 - $2,000
Estimated savings during first year with resolution in place.	$28,500	$880-$2,380

What Components in the Computer Might Fail

A server failure is typically the most costly with respect to a corporation's business, whether it is a file server, a print server, or an applications server. This section discusses the components within the computer to help you decide how to configure it. You should also regularly run diagnostics on the individual components.

Motherboard and CPU (Central Processing Unit)

Motherboards consist of electronics that can and do fail, yet the motherboard and the CPU are the more reliable computer components. There is not a lot you can do to avoid a motherboard failure or CPU fault, except to regularly run system checks that ensure they are functioning correctly. Some vendors provide systems having built-in diagnostics that operate with Windows NT.

RAM

There are three major types of RAM, in the sense of error detection and correction.

Parity RAM. Parity RAM has an extra bit that indicates if each byte in the RAM is good or faulty. When parity RAM detects a parity difference, it signals the CPU through a Non-Maskable Interrupt (NMI). Depending on where and when this happens, Windows NT determines if this is an I/O board parity error, memory bus error, or some other kind of parity error. Windows NT can also report I/O channel parity errors from cards in slots. You get an error message in these cases, and sometimes the computer stops.
Error Corrective Coding (ECC) RAM. High-end systems often use ECC RAM, which can detect a two-bit failure and correct a single-bit failure in the system memory. Windows NT continues to run in spite of a single-bit failure. Depending on the hardware vendor's design, there might or might not be a report of this corrective action.
Non-parity RAM. If you are using non-parity RAM, Windows NT has no way to detect memory problems, and your computer might crash randomly and inconsistently. Non-parity RAM is cheaper, and parity RAM is not available for all computers. If you do not have parity RAM in your computers, ask your vendors if parity RAM can be installed or supported by the computer.

There are a few vendors that supply products that you can use to check the RAM in a computer.

Video Cards

Video cards drive the screen as well as render images for display. Video cards rarely will cause a computer to crash, but rather, might cause the computer to behave erratically, which can be confusing to diagnose. More often, video cards cause screen redraw problems, application page faults, and the like. These problems are usually not critical enough to require you to shutdown the computer. To minimize video problems, be sure that your computer is running with the most recent release of a supported video driver from Microsoft, or the third-party vendor selling that card.

Disks and Disk Controllers

You should investigate IDE, EIDE, and SCSI technologies, because each offers different benefits with respect to fault tolerance and recovery. You have many choices for your disk configuration, including fault-tolerant configurations. The meantime between failures (MTBF) gives you a measure of expected reliability of disks and controllers.

Be sure to run disk and controller diagnostics during every preventive maintenance period. Diagnostics should be available from your hardware vendor. Windows NT automatically runs its Chkdsk program every time you start up, and you can run a surface scan of the disks by specifying chkdsk /r.

Backup Devices

Verifying your backups by doing test restores is the best way to make sure that your backup devices and media are working correctly.

Network Cards

FDDI, CDDI, and ATM network cards can have dual-channel connections. If one channel goes down, the other channel is automatically used.

Ethernet and token ring network cards do not have a dual-channel capability. If the manufacturer provides a diagnostic program, you should run diagnostics on the network cards during scheduled preventive maintenance or down-time periods.

You can evaluate network segments with network packet trace programs, called sniffers. The Windows NT Network Monitor can check for:

Bad cyclic redundancy checks (CRC).
Corrupted packets.
Bandwidth saturation caused by a broadcast intensive network card.

Power Considerations

You need to have consistent, reliable power to be able to run your computers. Power failures, power surges, and power sags can cause the computers to crash and can damage the electronics. There are different situations that affect the power supplied to your computer.

Power Supply

Be sure your computer has a high quality power supply that can simultaneously support all components attached to it. It is possible to overload a power supply by adding too many power consuming devices to a computer. It is also possible to overload a circuit by having too many computer components on it.

Power Outages

To protect against damage and loss of data from temporary power outages, consider the purchase of an uninterruptable power supply (UPS). Windows NT supports different kinds of UPSs and can send messages to users to save their data and to log off as soon as the UPS device signals an impending shutdown. For more information about the Windows NT UPS service, see the section titled "Avoiding Single Points of Failure," presented later in this chapter.

Power Surges

To prevent power surges that can destroy your data, obtain quality power surge surpressors.

What Components in the Network Might Fail

If users cannot connect to your computers running Windows NT Server, you do not have a fault-tolerant configuration. You need to consider what can fail in the connections between your computers as well as the individual computers themselves.

Network Cabling

If your company is growing, the connections between your computers could become saturated with network traffic. You should evaluate network traffic regularly to determine if you will need to upgrade with more equipment. You can also use newer technologies like FDDI, CDDI, ATM, Fast Ethernet 100 Base T, and the like.

Intermediary Devices

Devices that connect different segments of your network, such as routers, bridges, hubs, and switches, can also be bottlenecks and points of failure. You should have UPS protection for these devices.

For each of these devices, find out whether you can get the vendor support you need. What standards are supported by each of the devices, and does the vendor have a migration path for new standards? You should also find out if there are frequent software changes.

External Network Connections

If you lease an X.25, ISDN, or T1 line to connect to another building, branch, or subsidiary of your company, verify that your line vendor has recovery procedures in place that can guarantee minimum down times for the line.

Electric Wiring

Be sure the wiring in your building is capable of supplying your company with enough power as the demands in electricity increase. This is especially important if the building is older or if there are other companies in the building that are putting more demands on the supply of electricity.

What Else Might Fail

Climate Control

If the weather at the location of your company requires heating or cooling within the building to keep your computers and network devices within required operating temperatures, consider making the climate control system fault tolerant as well.

Software Failures

Does your software vendor provide the support you need in case of software failure? Does your company have a technical support group to assist users when there are software problems?

How Likely is a Failure

The meantime between failures (MTBF) information supplied by some manufacturers of equipment such as hard disks is unlikely to provide useful information without extensive analysis and modeling according to the variables that exist in your company's usage pattern. You can use the MTBF as a relative measure of reliability.

More useful is records of past failures and their causes, because you can use this information to help you in your planning. You can categorize failures by their type, such as:

Hardware failure of a server, client, or network component.
Software failure of the operating system on the server or client, or an applications failure.
Administrative error or oversight.
User error.
Deliberate damage, such as sabotage or viruses.

These are some questions to ask about failures:

Have you taken any actions to reduce the likelihood of each failure occurring in your business?
What was done or could be done to fix the problem?
How long would it or did it take?
What would it or did it cost?
What changes have you made that might result in more or fewer failures?
- Number of servers.
- Number of clients.
- Number of users.
- Number of administrators.
- Number of intermediary devices.
- Number of external connections.
- Physical size of LAN(s) or WAN.

Fewer computers can be easier to manage than many computers. However, the relative downtime impact is higher when you have more users connected to a smaller number of servers.

Importance of Training and Technical Support

Having trained personnel can reduce the likelihood of failures and reduce their severity. However, you need to determine if the cost of the training will be worth the expected benefit.

There are several ways you can train your personnel and provide them with technical support:

Use self-study courses.
Subscribe to TechNet.
Use the Internet to access Microsoft and vendor information.
Take vendor-approved or third-party courses.
Have personnel become certified in the use, administration, and troubleshooting of system hardware and software.
Have a technical library available for personnel.
Install computers to be used specifically for training and testing.
Develop your own training courses.

You can also contract to Microsoft, your hardware vendors, and third-party consultants for support.

See the Windows NT Server Start Here book for a description of Microsoft's AnswerPoint Information Services. Microsoft's support offerings range from no-cost and low-cost online information services (available 24 hours a day, 7 days a week) to annual support plans.

Importance of Testing

Testing is an important component of your contingency planning. You can use testing to try to predict failure situations and to practice recovery procedures. Be sure to do stress testing and test all functionality.

The following list identifies some of the failures that you should test:

Individual computer components such as hard disks and controllers, processors, and RAM.
External components such as routers, bridges, switches, cabling, and connectors.

These are some of the stress tests that you should set up:

Heavy network loads.
Heavy disk I/O to the same disk.
Heavy use of file, print, and applications servers.
Large number of simultaneous logons.