This section identifies topics you should consider in your contingency planning.
There are several ways to measure costs. Some costs are easy to understand and to calculate, such as the following:
But measuring the cost of server downtime to a department or your company can be more difficult.
And how do you measure the cost of a server failure on:
If you have kept records of failures, you might find them useful in your contingency planning. You can investigate ways to avoid each failure, or to minimize the downtime associated with the failure. If you have cost information for the failures, you can then compare the cost of each failure to the cost of preventing or minimizing the failure.
Here are two examples:
Failure | File server in sales department down, network card failure. | Router failure between development and testing department. |
Effect | Lost sales | Lost productivity of employees. |
Total downtime last year. | 3 hours. | 16 hours. |
Costs of failure per hour. | $10,000 | Average hourly wage of 10 affected employees is $18/hr. |
Total downtime costs last year. | $30,000 | $2,880 |
Possible resolution or workaround. | 3 spare network cards @ $500 each. | Put an alternate router in place or obtain a spare router. |
Expected costs of resolution or workaround. | $1,500 | $500 - $2,000 |
Estimated savings during first year with resolution in place. | $28,500 | $880-$2,380 |
A server failure is typically the most costly with respect to a corporation's business, whether it is a file server, a print server, or an applications server. This section discusses the components within the computer to help you decide how to configure it. You should also regularly run diagnostics on the individual components.
Motherboards consist of electronics that can and do fail, yet the motherboard and the CPU are the more reliable computer components. There is not a lot you can do to avoid a motherboard failure or CPU fault, except to regularly run system checks that ensure they are functioning correctly. Some vendors provide systems having built-in diagnostics that operate with Windows NT.
There are three major types of RAM, in the sense of error detection and correction.
There are a few vendors that supply products that you can use to check the RAM in a computer.
Video cards drive the screen as well as render images for display. Video cards rarely will cause a computer to crash, but rather, might cause the computer to behave erratically, which can be confusing to diagnose. More often, video cards cause screen redraw problems, application page faults, and the like. These problems are usually not critical enough to require you to shutdown the computer. To minimize video problems, be sure that your computer is running with the most recent release of a supported video driver from Microsoft, or the third-party vendor selling that card.
You should investigate IDE, EIDE, and SCSI technologies, because each offers different benefits with respect to fault tolerance and recovery. You have many choices for your disk configuration, including fault-tolerant configurations. The meantime between failures (MTBF) gives you a measure of expected reliability of disks and controllers.
Be sure to run disk and controller diagnostics during every preventive maintenance period. Diagnostics should be available from your hardware vendor. Windows NT automatically runs its Chkdsk program every time you start up, and you can run a surface scan of the disks by specifying chkdsk /r.
Verifying your backups by doing test restores is the best way to make sure that your backup devices and media are working correctly.
FDDI, CDDI, and ATM network cards can have dual-channel connections. If one channel goes down, the other channel is automatically used.
Ethernet and token ring network cards do not have a dual-channel capability. If the manufacturer provides a diagnostic program, you should run diagnostics on the network cards during scheduled preventive maintenance or down-time periods.
You can evaluate network segments with network packet trace programs, called sniffers. The Windows NT Network Monitor can check for:
You need to have consistent, reliable power to be able to run your computers. Power failures, power surges, and power sags can cause the computers to crash and can damage the electronics. There are different situations that affect the power supplied to your computer.
Be sure your computer has a high quality power supply that can simultaneously support all components attached to it. It is possible to overload a power supply by adding too many power consuming devices to a computer. It is also possible to overload a circuit by having too many computer components on it.
To protect against damage and loss of data from temporary power outages, consider the purchase of an uninterruptable power supply (UPS). Windows NT supports different kinds of UPSs and can send messages to users to save their data and to log off as soon as the UPS device signals an impending shutdown. For more information about the Windows NT UPS service, see the section titled "Avoiding Single Points of Failure," presented later in this chapter.
To prevent power surges that can destroy your data, obtain quality power surge surpressors.
If users cannot connect to your computers running Windows NT Server, you do not have a fault-tolerant configuration. You need to consider what can fail in the connections between your computers as well as the individual computers themselves.
If your company is growing, the connections between your computers could become saturated with network traffic. You should evaluate network traffic regularly to determine if you will need to upgrade with more equipment. You can also use newer technologies like FDDI, CDDI, ATM, Fast Ethernet 100 Base T, and the like.
Devices that connect different segments of your network, such as routers, bridges, hubs, and switches, can also be bottlenecks and points of failure. You should have UPS protection for these devices.
For each of these devices, find out whether you can get the vendor support you need. What standards are supported by each of the devices, and does the vendor have a migration path for new standards? You should also find out if there are frequent software changes.
If you lease an X.25, ISDN, or T1 line to connect to another building, branch, or subsidiary of your company, verify that your line vendor has recovery procedures in place that can guarantee minimum down times for the line.
Be sure the wiring in your building is capable of supplying your company with enough power as the demands in electricity increase. This is especially important if the building is older or if there are other companies in the building that are putting more demands on the supply of electricity.
If the weather at the location of your company requires heating or cooling within the building to keep your computers and network devices within required operating temperatures, consider making the climate control system fault tolerant as well.
Does your software vendor provide the support you need in case of software failure? Does your company have a technical support group to assist users when there are software problems?
The meantime between failures (MTBF) information supplied by some manufacturers of equipment such as hard disks is unlikely to provide useful information without extensive analysis and modeling according to the variables that exist in your company's usage pattern. You can use the MTBF as a relative measure of reliability.
More useful is records of past failures and their causes, because you can use this information to help you in your planning. You can categorize failures by their type, such as:
These are some questions to ask about failures:
Fewer computers can be easier to manage than many computers. However, the relative downtime impact is higher when you have more users connected to a smaller number of servers.
Having trained personnel can reduce the likelihood of failures and reduce their severity. However, you need to determine if the cost of the training will be worth the expected benefit.
There are several ways you can train your personnel and provide them with technical support:
You can also contract to Microsoft, your hardware vendors, and third-party consultants for support.
See the Windows NT Server Start Here book for a description of Microsoft's AnswerPoint Information Services. Microsoft's support offerings range from no-cost and low-cost online information services (available 24 hours a day, 7 days a week) to annual support plans.
Testing is an important component of your contingency planning. You can use testing to try to predict failure situations and to practice recovery procedures. Be sure to do stress testing and test all functionality.
The following list identifies some of the failures that you should test:
These are some of the stress tests that you should set up: