Now that you have created floppy disks to use to start the computer, created an Emergency Repair Disk, and backed up the Master Boot Record and Partition Boot Sector, you should use the utilities and floppy disks to practice recovering from problems. Going through the recovery process will force you to be diligent about making backups (Emergency Repair Disk, Master Boot Record, Partition Boot Sector, and data). You will also have an idea as to how long doing each procedure should take.
You should have a computer that you can use to do the following:
If you have multiple disks in your configuration or have more than one installation of Windows NT, you should thoroughly test using your Windows NT startup floppy disk to start the computer from each boot partition. This is especially critical on x86-based computers when you have a configuration that has both SCSI and IDE or EIDI disks. Sometimes, when the multi() syntax works to start from a SCSI hard disk, you must modify the Boot.ini file on the Windows NT startup floppy disk to use the scsi() syntax.
Be sure to test recovery procedures before bringing a new computer or server into production. Every server operator should have both primary and refresher training in recovering from the most common causes of unexpected server downtime. Testing should include:
If you configure your boot partition as a mirror set, be sure to test that the path to the shadow partition on your Windows NT startup floppy disk is correct.
The following test is sufficient to determine if the ARC path is correct. The Stop error indicates that the Kernel successfully loaded from the ARC path specified. If the ARC path is wrong, the computer cannot load the Kernel and you would get a message that says so.
1. Shutdown Windows NT. Power down disks and controllers, if necessary, to test the path to the shadow boot partition.
2. Use the Windows NT startup floppy disk to start the computer.
3. If your boot selection correctly specifies the alternate ARC path to the shadow boot partition, your computer should get part way through the startup sequence, and then fail with the following Kernel STOP error:
If the computer has software fault-tolerant volumes (a mirror set or a stripe set with parity), you should test the failure and replacement of one of the disks. Even though fault-tolerant volumes continue to work when one disk has failed, there is no fault tolerance until you install a replacement disk. A second disk failure during this interval will result in loss of data, because you lost the redundancy when the first failure occurred. If a backup disk is not available on site, you should know the business cost of the resulting downtime.
If you have hardware RAID arrays, your vendor's documentation should describe how to recover from disk or controller failures. Be sure to test that their procedures work for your installation.
If a network card or other network component fails on the primary domain controller (PDC), the server operator should be familiar with the procedure for promoting a backup domain controller (BDC), and demoting the failed server. Someone who is familiar with the procedure for reinstalling and reconfiguring the network card should also be available.
If a data volume fails, the server operator must be able to restore the data from the backup tape quickly and efficiently. The restore procedure should be tested frequently, both to insure the skill of the operator, and to test the quality of the backup tapes. The only way to test the quality of backup tapes is to do a full restore, which guarantees that the data are up-to-date and of consistent quality.
If your backup procedures involve the use of other computers running Windows NT Server, you must also verify that those backup and restore procedures work as expected. For information about what and how to backup, see "Backup Strategy" in Chapter 4, "Planning a Reliable Configuration."