Planning a Reliable Configuration |
The skill and experience of support personnel is crucial in getting failed systems back online with minimal disruption to your business. They need to be trained to troubleshoot problems and to implement recovery procedures when problems occur.
In preparing a recovery plan, start by imagining some typical scenarios. Your plan needs to answer the following questions:
Efficient recovery from system failures requires practice. Schedule drills several times a year that simulate computer crashes and disk failures.
Computers that have recently been taken out of service or are being prepared for production service can be used for training, or you can configure computers specifically for testing and training. Use training sessions and drills to update and document recovery procedures.
Testing is an important component of your contingency planning. You can use testing to try to predict failure situations and to practice recovery procedures. Be sure to stress test all functionality.
The following list identifies some of the failures that you need to test:
The following are some useful situations to simulate in your stress tests:
Once you have created a set of Windows 2000 Setup floppy disks, a Windows 2000 startup floppy disk, and an ERD, and have backed up the system state data, use the floppy disks, ERD, safe mode and the Recovery Console to practice recovering from problems. This can help you to be diligent about making backups of the system state data and user data. This can also help you determine how long these procedures take to accomplish.
Your testing needs to help you determine the best recovery procedure for a particular situation. Determine when to use the set of Windows 2000 Setup floppy disks, the Windows 2000 startup floppy disk, safe mode, and the Recovery Console to restart your computer and when to use the ERD and Backup to replace files.
Your test computer needs to allow you to conduct the following tests:
Be sure to test recovery procedures before bringing a new computer or server into production. Every operator needs to have both primary and refresher training in recovering from the most common causes of unexpected downtime. Testing needs to include:
If a network adapter or other network component fails on the domain controller, the server operator needs to be familiar with the procedure for promoting a member server to be a domain controller, and demoting the failed server. Someone who is familiar with the procedure for reinstalling and reconfiguring the network adapter also needs to be available.
If a data volume fails, the operator must be able to restore the data from backup quickly and efficiently. The restore procedure needs to be tested frequently, both to ensure the skill of the operator and to test the quality of the backup tapes. The only way to test the quality of backup tapes is to do a full restore, which guarantees that the data is up-to-date and of consistent quality.
If your backup procedures involve the use of other computers running Windows 2000 Server or Windows 2000 Professional, verify that those backup and restore procedures work as expected.
You need to develop step-by-step procedures for recovering from a variety of potential failures. You can use these procedures for:
Update your documentation when you make configuration changes to your computers or network, especially when you install a new operating system or change the utilities that you use to maintain your system.