I had a recent experience involving unexpected power failure in one of the data centers I deal with. I’m inheriting a lot of existing issues, as we all do in this business, and spent six hours working out many of them that weekend.
A recurring theme in diaster recovery is the lack of useful documentation. When things are fubared, that expensive bit of kit that hauls your bytes around the network turns into a black box of mystery. If you didn’t build it, or you don’t deal with it’s configuration frequently, reverse-engineering what it ought to be doing takes far too much time and effort. And if it’s suffering from configuration amnesia, you’re screwed.
Most companies have some sort of disaster recovery documentation, mandated by the risk management team or simply by good sense. The documentation is rarely updated, and probably doesn’t match the production state – but it acts as a roadmap and gives your successors an idea of what it should probably look like when it works.
I think that idea of a higher level roadmap is a good one, and practical for most staff-strapped shops. Something along the lines of “in case of emergency.” Obviously such documentation should be easy to get to in the event someone is cleaning up a mess.
In my experience, most disaster recovery documentation is extremely detailed by necessity. It is intended for a virtual “replay” of the configuration of a system from the ground up, and is useful if you’re expecting complete strangers to provision systems out of new boxes and make them behave identically. Such detailed documentation is not so practical for the seasoned system administrator who merely wants to remove the wrench from the gears and go home with services restored.
So, my point made, I believe a body of documentation useful in emergencies would be a collection of network diagrams telling you which device is responsible for routing which networks, especially for those of us who are not especially seasoned Cisco folks, what host is normally responsible for providing “dial tone” services such as DHCP and DNS, where the firewall is and what services it is expected to pass and to who, how the SAN is configured, what devices to power on and in what order, and who to call for access.
A roadmap approach allows the seasoned administrator to quickly theorize and investigate issues. It need not be too detailed nor constantly updated, as long as it provides the administrator with an ability to figure out how things are expected to fit together. Then the sysadmin can stop reverse engineering the infrastructure and start taking positive steps to recover it.