Level 3 and 4 disasters: Data loss and critical system failures

By Mike Talon, TechRepublic
29 November 2005 01:29 PM
Tags: data, disasters, system, failure, loss, recovery, critical, 4
TechRepublic
Mike Talon discusses the primary considerations in dealing with data loss and system failures, and the scope of these types of events within his disaster recovery classification system.

In previous columns, I laid out a classification system for the most common types of disaster recovery (DR) situations, and last time I focused primarily on what to do when there is proof of a network intruder. Now let's deal with what happens in more traditional DR situations:

  • Level 3 - You lose minor amounts of data or a non-critical system fails
  • Level 4 - You lose a large amount of non-critical data or a critical system fails

Level 3 disasters involve minor data loss -- perhaps due to an incomplete restore from backup tape, or the loss of non-critical systems. When this type of disaster occurs, speed is usually less of an issue. End users can continue to do their jobs without this data and/or without these systems, but this doesn't mean your staff doesn't have to get them back up and running, or find out what was lost. You will need to first figure out what went wrong, and ensure the damage is contained.

This may mean verification of backup systems for other data-systems, test restorations of controlled and previously backed-up data, and the determination of what caused the system failures. Your goal here is to make sure that you will not lose data or suffer the long-term loss of a critical system.

Once you have contained the problem, you can begin to address it. This may mean rebuilding the impacted systems as quickly as possible and restoring all known-good data, running anti-virus and/or other security measures to clean the systems and data, and performing other measures to bring your systems back.

Level 4 disasters are a bit more time sensitive. This is an instance in which large-scale data loss is discovered, or when one or more critical systems are taken offline. In these cases, you don't have time to move methodically, but you must absolutely proceed with extreme care whenever possible. Failure to do so could result in a recurrence of whatever caused the disaster in the first place, only leading to more downtime.

You will be forced to immediately restore any and all data that you can ensure is not corrupt, and -- if you have some form of high-availability solution -- you must allow your critical data-systems to fail over and resume operation. Initially, you will be acting fast to restore as much of your data and services as quickly as you can, so that end-users can resume working with those systems while you find out what went wrong. In Level 4 disasters, you do not carry out a complete investigation until after the restoration of service.

That being said, you must be as careful as possible while restoring services. Moving too fast could easily result, not only in a recurrence of the disaster due to your staff missing some critical fault, but could actually compound the problem. If you are rushing too much, misconfigurations or accidents could occur that cause even more damage. Move quickly, but stay in control of the situation at all times, no matter how loudly the executives are screaming to get everything back up immediately.

If you have failover systems, perform a quick check to ensure that you have a stable platform at your DR site, and then restore operations. If the platform isn't stable, you can make the changes necessary to begin the data-restoration process, preceding a return to service. Either way, this emergency calls for an acute awareness of your systems' health as you move forward.

For both Level 3 and 4 disasters -- after you deal with the initial disaster, you will have to determine exactly how much data was lost, so that end users can begin the job of manual recovery, where possible. This may mean re-entering data from hard copy, alerting clients to the loss, and preparing the proper regulatory reports. None of that can happen, however, until you are able to figure out what was lost and what is still recoverable.

Data-loss disasters are never easy to deal with. The urgency generally pressed upon IT staff during such outages can make for more mistakes, allow intruders to get back into the network, and generally open the door to new disasters. Working quickly and methodically while all around you goes haywire may sound like the toughest job in the world, but it will ensure that you get your systems back up and running, and that you are able to restore as much data and service as you can in the end.

TechRepublic is the online community and information resource for all IT professionals, from support staff to executives. We offer in-depth technical articles written for IT professionals by IT professionals. In addition to articles on everything from Windows to e-mail to firewalls, we offer IT industry analysis, downloads, management tips, discussion forums, and e-newsletters.

©2005 TechRepublic, Inc.

Advertisement

Talkback 0 comments

Latest Videos

Sponsored content

Power Centre - Content from our premier sponsors

Blogs

  • Suzanne Tindal IT: Govt's cost-cutting bitch
    The government needs to stop looking at IT as a necessary evil or the place to remove costs when the Treasurer comes calling.
  • Array Can complaints on mobile content be cut?
    On 1 July this year the new Mobile Premium Services Code was introduced. It sounds like it's had a good impact, but is it enough?
  • Array NZ farmers: Bleating about broadband
    As we know, farmers are such bleaters. They bleat as much as the four-legged woolly things in their paddocks. If it's not the weather, it's the strength of the dollar! Nothing is ever right. Likewise with rural broadband.
  • More blogs »

Tags

Back to top

Featured