In disaster recovery planning, many organisations include plans to restore data from tape and other point-in-time copies to new hardware.
Likewise, many larger firms' DR plans include preparations to fail over operations to another data center if the primary facility becomes inoperable.
However, very few organisations actively prepare to handle the loss of a single critical system--one that has a recovery time objective (RTO) of less than two hours--when the remainder of the production facility continues to function. DR plans that fail to address limited but critical disasters such as this run the risk of letting down the organisation when such an outage is often avoidable.
As with most aspects of DR planning, there's more than one way to handle this type of outage. The majority of possible solutions falls into one of two categories: local High Availability (HA) or off-site Remote Availability (RA) systems.
By design, HA solutions allow one server or system to stand in for another almost immediately. The timeframe is typically within a few minutes or so of recognition of the outage.
These systems offer much faster recovery times--but at the cost of flexibility. HA systems almost always refer to failing over a system to the same physical location, which is necessary to preserve IP subnet and other settings required for immediate failover.
You can configure some applications for many-to-one failover locally, and a multitude of clustering solutions exist that can also leverage failover in the same physical site. This allows you to stay within your budget while offering protection against limited-scale disasters.
RA solutions offer the same type of recovery, but these solutions generally refer to systems that allow failover to another physical location. Since this usually also means different networks and subnets, you won't be able to fail over every application within a two-hour RTO using RA systems, but you can protect the majority of technology solutions.
This provides failover options for both single-system failures and data-center-wide disasters, limiting the amount of money you'll need to spend for protection. However, keep in mind that end users will have to access data over slower WAN links, and they may need to reconfigure client-side applications in the event of a failure, even if the remainder of your production facility is still functioning.
Restoring critical systems when a complete failure hasn't occurred is a balancing act. You must recover many of these systems within a small RTO that usually won't allow for a tape restore to new hardware.
But protecting each system both locally and remotely may prove too expensive for your budget. Remember that you can phase in these systems over time, beginning with the most critical data systems and working outward.










Hey Mike,
I don't think it's as simple as high availability and remote availability... but you're spot on with the recovery time objective- which among other things is sometimes called MTTR (mean time to repair).
Many companies only really look at perceived availability, and don't count the cost of recovering hardware from a DR situation- and that's where the staffing cost and your risk increases. So you could have 100% perceived availability but 5000 hours per year of downtime and this isn't good!
And I think the meta group said once that 50% of downtime is a result of scheduled downtime going wrong! Surely that's the result of complicated HA systems being put into place.
There are some out of band solutions which help administrators quickly and securely access downed systems, decreasing MTTR and taking away some of the frantic 'how do we get to [system]' questions in those times of need.
What's more with a well planned OOBI (out of band infrastructure, courtesy of Wikipedia) you can effectively (read: cheaply) minimise your MTTR and perhaps even decrease the need for complicated and expensive HA systems.
Less after hours work is probably also of interest from both bean counters and staff.
I reckon those vendors who sell console, kvm and remote power solutions have some good ideas on this.