HARDWARE: Redundancy
Although hardware problems cause as little as 10 percent of downtime according to figures quoted by Marcus and Stern, some servers, storage units, firewalls, and other devices are designed to tolerate the failure of individual subsystems such as power supplies.
Tandem is one of the longest established names in fault tolerant hardware. Tandem was purchased by Compaq and is now part of Hewlett-Packard. Nicholas Lynch, HP's market development manager for industry standard systems, points to a variety of technologies that make up the company's "adaptive infrastructure" strategy.
Many subsystems are designed for hot swapping in the event of failure. Fans and power supplies are relatively easy to duplicate so that systems can keep running. In the case of RAID storage, a faulty disk can be replaced and the data automatically recreated.
A combination of hardware and software that HP calls advanced data guarding goes one step further and copes with the failure of two drives without loss of data.
Memory can also be duplicated: in low-end HP servers, one or two memory modules can be reserved to take over if the ECC (error correction code) detects a faulty chip. Midrange models feature hot-pluggable mirrored memory. Mirroring means each byte is written into two separate modules, so no data is lost in the event of a failure, and processing can continue while the failed module is replaced.
Forthcoming high-end servers will have what Lynch described as hot-pluggable RAID memory--an extra DIMM in each bank provides redundancy and continued operation when a failure occurs without having to duplicate the entire memory.
Among its features, HP's Insight Manager software provides predictive failure monitoring of processors, memory, and hard disks in HP servers. If warning is given of a critical problem, the company's pre-failure warranty means the company will provide a replacement part for hot-swapping before the failure actually happens. HP is "the only vendor that offers [pre-failure warranty] on all three components," claims Lynch.
It's all very well for hardware manufacturers to provide redundant power supplies, but unless you plug them into separate circuits, the only thing you are protected against is a PSU failure. Using separate mains circuits helps keep the device running if a fuse blows or a circuit breaker trips, or if someone isolates the wrong circuit at the switchboard.
In any case, the mains supply to the building will probably fail at some stage. The risk can be reduced by arranging power feeds from two separate substations, but if you need to keep operating in a blackout, you'll need an uninterruptible power supply and--depending on the load and anticipated outage duration--perhaps a generator set. But first, make sure the basics of reliable power are already in place.
"The starting point and necessary steps are the same today as they were 15 years ago. After determining what level of uptime is required to meet the needs of the business, a full site audit is completed," says Russell Perry, national channel manager at Emerson Network Power (formerly Liebert Corporation).
"Up to 80 percent of remedial actions and costs flowing from such an audit will not be associated with either UPS or diesel generators," he asserts. Basic issues such as grounding and bonding, and the condition of wiring must be addressed first. "Unless these fundamental issues are addressed, the efficiency of any UPS or genset system will be severely compromised," Perry explains.
UPS models designed for enterprise applications--such as Emerson's Nfinity--sport features including hot-swappable redundant modules, smooth transfers between power sources, and multiple communications paths for monitoring and control.
If you do go to the extent of installing a generator, make sure you've got plenty of fuel on hand. Don't laugh--while researching this article we heard of an organisation (no names, no pack drill) that installed a fancy generator, but it wasn't until the power went off that staff realised they hadn't filled the tank.











