Advertisement
To print: Select File and then Print from your browser's menu
-------------------------------------------------------------- This story was printed from ZDNet Australia. --------------------------------------------------------------
High availability: Keeping it up

By Stephen Withers, 0
July 31, 2002
URL: http://www.zdnet.com.au/news/business/soa/High-availability-Keeping-it-up/0,139023166,120267015,00.htm




High availability means much more than five nines or 24 x 7 operation. It's about getting your hardware, networks, software, policies, and people all working together smoothly.

What does high availability mean to you? How do you calculate how much it is worth spending to achieve it? In this feature, we look at strategies for hardware, software, and network configuration, as well as policy and people considerations, for keeping your systems running. The key concept behind high availability is that a system should be available when it is needed.

Evan Marcus, data availability maven at Veritas, draws an analogy with the rocket motor on the Apollo lunar excursion module. The rocket was only needed for about five minutes during each mission, but during those five minutes it had to be available.

But what does "available" mean? Exact definitions will vary between organisations, but in most contexts the only realistic measure is that people can use their applications and receive results with an acceptable delay. If users can't get their work done on time, the system is down.

As Niall Gallagher, VP, Intelligent Internet, Nortel Networks Asia, puts it, "A technical equation for calculating network availability exists, but availability should really be viewed from the user's perspective. The user does not know about the individual components in a network, but instead is limited to his/her experience of its performance. In this context, the maximum availability of the end-to-end network is paramount."

We have all heard about "five nines", but what does that actually mean? There are 525,600 minutes in a year, so 99.999 percent availability equates to a little over five minutes downtime per year. So for five nines, if the system comes down just once a year--planned or unplanned--you need to get it back up in less than six minutes. In their book Blueprints for High Availability, Evan Marcus and Hal Stern wrote, "Downtimes of less than 10 minutes per year (about 99.998 percent) are probably achievable, but it would be difficult to get much less than that."

In reality, it is only in rare cases that systems need to be available 24 hours a day. If the operational window for the system is 18 hours per day rather than 24, that gives you a handy window for planned downtime. Down is down, whether it is deliberate or unexpected, but it only counts against you when users need access to the systems.

For most organisations, one day isn't exactly like another. If you were an e-tailer, would you rather your system was down for a couple of hours on an evening in mid-December, or in the small hours of a February morning? If you're running an inbound call centre, would you rather have an outage on the day your biggest client starts a major advertising campaign, or the day before?

A system therefore can be described as highly available if it meets or exceeds the availability requirements. "You buy a computer to do something [and] you expect value back," says Marcus.

"A technical definition does not directly translate into the business impact of high availability. Mission critical networks, upon which vital revenue streams depend, must be highly robust and able to sustain an availability that meets the highest standards," says Gallagher.

It's also important to remember that downtime isn't over as soon as the failed subsystem is restored. If your desktop PC suffers a disk crash, an in-house technician might be able to replace it within 10 or 15 minutes if a spare is available and you have sufficient clout to demand that level of service. It's going to take a lot longer to reinstall the operating system, applications, and your data.

When you're talking about high availability, time to repair can be a significant part of the equation. If you need to reload a multi-terabyte database from removable media, that time is measured in hours, not minutes. And even when it is reloaded, you'll need more time to apply the changes that have accumulated in the journal files since that backup was made. Clearly, high availability calls for different approaches.

Gaining attention


Why is high availability getting so much attention?

The arrival of e-business has brought a 24 x 7 x 365 mentality to many organisations, which were previously used to eight-hour working days that left 16 hours for backup, overnight processing, and system maintenance.

Lengthened business days, the growth of B2B and B2C online transactions, and general business process reengineering have led to a general expectation that IT systems should be as reliable as the electricity supply and the phone service.

Costs vs benefits

Expenditure on high-availability systems can be seen in a similar light to insurance: you wouldn't pay AU$5000 to insure a AU$5000 car against loss or damage, so why pay an extra AU$2 million to protect against an outage that would only cost you AU$500,000?

The cost of an outage is something that must be estimated from a business perspective. In addition to any immediate cost due to transactions that cannot be performed, issues such as loss of reputation, and permanently lost customers must also be considered.

In the context of a systems outage affecting air traffic control in the UK, says John Holden, research analyst at Butler Group. "Bad news has always sold newspapers, and it will continue to do so. The fact that 100 percent of flights left on time on a particular day is not news. Similarly, the fact that a major organisation's Web site provided its customers with an uninterrupted 24 x 7 service for 99 days out of 100 is not news, but if it is out of action on the hundredth day, this will probably be news and will cause the organisation to lose current and future business."

Realistically, you're not going to get 100 percent uptime. Outages are going to occur. In assessing costs and benefits, you need to consider how often the system will fail and how long it will take to restore normal operation. It could be that a one-minute failure that occurs about once a week is acceptable, but a 45-minute outage once a year will not be tolerated. That's one reason why expressing availability as a percentage can be misleading--in this case, 99.990 percent uptime is better than 99.991 percent.

David Solsky, enterprise storage director at SecureData talks about the investment/availability curve. Each step provides increasing availability and each builds on the previous steps, but each step increases the costs.

Eric Keser, principal of Ernst & Young's technology and security risk services, points out that a business's original estimate of the longest acceptable outage should not be considered set in stone. High-availability expenditure shows diminishing returns--each marginal improvement costs more than the last. By combining the technology cost of providing particular recovery times with the business costs of suffering an outage of that duration, you may find that the lowest total cost is achieved by extending the maximum acceptable outage.

HARDWARE: Redundancy


Although hardware problems cause as little as 10 percent of downtime according to figures quoted by Marcus and Stern, some servers, storage units, firewalls, and other devices are designed to tolerate the failure of individual subsystems such as power supplies.

Tandem is one of the longest established names in fault tolerant hardware. Tandem was purchased by Compaq and is now part of Hewlett-Packard. Nicholas Lynch, HP's market development manager for industry standard systems, points to a variety of technologies that make up the company's "adaptive infrastructure" strategy.

Many subsystems are designed for hot swapping in the event of failure. Fans and power supplies are relatively easy to duplicate so that systems can keep running. In the case of RAID storage, a faulty disk can be replaced and the data automatically recreated.

A combination of hardware and software that HP calls advanced data guarding goes one step further and copes with the failure of two drives without loss of data.

Memory can also be duplicated: in low-end HP servers, one or two memory modules can be reserved to take over if the ECC (error correction code) detects a faulty chip. Midrange models feature hot-pluggable mirrored memory. Mirroring means each byte is written into two separate modules, so no data is lost in the event of a failure, and processing can continue while the failed module is replaced.

Forthcoming high-end servers will have what Lynch described as hot-pluggable RAID memory--an extra DIMM in each bank provides redundancy and continued operation when a failure occurs without having to duplicate the entire memory.

Among its features, HP's Insight Manager software provides predictive failure monitoring of processors, memory, and hard disks in HP servers. If warning is given of a critical problem, the company's pre-failure warranty means the company will provide a replacement part for hot-swapping before the failure actually happens. HP is "the only vendor that offers [pre-failure warranty] on all three components," claims Lynch.

It's all very well for hardware manufacturers to provide redundant power supplies, but unless you plug them into separate circuits, the only thing you are protected against is a PSU failure. Using separate mains circuits helps keep the device running if a fuse blows or a circuit breaker trips, or if someone isolates the wrong circuit at the switchboard.

In any case, the mains supply to the building will probably fail at some stage. The risk can be reduced by arranging power feeds from two separate substations, but if you need to keep operating in a blackout, you'll need an uninterruptible power supply and--depending on the load and anticipated outage duration--perhaps a generator set. But first, make sure the basics of reliable power are already in place.

"The starting point and necessary steps are the same today as they were 15 years ago. After determining what level of uptime is required to meet the needs of the business, a full site audit is completed," says Russell Perry, national channel manager at Emerson Network Power (formerly Liebert Corporation).

"Up to 80 percent of remedial actions and costs flowing from such an audit will not be associated with either UPS or diesel generators," he asserts. Basic issues such as grounding and bonding, and the condition of wiring must be addressed first. "Unless these fundamental issues are addressed, the efficiency of any UPS or genset system will be severely compromised," Perry explains.

UPS models designed for enterprise applications--such as Emerson's Nfinity--sport features including hot-swappable redundant modules, smooth transfers between power sources, and multiple communications paths for monitoring and control.

If you do go to the extent of installing a generator, make sure you've got plenty of fuel on hand. Don't laugh--while researching this article we heard of an organisation (no names, no pack drill) that installed a fancy generator, but it wasn't until the power went off that staff realised they hadn't filled the tank.

HARDWARE: Server farms


The idea of using multiple low-cost servers with a load-sharing front end has caught on for Web farms and application servers, but can it be extended to other areas?

Kevin McIsaac, program director of server infrastructure strategies at META Group believes it can.

This approach is relatively easy to apply to Web servers as they are essentially stateless--a browser requests some data, the server provides it, and the task is over. As soon as transactions are involved, states are maintained in the back-end database.

Typically, the database runs on high-end symmetrical multiprocessing (SMP) servers that are clustered in pairs. This is expensive, and the second server is generally a hot standby.

But the arrival of Oracle Real Application Clusters (RAC) means "in principle . . . we'll be able to have multiple nodes for the back-end database server running in a cluster with a shared disk, and it will look like a single machine," says McIsaac.

The total bill takes a double whammy with this arrangement. Firstly, cheaper hardware can be used--a pair of eight-way SMP servers is cheaper than one 16-way server.

Secondly, less backup hardware is needed: three eight-way servers could replace two 16-way units for example, and this will give better performance as the spare can be active at all times. Clusters of three to five servers with shared storage will strike "a good balance between the additional availability and complexity," says McIsaac. This type of clustering can reduce costs by 25-40 percent and "I think [that] will get some real interest from people".

McIsaac says this analysis assumes RAC is ready for "prime time"--he thinks it will work in real-life implementations, but is waiting for proof. The concept has been around for years in the mainframe and minicomputer worlds, but Oracle is now working with Red Hat and Dell. He suggests this relatively cheap approach could have a major impact in 12-18 months.

Where server farms are used, it can still be important to get a failed unit back in commission quickly to maintain good overall performance and as a precaution against an outage caused by another unit failing. Blade servers are becoming popular as they offer high rack densities. When an individual blade fails in an HP server, the "rip and replace" feature means its replacement is automatically reconfigured, says Lynch.

HARDWARE: Consolidation


The idea of server farms seems to work against recent trends to server consolidation (replacing multiple distributed servers with a smaller number of more powerful models to gain economies of scale and reduce management costs).

According to McIsaac, this works well when you consolidate multiple homogenous servers running a single application such as Exchange, Notes, or file and print services.

Twenty old machines can be replaced by perhaps five new ones due to the increased power of more recent hardware. It's harder to gain the benefits if you try to consolidate multiple applications or multiple databases onto one server.

McIsaac says he recently saw a quotation offering the customer a choice between two large servers or 12 smaller ones, with the former costing AU$5 million more. Given the total cost of employing highly skilled staff and a generous five-year useful life for the hardware, he suggests the customer would need to cut five staff to compensate for the increased hardware cost. "I don't see five people--worth of savings--you might save half a person," he says.

HARDWARE: Storage

Consolidation is paying off in the storage market. "There is more external storage being sold than internal," says Abie Gelbart, product manager at EMC. "People are realising storage needs to be separated from the server."

The main advantage of this arrangement is that it simplifies storage management. Instead of making the various system administrators worry about storage, responsibility can be transferred to one team that can look after all the storage resources with a single piece of management software.

But if you're going to put all your eggs in one basket, "you need to make sure that basket is very strong," says Gelbart. External storage units--such as those from EMC--can be designed so any single component can fail and be replaced while all data remains available.

For example, disk drives are usually configured as RAID pairs. If one drive fails, data is automatically copied from the surviving member of the pair to a spare drive.

In practice, it may not be necessary to read the data from disk at all, as such units are equipped with large memory caches. In any case, the copying is done as a background process and need not have any impact on performance.

High availability also requires redundancy in the SAN so there are at least two paths from the storage unit to each of the servers it supports.

"We try to focus on 100 percent 24 x 7 availability," says Gelbart. Scheduled downtime accounts for 85-90 percent of outages, he says, and while the storage units can be repaired or upgraded on the fly, events such as building work that requires the power to be disconnected or the air conditioning shut down can force downtime. "There are devices out there that have been up continuously for over a year," he says.

While mirroring can ensure the availability of files, backups are still important to protect against software failures that write incorrect data into the files, says Gelbart. Frequent backups will reduce the recovery time if this should occur.

Disk-to-disk backups are becoming more popular due to a speed advantage over tape, he adds. Remote copies of a database can be split from the main system according to a backup schedule, and then used to restore the data when necessary.

Even a low-end server benefits from using external hard drives. If the system fails, a relatively unskilled person can swap in a spare machine and reconnect the drives.

This can be especially useful for branch offices that are too small to justify on-site support staff.

Separating processing from storage can be advantageous even when the application seems bound up with storage. For example, mail servers such as Exchange and Notes can be called upon to handle huge amounts of data, and asking users to handle archiving for themselves is inefficient in terms of storage costs and person-hours.

StorageTek's Email Xcelerator suite moves messages and attachments into a separate database, replacing the originals with pointers to the new copies. The Exchange or Notes database is therefore much smaller, so backup and restore times are improved. According to Michael Palermo, director of StorageTek's ASM business group, restoring the pointers to a 10T database of messages and attachments can be done in a matter of minutes.

The external database is managed by StorageTek's Application Storage Manager (ASM), which can spread data across multiple storage units for reliability and across a storage hierarchy (eg, all messages might stay on disk for 45 days, then on high-performance tape for a year before being transferred to archive tape for seven years).

So "what is it that you need to back up and restore?" asked Palermo. Even if the original Email Xcelerator database has been destroyed, the restored Exchange or Notes database will reconnect to the additional copies created by ASM.

Network attached storage is sometimes used as a quick and easy way of providing workgroups or applications with extra storage without having to reconfigure the server.

Where SANs address the needs of large databases and moving blocks of data around, NAS focuses on sharing. One way of ensuring the availability of data stored on NAS is to use NAS servers that do not contain their own disk drives but instead connect to a SAN in order to access enterprise-class storage units.

This simplifies the storage environment and therefore makes it more manageable. Data availability is further improved by clustering these NAS servers with a spare unit to provide automatic failover, Gelbart explains.

Communications infrastructure


High availability is like a jigsaw puzzle--it isn't complete until all the pieces are in place. Even 100 percent uptime for the servers and storage does you no good if the rest of the organisation cannot connect to them.

"By simplifying the network architecture, without losing functionality, the risk of outage can be reduced. Network outage caused by operator errors is [more frequent] in a highly complex network environment, driving the need for risk mitigation in the design phase. Intelligent network elements and management systems will be able to detect human errors and automatically correct error or ask for confirmation of change," says Nortel's Gallagher.

Individual elements in the network may use some of the techniques discussed above to ensure high availability, but there are also some specific issues that must be addressed. Rather than go through each kind of network hardware, we will take firewalls as a representative of the category.

Mike Lee, senior product marketing manager for Check Point, explains there are several ways of implementing high-availability firewalls. Firewall appliances and servers can be built with redundant power supplies and other components to maximise uptime.

Running two or more firewalls in parallel as a cluster goes a step further, as it provides coverage if a software failure affects one of them.

This can be achieved by putting a network switch in front of the firewalls, which may also provide load sharing, but while it is easy to re-route packets to the second firewall, information about existing connections (such as authentication) will be lost. Lee suggests switching without maintaining connections does not really count as high availability.

Check Point's Cluster XL high-availability add-on performs state synchronisation between the clustered firewalls, so connections can fail over without interruption. This sounds straightforward, but when you are dealing with hundreds of thousands of connections, "it gets pretty tricky," says Lee.

Firewall failover to a remote site with synchronisation is not generally practical, he says. "It requires an ultra high-speed communications medium" between the firewalls, and while it has been done by some military users, "that's unreasonable for most enterprise users."

In any case, if the problem is big enough to switch to a backup site, forcing users to reauthenticate themselves is no big deal in most cases, he suggests.

However, Stonesoft has developed a clustering technology that allows firewalls to be kept in synchronisation without replicating all the traffic on the internal and external ports.

This arrangement means that multiple firewalls at the same location can failover to each other, and if the entire subcluster comes to a halt, it fails over to the other location.

This still requires substantial bandwidth with low latency, but it can be done over 45km of single mode fibre, and one Australian customer uses this arrangement for failover between its primary data centre and the backup site 26km away, says senior network specialist Mathew Butler. Simple primary to secondary failover can be done via a 9600bps serial line: "it wasn't pretty, but it did work," says Butler.

Another advantage is that the nodes in the cluster only have to be running the same operating system. A mix of hardware can be used, such as an old 1GHz Pentium 3 with a new Athlon 2000+ XP. "We've avoided the forklift upgrade process," says Butler.

Stonesoft's clustering technology isn't limited to its own firewalls--it also works with third-party firewalls, plus Web and proxy servers and content scanners such as MIMESweeper. The same GUI is used to manage all these applications.

Lee also points out changing firewall configurations can lead to outages. "You can make mistakes that bring a firewall down, but it's not super-common," he says. Check Point aims to make its products as easy to use as possible, and helps customers prepare and test appropriate configurations.

Redundant communication links are also important, as you don't want to be taken offline by a single backhoe accident. If your systems are located in a hosting centre, redundant connections to the Internet are probably part of the service, but if you are operating from your own premises you may need multiple connections through different ISPs to ensure high availability.

Stonesoft's StoneGate high-availability firewall and VPN product includes the ability to treat these links as a single virtual connection even if the various servers' public IP addresses have been allocated from different ranges. Outbound traffic is distributed according to which connections are the best performing at the time, Butler says.

Outsourced data centres


Maintaining a physical and operational environment appropriate for high availability isn't easy, and it isn't cheap. Even if you don't want to outsource your IT requirements completely, it may be worth considering an outsourced operations centre.

"We offer 99.9 percent availability at 70 percent of the cost of doing it in-house," says managing director Craig Allen. "What's more, service levels are part of the contract and if they're not met, the customer doesn't pay."

Virtual Offis' data centre is a utility-grade facility created for failed pay-TV operator Galaxy. The data centre has six high-speed connections to its backbone network.

Part of the story comes from economies of scale. Virtual Offis can afford high-quality infrastructure because the cost is spread over many customers. For example, its AU$60,000 investment in redundant firewalls represents a few hundred dollars per server. The same applies to routers, antivirus measures, automated backup, a tertiary domain name server, and so on.

In terms of the ability to communicate with the outside world, Virtual Offis claims an impressive figure of 21 seconds planned and unplanned downtime in 18 months.

At least one of the systems in the facility has been running non-stop during that period, and the only reason it was rebooted that long ago was after a software upgrade.

Apart from providing the right environment and infrastructure, the company chooses high-quality hardware for customer systems and ensures sufficient stocks of spare machines and parts are on hand to meet a 60-minute recovery guarantee. "We get very good prices," says Allen.

Virtual Offis uses IBM Director to monitor the operation of customers' systems. Apart from predicting component failure before it occurs, this also allows the detection of increasing server loads so customers can be advised of potential problems.

The company also takes care of the tedious side of server administration such as keeping all the clocks synchronised and applying operating system patches promptly (but without rushing into anything); things that may be overlooked by harried in-house administrators but are essential for reliable and secure operation.

If disaster should strike the centre, Virtual Offis has a standby data centre on Sydney's North Shore. This site is partially populated--some customers have chosen extended service and operate from both sites to guarantee availability, but few think it is worth the extra cost, says Allen.

Agreements with distributors are in place to ensure equipment can be transferred from their warehouses directly to the standby centre when necessary. For worst-case scenarios, mutual support relationships have been forged with other hosting providers to provide each other with rack space in emergencies.

"In essence, we are a virtual operations department," says Allen.

Desktops


Don't overlook the desktop or notebook computers used by your staff to access critical systems. If you have a standard configuration and operating environment, it is relatively easy to plug in a replacement system in the event of an isolated failure. The faulty equipment can then be repaired at leisure.

But what happens in a disaster situation when hundreds or thousands of units need to be replaced simultaneously? Sourcing and installing that many units is hard enough--especially in temporary premises--but what about loading all that software on each of them? One way around this problem is to deploy critical applications via Citrix servers.

Navin Rajapakse, assistant vice president at Lehman Brothers, says that when the company lost its New York office in the September 11 attack on the World Trade Centre, temporary workplaces had to be found for 7000 employees.

Two thousand traders were relocated to a temporary site in New Jersey, but the remaining 5000 staff worked from hotel rooms, home, or other temporary offices and relied on the Internet to access IT resources.

While Compaq and Sun quickly provided the company with replacement servers after September 11 and IBM did the same with 5000 notebooks (the dot-com bust meant there was plenty of hardware inventory around the country), Rajapakse says there wasn't time to load a standardised system image on each computer, so they just installed the Citrix client and ran the applications on the servers.

This approach was successful, and the company plans to retain the facility for ongoing remote access purposes. According to Rajapakse, Lehmann Brothers is still investing in disaster preparation but wants to be able to make use of its investment during normal operations, for example by using it to deliver applications to clients. "We're trying to make the best use of our infrastructure," he says.

Software

Software is said to be the major cause of downtime. Marcus and Stern point out that the increasing reliability of hardware coupled with technologies that reduce planned downtime means software-related downtime is set to rise proportionately.

Software is growing more complex, and all other things being equal, that is likely to reduce reliability. The question is whether improvements in software development, debugging, and testing are sufficient to decrease software-related downtime.

High-availability hardware can be an expensive way of protecting against buggy software. It may not even be effective: if failover occurs when a buggy program encounters a certain value in a particular data field (eg, attempting to divide by zero with no error trapping), the secondary system will be brought to a halt by the same bug.

Modern applications rely on multiple software layers. To achieve high availability from the end-user's perspective, they all need to be running. Even a relatively simple architecture may involve a server operating system, database, application, network routing, PC operating system, web browser, and a Java applet.

While most organisations would regard their ERP or accounting systems as business critical, it is important to examine the entire software portfolio to see what is essential to that particular environment. "Many companies could not operate without e-mail," says Wissam Raffoul, senior program director, service management strategies at META Group.

It's not only e-mail: a consulting firm could be completely paralysed without its Exchange or Notes server. Staff wouldn't know which client they were supposed to be attending, where and when meetings were being held, and so on.

Another important consideration is the nature of dependencies between systems. Once a particular application is defined as critical, all the systems and subsystems it relies on must also be considered critical.

For example, a short-term capacity problem might result in a certain database being deployed on what was notionally a development server. If that data is used by a critical application, the health of the development server has suddenly become business-critical as well.

People


"Forty percent of availability issues are associated with change [management]," according to Raffoul. Keser agrees: "don't worry about the external threats...IT outages are self-inflicted in most cases," he says.

While Keser includes hardware and software failures in the self-inflicted category (as we have seen, there are ways of designing systems so that such failures do not necessarily cause an outage), around 30 percent of Australian respondents to an international Ernst & Young survey said operational error was one of the top causes of outages.

Others see people problems as a less significant cause of downtime. Marcus and Stern quote figures that put humans as the cause of just 15 percent of outages.

Whichever figure you accept, it's clear that people are a significant cause of downtime. They may or may not be a major cause, but as Solsky pointed out, the old adage prevention is better than cure applies here: "Good system administration practices reduce the risk of downtime."

The right software can mitigate some systems management problems. Jeff Hyde, NetIQ's regional director for Australia, New Zealand, and India says the company was founded on the idea that "disciplines from the mainframe environment are tried and tested" and could be brought to distributed platforms.

NetIQ's AppManager collects data ranging from server hardware status such as fan failures through the operating system layer and on to server software including Exchange, Oracle, and SAP.

The operating system alone generates more events than human operators can handle, he explained, so AppManager filters out those that aren't really relevant, responds automatically to those it can handle (for example, by archiving particular files when a disk gets too full, killing non-essential tasks that aren't supposed to be run during peak hours, or rebooting a server before it crashes), and then reports what it has done.

Thorough testing before going live is another important tool for ensuring availability--remember 40 percent of downtime is reportedly due to software problems. What sort of reputation does the vendor have for quality assurance? Does it use a recognised development methodology?

For example, e-commerce infrastructure services company eFunds is proud of the fact that its software development centres in Chennai, India have achieved a Level 4 ("Managed") rating on the Software Engineering Institute's five-level Capability Maturity Model.

According to the SEI, CMM 4 means "Detailed measures of the software process and product quality are collected. Both the software process and products are quantitatively understood and controlled."

But every configuration is unique, so internal testing is also important whether it is performed by your own staff or outside specialists on your behalf. Financial institutions and other large organisations generally have this under control, but smaller installations may lack the budget to do it properly. Remember that the costs of achieving high availability must be compared with the benefits it is expected to yield.

"Even more important [than retrospective analysis of availability] are proactive steps to ensuring the highest availability, by predicting how the systems will respond to real-world conditions. This is achieved by thorough cycles of performance testing and tuning," says Peter Lilley, field marketing manager, Mercury Interactive.

He gave the example of a system that seemed to be working well despite a misconfigured router. Eventually, traffic reached a level where the router caused a significant problem, but finding the cause took time.

"Understanding this issue before going live would have resulted in a five-minute fix at nearly no cost, as opposed to hours of downtime, customer impact and damage to reputation," he says.

Mercury advocates an ongoing testing and tuning program that continues after the system goes into production so that changes cause minimal disruption.

"Tuning the production system reverses the usual way of managing problems, which is akin to finding the needle in the haystack. With Tuning in Production, we're taking a metal detector to the haystack to find all 40 needles that may be in there, and resolve many issues that haven't yet emerged as problems to affect availability," says Lilley.

Now what?

Veritas' Marcus can give you a whole list of advice about making systems highly available, but his number one tip is "KISS"--keep it simple, stupid. By this he means doing away with extraneous hardware, non-critical applications, non-essential users, and non-critical network connections.

Some downtime will always be due to honest mistakes, so he recommends reducing the opportunities for them to happen. Removing complexity leads to fewer mistakes and therefore less downtime.

Even simple tricks like giving systems sensible names can help: if you were asked to restart a particular server, would you be more likely to get the right one if you were looking for Zeus or HA4PV56A?

Although this may seem a trivial example, it illustrates the point that high availability doesn't just happen: it requires thought and careful planning.

Subscribe now to Australian Technology & Business magazine.


Copyright © 2009 CBS Interactive, a CBS Company. All Rights Reserved.
ZDNET is a registered service mark of CBS Interactive. ZDNET Logo is a service mark of CBS Interactive.