People
"Forty percent of availability issues are associated with change [management]," according to Raffoul. Keser agrees: "don't worry about the external threats...IT outages are self-inflicted in most cases," he says.
While Keser includes hardware and software failures in the self-inflicted category (as we have seen, there are ways of designing systems so that such failures do not necessarily cause an outage), around 30 percent of Australian respondents to an international Ernst & Young survey said operational error was one of the top causes of outages.
Others see people problems as a less significant cause of downtime. Marcus and Stern quote figures that put humans as the cause of just 15 percent of outages.
Whichever figure you accept, it's clear that people are a significant cause of downtime. They may or may not be a major cause, but as Solsky pointed out, the old adage prevention is better than cure applies here: "Good system administration practices reduce the risk of downtime."
The right software can mitigate some systems management problems. Jeff Hyde, NetIQ's regional director for Australia, New Zealand, and India says the company was founded on the idea that "disciplines from the mainframe environment are tried and tested" and could be brought to distributed platforms.
NetIQ's AppManager collects data ranging from server hardware status such as fan failures through the operating system layer and on to server software including Exchange, Oracle, and SAP.
The operating system alone generates more events than human operators can handle, he explained, so AppManager filters out those that aren't really relevant, responds automatically to those it can handle (for example, by archiving particular files when a disk gets too full, killing non-essential tasks that aren't supposed to be run during peak hours, or rebooting a server before it crashes), and then reports what it has done.
Thorough testing before going live is another important tool for ensuring availability--remember 40 percent of downtime is reportedly due to software problems. What sort of reputation does the vendor have for quality assurance? Does it use a recognised development methodology?
For example, e-commerce infrastructure services company eFunds is proud of the fact that its software development centres in Chennai, India have achieved a Level 4 ("Managed") rating on the Software Engineering Institute's five-level Capability Maturity Model.
According to the SEI, CMM 4 means "Detailed measures of the software process and product quality are collected. Both the software process and products are quantitatively understood and controlled."
But every configuration is unique, so internal testing is also important whether it is performed by your own staff or outside specialists on your behalf. Financial institutions and other large organisations generally have this under control, but smaller installations may lack the budget to do it properly. Remember that the costs of achieving high availability must be compared with the benefits it is expected to yield.
"Even more important [than retrospective analysis of availability] are proactive steps to ensuring the highest availability, by predicting how the systems will respond to real-world conditions. This is achieved by thorough cycles of performance testing and tuning," says Peter Lilley, field marketing manager, Mercury Interactive.
He gave the example of a system that seemed to be working well despite a misconfigured router. Eventually, traffic reached a level where the router caused a significant problem, but finding the cause took time.
"Understanding this issue before going live would have resulted in a five-minute fix at nearly no cost, as opposed to hours of downtime, customer impact and damage to reputation," he says.
Mercury advocates an ongoing testing and tuning program that continues after the system goes into production so that changes cause minimal disruption.
"Tuning the production system reverses the usual way of managing problems, which is akin to finding the needle in the haystack. With Tuning in Production, we're taking a metal detector to the haystack to find all 40 needles that may be in there, and resolve many issues that haven't yet emerged as problems to affect availability," says Lilley.
Now what?
Veritas' Marcus can give you a whole list of advice about making systems highly available, but his number one tip is "KISS"--keep it simple, stupid. By this he means doing away with extraneous hardware, non-critical applications, non-essential users, and non-critical network connections.
Some downtime will always be due to honest mistakes, so he recommends reducing the opportunities for them to happen. Removing complexity leads to fewer mistakes and therefore less downtime.
Even simple tricks like giving systems sensible names can help: if you were asked to restart a particular server, would you be more likely to get the right one if you were looking for Zeus or HA4PV56A?
Although this may seem a trivial example, it illustrates the point that high availability doesn't just happen: it requires thought and careful planning.
Subscribe now to Australian Technology & Business magazine.











