We have all fought the reliability battle.
Our clients complain about how unstable the system is while we struggle to keep servers, routers, and switches humming along at a reasonable level of stability. Meanwhile, the clients bombard the help desk with calls, then turn around to blame the staff for every hiccup in the system. When senior executives finally step in, they take the client's side, as they also experience whatever pain exists in the environment. As the blame starts to fall onto various people at random, the question becomes: What is stability and who measures it? I learnt this lesson the hard way when working on a support procedure improvement project for a client.
My client, a mid-sized (3,000 or so nodes) group with locations in 20 states (in the US) and four countries, called my company in to help with their constant "reliability" problems. They wanted us to assess their environment and give them specific technical and procedural solutions to "key problem areas." I went in as a junior member of a relatively small team.
After two weeks of assessment, we identified a few obvious problems. The server team had installed MS Exchange and MS SQL Server on the same disk array in the satellite offices. The network team demonstrated a bizarre tendency to ignore the foreign offices when scheduling core router outages. Three of the clients were "frequent flyers" when it came to locking themselves out of their security domain; their lockout frequency was two orders of magnitude higher than any other users. We recommended splitting the arrays to resolve the disk contention issue caused by the dual-purpose servers, avoiding scheduling down times during the European order entry period, and additional training for the users who locked themselves out. The client thanked us profusely and then scheduled a six-month follow-up to verify their success.
We came back fully expecting to find the reliability problems resolved. Technically, they were. The client's IT team had, a bit reluctantly, implemented our suggestions. The uptime data from the equipment indicated our changes had the desired effect. Servers no longer crashed at predictable intervals. The links to Europe stayed up during their order entry periods. Account lockouts were down over 80 percent.
Unfortunately, the clients still regularly complained about network stability. The dramatic improvements in technical stability had not translated into a noticeable improvement in client satisfaction with the system. Why?
Measuring reliability: What do we measure?
While the IT team smugly looked on, we set about trying to ferret out the reasons behind the problem. The architect assigned to the project, the man who mentored me through my early years, knew something unusual lurked in the wings. We made a number of phone calls, sometimes posing as potential customers while watching the system to track the data stream.
Two weeks of work later, we dug out the following points of contention:
The executive sponsor thought we completed our work with this basic analysis. My mentor disagreed. He prepared a report outlining how the company was creating continual problems for itself by measuring three disparate things and then trying to compare them:
In order to address these issues, and avoid a repeat call, our team suggested the IT staff and the executives take a more active role in their initial data analysis. Rather than relying on canned reports, we designed four basic survey instruments they could use to query the user community and correlate IT service data with known client issue patterns.
That last survey instrument proved key to the corporate IT team's future success. By forcing the corporate team and the executives to correlate outage times with the text of client problem reports, they uncovered a host of potential issues. More importantly, it forced the corporate team to understand their clients' needs rather than force technical solutions down their throats.
TechRepublic is the online community and information resource for all IT professionals, from support staff to executives. We offer in-depth technical articles written for IT professionals by IT professionals. In addition to articles on everything from Windows to e-mail to firewalls, we offer IT industry analysis, downloads, management tips, discussion forums, and e-newsletters.
©2004 TechRepublic, Inc.



7%
2%






