Horror story: Qld Health datacentre disaster

On 20 May, a brief electricity brown-out struck a Queensland Health datacentre, starting a chain of incidents that resulted in serious outages of over 20 health applications.

(CERN Datacentre, CERN, Geneva image by Cory Doctorow, CC2.0)

The datacentre, located on the campus of Herston hospital, is believed to be one of three datacentres Queensland Health operates. It only lost power for a fraction of a second, when two flooded Energex transformers failed at around 5:00pm on that day, according to a source close to the incident. Uninterrupted power supplies kicked in to keep servers up.

However, the brown-out tripped the chilled water system, cutting chilled water to the hospital campus. As it wasn't monitored, the datacentre support team didn't notice the loss of the chilled water. A datacentre employee came on scene to check everything was running, but being happy that there wasn't anything wrong, he left.

Only two of 10 air-conditioning units within the datacentre were able to use refrigerated gas if chilled water wasn't available, meaning that although the rest of the units were operating, they weren't cooling. The temperature in the datacentre began to rise.

Although people were called in to investigate the temperature rise, the cool water problem wasn't found. Due to a DNS change the day before the problems began, there were no messages being sent to tell staff of server problems. Four hours after the brown-out, services began to suffer. On-call hospital staff were affected and complained. Soon after, a server shut down.

The whereabouts of the air-conditioning specialist who had been called in was unknown to many staff members and he didn't answer his phone. It had taken the engineer three hours to arrive on site. Five hours after the systems failed, the fact that the chilled water pumps had not been operating was discovered as more servers shut down with temperatures over 50 degrees. It was believed to be fixed.

In the face of a severe weather event, the IT staff involved were outstanding in their response to minimise the impact of this incident.

Ray Brown, acting CIO Queensland Health

Because the remote access system wasn't working, staff had to wait until they arrived at the datacentre until they could begin shutting down servers. When they arrived, they started to move systems over to an alternate datacentre, which in some cases caused brief user inconvenience. Some, however, could not be moved since their servers had no ability to failover and Queensland Health's architecture for virtual machines didn't allow moving it over to a second datacentre.

The hospital's Cerner electronic medical record (patient administration) system was shut down by the hospital staff.

Six hours after the brown-out, the air conditioning was still not working. Although staff believed they had found the problem, more systems including iPharmacy shut down until 75 per cent of applications were down and the datacentre reached 45 degrees.

Eight hours after the brown-out, chilled water was finally brought back up. Nine hours after, the datacentre was back to normal and the services could be restored. By nine o'clock the morning after the brown-out, all services were restored.

Over the course of the problems, 12 applications caused significant impact, with another 12 having minor impact. Three years ago the datacentre was forced to shut down for the same reasons. Afterwards, the team had been told it could not happen again.

When queried on the incident, Queensland Health acting CIO Ray Brown did not respond to a question on what facilities around the state the downed applications provided services to. However, it is believed that Queensland Health's three datacentres provide services around the state to multiple locations.

He denied that there had been more than one incident over the past three years at the datacentre.

According to Brown, since several applications were relocated to the other datacentre, there was "minimal disruption" to services. "The majority of services impacted were available by 2:30am and all Queensland Health systems categorised as critical remained operational during this incident," he said.

"In the face of a severe weather event, the IT staff involved were outstanding in their response to minimise the impact of this incident. The ability of staff to physically attend the site was severely hampered by flooding in the area."

Lessons had been learned, according to Brown. Queensland Health was exploring options to remove reliance on chilled water. It also intended to replace the remote access system by the third quarter of this year. It is undertaking a review of management tools and is examining the crisis management plan.

Queensland Health has lost several chief information officers over the past several years. Long-time CIO Paul Summergreene had his contract terminated by the department in July 2008. Dr Richard Ashby filled his shoes for a short time, before leaving the chair vacant, with Brown currently leading the department's IT function in an acting capacity.

The news also comes as the Queensland Government flagged in the last state budget its intent to splurge hundreds of millions of dollars on health IT systems to support its e-health capability.

Advertisement

Talkback 9 comments

    CITEC, anyone? Anonymous -- 02/07/09

    Anyone fancy moving back to CITEC?

    DIY Data Centres are not the future Anonymous -- 03/07/09

    Their data centre is il conceived and reeks of DIY. Time move forward with technology and co lo it in a proper DC or better yet get the servers hosted so somebody else can worry about it.

    DR planning disaster Anonymous -- 03/07/09

    Whats more concerning is that they thought they had the ability to v-motion the virtual machines, but when the time came, they couldn't.

    What happened to the DR planning? Should this have been tested after the platform was rolled out?

    soft skills Anonymous -- 05/07/09

    dont worry, soft skills will fix this,we've got our best looking sales chick on the job

    good read Archie -- 06/07/09

    Nice article thanks Suzanne

    indeed Anonymous -- 06/07/09 (in reply to #320147777)

    indeed

    Very little facts...... Anonymous -- 08/07/09

    There was no application outage's as there were active redundant servers at the second data centre that took over when the failure occurred. Another fact is that there is no third data center. Looks as if should check your sources.

    A Severe Weather Event? Anonymous -- 30/07/09

    Two failed transformers does not equal a "Severe Weather Event", that is just bullshit.

    There are a million reasons why a transformer might fail, and when you are managing a critical datacentre it is not a question of "procedures for _if_ we lose power", but "procedures for WHEN we lose power".

    If only for gross dishonesty, Ray Brown should be sacked immediately.

    Agreed.. Sack Him Anonymous -- 20/10/09 (in reply to #320167417)

    Sounds like a "Spin Doctor", anyone in that position should have tested their DR procedure thouroughly, instead on relying on a vendors sales pitch that it will work.

    I wouldnt let him anywhere near a Microsoft Small Business Server, let alone a Data Centre

Add your opinion

Latest Videos

Sponsored content

Power Centre - Content from our premier sponsors

Blogs

  • Chris Duckett Carelessness busts Linux security
    No operating system can ever properly protect a computer from trojans as long as users continue to do silly things. Just because Linux is immune to your standard drive-by viruses it does not mean that it can escape trojan horses.
  • Array Sun shining on Ajnaware
    Graham Dawson talks about the future of iPhone app development and augmented reality.
  • Array Holiday IT to-do lists
    The fast-approaching holiday season is a great time to update your IT systems while everything's quiet.
  • More blogs »

Tags

Back to top

Featured