X
Tech

Horror story: Qld Health datacentre disaster

On 20 May, a brief electricity brown-out struck a Queensland Health datacentre, starting a chain of incidents that resulted in serious outages of over 20 health applications. Read our blow by blow account of an event that constitutes every CIO's nightmare scenario.
Written by Suzanne Tindal, Contributor

On 20 May, a brief electricity brown-out struck a Queensland Health datacentre, starting a chain of incidents that resulted in serious outages of over 20 health applications.

datacentrecern.jpg

(CERN Datacentre, CERN, Geneva image by Cory Doctorow, CC2.0)

The datacentre, located on the campus of Herston hospital, is believed to be one of three datacentres Queensland Health operates. It only lost power for a fraction of a second, when two flooded Energex transformers failed at around 5:00pm on that day, according to a source close to the incident. Uninterrupted power supplies kicked in to keep servers up.

However, the brown-out tripped the chilled water system, cutting chilled water to the hospital campus. As it wasn't monitored, the datacentre support team didn't notice the loss of the chilled water. A datacentre employee came on scene to check everything was running, but being happy that there wasn't anything wrong, he left.

Only two of 10 air-conditioning units within the datacentre were able to use refrigerated gas if chilled water wasn't available, meaning that although the rest of the units were operating, they weren't cooling. The temperature in the datacentre began to rise.

Although people were called in to investigate the temperature rise, the cool water problem wasn't found. Due to a DNS change the day before the problems began, there were no messages being sent to tell staff of server problems. Four hours after the brown-out, services began to suffer. On-call hospital staff were affected and complained. Soon after, a server shut down.

The whereabouts of the air-conditioning specialist who had been called in was unknown to many staff members and he didn't answer his phone. It had taken the engineer three hours to arrive on site. Five hours after the systems failed, the fact that the chilled water pumps had not been operating was discovered as more servers shut down with temperatures over 50 degrees. It was believed to be fixed.

In the face of a severe weather event, the IT staff involved were outstanding in their response to minimise the impact of this incident.

Ray Brown, acting CIO Queensland Health

Because the remote access system wasn't working, staff had to wait until they arrived at the datacentre until they could begin shutting down servers. When they arrived, they started to move systems over to an alternate datacentre, which in some cases caused brief user inconvenience. Some, however, could not be moved since their servers had no ability to failover and Queensland Health's architecture for virtual machines didn't allow moving it over to a second datacentre.

The hospital's Cerner electronic medical record (patient administration) system was shut down by the hospital staff.

Six hours after the brown-out, the air conditioning was still not working. Although staff believed they had found the problem, more systems including iPharmacy shut down until 75 per cent of applications were down and the datacentre reached 45 degrees.

Eight hours after the brown-out, chilled water was finally brought back up. Nine hours after, the datacentre was back to normal and the services could be restored. By nine o'clock the morning after the brown-out, all services were restored.

Over the course of the problems, 12 applications caused significant impact, with another 12 having minor impact. Three years ago the datacentre was forced to shut down for the same reasons. Afterwards, the team had been told it could not happen again.

When queried on the incident, Queensland Health acting CIO Ray Brown did not respond to a question on what facilities around the state the downed applications provided services to. However, it is believed that Queensland Health's three datacentres provide services around the state to multiple locations.

He denied that there had been more than one incident over the past three years at the datacentre.

According to Brown, since several applications were relocated to the other datacentre, there was "minimal disruption" to services. "The majority of services impacted were available by 2:30am and all Queensland Health systems categorised as critical remained operational during this incident," he said.

"In the face of a severe weather event, the IT staff involved were outstanding in their response to minimise the impact of this incident. The ability of staff to physically attend the site was severely hampered by flooding in the area."

Lessons had been learned, according to Brown. Queensland Health was exploring options to remove reliance on chilled water. It also intended to replace the remote access system by the third quarter of this year. It is undertaking a review of management tools and is examining the crisis management plan.

Queensland Health has lost several chief information officers over the past several years. Long-time CIO Paul Summergreene had his contract terminated by the department in July 2008. Dr Richard Ashby filled his shoes for a short time, before leaving the chair vacant, with Brown currently leading the department's IT function in an acting capacity.

The news also comes as the Queensland Government flagged in the last state budget its intent to splurge hundreds of millions of dollars on health IT systems to support its e-health capability.

Editorial standards