X
Tech

Internode resorts to disaster recovery

The email accounts of Internode users were stranded over the weekend as the internet service provider battled a major storage infrastructure failure and was forced to fall back to its disaster recovery centre to restore lost services.
Written by Suzanne Tindal, Contributor

The email accounts of Internode users were stranded over the weekend as the internet service provider battled a major storage infrastructure failure and was forced to fall back to its disaster recovery centre to restore lost services.

hackett.jpg

Simon Hackett
(Credit: Internode)

Customer email, corporate Web hosting, personal Web hosting, Web mail and customer Web tools took a hit on Friday when some of the company's systems were taken down by a major hardware failure affecting multiple servers.

Corporate Web hosting was restored early Friday afternoon, although other services remained down until later. Email was the last to be restored, with the percentage of customers whose services had been fixed creeping up until Sunday afternoon, when the company announced all services were up and running. The company insisted no email was lost.

Internode managing director Simon Hackett wrote on broadband forum Whirlpool that the outage could have been over four hours after it started, but once the recovery process had been completed, the system began crashing within minutes of being fed production traffic. This required the entire file system, holding 22 million plus files, to be rebuilt.

Complications aside, the outage never should have happened, Hackett said in another post.

"We have a very large investment in a very high end dual-site/fully redundant storage area network system that just isn't supposed to do this — ever. Clearly, it has — and yes, the vendor of that system has been involved (from 20min after the initial failure) in being a part of 'the solution' here, too," he said.

The failure also involved the server operating system. When things were working again, there would be an investigation into exactly what caused the disaster, Hackett said.

"The wrap-up here is going to involve two separate vendors (SAN and server OS) debugging some failure modes neither of them has seen before, some changes in approach in handling the mail cluster to avoid the restoration process (in the unlikely event its ever needed again) from taking so long, and a variety of other related measures," he said in another post.

"This is a rare and extremely annoying thing for us, as well as for you — and we're absolutely determined to avoid it becoming a habit," he said.

Internode did not respond this morning to an emailed request for comment.

Editorial standards