Red ring of death is closer than you think

Recently, Microsoft's problem with the Xbox's infamous Red Ring Of Death resulted in a billion-dollar bill. The consoles just died after a while; an issue that seemed to be linked to heat, but the company was reluctant to disclose exactly what.

Now we know — the graphics chip, designed in-house, chronically overheated and eventually gave up the ghost.

It can seem hard to believe that a company with so many resources can make such an expensive mistake. Yet in electronics design, there is no shortage of hidden problems that can elude every reasonable effort to find them before launch. Chip design is not the exact science you might imagine.

I've been there myself. Here's how it can go wrong. In the late 1980s, I worked for a small company with big ambitions. We started off by building a cheap PC network — non-standard, but built around a few low-cost off-the-shelf chips used in an ingenious way.

That sold well enough that it was decided to make a higher-performance version around a custom chip design. Our hardware designer (and co-owner) was a very experienced, creative and effective engineer, one of the most capable people I know: the project seemed very doable.

The prototyping went well. At the time, chips were designed in four main stages. First, you design the actual circuit in a CAD package, which output a netlist — effectively a script that describes which logic gates to use and how they're connected. Then, you run the netlist through a software simulator that applies electrical rules as if the circuit were running: you feed it a file of fake signals and check the output against what you expect.

Because simulators are always very slow compared to hardware, you can only check a small subset of possible conditions before building a hardware prototype. This can be a collection of many, perhaps hundreds, of standard logic chips wired together by hand to mimic your design's internals: it's slow to build, hard to get it exactly right, and difficult to make multiple copies; let alone plug it into a PC.

Or you can take the fast and expensive path and go for an e-beam lithography prototype: this is a way of building a full custom chip by firing a carefully steered beam of electrons at a properly prepared bit of silicon. You feed your netlist into the e-beam process at one end and end up with a fully functioning (you hope) real working prototype, same size and speed as the final part.

These are far too expensive for production — e-beam is the equivalent of hand-lettering an illuminated manuscript, as opposed to the printing press of standard chip fabrication — but a great way of creating final test systems that work exactly as the finished design.

Our e-beam litho prototypes came back from the fab, we plugged them in, held our breath, turned on the PCs and loaded the software. There's absolutely nothing like that moment; months of work past and an entire future hangs on it.

It worked just fine. All we had to do then was send the netlist to a company that made proper Asics (Application Specific Integrated Circuits). These are made in large numbers very cheaply; they cost a lot more to set up than e-beam litho, but when that's done you can churn them out like so many sausages. We knew the circuit worked; the Asic was just another way to build something we'd now tested in many different ways.

And at first, all went to plan. The chips were made, the network boards produced, software finished (well, I say finished...), the product launched and we started to take the punters' money.

Then reports started to come in from the field that there was an uncommon but far too frequent failure mode where PCs locked up solid in mid-network transaction. We were still a small company with very limited resources: it doesn't matter how smart you are, once things start going wrong you can only do so much firefighting. But time is tight: it's at this point that you learn by heart the number of every local late-night fast food delivery service.

At first, we couldn't even replicate the problem; everything ran fine in the lab. It transpired after a while that certain kinds of PC were more vulnerable than others: we collected examples.

The next problem was finding out a way of making the error happen repeatedly and often enough for us to investigate it. That took a while: our collection of Sancho's pizza boxes grew to mountainous proportions before we had a sequence of network transactions that could crash the bleeder on command. There didn't seem to be anything special about that sequence, but at least we could hook up our rather meagre collection of test equipment and start gathering real data.

It's worth remembering what the state of PC hardware was in the late 1980s, when the 8086 and 80286 ran the show and the 80386 was just coming onto the market. There were hundreds of different brands, many of them with custom motherboards, each trying with more or less success to emulate the IBM PC standard. Compatibility was a big issue: most (but by no means all) clones worked well out of the box. What happened when you plugged in an expansion card was a different matter.

The original IBM PC design was remarkable for a largely forgotten fact: hardware and software, it was open source. PC-DOS wasn't: that was Microsoft's. But a listing of the Bios and all the circuit diagrams were available...

...in the IBM PC Technical Manual. You couldn't just go and replicate them bit-for-bit, of course — IBM jealously guarded its copyright. But you could make your own with a high degree of confidence that they worked as described in the book.

One of the key parts of the equation was the expansion bus, the signals that fed interface cards such as the graphics adaptors, disk interfaces and network devices such as our own. That became known as the ISA — Industry Standard Architecture — and its expanded variant, the EISA bus. On the surface, this looked like a perfectly normal chunk of engineering: all the signals, their timings and voltage levels, were described with lots of nice clean graphs in the Technical Manual.

Unfortunately, that was the only place you'd see lots of nice clean graphs. Reality is far messier. Signals were late or early, voltages were never quite what you'd expect, and everything could change depending on what else was plugged into the bus alongside your bit. And, of course, all those hundreds of different makes of PC had different variations.

If you designed something to the book, chances are it wouldn't work very well. Experienced designers know this, and are very conservative in what they expect. Our designer was certainly experienced, and had done a good job of the ISA interface part of the chip. Our e-beam litho prototypes worked perfectly well. It had to be something to do with the Asic.

In the end, after months of extreme pain, cost and pizza overdose, the problem was revealed in a chance conversation at a conference in a "Oh, we had that problem..." way.

One of the abiding sins of the ISA bus — indeed, any bus that used the rather simple-minded signalling circuitry of the time — is called undershoot. If you rapidly change a signal at one end of the bus from five volts to zero volts, you would expect it to stop at zero all the way along the bus — it's just a bit of wire. But a combination of factors to do with basic physics means that further down the bus, some of the energy in that transition drives the voltage past zero, into negative numbers.

This is very bad news. The transistors in the logic circuits can lock up or even get permanently damaged by even a slight negative input. From time immemorial, chip designers have guarded against this by clamp diodes, fast-acting switches on the input lines that turn on as soon as a negative voltage appears and effectively short-circuits it to zero. Imagine a horde of Vikings rampaging towards a richly appointed town: the clamp diodes are trapdoors that open under the feet of the naughty Nordics and funnel them off to a big underground pit.

Normally, this just works. A negative spike appears, the clamp diodes turn on and the energy flows through them to ground. Nobody sees a thing. But on the Asics we were using, 'ground' wasn't quite as good as it should have been. If a really large undershoot happened on lots of input lines simultaneously — something that happened only when a certain data pattern appeared on the bus, and then only on particular designs — then the diodes turned on fine but the energy so diverted couldn't get back onto the main bus fast enough. The result was that the whole chip then went negative. The pits full of Vikings overfilled and burst up through the floors of the townsfolk.

It's an analogue quirk in a digital device, and not one that was specified in the design guide for the Asic — which, after all, was being driven way outside its nominal specification. We fixed it, if memory serves, by adding an extra bus interface chip between the Asic and the PC; this soaked up the undershoot without complaining and we moved onto the next design.

How could we have avoided this? As a small company, we couldn't afford the time or the money to go out and buy every make of PC before launch and go through the saturation testing which would have revealed the problem — but that's an issue that still plagues even the biggest outfits. We got our design right. The only thing that might have saved us was if we'd been far better plugged into the experiences of other companies working at the same problems; in those pre-internet days, that was by no means a simple job for a 10-person outfit in a converted warehouse in the East End of London.

As it was, we learned a great deal the hard way. It happens. It didn't cost us a billion dollars or earn us scalding headlines; we got off lighter than Microsoft's Red Ring Of Death. The complexity of modern IT, especially when you factor in millions of users and all their variations, is such that you can't know everything in advance. The world is not as it appears in technical manuals, marketing slides or engineers' heads — and all you can do is learn as much about it as you can before making the leap of adding to it.

Talkback

Add your opinion

In order to post a comment, you need to be registered. (Sign In or register below)

Post your comment

Terms of Service - As a ZDNet registrant, and by using this service, you indicate that you agree to our Terms and Conditions and have read and understand our Privacy Policy.

ZDNet Australia Live

What I like about Mike Quigley is that he is making it happen, despite all the bull**t barriers being put in front of him by Coalition po...

3 minutes ago by Magnus on NBN users opt for 100Mbps

Anonymous hacks Reliance's Internet filtering server - ZDNet (blog) http://t.co/uObU1HBP http://t.co/0UBXxwX4

Which Windows will make for a better tablet? http://t.co/4mAHg850

Gonna be crowded when TA switches of the inter webby thingy and everyone moves there, just as you suggested though.

1 hour ago by Beta on NBN users opt for 100Mbps

Yes "without secure internet identification methods" I cannot see a future for online voting be it a referendum or selecting a Gov (at ...

2 hours ago by Taskmanager on A farewell to democracy: Kaspersky

Oh of course you would would want something in return. hmmm I see, well maybe my best wishes for and your family. btw, Western Union is ...

2 hours ago by Doubt on NBN users opt for 100Mbps

Well Willunga looks like a nice place to live, close to wine growing areas, a golf club. Houses are probably reasonably priced. Very nice...

2 hours ago by Doubt on NBN users opt for 100Mbps

Listening to @stilgherrian cover AusCERT and cyberwar, http://t.co/6lGUEz8H

http://edfarmaciaes.com/#0500 generico viagra barcelona EdFarmaciaEs sildenafil y sulfatos

2 hours ago by buy priligy cheap on Top alternatives to Microsoft Outlook

Travel Tech Q and A: Skyscanner's Ewan Gray http://t.co/VN5tGJzC

#Westpac Board goes paperless with #Ipads with #Tabula #App http://t.co/duxuj2fd #Cybersecurity #Bank

Microsoft is serious about open source??? http://t.co/mqQGgta7

If I give you money what do I get in return? Do you know how commerce works or are you just a filthy poor that wants my monies for nothin...

3 hours ago by Hubert Cumberdale on NBN users opt for 100Mbps

@joedamato just try varying caps randomly. Maybe they do this http://t.co/1FN5FwYv

NSW outlines datacentre migration plans - Hardware - News - ZDNet Australia http://t.co/OQfUl0D1

MikeSkoey - thanks for your comments. Rather than hang my head in shame, I am proud of my achievements, particularly of being able to ru...

3 hours ago by Paul_Berryman on 30 servers to 7: BUPA redoes virtualisation

The Liberals have no idea what to do and would just go back to the "do nothing" policy we had under Howard, Alston and Coonan.

3 hours ago by Magnus on NBN cost-benefit analyses are so 2011

"Why is that if someone who expresses a view different from the sheep, are immediately bandied a troll?" Nope. I prefer to call you some...

3 hours ago by Hubert Cumberdale on NBN users opt for 100Mbps

"on the new fast Internets everyone wants the fast plan" #orly #nareally #yarly http://t.co/kvfCa84A

This article needs a conclusion or recommendation advising Android users what to do about this. For example, are there reliable security...

3 hours ago by Magnus on Android's biggest security flaws

Kaspersky is right. Even though voting is compulsory here, Australia needs to start work on this now. Once such a secure online credent...

3 hours ago by Magnus on A farewell to democracy: Kaspersky

Chrome overtakes IE: does it matter? http://t.co/e4SILk8a

A ZDNet study showed that British Facebook users are drunk in 76 percent of their photos.

The HDMI cable ripoff and why retail is really dying http://t.co/eFT7zEW7

Travel Tech Q and A: Skyscanner's Ewan Gray http://t.co/IUysbyKf

Travel Tech Q and A: Skyscanner's Ewan Gray http://t.co/V7vL5QB9

Dazza - lets make a deal. I won't call you a troll if you don't call me a sheep. Anyway let's get some perspective on this. You cannot ...

5 hours ago by dickster on NBN users opt for 100Mbps

Further to the comments from James, I can add that most botnets will test the bandwidth of the end host before they take control of that ...

5 hours ago by patrickbutler on National Botnet Network coming: Earthwave

ZDNet reports Microsoft launches its own social service http://t.co/VJS5BkwF

by http://t.co/vmlLt4bh: Travel Tech Q and A: Skyscanner's Ewan Gray: Ewan Gray, Skyscanner's director for Asia P... http://t.co/4bfDRXo4

Travel Tech Q and A: Skyscanner's Ewan Gray http://t.co/CtNlVWN7

Travel Tech Q and A: Skyscanner's Ewan Gray: Ewan Gray, Skyscanner's director for Asia Pacific, shares some of h... http://t.co/ZxjpmqiM

Seriously, every business is slow to start off, that's common sense. But the NBN is attempting to replace an incumbent monopoly. So wait ...

5 hours ago by Beta on NBN users opt for 100Mbps

Microsoft is serious about open source: 10 proof points http://t.co/iv2ji74q

Ok, for all of those that are complaining about price lets look at it this way, Australia started using copper wiring back in the late 18...

5 hours ago by Kalthae on NBN users opt for 100Mbps

Ah so you have an anti-NBN website then...ok!

5 hours ago by Beta on NBN users opt for 100Mbps

@ Doubt, I think you should be a policy advisor to Tony Abbott. I can see it now pre-election 2013, Press Club - Journo: Mr Abbott, yo...

5 hours ago by Beta on NBN users opt for 100Mbps

@beachking, that's why the first N in NBN is of importance, because while this may come as a shock, the universe does not revolve around ...

6 hours ago by Beta on NBN users opt for 100Mbps

Err the words give it away "world class"... it's not Huawei class, China class or India class, it's world class! World Class from Farlex...

6 hours ago by Beta on NBN users opt for 100Mbps

Accelerator targets 'clean-tech' start-ups http://t.co/p9VPCzCa

RT @vexnews: NBN users opt for highest speed plan http://t.co/8eUvvVvQ

OutsourcingLive: #Outsourcing is still on the rise http://t.co/5U6R431A ^NK http://t.co/B8HtVvAD

In Facebook IPO fiasco the 'smart money' got burnt - ZDNet (blog): TIMEIn Facebook IPO fiasco the 'smart money' ... http://t.co/3iD1g6lG

But will we actually get 100mps Internet speeds often overstated RT@vexnews: NBN users opt for highest speed plan http://t.co/1uTiHXrd

RT @JamesVickery: NBN users opt for 100Mbps http://t.co/atP8fi1L

more cloud TV recording services tumble in wake of court victory for copyright monopolies - http://t.co/FEWm6Z7Y

Mike Quigley | Only 3500 NBN customers with active fibre services to date http://t.co/6eB525Ur via #auspol NBN very expensive failure

NBN users opt for highest speed plan http://t.co/8eUvvVvQ

http://t.co/ZWOl5p8F

http://t.co/JWINuozI

Remember, these are the high speeds that Mr Abbott believes you guys don't want.... http://t.co/Jtqnwb2M

Three tips for businesses to support connected customers http://t.co/to8fCl1N via @zite

Which Windows will make for a better tablet? http://t.co/wxr95itf via @zite

Cloud based TV recording services in Australia shutdown after negative ruling. http://t.co/9zlnSVJd

AD on azure, is all about APPS .. http://t.co/EMdsrHZF

This story has been voted 12000 times in the last 24 hours!

2 days ago, Is Bill Gates a great leader?

This story has been voted 10 times in the last 24 hours!

3 days ago, CeBIT 2012 opens: photos

This story has been voted 15 times in the last 24 hours!

3 days ago, Lenovo ThinkPad 3G tablet (32GB)

Facebook Activity

Keep up with ZDNet Australia

ZDNet Events Calendar

ZDNet Events Calendar