Intel's Core architecture now underlies mobile, desktop and server chips, and is a major departure from the Pentium 4's NetBurst design.
It is hard to overstate the importance of the Core micro-architecture to Intel, and thus to the rest of the industry. The product of a major debate within Intel (what Pat Gelsinger, General Manager of the Digital Enterprise Group, called the 'speed freaks versus brainiacs'), it marks the victory of those who felt that extra performance was best achieved not by constantly upping the processor's clock speed, but by going for ever more parallel systems with much finer control of performance versus power consumption.
With the Core 2 Duo and Core 2 Extreme, the Core architecture is being applied to mobile (Merom) and desktop (Conroe) processors, which now join last month's Xeon 5100 (Woodcrest) to put basically the same chip in everything from notebooks to servers.
All modern processors work by reading a stream of instructions that tell them where data is in memory and what to do with it. Originally, processors took in one instruction from memory and took as long as it needed to fully execute it before starting on the next. Each clock tick -- of which there are a million a second with a 1MHz processor, a billion with 1GHz -- moved the instruction one stage further through the processor. Some instructions could take four or more ticks.
Now, processors move blocks of hundreds or thousands of instructions into cache before executing them in blocks of four or more at a time, trying to execute even the most complex instruction in one tick. To this end, each generation of processor uses all the best tricks from previous designs and adds some of its own: the Core architecture is mostly a mix of Pentium M (Banias and Dothan) ideas with those from the Pentium Pro/P6. It's a major departure from Pentium4's NetBurst design.
Core details
Core is a pipelined architecture, where instructions move through a number of internal stages between entering the processor and being completed - 'retired', in the jargon. As an instruction exits a stage another can enter, minimising the idle time for each internal component. Core has around fourteen stages in its pipeline: as with most modern architectures, there are a number of complications, such as early completion and out of order execution, which make it hard to define exactly how many stages there are.

Intel's Core micro-architecture.
The front end of the machine fetches instructions and does preliminary analysis and reconstruction work on them. Core is a four-wide machine, with portions of five or six wide, meaning it can execute at least four instructions at once. That's wider than any previous x86 architecture. Internally, Core has its own microcode, and the first stage in dealing with x86 instructions is translating them to micro-ops in that microcode while working out which instructions can be safely combined into single operations -- 'macrofusion'.
As with all chip designers, Intel spends a lot of time analysing software, looking for common combinations of instructions -- for example, a mathematical comparison followed by a switch to a different section of code depending on the result of that comparison. By fusing those two x86 operations into a single micro-op, the chip can complete them much faster. This happens on average once every ten x86 instructions.
Core also does 'microfusion', where it does something similar but for those occasions when a single x86 instruction translates into multiple micro-ops. Where possible, the processor binds two of those micro-ops together and treats them as one; again, this can reduce the number of processing steps by around ten percent in some cases.
Once we've got streams of micro-ops rattling through the pipelines, considerable performance gains can be achieved by spotting those instructions that'll take some time to complete and starting them as early as possible. Typically, these involve reads or writes to memory: if you know that ten steps down the pipeline you'll need to load some information in, it's best to send the request out to the relatively slow memory system as early in the pipeline as possible. Unfortunately, instructions already in the pipeline may change the data at the memory location that you've preloaded, making your version out of date by the time it comes into play.
Core copes with this by using prediction hardware that allows a read from memory to happen even if there's a write already in progress, provided the predictor thinks that the write is unlikely to cause a problem. Checking afterwards catches the times that this prediction is wrong, when there's a relatively slow process of recovering the right information; however, on balance the gains from guessing right outweigh the losses when it gets it wrong.



3%
2%





