[aDOC Services]

Technology Briefing


Four New Offerings on the CPU Market Since January This Year.
Summary of the Features of the Newcomers.

August 5, 1997.


Quick links

Introduction
A bit of History
A good old veteran - The Pentium CPU
The Cyrix 6x86 (formerly code-named M1)
The Pentium with MMX extensions
The Cyrix 6x86MX (formerly code-named M2)
The Pentium Pro
The Pentium II (formerly code-named Klamath)
The AMD K6
Related Stories
What is MMX exactly ?
Glossary
Glossary - Dynamic Execution

Related Stories

Dynamic Execution Slide Show
MMX enhanced software
 
Intel Pentium Pro processor
Pentium Pro: a Tour of the Microarchitecture
Intel Pentium II processor
 
AMD introduces sixth-generation AMD-K6 MMX processor (news release)
AMD reduces prices on AMD-K6 MMX enhanced processor family (news release)
AMD K6 MMX Enhanced Technical Brief
 
Cyrix 6x86™ Processor Brief
Cyrix 6x86MX Processor Brief

Considering the rate at which recent evolutions appeared in the CPU market, we felt appropriate to issue a technology briefing to present the new offerings from Intel and other vendors. Twenty four months ago, Intel released the Pentium 200, which had paramount performance; the chip will be phased out at this year end. Only looking at new chips since beginning of 1997, we have already 4 new CPUs that were introduced this year: the Pentium with MMX extensions, the K6 from AMD, the Pentium-II and the Cyrix 6x86MX (in chronological order).

This briefing first presents the history of the x86 CPUs for the most important vendors. Then, the salient features of the various present processors (including those four new offerings) are outlined.

Preliminary remark: chips coming from AMD, Cyrix or others, as they are available after the corresponding chips from Intel, often use more recent technologies, and, as such, are able to support higher clock rates and/or exhibit superior performances at the same core frequencies. A remarkable example of that fact is for the 5th generation chip from AMD, which uses a 0.35 micron manufacturing process, with a RISC core and out-of-order execution. Intel used such techniques only starting with the Pentium Pro chip. That's why other vendors have developed what they call the P-rating, or PR-rating, to indicate the equivalent chip in the Pentium range of chips - for example the 110 MHz 6x86 from Cyrix is called a P133+, meaning it runs at least as fast as a Pentium 133.

Note: The chips featuring MMX extensions are written in green.

A bit of History
Note: The dates given in this paragraph are dates of production of the first CPUs. Although with recent generations of chips, the date of the first PC containing a given chip is often not much later than the first production of the corresponding chip, it was not the case before. For example, the first PCs were available in 1981 (chip produced since 1979), PC/AT in 1984 (80286 produced since 1982), PCs based on the 386 in 1986 (chip since 1985).
Today, Intel introduced the Pentium-II in may (with a motherboard capable to support it), and less than 3 months later, you have already lots of clone makers producing compatible motherboards.
11/71 Intel announces the 4004, the first microcomputer chip. It consists of 2300 transistors running at 108 kHz, giving a processing power of 0.06 MIPS. The manufacturing process is at 10 micron.
04/74 Intel introduces the 8080, consisting of 6000 transistors running at 2 MHz, giving 0.64 MIPS. That increase in speed is allowed by a manufacturing process at 6 micron. The 8080 is at the core of the first personal computer, the Altair.
06/78 Introduction of the 8086, featuring 29,000 transistors. The first version runs at 4.77 MHz, giving 0.33 MIPS, but later revisions go up to 8 MHz (0.66 MIPS), then 10 MHz (0.75 MIPS).
06/79 Introduction of the 8088, similar to the 8086, except for an 8-bit external bus (instead of 16 bits for the 8086). This is the first popular CPU in the PC, then the PC/XT.
02/82 Intel introduces the 80286, changing the PC/XT into a PC/AT. Its 134,000 transistors at 1.5 micron run at 6 MHz (0.9 MIPS), then 10 MHz (1.5 MIPS), and finally 12 MHz (2.66 MIPS). The external bus is again up, to 16 bits.
10/85 Introduction of the 80386DX, running at 16 MHz (5 to 6 MIPS). The 3rd generation chip includes 275,000 transistors at 1.5 micron. The bus width, as well as the internal architecture, is now 32 bits. This is the first CPU that allows programmers to create a series of virtual x86 machines, the first step towards real multi-tasking.
In february 1987, it is followed by a version at 20 MHz (6 to 7 MIPS), and another at 25 MHz (8.5 MIPS) in april 1988. The latest version is available on the day when Intel also introduces the 486, the 4th generation x86 processor, in april 1989; that version is clocked at 33 MHz (11.4 MIPS).
1988 Harris introduces the 80C286 running at 16 and 20 MHz. The performances are comparable to those of a 80386 at the same frequency.
06/88 Introduction of the 80386SX, running at 16 MHz (2.5 MIPS). This chip is similar to the 386DX, except for a narrower 16-bit external bus. The chip is targeted to the low-end market. Higher frequency versions follow in january 1989, at 20 MHz (2.5 MIPS), then 25 MHz (2.7 MIPS). Finally, a 33 MHz (2.9 MIPS) revision is introduced in october 1992 to counter AMD chips.
04/89 Intel announces the 486DX, now with over 1 million transistors, at 1.2 million transistors (1 micron). For the first time, the chip includes a math coprocessor unit and an 8K internal L1 cache. It's also the first chip to process the instructions in several stages, in a pipeline. The performance at 25 MHz is roughly 50 times that of a 8088 (20 MIPS).
Faster versions run at 33 MHz (27 MIPS) in may 1990, then a new progress in the manufacturing process (down from 1 to 0.8 micron) allows for higher clock rates: 50 MHz (41 MIPS) in june 1991.
10/90 Intel introduces the 386SL, the first CPU specifically designed for portables. Running at 20 MHz (4.21 MIPS), it consists of 855,000 transistors. It is comparable to the 386SX, with a 32-bit internal architecture but a 16-bit external bus. It is highly integrated, including the cache as well as the bus and the memory controllers. Later, Intel introduces a 25 MHz (4.21 MIPS) chip in september 1991.
03/91 AMD introduces the Am386, challenging the 386DX from Intel. The chip runs at frequencies going up to 40 MHz, making of it an appealing solution for the low-end market, though the power users already enjoy the 486 for two years now. The chip includes 200,000 transistors at 0.8 micron. Intel 386 is now using a 1 micron manufacturing process, and stops at 33 MHz.
04/91 Introduction of the 486SX running at 16 MHz (13 MIPS). This is a copy of the 486DX chip, but without math coprocessor. The bus width is 32 bits, as for the 486DX. It features 1,185,000 transistors at 1 micron. The manufacturing process is improved to 0.8, then 0.6 micron. As the 486DX, it uses an 8K internal L1 cache. Later versions run at 20 MHz (16.5 MIPS), then 25 MHz (20 MIPS) in september 1991, and 33 MHz in september 1992.
03/92 Intel announces the 486DX2, the first chip that uses a core frequency different from the external frequency: 50 MHz in the CPU, 25 MHz on the bus. It consists of 1.2 million 0.8 micron transistors. Its performance reaches 41 MIPS. It is rapidly followed by another version, running at 66 MHz, 33 MHz on the bus (54 MIPS) in august 1992.
1992 Cyrix, once a math coprocessor maker that was founded in 1988, releases its version of the 486. For the production facilities, Cyrix relies on a foundry that belongs to IBM.
11/92 Intel introduces the 486SL, the 486 targeted towards portables. It is first running at 20 MHz (15.4 MIPS), then 25 MHz (19 MIPS), and finally 33 MHz (25 MIPS). The manufacturing process involves 1.4 million transistors at 0.8 micron.
03/93 Intel introduces the Pentium 60 and 66 MHz, both on the same date (100 and 112 MIPS resp.) This is the first superscalar chip, meaning it can process two instructions in parallel. The number of transistors reaches now 3.1 million, at 0.8 micron (BiCMOS). The bus is now 64 bits for data, and 32 bits for addresses; the internal architecture is 32 bits. External speed is at 60 or 66 MHz. The L1 cache is 8K for instruction, 8K for data. The Pentium is a significant step forward from the 486, providing a two times better performance for integer processing, and a huge fivefold improvement for floating point calculations. It is also the first chip to require efficient cooling (the Pentium 66 has a mean power consumption of 13W, while peaks reach 16W). Although, later, lower voltages alleviated the problem, heating concerns have never disappeared since that time.
Intel breaked the logic behind the names 286, 386,... because it couldn't enforce patents put on names that are simply numbers. Now, the chip is called Pentium, or P5 (its former code-name).
Later, Intel introduces on the same day (in march 1994) the Pentium 90 and 100 (resp. 149.8 and 166.3 MIPS), with 3.2 millions transistors at 0.6 micron (BiCMOS). The power supply goes down from 5V to 3.3V.
04/93 One month after the release of the Pentium, AMD launches the Am486, using 1 million transistors at 0.7 micron on 3 layers. That process will be enhanced several times, first to 0.5 micron, then 0.35 micron, allowing for higher clock rates. The Am486 will go up to 120 MHz (with a 3 times multiplier). Intel stopped the evolutions of the 486DX4 at 100 MHz, which again makes it an interesting solution for the low-end market. The L1 cache is 8K large.
03/94 The 486DX4 uses the same technique as the DX2, but this time, the bus frequency is multiplied by 3 inside the microprocessor. The first version runs at 75 MHz (core frequency), giving 53 MIPS. It is followed by a 100 MHz version. The chip uses 1.6 million transistors at 0.6 micron. To limit the memory bottleneck at those frequencies, the L1 cache is doubled, to 16KB.
10/94 The Pentium 75 (126.5 MIPS) follows the previous Pentiums. It features 3.2 million transistors, and is manufactured with a 0.6 micron process. This is a low-cost and low-power alternative to higher-end Pentiums. The multiplier is at 1.5, meaning the external speed is at 50 MHz.
03/95 The Pentium 120 reaches now 203 MIPS. Later, the manufacturing process goes from 0.6 to 0.35 micron. This allows for higher frequencies, and, in june 1995, Intel announces the Pentium 133 MHz (218.9 MIPS), including 3.3 million transistors. Both use a multiplier by 2 (resp. 60 and 66 MHz external frequency). In january 1996, the multiplier goes even higher, to 2.5, for two new versions, running at 150 and 166 MHz core frequency. Five months later, the multiplier climbs to 3, giving a core frequency of 200 MHz.
06/95 Cyrix announces its 5x86, a 4th generation chip, first running at 100 MHz. A later version, released in october 1995, is running at 120 MHz and has performances that are comparable to those of a Pentium 90. The L1 cache is a unified 16K cache.
10/95 Cyrix introduces the 6x86, first running at 100 MHz. To better define its performance, Cyrix, with AMD, IBM and SGS-Thomson, define the so-called P-rating, that measures their speed as the frequency of the Pentium that has the same performances. The 100 MHz 6x86 is then called a P120+. The L1 cache is a unified 16K cache.
In february 1996, the 6x86 range is completed by versions running at 110 MHz (P133+), 120 MHz (P150+) and 133 MHz (P166+). In June 1996 - the month when Intel releases its P200 - a P200+ (150 MHz - bus speed at 75 MHz) is added to that range.
Later, that CPU range is enhanced by the 6x86L, a low power version of the chip.
11/95 Intel introduces the Pentium Pro processor, its 6th generation chip. This is the first x86 CPU to use a superscalar architecture with a RISC core (explanation below). For the first time, the chip includes an internal L2 cache, which is clocked at the same frequency as the core of the CPU. With an internal frequency of 150 MHz (256K L2 cache), 166 MHz (512K L2 cache), 180 MHz (256K L2 cache) or 200 MHz (256 or 512K L2 cache), the chip is a dual cavity PGA, which uses 5.5 million transistors in the CPU cavity, and another 15.5 millions in the 256K L2 cache cavity or 31 millions for 512K L2 cache versions. The L1 cache is still 8K for instructions, and 8K for data. The CPU itself is manufactured at 0.6 micron at 150 MHz, or 0.35 &mu at higher frequencies; the cache is manufactured at 0.6 micron for the 256K versions, and at 0.35 micron for the 512K versions. The bus is 64 bits, plus 64 bits to the L2 cache. The external bus runs at 60 or 66 MHz (multiplier 2.5 or 3).
The chip is mainly optimized for 32 bit applications, which, at the time the PPro was designed, Intel planned to be ubiquitous. For 16-bit applications, the PPro performs slightly slower than Pentium CPUs running at the same core frequency.
12/95 The Am5x86 is a redesign of the Am486, but using the same architecture. This is in fact a 4th generation design that delivers Pentium 75-like performance. The core frequency is at 133 MHz, with a 4 times multiplier. The L1 cache is now a 16K unified cache. However, it comes much too late to get the market attention. Besides, the chip is long to come to mass production.
05/96 AMD releases a full 5th generation design, a 4-issue superscalar chip with a RISC core, featuring the same kind of enhancements that were introduced by Intel in its Pentium Pro: out-of-order execution, speculative execution with branch prediction, register renaming,... It features a 16K L1 cache for instructions, and 8K for data. The chip uses 4.3 million transistors at 0.35 micron. The K5 include versions with ratings of 75, and 90 MHz, then 100 (Q3'96), 120 and 133 (Q4'96), and finally 166 MHz (Q1'97).
However, it's already late, and other competitors like NexGen or Cyrix have already working alternatives for a long time. The chip is not really a success.
01/97 Intel introduces 57 new intructions in the x86 instruction set, providing a Pentium CPU with MMX Technology. The purpose of that extension is to speed up calculations typically found in multimedia applications. The number of transistors climbs to 4.5 millions (0.35 micron CMOS). The internal L1 cache is doubled from the Pentium, from 16K to 32 K. The first versions of the processor include a 166 and a 200 MHz chip. Unfortunately, Intel could not introduce the chip - which is mainly targeted to the home market - for Christmas last year. A later revision, available in June 1997, is running at 233 MHz.
02/97 AMD introduces a 6th generation chip, the K6, which features MMX extensions, and which elaborates on the RISC experience AMD had with the K5. The chip includes a whopping 8.8 million transistors at 0.35 micron. The L1 cache is 32K for instruction, and 32K for data. The chip has performances which are comparable to those of the Pentium-II - which is not yet available - and is therefore promised to a nice success, also because of the pricing policy of AMD, which claimed it would sell it for 25% less money than its Pentium-II counterpart.
05/97 Intel merges the Pentium with MMX Technology and the Pentium Pro to create the Pentium-II, a better Pentium Pro, with the MMX set of instruction, and optimized both for 16- and for 32-bit applications. 3 versions are announced at the same time, 233, 266 and 300 MHz, though the mass production of the 300 MHz version begins only during Q3'97. The chip features 7.5 million transistors at 0.35 micron with a 512K L2 cache. The L1 cache is doubled from the Pentium Pro, from 16K to 32K (16K for instructions, 16K for data). The bus width is 64 bits (with ECC), plus another 64 bits to the L2 cache (optionnally with ECC). For the first time, the CPU is packaged in the SEC, the Single Edge Contact cartridge, which Intel plans to use as well for future CPUs.
05/97 The same month, Cyrix introduces the 6x86MX, adding MMX extensions to the 6x86. It also quadruples its cache, with a 64K unified L1 cache. The manufacturing process is at 0.35 micron, with core frequencies at 150 (PR-166), 166 (PR-200) and 188 MHz (PR-233), with bus speeds at 60, 66 and 75 MHz resp.
98 In the future, higher frequency revisions of the Pentium-II may become available. Later, Intel will introduce the 7th generation x86 chip, which it currently develops with HP.

Current situation

Intel AMD Cyrix
Pentium
Pentium MMX

6x86

AMD K5 6x86MX
Pentium Pro
Pentium II
AMD K6
Legend
CPU
CPU
Standard set of instructions
MMX enhanced CPU
. . .
. . .
conventional core
RISC core

A good old veteran - the Pentium CPU
The Pentium is the first Intel chip to be superscalar, with two 5-stages pipelines; this means the CPU is able to execute simultaneously two integer instructions or one floating point instruction. The data bus is 64 bits wide. The processor contains a 2-way set associative 16K (8K for data, 8K for instructions) L1 cache. To reduce latencies induced by cache misses, the Pentium design includes dynamic branch prediction. Instructions are loaded in two prefetch buffers (one buffer using the branch prediction algorithm and another buffer for instruction loaded in a linear order).
Introduced in march 1993 and due to retire at this year end, the Pentium CPU, once at the high-end of the market, looks now like a good old veteran. Even more, with the latest price cuts of Intel, that set the MMX CPUs at the same price as the non-MMX enhanced Pentium chips running at the same frequency, that chip seems even more old-fashioned. For the record, the chip exists in frequencies going from 60 to 200 MHz (60, 66 MHz run at 5V, while 75, 90, 100, 120, 133, 150, 166 and 200 MHz run at a cooler 3.3V).
The Cyrix 6x86 (formerly code-named M1)
The 6x86 is essentially Cyrix's counterpart of the Pentium. It is a two-way superscalar unit, using register renaming, out-of-order completion, data dependency removal, branch prediction and speculative execution (see the glossary below).
[6x86 photo] Featuring a 16K unified L1 cache, it runs at frequencies from 100 MHz, introduced in october 95, up to 150 MHz (june 96). It is pin-compatible with the Pentium MMX processor, using a socket 7 architecture. Beware, though, that the clock multiplier is fixed at 2, and that you need therefore a motherboard that supports 75 MHz to use a 150 MHz 6x86 (P200+). That family has been recently enhanced by the 6x86L, a low-power version of the CPU.
Thanks to the use of advanced techniques, the 6x86 performs better than a Pentium running at the same frequency. In order to define precisely its performance, Cyrix, with AMD, IBM and SGS-Thomson created the P-rating, a rating based on real-world applications that measures the speed as the frequency of the Pentium that has the same performances. The table of the P-ratings as a function of the physical core frequency is as follows:

Clock Speed Bus Speed P-rating
100 MHz 50 MHz P120+
110 MHz 55 MHz P133+
120 MHz 60 MHz P150+
133 MHz 66 MHz P166+
150 MHz 75 MHz P200+
The Pentium with MMX extensions
The "Pentium CPU with MMX extensions" - the official name for the much touted Intel new version of the 5th generation chip - was the first CPU to include MMX extensions, the largest evolution in the x86 family since the 386, according to Intel. Besides the MMX extensions (refer to the side bar for a more detailed explanation), the Pentium with MMX extensions also features a larger 4-way set associative L1 cache (32K - 16K for instructions, 16K for data), which improves its speed by about 10% when compared to a standard Pentium chip at the same frequency. When using MMX enabled applications, the speed improvement is typically around 60%.

[Pentium MMX photo]

From the point of view of the microarchitecture of the CPU, the Pentium MMX also adds an MMX processing unit next to the two integer units and the floating point unit already found in the standard Pentium. It is also superscalar of degree 2. Other changes include larger write buffers for better performance of the memory, and a better branch prediction algorithm, with 4 prefetch buffers. The Pentium MMX is a split voltage design (2.8V for the core, 3.3V for the I/Os) that fits in a socket 7 architecture.
The Pentium MMX comes in three flavours, running at 166, 200 and 233 MHz (bus frequency is at 66 MHz). An upcoming version at 266 MHz is excepted for beginning of next year.

What is MMX exactly ?
The MMX acronym stands for Matrix Manipulation eXtensions. These extensions correspond to a set of 57 new instructions first introduced by Intel in january this year to enhance the performance of multimedia and graphic intensive applications.
One of the core technologies used in MMX is SIMD (single instruction multiple data). The idea behind that name is to apply the same instruction in parallel to several data, within the same execution unit. To provide large enough registers, the MMX instructions use the 80-bit floating point registers, of which they use 64 bits. To fill those 64 bits, MMX also provides four new 64-bit data types (corresponding to 8 8-bit data, 4 16-bit data, 2 32-bit data or 1 64-bit data). Although MMX uses the floating point registers, this set of instructions are integer instructions, which means the CPU can calculate as much MMX instructions as it can do for integer instructions (2 for Pentium, 3 for the Pentium-II). For example, let's suppose it fetches 4 16-bit data in one of those registers, it can then multiply the content of that register by another number, and accumulate the results (that's a convolution actually), all in a single clock cycle.
From now on, all the new designs from Intel will incorporate the MMX features. Other chips providing MMX extensions include the K6, and the 6x86MX.
The main group of applications that take advantage of the MMX extensions up to now are games, and some video-conferencing or image retouching packages. You can have a list of such applications by looking at the site of Intel (link available in the side bar about related stories).
The Cyrix 6x86MX (formerly code-named M2)
Launched on May 30, 1997, the 6x86MX, like the K6, relies on the ubiquitous socket 7 architecture. It adds 57 MMX instructions to the former generation, the 6x86 chip, thereby qualifying it as an MMX-enhanced CPU.
[6x86MX photo] When compared to the 6x86, the 6x86MX is a split voltage design that also adds a more flexible multiplier, with settings at 2, 2.5, 3 and 3.5 (the only value used up to now is 2.5). It features a huge 64K unified L1 cache. Unlike other recent processors, the 6x86MX does not split the x86 instructions into small RISC-like instructions prior to processing. However, it does use advanced features like a superscalar architecture, with register renaming, data dependency removal, multi-branch prediction, speculative execution, superpipelining and out-of-order completion, exhibiting performances comparable (the so-called PR-rating) to Pentium-IIs running at 166, 200 or 233 MHz. A PR-266 is expected to be introduced in Q4'97. Those chips rely on a 0.35 micron process, which Cyrix expects to replace by a 0.25 micron process beginning of next year, allowing for a PR-300 version in the same timeframe.
However, with only 35,000 units sold in Q2'97, the chip shows only a limited availability up to now. A PR-233 is expected to be priced around 17500 BEF (±13000 BEF for a PR-200) when it will become readily available. The table of the PR-ratings is as follows:

Clock Speed Bus Speed PR-rating
150 MHz 60 MHz PR-166
166 MHz 66 MHz PR-200
188 MHz 75 MHz PR-233
Glossary
L1/L2 cache Memory accesses impact dramatically on performance for all CPUs since the 20 MHz barrier was broken. Therefore, all vendors provide typically an intermediate level of fast, but expensive (and therefore small) memory to store frequently accessed data and instructions. This is a cache. That design is often pushed further by providing two levels, arranged hierarchically, of increasingly fast but expensive memory: the level 1 cache is typically inside the CPU, very fast but limited to somewhere between 8 and 64 KB, whereas L2 cache is typically outside the CPU, on the motherboard; it is a bit slower but larger, with a typical size between 256 and 1024 KB.
IPC The number of instructions executed per clock cycle.
Pipelining Every instruction goes through a series of elementary steps when it is processed. This allows for some parallelism: while instruction 1 is decoded, for example, instruction 2 is already being loaded; then instruction 1 is executed, while instruction 2 is decoded and instruction 3 is fetched. The purpose of pipelining is to achieve an IPC as close to one as possible.
Superpipelining A technique that consists in increasing the number of stages in the pipeline, each stage becoming easier to process, thereby allowing to reduce the execution time of each stage, i.e. to increase the frequency without added complexity.
Superscalar architecture Since the Pentium, all new x86 CPUs can process several intructions in parallel for higher performance. In the case of the Pentium, the CPU can process two instructions at a time; the Pentium Pro is superscalar of degree 3, meaning it can process up to 3 instructions at any given time. The purpose of a superscalar architecture is to get an IPC higher than 1.
The Pentium Pro
Originally defined as a chip that should surpass the Pentium while using the same manufacturing process, the Pentium Pro is the first chip that implemented a series of innovative features, like a 3-way superscalar RISC-like core architecture, with dynamic execution. The purpose of dynamic execution is to preserve an improvement when adding a third execution pipeline to the two already present in the Pentium, by preventing pipeline stalls. The three execution pipelines consist each of 12 decoupled stages.
[Pentium Pro photo] At the microarchitectural level, a unit is provided that decodes 3 x86 instructions into RISC-like instructions, the micro-operations, to be processed by an out-of-order core. The core of the CPU consists of 6 processing units (2 integer units/1 floating point unit/1 jump unit and 2 address generation units), which are controlled by a scheduler able to simultaneously dispatch up to 5 micro-op from the Re-Order Buffer to a processing unit, and retire up to 3 micro-operations. The scheduler chooses micro-ops that are ready to be processed (i.e. of which the operands are loaded), taking into account the data dependencies and the idle processing units. Another key element to dynamic execution is speculative execution: when a cache miss occurs, the processor already executes the next instructions, without waiting for the cache miss to be serviced. The temporary results from speculatively executed instructions are also stored in the Re-Order Buffer, where they are forwarded to other instructions in the instruction pool via the bus interface unit. If the results of the branch prediction algorithm were wrong, temporary results in the ROB are discarded, and the execution resumes at the new address. The accuracy of the branch prediction algorithm, which defines the frequency of the correct predictions, is more than 90%, according to Intel. Next the retirement unit copies definitively the results of (at most 3) completed instructions residing in the ROB to memory, in an in-order sequence.
Considering the dramatical impact of memory accesses on the overall performance, the Pentium Pro designers added an internal 256 or 512K L2 cache inside the CPU, running at full core speed. However, this design makes the Pentium Pro an expensive chip (think of the price tag of a 512K L2 cache version compared to a 256K chip and you understand how costly a fast cache can prove). The cache is serviced by an independent bus (DIB architecture). Next to the internal L2 cache, the Pentium Pro provides also a more conventional 8+8K L1 cache.
Well known for its relatively poor performance in 16-bit applications (at the timeframe of the Pentium Pro design, Intel couldn't know most of us would still live in a 16-bit world today, and hence decided to produce a chip mainly targeted to 32-bit applications), the Pentium Pro is still - until the Deschutes becomes available in Q2'98 - the best choice for server applications, for 32-bit operating systems (one of the reasons being its support for up to 4 glueless parallel processors). The Pentium Pro fits in a socket 8 motherboard and runs at frequencies ranging from 150 to 200 MHz, with 256, 512 or 1024K L2 cache (not all combinations are available).
The Pentium II (formerly code-named Klamath)
The ideas at the core of the Pentium-II design are threefold:
  • incorporate the MMX extensions into the Pentium Pro
  • get a more economical design
  • optimize the chip both for 16- and for 32-bit applications.
Basically a Pentium Pro with MMX extensions, the Pentium-II is also characterised by a more economical design. In particular, the costly internal L2 cache of the Pentium Pro is now accessed at half the core frequency - still much faster than other CPUs that use an external cache residing on the motherboard - in the Single Edge Contact cartridge. To compensate for that, the Pentium-II is equipped with a larger L1 cache: 16K for instructions + 16K for data. Up to now, the Pentium-II is only available with 512K of L2 cache, but Intel may introduce a 256K version at a lower price. Like the Pentium Pro, the Pentium II uses a dual bus architecture to cache and to memory; while ECC is always performed on the system bus, it is an optional feature on the L2 cache bus. Another improvement to the Pentium Pro architecture is in the branch prediction algorithm. The Pentium-II supports two-way glueless SMP configurations. A low-power version of the Pentium-II (code-named Deschutes), using a 0.25 micron manufacturing process, is planned to be released in Q2'98, at 400 MHz, using an impressive external bus speed of 100 MHz. The Pentium II itself will evolve to a 333 MHz version, then to 350 and 400 MHz, both using the 100 MHz external bus. Although not the cheapest in this round up, the Pentium-II is still the chip to beat when it comes to floating point intensive calculations, and for 32-bit operating systems, for which it kept the talent of its ancestor, the Pentium Pro.
[SEC cartridge photo]
Glossary - Dynamic Execution
RISC core Many vendors are now using a technique by which they split the CISC instructions found in x86 code into several pieces of small RISC-like instructions, that are easier to process efficiently by the scheduler unit inside the out-of-order core. These instructions are typically processed in a single clock cycle.
Data Dependency Removal / Data Forwarding Provide instruction results to all execution pipelines simultaneously, so that no pipeline can be stalled.
Branch Prediction A technique by which the CPU tries to know when - and to what location! - a jump will occur in a program, allowing it to already begin processing instructions at the target location before finishing all the instructions before the jump, in order to avoid pipeline stalls. This is typically done using a Branch Target Buffer, a small associative (=content addressable) memory that caches the location of the previous jumps, and the address to which the execution was transfered. Optionally, a small memory may be provided to cache also the instructions at the target location.
The accuracy of the algorithm is a key element in the overall performance of the CPU, as jumps are typically found every 5 or 6 instructions.
Speculative Execution When processing any instruction - e.g. an instruction that produces a cache miss - following instructions are already executed speculatively, including instructions beyond jumps, in order to keep the execution pipelines as full as possible. The result of the following instructions is kept in a temporary storage until the previous instructions complete. If the results of the branch prediction algorithm were correct, the results are definitively stored in memory. This accelerates the overall execution speed.
Register Renaming The idea of register renaming is to provide alternative storage for data used by speculatively executed instructions that should be loaded in an already occupied register.
Out-of-Order Execution Typically, only the first step of the pipeline - fetching the instructions - and the last step - storing definitively the results in memory, called retirement - are processed in order. Once the x86 instruction have been splitted in series of smaller instructions, the execution is carried over out-of-order. The scheduling unit tracks the data dependencies between instructions that are processed in parallel.
Dual Independent Bus (DIB) An architecture first introduced by Intel in its Pentium Pro, that uses two separate buses, for memory and cache. This architecture is also used in the Pentium II processors.
Single-Edge Contact (SEC) cartridge A plastic and metal cartridge used for the Pentium-II, which contains both the CPU and the L2 cache. This design allowed to use industry-standard cache memories to build the L2 cache, while still accessing the information at high clock rates. This cartridge fits in the so-called slot 1 on the motherboard. (illustration: see the section about the Pentium-II).
For a really nice and clear explanation of dynamic execution, please refer to the slide show on Intel web site (link available in the side bar about related stories). That slide show also illustrates how crucial the branch prediction algorithm performance can be for the efficiency of superscalar processors.
The AMD K6
After the disaster of its 5th generation chip, AMD bought NexGen - that also released its own design of a 5th generation chip - and hired the ex-Pentium project head at Intel for the K6 design team. They produced an efficient 4-issue superscalar design with a RISC core, which AMD claims it will always sell for 25% less money than corresponding alternatives from Intel.
[AMD K6 photo] The chip, equipped with a huge 64K L1 cache (32K for instructions, 32K for data) is manufactured in a 5 layer 0.35 micron process. It is compatible with the now venerable socket 7 architecture, the most efficient design that does not imply new development costs for motherboard manufacturers, according to AMD. Unfortunately, that design still relies on external L2 cache, typically accessed via the system bus at 66 MHz. However, don't be fooled by that design, its performance range makes it definitely a (good) competitor to the Pentium II. Tests conducted in house showed its performance to be almost twice that of a standard Pentium CPU running at the same core frequency, and to be only slightly under those of the Pentium-II. Also, although the name "K6" doesn't state anything about MMX - Intel sued AMD for using the MMX moniker, but the court finally stated that AMD should be allowed to use the "MMX" name, provided it is specified to be a trademark from Intel - the K6 features actually MMX extensions.
Like the K5 and the Pentium Pro, the K6 first splits the x86 instructions in series of small RISC-like instructions - which AMD calls RISC86, the equivalent of the "micro-operations" in the Pentium Pro - prior to processing. This is done by means of a series of dedicated decoders, with the help of the predecoding information (the predecoding information includes among other things the length in bytes of the x86 instructions, allowing the decoding of several of them in parallel). These RISC86 instructions are then processed by a RISC core with techniques like multi-level branch prediction, speculative execution, register renaming, out-of-order execution and data forwarding. This core consists of 7 processing units (load/store/2 integer units/1 FPU/1 Multimedia unit for MMX instructions and 1 branch unit), which are controlled by a scheduler able to issue up to 6 and retire up to 4 RISC86 instructions per clock cycle, making it a 4-issue superscalar chip. The branch prediction unit uses a 8,192 entries branch history table to predict jumps, yielding correct predictions over 95% of the time. The target adresses would consume too much memory to be stored, and are instead calculated on the fly by special decoding units. The instructions located at the target locations are stored in 16 16-byte buffers, providing for frequently accessed instructions at least one complete x86 instruction to the decoding logic without waiting for an external memory (either another cache or main memory) access. Like the Pentium Pro, the retirement of instructions is always made in-order.
Like other manufacturers, AMD is building a new silicon wafer fab to support a 0.25 micron process in the near future, allowing for higher clock rates. 100 MHz external bus speed support is also expected in the near future.

The war in the CPU arena is more intense than ever. Sure, Intel has already been challenged many times in the past with CPUs from AMD, Cyrix, or others, but for the very first time, AMD has now both a competitive CPU at the high-end of the market, and a large production capacity to back the demands up. Besides, AMD announced it will always sell that chip at 25% less money for the same performance as equivalent Intel chips. In that world with slumping prices, it might be able to shoot real bullets, this time. Analysts foresee that AMD could take a market share of up to 30%, if Intel decides not to sacrifice part of its benefice to save its market shares.
Cyrix also seems to have a new appealing offering, with its 6x86MX.
Be sure you seize that opportunity...


[Home] [Typical Configuration] [Quotation] [Promotions] [Services] [Hardware]
[HotNews] [Technology Briefing] [Sales Conditions] [References] [The Company]

© August 1997, J.-M. Mangen; no copy, whether on electronic or on paper support, whether complete or not, shall be made without the prior written consent of the author.
Credits: Illustrations come from the web sites of the respective manufacturers.