Fall Processor Forum 2004 Wrap-Up

We trust you’ve read all of Mark Hachman’s excellent news reports located in our Fall Processor Forum main page. Mark provided insights on the upcoming AMD dual-core Opteron, Transmeta’s just-released TM8800 (aka “Efficeon 2”), and VIA/Centaur’s 64-bit-capable “CM” processor due in early 2006, along with various embedded processor news. In this wrap-up report, we’ll discuss the overarching themes at the show, review a fascinating technology keynote speech from IBM’s brilliant chief chip technologist Dr. Bernard Meyerson, highlight some very cool embedded processors announced at the show, and dig a bit deeper into the upcoming dual-core AMD Opteron.

Before we get started, let’s clear up any possible confusion regarding the origins of the Fall Processor Forum. You may recall the old “Microprocessor Forum” held every October for the past 16 years. Show operator In-Stat/MDR decided to combine Microprocessor Forum with its Embedded Processor Forum, broadening the scope of the show, and running the event semiannually. Thus we’ll now see a “Fall Processor Forum” and “Spring Processor Forum” on an ongoing basis. Moderated by Microprocessor Report analysts, and including presentations from numerous microprocessor architects, the Forums are clearly great venues for learning details of future microprocessor architectures along with market positioning. While other shows such as ISSCC or Hot Chips might deliver more technical depth, Processor Forum strikes a good balance between technology insight and market factors.

The show included technical presentations of many upcoming embedded and server processors, and even a few chips used in supercomputer applications, but there weren’t many new mainstream processors (likely a major reason the show was more lightly attended than in the past), and Intel again skipped presenting at the show (see below).

The common architectural theme in nearly all presentations was a move away from building complicated, higher megahertz single core processors, to embedding multiple existing cores on a single die. Over the past few months, AMD, Intel and others have disclosed future “dual-core” designs for server, workstation, and even mainstream desktop and mobile processors. Other key design trends included system-on-a-chip (SoC) architectures for embedded applications, and more aggressively managed power utilization for processor cores.

Why the move to dual/multi-core? Because it’s getting increasingly difficult to build ever more complex, out-of-order processors that squeeze more parallelism out of an instruction stream (though certain processor architectures might do a better job than others, and/or work more closely with compilers to expose more parallelism). Equally important, it’s hard to scale clock frequencies into the stratosphere, because power requirements become unwieldy, and generated heat becomes extremely difficult to dissipate. A more readily attainable way to higher performance for many application scenarios is to include multiple lower-clocked cores on a single die.

You probably recall Intel cancelled the Prescott follow-on processor codenamed Tejas, and the Xeon (Nocona) follow-on codenamed Jayhawk, because of the “thermal wall” encountered at high GHz levels (beyond 4GHz) in the 90nm fabrication process, according to Intel President Paul Otellini earlier this year. In fact, moving to dual-core designs over the next year was an abrupt “right-turn” in Intel’s roadmap per Otellini.

Dual-core, quad-core, and even higher numbers of processor cores embedded on specialized chips will provide significant performance boosts for multithreaded applications and multiprocessing scenarios. Many server workloads and certain multithreaded and multiprocessing workstation applications will reap immediate benefits from multiple cores. A whole bunch of multimedia and imaging applications running in a variety of devices will benefit from multi-core embedded processors and DSP designs, as will network switches.

Back to Intel: Unfortunately, the company’s been notably absent the past few years at the (Micro)Processor Forum, with the excuse that its own developer forums are held too close in time to the Processor Forum, and there’s nothing much new to present. This may be true to a certain extent, but there’s always new stuff, and one could speculate that Intel is uncomfortable disclosing upcoming processor details to a roomful of highly technical competitors, especially with all its product delivery hiccups over the past few years. We wish Intel would reconsider for the Spring Processor Forum, but we won’t get our hopes up.

In the following pages we’ll take a look at some of the show highlights, starting with the excellent keynote.

Dr. Bernard Meyerson, chief technologist and VP of IBM Microelectronics Systems and Technology Group, kicked off the conference with an interesting exploration into the factors making it much more difficult to attain a doubling of transistor densities every 18 months (Moore’s Law), while also keeping power consumption and current leakage manageable and devices possible to manufacture (gate oxide thicknesses are down to a handful of atoms).

The following slide depicts the typical silicon roadmap, showing the expected transitions to smaller manufacturing processes and commensurate improvements in performance. Dr. Myerson says that we can’t assume performance will scale with process technology shrinks, given the aforementioned challenges.

Per Dr. Meyerson, performance scaling is not directly implied by Moore’s Law, which relates to transistor feature size and densities on a chip. To obtain a doubling in performance every 18 months requires various other semiconductor and system design techniques.

As mentioned above, gate oxide width is getting so narrow that we’re down to about 6 atoms thick, and any variability in manufacturing that might introduce even a single atom defect on either side of the gate oxide can cause 33% variability in gate thickness. This is a huge problem and can introduce much gate leakage.

Further, as seen in the past with Bipolar transistor designs in mainframes, the power density approached about 3X that of a steam iron, and CMOS is heading in the same direction—a big problem.

Innovative new techniques are required to scale performance beyond standard transistor scaling. Dr. Meyerson showed the following chart depicting the percentage performance gains resulting from process technology versus other innovations. Note that both the current 90nm and future 65nm generations both have more than 50% of performance improvements as a result of other innovations beyond pure fabrication scaling. (Sorry the chart is blurry, but you’ll get the idea).

Today, we’ve seen current Intel and IBM processors using strained silicon technology to increase electron mobility and increase performance by about 15% without requiring scaling to a new process generation. The following chart shows a few other innovations Dr. Meyerson foresees, though he explicitly stated he’s not necessarily presenting specific IBM roadmap information. Again hard to read, but this chart shows ultra-thin SOI (Silicon-on-Insulator) technology, high-K gate dielectrics, double-gate CMOS, and FinFETs (field-effect transistors):

Meyerson also described new interconnect technologies deemed critical to performance scaling (like ultra low-K conductors). He also covered power management, and how chips must include many independent “voltage islands” driven by separate power feeds, with multiple “power domains” within each island, that can be separately controlled to allow turning on and off sections of chip logic and memory based on active usage. Another area he addressed was dynamic reconfigurability of multi-core processors, using so-called “eFuses”. Imagine a situation where many cores exist on a chip and a single core goes bad. It would be nice to dynamically take that core offline and bring a spare core online, spreading the workload accordingly to maintain expected system performance.

Performance scaling is really a system problem today, not just a chip problem. The following chart shows many system elements that can be improved to increase performance:

Multi-core chips and highly integrated SoC designs both are key to performance enhancements in various applications, per Meyerson. The following graphic shows a few multi-core designs already in place today from IBM, and many other underway from different vendors:

While we’ve had software virtualization technologies available for years, such as VMware, we’ll see processors adding features to improve software virtualization technologies, better utilize multiple cores, and increase overall system performance. Meyerson noted that with Power5, 64 physical processors can appear as 1280 virtual processors. We’d add that system security and reliability can also be increased by virtualization and processor partitioning technologies. All in all, an excellent presentation!

We’ll next take a look at some slides showing off designs of various upcoming embedded processors and even a high-performance supercomputing solution.

In no particular order, we’d like to give you a glimpse of some of the interesting processor designs we saw at the show that were geared to embedded, server, and even supercomputer applications.

First up is a quad-core BCM-1480 system-on-a-chip (SoC) architecture from Broadcom which might be used in network line cards, or even specialized multi-processor computer systems. Note the dotted lines surrounding Core 2 and Core 3, which reflects the 1480 model versus the dual-core 1280 in the slide below. Also, you can see a very high bandwidth interconnect bus to tie components together—yes, that’s 128Gbps—and HT (HyperTransport) for switched-fabric interconnection and handling of data packets and other I/O:

The following slide shows four Broadcom BCM-1480 chips used to create a 16-way system (that’s four cores per chip):

And here we can see an individual Broadcom SB-1 processor core based on MIPS64 architecture:

MIPS was touting a new set of DSP hardware and instruction-set extensions to its MIPS32 and MIPS64 embedded processor line that increases die size by less than 6%, but will increase performance by an average of 2X in many embedded applications. See the chart below. (We like their tagline “At the core of the user experience”. How true!)

In an example set-top box SoC design below, you can see that the DSP extensions permit many discrete hardware processing components to be removed from the SoC device, and instead run in software by the MIPs processor core. Also, performing such functions with the DSP allows the SoC to run at a lower clock speed, saving power, generating less heat, and improving reliability

MIPS claims a large percentage of market penetration in many embedded application areas.

Cavium Networks was showing its soon-to-be-sampling (early 2005) Octeon Network Services processor. It’s a highly integrated SoC design targeted at devices providing various combinations of network services, such as firewalling, authentication, mail, VoIP, IPSec/SSL, AntiVirus, content filtering, load balancing, switching, storage networking, and so on. You can see it will have plenty of horsepower with support for up to 16 MIPS processor cores per die!

Here’s a look at the Octeon core’s feature-set.

IBM’s presentation of its BlueGene/L project for massively parallel supercomputing was quite cool. Originally announced in 1999, BlueGene/L will contain 64K (65536) multi-core SoC PowerPC-based processors as the main processing elements. IBM claims such processors reduce design complexity, reduce time to market (the PowerPC technology is well understood), and reduce power consumption. The two processor cores included in each node will have split duties in most large computational problems. One will be used for computation and the other for managing message passing between nodes. In some problems, both cores can be used for computation and the entire system can generate up to 360 teraflops (trillion floating point operations) peak performance. But most of the time “only” 180 teraflops will be the peak performance. No matter the type of application, that’s some serious floating-point ability folks. Oh, and a fully loaded system will use 16 terabytes (TB) of DDR memory!

IBM is building the monstrous BlueGene/L system in partnership with the Lawrence Livermore National Lab, with expected completion in the next year or so. Five different network architectures tie together all the components and pass data and control information between the various boards and compute nodes. The supercomputer is intended to focus on numerically intensive scientific problems. The system is logically structured as a 64x32x32 three-dimensional torus of compute nodes. Here’s a slide depicting how the 64K nodes are physically arranged:

And this is a diagram of each dual-core SoC chip. Note each processor has 32K instruction and 32K data L1 caches (not cache coherent) and a separate small 2K L2 (size not listed in diagram). A large 4MB L3 cache is shared by both cores. You can also see some of the different network interfaces at the bottom (Gbit Ethernet, Torus, Collective, and Global Barrier). For more details see IBM’s Research Site.

Texas Instruments had a great presentation on its new C5000 DSP architecture’s ability to manage power at very granular level, using both hardware and software techniques. The following chart symbolically depicts chip-level power management controls with large on/off switches (yes, they really are on the chip—OK, just kidding), which permit various functional areas on the DSP to be individually power managed:

The following two graphics shows the savings that can be realized by shutting down various unused areas of the DSP via the application, operating system, or BIOS control:

AMD provided more details of its previously announced dual-core Opteron at the show. The first dual-core Opteron chips are expected to ship in mid-2005, followed later in 2005 by desktop dual-core Athlon 64 parts, but the following die photo shows the actual layout. You can see the power consumption is quite reasonable at 95 watts (though the dual-cores will run slower than the single core version) and the chip still fits into the existing Socket 940. Mark Hachman delivered tech specs, marketing and basic performance details in his news story, AMD Tips Dual-Core Details, Performance.

And this next slide shows the basic block diagram. You can see the shared northbridge (Opteron’s northbridge was built from the start to handle two cores) and the same three HyperTransport ports as in the current Opteron. AMD’s director of its server and workstation business unit, Barry Crume, told us that the HT ports are underutilized in most situations with a single core, and the dual-core architecture should generally not be penalized. We questioned the single memory controller, and again Crume stated that the latency improvements of having an onboard controller help with the dual-core designs, and that while some contention may exist, the performance impact is fairly small with most workloads.

Mark Hachman adds: One thing Barry also noted was that the proximity of the memory controller to the processor cores eliminated the need to connect to next-generation memory — the bus isn’t close to being saturated yet. In other words, Crume said, expect to see AMD’s first dual-core processors use DDR-1 400 memory. A DDR-2 controller won’t be added until 2006.

Clearly, if both processors are working on streaming data and read or written to or from DRAM, there will be contention issues. Future versions may have more cores, more memory controllers, use DDR2, DDR3, or FB-DIMM (similar to Intel’s server roadmap), and include improved power management.

The new dual-core Opterons will provide SSE3 support with 10 new instructions. Note that when questioned about various server and workstation software vendors charging more money based on the number of perceived processors in a system, Crume said much progress is being made in this area. Crume thinks such software vendors will soon only be charging extra licensing fees based on the number of physical processors in a system, not based on how many cores are in the system. If this were not the case, you could imagine a $25,000 system of the future with eight 16-core processors (128 cores total), and a $2000 charge per processor core for the software, yielding a $256,000 software cost for a $25,000 hardware system. Not likely. In any case, AMD is boldly moving ahead and the company again showed operational dual-core silicon at the show, similar to demos at the recent IDF.

The conference ended with a panel discussion that was supposed to focus on embedded x86 processor benchmarking. It was also supposed to address concerns expressed in an excellent paper written recently in Microprocessor Report by MDR senior analyst Tom Halfhill. Of course, similar to 99% of the benchmark panels I’ve witnessed over the past 25 years, it turned into a free-for-all, with little useful information or guidance, and a bunch of complaining about the state of benchmarks. The same issues regarding benchmark effectiveness, usefulness of the results, misleading users, etc. etc. have been brought up over and over and over throughout the years, no matter what type of computer system benchmarks were being discussed. But it was quite an entertaining panel discussion nonetheless!

The highlight was when Glenn Henry from VIA directed a salvo at Erik Salo (hey, that rhymes) from AMD. Henry claimed that Salo’s defense of AMD’s recent questionable benchmarking practices that showed AMD embedded chips outperforming older VIA chips tested in older platforms was “a bunch of marketing bullshit,” to be exact. It drew much applause . . . I can’t wait for the follow-up act!

Here’s a relaxed shot of the panelists before the fireworks, including from left to right, Eric Salo, Director of Marketing, AMD Personal Connnectivity Solutions; Markus Levy, President, EEMBC; Glenn Henry, President, Centaur Divison of VIA; Dave Ditzel, CTO, Transmeta; Alan Weiss, CEO, Synchromesh Computing; and Kevin Krewell, Editor in Chief, MicroProcessor Report who tried to keep order—clearly an impossibile chore:

Fall Processor Forum 2004 Wrap-Up

Nick Stam

Company

Categories