Posts Tagged ‘Intel’

Shaking Up The Green500

Thursday, January 17th, 2013

Barry Pangrle
Last September, I wrote about the efficiency of IBM’s Power7+ architecture in my blog. IBM’s Sequoia supercomputer (a BlueGene/Q system) this past June had just shot to the top of the Supercomputing Top500 chart, clocking in at 16.32 petaflop/s on the Linpack benchmark. Other systems built around the IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom were also dominating the top of the Green500 list with the Sequoia placing 20th in the megaflop/s/watt category.

Well, that was September and the November 2012 Top500 chart now shows a new leader at the top with a score of 17.59 petaflop/s on Linpack. It is the Oak Ridge National Laboratory’s Titan, a Cray XK7 with 18,688 nodes with each node containing an AMD 16-Core Opteron 6274 and an NVIDIA Tesla K20X.

Oak Ridge National Labs’ Titan Supercomputer (Photo courtesy of ORNL)

I alluded to the potential energy efficiency of GPU-based computing in the September blog and heterogeneous systems have suddenly jumped to the top of the November 2012 Green500 list. The top 4 on the Green500 list are now all heterogeneous systems. Titan currently sits in 3rd place, bettering Sequoia’s opening position at 20th (which has now fallen to 29th). The next fastest supercomputer, RIKEN’s K Computer is now in 85th place on the Green500 list, so out of the top of the supercomputer class, the heterogeneous Cray XK7 Titan is a good number of spots higher on the Green500 list.

How about the other two systems above Titan on the green list? In first place is a mixed Intel Xeon E5-2670- and Intel Xeon Phi 5110P-based system that is good for 0.112 petaflop/s and in second is a mixed Intel Xeon E5-2650 and AMD FirePro S10000 system that checks in at 0.421 petaflop/s. It will be interesting to see if and how well these systems scale up to performance an order of magnitude or two higher to possibly compete for the top of the Top500 list. We might just get some insight into that soon with the Texas Advanced Computing Center’s Stampede supercomputer that just went online after two years of development. It is supposed to hit 10 petaflop/s (peak) and is a Dell PowerEdge C8220 cluster filled with Intel Xeon Phi co-processors. It is currently estimated to need more than 6 MegaWatts, including cooling. Based on the preliminary numbers, it looks unlikely to challenge for the top of the Green500 list, but we should wait for official numbers before drawing any firm conclusions.

Texas Advanced Computing Center’s Stampede Supercomputer (Photo Courtesy TACC)

The fourth and fifth place positions on the Green500 list belong to a Cray XK7 and an IBM BlueGene system, respectively. It will be interesting to see how the list changes again this year. The push to hit the exascale mark is on and the competition promises to be fierce.

—Barry Pangrle is a senior power methodology engineer at NVIDIA. The views expressed in this article are his own and not necessarily those of NVIDIA.

Return To Claremont

Thursday, October 11th, 2012

By Barry Pangrle
Intel’s Gregory Ruhl gave an update presentation on Intel’s Claremont IA-32 near-threshold voltage (NTV) and wide dynamic range processor at Hot Chips 24. (I’ve also written an earlier article about Claremont here.) There are many challenges in building a part that operates across a broad range of voltages and Intel listed reduced ratios of on-current vs. off-current, reduced noise margins and the impact of variability resulting in circuit functional failures. The power/performance profile also becomes extremely sensitive to PVT (process, voltage and temperature) variation. The potential for much improved energy efficiency is the lure for going down a path where the tools and methodologies are not yet mature for this type of design.

Claremont is based on Intel’s legacy P54C IA-32 core (circa 1994). It’s a superscalar, in-order pipeline architecture with a pipelined floating-point unit. It incorporates dynamic branch prediction and separate instruction and data caches. Figure 1 below shows the normalized energy per cycle for the design at different operating points. The savings are quite substantial at lower voltages. It should be noted that this design was manufactured using Intel’s 32nm High-K Metal Gate process with one poly layer and 9 layers of metal (Cu). The core is 1.96mm2 and uses 6 million transistors.


Figure 1: ~5x Energy Reduction Achievable (Source: Intel)

To give you a better idea of what the designers are up against, Figure 2 (from a UC Berkeley lecture found here) shows how a typical MOSFET’s drain current is impacted at a given drain-source voltage when the gate-source voltage varies. Clearly the drive strength falls off (exponentially) below the threshold voltage, which also exacerbates any variability issues. (I previously wrote about the impact of body bias impact on threshold voltages here).


Figure 2. ID vs. VGS for a Given VDS

The part was designed to hit three target operating points: 0.5V/66MHz, 0.75V/333MHz and 1.05V/525MHz. In fact, Intel reported the results shown in Figure 3 below where they were actually able to achieve a range of 0.38V/10MHz to 1.1V/741MHz using 1.5mW and 445mW of power respectively.


Figure 3. Power/Performance Characteristics (Source: Intel)

In order to get to this point, Intel reported performing variation-aware library pruning to ensure reliable operation and limited transistor stacks to 3, used no wide transmission gate muxes and no contention circuits. They pruned minimum sized and low drive strength cells, redesigned sequentials with interruptible and upsized keepers, used 10 transistor single-ended transmission gate register file cell topology and still maintained legacy full swing 3.3V I/O support. All in all it’s an impressive effort with promising results and it will be interesting to see where Intel takes this technology going forward.

—Barry Pangrle is an independent power architect in Silicon Valley.

Dealing With Variability

Thursday, July 12th, 2012

By Barry Pangrle
Process, voltage and temperature, a.k.a. PVT, are well known to designers who are working to complete “signoff” for their designs. In order for a design to be production-ready, it’s necessary to ensure that the design is going to yield parts at a sufficiently high percentage for profitability and that it will still operate within the expected variation of the process and environment.

In the quest for ever-more energy-efficient designs, lowering operating voltages is a promising approach to completing that quest. Because it’s a well-known fact that dynamic power is proportional to the voltage squared, it may be surprising then that Vdd has not really dropped very much for most designs since we crossed the 100nm threshold almost a decade ago. For performance reasons, Vdd has generally been set at roughly 3x-4x the threshold voltage. In order to keep leakage in check, threshold voltages have stopped scaling so, the question then is, how much performance do we lose by continuing to scale down Vdd with relatively unchanged threshold voltages and what other factors come into play?

In a paper published in the January 2009 IEEE Journal of Solid-State Circuits, Himanshu Kaul et. al. from Intel described a 320 mV implementation of a motion estimation accelerator in 65nm CMOS. Some of the material was then also presented by Greg Taylor at the 2009 EPEPS and can be found here.


Figure 1. Frequency variation with temperature. Source: Intel

Figure 1 shows the results of how performance is impacted by voltage and temperature. Note that when operating at 1.2 V, the difference in Fmax from 50°C is only ± 5% when the temperature varies from 0°C to 110°C whereas at 0.320 V the variation is ± 2x. This is a huge difference to have to compensate for during the design process and is too large to just try to “margin” into the design. Another point to take into account here is the slope of the curves and how much steeper they are at lower voltages. If I just eyeball the graph for a rough estimate, it looks like about a ± 0.05V difference around 0.320 V at 50°C will also lead to about the same ± 2x variation in performance. These designs are incredibly sensitive to any voltage fluctuations


Figure 2. Frequency variations across fast-slow process skews. Source: Intel

So Figure 1 gives us an indication of the voltage and temperature impact, but we haven’t looked at process variation. Figure 2 shows how process variation affects performance. Again we see that the impact on performance due to variation is “magnified” at lower voltages. Process variation when operating at 1.2 V accounts for a ± 18% change in performance whereas at 0.320 V it accounts for another ± 2x difference in Fmax. It should be clearer now why everyone wasn’t immediately rushing to run at ultra-low voltages. Designing complex chips is hard and designing them to run at really low voltages is harder.

How about the promise of new process technologies? The most radically different new process technology in high-volume production today is Intel’s 22 nm Tri-Gate CMOS and I’ve referenced the figure below in earlier articles (here and here).


Figure 3. 22nm Tri-Gate vs. 32nm Planar. Source: Intel

Figure 3 certainly provides hope that there may be some promise for reduced voltage levels in the newer process technology. Perhaps unsurprisingly though for such a radically new process, there are questions about variability. In a blog on the GSS site, there are diagrams and simulations based in part on the TEMs here from Dick James’ Chipworks blog.

Figure 4 below, from Chipworks, clearly shows the process variation between transistors with some being more rectangular and others more triangular. According to the GSS simulations and Professor Asen Asenov, the rectangular transistors perform better and to my mind seem much more like a true “Tri-Gate” transistor. Professor Asenov also is quoted here as saying, “I think Intel just survived at 22nm. I think bulk FinFETs will be difficult to scale to 16nm or 14nm. I think that SOI will help the task of scaling FinFETs to 16nm and 11nm.” So before victory is declared, it appears that there is going to be plenty of work to keep the process engineers busy going forward. The additional complexity will certainly impact the economics of these newer nodes as well.


Figure 4. TEM Image of NMOS Gate and Fin Structure. Source: Chipworks

The road to near-threshold and sub-threshold operation also requires a lot of work and ingenuity on the design side. Circuits can be designed to have better characteristics for withstanding variation but often at a cost in area or performance or both. Of course, if the design techniques reduce the need for margining then the practical useable performance should improve. There will be a lot more study in these areas to help bring near and sub-threshold designs to market and to deal with variability. From a process standpoint, it appears that variability will continue to be an important issue for some time.

—Barry Pangrle is a solutions architect for low power design and verification at Mentor Graphics.

Intel’s Hot New Tri-Gate Processors

Thursday, May 10th, 2012

By Barry Pangrle
Intel announced its newest third-generation Core processors on April 23rd. There has been much anticipation surrounding these new chips from Intel, largely because of their new 22nm tri-gate process technology used to fabricate these devices.

Figure 1, from the presentation entitled, “Intel’s Revolutionary 22nm Transistor Technology,” by Mark Bohr and Kaizad Mistry, shows the dramatic improvement in performance at lower voltages for the new tri-gate technology.

Fig. 1. Delay vs. Voltage for 22 nm Tri-Gate and 32 nm Planar.

Intel is calling its new third-generation processors a “tick-plus” in its tick tock model. Typically, a “tick” represents the move of a previous architecture to a new process technology and then the “tock” is a new architecture on that same newer technology node. In this case, the CPU architecture for the new “Ivy Bridge” parts is mostly the same as for the older “Sandy Bridge” parts, but the graphics portion of the newer chips has received notable enhancements over the previous version; hence the tick-plus designation.

The various markets for processors have been carved up roughly based on the Thermal Design Power (TDP) of the processors. Table 1 below shows a typical breakdown for the different applications and the TDP ranges for those parts. Manufacturers often will try to squeeze the most performance out of their designs within the power envelope for the target market segment.

We’re going to take a look at the top-end high-performance desktop parts, which Intel calls Core i7, to compare the old 32nm Sandy Bridge parts to the new 22nm tri-gate Ivy Bridge parts. Table 2 shows two 32nm parts and three new 22nm parts. A more extensive comparison table is available here. One thing that immediately stands out is that the CPU clock speeds for the fastest 32nm and 22nm parts are the same and that the GPU clock is actually slower in the newer 22nm parts. The architectural improvements of the new HD Graphics 4000 vs. HD Graphics 3000, and the boost in memory bandwidth from 21 GB/s to 25.6 GB/s, enable the new parts to outperform the older parts in graphics and slightly from a CPU standpoint. Still, I find it somewhat surprising that the clock speeds weren’t increased.

So, other than scaling, what has the new technology brought? Well the new top end part (i7 3770K), with about 20% more transistors, is now rated at a TDP that is about 19% lower. So in this case, the new technology has really primarily been used to provide more energy efficient parts and it’s clearly a nod to power as the primary driver.

Another interesting note from Table 2 is that while the TDP has been reduced by about 19%, the area has been reduced by closer to 26%, which means that the power density, or the amount of power that needs to be dissipated per square millimeter, has actually gone up. This could bring up some interesting cooling issues.

Figure 2 below actually is from Intel’s Desktop 3rd Generation Intel Core Processor Family and LGA1155 Socket Thermal Mechanical Specifications and Design Guidelines (TMSD) (Figure 2.1) available here.

Fig. 2. Processor Package Assembly Sketch. (Source: Intel)

A number of online sites (notably here and here) have mentioned that Ivy Bridge seems to heat up rather quickly as the voltage is increased. This is probably of little concern to most customers, but for those who buy high-end parts for overclocking, this has raised some eyebrows. It appears that at least a partial answer to this issue may have been found here. As diagramed in Figure 2, IHS stands for Integrated Heat Spreader and TIM for Thermal Interface Material. Looking at packaged parts that had the IHS pried apart from the die and the substrate, revealed that rather than using a fluxless solder approach, as used in Sandy Bridge, a TIM paste with presumably a much lower (order of magnitude) thermal conductivity has been used, which would severely impact the ability to cool these parts. It will be interesting to see if Intel stays with the TIM paste or goes back to fluxless solder. In the meantime, it appears that the serious overclockers will have to play a waiting game or perhaps try even more extreme measures to push these new parts to ever-higher limits.

—Barry Pangrle is a solutions architect for low power design and verification at Mentor Graphics.

FinFET Vs. Tri-Gate

Thursday, April 5th, 2012

By Barry Pangrle
A large portion of the Common Platform Technology Forum, recently held in Santa Clara, was dedicated to presentations about 14nm process technologies and FinFETS. If you missed the event and are interested, many of the presentations are available from a link off of the Common Platform home page. Dick James wrote a nice article about GlobalFoundries’ claim that its FinFETS are better than Intel’s. A number of things struck me as being interesting about this statement, but the foremost was that Intel would likely claim it isn’t even using FinFETs. Its new technology uses “Tri-Gate” transistors. So, you may be asking yourself, what’s the difference? In a paper authored by Robert Chau and others at Intel, they discussed different types of CMOS transistors and included a diagram similar to the one below.

The device in (a) is representative of a FinFET while (b) is representative of a Tri-Gate FET and (c) is a planar FET device. The FinFET includes a spacer at the top of the fin and is considered a dual-gated device with a gate on two sides of the channel. The Tri-Gate FET, on the other hand, is gated on three sides of the channel and hence the name “Tri-Gate.” The authors claimed that the Tri-Gate requirements were the most relaxed and allowed for improved manufacturability. I wrote a blog briefly discussing Intel’s Tri-Gate technology here.

I also wrote a blog about the SEMICON West panel session on FETs discussing different FET devices and mentioned the overview presentation on FinFETs given by Serge Biesemans of IMEC. One of the points made in Serge’s presentation is that there are a host of new device architectures that are aimed at fully depleted channels for better short-channel control (with FinFETs just being one of them). Serge also pointed out in his presentation that as fins get thinner, there is less control over the threshold voltage (Vt) of the devices. Another way to control the threshold voltage though is through work-function tuning of the metal gate process. This would seem to imply the need for the availability of multiple work-function gate types in order to provide a multi-Vt solution and along with it, additional process steps to manufacture the devices.

So, this brings us back to Subramani “Subi” Kengeri’s claim that GlobalFoundries has a better FinFET process for mobile SoCs. As was mentioned last summer at the SEMICON West conference panel, mobile SoC designers have become reliant upon multi-Vt processes to help produce better energy-efficient designs. The ability to choose between cells built with different threshold-voltage level transistors allows designers to optimize performance and power by using slower and less leaky transistors off of the critical paths. Some questions that will need to be answered before Subi’s claim can be verified are: 1) How much control is available over the threshold voltages; 2) What if any impact is there on process variation (i.e. how tightly controlled is the Vt process); 3) How much impact is there over the leakage of a FinFET by changing its threshold voltage, and 4) What additional complexities does this bring to the manufacturing process and what if any impact is there on yields?

If GlobablFoundries can produce a highly manufacturable multi-Vt process that still gives designers another knob for controlling power, this could be a real benefit for both them and their customers. Given that 28nm is the most recent technology node put into production at GlobalFoundries, we will have some time to wait before we know for sure.

–Barry Pangrle is a solutions architect for low power design and verification at Mentor Graphics.

Intel vs. AMD: Who’s Right?

Thursday, March 8th, 2012

By Barry Pangrle
It’s all about the system. One energy-efficient component doesn’t an energy-efficient system make. There were two big announcements recently made by the industry’s two x86 designers. One was by Intel announcing its new Sandy Bridge Xeon Processor E5-2600 product family, and the other one was by AMD announcing its planned acquisition of SeaMicro.

Both of these announcements emphasized the power savings that these technologies bring to the marketplace, and it looks as if Intel is continuing to extend its technological lead in the high-end server processor market with the new E5-2600. Still, the SeaMicro deal looks like an interesting play to address overall server energy-efficiency by means other than only improvements to the CPU. Both of these announcements also mentioned the importance of the fast-growing server market to address cloud computing.

As we’ve discussed before, choosing the right architecture for a given task is absolutely essential for generating a power- or energy-efficient design. World-renowned computer architect, Seymour Cray, famously quipped “If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?” Well, if the field changes, 1024 (or in SeaMicro’s case, maybe 768) chickens look pretty good.

In a talk reported last year, Dileep Bhandarkar, distinguished engineer at Microsoft, called for 16-core SoCs based on Intel Atom or AMD Bobcat cores that would also integrate all of the core logic and I/O functions currently placed in separate chips. He presented the chart below to show how components in a system need to be balanced to get the best power, performance and cost outcome. Bhandarkar stated,

“The conclusion was clear—the best balance of performance, power, and price was found in the lower-power processors.”

In a whitepaper published by SeaMicro, SeaMicro states that, “while much of the industry discussion centers around the CPU, the CPU consumes only one-third of the power used by a server. The remaining two-thirds is consumed by hundreds of other components. If one seeks to reduce server power consumption by 75%, one needs to focus on the non-CPU power-drawing components first, and only then on the CPU.”

SeaMicro also realized that the challenge in the data center had turned into a problem of handling huge volumes of relatively modest computational workloads rather than solving a few complex problems (chickens vs. oxen). These workloads in part are generated by the millions of users wanting to perform searches, view Web pages, check e-mail, and read the news. Put simply, large, complex, high-speed, multi-core, multi-socket CPUs are “overkill,” and the mismatch between the CPU and the primary workload in the data center is also a fundamental underlying cause of the data-center power issue.

Pushing data around is expensive from an energy standpoint. Whether you are sending data across a chip, multiple chips or multiple boards, greater parasitic capacitance means greater energy to move that data. Energy-efficient systems are all about the efficient choreography of the movement of data within the system.

SeaMicro addressed the energy-efficiency issue by:

  1. Inventing technology that eliminates 90% of the components from the motherboard, leaving only three components: CPU, DRAM, and the SeaMicro Freedom ASIC, shrinking the motherboard to the size of a business card;
  2. Inventing technology that makes the necessary remaining components more efficient by reducing the power consumed by the CPU and its associated chipset by consolidating functions and by powering down unused CPU and chipset features;
  3. Inventing technology that ties together hundreds of the ultra-efficient mini-motherboards using the SeaMicro Freedom fabric, and
  4. Inventing technology that dynamically ensures that the system runs at its most efficient level by combining CPU management and load balancing, allowing workloads to be allocated dynamically to specific CPUs on the basis of power use.

SeaMicro claims that this combination of technologies produces a system that uses one-fourth the power and takes one-sixth the space of the best-in-class competition. The SeaMicro Freedom fabric is also independent of the CPU instruction set architecture (ISA), so that the use of other ISAs, like ARM, is still an open possibility. For applications that do need more compute power, SeaMicro announced the production of new products using more powerful Intel Xeon processors, too. Going forward, the obvious expectation is that SeaMicro will transition to parts designed by its acquiring company, which also opens some interesting possibilities for future APU functionality.

–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.

Speed Demons

Thursday, December 1st, 2011

By Barry Pangrle
For extreme world record performance levels, the required power levels are also typically extreme. It’s that age-old battle against diminishing returns to squeeze out every last drop of performance versus practical limits and wallets.

For example, a top fuel dragster can consume about six gallons of fuel for a quarter-mile run down the strip. As has previously been shown here, x86 clock frequencies have gone from doubling nearly every year to practically stagnant over the past seven. Of course, power has been a major player in putting a ceiling on those frequencies. So in honor of that high-speed clock demon that was released back in November 2004, the Intel Pentium 4 HT 570J, we will look at recent new records in x86 (over) clocking frequencies (“overclocking” is a term used for pushing a chip past its specified clock frequency).

AMD recently released its new “Bulldozer” based x86 parts that are composed of modules with two separate integer paths that share floating-point resources. Roughly speaking, the concept is to get two “cores” into one module that uses slightly more area than a standard single core. Breaking from the current trend of not designing for higher clock speeds, Bulldozer in fact seems to have been architected to run at somewhat higher clock frequencies. This new part presented itself as an opportunity for the over-clocking community to jump in and try to break the previous record held by an Intel Celeron D 352 clocked at 8308.94 MHz.

Heat kills chips, and if the clock is pushed past a part’s rated frequency, dynamic power typically increases linearly with the clock frequency so that the heat produced from the chip also will increase. One trick used by overclockers is to also increase the supply voltage to help increase the chip’s ability to run at a higher clock frequency. As you might imagine, because the dynamic power increases quadratically with the voltage, this can really increase the amount of heat being generated. Overclockers typically use more exotic cooling than the standard heatsink fan often provided with a boxed part. The cooling solutions range from larger heatsinks made of metals that have good thermal conductivity, like copper, to elaborate liquid cooling systems that on the far end of the spectrum included liquid nitrogen and liquid helium.

As part of the launching of these new parts, AMD held an extreme overclocking session to go after the record. There’s an entertaining video here on YouTube (it’s just a bit over two minutes long and looks pretty good at 1080p). The result was what looked like a fairly stable 8.429 GHz—assuming you kept the liquid helium flowing. The power for the part wasn’t reported, but that top fuel dragster comes to mind.

Since then an overclocker in Taiwan named Andre Yang has twice reported faster results, and I’ve included them in the table below along with an image of the setup shown in the AMD video. It’s interesting to note that the first three entries are all running within 16 mV of 2.0 V. Even the P4 570J ran at only 1.425 V and that was in a 90 nm technology. It looks like Andre was able to crack 8.5 GHz by pushing the voltage to 2.064 V. It could possibly be that the max voltage before part failure ultimately will limit how high these clock frequencies could go.

Not to be left out, the memory community has also gotten into the game of setting new records recently and you can find more here about Corsair’s DDR3-3467 memory. Coincidentally or not, it was also run with an FX-8150 liquid nitrogen-cooled PC.

So, you might be wondering, if it’s possible to take these parts and bump up the voltage to run them faster, if you were to run them slower would it be possible to reduce the voltage levels to further reduce power? The answer is of course “yes,” and we might just look at that in a future blog too.

Best Wishes for a Happy, Healthy and Prosperous 2012!

–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.

Intel’s Claremont Near-Threshold Voltage IA Core

Thursday, October 6th, 2011

By Barry Pangrle
Intel announced many new technologies at its recent Intel Developer Forum (IDF) held from Sept. 13-15 in San Francisco, but the one announcement that jumped out at me was the unveiling of its work on a near-threshold voltage (NTV) processor named “Claremont.” For this exercise, Intel chose an older Pentium design to help minimize the number of variables the engineers would have to deal with in order to get the chip working at the voltage levels they were targeting. Intel CEO Paul Otellini mentioned the Claremont chip in his keynote and then CTO Justin Rattner’s keynote (minutes 46-53) went into more depth with Shekhar Borkar and Sriram Vangal also taking the stage to help with the demonstrations. Sriram has also provided more information about Claremont in his blog.

It has long been known that reducing VDD leads to significant energy savings, but there are numerous issues with pushing the supply voltage near-threshold levels. The threshold voltage of a transistor is roughly thought of as the gate voltage relative to the source voltage where a transistor turns “on.” In fact, CMOS transistors are not perfect switches and current flows even before the threshold voltage is reached and this is often considered an undesirable effect. For example, sub-threshold leakage current is typically thought of as an undesirable characteristic of CMOS devices.

Designing memories becomes more challenging and the use of differential sense amplifiers that typically sense about a 100 mV difference become less effective as VDD decreases to near-threshold voltages. Claremont is reported to run at voltages as low as 400 to 500 millivolts. Another challenge to running at such low voltages is the variation in the threshold voltages of the transistors themselves on a chip. If the variation is large enough and the supply is very close to the nominal threshold voltage, some transistors might actually fall into the sub-threshold range, adversely impacting the timing of the design. Overcoming these challenges to produce a working x86 processor design at near-threshold voltages is a significant accomplishment.

The slide shown below from the Justin’s keynote presentation diagrams an energy efficiency curve over a range of operating voltages.

While the slide shows approximately a 5X gain in efficiency that matches Intel’s results with Claremont, Shekhar Borkar had claimed about 8X possible gains using NTV. Shekhar explained that the results showed only a 5X gain due to the use of an old Pentium architecture. In fact, the Intel engineers had to go “dumpster diving” and looking on eBay to find motherboards that would work with the chip. Again it’s interesting to note the impact that the architecture makes on the overall energy efficiency of the design. Shekhar later clarified that ~8-10x should be possible with architectural improvements.

Another important note about this chip was that it was also designed to work at higher voltages and frequencies. Claremont can run at clock frequencies that are 10x of those at near-threshold voltages, giving it a dynamic range capable of running at much higher performance levels and then dropping back down to very low power levels. The demonstration showed Claremont running on a small solar cell and at a claimed power of less than 10 millivolts, all in all a pretty impressive display. While Claremont is a proof of concept vehicle and won’t be sold as a product, expect to see the fruits of this research showing up in future Intel processors.

Intel’s Claremont Near-Threshold Voltage IA Core

-–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.

Interconnect Power II

Thursday, September 8th, 2011

Barry Pangrle
After submitting last month’s blog, I read a very interesting article by Deepak Sekar analyzing Intel’s 22 nm FinFET technology versus a hypothetical planar 22nm CMOS technology. Beyond the advantages of being able to use a 140 mV reduction in the supply voltage for the trigate technology, Deepak did a breakdown analysis for the predicted power across a representative microprocessor mobile logic core design. Deepak also references a SLIP’04 paper entitled, “Interconnect-Power Dissipation in a Microprocessor” by Magen et. al., describing results for the analysis of a Intel Pentium M processor (0.13 um with 77 million transistors) with more detail shown in their presentation. The first two pie charts below are from Magen’s SLIP’04 paper and the last bar-chart is from Deepak’s article.

The analysis for the Pentium M includes a detailed breakdown for the dynamic power. The authors model three types of capacitance; interconnect (metal wires), gate (actual gates of the MOS transistors) and diffusion (the effective capacitance of the junction between the diffusion region used to form the source and drain and the well or substrate).


Chart 1

As can be seen in Chart 1, most of the dynamic power is used in the Interconnect, which shouldn’t be too surprising. The astute reader may also be wondering what happened to the short-circuit (or crowbar) power. The authors claimed that it was later added by using an overall factor of about 10% and that the focus of the paper was on energy dissipation due to the switching of interconnection capacitances (gate and diffusion included).


Chart 2

Chart 2 shows the total dynamic power used for clocks and signals both locally and globally. In this case, 58% of the dynamic power is in the signals and 42% is in the clocks. It should be remembered though that this part was designed as an efficient “low-power” processor and that higher performance parts could easily have a higher percentage of power in the clocks. Jan Rabaey has a nice chart in slide 1.30 of his book, Low Power Design Essentials, showing the variation in clock and logic power over four different processor designs.


Chart 3

So we’ve had a chance to look at some data from an older processor. How do things shape-up when projected to 22 nm? Clearly, a big chunk of the power is still in the clock and wires, and this seems to hold whether we’re looking at trigate or planar transistors. It looks like the reduction in the dielectric coefficient for the interconnect layers will still be welcomed at 22 nm. Another interesting point about the projected approximately 28% power savings in the trigate versus planar transistor is that Deepak used a source voltage of 0.82 V for the planar devices and 0.68 V for the trigate devices. Doing a straight V-squared analysis would yield about a 31% savings in dynamic power alone. This may indicate that it still leaves the door open for a planar process that could further improve on process threshold variation enough that it would enable the source voltage to also drop further.

–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.

Intel’s New Machine

Thursday, May 12th, 2011

By Barry Pangrle
Power is one of those product characteristics that touches on every phase of the design and verification process all the way from the system architecture down to the fabrication process used for the actual IC implementation. In this month’s blog we take a look at process technology and in this case, it appears to be the case that the technology rich are getting richer.

On May 4, 2011 Intel announced that their next generation “Ivy Bridge” processor will be built using 22nm 3D tri-gate transistors (also commonly referred to as FinFETs). These transistors will also continue the use of high-k metal gates (HKMG), so the new process adds onto the proven HKMG process capabilities already demonstrated by Intel. In their presentation entitled “Intel’s Revolutionary 22nm Transistor Technology”, Mark Bohr and Kaizad Mistry include the following diagram shown below in Figure 1.

Figure 1. Transistor Gate Delay

The diagram shows that the 22nm tri-gate process can deliver the same gate delay as the 32nm planar process with a 0.2 V reduction in the operating voltage. At the “fast” end of the curve shown above, that looks like approximately a reduction from 1.0 V to 0.8 V, which from a first-order standpoint should be good for a 36% decrease in active power. If we look at the other end of the spectrum it looks like we get approximately a reduction from 0.8V to 0.6V, or roughly a 44% decrease in active power. Bohr and Mistry claim a >50% reduction in active power with good performance. Taking into account that there are additional effects on the power based on the technology, their claims appear completely plausible and quite outstanding.

I’ve included some diagrams for SoC consumer portable power trends from the 2008 Update, 2009, and 2010 Update ITRS Reports to help put this into better context. I’ve used these diagrams in a number of presentations to help illustrate the trend in power going forward. I’m often asked about the two dips that occur in 2014 and 2019. In the 2008 Update, there is a push-out of the metrics for FDSOI (Fully Depleted Silicon on Insulator) and MG (Multi-Gate) and an extension of the metrics for bulk planar devices. The start of the metrics for FDSOI is listed as 2013 and MG as 2015 and the real noticeable impacts to the diagrams appear to show up in 2014 and 2019. The basic shape of the plots remains relatively similar through all three reports. (One really noticeable change is in the Y-axis of the plots where the top of the charts are labeled at 3.5 W, 14.0 W, and 8.0 W respectively.) Essentially what Intel has announced is that it will be shipping production parts by early 2012 using a MG process. This is well ahead of the ITRS roadmap.

“]

Figure 2. SoC Consumer Portable Power Trend [Source: ITRS, 2008 Update

“]

Figure 3. SoC Consumer Portable Power Trend [Source: ITRS, 2009

“]

Figure 4. SoC Consumer Portable Power Trend [Source: ITRS, 2010 Update

The reaction in the press to the announcement has been interesting, notably with the implications that this could have in the market place with regard to the ability to compete with ARM processors for low-power applications. For readers who have been following the trends in power efficiency, they’ve heard a lot about the importance of the front-end or architectural part of the design process in terms of creating efficient implementations. That continues to hold true here, as well. But one thing is clear—the designers that will be using Intel’s new P1270 process appear to have a great technology process to build upon.

–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.