Posts Tagged ‘AMD’

Shaking Up The Green500

Thursday, January 17th, 2013

Barry Pangrle
Last September, I wrote about the efficiency of IBM’s Power7+ architecture in my blog. IBM’s Sequoia supercomputer (a BlueGene/Q system) this past June had just shot to the top of the Supercomputing Top500 chart, clocking in at 16.32 petaflop/s on the Linpack benchmark. Other systems built around the IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom were also dominating the top of the Green500 list with the Sequoia placing 20th in the megaflop/s/watt category.

Well, that was September and the November 2012 Top500 chart now shows a new leader at the top with a score of 17.59 petaflop/s on Linpack. It is the Oak Ridge National Laboratory’s Titan, a Cray XK7 with 18,688 nodes with each node containing an AMD 16-Core Opteron 6274 and an NVIDIA Tesla K20X.

Oak Ridge National Labs’ Titan Supercomputer (Photo courtesy of ORNL)

I alluded to the potential energy efficiency of GPU-based computing in the September blog and heterogeneous systems have suddenly jumped to the top of the November 2012 Green500 list. The top 4 on the Green500 list are now all heterogeneous systems. Titan currently sits in 3rd place, bettering Sequoia’s opening position at 20th (which has now fallen to 29th). The next fastest supercomputer, RIKEN’s K Computer is now in 85th place on the Green500 list, so out of the top of the supercomputer class, the heterogeneous Cray XK7 Titan is a good number of spots higher on the Green500 list.

How about the other two systems above Titan on the green list? In first place is a mixed Intel Xeon E5-2670- and Intel Xeon Phi 5110P-based system that is good for 0.112 petaflop/s and in second is a mixed Intel Xeon E5-2650 and AMD FirePro S10000 system that checks in at 0.421 petaflop/s. It will be interesting to see if and how well these systems scale up to performance an order of magnitude or two higher to possibly compete for the top of the Top500 list. We might just get some insight into that soon with the Texas Advanced Computing Center’s Stampede supercomputer that just went online after two years of development. It is supposed to hit 10 petaflop/s (peak) and is a Dell PowerEdge C8220 cluster filled with Intel Xeon Phi co-processors. It is currently estimated to need more than 6 MegaWatts, including cooling. Based on the preliminary numbers, it looks unlikely to challenge for the top of the Green500 list, but we should wait for official numbers before drawing any firm conclusions.

Texas Advanced Computing Center’s Stampede Supercomputer (Photo Courtesy TACC)

The fourth and fifth place positions on the Green500 list belong to a Cray XK7 and an IBM BlueGene system, respectively. It will be interesting to see how the list changes again this year. The push to hit the exascale mark is on and the competition promises to be fierce.

—Barry Pangrle is a senior power methodology engineer at NVIDIA. The views expressed in this article are his own and not necessarily those of NVIDIA.

AMD’s Bobcat Processor

Thursday, August 9th, 2012

Barry Pangrle
The International Symposium on Low Power Electronics and Design (ISLPED) was held last week in Redondo Beach, California. There were many good presentations and keynote addresses and a topic that’s near to my heart, near-threshold voltage computing, was often discussed along with how best to (or not) handle variability.

One paper out of many that caught my attention was The Core-C6 (CC6) Sleep State of the AMD Bobcat x86 Microprocessor, by Aaron Rogers, David Kaplan, Eric Quinnell & Bill Kwan. What was notable about this paper was the publication of power and performance data for the processor. Since the “Bobcat” core is about to be replaced by the “Jaguar” core, AMD has allowed its engineers to publish some of the data regarding the older 40nm TSMC bulk CMOS design. Figure 1 below shows a die photo of the Ontario SoC that not only includes 2 Bobcat cores and cache (left side of figure), but significant GPU and other capabilities, as well.


Figure 1. De-processed die photo of 40 nm Ontario SoC. Source: AMD

Figure 2 below shows a graph of the dynamic power for one core operating at 90°C and at various frequency and voltage points. For people interested in creating higher-level power models to approximate processors running at different operating points, these graphs should be of interest.


Figure 2. Power vs. Voltage. Source: AMD


Figure 3. Power vs. Voltage. Data Source: AMD

Figure 3 shows the best and worst case points presented across 5 different operating frequencies for the core. It’s interesting to note that there appears to be about a factor of 2 power increase in the worst case compared to the best case listed.


Figure 4. Power vs. Clock Frequency. Data Source: AMD

Given the data, I couldn’t help but play around to see what else I could possibly try to glean from it. Figure 4 shows some plots for power versus clock frequency for given voltage levels. For less than 2 points at any voltage, I left the data off of this plot for clarity (and even only two points makes for a really nice straight line). For dynamic power, we’d expect that the power would increase linearly with clock frequency at a constant voltage, and the plotted data in Figure 4 seems to bear this out. The only slightly odd point is the jump at 1700 MHz for 1.1 V, which can also be seen back on Figure 2.

All in all though, these plots look like fairly straight lines. So if the dynamic power seems to be tracking linearly with the clock frequency (as expected), we might guess that we should also be able to factor in a voltage squared term to make a predictive model. Figure 5 below shows plots for the predicted dynamic power levels starting with a base at the 500 MHz point and then factoring in the voltage squared component and linear component for the increase in clock frequency.


Figure 5. Predicted Power vs. Voltage. Data Source: AMD

The predictive models for the best-case and worst-case curves are not too bad. The worst-case curve was surprisingly close, in my opinion.

The authors go into some depth about the static power component and their sleep mechanism for reducing static power (actually the majority of the paper). Thanks to AMD for making this information available to the low-power community and I’d like to encourage anyone who is interested in low-power/energy-efficient design to check out the conference proceedings.

—Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.

Intel vs. AMD: Who’s Right?

Thursday, March 8th, 2012

By Barry Pangrle
It’s all about the system. One energy-efficient component doesn’t an energy-efficient system make. There were two big announcements recently made by the industry’s two x86 designers. One was by Intel announcing its new Sandy Bridge Xeon Processor E5-2600 product family, and the other one was by AMD announcing its planned acquisition of SeaMicro.

Both of these announcements emphasized the power savings that these technologies bring to the marketplace, and it looks as if Intel is continuing to extend its technological lead in the high-end server processor market with the new E5-2600. Still, the SeaMicro deal looks like an interesting play to address overall server energy-efficiency by means other than only improvements to the CPU. Both of these announcements also mentioned the importance of the fast-growing server market to address cloud computing.

As we’ve discussed before, choosing the right architecture for a given task is absolutely essential for generating a power- or energy-efficient design. World-renowned computer architect, Seymour Cray, famously quipped “If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?” Well, if the field changes, 1024 (or in SeaMicro’s case, maybe 768) chickens look pretty good.

In a talk reported last year, Dileep Bhandarkar, distinguished engineer at Microsoft, called for 16-core SoCs based on Intel Atom or AMD Bobcat cores that would also integrate all of the core logic and I/O functions currently placed in separate chips. He presented the chart below to show how components in a system need to be balanced to get the best power, performance and cost outcome. Bhandarkar stated,

“The conclusion was clear—the best balance of performance, power, and price was found in the lower-power processors.”

In a whitepaper published by SeaMicro, SeaMicro states that, “while much of the industry discussion centers around the CPU, the CPU consumes only one-third of the power used by a server. The remaining two-thirds is consumed by hundreds of other components. If one seeks to reduce server power consumption by 75%, one needs to focus on the non-CPU power-drawing components first, and only then on the CPU.”

SeaMicro also realized that the challenge in the data center had turned into a problem of handling huge volumes of relatively modest computational workloads rather than solving a few complex problems (chickens vs. oxen). These workloads in part are generated by the millions of users wanting to perform searches, view Web pages, check e-mail, and read the news. Put simply, large, complex, high-speed, multi-core, multi-socket CPUs are “overkill,” and the mismatch between the CPU and the primary workload in the data center is also a fundamental underlying cause of the data-center power issue.

Pushing data around is expensive from an energy standpoint. Whether you are sending data across a chip, multiple chips or multiple boards, greater parasitic capacitance means greater energy to move that data. Energy-efficient systems are all about the efficient choreography of the movement of data within the system.

SeaMicro addressed the energy-efficiency issue by:

  1. Inventing technology that eliminates 90% of the components from the motherboard, leaving only three components: CPU, DRAM, and the SeaMicro Freedom ASIC, shrinking the motherboard to the size of a business card;
  2. Inventing technology that makes the necessary remaining components more efficient by reducing the power consumed by the CPU and its associated chipset by consolidating functions and by powering down unused CPU and chipset features;
  3. Inventing technology that ties together hundreds of the ultra-efficient mini-motherboards using the SeaMicro Freedom fabric, and
  4. Inventing technology that dynamically ensures that the system runs at its most efficient level by combining CPU management and load balancing, allowing workloads to be allocated dynamically to specific CPUs on the basis of power use.

SeaMicro claims that this combination of technologies produces a system that uses one-fourth the power and takes one-sixth the space of the best-in-class competition. The SeaMicro Freedom fabric is also independent of the CPU instruction set architecture (ISA), so that the use of other ISAs, like ARM, is still an open possibility. For applications that do need more compute power, SeaMicro announced the production of new products using more powerful Intel Xeon processors, too. Going forward, the obvious expectation is that SeaMicro will transition to parts designed by its acquiring company, which also opens some interesting possibilities for future APU functionality.

–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.

Undervolting & Underclocking

Thursday, January 12th, 2012

By Barry Pangrle
Last month we looked at record-breaking clock frequencies accompanied by voltage levels over 2V for some high-speed x86 processors. This month we’re going to go in the opposite direction—reducing the voltage and clock frequency to reduce power.

Our processor of choice is the AMD A8-3850, a 100W, 2.9 GHz, quad-core, x86 processor that also incorporates 400 “Radeon cores” (Radeon HD 6550D Graphics) on the same piece of silicon. AMD also makes a 2.4 GHz (2.7 GHz Turbo Core) A8-3800 that’s rated at 65W but has been very hard to obtain. The planned application is a Home Theatre Personal Computer (HTPC). Since this is a box that will be used to record broadcasts, it will be on 24×7, and it’s imperative that it be rock stable.

For my local power service provider, it costs me about $1.58 a year for 1W that is on 24x7x52. So, for example, a 35W reduction would translate into approximately a $55/year savings.

Figure 1. AMD A8-3850 Processor In Box

The components used in this system were the following:
• AMD A8-3850 Llano 2.9GHz Socket FM1 100W Quad-Core Desktop APU with DirectX 11 Graphic AMD Radeon HD 6550D AD3850WNGXBOX
• GIGABYTE GA-A75M-UD2H FM1 AMD A75 (Hudson D3) SATA 6Gb/s USB 3.0 HDMI Micro ATX AMD Motherboard
• CORSAIR Vengeance 8GB (2 x 4GB) 240-Pin DDR3 SDRAM DDR3 1866 (PC3 15000) Desktop Memory Model CMZ8GX3M2A1866C9B
• SeaSonic X series SS-400FL Active PFC F3 400W ATX12V Fanless 80 PLUS GOLD Certified Modular Active PFC Power Supply
• (2 x) HITACHI Deskstar 5K3000 HDS5C3020ALA632 (0F12117) 2TB SATA 6.0Gb/s 3.5″ Internal Hard Drive
• SONY Black Blu-ray Burner SATA BD-5300S-0B – OEM
• Ceton InfiniTV 4 Quad-tuner Card for Watching Digital Cable TV on the PC, PCI-Express x1 Interface
• LIAN LI Black Aluminum PC-C39 Micro ATX Media Center / HTPC Case

Power measurements were made with a P3 International P4460 Kill A Watt EZ Electricity Usage Monitor.

Marcus Pollice wrote an article for Brightsideofnews [1] and used Prime95 and MSI Kombustor to stress test his system for stability and max power usage. I’ve used the same two programs for testing here. Probably the trickiest part of this setup was finding the necessary DirectX libraries to run Kombustor and you can find those here. Marcus reported that he was able to get a 32% reduction in power by lowering the voltage from 1.4 V to 1.136 V and under stress testing the peak power dropped from 209 W to 142 W. That’s a whopping 67 W reduction in power at load.

Figure 2. “Husky” 32 nm x86 Core [2]

Figure 3. Llano Die Photo with Annotations [3]

Figures 2 and 3 show photos for the x86 “Husky” core and the Llano die, respectively. The GPU portion runs on its own separate voltage plane. Using the IFFT stress test to load all 4 cores simultaneously, I was only able to reduce the voltage to 1.225 V from the 1.4 V stock voltage. Reducing the voltage further to 1.2 V caused errors to occur during the Prime95 test.

Figure 4. 32 nm Cores: Frequency vs Voltage [4]

Figure 4 shows the relationship between frequency and voltage for the Bulldozer core and a “legacy” 32nm core (presumably our “Husky” core for Llano). It’s clear that reducing the clock frequency should enable a lower stable voltage. In fact, by reducing the clock frequency to 2.4 GHz it was possible to get a stable system at 1.2 V. Another really nice benefit was that under Prime95 testing the reported CPU temperature had now dropped from ~ 74°C to ~51°C. This also meant that the stock heat-sink fan didn’t have to spin at its max RPM, keeping the system quieter.

So, what was the impact on the final power? At the wall plug, the system routinely idles around 47W with CPU temperatures in the 22 to 23 degrees Celsius range. Just running the Prime95 stress testing, the power increases to about 105W, and adding the Kombustor GPU stress to the mix pushes the power up to about 125W max. So, while I didn’t get the voltage reduction reported by Pollice, the overall power is lower (125W vs. 142W). There are probably numerous reasons for this and perhaps primary is a more efficient power supply unit. If the PSU is 8% more efficient, that would be good for about 10W. The approximately 20% lower clock frequency should also help to keep the power lower. On the other hand, this system does have two hard drives, a cable tuner card and another 4 GB of RAM that is not part of the Brightside system.

The GPU ran at the stock 1.15 V for the graphics logic. Max GPU temperature was reported to be only about 40°C. The reported system temperature was 48°C and didn’t seem to vary much whether the system was idling or being stressed.

Figure 5 shows the HTPC case with all of the components mounted. The Ceton card is hidden underneath the center support for the case.

I suspect that we’ll see more of these types of systems being discussed and Semiaccurate [5] has even recently reported about tweaking software for AMD’s Brazos platform here.

Figure 5. Components Mounted Inside of HTPC Case

References:

[1] Marcus Pollice, AMD APU Undervolting: Reduce Power Consumption by 32%!, Bright Side of News, 7/6/2011.

[2] Jotwani, R.; Sundaram, S.; Kosonocky, S.; Schaefer, A.; Andrade, V.F.; Novak, A.; Naffziger, S.; An x86-64 Core in 32 nm SOI CMOS, Solid-State Circuits, IEEE Journal of, Volume: 46 , Issue: 1, 2011, Page(s): 162 – 172

[3] Denis Foley, Maurice Steinman, Alex Branover, Antonio Asaro, Ljubisa Bajic, Swamy Punyamurtula and Greg Smaus, AMD’s Llano Fusion APU, Hot Chips 23, August 19, 2011.

[4] McIntyre, H.; Arekapudi, S.; Busta, E.; Fischer, T.; Golden, M.; Horiuchi, A.; Meneghini, T.; Naffziger, S.; Vinh, J.; Design of the Two-Core x86-64 AMD “Bulldozer” Module in 32 nm SOI CMOS, Solid-State Circuits, IEEE Journal of, Volume: 47 , Issue: 1, 2012 , Page(s): 164 – 176

[5] Thomas Ryan, Spotlight: BrazosTweaker, An Under-volting Tool for AMD APUs, SemiAccurate, Jan 9, 2012.

–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.

Speed Demons

Thursday, December 1st, 2011

By Barry Pangrle
For extreme world record performance levels, the required power levels are also typically extreme. It’s that age-old battle against diminishing returns to squeeze out every last drop of performance versus practical limits and wallets.

For example, a top fuel dragster can consume about six gallons of fuel for a quarter-mile run down the strip. As has previously been shown here, x86 clock frequencies have gone from doubling nearly every year to practically stagnant over the past seven. Of course, power has been a major player in putting a ceiling on those frequencies. So in honor of that high-speed clock demon that was released back in November 2004, the Intel Pentium 4 HT 570J, we will look at recent new records in x86 (over) clocking frequencies (“overclocking” is a term used for pushing a chip past its specified clock frequency).

AMD recently released its new “Bulldozer” based x86 parts that are composed of modules with two separate integer paths that share floating-point resources. Roughly speaking, the concept is to get two “cores” into one module that uses slightly more area than a standard single core. Breaking from the current trend of not designing for higher clock speeds, Bulldozer in fact seems to have been architected to run at somewhat higher clock frequencies. This new part presented itself as an opportunity for the over-clocking community to jump in and try to break the previous record held by an Intel Celeron D 352 clocked at 8308.94 MHz.

Heat kills chips, and if the clock is pushed past a part’s rated frequency, dynamic power typically increases linearly with the clock frequency so that the heat produced from the chip also will increase. One trick used by overclockers is to also increase the supply voltage to help increase the chip’s ability to run at a higher clock frequency. As you might imagine, because the dynamic power increases quadratically with the voltage, this can really increase the amount of heat being generated. Overclockers typically use more exotic cooling than the standard heatsink fan often provided with a boxed part. The cooling solutions range from larger heatsinks made of metals that have good thermal conductivity, like copper, to elaborate liquid cooling systems that on the far end of the spectrum included liquid nitrogen and liquid helium.

As part of the launching of these new parts, AMD held an extreme overclocking session to go after the record. There’s an entertaining video here on YouTube (it’s just a bit over two minutes long and looks pretty good at 1080p). The result was what looked like a fairly stable 8.429 GHz—assuming you kept the liquid helium flowing. The power for the part wasn’t reported, but that top fuel dragster comes to mind.

Since then an overclocker in Taiwan named Andre Yang has twice reported faster results, and I’ve included them in the table below along with an image of the setup shown in the AMD video. It’s interesting to note that the first three entries are all running within 16 mV of 2.0 V. Even the P4 570J ran at only 1.425 V and that was in a 90 nm technology. It looks like Andre was able to crack 8.5 GHz by pushing the voltage to 2.064 V. It could possibly be that the max voltage before part failure ultimately will limit how high these clock frequencies could go.

Not to be left out, the memory community has also gotten into the game of setting new records recently and you can find more here about Corsair’s DDR3-3467 memory. Coincidentally or not, it was also run with an FX-8150 liquid nitrogen-cooled PC.

So, you might be wondering, if it’s possible to take these parts and bump up the voltage to run them faster, if you were to run them slower would it be possible to reduce the voltage levels to further reduce power? The answer is of course “yes,” and we might just look at that in a future blog too.

Best Wishes for a Happy, Healthy and Prosperous 2012!

–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.