Power vs. Accuracy
By Barry Pangrle
So, how much energy are you willing to expend to be accurate?
The question is one that chip designers face more often than they probably realize. The first question really is, ‘How accurate do you need to be?’ Whether it is test coverage, verification coverage, signal-to-noise ratio, or error-correcting codes, the list is seemingly endless. Variability in the environment, and even in the products themselves, can cause errors. Typically these are handled in a probabilistic manner, because the cost of ensuring there are never any errors is way beyond the economics of the manufacturing of any part—especially for complex systems that contain billions of devices.
As an example let’s briefly look at error detection and correction codes. Simply adding an additional parity bit to a string of bits will allow the detection of a single bit error in the bit string. You can detect there’s (at least) one bit error but you won’t know which bit is wrong. What happens if two bits change? Well, that case isn’t covered and will go undetected. But plenty of systems use only parity checking.
A step up would be to use a single error correction and double error detection code. Now if you have a single bit error not only can you detect it but you can actually fix it. If there are two errors you can even tell that there are two errors, but you probably won’t know how to fix them.
What’s the cost? For each additional level of accuracy in terms of detection and correction you need to include additional bits plus more complex circuitry to handle the encoding and decoding plus correcting. If you perform parity on a byte-by-byte basis, then the overhead in additional bits is already 12.5% in terms of additional memory, not including the extra circuitry for the parity generation and detection. The overhead increases with more complex schemes. All of this additional circuitry requires more chip area and switching activity and that requires more energy (or power).
Determining the necessary level of accuracy can have interesting ramifications on the implementation of a design. Take graphics cards, for example. GPGPU computing is growing in popularity. For some uses, such as in the financial community, it requires double precision floating point and error correcting code (ECC) memory. Clearly, if one part can be used to sell not only to the graphics community but also to the high-performance computing community, then there are economic advantages in using the same part for both markets.
The downside is that the “extra baggage” for the higher accuracy carries over into the graphics applications. Do you care if there’s a single bit error in some calculation that affects a frame that’s on the screen for maybe 1/24th of a second or less? I don’t know about you, but I probably don’t. In fact, if you have an HDTV and sometimes play standard definition video on it, you might notice that the picture doesn’t look as clear as an HD picture. But try stopping on a single frame. You may be amazed at how awful it looks. Your eyes and brain do a remarkably good job of smoothing it all out when playing in real time, so a few bad pixels out of a couple million in a single frame probably aren’t going to ruin your viewing experience.
Certainly there are applications where every bit doesn’t have to be perfect. Video happens to be one of them. So if we know this up front and want to build a mobile graphics processor that uses really low power, could we trade off some of the video quality for battery lifetime? The answer is yes. Many researchers are looking at running circuits at reduced voltage levels, knowing that the timing on some critical paths will start to fail. As the voltage is further reduced, more paths will fail timing and more errors will be injected into the computations and the image quality will continue to degrade, eventually to and beyond an acceptable viewing level. Design “tricks” can be used to alter paths that are likely to fail first in order to maintain a higher quality image as the timing paths start to fail. The advantage is that by running at a lower voltage there is a quadratic savings in the dynamic power.
Okay, so outside of graphics are there other applications? Again the answer is “yes”. Systems that have to be very accurate in general, such as CPUs, can have parts that maybe can tolerate an occasional error. We’ve mentioned previously how errors in the memory system could be handled, but that requires additional circuitry. Are there any areas where an error could just be tolerated? As it turns out, there are, and this also plays into testing and reliability.
Professor Mel Breuer from USC gave a talk at the SoC conference where he gave an example of a CPU that could tolerate errors and still perform reasonably well if the errors occurred in the branch-prediction unit. Since it is a prediction unit after all, sometimes the unit “predicts” wrong. If there are errors in the unit, the predictor typically just loses accuracy in its predictions. The net effect is that the CPU degrades a bit in performance due to going down the wrong branch a little more frequently than it would have otherwise. This could also create a possible opportunity to save power by reducing the voltage on the branch predicting unit. This would certainly reduce the power. To determine if this is a net energy savings though would require a calculation to see if more energy was saved by running at the reduced voltage compared to the energy needed to run the increased steps due to the mispredicted branches. In any event, I’ll predict that we’ll see more research in this area with actual products using some of these power saving techniques.
–Barry Pangrle is a solutions architect for low-power design and verification at Mentor Graphics.
Tags: Mentor Graphics