making an fpga hot

Guys

We have just laid out a board and want to put the thermal analysis to bed (it's conduction cooled so not much room for error). If the xilinx estimator says we are going to use 25 watts does anyone know the best way to code an FPGA so that it will get nice and hot.

The estimator is just that, but is there a more accurate way of writing some code so that a particular clock input will generate a particular amount of heat. A 2000 D type serial chain where every flip flop is toggling every clock which blinks an LED is obviously one way but doesn't seem very ellegant.

We have wired up the internal temp sense diode to take a look at the result (and yes we know how noisy and innacurate they are).

Any experiences?

Colin

Reply to
colin
Loading thread data ...

If your goal is just to generate heat, use all the LUTs as SRL's, make use of all the BRAM's, and drive all the I/O's with a nice high current drive strength.

Marc

Reply to
Marc Randolph

Coiln,

Just make a huge shift register, or all DFF's toggling, and then just vary the clock input (or the shifted data input pattern from ....000001, to 101010....etc).

That is what we do.

Aust> Guys

Reply to
Austin Lesea

Well, I've found the diode isn't particularly noisy nor especially inaccurate! It gives repeatable and consistent (between parts) results, certainly good enough for your application. You have routed its connections together and away from big switching currents, I presume?! I use copper sheet to move heat to where I can get rid of it. Cu is 400 W/m/K, about twice as good as Aluminium. Don't use copper alloys. Very useful if you've got boards stacked closely together, you can get the heat out from between the boards. I've never tried heat pipes, but they're meant to be very good indeed. Finally, you'll find that the FPGAs work at elevated temperature for a long time. I recall a thread on CAF all about FPGAs down boreholes where they were running for weeks at 175C. You might be enlightened by a quick trawl of CAF in Google Groups. So, what's the lifetime of your product? How long will you be working for that company? All part of the engineering compromise!! Good luck, Syms.

Reply to
Symon

ROFL !!

thanx for the chuckle - Mike T

Reply to
Mikeandmax

Ahhh, that explains the issues with the ANT then... ;-)

Mark

Reply to
Mark Smith

Hi Colin,

Below I try to give some insight into how to make a hot design, though I do question the motivation of doing so. A simple FF chain comes no where close to achieving a high (or even average) core power.

All of the phenomena I describe below are modeled in the recently released Quartus II 4.2 software via its PowerPlay Power Analyzer. Target Stratix II or Max II and you'll get very accurate estimates of how all these factors affect your power consumption. You can try out the Power Analyzer in the Quartus II 4.2 Web Edition software available from

formatting link

If you're trying to figure out if a given design will work on your board after it's been made, the best bet is to try the chip out in the lab using stimulus (vectors) that reflect the worst-case operating conditions for the chip. I can make you a design that will burn many many Watts of power, but that doesn't mean your design will. A dynamic power measurement from the lab is the most accurate estimate possible -- just remember to use the manufacturer's spec for worst-case static power (at worst-case temperature) since the unit you have on your board is likely NOT worst-case.

There are many factors that affect overall dynamic power consumption of an FPGA design. I will highlight a few critical ones below, and make suggestions along the way to build a design to turn your FPGA into the hot-plate you desire. It is *not* as simple as making one big shift-register...

(0) Transition Density. You want to toggle as much every cycle as possible. Toggle FF/shift register achieve this, as do XOR functions (if you want to utilize the LUT too).

(1) Routing Utilization. The routing buffers, multiplexers, and wiring in an FPGA can add up to a large amount of switching capacitance and short-circuit (crowbar) current. To maximize dynamic power, you must use a lot of routing. A simple FF chain will actually use very little routing, unless you purposely make the placement very bad by using region constraints such as LogicLock regions. You could, for example, constrain the even bits of your chain to one-half the chip and the odd bits to the other half, and this will greatly increase routing utilization. Or use something other than FFs to increase the number and fanout of the routed wires. Of course, you'll need to experiment a little to find the right balance between high utilization and still being able to route!

(2) LUT Configuration. A LUT configured as an AND gate does not burn nearly as much power as one configured as an XOR. This difference is due to the number of internal nodes in the circuit that toggle states upon the toggle of in input signal. On top of this, the output of an XOR will toggle upon the toggle of any input -- so chaining together XORs will result in a cascade of glitching (if there are no pipeline registers), which can further increase your power. To get the most accurate estimate of LUT power, you must consider the functionality of the LUT -- Quartus II can do this for you.

(3) Clock Network. The vast majority of power on a high-fanout clock will be burned *inside* the LABs (on the LAB-wide clock), not on the global clock network. If you distribute a clock such that it fans out to one FF (out of

16) in every LAB of the device, this will maximize this internal LAB clock network power. You can achieve this through location constraints applied to these FFs. And the more clocks you use, the more you will burn. You can use the PLLs to step up the clock frequency to help increase the toggle rate.

(4) RAMs. A RAM can burn significant power if you perform reads & writes every cycle (keep the clock enable asserted). Just hook up all the RAMs in the device to be in dual-port mode writing & reading random data every cycle, and you've got some more power.

(5) I/Os. You can burn an arbitrary amount of power with your I/Os, depending on external termination resistance, contention, I/O standard, drive strength, load capacitance, etc. Let's just pretend you don't have I/Os to make life easier.

Hopefully that gives you some ideas of where to go to burn some power. If your using a Xilinx chip, I'm sure similar techniques will apply, though their tools may not be able to fully predict the results you will see.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Paul, Comments/Questions below!

Could you explain that a little more? I thought that the LUT was just a 16x1 RAM. Is the extra power consumed only when two inputs change? e.g. 00 => 11 into the XOR would still have 0 as its output but it might transistion through the 1 output state? I understand that XOR gates are more likely to transition, but you seem to be saying there's some additional internal reason why they consume power.

Cheers, Syms.

Reply to
Symon

--

--Ray Andraka, P.E. President, the Andraka Consulting Group, Inc.

401/884-7930 Fax 401/884-7950 email snipped-for-privacy@andraka.com
formatting link

"They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759

Reply to
Ray Andraka

Hi Symon,

the

toggle

16x1
11

While logically a LUT is just 16x1 ROM, physically it is not built the same way as a RAM.

A traditional RAM is built with a 2D-array of bits, where a row is selected by decoding the address, and a pair of differential bit lines per cell is precharged and then the cell pulls one side down which is amplified by a sense-amplifier to speed things up (gross simplification). In that structure, regardless of what you are reading, you burn the same power since the reads are differential, and you burn power on each read, regardless of the previously read value, since all that precharge, pull-down and sensing happens every read.

A LUT however is traditionally built as a multiplexor tree. You have 16 SRAM cells feeding a tree of 2:1 muxes. The 4 inputs of the LUT each control one level of the tree. There is a diagram below for a 2-LUT.

Let's take a 2-LUT implementing an XOR as an example (see diagram). We have x = A?1:0 and y = A?0:1, and f = B?y:x. Let's say A switches from 0-->1 (and B = 0). Node x toggles from a 0 to 1. Node y toggles from a 1 to a 0. And node f toggles from a 0 to a 1 (with x). So you have not only the output of the LUT toggling, but also the internal stages. If you extend the example to an N-LUT, you'll see that a toggle on input A results in 2^(N-1) first stage nodes toggling, 2^(N-2) second stage, etc. or 2^N - 1 nodes toggling *internal* to the LUT. If you look at an AND instead, you'll see that only one first stage node toggles state with a change in A.

A B

+-+ | | |0|-|\ x | +++ | |__ | +-+ | | |\ |1|-|/ | | +++ | | |__ f +-+ | | | |1|-|\ y| | +++ | |__|/ +-+ | | |0|-|/ +++

So in conclusion, an XOR not only results in a higher output switching probability (which should be modeled by your simulation vectors or assumed toggle rate), but also results in higher *internal* switching activity. Hence power of a LUT is not constant in LUT mask. In fact, it also changes as a function of what the "static probabilities" of each input are, or % of the time those inputs are 1 or 0, since assymetric LUT masks result in assymetric internal states as a function of input values.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hi Ray et al:

Good point on glitching. On a related note, this glitching also makes power analysis difficult. Even with good-quality simulation vectors for a design, the resulting gate-level simulation results will contain glitches. Are the glitches real? If so, then they should count towards power. But sufficiently short glitches will never propagate through the routing, or even through the gate.

This is why we recommend that our users employ glitch filtering on simulation results. This can be done with the Quartus II 4.2 simulator or with 3rd party simulators (via the control file emitted by Quartus II). We find that very glitchy designs do not correlate well unless this glitch filtering is used. In addition, the resulting VCD files produced by 3rd party sims need to be further filtered by Quartus in order to improve accuracy further.

For further information on power analysis, the Quartus II PowerPlay Power Analyzer and glitch filtering specifically, please see

formatting link

And yes, pipelining is an excellent way to reduce glitching and thus dynamic power. At some point, the pipeline registers and additional clock routing will add more power than the glitches removed, but for glitch-heavy designs (anything with XORs, such as adders, multipliers, and parity trees, and "randomizing" circuits such as encryption) pipeling will help a lot.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

Hmm, that's very interesting. I wonder if the FPGA vendors have got their SLICEs back to front? I.e. the FFs should feed directly into the LUTs within the SLICEs, instead of the other way round that exists now. If it saved even

Reply to
Symon

Hi Paul, That's interesting too! I think what you're saying is that some inputs to the LUT are more power thirsty than others. So, in your example, the A input in your example controls more muxes than the B input. This means that you could reduce power by taking this into account. If you had a LUT structure with four inputs A, B, C, D then A would feed 8 muxes, B feeds 4, C feeds 2, and D feeds just one. For any two input function, only two inputs are used and the P & R tools could prefer to use the C and D inputs for the least amount of internal switching of nodes. Also, the net that changes most frequently should be on the D input. Correct? Thanks, Syms.

Reply to
Symon

Paul Leventis (at home) wrote: (snip regarding power, XOR trees, and FPGAs)

That sounds more like a DRAM or SDRAM. Traditional SRAMs were completely combinatorial, such that the output changed the appropriate propagation delay after the address changed. Wouldn't the precharging require a clock? I would have thought a 2D array, where a row is decoded, the outputs from the selected row, either differential or not are supplied to a mutliplexer to select the appropriate bits to output. At 16 cells the advantage of 2D decoding might not be worthwhile.

I wonder how 16 bit SRAMs were built? As far as I understand it, the first semiconductor memory for a commercial computer was the storage protection keys for the IBM 360/91, built out if 16 bit SRAM chips.

-- glen

Reply to
glen herrmannsfeldt

As I understand it (!) Stephen Trimberger (Xilinx and much distinguished previous work) presented a paper recently on this fairly recently.

Reply to
Tim

You mean put four FF's on the LUT inputs, instead of one on the output? I suppose that reduces glitching inside the LUT (RAM), but it still leaves glitches through the routing. Also, four FF's are likely to take more power than one.

-- glen

Reply to
glen herrmannsfeldt

Reply to
Symon

Hi Tim, Thanks for the heads up. Googleing Mr. Trimberger's name, I found this EE times article:-

formatting link
quote>

Xilinx has already taken the first steps to raise the awareness of power issues by disclosing a study on the hot spots in its latest Virtex 2 architecture. In the paper, the company showed that 60 percent of the power consumption in the Virtex 2 family is from routing while logic and clocking account for 16 and 14 percent, respectively. Additionally, Xilinx found that the cluster of LUTs, flip-flops and other circuitry that make up its configurable-logic blocks take up 5.9 microwatts per MHz for a typical design. But this is just for "typical" designs; actual power consumption within the configurable logic blocks (CLBs) can change wildly depending on the switching activity. This can occur frequently in synchronous circuits, where the inputs to the LUTs come in at different times during the same clock cycle. This "glitching" effect could contribute up to 70 percent of the power dissipation in a CMOS circuit, whether it's an ASIC or FPGA.

As I understand it (!) Stephen Trimberger (Xilinx and much

Reply to
Symon

I didn't try to figure all possibilities, but it would be a rare design that used a FF on each LUT output, so I would expect some LUT without FF's on the inputs. The arrival time will be different for the different inputs, so there may (depending on logic) still be glitches left.

I do agree, though, that for many designs it could greately reduce glitches propagating through logic. I have done designs with at most two LUT between FF's, highly pipelined for high speed.

I do agree that it could be an interesting addition to FPGA architecture, and you might want to patent it. (If you do, it will probably never get into any FPGA's though.)

-- glen

Reply to
glen herrmannsfeldt

Hi Symon,

their

within

even

prevent the

You'd have to consider the cost of having 4 flops (if I understand correctly) vs. 1. How often will 4 flops be used? What if you instead spent that same silicon area on other things (other power reduction circuitry, etc.)? How much more wiring cap will there be due to increased size of a LE? How much more power are you burning by replicating clocks and other signals?

One thing I should point is that you *can* put FF in front of the LUT in Stratix/Cyclone/Max II/Cyclone II/Stratix II. There is only one FF, but it can directly feed the LUT instead of the other way around.

Regards,

Paul Leventis Altera Corp.

Reply to
Paul Leventis (at home)

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.