fastest FPGA

On the largest parts, I think this is probably true, because the clock tree skew is wider than clock period. I'm still not entirely convinced that slightly smaller parts, in the smallest packages are actually worst case stable due to higher than average clock currents in other cases.

Not true ... it's the same micro optimization of temp/voltage/process with maybe a slightly different sweet spot ... simply optimizing the environmentals and hand screening parts that overclockers do for CPU's and memory. For all the same reasons, to avoid the worst case margin based on worst case process, temp and voltage.

After using Totally_Lost for more than 20 years to draw out technical bigots, I'm suprised the bait still works. You would think that smart people would realize few technical people making strong statements are neither clueless, or offering to be rightfully skewed as idiots.

Flaming posters based on name, origin, place of work/school, and other plain ignorance factors is pretty poor form ... and at minimum violates expected standards of civility and code of conduct for this, and most forums.

As I noted, the largest parts have a high clock tree skew, slightly smaller parts do not, and I suspect the "THUMP" you are talking about is the same peak pin current problem I have repeatedly asked about that is not specified by Xilinx, nor are tools available to calcuate the resulting voltage transients at the die.

Reply to
fpga_toys
Loading thread data ...

Totally_lost is fpga_toys? I just saw recent posts by both. Do people use sock puppets in this newsgroup?

-Dave

--
David Ashley                http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
Reply to
David Ashley

Yes, and Yes (if you call handles sock puppets).

You will find that I used Totally_Lost as my handle on sf.net to create the FpgaC project, and used that same handle here to anounce the start of the FpgaC project last October. I have used it on Ebay since March

1998, and in other forums, bbs, and lists since the late 1970's.

Fpga_Toys is relatively new, last spring.

There are other posters that post from multiple email addresses, often work/home, some with different username/handles at the different sites. Not all export their fully legal name for both.

Reply to
fpga_toys

As a reference point, bit toggle rates of high density bit serial LUT SRL designs are SIGNIFICANTLY higher, and depending on the statistical distribution of 1's and 0's in the data easily hover around 75% or better. Doing designs which ignore the data specific toggle rates by simply assuming toggle rates based on traditional parallel designs is folly at best.

The RC5 design was interleaved and pipelined with NO parallel operations. The nature of the RC5 data stream after the first stage is highly random, IE 50% of the data has a changes state on each clock,

25% of the data changes state at half the clock rate, and 12.5% of the data changes state at 1/4 the clock rate, ... to an average of near 75%, and short term variances that are statistically frequent much higher than that.

The design was interleaved word wise, because the barrel shifters when implemented in SRL's required a full word latency, that also doubled the number of SRL's needed to retime the 26 stage SBOX delay for the pipeline. As a result, the design ran with a toggle rate that was well over your offered typical of "usually well under 50%". Most bit serial brute force crypto designs will see the same high average toggle rates.

The same was true of the high density heat simulation model based on LUT SRL's and bit serial MAC's (inspired by your web page). The floating point data, with an assumed leading one, and normalized, has a nearly random toggle rate in the bit serial stream. It was necessary again to word interleave the engine to facilitate parallel loading of a multiplier, which shifted the ratio of LUT SRL's up to retime the interleaved data. And again, toggle rates overall were well above 50%, actually near 75%, and nothing like your "usually well under 50%" guideline.

Reflecting on these two "normal" bit serial designs, I would suggest that heavily pipelined bit seral crypto and floating point math engines which are hand packed, and highly replicated, will easily exceed the

50% toggle rates. This is especially compounded when considering the serial shift costs of LUT SRL's.

Replacing LUT SRL's with LUT RAM's certainly helps by removing the SRL bit shifting power, but doesn't lower the design toggle rates below 50% ... as they remain around 75% unless there are significant data dependencies that cause significant runs of zero's or one's.

The problem is that 75% becomes the average power, and there are certainly valid data patterns what will occur which are higher for brief periods (clocks for several word latencies). This is particularly true where the whole design is syncronized by the same data seed, or a single variable is shared by all engines that is worst case ...

1010101, at which point the number of bits changing state will go to nearly 100% for that word length in bits, many clocks ... and possibly a far higher power limit than the device, package, or PCB decoupling can support if designed based on a "usually well under 50%" rule of thumb. Cooling design can float thru this, power design can not.

I do worst case design when at all possible ... that is difficult with Xilinx parts, as the full specification provided doesn't come close to answering the questions regarding peak VCCINT pin currents in the short period following a clock transistion for a specific design and data. Or how those currents translate into VCCINT power drops at the die.

Best case, for a large FPGA in a large package, using a single clock, the global clock network skew will diffuse the current peaks. That however also slows the design down, prompting the designer to segment the clock network into multiple domains, and in the process remove the skew and increase the number of gates transitioning right after the clock so that the majority are well away from the next clock edge. This implies that the 100W of peak average power is now time compressed to a small number of time points that are clustered around typical LUT propagation times and typical high density regular routing propagation times. This time synchronization which is a natural side effect of optimizing the design for performance, also then increases the probability of instantaneous current spikes that are several times the long term average ... many by as much as a factor of 10.

As Austin puts it ... a power "THUMP" that may leave the device unstable, and should be part of doing good worst case FPGA designs from my perspective. Especially for reconfigurable computing FPGA engines, where worst case designs are probable in this respect, and certainly not "unlikely".

Reply to
fpga_toys

Sock puppet, from wikipedia:

formatting link

Sockpuppet (sometimes known also as a mule, glove puppet, or joke account) is an additional account created by an existing member of an Internet community pretending to be a separate person. This is done so as to manufacture the illusion of support in a vote or argument or to act without social effect on one's "main" account. This behaviour is often seen as dishonest by online communities and as a result these individuals are often labeled as trolls.

I only recently became aware of the term myself.

-Dave

--
David Ashley                http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
Reply to
David Ashley

I certainly try to make sure I login to the same handle for a particular thread, sometimes I may not. I've not seen much here that indicates people are purposefully manimpulating the discussion by concurrently using difference handles.

Reply to
fpga_toys

I have multiple accounts myself, but I list my name in all. Frankly I find it childish to make up new names for yourself. Always makes me wonder what you're trying to hide.

As for your use of handles elsewhere, why exactly would we know or care?

Tommy

Reply to
Tommy Thorn

Clearly not hiding ... as I've been clear in using both handles that I'm also the primarly developer for FpgaC. As for the choice of a handle, Totally_Lost has always been useful in some threads

Some posters here state things strongly with the reputation of their name, knowing that few will question their position simply from their reputation, accepting whatever crap as gospel.

Totally_Lost has exactly the opposite effect, doesn't matter what position I take, somebody will step up to the plate, and with strong moral authority refute it assuming I'm clueless and lost.

Some discussions that side effect, from the choice of handle is useful, and in others it would detract from the quality of the discussion by prompting unnecessary flame wars.

that should always be true, and we should never see shit heads jumping on a poster just because they think the poster is helpless.

On the other hand, we have human nature at it's worst ....

Reply to
fpga_toys

Just as another reference point, on the speed/parallel processing question, this news from NEC : ( sounds like a real device )

formatting link
?articleID=192300291

Claims this :

"Imapcar has 128 processing elements, each with embedded memory. The 128 parallel processing elements use the SIMD (single instruction stream multiple data stream) method. Each element processes four instructions per cycle. Thus, total performance was 100 Gops running at 100 MHz, enabling real-time image recognition at 30 frames per second."

and this:

formatting link

Can't see mention of how much embedded memory ?

-jg

Reply to
Jim Granville

It will be interesting to see if that ends up in NEC's next generation super computers too. If the chip has reasonable one board memory, and lots of off board bandwidth, it's surely a monster :)

I'm still waiting to see how the Cell processor matures into other product lines besides gaming.

Reply to
fpga_toys

Yes, it seems a very good idea, in the specialised research niches, to watch closely the chip output of the large revenue areas, like gaming, and now automotive vision - after all, their R&D spend make the FPGAs look like toys..

-jg

Reply to
Jim Granville

Yep ... fpga_toys :)

And they are real successful companies, capable of eating Xilinx with chump change and an after thought.

Reply to
fpga_toys

I almost threw up when I read John Bass explain how he uses a deliberately dumb-sounding name plus naive questions to light some fire and create controversial flames.

Really destroys my enthusiasm for this newsgroup. It's like a rich guy pretending to be homeless, and then hitting you in the groin while you are reaching for your wallet. Scum is the word describing that kind of behavior...

I'll try to get over this, and will fly to Madrid for the European FPL conference. Hopefully more civilized discussions there. BTW, I'll have an ML501 board with me, just to prove a point or two.

Peter Alfke

Reply to
Peter Alfke

Peter,

Don't be offended, I don't think he meant it in any destructive way. I for one appreciate your presence + that of other xilinx/altera/etc. individuals. On the internet and usenet in particular one has to keep some distance as well as have a thick skin.

In the little bit I've been here I've realized these groups are a very valuable resource, and questions I've posted and received answers to have gotten me over a deadblock. It's worthwhile spending the time and contributing, even from a business perspective, IMO. Stick around!

I'm off on vacation through Sept 1. In case people post and I don't follow up.

-Dave

--
David Ashley                http://www.xdr.com/dash
Embedded linux, device drivers, system architecture
Reply to
David Ashley

...snip...

Is it possible that your problem is something other than heat? Power decoupling is more of a science than an art or guessing game. It is entirely possible that your decoupling was not adequate in spite of the augmentation. To fully decouple power for fast chips the power planes must be built with a lot of internal capacitance along with a range of different capacitor values to allow parallel resonances to be bypassed and a low impedance to be obtained over a wide frequency range.

Reply to
rickman

Rick, You may find this enlightening!

formatting link
A range of values (in the same package size) is no help, it just bloats your BOM. Cheers, Syms.

Reply to
Symon

Not heat, the XC2000E rolled over right after loading, room temp die.

You know, it's always possible.

I stacked 0805/0603 caps right across the back side of the BG560 power/ground pads ... 0.01uf, 0.1uf, and 1.0uf on top of each other in a second test, and it just delayed the failure for a few seconds. It took abandoning the LUT SRL's to get any stability, plus backing the clock off.

I personally don't think there is any way to fully load a large BG560 XCV2000E device with a dense bit/digit serial computational engine. I did notice that both Austin and Peter were not that quick to defend that is should be possible.

The only bright spot, is that Austin sayslarge XC4V and XC5V can handle a fully loaded 1010101 design, which I would love to verify ... but I'm certainly not going to spend $10K to test it from my own funds.

My personal belief is that the ground ring/net is probably shared with I/O pads, and is probably stable. I also suspect that it's the VCCINT pads that are dropping, which would make the worst case not 1010101 ... but a large number of 0->1 transistions in the fabric. Bit serial doesn't do that easily, or likely, but it does happen too when a MAC is cleared, the first add value for the next term is all ones. It does complicate the problem, when you also have heavy concurrent pin I/O in progress ... which I didn't. At least Xilinx is quick to rate the max number of up/down transitions for that problem.

This mantra about 25% transitions is total BS for dense bit serial designs, SRL's or not.

Reply to
fpga_toys

Reply to
fpga_toys

And just to correct the po> John_H wrote:

To which Aust> Ray,

and both of you end up eating crow for. That's your problem, take responsibility for it.

Reply to
fpga_toys

Yes, I see what you are saying. But the point is that you and Dr. Johnson make these statements with absolutely no supporting evidence. I was in the class, I saw the impedance plots and I even asked the exact same question about using the same package with the same parasitics. The issue is not the inductance, the issue is SRF. It took me awhile to understand why this is important. It was especially hard for me to grasp when I looked at the data for three different value caps superimposed to show that the inductive reactance region was nearly identical. But the smaller caps had a higher SRF. So combining a small number of 0.1 uF, a few more 0.01 uF and more of the 0.001 uF parts seem to provide the best solution.

The impedance plot of the bare board with well coupled ground and power planes (no parts) shows the impedance dropping with frequency until it hits a SRF. The impedance goes up a bit and then seems to oscillate in the very high frequency region. Adding a single value of caps lowers the impedance at the low end, but actually makes the impedance higher at 150 MHz from parallel resonance between the cap and the plane. It is not clear to me that any reasonable number of additional caps would reduce this peak to be acceptable. Adding some caps at a smaller value added a new minimum in the impedance and raised the frequency of the parallel resonance. Adding a third value of caps makes the impedance plot quite acceptable with peaks less than about 100 mohm up to 3 GHz and the bulk of the graph under 25 mohm below 300 MHz. The graph is so very clear with the capacitor related minimums at about 20, 50 and 200 MHz.

This guy chalenged some of my very deeply held beliefs in how to design power systems. I could not find a single thing wrong with anything he said. Others in the class would argue with him about their beliefs. But in the end it was beliefs vs. facts... no contest!

There are two things that Dr. Johnson said that I want to address. "In addition, you have introduced the possibility of a resonance occurring, as you point out, between the lead inductance of the larger capacitor and the capacitance of the second, smaller-valued component." I think this is pretty clearly nonsense and I am surprised he said it. He refers to the two values of capacitors not being useful because they have the same inductance. Now he is saying that the inductance of one cap can interact with the capacitance of the other. Isn't that a contradiction? Wouldn't that produce the same result as the second capacitor's self resonance?

The other thing he said is, "The best method for controlling the resonances between sections of the power system is to buy cheap, low-Q bypass components in the first place". Even though this seems counter intuitive, the data bears this out. The low quality (or high ESR) reduces the amplitude of the parallel resonance peaks and makes the impedance more even. Sounds odd, but correct, again the data bears this out.

Reply to
rickman

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.