Multi-FPGA PCB data aggregation?

- E
- Eric
  
  Contact options for registered users
posted
19 years ago

Sun, Mar 27, 2005 1:24 AM

Hello! I'm trying to build a 20-FPGA (spartan-3s, XC3S400-TQ144s) board for a class project to investigate the use of FPGA arrays for accelerating scientific competition. The idea was to have a 66 MHz 16-bit single-ended shared TX bus sending 125 MBps to each FPGA, and a shared 66 MHz 16-bit data aggregation bus where a bus controller would poll each FPGA sequentially to place its output data onto that bus.

After discussions with some signal-integrity-leery friends, I'm no longer convinced that a 12"x12" 20 IC board at 66 Mhz with single-ended buses is such a good idea. I've been reading the various datasheets on doing LVDS and DDR signaling. Multidrop LVDS is still a bit tricky, evidently, but to cut down on trace number I might be able to go to 125 MHz DDR 4-pair LVDS for the TX bus; the problem I'm having is with the data aggregation bus.

Could the aggregate data bus be structured in a similar manner, with: a.) all FPGAs connected to the 4 DDR LVDS pairs? b.) a single master, with a separate output enable line to each of the 20 FPGAs

Such that when FPGA n is output-enabled, it would drive it's m nibbles of data onto the output aggregation bus. But this would require FPGA n to drive its pins within 4 ns; this sounds nearly impossible.

Are there any common solutions I'm missing? One thought was to aggregate all of the data from the FPGAs via dedicated serial links and do clock recovery at the bus master; this would require recovering 20 separate clocks, alas, and with the spartan 3s we don't have quite that many DCMs.

An obvious solution is "do an IBIS simulation, duh" but we don't have access to the sort of high-end signal integrity simulation software that this would require.

Can anyone with spartan-3 serial interconnect experience offer suggestions as to how to make this work, either through different LVDS configurations or interconnect topolgies?

Thanks for any advice you can give, ...Eric

- M
- Marc Randolph
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 1:41 AM

board

accelerating

shared 66

each

longer

buses is

LVDS

but to

LVDS

bus.

the 20

nibbles of

to

Howdy Eric,

I don't think it necessarily would be, except that there is no telling how long it will take all the reflections to settle out (highly dependant upon PCB routing). One way to cut down on all the reflections would be to put a weak termination pull resistor, or maybe AC-termination, at every couple FPGA's. Don't use the built in pull resistors - you can't tune tune them, and they are too weak to be used as anything as a pure weak pull up/down. But too much termination will affect rise time.

aggregate

clock

DCMs.

You haven't said how many spare pins the aggregator FPGA will have, but my first thought was to do as you mention and use dedicated serial links for each FPGA, but without the clock recovery. Instead, just keep the whole board synchronous: distribute the clock so that it reaches every FPGA at the same time and put the aggregator FPGA in the middle. Use the IOB flops for both output and input, and I think you'll have plenty of time left over for PCB data prop delay.

Good luck,

Marc

- E
- Eric
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 2:21 AM

Marc, thanks for the answers!

As I understand it reflections are still a problem with LVDS signalling, so you're right, this could be a problem. If I've got a shared (multidrop) LVDS bus with 20 inputs and one driver, I'm also worried about the input capacitance slowing rise time. Would it be possible to use the DCI on the spartan-3s to help cut down on reflections?

20 FPGAs * 4 DDR LVDS Pairs = 160 wires, which isn't something I can pull off with any components I can solder by hand. :( Plus this configuration is a bit of overkill, as I really only need ~125 MB/sec aggregate bandwidth (I'm reusing a gigabit ethernet design I did for off-board IO... UDP/IP in vhdl isn't fun, but it's fast).

I agree that keeping things synchronous would be best, but then I get worried about the combined effects of propagation delay and clock skew, which would seem to easily swamp the 4 ns window for DDR 125 MHz. Is there any way to use the DCMs on the Spartan-3s to help reduce the effects of clock skew across the 20 FPGAs? I have to confess that I've never quite understood how to use the DCMs, even after reading the app notes.

Thanks again for all the help, ...Eric

- P
- Phil Short
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 6:03 AM

How about daisy chaining these dedicated serial lines in some fashion, rather than running them all to the bus master? The output of each non-master FGPA goes to the next non-master FPGA, and only the last one goes to the bus master. This is not without it's intricacies, but certainly worth considering.

--
Phil

- J
- John Adair
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 8:22 AM

Why not use LVDS to connect to a central FPGA. Make a data packet header that indicates where data should be routed and where it is from. This way data can route to/from any pair of FPGAs or even multiple FPGAs (broadcast header).

This way you have point to point links which are relatively easy to route on your PCB. Depending on your data needs to you can use multiple links to increase data rate or use quadrant, or whatever suits, routing.

Usually with LVDS you should consider if you are going to send a clock with the data. Simplest is to send a clock trace pair with the data pair and skew between data and clock is minimised. At your data rates you may be able to use a central, or common, clock approach but watch for the differences in trace lengths causing skew.

John Adair Enterpoint Ltd. - Home of Broaddown2. The Ultimate Spartan3 Development Board.

formatting link

- M
- Marc Randolph
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 6:19 PM

telling

maybe

pull

used

will

signalling,

(multidrop)

input

the

Howdy Eric,

Sorry, I didn't really address the differential situation. Multi-drop LVDS should take care of most of the reflection, but not having done it myself, I can't address the pro's or con's - and I especially don't know what the smallest usable bit-time is.

but

links

the

every

Use

plenty

pull

configuration

IO...

To save pins, my favorite idea is a variation on Phil Short's response. You could do a tree network: the four chips in each corner would feed to one of four sub-aggregator's. The sub-aggregator would combine the four streams plus its own, and forward the resulting data to the main aggregator (or maybe could feed it to each other to do a distributed aggregation?). Of course, this assumes you can stand the pipeline delay, and have enough space in those four sub-aggreators for a 5:1 function. This would keep pin-count down enough you could do source synchronous if you wanted.

skew,

If the clock nets going to each target device are kept the same length, clock skew should not be a concern. The window is more of a challenge...

Approximate worst case timing:

Tickofdcm: 1.75 ns (basicly clk to out) prop delay:~1.45 ns (approx prop delay of 8 inches of trace) Tiopick: 1.89 + 0.75 ns (setup time for LVDS)

--------------------- total: 5.84 ns

WAG at best case (just to have some numbers to throw around):

Tickofdcm: 1.25 ns (basicly clk to out) prop delay:~1 ns (prop delay of approx 6 inches of trace) Tiopick: 2 ns (setup time for LVDS)

--------------------- total: 4.25 ns

So if these numbers are close to correct (they might not be), the data would appear at the next chip as if it had been delayed by a clock cycle (although the best case WAG cuts the margin very close).

of

quite

I believe the original use of DLL/DCM/PLL devices is for actually that

- system synchronous clocking. In their default mode, they remove the delay so that the rising edge internal to the FPGA occurs at the same time as the rising edge outside the FPGA (often time, a tad before BEFORE) - so the whole board, inside the FPGA and out, is transitioning at basicly the same time.

All you have to do (famous last words) is take the clock to out time, prop delay, and setup time, and you can compute your timing budget against your period. Of course, if you have a range of net lengths and your IO timing is somewhat variable due to temperature, voltage, process, etc, system synchronous timing can be a challenge for higher clock rates - starting at around the speed you're at.

Good luck,

Marc

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 6:48 PM

Eric, why do you want to use many small FPGAs in bad packages, when you could reduce your device count by a factor 8 and use a package that guarantees better signal integrity? Spartan-3 has lower performance than Virtex-II, and the TQ144 is probably the worst available package from a signal-integrity point of view.

I would use a few Virtex-II devices ( perhaps eight 2V2000 or two

2V6000 chips), and avoid most of the > Hello! I'm trying to build a 20-FPGA (spartan-3s, XC3S400-TQ144s)

board

accelerating

shared 66

each

longer

buses is

LVDS

but to

LVDS

bus.

the 20

nibbles of

to

aggregate

clock

DCMs.

software

- E
- Eric
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 8:18 PM

Peter, Xilinx University support has already been very helpful, supplying me with an ISE license and the XC3S FPGAs I'm planning on using in this project. The XC3S400 has 3584 slices, so for a 20-FPGA board I'd have 71k slices available. I can get these for $400 total (nuhorizons). The XC2V2000-4BG575C (again, nuho) costs $450, and has 10k slices. So to get the same number of slices I'd be looking at $3150 in FPGA hardware. With the XC2V6000, nuhorizons doesn't list a price, but I can imagine the expense would be similar. I agree with your adveritising: spartan-3 is the cheapest xilinx technology you can get. Plus, there are additional costs when using the BGA packages -- need for 6-mil traces and 10-mil vias, coupled with the high layer count needed just to get at the inner balls, rules pcbexpress.com and friends right out. Even better, you can't exactly attach a BGA component to a PCB by hand (and I don't think I'd want to try with a $450 component and a toaster oven). I know there are places that will do this for you, but they have high setup fees (which for a one-time project like this really kills the budget).

I know my approach is in many ways sub-optimal but I'll be able to build it all for around $600. That's a lot of logic to play with for a relatively small amount of cash. And it will let me play around with some algorithms and whatnot to determine if it's really worth it. And if I can find some places where it will let me speed up my daily computations, maybe I'll have time to consider investing in the high-end FPGAs for the next generation.

Thanks, ...Eric

- P
- Peter Alfke
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Mar 27, 2005 9:39 PM

Eric, the economics of a student project and a commercial project are obviously very different. In industry, a cost difference of $ 1000 or even $ 2000 would never jeopardize a research project. Industry values one engineer-day as at least $ 500, and I would bet that the larger parts and the reduced pc-board trouble would save many days of work and achieve more impressive performance. But when it's educational, and the labor is "free", the picture is obviously different. I just wanted to save you some headache with the PQ144s.

Peter Alfke

- L
- Lukasz Salwinski
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 28, 2005 12:10 AM

uh... unless this $1k-2k removes the research subject altogether ;o) say, someone's interested in constructing arrays of FPGAs ? And once happy with the results then intends to move into larger IC territory ?

l

- M
- Marc Randolph
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 28, 2005 12:35 PM

Peter Alfke wrote: [...]

Howdy Peter,

Could you expand on this? I know that Xilinx likes to push Virtex as the performance family and advertises V2 as being faster than S3 (and on a few things it is)... but looking at the detailed speedprint numbers, on average they look to be quite quite comparable, and on some parameters, S3 is noticably faster - presumably due to 90nm.

Agreed. If he can somehow swing getting to a BGA, it could possibly solve his I/O problem as well.

Marc

- E
- Eric
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 28, 2005 2:27 PM

Marc, Peter, and everyone else who has been helping me, I think I've come up with a solution.

My goal was to be able to get 1 Gbps TO each FPGA and to aggregate 1 Gbps total from all the FPGAs, and Marc's suggestions tipped me off as to how I might do this: arrange them in a ring.

I have enough pins (barely) that each FPGA n can have 14 DDR LVDS pairs in from the n-1 fpga, and 14 DDR LVDS pairs out to the n+1 FPGA. We partition these in the following way: 1. 5 in / 5 out are used for the broadcast data bus 2. 5 in / 5 out are used for the aggregate data bus 3. 4 in / 4 out are used for direct inter-FPGA communication

So I'll have this 20 FPGA "ring" with one extra link: the source/sink FPGA which will also talk to my gig-E nic.

Now, to broadcast data to all of the FPGAs, the master FPGA first sends the data to FPGA 0, which makes a copy in an internal buffer and then sends it to FPGA 1, etc. We use the 5 high speed lines: 4 for data and one for an EN pulse.

To aggregate data from the FPGAs, the bus master FPGA first streams an empty packet to FPGA 0; fpga zero passes this packet unchanged to fpga 1, but notes that "packet -1 just went by", and so next sends its own output to FPGA 1 (or a null packet, saying "I have no data to send"). FPGA 1 waits until it has passed FPGA 1's packet through before sending its own; and so on. In this way, FPGA n waits until n-1 has sent, and then sends N+1. At the end, the bus master FPGA collects all of this data and then restarts the cycle.

There are a number of advantages here:

Routing: each FPGA has a 2-3" LVDS connection with its neighbors; that's it. ~700 ps delay there on FR4 board (if I'm remembering my High Speed Digital Design numbers right).
Clock skew. The clock skew between two adjacent FPGAs is likely to be very very small, because even if they're fed by a long clock line from the clock source, the differential in that clock line between them is probably small.
Cost: I can route almost all of these traces on the top layer of my board; I can probably get away with using a 4-layer board now.
Flexibility: If I have an application where, say, what I really want is to have a long chain of FPGAs, with n passing computed data to n+1 for further processing, I can do it this way.
physical geometry:

I can snake them around my board like:

4 5 14 15 3 6 13 16 2 7 12 17 1 8 11 18 0 9 10 19 MASTER NIC

which means that the master FPGA is still relatively close to the two FPGAs it needs to talk to, 0 and 19.

Unfortunately, for the above configuration the WAG would likely be much shorter, and thus it might not work. The worst case would be

Tickofdcm: 1.75 ns prop delay: 600 ns Tiopick: 1.89 +0.75 ns

-------------------- 5 ns

and the WAG best would be 3.85 ns. Ouch. And this is with using all the tricks I know, like registers on all the IOBs, etc.

Given that, and the very short distances (compared to before) these traces are going, would it be wise to abandon lvds and DDR and just have adjacent links be 125 MHz single-ended?

Oh, my kingdom for an ibis simulator!

Thanks again to everyone for the help!

...Eric

- P
- Philip Freidin
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 28, 2005 3:15 PM

You may want to do a google search on the following two keywords: splash FPGA

If you look at the sixth hit, page 3, you see a bunch of FPGAs in a configuration similar to yours (from about 12 years ago):

formatting link

There is lots more out there on Splash and Splash2.

Another research FPGA project that may interest you would start with the google search of: PAM FPGA

Philip

Philip Freidin Fliptronics

- P
- Piotr Wyderski
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Mon, Mar 28, 2005 10:42 PM

Forgive my curiosity, but why are you going to use a cluster of FPGAs to perform scientific computations? Wouldn't a bunch of high-speed floating-point DSP processors be better? FP calculations, which are not well supported by the low-cost FPGA chips (huge amount of shifting etc.), are the core of most scientific computations.

Best regards Piotr Wyderski

- E
- Eric
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 2:46 AM

Well, at this point it's all about experimentation. This is for a class on parallelism and supercomputing (beowulfs and the like) and I've wanted to build some arrays of FPGA-based hardware for a while. There are some applications for which floating point is unnecessary (signal processing, crypto, etc), and also it seems that at least single-precision FP might be possible (I also want to address that).

My day job revolves around large-scale pattern detection (think lots of FSMs) in neural data, so there's not much FP at all... ...Eric

- M
- Marc Randolph
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 4:01 AM

the

traces

adjacent

[...]

Howdy Eric,

Except that I have no idea how bad the TQ package will muck the signals up, 125 MHz single-ended shouldn't be a problem for the PCB or the die. With a BGA package, even single-ended 125 MHz DDR should work fine. To get around the I/O timing problem that comes with system-synchronous, you can go back to the idea of using a source-synchronous clock to flop it on-chip (either use local routing, which should finally be available with 7.1 on an S3, or a DCM).

Good luck,

Marc

- T
- Tobias Weingartner
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 6:10 AM

4 5 12 13 3 6 11 14 2 7 10 15 1 8 9 16 0 19 18 17

--
 [100~Plax]sb16i0A2172656B63616820636420726568746F6E61207473754A[dZ1!=b]salax

- K
- Kolja Sulimma
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 6:54 AM

Why not use a single bus for both? Simpler and higher bandwidth (or lower clock rate). For example you can allocate fixed bandwidth to each fpga by allocating slots in the data stream. The slot that is used to transfer data on the ring from FPGA N to the master FPGA will be unused between the master and and FPGA N so you can use the same slot to transfer data both ways.

This static scheduling actually looks a little bit like a jtag shift register: The master shifts in data into the N fpgas. After N cycles each fpga reads the data from its register and replaces it by its own data that is to be sent to the master during the next shifting operation.

Kolja Sulimma

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Tue, Mar 29, 2005 7:32 AM

I'd suggest picking a particular problem and working through far enough so that you can estimate the speedup relative to a simple PC or DSP chip. That way you can find the hot spots in your design and/or see if the economics make sense.

--
The suespammers.org mail server is located in California.  So are all my
other mailboxes.  Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's.  I hate spam.

- C
- c d saunter
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Wed, Mar 30, 2005 3:12 PM

Peter Alfke ( snipped-for-privacy@sbcglobal.net) wrote: : Eric, why do you want to use many small FPGAs in bad packages, when you : could reduce your device count by a factor 8 and use a package that : guarantees better signal integrity?

Peter, Not the OPs reason by the looks of things, but I can see one advantage in more smaller chips over fewer larger chips in the related area.

Modular reconfiguation of number crunching black boxes! The synthesis times for large chip designs tend to be quite at odds with the timescales software people are used to, and pretty much preculde run time modifiable RC acceleration in more than speciallised "We know all options in advance" capacities.

The number of times I've heard different vendors in the area say "... our product converts this number crunching to edif/vhdl/... in seconds. Then we set the Xilinx tools off and go to the pub."

I realise there is support from Xilinx for reconfigurability on a sub chip modular scale, but as I understand it it is reliant on the emulated tristate bus feature lacking in new devices.

I suspect the first vendor to come up with a decent approach for sane HDL

-> Bitstram toolflows on a sub chip modular scale will be very attractive to the HW acceleration / RC folks.

Also I could imagine the reduced PAR etc. times for being able to partition a chip and only re-syn+PAR parts of it would make interactive testing / debugging a lot more of an 'interactive' experience :-)

Finally it would provide people with a way of breaking very very memory hungry PAR runs into more managable (and parallelisable!) jobs. Obviously the module boundries need to be carefully thought out.

Okay I guess the reconfigurable computing world isn't that big for Xilinx sales *yet...*

(This is of my Dear Santa note :-)

Cheers, Chris

: Spartan-3 has lower performance than Virtex-II, and the TQ144 is : probably the worst available package from a signal-integrity point of : view.

: I would use a few Virtex-II devices ( perhaps eight 2V2000 or two : 2V6000 chips), and avoid most of the interconnect hassles that you : mentioned. "Keep most of the routing on chip!" : If this is a university research project, contact Xilinx University : Support. They can be quite helpful... : Peter Alfke, Xilinx Applications

: Eric wrote: : > Hello! I'm trying to build a 20-FPGA (spartan-3s, XC3S400-TQ144s) : board : > for a class project to investigate the use of FPGA arrays for : accelerating : > scientific competition. The idea was to have a 66 MHz 16-bit : > single-ended shared TX bus sending 125 MBps to each FPGA, and a : shared 66 : > MHz 16-bit data aggregation bus where a bus controller would poll : each : > FPGA sequentially to place its output data onto that bus. : >

: > After discussions with some signal-integrity-leery friends, I'm no : longer : > convinced that a 12"x12" 20 IC board at 66 Mhz with single-ended : buses is : > such a good idea. I've been reading the various datasheets on doing : LVDS : > and DDR signaling. Multidrop LVDS is still a bit tricky, evidently, : but to : > cut down on trace number I might be able to go to 125 MHz DDR 4-pair : LVDS : > for the TX bus; the problem I'm having is with the data aggregation : bus. : >

: > Could the aggregate data bus be structured in a similar manner, with: : > a.) all FPGAs connected to the 4 DDR LVDS pairs? : > b.) a single master, with a separate output enable line to each of : the 20 : > FPGAs : >

: > Such that when FPGA n is output-enabled, it would drive it's m : nibbles of : > data onto the output aggregation bus. But this would require FPGA n : to : > drive its pins within 4 ns; this sounds nearly impossible. : >

: > Are there any common solutions I'm missing? One thought was to : aggregate : > all of the data from the FPGAs via dedicated serial links and do : clock : > recovery at the bus master; this would require recovering 20 separate : > clocks, alas, and with the spartan 3s we don't have quite that many : DCMs. : >

: > An obvious solution is "do an IBIS simulation, duh" but we don't : > have access to the sort of high-end signal integrity simulation : software : > that this would require. : >

: > Can anyone with spartan-3 serial interconnect experience offer : > suggestions as to how to make this work, either through different : > LVDS configurations or interconnect topolgies? : > : > Thanks for any advice you : > can give, : > ...Eric