Multi-FPGA Interconnection: latest techniques

- P
- partha sarathy
  
  Contact options for registered users
posted
3 years ago

Thu, Sep 24, 2020 10:52 AM

Hi Experts,

In FPGA Prototyping/Emulation flows, Multi-FPGA partitioning puts limitation on performance due to limited IO pins. What are the latest Multi-FPGA Interconnection techniques available today? By using Multi Gigabit Transceivers , how much performance improvement is expected ?

Thanks in Advance Parth

- T
- Theo
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Thu, Sep 24, 2020 2:46 PM

How much performance do you want? There are transceivers upwards of 56Gbps these days. Questions:

How many transceivers can you get at that speed? How to route an nn Gbps signal from one place to another? How many transceivers can you successfully route and at what speed? How to make that reliable in the face of bit errors, packet loss and other errors? What end to end bandwidth can you actually acheive? What latency impact does all that extra processing have?

Relevant paper of mine:

formatting link

Theo

- P
- partha sarathy
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Thu, Sep 24, 2020 4:07 PM

Hi Theo,

Thanks a lot for the reply.

On Xilinx UltraScale board with 8 FPGAs , using automatic FPGA partitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only. individual FPGA may run up to 100 MHz but overall performance is limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

Do we have any Interconnect technology that can achieve 70 MHz-100 MHz on 4-8 FPGA board.? If partitioning is done manual or by auto partition tool , can BLUELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interconnect logic area overhead can be tolerated.

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Thu, Sep 24, 2020 5:54 PM

Are you sure you aren't doing something wrong? The purpose of pin muxing would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used?

--

  Rick C. 

  - Get 1,000 miles of free Supercharging 
  - Tesla referral code - https://ts.la/richard11209

- P
- partha sarathy
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sat, Sep 26, 2020 2:25 AM

Hi Rick, Thanks for the reply with details. Does the gigabit transceiver pipe line inserted delay count more than 20ns say for 50MHz FPGA clocks?

Best Regards Parth

- R
- Rick C
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sat, Sep 26, 2020 3:25 AM

ote:

le

56Gbps

d other errors?

f

ioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexing ) , Maximum system performance achieved is 10MHz -15 MHz only.

ted to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:1 so on

z on 4-8 FPGA board.?

NK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Interco nnect logic area overhead can be tolerated.

g would seem to be to increase the data rate. But I assume this will incur pipeline delays. Or do I not understand how this is being used?

0ns say for 50MHz FPGA clocks?

Sorry, I'm not at all clear about what you are doing.

Maybe I misunderstood what you meant by pin muxing. Are they using fewer p ins and sending data for multiple signals over each pin? That would defini tely slow things down.

Using SERDES (the gigabit transceiver you mention) should speed that up, bu t might include some pipeline delay. I'm not that familiar with their oper ation, but I assume you have to parallel load a register that is shifted ou t at high speed and loaded into a shift register on the receiving end, then parallel loaded into another register to be presented to the rest of the c ircuitry. If that is how they are working, it would indeed take a full clo ck cycle of latency.

--

  Rick C. 

  + Get 1,000 miles of free Supercharging 
  + Tesla referral code - https://ts.la/richard11209

- T
- Theo
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sat, Sep 26, 2020 2:02 PM

That's right - you get a parallel FIFO interface. There's no guarantee what you put in will get to the other end reliably (if BER is 10^-9 say and your bit rate is 10Gbps, that's one error every 0.1s). So on these kinds of links to be reliable you need some kind of error correction or retransmission. In the Bluelink case, that was hundreds of ns.

Basically you end up with something approaching a full radio stack, just over wires.

Theo

- P
- partha sarathy
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Sun, Sep 27, 2020 4:01 AM

ts

able

ce

of 56Gbps

d?

and other errors?

pdf

itioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplexin g ) , Maximum system performance achieved is 10MHz -15 MHz only.

mited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 16:

1 so on

MHz on 4-8 FPGA board.?

LINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Inter connect logic area overhead can be tolerated.

ing would seem to be to increase the data rate. But I assume this will incu r pipeline delays. Or do I not understand how this is being used?

0ns say for 50MHz FPGA clocks?

pins and sending data for multiple signals over each pin? That would defini tely slow things down.

but might include some pipeline delay. I'm not that familiar with their ope ration, but I assume you have to parallel load a register that is shifted o ut at high speed and loaded into a shift register on the receiving end, the n parallel loaded into another register to be presented to the rest of the circuitry. If that is how they are working, it would indeed take a full clo ck cycle of latency.

Hi Rick, Thanks for the clarifications. It is obvious now that the Serdes is not su itable for Pin Muxing

Regards Parth

- P
- partha sarathy
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Sep 29, 2020 4:40 AM

e:

puts

ilable

ance

s of 56Gbps

eed?

s and other errors?

t.pdf

rtitioning tools (which uses Muxes for pin multplexing , ie HSTDM multiplex ing ) , Maximum system performance achieved is 10MHz -15 MHz only.

limited to 10MHz -15 MHz as the tool inserts pin Muxes in the order 8:1 , 1

6:1 so on

0 MHz on 4-8 FPGA board.?

UELINK Interconnect or GTX transceivers achieve 70 MHz-100 MHz speeds ? Int erconnect logic area overhead can be tolerated.

uxing would seem to be to increase the data rate. But I assume this will in cur pipeline delays. Or do I not understand how this is being used?

20ns say for 50MHz FPGA clocks?

r pins and sending data for multiple signals over each pin? That would defi nitely slow things down.

, but might include some pipeline delay. I'm not that familiar with their o peration, but I assume you have to parallel load a register that is shifted out at high speed and loaded into a shift register on the receiving end, t hen parallel loaded into another register to be presented to the rest of th e circuitry. If that is how they are working, it would indeed take a full c lock cycle of latency.

uitable for Pin Muxing

Multi-Gigabit Transceiver (MGT): Configurable hard-macros MGTs are implemen ted for inter-FPGA communication. The data rate can be as high as ~ 10Gbps [MGT, 20

14]. Nevertheless, the MGT has a high latency (~ 30 fast clock cycles) that limits the system clock frequency and only a few is available. When the TD M ratio is 4, the system clock frequency is ~ 7MHz [Tang et al., 2014]. In addition, the communication between MGTs is not errorfree. They come with a non-null bit error rate (BER). Therefore , at this moment, MGT is not used as inter-FPGA communication architecture in multi-FPGA prototyping

- T
- Theo
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Wed, Sep 30, 2020 2:12 PM

It really depends on what you mean by 'prototyping'. If you have interconnect which is tolerant of latency, such that the system doesn't mind that messages take several cycles to get from one place to another (typical of a network-on-chip implementing say AXI), then using MGT with a reliability layer is fine for functional verification.

If you mean dumping a hairball of an RTL netlist across multiple FPGAs and slowing the the clock until everything works in a single cycle, then they're not right for that job.

They're both prototyping, but at different levels of abstraction.

Theo