high bandwitch ethernet communication

- E
- eliben
  
  Contact options for registered users
posted
16 years ago

Wed, Sep 5, 2007 2:59 PM

Hello,

In our application we have to receive and merge several proprietary serial channels (200 MHz) over fibers, and send all the data over Gigabit Ethernet. The bandwidth is ~60 MByte/s, sustained.

While generally sending this amount of data is possible over Gbit Ethernet, doing so in an embedded system isn't easy. That's because we need to send it by UDP or TCP, for which a TCP/UDP/IP stack is required (software).

Since the translation of the proprietary format is certainly done in an FPGA, I tried to calculate how to implement the whole process in an FPGA. For example, I can take an Altera Stratix II GX (with a built in Gbit Ethernet PHY), add Altera's MAC and use a TCP/IP stack running on the Nios II soft-core processor. Unfortunately, as Altera's appnote

440 shows, the maximal bandwidth attainable this way is only 15-17 MByte/s. For the sake of comparison, benchmarks of Gbit Ethernet adapters on PCs show a maximal bandwidth of 80-90 MByte/s.

However, I wouldn't like to build in a Pentium into the embedded system. Any suggestions / recommendations on how to solve the problem ?

Thanks in advance

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Sep 5, 2007 3:13 PM

My first choice would be some other, more embeddable, processor running off to the side. A PowerPC from Freescale, or an ARM processor from just about anybody, comes to mind. I suspect that even a modest such processor would get some pretty high speeds if that's all it was doing.

You may have to bite the bullet and write your own stack that's shared between a processor and the FPGA. I know practically nothing about TCP/IP, but I'm willing to bet that once you've signed up to writing or modifying your own stack there are some obvious things to put into the FPGA to speed things up.

Slapping a big, limited temperature range, power hungry Pentium into an embedded product would not be my first choice either.

--
Tim Wescott
Control systems and communications consulting
http://www.wescottdesign.com

Need to learn how to apply control theory in your embedded system?
"Applied Control Theory for Embedded Systems" by Tim Wescott
Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html

- G
- Gabor
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Sep 5, 2007 3:26 PM

Check out Stretch

formatting link

They have processors with 4 gigabit ethernet ports and a hardware-assisted stack that can keep up at full gigE bandwidth on at least 3 at the same time. Getting data into the Stretch processor memory can be via the "coprocessor bus" interface or by using one of the MAC interfaces as a simple FIFO.

- J
- John McCaskill
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Sep 5, 2007 3:53 PM

If you have the choice between UDP and TCP, UDP is much simpler and fits an FPGA well. The big issue in choosing between the two is if you require the guaranteed delivery of TCP, or can tollerate the potential packet loss of UDP.

As an example, we make a card that acquires real time data in a custom protocol that is wrapped in UDP. We use a Xilinx Virtex-4 FX60, and a protocol offload engine that uses the Xilinx PicoBlaze soft processor to deal with the protocol stack. The PicoBlaze is an 8-bit soft processor. It looks at each incomming packet and reads the header to see if it is one of the real time streams we are trying to offload. If it is, it sends the header to one circular buffer in memory and the data to another circular buffer. If it is not, it sends it to a kernel buffer and we let the Linux network stack deal with it.

With this setup, we can consume data at over 90 MB/sec per Gigabit Ethernet port. The data part of the packet is 1024 bytes, and each GigE port has its own PicoBlaze dedicated to it.

I did notice that you want to send GigE instead of receive it like we are doing, but this method should work for sending a custom protocol wrapped in UDP with some minor changes.

How is the GigE that you are sending the data over connected? Is it point to point, a dedicated network, or something else?

The network that the data we deal with is set up as multicast data on VLANs. The VLANs are allocated to guantee the required bandwidth through the switches, and data sources use traffic shapping to put out their packets at a steady rate. In that environment, UDP works just fine.

This is the card that I am talking about:

formatting link

Regards,

John McCaskill,

formatting link

- S
- Siva Velusamy
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Sep 5, 2007 5:02 PM

Try the Xilinx GSRD appnote. Performance of over 700 Mbps is possible with the embedded PowerPC.

/Siva

- E
- eliben
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 5:20 AM

Thanks, this is the design I'm now leaning towards. However, I have a concern regarding the high-speed communication between the FPGA and the outside Processor. How is it done - using some high speed outside DDR memory ?

Eli

- E
- eliben
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 5:22 AM

So where is the actual UDP communication implemented ? In Linux ? What processor is it running on ? Is it an external CPU or the built-in PPC of Virtex 4 ?

We can assume for the sake of discussion that it is point to point, since the network is small and we're likely to use a fast switch to ensure exclusive links between systems.

Eli

- T
- Tim Wescott
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 5:51 AM

I've always seen it implemented as a plain ol' asynchronous memory interface, as seen in the '70's and '80's. Most processors support it, and it's not too shabby.

If you have to have the processor fondling the data bits then mapping the FPGA as synchronous static RAM or SDRAM may be quicker, but it'll certainly be more problematic.

--
Tim Wescott
Control systems and communications consulting
http://www.wescottdesign.com

Need to learn how to apply control theory in your embedded system?
"Applied Control Theory for Embedded Systems" by Tim Wescott
Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html

- H
- Hal Murray
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 9:40 AM

I suggest that you go back and read John McCaskill's response. It would probably help to discuss things with a network wizard. You want a low level protocol geek, not a web designer. (especially one who knows something about hardware)

I think the real question is what happens when a packet gets lost? If you are using TCP, you have to buffer all the data until it gets ACKed. If you are using UDP, you drop some data.

UPD in send-only mode doesn't really require a stack.

If I was doing this (or something like what I think you are doing), I would try to do all the UDP in the FPGA. The header is just a bunch of constants. You probably want a sequence number in your payload. Then you have to compute the CRC. The whole thing is well specified and you can be sure it will run fast enough as long as the network doesn't get congested. No ACKs, just fire and forget.

An alternative approach is to get the data into a PC somehow, and do the UDP/whatever work from that PC.

As somebody else already suggested, one "easy" way to get the data into a PC would be to use Ethernet on a point to point link. You don't even need a CRC. This is easy for the hardware. The software guys might not like it. They have to steal the ethernet port from the software stack and write a driver. You might look at tcpdump and see how it handles packets with CRC errors.

There are various PCI boards with an FPGA on them. If you can get your data on to one of those cards, then you can DMA it into memory. That still needs software but it is a slightly different type of software.

--
These are my opinions, not necessarily my employer's.  I hate spam.

- B
- Brian Drummond
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 1:53 PM

If the FPGA was a Virtex-IIPro, or V4FX or so, the PowerPC wouldn't be external.

Mind you, after designing logic, anything running on the PowerPC seems painfully slow...

- Brian

- J
- John McCaskill
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 2:45 PM

We are using one of the embedded PowerPCs to run Linux, and one PicoBlaze soft processor per EMAC in the design.

As each packet exits the EMAC, its header is examined by software that is running on the PicoBlaze. The PicoBlaze is running a very simple stack written in assembly language. As it looks at each layer of the header, it makes a decision to do one of several things. At the Ethernet level, it is deciding if it should just throw the packet away, or pass it on to the next layer. At the IP and UDP layers, it is deciding if the packet belongs to the protocol that we are offloading, and is a stream that we have requested. If the packet does belong to the protocol that we are offloading, the PicoBlaze sets up a PLB DMA engine to send it down one data path. If the packet does not belong to the protocol we are offloading, then the PicoBlaze sends it down another data path and it is given to the Linux kernel to deal with.

So for the protocol that we are offloading, the entire Ethernet, IP, UDP, and custom protocol stack are implemented in PicoBlaze assembly code. For everything else, the stack is in Linux. Since the data we are offloading is multicast data, this lets us have the PicoBlaze deal with the simple but high speed UDP packets, and the PowerPC running Linux deal with the IGMP messaging required to join, leave and maintain membership in a multicast group.

The PicoBlaze is just running a single threaded loop of code that polls for input or output data, and then deals with it. Its program is loaded from the PowerPC taking advantage of the fact that the BRAM holding its program is dual ported. The PowerPC also tells the PicoBlaze what streams of data it is supposed to be acquiring, and what buffers in DDR2 memory to write the data to. The PicoBlaze will generate an interrupt once it has written a certain number of packets to the buffer, so the PowerPC just sees big chunks of data showing up in the buffers and does not have to deal with each packet. The PowerPC only needs to spend 1 or 2 percent of its compute to deal with acquiring each stream of data.

With this setup, you should be able to have an error rate that is incredibly low. As long as the possibility of a lost packet is not catastrophic, UDP would be a very good match. The protocol we offload has sequence numbers in it, so we know if we have lost a packet. The data is being used for realtime signal processing, so losing a packet just looks like a burst of noise.

Regards,

John McCaskill

formatting link

- E
- eliben
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 6:25 PM

At what frequency does the PicoBlaze run ? It must be pretty fast to deal with packets at this bandwidth. Or is the fact that it's only examining the frame headers saves you from the need of high speed ?

Thanks Eli

- J
- John McCaskill
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Sep 6, 2007 6:47 PM

We run the EMAC 8 bits wide at 125 MHz, and the PicBlaze at 62.5 MHz using a divided by two version of the EMAC clock. The PicoBlaze takes two cycles per instruction, and the packets we are offloading are a bit over 1KB, so we about 512 instructions to deal with an offloaded packet and other overhead. Dealing with a non-offloaded packet takes the shortest path through the code to keep the number of packets per second we can handle up. The network the data is on is tightly controlled, so there is very little on it that is not the protocol we are offloading, mostly just IGMP packets for dealing with the multicast groups, and they are at a very low rate.

You are correct in that we do not look at the entire packet with the PicoBlaze, just the header. Once it has determined that it wants to offload that packet, it then has a little bit more work to do to calculate addresses and load them into the DMA engine. To make sure that we do not drop packets, we just need to make sure that the longest path through the code takes less time than about how long it takes to receive a packet. We have a FIFO between the EMAC and the DMA engine, so we can smooth things out a bit.

Regards,

John McCaskill

formatting link

- E
- eliben
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Sep 7, 2007 4:55 AM

Thanks for the information. I must say I'm impressed with your design

- a great interoperability of logic, cores, small and large CPUs and software.

Eli

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Sep 7, 2007 10:44 AM

There are a number of things that can be used to speed up the Ethernet communication (I've read about these, but not tried them - but they might give you a clue).

On the software side, there are a number of different tcp/ip stacks available, and the particular implementation can make a lot of difference.

In the FPGA, you can make sure you are using DMA for memory transfers rather than cpu memory accesses. You can also use the FPGA to accelerate things like CRC calculations enormously - perhaps you can get these ready-written, or make one yourself, and modify the stack to use it.

There are also several different Ethernet MAC's available, with widely different throughputs. Have a look at the OpenCores lists and try some out (I gather the prices are not insignificant, but they may be worth the money).

mvh.,

David

- J
- John McCaskill
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Sep 7, 2007 3:09 PM

Thanks for the compliment!

I really like the PicoBlaze. It makes a great compliment to the PowerPC, and is very small. It is very good in this appication of being an IO processor.

Regards,

John McCaskill

formatting link