high bandwitch ethernet communication

E

eliben 18 years ago

Hello,

In our application we have to receive and merge several proprietary serial channels (200 MHz) over fibers, and send all the data over Gigabit Ethernet. The bandwidth is ~60 MByte/s, sustained.

While generally sending this amount of data is possible over Gbit Ethernet, doing so in an embedded system isn't easy. That's because we need to send it by UDP or TCP, for which a TCP/UDP/IP stack is required (software).

Since the translation of the proprietary format is certainly done in an FPGA, I tried to calculate how to implement the whole process in an FPGA. For example, I can take an Altera Stratix II GX (with a built in Gbit Ethernet PHY), add Altera's MAC and use a TCP/IP stack running on the Nios II soft-core processor. Unfortunately, as Altera's appnote

440 shows, the maximal bandwidth attainable this way is only 15-17 MByte/s. For the sake of comparison, benchmarks of Gbit Ethernet adapters on PCs show a maximal bandwidth of 80-90 MByte/s.

However, I wouldn't like to build in a Pentium into the embedded system. Any suggestions / recommendations on how to solve the problem ?

Thanks in advance

Vote

T

Tim Wescott 18 years ago

My first choice would be some other, more embeddable, processor running off to the side. A PowerPC from Freescale, or an ARM processor from just about anybody, comes to mind. I suspect that even a modest such processor would get some pretty high speeds if that's all it was doing.

You may have to bite the bullet and write your own stack that's shared between a processor and the FPGA. I know practically nothing about TCP/IP, but I'm willing to bet that once you've signed up to writing or modifying your own stack there are some obvious things to put into the FPGA to speed things up.

Slapping a big, limited temperature range, power hungry Pentium into an embedded product would not be my first choice either.

Tim Wescott Control systems and communications consulting http://www.wescottdesign.com Need to learn how to apply control theory in your embedded system? "Applied Control Theory for Embedded Systems" by Tim Wescott Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html

Vote

G

Gabor 18 years ago

Check out Stretch

formatting link

They have processors with 4 gigabit ethernet ports and a hardware-assisted stack that can keep up at full gigE bandwidth on at least 3 at the same time. Getting data into the Stretch processor memory can be via the "coprocessor bus" interface or by using one of the MAC interfaces as a simple FIFO.

Vote

J

John McCaskill 18 years ago

If you have the choice between UDP and TCP, UDP is much simpler and fits an FPGA well. The big issue in choosing between the two is if you require the guaranteed delivery of TCP, or can tollerate the potential packet loss of UDP.

As an example, we make a card that acquires real time data in a custom protocol that is wrapped in UDP. We use a Xilinx Virtex-4 FX60, and a protocol offload engine that uses the Xilinx PicoBlaze soft processor to deal with the protocol stack. The PicoBlaze is an 8-bit soft processor. It looks at each incomming packet and reads the header to see if it is one of the real time streams we are trying to offload. If it is, it sends the header to one circular buffer in memory and the data to another circular buffer. If it is not, it sends it to a kernel buffer and we let the Linux network stack deal with it.

With this setup, we can consume data at over 90 MB/sec per Gigabit Ethernet port. The data part of the packet is 1024 bytes, and each GigE port has its own PicoBlaze dedicated to it.

I did notice that you want to send GigE instead of receive it like we are doing, but this method should work for sending a custom protocol wrapped in UDP with some minor changes.

How is the GigE that you are sending the data over connected? Is it point to point, a dedicated network, or something else?

The network that the data we deal with is set up as multicast data on VLANs. The VLANs are allocated to guantee the required bandwidth through the switches, and data sources use traffic shapping to put out their packets at a steady rate. In that environment, UDP works just fine.

This is the card that I am talking about:

formatting link

Regards,

John McCaskill,

formatting link

Vote

S

Siva Velusamy 18 years ago

Try the Xilinx GSRD appnote. Performance of over 700 Mbps is possible with the embedded PowerPC.

/Siva

Vote

G

glen herrmannsfeldt 18 years ago

eliben wrote: (snip)

TCP is somewhat complicated, but UDP is pretty simple. You might want external software to handle ARP and routing tables (if needed), but it should not be too hard to create and send a UDP packet in an FPGA. You don't say anything about receiving. Among the complications of IP are reassembling fragmented packets. You should be able to avoid that (if you control both ends, and the path in between).

-- glen

Vote

J

Janaka 18 years ago

We've are using the MPC8349E at 400Mhz core and got only 480mbit/s sustained UDP data rate. These processors are marketed as communications processors but only have low level HW support (on Ethernet layer). All the upper level IP and UDP protocols are handled in software (when running linux). So it takes up CPU time. Same setup on two desktop PCs running linux yeilds 840mbit/s sustained UDP rate.

Vote

E

eliben 18 years ago

Thanks, this is the design I'm now leaning towards. However, I have a concern regarding the high-speed communication between the FPGA and the outside Processor. How is it done - using some high speed outside DDR memory ?

Eli

Vote

E

eliben 18 years ago

So where is the actual UDP communication implemented ? In Linux ? What processor is it running on ? Is it an external CPU or the built-in PPC of Virtex 4 ?

We can assume for the sake of discussion that it is point to point, since the network is small and we're likely to use a fast switch to ensure exclusive links between systems.

Eli

Vote

T

Tim Wescott 18 years ago

I've always seen it implemented as a plain ol' asynchronous memory interface, as seen in the '70's and '80's. Most processors support it, and it's not too shabby.

If you have to have the processor fondling the data bits then mapping the FPGA as synchronous static RAM or SDRAM may be quicker, but it'll certainly be more problematic.

Tim Wescott Control systems and communications consulting http://www.wescottdesign.com Need to learn how to apply control theory in your embedded system? "Applied Control Theory for Embedded Systems" by Tim Wescott Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html

Vote

P

Paul Keinanen 18 years ago

If the OP required only something dedicated point to point connectivity, why bother with the IP wrapper, just send raw Ethernet frames with MAC addressing ?

Apparently that MPC has some modern version of the QUICC co-processor (as found on the MC68360), in which it is quite easy to set up one BD (buffer descriptor) for the (possibly fixed) header and an other for the actual data. The co-processor assembles the frames from the fragments, sends them autonomously, appends the CRC and then search for next ready frame to be sent, without any further main processor intervention.

The hard thing is to get the transmit data into the transmit buffers fast enough, but for direct port to port copying, there should not be much need to move the actual data in the memory.

Paul

Vote

H

Hal Murray 18 years ago

I suggest that you go back and read John McCaskill's response. It would probably help to discuss things with a network wizard. You want a low level protocol geek, not a web designer. (especially one who knows something about hardware)

I think the real question is what happens when a packet gets lost? If you are using TCP, you have to buffer all the data until it gets ACKed. If you are using UDP, you drop some data.

UPD in send-only mode doesn't really require a stack.

If I was doing this (or something like what I think you are doing), I would try to do all the UDP in the FPGA. The header is just a bunch of constants. You probably want a sequence number in your payload. Then you have to compute the CRC. The whole thing is well specified and you can be sure it will run fast enough as long as the network doesn't get congested. No ACKs, just fire and forget.

An alternative approach is to get the data into a PC somehow, and do the UDP/whatever work from that PC.

As somebody else already suggested, one "easy" way to get the data into a PC would be to use Ethernet on a point to point link. You don't even need a CRC. This is easy for the hardware. The software guys might not like it. They have to steal the ethernet port from the software stack and write a driver. You might look at tcpdump and see how it handles packets with CRC errors.

There are various PCI boards with an FPGA on them. If you can get your data on to one of those cards, then you can DMA it into memory. That still needs software but it is a slightly different type of software.

These are my opinions, not necessarily my employer's. I hate spam.

Vote

E

eliben 18 years ago

I wondered about that, actually. But working on the MAC level is very inflexible. For example:

1) What if the client computer gets replaced by an equivalent computer. Each NIC has a unique MAC address, and so I'll have to reconfigure my sender, or set up some manual MAC discovery protocol. 2) If the client is a PC of some sort, working on the MAC packet level isn't too simple, as the networking APIs don't provide that level. A separate driver to the NIC should be used, or whatever. 3) If I want to advance to a more complicated network, such as one with a few clients, working on the IP level is much more convenient as I can set up a router with all the niceties it brings - multicasts, groups, etc.

Eli

Vote

P

Paul Keinanen 18 years ago

The "manual MAC discovery protocol" could be ARP, which is simple to implement (manually creating the request IP header) and you get the MAC address of the other partner. After that, you do not have to bother about any IP addresses in the message headers in the actual high speed data transfers. Only if you send the data to some hot standby redundant system, in which the MAC address can change at any time, but again, you just would have to repeat the ARP protocol query.

I haven't written any raw Ethernet protocols in two decades, but in those days setting the receiver into Promiscuous mode was all that was needed.

I still assume that the current Ethernet card support the Promiscuous mode, since there are a lot of Ethernet and TCP/UDP/IP analysing programs working with standard Ethernet adapters. Are these analysing programs using some dedicated driver stacks ?

With the cost of the system that the OP asked, there would not be a cost issue of installing an extra network cards on the receiving PC. Thus, one NIC could handle the fast traffic in Promiscuous mode, while the other NIC(s) could handle ordinary network traffic.

MAC broadcasts work well with hubs. This kind of MAC broadcast is used in some producer/consumer model Ethernet based industrial networks these days.

Paul

Vote

B

Brian Drummond 18 years ago

If the FPGA was a Virtex-IIPro, or V4FX or so, the PowerPC wouldn't be external.

Mind you, after designing logic, anything running on the PowerPC seems painfully slow...

- Brian

Vote

G

Grant Edwards 18 years ago

It's easy:

$ man packet

Sure they do. See above.

Of course it sucks trying to do it under Windows, but it sucks trying to do _anything_ under Windows. ;)

Yup. One of the products I work on started out with MAC level networking. It's fast and has very low overhead, but there are always going to be customers who want IP networking. So now the product will do either MAC networking or TCP networking (or both, actually).

Beware of relying on the Ethernet CRC. I've run across two different uController/MAC combinations where it wasn't reliable.

Grant Edwards grante Yow! Wait ... is this a FUN at THING or the END of LIFE in visi.com Petticoat Junction??

Vote

G

Grant Edwards 18 years ago

There's no need for promiscuous mode. None of the MAC packet level products I've worked on used promiscuous mode at all.

I don't see what promiscuous mode has to do with it. The MAC level protocols I worked with were all still unicast.

Grant Edwards grante Yow! FOOLED you! Absorb at EGO SHATTERING impulse visi.com rays, polyester poltroon!!

Vote

J

John McCaskill 18 years ago

We are using one of the embedded PowerPCs to run Linux, and one PicoBlaze soft processor per EMAC in the design.

As each packet exits the EMAC, its header is examined by software that is running on the PicoBlaze. The PicoBlaze is running a very simple stack written in assembly language. As it looks at each layer of the header, it makes a decision to do one of several things. At the Ethernet level, it is deciding if it should just throw the packet away, or pass it on to the next layer. At the IP and UDP layers, it is deciding if the packet belongs to the protocol that we are offloading, and is a stream that we have requested. If the packet does belong to the protocol that we are offloading, the PicoBlaze sets up a PLB DMA engine to send it down one data path. If the packet does not belong to the protocol we are offloading, then the PicoBlaze sends it down another data path and it is given to the Linux kernel to deal with.

So for the protocol that we are offloading, the entire Ethernet, IP, UDP, and custom protocol stack are implemented in PicoBlaze assembly code. For everything else, the stack is in Linux. Since the data we are offloading is multicast data, this lets us have the PicoBlaze deal with the simple but high speed UDP packets, and the PowerPC running Linux deal with the IGMP messaging required to join, leave and maintain membership in a multicast group.

The PicoBlaze is just running a single threaded loop of code that polls for input or output data, and then deals with it. Its program is loaded from the PowerPC taking advantage of the fact that the BRAM holding its program is dual ported. The PowerPC also tells the PicoBlaze what streams of data it is supposed to be acquiring, and what buffers in DDR2 memory to write the data to. The PicoBlaze will generate an interrupt once it has written a certain number of packets to the buffer, so the PowerPC just sees big chunks of data showing up in the buffers and does not have to deal with each packet. The PowerPC only needs to spend 1 or 2 percent of its compute to deal with acquiring each stream of data.

With this setup, you should be able to have an error rate that is incredibly low. As long as the possibility of a lost packet is not catastrophic, UDP would be a very good match. The protocol we offload has sequence numbers in it, so we know if we have lost a packet. The data is being used for realtime signal processing, so losing a packet just looks like a burst of noise.

Regards,

John McCaskill

formatting link

Vote

E

eliben 18 years ago

At what frequency does the PicoBlaze run ? It must be pretty fast to deal with packets at this bandwidth. Or is the fact that it's only examining the frame headers saves you from the need of high speed ?

Thanks Eli

Vote

J

John McCaskill 18 years ago

We run the EMAC 8 bits wide at 125 MHz, and the PicBlaze at 62.5 MHz using a divided by two version of the EMAC clock. The PicoBlaze takes two cycles per instruction, and the packets we are offloading are a bit over 1KB, so we about 512 instructions to deal with an offloaded packet and other overhead. Dealing with a non-offloaded packet takes the shortest path through the code to keep the number of packets per second we can handle up. The network the data is on is tightly controlled, so there is very little on it that is not the protocol we are offloading, mostly just IGMP packets for dealing with the multicast groups, and they are at a very low rate.

You are correct in that we do not look at the entire packet with the PicoBlaze, just the header. Once it has determined that it wants to offload that packet, it then has a little bit more work to do to calculate addresses and load them into the DMA engine. To make sure that we do not drop packets, we just need to make sure that the longest path through the code takes less time than about how long it takes to receive a packet. We have a FIFO between the EMAC and the DMA engine, so we can smooth things out a bit.

Regards,

John McCaskill

formatting link

Vote

high bandwitch ethernet communication

Join the Discussion

Didn't find your answer?