Handling high UDP throughput

Hello,

We have a system that has to process data incoming from a GbE channel in UDP packets. The throughput is ~40Mbyte / second.

I'm looking for solutions on how to process this data. The data processing can be done in an FPGA, but I'd like to do the UDP/IP in a CPU.

Is there a CPU that can handle such a load from a GbE interface, even minimally, by writing all the data into some fast memory that can be accessed by another CPU/FPGA for processing? Or perhaps a custom network chip that can help me?

Besides writing UDP packets to memory, this interface will have to register for UDP multicast with IGMP and answer ARP requests, for the router to know where it's located.

Thanks in advance

Reply to
eliben
Loading thread data ...

Any ideas?

Reply to
eliben

Any CPU fast enough would seem to suffice. You should be able to sit down and write some calculations to figure it out. In fact you must already have good ideas for the memory speeds needed since you imply you have this working with a FPGA.

It's just a question of money in the end. Ed

Reply to
Ed Prochak

You need to worry about UDP and IP issues when setting up the transfer, after that, think about the actual data transfer is raw Ethernet frames with some garbage (IP and UDP headers) in the beginning of each frame. As long as the Ethernet chip supports DMA transfers into multiple buffers, this should be doable. However, this may complicate the further processing using the FPGA.

If jumbo frames can be used, this will reduce the number of interrupts needed at the end of each frame.

Paul

Reply to
Paul Keinanen

What processing is required?

The IP/UDP stack at 40MB/s is the substantial computing load. You need a ~GHz class CPU with the appropriate memory and DMA subsystems.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Reply to
Vladimir Vassilevsky

Take a look at the QUICC coprocessors on Freescale MPC8xxx processors, these should be able to capture the raw Ethernet frames into memory without main CPU intervention (or get an interrupt after each frame or burst of frames). The actual UDP data processing is not any harder compared to raw Ethernet frames after the communication has been set up.

However, the other question is how are you going to do the actual application data processing, since you have only 25-100 instructions/data byte even on a multi-GHz processor. It might be more economical to use several low speed processors in parallel, each processing every nth frame or each receiving a burst of frames, doing the application data processing, while the other processors capture the other bursts of data.

Paul

Reply to
Paul Keinanen

How can a multiple-buffer DMA be employed here?

Reply to
eliben

I don't have a problem with the processing, actually, because the "heavy" data stream will be decimated considerably by the FPGA prior to being sent to a DSP.

What I worry about is that since this FPGA has to receive the packets, I wonder how to make some other entity (not the FPGA) handle the IGMP/ ARP stuff. I know UDP/IP packets can be easily "unshelled" and processed in a FPGA, but not IGMP/ARP.

Ideally the solution would be some smart chip that can process IGMP/ ARP and quickly dump everything else by DMA to my FPGA. Can the QUICC do that?

Eli

Reply to
eliben

Probably yes. A definitive answer would take working out how it will be done in detail, of course; anyway, the QUICC and their likes (Freescale have also other powerful processors with just 1 or 2

10/100/1000 capable Ethernet ports) are likely your best bet.

Didi

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

Original message:

formatting link

Reply to
didi

You can do 60MB/s easily with TCP to a 500MHz PowerPC even using a WinXP PC as the host. A lot depends on what OS and TCP/IP stack are used no the device, what is done with the data once received, and how much time you can put into optimizing the system.

I'm not just saying you can do this because I think you can - I've done it.

Bill A.

Reply to
Bill A.

Only if you just send the same packet over and over in a dummy loop and do nothing else.

I've done 100Mbit Tx/Rx with BlackFin at 600MHz. Even the 12/12 MB UDP traffic is the considerate amount of load. Copying between the different buffers, calculation of the checksums, cache trashing etc. etc. = all of that is not free and hogs the bus and CPU.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Reply to
Vladimir Vassilevsky

Actually, my tests sending data and doing nothing with the data got me over

920MbS. You just can't throw together a system and do this. You won't get that with Linux or any other OS. The product that uses this sustains 540MbS with a 38kHz interrupt running using more than half the processor's power, so a lot goes on in the system but a lot of time is available for TCP/IP. The Ethernet driver was optimized, the memory movement was optimized (just using an inline memcpy that does a DMA transfer adds 30% to the effective speed), the IP checksum was in assembly, and a zero-copy TCP/IP stack was required.

This was with the Freescale QUICC 8349 so I concur with the other post - this processor can do it - it's designed as a communications processor.

I didn't say it was easy. I didn't say a system like you used could do it. I'm only saying it is possible in an embedded device with a reasonable processor - you don't need ~GHz as you claimed.

What OS did you use? What stack? How much TX buffers did you have? How fast could the processor get the data to the MAC? Did you do zero-copy TCP/IP (it's very hard to do this with sockets)? The QUICC buffer descriptor memory makes it very easy to send lots of data without processor intervention. Oh, I forgot, the Ethernet driver I wrote wasn't even interrupt driven. At those interrupt rates there was no improvement over simply polling for data. This may have been because when polling, the processor cache wasn't constantly being replaced by the Ethernet interrupt service routine.

Bill A.

Reply to
Bill A.

Typo:

make that receive

Reply to
Bill A.

Fortunately, the IP checksum is only calculated over the IP header (a few bytes) and you don't have to use the UDP Checksum at all (it is allowed to set it to 0 according to RFC 768) and simply rely on the 32 bit MAC level CRC.

I still wonder, how the OP would interface the FPGA to the QUICC I/O-coprocessor, it might be possible to use the TSA (Time Slot Assigner) in some creative way.

Paul

Reply to
Paul Keinanen

Our own OS, our own stack and MAC driver, 4/4 Rx/Tx buffers, 100/100 full duplex. It was found that there is generally no advantage in using more then 4 buffers; less then 4 buffers decreases the throughput.

That is done by DMA. The speed depends on many factors.

No, it has to copy the data. You have to do that not just because of sockets but since BlackFin doesn't have the automatic means to ensure cache - DMA coherency.

So, the driver is blocking. No multitasking.

The interrupt servicing up to the rates of ~hundreds kHz isn't a big problem in BlackFin. The context switch overhead is only ~200ns, and the interrupt code and data are located in L1, so there is no stalling because of the cache.

Vladimir Vassilevsky DSP and Mixed Signal Design Consultant

formatting link

Reply to
Vladimir Vassilevsky

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.