TEMAC Performance Issues with Virtex 4FX

- R
- ryufrank
  
  Contact options for registered users
posted
16 years ago

Tue, Aug 7, 2007 11:49 AM

Hello. I have designed a small application on a Virtex4FX based on the Xilinx lwIP Echo Server example (

formatting link

). I haven`t done any dramatic changes to the example, in general terms I modified the socket.c of the project to receive UDP packets on the TEMAC from a source IP, and retransmit them to a destination IP.

After some time I managed to make everything work as I planned. But now I`m facing a problem that I don`t know how to solve.

I can only transmit through the board data at a maximum rate of 1Mbps. Anything more than that is lost. eg. When i transmit at 1.4Mbps, 400Kbps of data is lost, etc. And when I try to transmit at data rates over 3Mbps, I get on Hyperterminal the error messege that the Rx Fifo is full. The data buffer I use for the data transmission, is declared as a normal Array[] variable. Is there an easy way to improve the performance of my application?

Thank you in advance.

- P
- PFC
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 1:03 PM

You mean "to the board", not "through", yes ?

Well, I have never used Virtex4, but I have used a FPGA board (Suzaku) = =

with LAN91c111 MAC chip. This, running Microblaze/ucLinux, achieved an ethernet bandwidth of... =

2 =

Mbps. Yes, that's a bit more than 200 kilobytes per second, ie. ridiculo= us.

TCP handles retransmissions, so full FIFOs causing lost packets will sl= ow =

down your throughput, but the packets will be retransmitted. If, however, you need more throughput, or use UDP, you have a problem.

If you only need a few megabits/s with UDP, you still have a problem : = =

the PC will sometimes pause for a few tens of milliseconds (disk access,= =

whatever), and then send the backlog of packets at full wire speed, whic= h =

will zap your RX buffers.

So if you need UDP without packet loss, you need to be able to absorb t= he =

full wire speed (100M or 1G depending on your application). If you need = =

less than 100 Mbps throughput, it can be useful to configure the etherne= t =

on 100 Mbits instead of 1 Gbits (but then the switch might zap the packe= ts =

if you use a switch !)

UDP has no transmission guarantee ; however if your receiving hardware = =

can pull the packets from the FIFO faster than they come (ie. handles 10=

0 =

Mbps without sweating), and your network has no funky topology like =

inter-switch bottlenecks etc, you will find that you can run the thing f= or =

10 hours straight and not lose a single packet. Wired Ethernet is very =

reliable.

Your problem is that lwIP is a TCP stack designed for simple embedded =

aplications, to add TCP-IP to a microcontroller, at low bandwidth, with = a =

small code footprint. It is not at all designed for high throughput ! Embedded Linux has another problem : it has too many features, so a lot= =

of CPU is used to process the received network data. This is not a probl= em =

when you have a 3 GHz Core 2 CPU. However, on a 50 MHz Microblaze, that'= s =

a different story.

So, if you intend to use UDP for high throughput data transfer, you'll = =

need to ditch lwIP and write your own very simple network stack. If your TEMAC supports scatter/gather DMA, set it up with a large RX ri= ng =

buffer, preferably a few megabytes in SDRAM, so that it can receive many= =

packets and write them to memory very fast. Then, to handle a received packet, write simple C code like this :

- If packet is ARP query, answer query - If packet is UDP with destination =3D myself, parse packet - all other cases, drop packets

This will take about 3 pages of C code, very simple, very efficient.

If you can't use hardware for UDP checksums, ignore them on receive, se= t =

them to 0 on sends. UDP checksums are useless on LAN anyway, since there= =

is Ethernet CRC for data integrity check. Do not copy packet data at least 3 times like lwIP does ! Do everything= =

in-place, recycle your buffers, etc. If you need to copy data, use a DMA core, but first ask yourself, do yo= u =

really need this copy ? You could do some clever buffer recycling instea= d.

- R
- ryufrank
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 1:59 PM

Thank you for your reply PFC. Really helpful information.

I forgot to mention some information about my application, It's using UDP, with a Gigabit Ethernet controller and Xilinx's Microkernel (XMK).

I knew that lwip and XMK were going to cause me delays, but I never expected to be such great! Unfortunately there is no time for me to redesign the application to achieve a good throughput, but I have learnt my lesson.

- R
- ryufrank
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 2:14 PM

Sorry, I forgot to make reference to this part of ur reply:

The application I made for the board, receives UDP data from a source PC, and retransmits them to a destination PC.

- S
- Sylvain Munaut
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 3:02 PM

You could do that with just hardware, no need for a microblaze.

Sylvain

- M
- morphiend
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 3:03 PM

Without "re-designing", you're biggest speed increase will be increasing your receive buffer size. Odds are you are probably just overrunning your receive queue. The V4FX temac has support for gathering these statistics. Complete w/ the PLB temac, you should be able to read out the registers from your software to see what's going on. If it follows what you posted, your design should be adequate to achieve a much higher speed than you have.

I have a V4FX60-10 w/ the PowerPC running at 100MHz and using the MPMC2 w/ the CDMAC. With this I am able to achieve speeds >100Mb/s. (don't know about 1000Mb yet, haven't pushed it that far).

-- Mike

- R
- ryufrank
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 3:52 PM

Thank you for the reply. Just wanted yo clarify that I'm using the PPC processor at 100MHz, not a microblaze.

It would be great if I could achieve such speed as your V4FX60 Mike, but I am out of luck (and skills, and time) :) I will try to play around with the buffer size, but as I recall I have already tried that with no big difference on the throughput. What makes me curious, is the maximum throughput I get is EXACTLY

1Mbps whatever I do. Could that be somekind of limitation implemented in the original example project's architecture?

- P
- PFC
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 4:17 PM

You're welcome ;)

Well, using a TCP stack and OS (like ucLinux for MicroBlaze, I don't know about XMK), adds a lot of complexity. Recently I looked into the Atmel AVR32 CPU with Linux ; this small CPU is very nice but it only does 2 MBytes/s on Ethernet (1 Mbyte/s on full duplex) with Linux. This is very good for this kind of embedded CPU. Basically, from the kernel sources, Linux does this :

- Ethernet MAC uses DMA to copy the received packet to RAM ring buffer

The DMA engine is very smart and supports a linked list of buffers. So, here, a compromise can be made : - either use 1.5kB buffers, which can hold a full packet, but waste space for small packets - or use several small buffers per packets (wastes less space, but needs a copy). This is what is done in the driver.

- Allocating a SKB (Socket Buffer) for the full packet and copy the packet into it - UDP Checksum computation and verification - Going through the TCP stack (possibly one more copy) - Copying the packet data to user space

- Application manipulates the packet and sends it

- Copy packet from user space to skb in kernel - Build Ethernet and UDP headers - Compute UDP checksum - Queue skb for DMA send - DMA copy to MAC

So you have 2 DMA copies (which is the minimum), plus at least 4 processor copies, which obliterate your performance.

An optimum configuration for an UDP forwarder would be, without OS :

- Have a pool of buffers - DMA write received packet to a buffer - CPU examines packet, messes with headers, does its thing - DMA to MAC for send

This is zero copy and much faster, but you can't do this with an OS unless you really hack the drivers, TCP stack, and run without separate user/kernelmemory spaces.

Is this for a school project ?

You also have to consider that MicroBlaze only runs at maximum speed when code and data come from LMB BRAM or I/D Cache. Executing code from SDRAM, or loading data from SDRAM, is HUMONGOUSLY SLOW if you use plain opb_sdram without cache links, you get something like 15 cycles per access, so your

50 MHz CPU becomes slower than some 8 bit dinosaur from 1980. And a large OS and TCP stack does not fit in fast BRAM or cache, plus it's full of branches/tests which kill the cache prefetching.

If you want to handle 1 Gbps with your Virtex-4, at full throttle, keep in mind that with 1024 bytes per packet, this is more than 100.000 packets per second ! 10 microseconds per packet ! A Core 2 CPU will process a few tens of thousands instructions in 10 microseconds, but your 50 MHz Microblaze will only have about 500 cycles to process a packet. You can't do a lot of things in 500 cycles... especially with 12-15 cycle SDRAM access latency...

That's why routers like WRT54G can afford to use Linux, because a 200 MHz ARM CPU can handle the very slow speeds of WiFi, but all the 10$ fast Ethernet switches on the market are basically just a chip with a hardware packet processing engine and a slow microcontroller which only has to tell the hardware "send the packet to port #2".

I put my UDP code, and all the MAC driver code, in a BRAM block sitting on the LMB, with 1 cycle access time ; I could do this because it is so small ; a full TCP stack would never fit.

- P
- PFC
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 4:38 PM

Hint :

Try it with different packet sizes, at constant throughput, ie. lots of= =

small packets versus less packets, but bigger. So, you can measure two =

different things : the time it takes to process a packet (independent of= =

length), and the time it takes to handle the data in it (which depends o= n =

length).

Would you by any chance use a timer as an interrupt source, instead of = =

the MAC "I received a packet" interrupt ? (check your interrupt wiring..= .) =

Can your interrupt process ALL the pending packets ?

Hint : since all your processing will probably be in an interrupt =

handler, run a simple free CPU time performance meter in the non-interru= pt =

code path (main() function).

I did it this way : I know on an idle system, microblaze can process N iterations per secon= d =

of a simple for () { i +=3D 1 }. I use a timer, and see how many iterati= ons =

are actually done in one second, and print it on the serial port. So, I = =

can see how much time is spent in the interrupt handler, in real time. I= f =

I nuke it with too many 1 byte packets, it never gets out of the MAC =

interrupt handler, and it stops displaying the free CPU percentage.

- J
- John Williams
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 7, 2007 11:29 PM

Be fair - a major problem in Suzaku's configuration is the lack of DMA on the ethernet MAC. A simple fix would be to add an opb_dma controller to the system, and reconfigure the ethernet driver to use it.

With the Xilinx EMAC core, full DMA, data realignment engine and checksum offload we see sustained 50Mbps throughput on MicroBlaze Linux systems at 100MHz.

You are right that there is OS overhead, however the Linux kernel does as little packet copy as possible - once off the MAC to main memory (unavoidable), then once from kernel to user space. If you use sendfile() then it's zero copy.

No operating system will get high performance on a badly bottlenecked hardware architecture!

Regards,

John

- G
- Guru
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Wed, Aug 8, 2007 11:45 AM

I agree with John: When your HW suits your needs then you can concentrate to the SW part. The best HW solution for you is a modified a Xilinx Gigabit System Reference Design (aka GSRD2) which was originally built for ML403 board. It uses Multiport Memory Controller, fast LL DMA engine for TEMAC connection. With this system running on a Avnet V4FX12 MiniModule (pretty much the same as your board) I have achieved performance of 740 Mbit/s at 1.5k packets and 850 Mbit/s using 7k packets at streaming raw ethernet data to a PC. Some japanese guy put a Linux 2.6 on this design and achieved 350 Mbit/s TCP performance. The problem I am faced now is a PPC data caching errata - when turned off the performance is significantly lower, when turned on errors in DMA descriptors appear. I still do not what to do about it. If you need a GSRD2 HW design do not hesitate to contact me.

Cheers,

Guru

- R
- ryufrank
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Aug 12, 2007 11:51 AM

Thank you all for your help, I managed to get a slight performance improvement by some minor modifications of the configurations. I guess you are right, the platform design is my main problem. Unfortunately I was unable to find one to directly work with my board, as the GSRD2 design provided by xilinx won't work on my Memec V4FX12 LC board without modifications. Thank you for offering sending it to me Guru, unfortunately for me it's a bit late now, but if you could upload it to me anyway for my future designs, I'd be greateful. Will it work on my Memec board? Or do you know what kind of changed will require to work? Thanks in advance

- T
- Torsten Landschoff
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Tue, Aug 14, 2007 11:22 AM

Data caching errata? Do you have a pointer for those? So far I did not run into any problems with data caching. I still have the hope it stays that way.

Greetings, Torsten