Hello. I have designed a small application on a Virtex4FX based on the Xilinx lwIP Echo Server example (
formatting link
). I haven`t done any dramatic changes to the example, in general terms I modified the socket.c of the project to receive UDP packets on the TEMAC from a source IP, and retransmit them to a destination IP.
After some time I managed to make everything work as I planned. But now I`m facing a problem that I don`t know how to solve.
I can only transmit through the board data at a maximum rate of 1Mbps. Anything more than that is lost. eg. When i transmit at 1.4Mbps, 400Kbps of data is lost, etc. And when I try to transmit at data rates over 3Mbps, I get on Hyperterminal the error messege that the Rx Fifo is full. The data buffer I use for the data transmission, is declared as a normal Array[] variable. Is there an easy way to improve the performance of my application?
Thank you for your reply PFC. Really helpful information.
I forgot to mention some information about my application, It's using UDP, with a Gigabit Ethernet controller and Xilinx's Microkernel (XMK).
I knew that lwip and XMK were going to cause me delays, but I never expected to be such great! Unfortunately there is no time for me to redesign the application to achieve a good throughput, but I have learnt my lesson.
Without "re-designing", you're biggest speed increase will be increasing your receive buffer size. Odds are you are probably just overrunning your receive queue. The V4FX temac has support for gathering these statistics. Complete w/ the PLB temac, you should be able to read out the registers from your software to see what's going on. If it follows what you posted, your design should be adequate to achieve a much higher speed than you have.
I have a V4FX60-10 w/ the PowerPC running at 100MHz and using the MPMC2 w/ the CDMAC. With this I am able to achieve speeds >100Mb/s. (don't know about 1000Mb yet, haven't pushed it that far).
Thank you for the reply. Just wanted yo clarify that I'm using the PPC processor at 100MHz, not a microblaze.
It would be great if I could achieve such speed as your V4FX60 Mike, but I am out of luck (and skills, and time) :) I will try to play around with the buffer size, but as I recall I have already tried that with no big difference on the throughput. What makes me curious, is the maximum throughput I get is EXACTLY
1Mbps whatever I do. Could that be somekind of limitation implemented in the original example project's architecture?
Well, using a TCP stack and OS (like ucLinux for MicroBlaze, I don't know about XMK), adds a lot of complexity. Recently I looked into the Atmel AVR32 CPU with Linux ; this small CPU is very nice but it only does 2 MBytes/s on Ethernet (1 Mbyte/s on full duplex) with Linux. This is very good for this kind of embedded CPU. Basically, from the kernel sources, Linux does this :
- Ethernet MAC uses DMA to copy the received packet to RAM ring buffer
The DMA engine is very smart and supports a linked list of buffers. So, here, a compromise can be made : - either use 1.5kB buffers, which can hold a full packet, but waste space for small packets - or use several small buffers per packets (wastes less space, but needs a copy). This is what is done in the driver.
- Allocating a SKB (Socket Buffer) for the full packet and copy the packet into it - UDP Checksum computation and verification - Going through the TCP stack (possibly one more copy) - Copying the packet data to user space
- Application manipulates the packet and sends it
- Copy packet from user space to skb in kernel - Build Ethernet and UDP headers - Compute UDP checksum - Queue skb for DMA send - DMA copy to MAC
So you have 2 DMA copies (which is the minimum), plus at least 4 processor copies, which obliterate your performance.
An optimum configuration for an UDP forwarder would be, without OS :
- Have a pool of buffers - DMA write received packet to a buffer - CPU examines packet, messes with headers, does its thing - DMA to MAC for send
This is zero copy and much faster, but you can't do this with an OS unless you really hack the drivers, TCP stack, and run without separate user/kernelmemory spaces.
Is this for a school project ?
You also have to consider that MicroBlaze only runs at maximum speed when code and data come from LMB BRAM or I/D Cache. Executing code from SDRAM, or loading data from SDRAM, is HUMONGOUSLY SLOW if you use plain opb_sdram without cache links, you get something like 15 cycles per access, so your
50 MHz CPU becomes slower than some 8 bit dinosaur from 1980. And a large OS and TCP stack does not fit in fast BRAM or cache, plus it's full of branches/tests which kill the cache prefetching.
If you want to handle 1 Gbps with your Virtex-4, at full throttle, keep in mind that with 1024 bytes per packet, this is more than 100.000 packets per second ! 10 microseconds per packet ! A Core 2 CPU will process a few tens of thousands instructions in 10 microseconds, but your 50 MHz Microblaze will only have about 500 cycles to process a packet. You can't do a lot of things in 500 cycles... especially with 12-15 cycle SDRAM access latency...
That's why routers like WRT54G can afford to use Linux, because a 200 MHz ARM CPU can handle the very slow speeds of WiFi, but all the 10$ fast Ethernet switches on the market are basically just a chip with a hardware packet processing engine and a slow microcontroller which only has to tell the hardware "send the packet to port #2".
I put my UDP code, and all the MAC driver code, in a BRAM block sitting on the LMB, with 1 cycle access time ; I could do this because it is so small ; a full TCP stack would never fit.
Be fair - a major problem in Suzaku's configuration is the lack of DMA on the ethernet MAC. A simple fix would be to add an opb_dma controller to the system, and reconfigure the ethernet driver to use it.
With the Xilinx EMAC core, full DMA, data realignment engine and checksum offload we see sustained 50Mbps throughput on MicroBlaze Linux systems at 100MHz.
You are right that there is OS overhead, however the Linux kernel does as little packet copy as possible - once off the MAC to main memory (unavoidable), then once from kernel to user space. If you use sendfile() then it's zero copy.
No operating system will get high performance on a badly bottlenecked hardware architecture!
I agree with John: When your HW suits your needs then you can concentrate to the SW part. The best HW solution for you is a modified a Xilinx Gigabit System Reference Design (aka GSRD2) which was originally built for ML403 board. It uses Multiport Memory Controller, fast LL DMA engine for TEMAC connection. With this system running on a Avnet V4FX12 MiniModule (pretty much the same as your board) I have achieved performance of 740 Mbit/s at 1.5k packets and 850 Mbit/s using 7k packets at streaming raw ethernet data to a PC. Some japanese guy put a Linux 2.6 on this design and achieved 350 Mbit/s TCP performance. The problem I am faced now is a PPC data caching errata - when turned off the performance is significantly lower, when turned on errors in DMA descriptors appear. I still do not what to do about it. If you need a GSRD2 HW design do not hesitate to contact me.
Thank you all for your help, I managed to get a slight performance improvement by some minor modifications of the configurations. I guess you are right, the platform design is my main problem. Unfortunately I was unable to find one to directly work with my board, as the GSRD2 design provided by xilinx won't work on my Memec V4FX12 LC board without modifications. Thank you for offering sending it to me Guru, unfortunately for me it's a bit late now, but if you could upload it to me anyway for my future designs, I'd be greateful. Will it work on my Memec board? Or do you know what kind of changed will require to work? Thanks in advance
Data caching errata? Do you have a pointer for those? So far I did not run into any problems with data caching. I still have the hope it stays that way.
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.