Xilinx FFT C-sim model

- P
- Pete
  
  Contact options for registered users
posted
16 years ago

Wed, Apr 9, 2008 8:19 PM

I've noticed that the Xilinx FFT bit-accurate c simulation calls are very very slow. Anyone else notice this?

I am working on hybrid fixed/floating-point digital signal processing application, and I frequently make calls to the bit-accurate simulation function (anywhere on the order of 1,000 times, to

1,000,000 times per run). As I'm attempting to run longer simulations, the runtime is becoming my critical path.

The program is single-threaded right now, and I intend to write a multi-threaded version soon. But first, I decided to use a profiler on my single-threaded code to see what kind of speedup I should expect. I wasn't very surprised to find that 99% of my execution time was taking place inside the xilinx_ip_xfft_v5_0_bitacc_simulate() function. I WAS surprised to find that 55.6% of that function's execution time is being spent in malloc() and free() calls. I have some nice call graphs and an excel spreadsheet I'd be willing to share if someone from Xilinx would like to take a look.

Why not move all of the memory allocations into the "create" and "destroy" state calls of the program? If this were done, all memory allocations could be performed once, when the state were created. A speedup of over 2x could be realized with those changes to the function. On top of that, I fear that multi-threading my code won't lead to very much speedup, due to the spin-locks in the heap allocation kernel calls.

Now before anyone replies to this thread and says, "but that's a lot of memory allocations hanging around!" My reply: If a user is worried about running out of memory, they can create and destroy the state as needed. The memory is going to get allocated anyway. The only difference will be better control of how often the costly heap access kernel functions are called. I could see the memory possibly being wasted when simulating a core with a reconfigurable transform length. A pipelined 64K-point FFT would need roughly 1GB. At that point, however, the math complexity will probably start to eclipse the memory allocation penalties.

I'd be more than happy to make the changes to the source myself, if someone from Xilinx were kind enough to share it.

Cheers. Petersen Curt EM Photonics, Inc. Newark, DE

- M
- Marty Ryba
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Thu, Apr 10, 2008 3:00 AM

I don't have direct experience with the Xilinx simulation, but a couple observations:

-- malloc() and free() don't belong in an FFT core; most optimized DSP library calls make the user pre-build the "twiddle factors" table (via an API call) and then the FFT routine can reuse that without needing to recreate it every time. Saves a lot of CPU-expensive sine() and cosine() calls as well. Xilinx could easily alter their API to provide that functionality.

-- Bit-true emulation is expensive since you no longer can call the IEEE-754 functions built into your FPU in the microprocessor core but have to do everything the "hard way".

I've done one simulation using the Matlab Fixed Point Toolbox. It worked, and was pretty close to the Blackfin DSP I was emulating, but it ran

10 times slower than the floating point implementation. One problem with the Fixed Point Toolbox is that they only provide an "example" code that implements a radix-two transform. If you don't do it exactly that way, you'll have to code the FFT routine yourself. I was using a radix-4 algorithm on the Blackfin, and some of the source code was in assembler. I truncated and rounded my results after the FFT and did the other critical calculations in the Toolbox. Maybe Xilinx could generate a reference FFT function for the Matlab toolbox, but that might give away some IP on their implementation.

- P
- Patrick Dubois
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Apr 11, 2008 1:56 PM

I'm using the Matlab mex file function and noticed the slowness as well. Not only that, my input data is real only and the function issues a warning in the Matlab console in that case. It's really annoying because it can't be turned off.

As for the speed issue, maybe you could consider using a non bit- accurate model that you would build yourself and use the bit-accurate model only when really needed. That's what I did.

Patrick