System Generator pcore I/O performance results

E

eejw 19 years ago

Hello all:

I have a question regarding using SysGen to create a co-processor that's used in a microblaze design. I'm using EDK v9.1 through the base system builder wizard to create a design used on a Xilinx ML401 dev. board.

I've already generated a simple pcore and connected that to the microblaze proc. in EDK. Data are being passed from MB -> pcore and pcore -> MB through shared memory (using the "from register" and "to register" in SysGen).

Using the provided function calls for communicating from MB -> pcore, I do the following:

findavg_sm_0_Write(FINDAVG_SM_0_D0,FINDAVG_SM_0_D0_DIN, datasamp[0]); findavg_sm_0_Write(FINDAVG_SM_0_D1,FINDAVG_SM_0_D1_DIN, datasamp[1]); findavg_sm_0_Write(FINDAVG_SM_0_D2,FINDAVG_SM_0_D2_DIN, datasamp[2]); etc.

To check performance, I start timer, do function call to write shared memory, then read value from timer.

So it's just:

//start timer findavg_sm_0_Write(FINDAVG_SM_0_D0,FINDAVG_SM_0_D0_DIN, datasamp[0]); //read count register

I'm seeing that it takes 28 clock cycles to pass a 16-byte word from MB -> pcore in this way. This seems *way* too long.

To improve performance, the API documents that were generated when I created the pcore suggest to remove this line in the xparameters.h file:

#define FINDAVG_SM_0_SG_ENABLE_FSL_ERROR_CHECK

I did that, but it doesn't help.

I didn't do anything special regarding connecting my pcore to the MB. Just added it through the Hardware -> Configure coprocessor... tool in EDK which connects the pcore to MB through an FSL.

Has anyone investigated this and can share any words of wisdom?

thanks, Joel

Vote

E

eejw 19 years ago

Sorry...typo

16-bit word (not "16-byte word") > Hello all:

Vote

N

Newman 19 years ago

could start timer do 4 writes to different locations then read the elapsed value divide value by 4 manually

it would be interesting to see if the value is still 28 clocks does MB have a cache? chipscope or simulation would highlight what's going on

Newman

Vote

N

Newman 19 years ago

also, disassemble the write function to see how efficiently it compiled the instruction I would think that it should be around 1 assembly op

Vote

E

eejw 19 years ago

Newman,

Thanks for writing back.

I tried: 1. starting the timer 2. writing 8 samples 3. reading timer 4. dividing timer result by 8 -->

This gave me an average write time of 20 cc's. So it did lower it some.

It's interesting...I'm finding that it takes 21 cc's to read/write data from/to external SRAM. I would think that the FSL link should be

*much* faster since it's accessing memory on-chip. In fact, the mb_ref_guide states a latency of 2 cc's for using non-blocking "put" and "get" operations for transferring data over FSL. Blocking accesses stall until there is space available on the FSL. What I am doing is a very simple design, and there shouldn't be any blocking, at least not from the program I am implementing. There must be some way to get better performance than what I'm seeing.

I'm not implementing cache with this design.

I looked at main.s and couldn't really make much sense of the assembly code. I did searches for put, get, fsl and found nothing. I would be interested to know how the compiler is translating to machine code as well...is there some option for seeing c-code interspersed with related assembly? I set compiler options to no optimization and create symbols for assembly.

Joel

Vote

E

eejw 19 years ago

Just a couple more data points to add regarding performance of FSL...

I created a 2-processor microblaze design connected by FSL links.

With a simple program and using these functions:

microblaze_bwrite_datafsl(data[index],0); microblaze_bread_datafsl(result, 0);

from mb_interface.h to pass data from one processor to the other and back, and using the counter to measure performance, I found:

takes 9 cc's to write a data sample to FSL (doesn't matter if it's

1 or 99 samples and dividing count result by 99)

takes 10 cc's to read a data sample from FSL

I tried the "non-blocking" functions as well and found the same results.

Vote

G

Göran Bilski 19 years ago

Hi,

The actual fsl PUT or GET instruction takes 2 clock cycles in MicroBlaze v4 and 1 clock cycle in MicroBlaze v5. Where is your code and data located? The macro you are using are also reading/writing the data that you want to use for the FSL link.

Just disassemble the .elf file and look how the macro is implemented.

Göran

Vote

N

Newman 19 years ago

A while back, I stepped through code on the target using the gnu debug tool. There is an option to compile the code and libraries for single step debug that shows the mixed assembly with the C code and one can single step at either level.

As mentioned, one can disassemble the elf, but the jtag gnu debugger is more intuitive if you can suffer the pain to get it working.

Newman

Vote

System Generator pcore I/O performance results

Join the Discussion

Didn't find your answer?