Could someone tell me NIOS II/MB performance on this benchmark?

I trying to get a feel for how the performance of my (so far unoptimized) soft-core stacks up against the established competition, so it would be a great help if people with convenient access to Nios II / MicroBlaze respectively would compile and time this little app:

formatting link
(It's an Othello endgame solver. I didn't write it) and tell me the configuration.

In case anyone cares, mine finished this in 100 seconds in this configuration: 8 KiB I$, 16 KiB D$, 48 MHz clock frequency, async sram. (My Mac finished this in ~ 0.5 sec :-)

Thanks Tommy

Reply to
Tommy Thorn
Loading thread data ...

Hi,

I did a quick test with MicroBlaze. With 125 MHz and 64kbyte of local memory, it takes MicroBlaze 6.8s to run the benchmark.

I added two defines in the program. #define printf xil_printf #define double float The first define is to get a smaller code footprint since the default printf is bloated and no floating-point is printed. The second define will make the compiler to use the MicroBlaze FPU single-precision floating-point compare and conversion instructions. Neither defines will change the program result since there is no actual floating-point calculations, just compare and conversions.

Actually the program prints out a relative large number of characters and if I remove the printf statement that is part of the loop, the program executes in 6.1 s The baudrate will have an effect on the execution speed if too many prints exists in the timed section.

Göran

Reply to
Göran Bilski

Hi,

Actually the use of floating-point at all seems unnecessary in the program. Think this is a legacy of PC program where the usage of double (or float) is not performance critical as on CPU without a FPU.

I think it's safe to change to double in the program to int without any changes in result. The program would not run faster on a MAC/PC with this change but it will have a drastic effect on your CPU.

Göran

Reply to
Göran Bilski

Thanks G=F6ran,

that's very impressive. You are right about the double precision, and output. With the below patch applied, I now clock in at 42.5 s. Could you try it again (I assume your numbers were with floats).

Using local memory however doesn't make for an apples to apples comparison as this benchmark is memory heavy and local memory (as opposed to cache + slow memory) will give MB a large advantage.

Thanks Tommy PS: Which FPGA was this on?

.

is

un

d

ts

Reply to
Tommy Thorn

Forgot the patch. I'm sure Google Groups will mangle it for me.

Tommy

diff --git a/testcases/demos/smith-weill-gunnar-endgame.c b/testcases/ demos/smith-weill-gunnar-endgame.c index 55f02d5..55a92db 100644

--- a/testcases/demos/smith-weill-gunnar-endgame.c

+++ b/testcases/demos/smith-weill-gunnar-endgame.c @@ -168,2 +168,4 @@ additional 1.5 or so.

+#define double long

  • /* #define WINDOWS_TIMING */ @@ -989,3 +991,3 @@ int main( void ){ }

- printf("%3d (emp=%2d wc=%2d bc=%2d) %s\n",

  • if (0) printf("%3d (emp=%2d wc=%2d bc=%2d) %s\n", val, emp,wc,bc, bds[i] );
Reply to
Tommy Thorn

Hi Tommy,

It depends how you want to benchmark, only using features that your CPU has? (lacking large local memory). The code footprint when using optimized printf is around 50k with data. Using a processor with 8kbyte dcache and 16kbyte dcache on an application that is just twice the size dont seems to be valid. Cache effiencies is more likely to show when you have at least a 10-50x factor between cache size and code size. Also using cache will also include the external memory type and memory controller in the benchmark numbers. I guess they are not apples to apples between you and me. Using fast async sram as the external memory is not the same as using SDRAM.

Yes, my results was with using float instead of double, I don't think you need to set the type to long since the values seems to be well within a byte.

I took my board connected to my laptop, which is a ML505 (Virtex5 slowest speedgrade) and I didn't pushed the clock frequency.

Göran

that's very impressive. You are right about the double precision, and output. With the below patch applied, I now clock in at 42.5 s. Could you try it again (I assume your numbers were with floats).

Using local memory however doesn't make for an apples to apples comparison as this benchmark is memory heavy and local memory (as opposed to cache + slow memory) will give MB a large advantage.

Thanks Tommy PS: Which FPGA was this on?

Reply to
Göran Bilski

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.