Could someone tell me NIOS II/MB performance on this benchmark?

- T
- Tommy Thorn
  
  Contact options for registered users
posted
15 years ago

Tue, Apr 29, 2008 7:13 AM

I trying to get a feel for how the performance of my (so far unoptimized) soft-core stacks up against the established competition, so it would be a great help if people with convenient access to Nios II / MicroBlaze respectively would compile and time this little app:

formatting link

(It's an Othello endgame solver. I didn't write it) and tell me the configuration.

In case anyone cares, mine finished this in 100 seconds in this configuration: 8 KiB I$, 16 KiB D$, 48 MHz clock frequency, async sram. (My Mac finished this in ~ 0.5 sec :-)

Thanks Tommy

- G
- Göran Bilski
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Apr 29, 2008 11:29 AM

Hi,

I did a quick test with MicroBlaze. With 125 MHz and 64kbyte of local memory, it takes MicroBlaze 6.8s to run the benchmark.

I added two defines in the program. #define printf xil_printf #define double float The first define is to get a smaller code footprint since the default printf is bloated and no floating-point is printed. The second define will make the compiler to use the MicroBlaze FPU single-precision floating-point compare and conversion instructions. Neither defines will change the program result since there is no actual floating-point calculations, just compare and conversions.

Actually the program prints out a relative large number of characters and if I remove the printf statement that is part of the loop, the program executes in 6.1 s The baudrate will have an effect on the execution speed if too many prints exists in the timed section.

Göran

- G
- Göran Bilski
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Apr 29, 2008 12:31 PM

Hi,

Actually the use of floating-point at all seems unnecessary in the program. Think this is a legacy of PC program where the usage of double (or float) is not performance critical as on CPU without a FPU.

I think it's safe to change to double in the program to int without any changes in result. The program would not run faster on a MAC/PC with this change but it will have a drastic effect on your CPU.

Göran

- T
- Tommy Thorn
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Apr 29, 2008 5:02 PM

Thanks G=F6ran,

that's very impressive. You are right about the double precision, and output. With the below patch applied, I now clock in at 42.5 s. Could you try it again (I assume your numbers were with floats).

Using local memory however doesn't make for an apples to apples comparison as this benchmark is memory heavy and local memory (as opposed to cache + slow memory) will give MB a large advantage.

Thanks Tommy PS: Which FPGA was this on?

.

is

un

d

ts

- T
- Tommy Thorn
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Apr 29, 2008 5:13 PM

Forgot the patch. I'm sure Google Groups will mangle it for me.

Tommy

diff --git a/testcases/demos/smith-weill-gunnar-endgame.c b/testcases/ demos/smith-weill-gunnar-endgame.c index 55f02d5..55a92db 100644

--- a/testcases/demos/smith-weill-gunnar-endgame.c

+++ b/testcases/demos/smith-weill-gunnar-endgame.c @@ -168,2 +168,4 @@ additional 1.5 or so.

+#define double long

/* #define WINDOWS_TIMING */ @@ -989,3 +991,3 @@ int main( void ){ }

- printf("%3d (emp=%2d wc=%2d bc=%2d) %s\n",

if (0) printf("%3d (emp=%2d wc=%2d bc=%2d) %s\n", val, emp,wc,bc, bds[i] );

- G
- Göran Bilski
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Apr 30, 2008 8:46 AM

Hi Tommy,

It depends how you want to benchmark, only using features that your CPU has? (lacking large local memory). The code footprint when using optimized printf is around 50k with data. Using a processor with 8kbyte dcache and 16kbyte dcache on an application that is just twice the size dont seems to be valid. Cache effiencies is more likely to show when you have at least a 10-50x factor between cache size and code size. Also using cache will also include the external memory type and memory controller in the benchmark numbers. I guess they are not apples to apples between you and me. Using fast async sram as the external memory is not the same as using SDRAM.

Yes, my results was with using float instead of double, I don't think you need to set the type to long since the values seems to be well within a byte.

I took my board connected to my laptop, which is a ML505 (Virtex5 slowest speedgrade) and I didn't pushed the clock frequency.

Göran

that's very impressive. You are right about the double precision, and output. With the below patch applied, I now clock in at 42.5 s. Could you try it again (I assume your numbers were with floats).

Using local memory however doesn't make for an apples to apples comparison as this benchmark is memory heavy and local memory (as opposed to cache + slow memory) will give MB a large advantage.

Thanks Tommy PS: Which FPGA was this on?