A new benchmark suitable for small systems: stdcbench

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
For benchmarking C implementations, the there are a few benchmarks, but
they all have their problems. Many benchmarks have memory requirements
that are far too high or need functionality not necessarily available.
Some are quite one-sided in what they measure (e.g. Whetstone,
Dhrystone, Coremark).

So, I deciced to write a new benchmark, stdcbench. I wanted it to be
suitable for small systems (4KB of RAM, about 32 KB of Flash). There is
a trade-off here, since all the data and code will fit easily into
caches on bigger systems, but IMO it is worth it.

The current version consists of 2 modules, which on typical systems
should contribute about equally to the score.

c90base:
It benchmarks a commonly-implemented subset of what the standard
requires for freestanding implementations of C90. It consists of three
submodules:
1) Huffman/RLE decompression (adapted from real-world code)
2) Integer matrix multiplication (synthetic)
3) Insertion sort (adapted from real-world code)

c90lib:
Benchmarks the standard library.
I consists of two submodules:
1) Computation of lnlc-width (adapted from real-world code).
2) Peephole optimizer (simplified from real-world code).


C99 features (e.g. bool, restrict) are used where available, but not
necessary.

So far, stdcbench seems to achieve the goals: benchmark a wide range of
important standard c functionality, without giving too much emphasis to
any particular aspect.

Scores are reported for each module and as total.

Example output from a i7-7500U-based system (benchmark compiled with GCC
7.2.0 using -O2 -march=native):

stdcbench 0.2
stdcbench c90base score: 7827
stdcbench c90lib score: 6548
stdcbench final score: 14375

Example output from a STM8AF5288 at 16 Mhz (benchmark compiled with SDCC
3.6.9 using -mstm8 --opt-code-speed --max-allocs-per-node 10000):

stdcbench 0.2
stdcbench c90base score: 6
stdcbench c90lib score: 6
stdcbench final score: 12

Future plans for the benchmark:

1) Come up with module(s) for floating-point performance. What matters
for embedded systems? How should correctness be verified for floating-point?
2) Find out why the c90lib module hangs on C8051F120 (possible compiler
bug).
3) State run/reporting rules.
4) Benchmark a few interesting systems


I am looking forward to comments from you.

http://stdcbench.org/

Philipp

Re: A new benchmark suitable for small systems: stdcbench
miercuri, 7 februarie 2018, 17:50:05 UTC+2, Philipp Klaus Krause a scris:
Quoted text here. Click to load it

Nice.
One observation though: the STM8 score seems too low. I mean, it would
be difficult to compare systems that have scores like that (11,12,15 etc.)
I know STM8 and I know it's quite powerfull. I even use these (and some
AVRs) at a much lower frequency (5MHz).

What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz)
should be in a 100-1000 range.

So please scale the scores up! (But take care that the lsb digits to not be noise!).

I don't care if the PC scores would be millions...

Re: A new benchmark suitable for small systems: stdcbench
Am 09.02.2018 um 07:36 schrieb snipped-for-privacy@gmail.com:
Quoted text here. Click to load it

I agree. The previous resolution often was insufficient to even see the
effect of compiler optimizations. In version 0.3, I did a bit of
rebalancing and rescaling of scores.

Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8
--opt-code-speed --max-allocs-per-node 10000):

stdcbench 0.3
stdcbench c90base score: 109
stdcbench c90lib score: 88
stdcbench final score: 197

Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8
--opt-code-size --max-allocs-per-node 10000):

stdcbench 0.3
stdcbench c90base score: 107
stdcbench c90lib score: 87
stdcbench final score: 194

Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large
--stack-auto --opt-code-size --max-allocs-per-node 10000):

stdcbench 0.3
stdcbench c90base score: 96
stdcbench final score: 96

Philipp

P.S.: The reason the c90lib module is not enabled for the C8051F120 is
that it runs out of stack space.

Re: A new benchmark suitable for small systems: stdcbench
Quoted text here. Click to load it

Was that really supposed to say 98 mhz?

Can you say the code size for the different compiler outputs?

Could you do the AVR8 the and MSP430 with gcc, if you happen to have
those available?  Would the ARM Cortex M0 be getting outside the
intended range of this benchmark?

Thanks!

Re: A new benchmark suitable for small systems: stdcbench
On 2018-02-09 Paul Rubin wrote in comp.arch.embedded:
Quoted text here. Click to load it

No, I think he meant to say 98 MHz:
https://www.silabs.com/products/mcu/8-bit/c8051f12x-f13x/device.c8051f120

Yes, those 8051's have progressed a bit since the 12MHz, 12-cycle instruction
devices of some 25 years ago. ;-)

--  
Stef    (remove caps, dashes and .invalid from e-mail address to reply by mail)

Many hands make light work.
We've slightly trimmed the long signature. Click to see the full one.
Re: A new benchmark suitable for small systems: stdcbench
Am 09.02.2018 um 22:28 schrieb Paul Rubin:
Quoted text here. Click to load it

Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL.
the C8051 is rated at 100 Mhz.

Quoted text here. Click to load it

I'll report exact number when I have a bigger range of results. But for
now, it seems that code size on the MCS-51 is about twice that of STM8
when using the same features (i.e c90lib module enabled or disabled for
both targets).

Quoted text here. Click to load it

The M0 definitely falls into the intended range. However, I don't have
any around at the moment.
I intend to do a few more benchmarks with what I have, probably next
weekend or during the week after:

* STM8AF5288 @ 16 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0, some IAR and
Cosmic compilers and various optimization settings
* C8051F120 @ 98 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0 and various
optimization settings
* STM8S208 @ 24 Mhz
* Z80 @ 3.58 Mhz (in the Sega Master System II or Sega Mark III)
* CYC68013A @ 48 Mhz (a 8051-derivative from Cypress)

I also intend to get a few more boards to compare (at least Cortex M0
and RISC-V).

Philipp

Re: A new benchmark suitable for small systems: stdcbench
On 10.2.18 19:47, Philipp Klaus Krause wrote:
Quoted text here. Click to load it

4 * 24 MHz = 96 MHz.

--  

-TV


Re: A new benchmark suitable for small systems: stdcbench
Am 10.02.2018 um 20:32 schrieb Tauno Voipio:
Quoted text here. Click to load it

Yes. Sorry for the mistake. The C8051 internal oscillator frequency is
24.5 Mhz.

Philipp

Re: A new benchmark suitable for small systems: stdcbench
Am 09.02.2018 um 22:28 schrieb Paul Rubin:
Quoted text here. Click to load it

Here's a first reuslt from an ARM Cortex-M0 (the STM32F051R8 at 48 Mhz)
compiled using GCC 6.3.1 with -O2 and using newlib-nano:

stdcbench 0.3
stdcbench c90base score: 1141
stdcbench c90lib score: 651
stdcbench final score: 1792

Per clock cycle, this Cortex-M0 with GCC gets about twice the score
compared to an STM8 with IAR.
Interesting is the large difference between the c90base and the c90lib
score. I guess newlib-nanao is optimized for code size at the expense of
speed (even though it is surprising to see that much of a difference vs.
the situation for the STM8).

Philipp

Re: A new benchmark suitable for small systems: stdcbench
mar?i, 20 februarie 2018, 15:00:41 UTC+2, Philipp Klaus Krause a scris
:
Quoted text here. Click to load it

You can try an -Os variant to see what difference you get.
It would be in fact quite interesting. I always thought that
-O2 speed gain is not much than the -Os, and so always use -Os ...

Can you do some (8bit) AVR tests?  


Re: A new benchmark suitable for small systems: stdcbench
Am 20.02.2018 um 14:13 schrieb snipped-for-privacy@gmail.com:
Quoted text here. Click to load it

Not much:

stdcbench 0.3
stdcbench c90base score: 1047
stdcbench c90lib score: 607
stdcbench final score: 1654

Quoted text here. Click to load it

'Don't have one yet, but intend to do a Cortex-M3 and maybe a Z80 test
later this week. At some point, I should also put a list of results on
http://stdcbench.org/

Philipp

Re: A new benchmark suitable for small systems: stdcbench
Am 20.02.2018 um 14:38 schrieb Philipp Klaus Krause:
Quoted text here. Click to load it

Here is a Cortex-M4 (the STM32F302R8 at 64 Mhz - it could do 72 Mhz, but
not with the internal oscillator, and my board doesn't have a crystal),
using GCC -O2 -mcpu=cortex-m4 -mthumb with newlib-nano:

stdcbench 0.3
stdcbench c90base score: 1693
stdcbench c90lib score: 864
stdcbench final score: 2557

Looks only 15% faster per clock cycle than the Cortex-M0.

Philipp

Re: A new benchmark suitable for small systems: stdcbench

e ha scritto:
Quoted text here. Click to load it

Following the "Definitive Guide to ARM CortexM0-M0+" the performance of avr
ious Cortex M are:

Features Cortex-M0 Cortex-M0+ Cortex-M3 Cortex-M4 Cortex-M7
Dhrystone 2.1 (per MHz) 0.9 0.95 1.25 1.25 2.14
CoreMark 1.0 (per MHz) 2.33 2.46 3.34 3.40 5.01

So maybe there is something that the M4 doesn't like too much or gcc doesn'
t optimize very well for the M4 (with the option used).

Bye Jack

Re: A new benchmark suitable for small systems: stdcbench
Am 20.02.2018 um 16:14 schrieb Jack:

Quoted text here. Click to load it

I probably won't find time for testing until later this week, but for
now I suspect it is the flash: The flash in the STM32F051 (Cortex-M0) at
48 Mhz needs 1 wait state, while the flash in the STM32F302 (Cortex-M4)
at 64 Mhz needs 2 wait states. Both devices have a prefetch buffer that
is supposed to somewhat reduce the effect of the wait states on program
execution.

Philipp

Re: A new benchmark suitable for small systems: stdcbench
Am 20.02.2018 um 15:27 schrieb Philipp Klaus Krause:
Quoted text here. Click to load it


And from the opposite end of the performance spectrum, a Cycpress EZ-USB
FX2LP at 48 Mhz, compiled using SDCC 3.7.0 RC2, sdcc -mmcs51
--model-large --stack-auto --code-loc 0x0000 --code-size 0x3500
--xram-loc 0x3500 --xram-size 0x0b00 --opt-code-speed
--max-allocs-per-node 10000:

stdcbench 0.3
stdcbench c90base score: 12
stdcbench final score: 12

Philipp

Re: A new benchmark suitable for small systems: stdcbench
mar?i, 20 februarie 2018, 21:58:12 UTC+2, Philipp Klaus Krause a scris
:
Quoted text here. Click to load it
t

Is it 12clk/instruction or something? In this case 48MHz is misleading.
Same for PICs which are 4clks/instr. Microchip usually give another
number, the Mips, for example 64MHz/16Mips.

So the Cypress chips is 48MHz/4Mips maybe?

When I worked with 12clks/instr 8051 I always talked about it as 1Mips cpu.


Re: A new benchmark suitable for small systems: stdcbench
Am 21.02.2018 um 08:37 schrieb snipped-for-privacy@gmail.com:
Quoted text here. Click to load it

The Cypress EZ-USB can execute most 1-byte instructions in 4 clock
cycles. Most 2-byte instructions take 8 clock cycles. Branch
instructions tend to take 16 clock cycles. A few instrcutions take even
longer.

I gave the 48 Mhz figure mostly for reproduction of results (though the
port can be found in the examples for stdcbench now anyway).

Philipp

Re: A new benchmark suitable for small systems: stdcbench
On 21/02/18 08:37, snipped-for-privacy@gmail.com wrote:
Quoted text here. Click to load it

It is common for modern 8051 implementations to have 4 oscillator clocks
per instruction clocks.  The original 8051 had 12 oscillator clocks per
instruction clock.

Most single byte register-to-register operations take 1 instruction
clock.  Memory access adds to that, as do multi-byte instructions,
jumps, calls, etc.




Re: A new benchmark suitable for small systems: stdcbench
On 20.2.18 15:00, Philipp Klaus Krause wrote:
Quoted text here. Click to load it

GCC -Os does wonders compared to the standard compiled newlib.

--  

-TV


Re: A new benchmark suitable for small systems: stdcbench
mar?i, 20 februarie 2018, 21:37:33 UTC+2, Tauno Voipio a scris:
Quoted text here. Click to load it
e
f
.

Speaking of that, we always take the newlib/nelib-nano/whatever-lib for gra
nted.
An interesting possibility would be to compile the same lib bench without l
ib calls
(of course, another version of the program doing the exact same thing but w
ith no lib calls)
Just to see how we stand. It would be quite a good indicator of the
performance of the lib.

Site Timeline