A new benchmark suitable for small systems: stdcbench

P

Philipp Klaus Krause 8 years ago

For benchmarking C implementations, the there are a few benchmarks, but they all have their problems. Many benchmarks have memory requirements that are far too high or need functionality not necessarily available. Some are quite one-sided in what they measure (e.g. Whetstone, Dhrystone, Coremark).

So, I deciced to write a new benchmark, stdcbench. I wanted it to be suitable for small systems (4KB of RAM, about 32 KB of Flash). There is a trade-off here, since all the data and code will fit easily into caches on bigger systems, but IMO it is worth it.

The current version consists of 2 modules, which on typical systems should contribute about equally to the score.

c90base: It benchmarks a commonly-implemented subset of what the standard requires for freestanding implementations of C90. It consists of three submodules:

1) Huffman/RLE decompression (adapted from real-world code)
2) Integer matrix multiplication (synthetic)
3) Insertion sort (adapted from real-world code)

c90lib: Benchmarks the standard library. I consists of two submodules:

1) Computation of lnlc-width (adapted from real-world code).
2) Peephole optimizer (simplified from real-world code).

C99 features (e.g. bool, restrict) are used where available, but not necessary.

So far, stdcbench seems to achieve the goals: benchmark a wide range of important standard c functionality, without giving too much emphasis to any particular aspect.

Scores are reported for each module and as total.

Example output from a i7-7500U-based system (benchmark compiled with GCC

7.2.0 using -O2 -march=native):

stdcbench 0.2 stdcbench c90base score: 7827 stdcbench c90lib score: 6548 stdcbench final score: 14375

Example output from a STM8AF5288 at 16 Mhz (benchmark compiled with SDCC

3.6.9 using -mstm8 --opt-code-speed --max-allocs-per-node 10000):

stdcbench 0.2 stdcbench c90base score: 6 stdcbench c90lib score: 6 stdcbench final score: 12

Future plans for the benchmark:

1) Come up with module(s) for floating-point performance. What matters for embedded systems? How should correctness be verified for floating-point?
2) Find out why the c90lib module hangs on C8051F120 (possible compiler bug).
3) State run/reporting rules.
4) Benchmark a few interesting systems

I am looking forward to comments from you.

formatting link

Philipp

Vote

R

raimond.dragomir 8 years ago

miercuri, 7 februarie 2018, 17:50:05 UTC+2, Philipp Klaus Krause a scris:

Nice. One observation though: the STM8 score seems too low. I mean, it would be difficult to compare systems that have scores like that (11,12,15 etc.) I know STM8 and I know it's quite powerfull. I even use these (and some AVRs) at a much lower frequency (5MHz).

What I'm trying to say is that the score for such a system (STM8/AVR8/16MHz) should be in a 100-1000 range.

So please scale the scores up! (But take care that the lsb digits to not be noise!).

I don't care if the PC scores would be millions...

Vote

P

Philipp Klaus Krause 8 years ago

Am 09.02.2018 um 07:36 schrieb snipped-for-privacy@gmail.com:

I agree. The previous resolution often was insufficient to even see the effect of compiler optimizations. In version 0.3, I did a bit of rebalancing and rescaling of scores.

Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8

--opt-code-speed --max-allocs-per-node 10000):

stdcbench 0.3 stdcbench c90base score: 109 stdcbench c90lib score: 88 stdcbench final score: 197

Output for the 16 Mhz STM8AF5288 (compiled via sdcc -mstm8

--opt-code-size --max-allocs-per-node 10000):

stdcbench 0.3 stdcbench c90base score: 107 stdcbench c90lib score: 87 stdcbench final score: 194

Output for a 98 Mhz C8051F120 (compiled via sdcc -mmcs51 --model-large

--stack-auto --opt-code-size --max-allocs-per-node 10000):

stdcbench 0.3 stdcbench c90base score: 96 stdcbench final score: 96

Philipp

P.S.: The reason the c90lib module is not enabled for the C8051F120 is that it runs out of stack space.

Vote

P

Paul Rubin 8 years ago

Was that really supposed to say 98 mhz?

Can you say the code size for the different compiler outputs?

Could you do the AVR8 the and MSP430 with gcc, if you happen to have those available? Would the ARM Cortex M0 be getting outside the intended range of this benchmark?

Thanks!

Vote

S

Stef 8 years ago

No, I think he meant to say 98 MHz:

formatting link

Yes, those 8051's have progressed a bit since the 12MHz, 12-cycle instruction devices of some 25 years ago. ;-)

Stef (remove caps, dashes and .invalid from e-mail address to reply by mail) Many hands make light work. -- John Heywood

Vote

P

Philipp Klaus Krause 8 years ago

Yes. 24 Mhz from the internal oscillator, multiplied by 4 via the PLL. the C8051 is rated at 100 Mhz.

I'll report exact number when I have a bigger range of results. But for now, it seems that code size on the MCS-51 is about twice that of STM8 when using the same features (i.e c90lib module enabled or disabled for both targets).

The M0 definitely falls into the intended range. However, I don't have any around at the moment. I intend to do a few more benchmarks with what I have, probably next weekend or during the week after:

STM8AF5288 @ 16 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0, some IAR and Cosmic compilers and various optimization settings
C8051F120 @ 98 Mhz using SDCC 3.5.0, 3.6.0, 3.7.0 and various optimization settings
STM8S208 @ 24 Mhz
Z80 @ 3.58 Mhz (in the Sega Master System II or Sega Mark III)
CYC68013A @ 48 Mhz (a 8051-derivative from Cypress)

I also intend to get a few more boards to compare (at least Cortex M0 and RISC-V).

Philipp

Vote

T

Tauno Voipio 8 years ago

4 * 24 MHz = 96 MHz.

-TV

Vote

P

Philipp Klaus Krause 8 years ago

Yes. Sorry for the mistake. The C8051 internal oscillator frequency is

24.5 Mhz.

Philipp

Vote

R

raimond.dragomir 8 years ago

vineri, 9 februarie 2018, 21:48:34 UTC+2, Philipp Klaus Krause a scris:

Now it's better :) This kind of benchmark is very interesting. Without it you can only have a "feeling" about the power of an architecture, and only if you have much experience with it. And of course it much depends on the application.

For example, it seems that the STM8S 16MHz performs better than the C8051 at 100MHz. This is not a surprise for me. I have worked a long time with

8051 and I know very well what is it capable of. For example, an 8051 is almost unbeatable for small control applications of under 8K program size and max. 256 bytes of internal ram. But if you step this line and your program goes bigger, and especially if you need bigger ram and start to use the XRAM, the efficiency goes down rapidly. The 8051 just doesn't scale well in the addressing range. In the 8K/256 range is probably the best 8bitter, in the 64K/64K range is probably the worst :) Here the program size is not a direct factor, it usually depends on how much ram you need. My experience is that you can "grow" your program up to 8K and still use only the internal 256 bytes ram.

But of course, this benchmark is not suppose to reveal this kind of things...

Vote

P

Philipp Klaus Krause 8 years ago

Here is a small comparison of STM8 results with various current compilers (all done on the STM8AF5288).

SDCC 3.7.0 RC1 with optimization for code size (-mstm8 --opt-code-size

--max-allocs-per-node 100000), binary size 20953 B:

stdcbench 0.3

stdcbench c90base score: 106 stdcbench c90lib score: 87 stdcbench final score: 193

SDCC 3.7.0 RC1 with optimization for code speed (-mstm8

--opt-code-speed --max-allocs-per-node 100000), binary size 21083 B:

stdcbench 0.3 stdcbench c90base score: 109 stdcbench c90lib score: 88 stdcbench final score: 197

IAR 3.10.1.201 with optimization for code size, binary size 24288 B:

stdcbench 0.3 stdcbench c90base score: 117 stdcbench c90lib score: 71 stdcbench final score: 188

IAR 3.10.1.201 with optimization for code speed, binary size 27268 B:

stdcbench 0.3 stdcbench c90base score: 197 stdcbench c90lib score: 100 stdcbench final score: 297

Cosmic 4.4.4 with optimization for code size:

stdcbench 0.3 stdcbench c90base score: 116 stdcbench final score: 116

Cosmic 4.4.4 with optimization for code speed:

stdcbench 0.3 stdcbench c90base score: 123 stdcbench final score: 123

For Cosmic 4.4.4, the c90lib module was disabled, since Cosmic 4.4.4 doesn't provide qsort() in the standard library. The Raisonance compiler was not included in the comparison due to dificulties getting an evaluation license.

These results are quite interesting when compared to Dhrystone and Coremark (see

formatting link

In particular, while SDCC is ahead in Dhrystone and Coremark scores, it apparently falls behing in stdcbench scores. On the other hand, SDCC seems to do better in code size for stdcbench.

Philipp

Vote

P

Paul Rubin 8 years ago

I've always had the impression the 8051 was not well suited for commonly used C coding styles and datatypes. It was always intended to be programmed in assembler, has good support for single-bit operations but not much for 16 bit (usual C int type), etc.

The STM8 is interesting. I found out about it fairly recently and got some of the small STM8S103F3 boards for various purposes. In small cheap 8 bitters it's often quite attractive compared with AVR and the like. I'm not sure what else is out there that's comparable, except maybe PIC. I see from the colecovision site that Philipp Klaus Krause did most of the SDCC back end, so thanks Philipp!

Vote

G

Grant Edwards 8 years ago

Are bigger scores better or worse?

Grant

Vote

P

Philipp Klaus Krause 8 years ago

Bigger scores are better.

Philipp

Vote

P

Philipp Klaus Krause 8 years ago

Here's a first reuslt from an ARM Cortex-M0 (the STM32F051R8 at 48 Mhz) compiled using GCC 6.3.1 with -O2 and using newlib-nano:

stdcbench 0.3 stdcbench c90base score: 1141 stdcbench c90lib score: 651 stdcbench final score: 1792

Per clock cycle, this Cortex-M0 with GCC gets about twice the score compared to an STM8 with IAR. Interesting is the large difference between the c90base and the c90lib score. I guess newlib-nanao is optimized for code size at the expense of speed (even though it is surprising to see that much of a difference vs. the situation for the STM8).

Philipp

Vote

R

raimond.dragomir 8 years ago

mar?i, 20 februarie 2018, 15:00:41 UTC+2, Philipp Klaus Krause a scris :

You can try an -Os variant to see what difference you get. It would be in fact quite interesting. I always thought that

-O2 speed gain is not much than the -Os, and so always use -Os ...

Can you do some (8bit) AVR tests?

Vote

P

Philipp Klaus Krause 8 years ago

Am 20.02.2018 um 14:13 schrieb snipped-for-privacy@gmail.com:

Not much:

stdcbench 0.3 stdcbench c90base score: 1047 stdcbench c90lib score: 607 stdcbench final score: 1654

'Don't have one yet, but intend to do a Cortex-M3 and maybe a Z80 test later this week. At some point, I should also put a list of results on

formatting link

Philipp

Vote

P

Philipp Klaus Krause 8 years ago

Am 20.02.2018 um 14:38 schrieb Philipp Klaus Krause:

Here is a Cortex-M4 (the STM32F302R8 at 64 Mhz - it could do 72 Mhz, but not with the internal oscillator, and my board doesn't have a crystal), using GCC -O2 -mcpu=cortex-m4 -mthumb with newlib-nano:

stdcbench 0.3 stdcbench c90base score: 1693 stdcbench c90lib score: 864 stdcbench final score: 2557

Looks only 15% faster per clock cycle than the Cortex-M0.

Philipp

Vote

J

Jack 8 years ago

e ha scritto:

Following the "Definitive Guide to ARM CortexM0-M0+" the performance of avr ious Cortex M are:

Features Cortex-M0 Cortex-M0+ Cortex-M3 Cortex-M4 Cortex-M7 Dhrystone 2.1 (per MHz) 0.9 0.95 1.25 1.25 2.14 CoreMark 1.0 (per MHz) 2.33 2.46 3.34 3.40 5.01

So maybe there is something that the M4 doesn't like too much or gcc doesn' t optimize very well for the M4 (with the option used).

Bye Jack

Vote

P

Philipp Klaus Krause 8 years ago

Am 20.02.2018 um 16:14 schrieb Jack:

I probably won't find time for testing until later this week, but for now I suspect it is the flash: The flash in the STM32F051 (Cortex-M0) at

48 Mhz needs 1 wait state, while the flash in the STM32F302 (Cortex-M4) at 64 Mhz needs 2 wait states. Both devices have a prefetch buffer that is supposed to somewhat reduce the effect of the wait states on program execution.

Philipp

Vote

T

Tauno Voipio 8 years ago

GCC -Os does wonders compared to the standard compiled newlib.

-TV

Vote

A new benchmark suitable for small systems: stdcbench

Join the Discussion

Didn't find your answer?