Does 32 bitter code pack less efficiently than 16 bitters?

- J
- John Speth
  
  Contact options for registered users
posted
13 years ago

Thu, Aug 19, 2010 3:04 PM

Hi folks-

Yesterday I was in a meeting discussing what path we should take due to our decision (not mine, thankfully) to use an undersized FPGA that will host NIOS. NIOS is a 32 bit soft code machine that runs in Altera FPGAs. At that time it suddenly occurred to me that, since instructions are 32 bits, C code memory usage (not data) might not be as efficient as a 16 bit C code memory usage.

I realize the issue is complex to analyze. One must formemost consider (1) the compiler and linker and (2) the instruction set. In our case, NIOS code is built using GCC. I have my suspicions that my particular implementation of GCC for NIOS isn't so hot in terms of optimizing for code size. My frame of reference is the IAR toolset on the STM32, which I believe is tops in its class in terms of code footprint efficiency. It's plausible to me that even though 32 bit instruction storage is twice that of a 16 bit machine, that loss might be more than offset by gains in additional addressing modes and a greater number of other instructions, if indeed those advantages really exist.

What is the real world comparison of 16 bit vs. 32 bit code footprint among popular professional grade MCU cores and tools?

Thanks in advance for your discussion.

JJS

--- news://freenews.netfront.net/ - complaints: snipped-for-privacy@netfront.net ---

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 19, 2010 4:07 PM

Op Thu, 19 Aug 2010 17:04:16 +0200 schreef John Speth :

ARM and MIPS did not create 16-bit ISA's from their 32-bit ISA's just for the fun of it. Also the bulk of your code is not performance-sensitive, so assuming your core can switch between modes, you can always keep using

32-bit where it is really needed.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(remove the obvious prefix to reply by mail)

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 19, 2010 7:04 PM

That's putting it quite a bit too simplistically. Just because a CPU architecture is 32-bit by no means implies instructions have to be 32 bits.

Not necessarily. Not by far. A 32-bitter's machine instructions can be consistently 32 bits long (including operands, though), or they can be the same size as those of a 16-bit architecture (think Intel 8086 vs. i386). A 16-bitter's instruction set may be a mixture of sizes from 1 to 8 bytes.

That depends entirely on the particular architectures being compared. The tools and the type of code you use have an influence, too, but that's negligible compared to the architectures'. The ratio can be anywhere between a factor of two or larger, and practically no change at all.

So stop making guesses. Measure the actual size of _your_ code on _your_ candidate architectures instead.

- J
- John Speth
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 19, 2010 7:41 PM

I knew the question would be difficult to answer. I was looking for generalities that you seemingly claim don't exist.

I can measure my code objects easily enough. I just can't measure what that same could footprint might be on another target without investing the time and money in the tools to help me compare the two.

JJS

--- news://freenews.netfront.net/ - complaints: snipped-for-privacy@netfront.net ---

- G
- Grant Edwards
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 19, 2010 8:01 PM

They don't exist (at least not in any useful way).

Life's tough that way: usually the only way to measure something is to measure it. ;)

You can probably find plenty of benchmark comparisons for architectures, but those often aren't very meaningful either.

If you want to know what the codesize is for various architectures for a particular function, then compiling the code for the architectures is the only reliable thing to do. Trial versions of compilers are usually pretty easy to come by.

--
Grant Edwards               grant.b.edwards        Yow! I request a weekend in
                                  at               Havana with Phil Silvers!
                              gmail.com

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 19, 2010 8:38 PM

If you look closer, you may notice I backed up that claim with some details.

Thanks to trial versions and free tools, you can usually keep that money until after you've committed to an architecture.

As for the rest: yes, getting meaningful data does invariably take somebody's time. It's up to you to decide who that "somebody" is going to be.

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 19, 2010 9:12 PM

[...]

In general, control code (i.e. "test this, jump there, call that") on a

32-bit architecture tends to be a little larger than on a 16-bit machine simply because the 32-bit architecture requires larger addresses.

On the other hand, 32-bitters need fewer instructions for 32-bit arithmetic or large data manipulation (no messing around with segment base registers etc).

I happen to have a code size comparison for a particular module (a lisp interpreter kernel, specifically written to have a low ROM footprint) at hand. This is mostly control code, little arithmetic.

i386 (32 bit) gcc-4.1.0 -Os -fomit-frame-pointer (i686-linux) 4485 gcc-4.2 -Os -fomit-frame-pointer (cygwin) 4656 cl /O1 /Oy- (MS C 12.00.8168, i386) 4506 wcc386 -3 -ox -os (OpenWatcom C16 1.2, i386) 4976

Blackfin (32 bit MCU with 16 bit DSP) ccbf-5.1.2 -Os 4764

x86 16 bit mode tcc -O -Z -d -1 (Turbo C 2.01, i286) 4231 wcc -3 -ox -os (OpenWatcom C16 1.2, i386-16 bit) 3930

Lessons learned: 16 bit code here is smaller by about 10%. However, the compiler also has an effect of about the same magnitude. What I find particular interesting is that OpenWatcom here generated both the smallest 16-bit and the largest 32-bit object size.

Hence, I would assume that code which closely fits into a 128k ROM

16-bit MCU may not fit into a 128k 32-bit MCU, but should conveniently fit into 256k with enough room for expansion.

Stefan

- C
- Chris Burrows
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 12:21 AM

Reference: "Code density concerns for new architectures" by Vincent M. Weaver and Sally A. McKee:

formatting link

We hand-optimize an assembly language embedded benchmark for size on 21 different instruction set architectures, finding up to a factor of three difference in code sizes from ISA alone. We find that the architectural features that contribute most heavily to code density are instruction length, number of registers, availability of a zero register, bit-width, hardware divide units, number of instruction operands, and the availability of unaligned loads and stores.

I have not read this myself yet but found the reference while searching for a downloadable copy of an ARM White Paper which is interesting reading:

R. Phelan, Improving ARM Code Density and Performance: New Thumb Extensions to the ARM Architecture, ARM Limited, 2003.

I can't find it on the ARM website any more.

Regards, Chris Burrows

CFB Software Astrobe: ARM Oberon-07 Development System

formatting link

- P
- PhilW
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 12:24 AM

Another consideration is the RAM footprint comparison. Most 16 bitters allow 2 x 8 bit chars to be fitted into a 16 bit word, if you understand what I mean. But 32 bitters will usually allocate 32 bits for a 8 bit variable.

PhilW

- L
- larwe
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 1:56 AM

that

e

It's not hard. I took a sample representative product ~50kloc of C, removed all the platform-specific stuff so it would compile on anything calling itself a C compiler, and ran it through the eval compilers for our target architectures to get a useful metric. This is very imperfect but it gave some interesting data.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 4:30 AM

Most 32-bit compilers do pack string characters regardless of whether the chip has byte addressing, if not the compiler uses masking and shifting to access them.

Short stack local variables and structure fields are another matter. For stack locals are almost invariably placed according to the chip's memory alignment requirements - inserting padding if necessary (I've only ever seen one compiler pack locals on a chip that lacked byte addressing).

Similarly structure fields are normally aligned with padding, but there is usually a switch to force packing of short fields. If the chip can't address them directly the compiler will use masking and shifting to access them.

If you pack data shorter than the alignment restriction, you are trading code for data due to the extra instructions required to perform field insertions and extractions as opposed to simple loads and stores. The tradeoff is system dependent - sometimes it is worthwhile and sometimes not.

George

- H
- HT-Lab
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 2:37 PM

formatting link

Interesting results,

Hans

formatting link

- J
- John Speth
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 3:53 PM

That's very very interesting! Speaks right to what I needed to know. Thanks for that.

JJS

--- news://freenews.netfront.net/ - complaints: snipped-for-privacy@netfront.net ---

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 3:59 PM

among

Weaver

e

ength,

re

Interesting indeed, thanks for posting the link!

I somewhat suspect the manually optimized assembly tests were done by a X86 person hence the code density was better than say for 68k. But the results are consistent with what I have seen/done myself, manually optimized VPA code results in little if any more than the

68k equivalent, whereas unoptimized 68k assembly VPA assembled for PPC is apr. 3.5 times larger (mainly because the C bit is maintained unnecessarily all of the time, .b or .w only register modifications etc.).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

formatting link

- W
- Walter Banks
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Fri, Aug 20, 2010 4:39 PM

It is. I have worked on several commercial instruction set designs and care should be taken on interpreting a single data point. The most interesting thing is the wide range of targets that he is using.

Two things that I have seen that affects code density is instruction mix and and data flow within the processor.

The instruction mix issue is often trading several registers for an instruction set with a few general purpose registers and additional very efficient (bit wise) instructions that support part of the address space. There are a ton of potential potential data flow problems but two stand out.

When ISA's have few registers and even fewer working accumulators with full arithmetic capabilities then in the middle of complex calculations the intermediate value of the accumulator needs to be saved to calculate for example a pointer value or array offset.

The second less obvious case is the relationship between the accumulator and I/O ports. In general there isn't a large price to pay when a I/O port write needs to be done with writes from the accumulator. This changes when the part runs out of RAM and ROM space and the silicon company bolts on memory management. Almost always this is implemented with partial memory selection done with an I/O port. When this happens the data flow through the AC changes. I/O port memory management now needs to change during the a complex computation instead of the last operation. This last case can be mostly fixed with a constant to I/O port instruction.

Regards,

Walter..

-- Walter Banks Byte Craft Limited

formatting link

--- news://freenews.netfront.net/ - complaints: snipped-for-privacy@netfront.net ---

- C
- Chris Burrows
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Aug 21, 2010 12:07 AM

Hans, thanks for tracking that down. That link takes you to a slide show but it did enable me to find the the actual paper:

formatting link

Regards, Chris

formatting link

- H
- HT-Lab
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Sat, Aug 21, 2010 9:38 AM

Thanks, I was planning to email the authors to see if they would be willing to send me a copy.

I assume you have found your ARM paper by now but if not here is a link,

formatting link

Regards, Hans

formatting link

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Aug 25, 2010 4:06 PM

As you would expect, there is no "real" comparison to be made, here. The problem lies in deciding "what's an apple" (let alone, an "orange"!)

What you have to look at is how long *instructions* are, not the "size" of the processor. E.g., if every instruction for the "32 bit" processor is exactly one "(32b) word" but the "16b CPU" often needs two or three "(16b) words", then the average instruction in the 16b case is longer.

[Of course, "average" can be defined a lot of ways!]

You can **naively** try to get an idea for the relative efficiencies of various architectures by cross-compiling a "large-ish" project for each and looking at the sizes of the resulting (stripped) binaries.

But, this is truly a naive approach because it assumes:

- you have compilers for each architecture

- the compilers are "equally good" at code generation

- there are no variations in data types, etc. across platforms (e.g., 32b ints in one, 16b ints in the other) etc.

However, a more insidious issue regards how well the "application" *fits* the particular implementation (CPU). E.g., if you are dealing with IP addresses, then the ability to handle 32b values economically can give preference to a 32b architecture. OTOH, if you are processing strings or other small data types, the overhead that comes with a 32b CPU may represent a lot of waste.

In "the real world" (per your query), you fit the implementation to the application (C.A.E typically having less "spare resources" than the desktop world). Or, modify the application to fit the implementation available (!).

In my own *personal* experience, I have found 16b implementations tend to use 40-50% more "resources" than "equivalent" 8b implementations. The same sort of ratio (ballpark) applies to 16b vs 32b.

BUT, THIS IS JUST A ROUGH GUIDELINE. I use it to gauge the approximate cost of moving up/down the "wideness" hierarchy (or should that be "left/right"?) when a project looks like it may straddle a particular "decision point" (e.g., something that taxes the abilities of a 16b -- but would underutilize those of a 32b, etc.).

In actuality, once the decision is made to cross such a threshold, the scope of a project is often adjusted, accordingly. E.g., adding features/capabilities if moving upwards; removing/reducing if moving downwards. (i.e., so the resulting implementation better *fits* the resources available).

If you are *lucky*, you'll find yourself right smack in the "center" of a particular technology's capabilities and won't be faced with this choice: "This is what we *need*. Anything less won't work; anything more is waste"