Portable Assembly

rickman · 2017-05-27T19:39:36+00:00

Someone in another group is thinking of using a portable assembler to write code for an app that would be ported to a number of different embedded processors including custom processors in FPGAs. I'm wondering how useful this will be in writing code that will require few changes across CPU ISAs and manufacturers. I am aware that there are many aspects of porting between CPUs that is assembly language independent, like writing to Flash memory. I'm more interested in the issues involved in trying to use a universal assembler to write portable code in general. I'm wondering if it restricts the instructions you can use or if it works more like a compiler where a single instruction translates to multiple target instructions when there is no one instruction suitable. Or do I misunderstand how a portable assembler works? Does it require a specific assembly language source format for each target just like using the standard assembler for the target? -- Rick C

May 27, 2017 94 Replies

Rate this thread:

Walter Banks 9 years ago

I have done a few portable assemblers of the general type your describing. There are two approaches. One is to write macro's for the instruction set for the target processor and effectively assembler processor A into processor B with macros. This might work for architecturally close processors but even then has significant problems. To give an example 6805 to 6502. The carry following the subtract of 0 -

0 is different.

There is one approach that I have used that does work reasonably well. Assemble processor A into functionally rich intermediate code and compile the intermediate code into processor B. The resulting code is quite portable between the processors and it is capable of supporting a diverse architectures quite well.

I have done mostly 8 bit processors this way 6808 3 major families to PIC many varieties 12,14,14x,16 families. In all cases I set up the translation so I could go either way. I have also targeted some 16,24, and 32 bit processors. For pure code this has worked quite well with a low penalty for the translation.

Application code usually has processor specific I/O which can actually be detected by the translator but generally needs to have some hand intervention.

w..

Vote

Dimiter_Popoff 9 years ago

I am pretty sure I have seen - or read about - compiler generated code where the compiler detects what you want to do and inserts some assembly prewritten piece of code. Was something about CRC or about tcp checksum, not sure - and it was someone who said that, I don't know it from direct experience.

But if the compiler does this it will be obvious enough.

Anyway, a function would do - if complex and long enough to be close to real life, i.e. a few hundred lines.

But I don't see why not compare written stuff, I just checked again on that vnc server for dps - not 8k, closer to 11k (the 8k I saw was a half-baked version, no keyboard tables inside it etc.; the complete version also includes a screen mask to allow it to ignore mouse clicks at certain areas, that sort of thing). Add to it some menu (it is command line option driven only), a much more complex menu than windows and android RealVNC has I have and it adds up to 25k. Compare this to the 350k exe for windows or to the 4M for Android (and the android does only raw...) and the picture is clear enough I think.

Dimiter

====================================================== Dimiter Popoff, TGI

formatting link

======================================================

formatting link

Dimiter

Vote

Boudewijn Dijkstra 9 years ago

Op Mon, 29 May 2017 18:43:01 +0200 schreef Stefan Reuther =

Unless it uses a push/pop architecture like Java bytecode, which can get= =

'assembled' to any number of registers.

-- =

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma:

formatting link

Vote

Anssi Saari 9 years ago

I've sometimes wondered what kind of development systems were used for those early 1980s home computers. Unreliable, slow and small storage media would've made it pretty awful to do development on target systems. I've read Commodore used a VAX for ROM development so they probably had a cross assembler there but other than that, not much idea.

Vote

David Brown 9 years ago

A compiler sees the source code you write, and generates object code that does that job. It be smart about it, but it will not insert "pre-written assembly code". Code generation in compilers is usually defined with some sort of templates (such a pattern for reading data at a register plus offset, or a pattern for doing a shift by a fixed size, etc.). They are not "pre-written assembly", in that many of the details are determined at generation time, such as registers, instruction interleaving, etc.

The nearest you get to pre-written code from the compiler is in the compiler support libraries. For example, if the target does not support division instructions, or floating point, then the compiler will supply routines as needed. These /might/ be written in assembly - but often they are written in C.

A compiler /will/ detect patterns in your C code and use that to generate object code rather than doing a "direct translation". The types of patterns it can detect varies - it is one of the things that differentiates between compilers. A classic example for the PPC would be:

#include

uint32_t reverseLoad(uint32_t * p) { uint32_t x = *p; return ((x & 0xff000000) >> 24) | ((x & 0x00ff0000) >> 8) | ((x & 0x0000ff00) But if the compiler does this it will be obvious enough.

If you had some examples or references, it would be easier to see what you mean.

A function that is a few hundred lines of source code is /not/ real life

- it is broken code. Surely in VLA you divide your code into functions of manageable size, rather than single massive functions?

A VNC server is completely useless for such a test. It is far too complex, with far too much variation in implementation and features, too many external dependencies on an OS or other software (such as for networking), and far too big for anyone to bother with such a comparison.

You specifically need something /small/. The algorithm needs to be simple and clearly expressible. Total source code lines in C should be no more than about a 100, with no more than perhaps 3 or 4 functions. Smaller than that would be better, as it would make it easier for us to understand the VLA and see its benefits.

Here is a possible example:

// Type for the data - this can easily be changed typedef float data_t;

static int max(int a, int b) { return (a > b) ? a : b; }

static int min(int a, int b) { return (a = 0, we have j (i - lenB) // These give us tighter bounds on the run of j

for (int i = 0; i < lenC; i++) { int firstJ = max(0, 1 + i - lenB); int endJ = min(lenA, i + 1); data_t x = 0; for (int j = firstJ; j < endJ; j++) { int k = i - j; x += (pA[j] * pB[k]); }

pC[i] = x; } }

With gcc 4.8 for the PPC, that's about 55 lines of assembly. An interesting point is that the size and instructions are very similar with -O1 and -O2, but the ordering is significantly different - with

-O2, the pipeline scheduling is considered. (I don't know which particular cpu model is used for scheduling by default in gcc.)

To be able to compare with VLA, you'd have to write this algorithm in VLA. Then you could compare various points. It should be easy enough to look at the size of the code. For speed comparison, we'd have to know your target processor and compile specifically for that (to get the best scheduling, and to handle small differences in the availability of particular instructions). Then you would need to run the code - I don't have any PPC boards conveniently on hand, and of course you are the only one with VLA tools.

Comparing code clarity and readability is, of course, difficult - but you could publish your VLA and we can maybe get an idea. Productivity is also hard to measure. For a function like this, the time is spent on the details of the algorithm and getting the right bounds on the loops - the actual C code is easy.

You can get a gcc 5.2 cross-compiler for PPC for Windows from here , or you can use the online compiler at . The PowerPC is not nearly as popular an architecture as ARM, and it is harder to find free ready-built tools (though there are plenty of guides to building them yourself, and you can get supported commercial versions of modern gcc from Mentor/CodeSourcery). You can also find tools directly from Freescale/NXP.

Vote

Twitter Facebook LinkedIn

T

Tauno Voipio 9 years ago

I used an Intel MDS and a Data General Eclipse to bootstrap a Z80-based CP/M computer (self made). After that, the CP/M system could be used to create the code, though the 8 inch floppies were quite small for the task.

-Tauno Voipio

Vote

Twitter Facebook LinkedIn

D

Dimiter_Popoff 9 years ago

Code generation in compilers is usually

We are referring to the same thing under different names - again. At the end of the day everything the compiler generates is written in plain assembly, it must be executable by the CPU. Under "prewritten" I mean some sort of template which gets filled with addresses etc. thing before committing. To what lengths the compiler writers go to make common cases look good know only the writers themselves, my memory is vague but I do think the guy who said that a few years ago knew what he was talking about.

Above all this is a good example how limiting the high level language is. Just look at the source and then at the final result.

You will get *exactly* the same result (- the return) with no optimization in vpa from the line:

mover.l (source),r3

Logic optimization is more or less a kindergarten exercise. If you need logic optimization you don't know what you are doing anyway so the compiler won't be able to help much, no matter how good.

Of course if you stick by a phrase book at source level - as is the case with *any* high level language - you will need plenty of optimization, like your example demonstrates. I bet it will will be good only in demo cases like yours and much less useful in real life, so the only benefit of writing this in C is the source length, 10+ times the necessary (I counted it and I included a return line in the count, 238 vs. 23 bytes). While 10 times more typing may seem no serious issue to many 10 times higher chance to insert an error is no laughing matter, and 10 times more obscurity just because of that is a productivity killer.

I meant "function" not the in C subroutine kind of sense, I meant it more as "functionality", i.e. some code doing some job. How it split into pieces etc. will depend on many factors, language, programmer style etc., not relevant to this discussion.

Actually I think a comparison between two pieces of code doing the same thing is quite telling when the difference is in the orders of magnitude, as in this case. Writing small benchmarking toy sort of stuff is a waste of time, I am interested in end results.

No, something "small" is kind of kindergarten exercise again, it can only be good enough to fool someone into believing this or that. It is end results which count.

Dimiter

====================================================== Dimiter Popoff, TGI
formatting link
======================================================
formatting link

Vote

Twitter Facebook LinkedIn

D

David Brown 9 years ago

OK. I think your naming and description is odd, but I am glad to see we are getting a better understanding of what the other is saying.

I think of "prewritten" as referring to larger chunks of assembly code, with much more concrete choices of values, registers, scheduling, etc. You described the "prewritten" code as being easily recognisable - in reality, the majority of the code from modern compilers is generated from very small templates with great variability. And on a processor like the PPC, these will be intertwined with each other according to the best scheduling for the chip.

As an example, if we have the function:

int foo0(int * p) { int a = *p * *p; return a; }

The template for reading "*p" generates

lmz 3, 0(3)

(Register r3 is used for the first parameter in the PPC eabi. It is also used for the return value from a function, which is why it may seem "over used" in the examples here. In bigger code, and when the compiler can inline functions, it will be more flexible about register choices. I don't know whether you follow the standard PPC eabi in your tools.)

Multiplication is another template:

mullw 3, 3, 3

As is function exit, in this case just:

blr

I find it very strange to consider these as "pre-written assembly".

And if the function is more complex, the intertwining causes more mixups, making it less "pre-written":

int foo1(int * p, int * q) { int a = *p * *p; int b = *q * *q; return a + b; }

foo1: lwz 9,0(3) lwz 10,0(4) mullw 9,9,9 mullw 3,10,10 add 3,9,3 blr

Well, it is known to the compiler writers and to users who look at the generated code! Certainly there is plenty of variation between tools, with more advanced compilers working harder at this sort of thing. Command line switches with choices of optimisation levels can also make a big difference.

How much experience do you have of using C compilers, and studying their output?

No, that is a good example of how smart the compiler is (or can be) about generating optimal code from the source.

You may in addition view this as a limitation of the C language, which has no direct way to specify a "bit reversed pointer". That is fair enough. However, it is not really any harder than defining a function like this, and then using it. For situations where the compiler can't generate ideal code, and it is particularly useful to get such optimal assembly, it is also possible to write a simple little inline assembly function - it is not really any harder than writing the same thing in "normal" assembly.

Another option (for newer gcc) is to define the endianness of a struct. Then you can access the fields directly, and the loads and stores will be reversed as needed.

typedef struct __attribute__((scalar_storage_order ("little-endian"))) { uint32_t x; } le32_t;

uint32_t reverseLoad2(le32_t * p) { return p->x; }

reverseLoad2: lwbrx 3,0,3 blr

So the high level language gives you a number of options, with specific tools giving more options, and the implementation gives you efficient object code in the end. You might need to define a function or macro yourself, but that is a one-time job.

When you say "no optimisation" here, does that mean that VPA supports some kinds of optimisations?

What do you mean by "logic optimisation" ? It is normal for a good compiler to do a variety of strength reduction and other re-arrangements of code to give you something with the same result, but more efficient execution. And it is a /good/ thing that the compiler does that - it means you can write your source code in the clearest and most maintainable fashion, and let the compiler generate better code.

For example, if you have a simple division by a constant:

uint32_t divX(uint32_t a) { return a / 5; }

The direct translation of this would be:

divX: lis 4,5 divwu 3,3,4 blr

But a compiler can do better:

divX: // divide by 5 lis 9,0xcccc ori 9,9,52429 mulhwu 3,3,9 srwi 3,3,2 blr

Such optimisation is certainly not a "kindergarten exercise", and doing it by hand is hardly a maintainable or flexible solution. Changing the denominator to 7 means significant changes:

divX: // divide by 7 lis 9,0x2492 ori 9,9,18725 mulhwu 9,3,9 subf 3,9,3 srwi 3,3,1 add 3,9,3 srwi 3,3,2 blr

I still don't know what you mean with "phrase book" here.

Nonsense. The benefits of using a higher level language and a compiler get more noticeable with larger code, as the compiler has no problem tracking register usage, instruction scheduling, etc., across large pieces of code - unlike a human. And it has no problem re-creating code in different ways when small details change in the source (such as the divide by 5 and divide by 7 examples).

You have this completely backwards. If I write a simple example like this, in a manner that is compilable code, then it is going to take longer in high-level source code. But that is the effect of giving that function definition. In use, writing "reverseLoad" does not take significantly more characters than "mover" - and with everything else around, the C code will be much shorter. And this was a case picked specifically to show how some long patterns in C code can be handled by a compiler to generate optimal short assembly sequences.

The division example shows the opposite - in C, I write "a / 7", while in assembly you have to write 7 lines (excluding labels and blr). And the C code there is nicer in every way.

In real code, the C source will be 10 times shorter than the assembly. And if the assembly has enough comments to make it clear, there is another order of magnitude difference.

OK.

But again, it has to be a specific clearly defined and limited functionality. "Write a VNC server" is not a specification - that would take at least many dozens of pages of specifications, not including the details of the interfacing to the network stack, the types of library functions available, the API available to client programs that will "draw" on the server, etc.

No, it is not. The code is not comparable in any way, and does not do the same thing except in a very superficial sense. It's like comparing a small car with a train - both can transport you around, but they are very different things, each with their advantages and disadvantages.

If you want to compare your VNC server for DPS written in VPA to a VNC server written in C, then you would need to give /exact/ specifications of all the features of your VNC server, and exact details of how it interfaces with everything else in the DPS system, and have someone write a VNC server in C for DPS that follows those same specifications. That would be no small feat - indeed, it would totally impossible unless you wanted to do it yourself.

The nearest existing comparison I can think of would be the eCos VNC server, written in C. I can't say how it compares in features with your server, but it has approximately 2100 lines of code, written in a wide style. Since I have no idea about how interfacing with DPS compares with interfacing with eCos (I don't know either system), I have no idea if that is a useful comparison or not.

Then we will all remain in ignorance about whether VPA is useful or not, in comparison to developing in C.

Vote

Twitter Facebook LinkedIn

B

Boudewijn Dijkstra 9 years ago

Op Sat, 27 May 2017 21:39:36 +0200 schreef rickman :

LLVM has a pretty generic intermediate assembler language, though I'm not sure if it's meant for actually writing code in.

formatting link

Another portable assembly language is Java Bytecode, though it assumes a
32-bit machine.

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Vote

Twitter Facebook LinkedIn

M

Mike Perkins 9 years ago

Interesting, but its not obvious who the audience is. Why would anyone want to learn another language that is not in common use or aligned to any specific CPU?

I've been watching this thread for some time. My first impression was why not just write in C? So far that impression hasn't changed. Despite the odd line of CPU specific assembler code for those occasions that require it, C is still perhaps the most portable code you can write?

Mike Perkins Video Solutions Ltd www.videosolutions.ltd.uk

Vote

Twitter Facebook LinkedIn

D

David Brown 9 years ago

The LLVM "assembly" is intended as an intermediary language. Front-end tools like clang (a C, C++ and Objective-C compiler) generate LLVM assembly. Middle-end tools like optimisers and linkers "play" with it. and back-end tools translate it into target-specific assembly. Each level can do a wide variety of optimisations. The aim is that the whole LLVM system can be more modular and more easily ported to new architectures and new languages than a traditional multi-language multi-target compiler (such as gcc). So LLVM assembly is not an assembly language you would learn or code in - it's the glue holding the whole system together.

Well, yes - of course C is the sensible option here. Depending on the exact type of code and the targets, Ada, C++, and Forth might also be viable options. But since there is no such thing as "portable assembly", it's a poor choice :-) However, the thread has lead to some interesting discussions, IMHO.

Vote

Twitter Facebook LinkedIn

‹

1

2

3

Join the Discussion

Have something to add? Share your thoughts — no account required.

Didn't find your answer?

Ask the community — no account required

Further Reading

Arm registers

Why? There are perfectly good Arm assemblers available freely. Is this just...

Arduino / compilers on uC

No, C was never a "portable assembly language". Some people think so - they are...

difference between assembly directive and pseudo-inst ?

Where I've seen the term "pseudo-instruction" used it's been in the...

mixing C and assembly

It took a bit of convincing to make me understand how C could work better than...

Assembly code with Borland's C compiler

-- snip --If I'm working with a tool chain that supports compiling to...

Initializations in assembly or C : which is favourable ? why ?

Why would writing in assembly produce faster, tighter code for an initialisation...

Writing a simple assembler

No one is writing an assembler to run *on* the target. Someone jumped into this...

My idea of fully-portable C code

I think you just made my point for me.If you've only worked with one...

reducing flash size in embedded processors?

Yes, I agree with all that. The problem however is that many peopleskip the...

Making Fatal Hidden Assumptions

[...]There's a continuum from raw machine language to very...

64 bit OS

Not relevant at present: this discussion is about an RPi 3B running a 64 bit OS...

Debugging assembly

I noticed in the 1970's that I could not consistently beat the RSX-11Fortran...

beginner and 8051

Should you write your application in assembly language you will find that you...

Apologies where they are due

Two conflicting goals a cpu only talks one language - machine code. therefore...

Raspberry Pi node red flow editor very slow when using Modbus

Yes ARM instructions are all 32 bit, apart from the 16 bit thumb instruction...

Override i2c_register_board_info

Hi, I am checking if there is a mechanism that can override the functionality of...
« previous thread

ARM Cortex Mx vs the rest of the gang

Hi So, considering going the Cortex Mx route for project with new...
next thread »

Report Content

You are reporting this content to the moderators.
They will look at it ASAP.

Reason

Please specify