Keil C51/A51 ASM Code in C

Hi!

For some serious speed optimisations I have to implement a function in Assembler within a C-Sourcecode. I am using Keil uVision 2 with an Infinion C515-CL Processor - and the c51/a51 environment. After reading documentation I got some ideas how to implement a function in ASM and execute it from the C-Source. The problem is: I don't get it compiled. :-)

What I did is the following:

---------------------------- Right-Click on the .c-file and enabled "Generate Assembler SRC File" and "Assemble SRC File" from the file-options-menu.

I implemented the function with the following code:

void myop (unsigned char idx, unsigned char pos1, unsigned char pos2, unsigned char shval) { #pragma asm EXTRN DATA (idx) EXTRN DATA (pos1) EXTRN DATA (pos2) EXTRN DATA (shval)

/* plus */ mov a, #x add a, #pos1 mov r1, a mov a, #x add a, #pos2 mov r0, a mov a, @r0 add a, @r1 mov r7, a

/* rotl */ mov r0, #shval mov a, r7 inc r0 // increase by one to have 0 as exit condition sjmp lab2 // unconditional jump lab1 rl a // rotate left by 1 :-) lab2: djnz r0,lab1 // decrease and jump if not zero mov r7,a

/* xor */ mov a, #x add a, #idx mov r0,a mov a, @r0 xrl a, r7

/* save back */ mov @r0, a #pragma endasm }

The variable x is global and defined by unsigned char idata x[16];

Build-Output:

-------------

Build target 'Target 1' compiling ecrypt.c... ECRYPT.C(104): warning C280: 'idx': unreferenced local variable ECRYPT.C(104): warning C280: 'pos1': unreferenced local variable ECRYPT.C(104): warning C280: 'pos2': unreferenced local variable ECRYPT.C(104): warning C280: 'shval': unreferenced local variable assembling ecrypt.src... linking...

*** WARNING L1: UNRESOLVED EXTERNAL SYMBOL SYMBOL: ?C_STARTUP MODULE: ecrypt.obj (ECRYPT) Program Size: data=28.0 xdata=0 code=625 "asm_optimized" - 0 Error(s), 5 Warning(s).

The questions are: How do I get variables from the function-signature used in ASM-Code? Can I acces them with #varname? Are they on the stack; or in r0-r7? Reading documentation mentions all these possibilities.

And the second one: Which is the "correct" or recommended way to access the unsigned char idata x[16]-array from the assembler code?

Thanks in advance!

Cheers Markus

Reply to
Markus
Loading thread data ...

ALL the questions you ask are fully explained in the manuals. These are in electronic form on the CD and as part of the installation.

In article , Markus writes

--
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
\/\/\/\/\ Chris Hills  Staffs  England     /\/\/\/\/
/\/\/ chris@phaedsys.org      www.phaedsys.org \/\/\
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
Reply to
Chris Hills

On Thu, 11 Jan 2007 19:41:57 +0000, Chris Hills wrote in comp.arch.embedded:

[snip]

...and in another post, just a few hours earlier, Chris posted this:

Perhaps you should follow your own advice?

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
Reply to
Jack Klein

In article , Jack Klein writes

Don't be silly! :-)

--
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
\/\/\/\/\ Chris Hills  Staffs  England     /\/\/\/\/
/\/\/ chris@phaedsys.org      www.phaedsys.org \/\/\
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
Reply to
Chris Hills

After starting with a new example project from scratch it's working now, thanks! But there is still one question left which answer I didn't find in the manual. If I look at the debugger, a function is called with a LJMP. What I am searching for is some "real" inline assembler which can be defined with a macro. My function is called 320 times in a loop. It's quite "expensive" to jump 320 times to the function itself.

Can I define a macro which includes ~20 lines of ASM code and is completely inlined from the compiler?

Cheers Markus

Reply to
Markus

In article , Markus writes

No it's not.

You could but it extremely bad practice!

This is pointless. The compiler will compile the code to ASM and then it will incorporate the "inline" asm and then assemble it all together. This is not as efficient as using all C because it stops optimisation.

It is FAR Better to write either the function in C in which case the compiler will do far better optimisation that you can do. Or write as an assembler function. In which case the optimisation is not quite as good but better than using a Macro to inline it..

If you find the overhead of a function call is too expensive then you need to be running at a higher clock speed or use a different MCU.

However.... I suspect that you need to re-design your algorithm.... I think your best bet might he to write C code in line in the loop.

The Keil C compiler is extremely efficient at optimising C

You could post the code here?

BTW which part of the world are you in?

--
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
\/\/\/\/\ Chris Hills  Staffs  England     /\/\/\/\/
/\/\/ chris@phaedsys.org      www.phaedsys.org \/\/\
\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
Reply to
Chris Hills

Maybe some of the intention got lost in translation ("within" could men inline asm, or just asm called from C), but this seems like you've deciced that inline assembler must be the solution, and this before you've so much as tried such a thing once. That's the wrong way to approach an optimization.

In fact, especially for C51, inline assembler is basically never the right approach to squeezing out those last couple bits of speed, because of the way they handle it. Roughly said, C51 stops to optimize the C code if you've used any inline assembler. Either do the truly critical part completely in C, or completely in assembler.

This is rather certainly not going to be the optimal method. "Generate Assembler SRC File" (a.k.a. #pragma SRC) has some value for generating an assembly prototype implementation from C code, but "Assemble SRC File" is more trouble than it's worth.

The recommended approach for all compilers that behave like C51 is:

1) write the function in C (or a silly dummy version, just enough to use all inputs and write to all outputs) 2) compile it to SRC 3) rename *.src to *.a51 4) delete (or at least hide) the C version 5) work only on the *.a51 file from now on
Reply to
Hans-Bernhard Bröker

Hi!

[...]

I guess, in my case it is. :)

[...]

But the compiler still uses JMPs to execute the code. That is what the debugger is telling me..

#include

#define ROTATEL(v,c) ( _crol_((v),(c))) #define ROTATER(v,c) ( _cror_((v),(c)))

#define XOR(v,w) ((v) ^ (w)) #define PLUS(v,w) ((v) + (w))

unsigned char idata x[16];

int main(void) { int i; for (i = 0;i < 16;++i) x[i] = 'x';

for (i = 10;i > 0;i--) { x[ 4] = XOR(x[ 4],ROTATER(PLUS(x[ 0],x[12]), 1)); x[ 8] = XOR(x[ 8],ROTATEL(PLUS(x[ 4],x[ 0]), 1)); x[12] = XOR(x[12],ROTATER(PLUS(x[ 8],x[ 4]), 3)); x[ 0] = XOR(x[ 0],ROTATEL(PLUS(x[12],x[ 8]), 2)); x[ 9] = XOR(x[ 9],ROTATER(PLUS(x[ 5],x[ 1]), 1)); x[13] = XOR(x[13],ROTATEL(PLUS(x[ 9],x[ 5]), 1)); x[ 1] = XOR(x[ 1],ROTATER(PLUS(x[13],x[ 9]), 3)); x[ 5] = XOR(x[ 5],ROTATEL(PLUS(x[ 1],x[13]), 2)); x[14] = XOR(x[14],ROTATER(PLUS(x[10],x[ 6]), 1)); x[ 2] = XOR(x[ 2],ROTATEL(PLUS(x[14],x[10]), 1)); x[ 6] = XOR(x[ 6],ROTATER(PLUS(x[ 2],x[14]), 3)); x[10] = XOR(x[10],ROTATEL(PLUS(x[ 6],x[ 2]), 2)); x[ 3] = XOR(x[ 3],ROTATER(PLUS(x[15],x[11]), 1)); x[ 7] = XOR(x[ 7],ROTATEL(PLUS(x[ 3],x[15]), 1)); x[11] = XOR(x[11],ROTATER(PLUS(x[ 7],x[ 3]), 3)); x[15] = XOR(x[15],ROTATEL(PLUS(x[11],x[ 7]), 2)); x[ 1] = XOR(x[ 1],ROTATER(PLUS(x[ 0],x[ 3]), 1)); x[ 2] = XOR(x[ 2],ROTATEL(PLUS(x[ 1],x[ 0]), 1)); x[ 3] = XOR(x[ 3],ROTATER(PLUS(x[ 2],x[ 1]), 3)); x[ 0] = XOR(x[ 0],ROTATEL(PLUS(x[ 3],x[ 2]), 2)); x[ 6] = XOR(x[ 6],ROTATER(PLUS(x[ 5],x[ 4]), 1)); x[ 7] = XOR(x[ 7],ROTATEL(PLUS(x[ 6],x[ 5]), 1)); x[ 4] = XOR(x[ 4],ROTATER(PLUS(x[ 7],x[ 6]), 3)); x[ 5] = XOR(x[ 5],ROTATEL(PLUS(x[ 4],x[ 7]), 2)); x[11] = XOR(x[11],ROTATER(PLUS(x[10],x[ 9]), 1)); x[ 8] = XOR(x[ 8],ROTATEL(PLUS(x[11],x[10]), 1)); x[ 9] = XOR(x[ 9],ROTATER(PLUS(x[ 8],x[11]), 3)); x[10] = XOR(x[10],ROTATEL(PLUS(x[ 9],x[ 8]), 2)); x[12] = XOR(x[12],ROTATER(PLUS(x[15],x[14]), 1)); x[13] = XOR(x[13],ROTATEL(PLUS(x[12],x[15]), 1)); x[14] = XOR(x[14],ROTATER(PLUS(x[13],x[12]), 3)); x[15] = XOR(x[15],ROTATEL(PLUS(x[14],x[13]), 2)); } }

_crol_ and _cror_ are implemented (in ASM) with a counter and a loop. This implementation takes 8.7ms on 8 Mhz and I have to go to < 6ms.

My idea was: void ASMROTATEL(...) { #pragma asm ... #pragma endasm }

By replacing the Macro in each line in the "big loop" with the asm-function, everytime the function get's executes (32 lines in 10 times loop -> 320 times) it makes a LJMP. If I can completely inline the ASM-Code with a macro, I can ommit 320 LJMPs and 320 RETs.

Another point for optimization is to "manually replace" _cror_ and _crol_: I fyou look at the code, It will only rotate 2,3 or 1 Bit. So instead of implementing (in asm) _cror_ with a counter in a register and a conditional jump, I can just use 2,3 or 1 time 'RR R0' - which also gives me some extra cycles.

The crypto algorithm itself is from DJ Bernstein and is called Salsa20. This is only the part of the code which costs 95% cycles.

Germany :)

Thanks for your help! Markus

Reply to
Markus

So what? A JMP is fast on an 8051. If your function really is worth worrying about, the JMP or CALL that gets you there will be the least of your worries.

OK, you've found Keil's weak spot. _crol_ and _cror_ are quite certainly the worst of their intrinsic functions I've seen.

But the only way you can do that is by writing the entire surrounding code in ASM. Good luck.

Reply to
Hans-Bernhard Bröker

I suppose you should see some of my VPA or A32 text parsing and even recursive macros then...

formatting link
vpasttbq.sa

But you are right, of course, it is bad practice for someone who asks whether it can be done (probably having in mind to put

20 machine opcodes in a macro, which is rarely needed if at all).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Chris Hills wrote:

Reply to
Didi

... snip ...

... snip etc ...

Looks ridiculous. Why don't you publish the actual algorithm, or at least a link to it, and/or look at the manipulations for simplifications. As an example, there are two fast ways of calculating a CCIT CRC16, one is via look-up tables (and takes quite a bit of space) and another takes advantage of the

8080/Z80/8086 decimal adjust instructions, consumes about 30 bytes of object code without any loops, and executes roughly 50 opcodes. Again, you may want to select a different algorithm.
--
Chuck F (cbfalconer at maineline dot net)
   Available for consulting/temporary embedded and systems.
Reply to
CBFalconer

And I can't imagine any justification for either of

#define XOR(v,w) ((v) ^ (w)) #define PLUS(v,w) ((v) + (w))

The first macro is bad enough. The second...

Robert

--
Posted via a free Usenet account from http://www.teranews.com
Reply to
Robert Adsett

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.