Assemblers are for hiding your work , not for faster code .

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Jan 9, 2007 9:33 PM

I do deep-embedded stuff, in assembly (because I like it) on the

68332, with most projects in the 4-8 kline sort of range (if I don't do something goofy like a 128 kbyte sine lookup.)

I use C32C, a "universal" table-driven cross-assembler, with home-brew (PowerBasic) pre and post processors to tweak the syntax and format the listings (including adding a table of contents.)

The "make" process, a batch file, preprocesses the source, assembles it, post processes the listing, and builds a rom image from the assembler S28 file and, typically, 2-5 Xilinx .rbt FPGA config files. The whole thing takes maybe 4 seconds.

This subroutine erases blocks in flash memory, the same flash we're executing code out of. So it must be relocated into CPU ram to execute, since flash disappears while any blocks are being erased. That sort of thing is easy in assembly.

The .SBTTL directive creates an entry in the table of contents:

.SBTTL . FLASH BLOCK ERASE DRIVER

; THIS GETS RELOCATED TO CPURAM FOR EXECUTION

; D0 IS A BITMAP, WITH Bn NAMING BLOCK 'n' TO BE ERASED. ; ERASING BLOCK 0 IS *NOT* ALLOWED.

; USES D7

FEPRO: TST.W SILLY.W ; ARE WE IN RAM/DEBUG MODE? BNE.S FLEE ; IF SO, SKIP FLASH COMMANDS! ANDI.W # 7FEh, D0 ; TRIM MAP TO BITS 10..1 BEQ.S FLEE ; AND BAIL IF NONE!

; POKE THE 'BLOCK ERASE' PREAMBLE...

MOVE.W # 0AAh, 0555h*2.W ; AA TO WORD 555 1 MOVE.W # 055h, 02AAh*2.W ; 55 TO WORD 2AA 2 MOVE.W # 080h, 0555h*2.W ; 80 TO WORD 555 3 MOVE.W # 0AAh, 0555h*2.W ; AA TO WORD 555 4 MOVE.W # 055h, 02AAh*2.W ; 55 TO WORD 2AA 5

; WE NOW WRITE 030h TO THE START OF ALL BLOCKS TO ZAP...

X = CPURAM + {ETAB-FEPRO} ; 'ETAB' AS RELOCATED!

MOVEA.W # X, A0 ; AIM AT BLOCK POINTER LIST MOVE.W # 10-1, D7 ; WHICH HAS 10 REAL ENTRIES

ELOOP: BTST.L # 1, D0 ; TEST BIT1 FIRST! BEQ.S ESKIP ; NO BIT, SKIP THIS ONE MOVEA.L (A0), A1 ; GOT BIT, GET VALID BLOCK POINTER MOVE.W # 030h, (A1) ; AND POKE '30' TO REQUEST ERASE

ESKIP: ADDQ.L # 4, A0 ; HOP TO NEXT TABLE ADDRESS LSR.L # 1, D0 ; SLIDE BITMAP DOWN DBF D7, ELOOP ; AND TEST ALL BITS MOVE.W # 6000, D0 ; NOW POLL FOR DONE, 6 SECONDS MAX

FLUTE: MOVE.W # 2000, D7 ; INNER LOOP IS 1 MSEC

FLOOG: SUBQ.W # 1, D7 ; BNE.S FLOOG ; 500 NS/LOOP IN CPU RAM, 10 CLKS

; WHEN DONE, THE 'FOFO' WORD WILL REAPPEAR...

CMPI.W # 0F0F0h, FOFO.W ; CHECK AND BEQ.S FLEER ; SKIP ON DONE

SUBQ.W # 1, D0 ; NO GO, TOCK ANOTHER MILLISEC BNE.S FLUTE ; AND LOOP.

; DO THE READ/RESET COMMAND IN CASE WE HAD AN ERROR...

FLEER: MOVE.W # 0F0h, DUMMY.W ; F0 TO ANYWHERE WILL DO MOVE.W # 20, D7 ; BUT TIME OUT > 10 USEC FL10: SUBQ.W # 1, D7 ; TO MAKE SURE FLASH BNE.S FL10 ; IS BACK ONLINE.

FLEE: RTS

; FLASH BLOCK POINTERS:

.LONG 0 ; 0 16K

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jan 10, 2007 11:38 AM

John,

That is quite manageable and with your pre/post tooling, all within

4 seconds, sounds pretty good to me. I also like the cpu32 assembly, there are many reasons for that. I guess one of the important reasons why we do like it is the fact that the habbit of writing comments telling the whole story while programming comes with it (a cultural thing, I suppose, but I have yet to see HLL code written like that). This is unbeatable when it comes to going back to your code a few years later. In fact I consider less commented than that code a waste of time. A few years ago, I included the PPC in my arsenal. Having about 15M source text for the CPU32 (back then), I had to figure out a way to move forward and use it.

I took the liberty (and 15 minutes :-) to manually precondition your code example and target it at the PPC. [I had to remove the spaces within the arguments and to convert the 0AAh-s to $0AA-s, I could have done 0xAa-s as well but no h-s :-) ]. The results are here: your code assembled for a cpu32 (should equal yours apart from some definitions I made up):

formatting link

assembled for the PPC, short list:

formatting link

assembled for the PPC, detailed native list:

formatting link

The source after my modification:

formatting link

Well, I have left the directory open,

formatting link

:-).

Nice to know I am not the only one left who likes that way of programming.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

John Lark>

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jan 10, 2007 2:49 PM

Didi wrote: (And failed to maintain attributions)

Please do not strip attributions (the "joe wrote:" lines) for material you quote.

As long as you are talking about commenting style in assembly, here is a 30 year old example of mine, for an 8080 floating point package. The macros involved are all short and primarily exist for readability. For example, lfts is load floating from top of stack. lfbs is load b reg from arbitrary stack level. The final header comment specifies registers disturbed. push and pop are actually macros that keep track of the current stack depth in .lvl to enable reaching down for arguments etc. Save and reload push and pop complete floating point values. The code used only registers and stack, and thus was completely re-entrant.

; ; Evaluate polynomial in (DE.H) = x ; (DE.H) := A(N)*X^N + A(N-1)*X^(N-1) + ... + A(1)*X + A(0) ; Carry for arithmetic overflow ; (BC) specifies address of coefficients ; First coefficient is order of polynomial (128 max) ; A,F,D,E,H poly: save bc.l ldax b; Get order inx b; Advance coeff pointer sfts d; Save argument @arg set .lvl; Argument stack address mvi h,0; Clear partial value push psw; Save order counter poly1: push b; Save coeff loc sfts d; Save partial value call fload; Get coefficient lfts b; Recover partial value to (BC.H) call fadd; Add in pop b; Coeff pointer jc poly2; Arith overflow pop psw; Order counter dcr a jm poly3; Done push psw; Save order counter push b; Save coeff pointer mvi a,.lvl-@arg call lfbs; Get argument call fmul; Multiply pop b; Restore coeff pointer inx b inx b inx b; Advance to next coeff jnc poly1; No arith error poly2: pop b; Error exit, purge stack poly3: pop b pop b; Purge argument from stack reload bc.l ret

--
Chuck F (cbfalconer at maineline dot net)
   Available for consulting/temporary embedded and systems.

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jan 10, 2007 4:25 PM

Cool. But how does...

MOVE.W #$0AA,$0555*2.W

get to be all of...

0000001C: 388000AA addi $0AA,r0,r4 00000020: 7C840735 extsh. r4,r4 00000024: 4D4A5102 crandc 10,10,10 00000028: B0800AAA sth r4,$0555*2,r0

I like the 68K because it is fun to program in assembly. PPC doesn't look like fun, especially for people who can't type.

John

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Wed, Jan 10, 2007 5:03 PM

Hi John,

this is the complete emulation - it does set the CC bits and clear the carry bit in the CPU32, all of this emulated. No shorter way to do it for the PPC - but it is not an issue to me, I get about

3.5 times longer object code for the PPC if the source is pure CPU32, and a vast speed improvement (IIRC about 40-50 times for an 8 times faster clock). However, you can write it - if the context would allow it - like that:

move.w- #$aa,$555*2.w , which will leave CC unchanged because of the "-" character:

00000000: 38800055 addi $55,r0,r4 00000004: B0800AAA sth r4,$555*2,r0

This makes the code no longer backward compatible with the CPU32, of course.

But wait to see the div 64/32 - the PPC has no such thing, so I had to do it like this:

00000008 38C0 0020 7CC9 03A6 divu.l- d1,d2:d3 - source line, resulting in: 00000008: 38C00020 addi 32,r0,r6 0000000C: 7CC903A6 mtspr r6,288 00000010: 5526F802 rlwinm r9,31,0,1,r6 00000014: 5527F87E rlwinm r9,31,1,31,r7 00000018: 7C865810 subfc r6,r11,r4 0000001C: 7CA75110 subfe r7,r10,r5 00000020: 7D000400 mcrxr 2 00000024: 418A0014 brc 12,10,5 00000028: 554A083C rlwinm r10,1,0,30,r10 0000002C: 516A0FFE rlwimi r11,1,31,31,r10 00000030: 556B083C rlwinm r11,1,0,30,r11 00000034: 48000014 b 5 00000038: 54AA083C rlwinm r5,1,0,30,r10 0000003C: 508A0FFE rlwimi r4,1,31,31,r10 00000040: 548B083C rlwinm r4,1,0,30,r11 00000044: 396B0001 addi 1,r11,r11 00000048: 421FFFD0 brc 16,31,-12 .....

PPC native assembly not only is not fun, I would say it is unusable for programming much more than a few critical lines. This is why I did the VPA thing, it not only incorporates the CPU32, it also allows me to use all PPC features - in a readable and maintainable manner. My PPC sources would be as readable to you as yours CPU32 are to me, I guess. Tracing step by step while reading the native list is not a whole 3.5 times harder, actually I would say it is not harder at all - although one has to press "N" "CR" somewhat more :-).

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

John Lark>

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jan 11, 2007 3:28 AM

I just did this today, a little thing to divide a 64-bit integer by

I had been using a partial-product multiply by 0.1, which worked fine for most of my data (time delays) but rounded a teeny bit for DDS frequencies, which was both bad and customer-visible. Setting f=10 MHz gave f=10.00000001 MHz!

So I think this will work:

; DIVIDE D0:D1 BY 10, EXACTLY

D10: MOVEM.L D3 D4, -(SP) ; SAVE SCRATCHPADS

MOVE.L # 10, D4 ; IS COMRADE CONSTANT! CLR.L D3 ; TREAT HI AS 64-BIT INT DIVU.Q D4, D3:D0 ; D0 = HI/10 D3 = REM DIVU.Q D4, D3:D1 ; DIVIDE REM:LO BY 10, QUO IN D1

MOVEM.L (SP)+, D3 D4 ; RETURN SCRATCH REGS RTS

I tested it in PowerBasic (using 64-bit integer variables) and it does seem to work. 4 lines of real code, not bad.

John

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jan 11, 2007 7:28 AM

... snip ...

See

--
Chuck F (cbfalconer at maineline dot net)
   Available for consulting/temporary embedded and systems.

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jan 11, 2007 3:58 PM

Hi John,

I also have found out that this combination won't go from itself in instrumentation making, eventually we have to act on it :-).

I am not quite sure what your assembler does, but the second "divu" line seems suspicious to me (looks like you just overwrite the remainder from the first div and divide d1 instead).

The CPU32 has indeed nice arithmetics, no doubt. The 32-bit PPC has also all of it except the mentioned 64/32 division... Well, the 32/32 takes an extra multiply and subtract to get to the remainder, but those are single cycle operations... and IIRC the divide was something like 17 cycles, that at hundreds of MHz. This is how it goes:

00000000 7D66 5B78 7D6B 4B96 divul.l- d1,d2:d3 00000000: 7D665B78 or r11,r11,r6 dividend extra copy 00000004: 7D6B4B96 divwu r9,r11,r11 actual divide - quotient in d3 (r11) 00000008: 7CE959D6 mullw r9,r11,r7 quotient*divisor in r7 0000000C: 7D473050 subf r7,r6,r10 remainder in d2 (or r10)

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

John Lark>

- W
- werty
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jan 11, 2007 5:54 PM

______________________________________

But in Forthrite , there is NO delay of 4 seconds , nor assembly "proceedure" to excute ..

thus one can , hold their thoughts , continue being creative no details to detour them ..

Snips from my Forthrite OpSys ..

{ If .............. Then.... ............ ............. Case ........................ Else................. EndIf Return Void() {

_____________________________________

The above { detour , done on purpose to show a bit of why the whole world hates C . ---------------------- And [ .......] [........] [.........] Xor[ 1 2 3 4 5 Fail[ 6 ] ] Fail[GoTo ....] ------------------

Forthrite does not write the condions , because you get them from a previous context . You are HERE because you did not understand this thus you picture this levels "success" and then exit it if it fails , following this path Fail[GoTo..]

Its bad to put details inur mind that are not SUCCESS path .

Forthrite , a new OpSys for ARM ....

I am the fastest systems programmer on Earth , faster than Chuck Moore ...

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jan 11, 2007 9:30 PM

How about this? The divide with remainder thing (at ACRID:) just pumps out digits.

.SBTTL . ATIME : CONVERT TIME TO ASCII

; GIVEN A 64-BIT TIME IN PICOSECONDS, CONVERT THAT INTO AN ASCII ; STRING IN OUR "BCD" STRING BUFFER.

; INPUT IS IN D0:D1, MAX VALUE 99,999,999,999,999, 14 DIGITS! ; A0 WILL AIM AT RESULTING STRING

; WE'LL BREAK D0:D1 INTO TWO CHUNKS BY FIRST DIVIDING BY 1E9

ATIME: MOVE.L D7, -(SP) ; SAVE OLE # 7 MOVE.L # 1000000000, D7 ; ONE WHOLE BILLION! WOW. DIVU.Q D7, D0:D1 ; DO THAT DIVIDE

; D0 (REMAINDER) HOLDS THE LOW 9 DECIMAL DIGITS, 999,999,999 MAX ; D1 (QUOTIENT) NOW HOLDS THE NUMBER OF BILLIONS, 10,999 MAX

MOVEA.W # BCX, A0 ; AIM AT END OF 'BCD' BUFFER, +1 MOVE.W # 9-1, D7 ; REQUEST 9 DIGITS BSR.S ACORN ; CONVERT THEM, POKE INTO BUFFER

MOVE.W # 5-1, D7 ; MAKE READY FOR BIG 5 DIGITS MOVE.L D1, D0 ; COPY THE GIGAS BSR.S ACORN ; AND DO INTO FINAL CONVERSION MOVE.L (SP)+, D7 RTS

; THIS LITTLE SUB CONVERTS THE CONTENTS OF D0 INTO ; AN ASCII STRING.

; D7 IS NUMBER OF DIGITS TO OUTPUT, DBF STYLE ; D0 HOLDS UNSIGNED LONG TO CONVERT ; A0 AIMS AT START, THE *LS* DIGIT OF THE STRING+1, ; MEANING WE'LL WORK BACKWARDS, -(A0) MODE

ACORN: MOVEM.L D3 D5, -(SP) ; MINIMIZE DAMAGE MOVE.L # 10, D3 ; RADIX CONSTANT FOR HUMANOIDS

ACRID: CLR.L D5 ; MAKE INPUT INTO 64-BIT INT DIVU.Q D3, D5:D0 ; DIVIDE, REMAINDER IN D5 ADDI.W # "0", D5 ; CONVERT TO ASCII 0..9 MOVE.B D5, -(A0) ; POKE A NEW DIGIT

DBF D7, ACRID ; AND LOOP

MOVEM.L (SP)+, D3 D5 RTS ; ON EXIT, A0 AIMS AT THE MS CHARACTER

The box I'm working on makes up to 20-second time delays with 1 ps resolution. I don't own a calculator that can help me compute scale factors to that sort of precision.

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Jan 11, 2007 9:37 PM

The point is not to be fast, but to be right. No doubt you code faster than I do, so a 4-second delay would matter to you. I'll often write a program, print a listing, and read it through over a period of days before I actually try to run it. And most of it generally works the first pass, and what bugs I have are usually hard crashes that are quickly found and fixed.

Assembly slows me down enough that I get the final, documented, bug-free product out a lot faster.

Whenever somebody around here wants to buy fancy tools to code faster, I get wary. Programming is not ditch digging: slow down and think!

OK, show us some code.

And yes, all good people hate C.

John

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jan 12, 2007 2:10 AM

Show us some code.

John

- J
- -jg
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jan 12, 2007 4:37 AM

Looks good - Just a quick clarify, as this is not an ASM I am used to

Is DIVU.Q the quad form, and doing this

DIVU.Q 32den , 32NumU,32NumL as 64/32 -->; 32r:32q

32r:32q values placed into registers that were holding 32NumU,32NumL ?

-jg

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jan 12, 2007 4:51 AM

Right.

This is the CPU32 variant of the 68K. I think Motorola called this op DIVUL.L. I called it DIVU.Q ("quadword") 'cause I like that better.

DIVU.Q D0, D1:D2

divides the 64-bit thing D1:D2 (D1 is MS 32 bits, D2 is LS) by D0 (32 bits.) D2 is the quotient, 32 bits max, and D1 is the remainder.

I use a slightly fractured assembler for the 68332, staying as close to PDP-11 assembly syntax as practical. MACRO-11 was a damned fine assembler. We wrote cross assemblers for totally different machines entirely as MACRO-11 macros.

A: equ 12 ; is barbaric

A = 12 ; is genteel

John

- W
- werty
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jan 12, 2007 6:37 AM

Why flood ?

Forth names these opcodes to Primatives , then groups Primatives into MidLevels and names them , then groups MidLevels into high levels and names them .

At a Forth terminal , you can read ur src code , and expand/contract any part of it , instantly , to eliminate clutter , Its easier to follow the "flow"

Im creating a Forth GUI ( No Text) , so no need to READ anything ! Your src code is images / icons / concepts , not in textual form .

Cant buy hardware . Im looking for EVB , ARM-9 Emailed ATMEL 5 times , they wont answer . Also need a debugger ,

like i had 20 years ago . "SST+" by Murray Sargent III , hes with M$ now . But for ARM -9

44KB of assm/dis-assm , mem dumper , code pattern finder

Ill be making a pocket PC . I will put Forth in the 128KB Flash-ROM , but move it to ext-SRAM to run .

- M
- Michael N. Moran
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jan 12, 2007 12:21 PM

(I can't believe that I'm actually going to feed this thing.)

WTF has a flood to do with anything?

The legend continues ... blah blah ... show us the code.

Clearly ... reading (and writing / communicating) is sooo last century.

--
Michael N. Moran           (h) 770 516 7918
5009 Old Field Ct.         (c) 678 521 5460
Kennesaw, GA, USA 30144    http://mnmoran.org

"So often times it happens, that we live our lives in chains
  and we never even know we have the key."
The Eagles, "Already Gone"

The Beatles were wrong: 1 & 1 & 1 is 1

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Jan 12, 2007 12:46 PM

unsigned long long a;

. . . .

a /= 10;

:)

--

John Devereux

- I
- Ian Bell
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Jan 13, 2007 7:02 AM

I don't know. Why are you?

ian

- W
- werty
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Jan 13, 2007 9:38 PM

English is a translation , it has no use in computers , it causes buggy code . Use the right side to make images of OpCodes . Then string them together , for perfect code . There is no faster way .

You are wr> The second, and by far more important, is to tell other programmers

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Jan 14, 2007 7:05 PM

I suppose that means that I will *not* be receiving your resume any time soon.

John