What micros do you actually hate to work with?

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Oct 12, 2006 9:47 PM

I know this differs from the mainstream beliefs, but it is a fact of life. Not every assembler and not used by anyone (this repeated just to avoid being quoted out of context), see my other postings.

Most of them would probably not be allowed to try; another large part would probably be just wary to go against the mainstream; and, the keyword may be "enough".

Well I can assure you in VPA it cannot possibly take 1/5 of that, no matter how much you overblow it with features and whatever. I have a lot more sophisticated things than a mobile phone behind me which fit within some 10 M source text... The factor here will be several times 10, not just 10. Just place an order and we'll do it for you.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

- C
- cbarn24050
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Oct 12, 2006 10:12 PM

d7f10$0$5648$ snipped-for-privacy@dreader32.news.xs4all.nl...

er

duce

Dont tell me how trivial it is, DO IT, then we'll see.

amming is in the design,

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Thu, Oct 12, 2006 10:19 PM

Thanks, Walter. One of the things I depended upon and was barely able to achieve in close work with the assembly was a _small_ fixed size for each state. By this, I mean two words or four words (different machines.) If the C compiler was even in a single state case unable to achieve this packing, it would NOT be able to take advantage of the space packing as I did. It only takes one "spill" to screw up the entire thing.

Interesting.

Jon

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Oct 13, 2006 12:25 AM

Sadly, it is proprietary and I'm pretty sure I couldn't get permission to post it up. Plus, folks would need the exact compilers used. I wish I could make it public. Of course, it wouldn't prove anything except that such cases do sometimes happen.

What precisely do you attribute these changes to? At least, what are the few more important changes? I'm betting that they were "nothing new" in terms of ideas, so much as the inclusion of ideas that had been around for quite some time, as yet unimplemented in your compiler.

But the details would tell.

And some of the early choices made on internal structures. Once you get mired in one model, it can be very difficult to implement other algorithms as the structure changes needed can be pervasive and therefore risky to attempt. Someone with deep knowledge and an idea of where they may be headed can do a lot towards getting those structures in the right shape at the outset for later additions.

I don't consider the obvious benefits from faster processors and more memory, allowing changing the threshold of when to abort more work to be moved around for better code production, to be remarkable in any way at all. It's just the obvious. Like saying that more memory and faster processors makes sorting faster because you can put more of the data in memory while sorting it in sections. If that's what you meant. And it may not be, at all.

But if it turns out that part of the 50%/30% figure you mentioned above is about having access to faster CPUs and more memory so that you opened up your data structures and changed your DAG and basic block walking algorithms to spend more 'effort' then I'm not going to take that part as having been an improvement due to software and diligent and intelligent application of human effort towards more modern compilation ideas.

Understood. I've seen, a while back, the ability for compilers to "lose" the prologue/epilogue (or restrict it) in cases where it wasn't needed. Appropriate.

But really... Can you single out a commercial compiler anywhere today and point out within in it a feature that wasn't known _and_ well documented in book form about 25 years back? I have a variety of Ph.D. theses books (including, for example, the Bulldog VLIW thesis, but in no way limited to that) that really _do_ propose very useful ideas with implementation details that I've _yet_ to see anywhere at all. And some of them are actually valuable, too, in real applications. Not pie in the sky stuff.

Not to mention those other good ideas, as yet still unimplemented in available compilers today, that _were_ documented 25 years back and more -- which are still growing dust for lack of use.

No. I think compilers were pretty good back in the early '90s. Just perhaps that ROMP Pascal wasn't. They were pretty good in the '80's, too -- including implementations of instruction reordering and trace scheduling. They are pretty good today. The point is that they are only "pretty good," still today. And I've lived a long enough life to have looked forward to higher expectations over time. (Something like those old science fiction stories from the 'rocket' days in the late

50's and early '60's, none of which ever dared to imagine that humans _wouldn't_ already be living on the moon and maybe even Mars by now -- boy, were they way optimistic.)

I blame a marketplace that doesn't understand the technology well and couldn't recognize the production of copies in loops and copy propagation, strength reduction and elimination of induction variables if it hit them in the face, let alone care about it. They know as much about the advanced technology in the compilers they buy as a buyer of a PC monitor knows about the design of the horizontal oscillator in the unit they are buying and taking home -- marginal and likely doomed to fail in a few years.

Instead, as long as the product works okay at first, they buy compilers based on IDE features, docking toolbars, make features, price, what others are using, and the like. They read magazine comparisons and look for more check-offs than less.

So that's what vendors sell. Because they want to stay in business.

Implementing interesting (read: tricky) and promising, yet so far not market-proved optimization ideas is way down on the list of where time is spent. There are some really nifty compiler ideas out there. Some old, some new, things I'd really like to see tested out in broader practice. I won't hold my breath. The market isn't demanding them.

But I am allowed to dream.

Jon

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Oct 13, 2006 12:37 AM

That's an excellent goal, I think.

I'd like to be able to specify "hints" to the compiler, where I understand what is going on and can provide useful optimization knowledge to the compiler. The "fixed-execution-time" thing would be one example. Another would be, "This if-condition takes place 90% of the time." Stuff like that.

And I still want coroutines (thunking, iterators, etc.) They can be had without requiring anything different than already exists on the same hardware a normal C compiler's output can reside on. And at no cost to anyone not using them. That is one place where a C compiler would be lots better than my mickey-mousing around in assembly to patch it in.

;)

Of course, I still want those compiler optimizers I like to dream having. Certainly, these darned 4Gb RAM, 3GHz dual-CPU, read-around-write, fancy billion transistor beasts we now routinely put on our desktops could support a little something more than I got back in 1980. One might think, anyway. :)

Jon

- U
- Ulf Samuelsson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Oct 13, 2006 2:03 PM

Maybe You should try "Protothreads".

formatting link

--
Best Regards,
Ulf Samuelsson
This is intended to be my personal opinion which may,
or may not be shared by my employer Atmel Nordic AB

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Oct 13, 2006 8:24 PM

Please read the above. Ideally I want a proper spec describing the load, wavevorm, period, the accuracy of the amp meter, whether this is an interrupt function or not, accuracy of answer etc, but I accept the assembler code as a (rather poor) substitute.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Oct 13, 2006 9:15 PM

In other words, you believe that with your particular assembler syntax you can outcode everybody else, right?

Yes, I'm sure you've done that a few times. Maybe you should have a go at rewriting Windows, it's only 100 million lines.

How about a compression algorithm that compresses every file by at least a factor of 2? Or 10 times if that is too easy...

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Fri, Oct 13, 2006 9:36 PM

I meant a small example doing something similar. You were saying elsewhere it was mostly caused by switch statements. I'm having trouble understanding how you could ever get anywhere near

6 times expansion with just that.

Assuming an 8-bit target, a switch statement has a small constant overhead, a table for each case (1 or 2 bytes if large), and for each break a jump to the end of the switch statement (2 bytes if nearby,

3 bytes if far). If you as you say arrange the code of each case to be a power of 2, then the minimum useful size is 4 bytes per case for a small switch (2 instruction bytes, one branch), and 8 bytes per case for a large switch (5 instruction bytes, one branch).

So for a small switch the table has a 25% overhead (5 vs 4), for a large switch it is 25% (10 vs 8). If you happen to return in your code but the compiler is not able to do branch chaining of the breaks then it would be 6 vs 4, 12 vs 8 (50% overhead).

I fail see a 6x factor...

From my experience the average overhead of jump tables in embedded applications on ARM is around 5%.

Wilco

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 12:22 AM

I understand the question and applaud it. I don't have the time right now to spend going back through a system I no longer have direct access to, today. My recollection of the memory footprint is approximately correct, though, since it was something that I was surprised by at the time and it kind of 'sticks in mind.'

But we shall have to leave it there for now. Do you know of any specific similar experiences, though? I'd be interested in hearing other example cases, you know.

Jon

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 12:25 AM

By the way, I'd like you to follow up on my question regarding the following exchange. Since you had professional experiences here, I'd like to hear what you say:

What precisely do you attribute these changes to? At least, what are the few more important changes?

I'd be interested to hear.

Thanks, Jon

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 12:53 AM

Thanks for not believeing me. It is, like I said before, one of the best compliments I get. I have not done it a few times, just about twice (about 20 M source text last 10 years or so).

Actually Windows has little if anything to give you that my DPS cannot, and there is a lot DPS can give you which Windos cannot give you. Here the ratio between the taken resources is much, much larger than 10 (more like 1000+). This all meant in technical terms, not marketing.

Thanks again for the copliment - I already said something in this line elsewhere on the thread. BTW, I have encountered people deeming what I have impossible as you suggest in your example above who have had a lot less trouble realizing what they are facing than you seem to have.

I am sure you will be a lot more amazed if you actually understand what resources I have here. Our services are on the market.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 9:33 AM

I now realise you actually believe the lies you're telling. This is really sad. I suggest you seek medical help - urgently. Please mention the word "megalomanic" to the doctor, they are professionals and know how to treat this condition. With the right drugs you will be on your feet in no time, I hope you get well soon.

Wilco

- W
- Wilco Dijkstra
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 3:04 PM

Some of the large improvements were loop unrolling, inlining, and library optimizations (floating point emulation, memcpy etc). A smart calling standard helped too - I quickly got rid of frame pointers and other unnecessary baggage like compiler temporaries choking up the registers.

But those together amount to less than a quarter of the total gain. Most of the improvements came from concentrating on very low level optimizations and fine tuning every bit of the compiler. This was possible by continuous benchmarking to ensure progress.

I did a lot of staring at compiler output of actual customer code to spot cases where the compiler could improve. By focusing on things that matter (ie. real world code) it was possible to identify the weak spots and fix them. In most of these cases simple optimizations were all that was needed, for example expression simplifications, peepholes or tuning of a particular optimization.

I have done a lot of comparisons with the competition. Almost invariably they concentrate on high-level optimizations and can do interesting stuff that I could only dream of. However we still won due to using the instruction set more efficiently and so emitting fewer instructions in the end. For a comparison, GCC still hasn't beaten our 10 year old compiler, despite a huge amount of effort and using the latest advances (fashion?) in compiler technology.

Using the instruction set efficiently appears to be the key. If you teach the compiler to use instructions with writeback, conditional execution, select the more esoteric instructions like wide or narrow multiplies, etc then you end up with code that is as good as assembler written by an expert. I've been surprised more than once by the way the compiler combines several simple optimizations and ends up with amazing code.

To give a small example - byte wise strcmp:

while (t0 = *p++, t1 = *q++, t0 >= 1 && t0 == t1);

translates to the following optimal ARM code:

loop LDRB t0, [p],#1 LDRB t1, [q],#1 CMP t0,#1 CMPCS t0,t1 BEQ loop

This is a feature called conditionalized comparisons. Compares can be conditionally executed just like all other instructions on ARM to avoid branches. The compiler can combine several comparisons like this with varying conditions and && or || combinations. It's a trick few assembly programmers know, let alone use...

As another example, ARM has multiple load and store instructions which are often used to save & restore registers at function entry and exit. You also may have to adjust the stack pointer. Rather than using 4 instructions it is often possible to push a few extra registers thus needing only 2 instructions. This kind of trick makes the code that deals with this extremely complex as it needs to handle many special cases, but it helps a lot as functions are small on average.

So there you are. Probably not what you wanted to hear... Compiler design is not about sexy new optimizations, it's just plain good old engineering, implementing a huge amount of smart tricks and taking care of every single detail while at the same time understanding what matters most in the big picture. Very challenging but rewarding!

Wilco

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 6:57 PM

I remember noticing this when scanning a manual.

But some of us would, of course, notice and use at times.

Understood. Special cases are exactly where a human can excel, at least pretty often.

Actually, this was a lovely discussion. Some of the above is what I'm quite used to seeing in compilers. Some makes me a little interested in looking more at ARM, again.

Have you taken a look at the ADSP-21xx processors? They allow special case packing (you mention "many special cases" and the ADSP-21xx has a bevy of such things) of up to three instructions in a single word. I will often promote calculations or moves many tens or hundreds of instructions behind where they are actually needed, in order to achieve high density. I'd love to show you a few examples for you to think on how you'd manage them in a C compiler. Frankly, I don't doubt it could largely be done -- but it would be fun to discuss exactly how. This would definitely include trace scheduling to move calculations above branch edges, for example.

Thanks for the time. Much appreciated. I will pose to you this example:

unsigned int gcd (unsigned int a, unsigned int b) { if (a == 0 && b == 0) b= 1; else if (b == 0) b= a; else if (a != 0) while (a != b) if (a < b) b -= a; else a -= b; return b; }

I think Walter will remember this one.

You can easily deduce the boundary conditions this routine is required to handle from a cursory examination of the code, so I won't write a spec for it. None needed.

I have taken the time on the x86 to run this through quite a number of compilers and _none_ of them can 'see' the topology change that a good assembly programmer can, in writing this for the x86.

I have never written a single line of ARM assembly code and I am completely unfamiliar with the instruction set, except from a very general overview point of view, but I might be able to try my hand at it just the same with something smaller like this. It will take me some work to get familiar and, probably, I will miss some details you will then remind me about. Perhaps you might consider popping this through your compiler (or another one, as you like) and then also telling me what you personally can "see" here as some alternative ways of looking at it.

I honestly only know where this takes me with the x86, so I'm at a disadvantage in the sense that I do NOT know how this plays elsewhere and it certainly may toss me some surprises, too. You know compiler technology better than I do, as well. I'm just a lowly application programmer. But the process is always interesting for me to see unfold.

I don't mean any of this as some kind of C vs assembly challenge, but only as a learning experience for all. No matter how this plays out, it does NOT reflect on either. It's just interesting, that's all.

Jon

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 6:59 PM

By the way, such a routine is not merely some theoretical exercise. Such a routine is part of Stern-Brocot trees, representations of rationals, and mediant fractions; concept of congruence, the modulo function, residue functions, as well as relative primality; recurrences and power functions; off the top of my head. I've certainly used the concept in instrumentation where I need to find a commensurate period of time which evenly divides two other rather arbitrary times, in order to arrange an exact timer period which I can use to drive both processes exactly, for example.

Jon

- T
- Tauno Voipio
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 9:14 PM

Here's your example written for compiler input

--

   unsigned int gcd (unsigned int a, unsigned int b)
   	{
#if 0
	if (a == 0 && b == 0)
		b = 1;
	else

	if (b == 0)
	    b = a;
	else

	if (a != 0)
		{
		while (a != b)
			{
			if (a < b)
				b -= a;
			else
				a -= b;
			}
		}
	
	return b;
#else
	if (a == 0 && b == 0)
		return 1;
	else

	if (b == 0)
	    return a;
	else

	if (a != 0)
		{
		while (a != b)
			{
			if (a < b)
				b -= a;
			else
				a -= b;
			}
		}
	
	return b;
#endif
	}

- D
- Didi
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 11:18 PM

Well, what do I say. I do understand the motivation driving your denial, but, well... plonk.

Dimiter

------------------------------------------------------ Dimiter Popoff Transgalactic Instruments

formatting link

------------------------------------------------------

Wilco Dijkstra wrote:

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sat, Oct 14, 2006 11:57 PM

Understood. I'd like to see both outputs, though. More educational. Also, can you tell me why you think a compiler should care, though? As a theoretical matter, I mean. Should not it be able to make the same output either way? (We can discuss why I imagine it should be able to do so, as we get into this, I suppose.)

This is ARM output? If so, I take it ARM7 and not THUMB? Anyway, thanks.

Well, I'm not going to be quick on the uptake just yet. I need to spend some time reading and thinking a bit about the ARM instructions

-- as it is pretty much unfamiliar territory. But I'm going to bite in and see.

I started using C in some earnest in 1978, with the V6 Unix kernel. Assembly goes to my very roots of coding, which is 1972.

Can you generate this for x86, too? It would be quicker for me to compare, since I come already prepared there and since GCC should be able to handle that.

Thanks, Jon

- I
- Isaac Bosompem
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Sun, Oct 15, 2006 12:07 AM

Hi guys,

Here is what I get from Visual Studio .NET 2003 (x86), speed optimization is on:

00000 8b 4c 24 04 mov ecx, DWORD PTR _a$[esp-4] 00004 85 c9 test ecx, ecx 00006 8b 44 24 08 mov eax, DWORD PTR _b$[esp-4] 0000a 75 0a jne SHORT $L1490 0000c 85 c0 test eax, eax 0000e 75 1d jne SHORT $L1489 ; Line 607 00010 b8 01 00 00 00 mov eax, 1 ; Line 627 00015 c3 ret 0 $L1490: ; Line 610 00016 85 c0 test eax, eax 00018 75 03 jne SHORT $L1492 ; Line 611 0001a 8b c1 mov eax, ecx ; Line 627 0001c c3 ret 0 $L1492: ; Line 616 0001d 3b c8 cmp ecx, eax 0001f 74 0c je SHORT $L1489 $L1599: ; Line 618 00021 73 04 jae SHORT $L1498 ; Line 619 00023 2b c1 sub eax, ecx ; Line 620 00025 eb 02 jmp SHORT $L1499 $L1498: ; Line 621 00027 2b c8 sub ecx, eax $L1499: ; Line 616 00029 3b c8 cmp ecx, eax 0002b 75 f4 jne SHORT $L1599 $L1489: ; Line 627 0002d c3 ret 0