a couple of LC filter progs

On Apr 7, 2019, John Larkin wrote (in article):

OK. No bytecode "compiler" marketing nonsense. I run into this all the time, so I always ask.

OK. That makes more sense.

The easy and hard directions in high-order programming languages (HOLs) are often surprising.

What may be a dead-simple but relevant example is looping over a two dimensional array "Data[i, j]. In all HOLs, it matters which index increments faster, i or j, and in a big array the difference can be dramatic because of how caches and virtual memory work. A common problem when comparing the Fortran family and the C family is that with order is best differs between those two families. It all depends on how arrays are stored in memory, column-first or row-first. Whichever is first needs the faster index. So a full translation involves reversing how those indexes are used.

Twenty years ago, when I was involved with developing what is now called Middleware for a radar fire-control system for ship self defense against cruise missiles (no-kidding realtime - the final exam arrived at Mach 2, and fumbling was bad for your health), our main vendor had an interesting approach. They coded in plain C, inspected the generated assembly code, and tweaked the C code until the assembly code was clean and fast. It turned out that the resulting code was largely portable in that all C compilers generated clean, fast code from the same tweaked C source code, after the source code was tweaked to the first two compilers.

Joe Gwinn

Reply to
Joseph Gwinn
Loading thread data ...

cruise missiles (no-kidding realtime - the final exam arrived at Mach 2, and fumbling was bad for your health), our main vendor had an interesting approach. They coded in plain C, inspected the generated assembly code, and tweaked the C code until the assembly code was clean and fast. It turned out that the resulting code was largely portable in that all C compilers generated clean, fast code from the same tweaked C source code, after the source code was tweaked to the first two compilers.

Schools need to teach this [more often?]. If you're working in a low level language* like C, you need to think about the end product: ultimately, you're telling the compiler how to write a program. C is not exactly a specification or description language, but that's not an invalid perspective to take.

(Personally, I've seen VHDL taught that way, as describing instances of logic gates through semantic structures, but not C. I did get ASM and then C from the same prof as did many other students on the same track, but without the explicit call to inspect ones' machine code, I doubt anyone made the connection.)

*Face it, C is low. Well, "medium" would be more charitable. It's assembler, cleaned up with richer macros and an optimizer. If it were a high level language, it would know better than to give mere /developers/ the pointers^Hkeys to the missile silos!

As for the matter of writing mission-critical software in C, I will withold judgement on that... :^)

Tim

--
Seven Transistor Labs, LLC 
Electrical Engineering Consultation and Design 
Website: https://www.seventransistorlabs.com/
Reply to
Tim Williams

On Apr 9, 2019, Tim Williams wrote (in article ):

I should mention that the mission code was written in Ada83, but the

many problems. This was the last Ada project in that area - the Ada mandate was rescinded during the project.

.

The original intent of C was to develop a portable equivalent to assembly language, so UNIX could be ported from computer type to computer type. This is documented in K&R.

But whatever the language level, C won the language wars hands down. Ada83 was an early casualty.

.

One of the problems with Ada was precisely that it attempted to prevent errors by constraining the programmers. Having right hand tied to left foot does prevent mistakes, but it turns out that the price was too high.

Ada83 enforced a 1970s academic theory of how programs ought to be structured, from people who had zero experience of embedded realtime programming. There was zero support for hardware interfaces (no shared memory, no volatile variables, etc). The whole issue of priority inversion was simply missed - priority inversion was in the embedded-realtime lore, but had neither a name nor a literature, and so was invisible to academics. And Ada83 was locked down, and could not be changed to fix any of the many problems. So, we fixed such problems in the C-coded runtime, where Lady Ada could neither see nor touch - she lived in an artificial space designed to never surprise her.

.

Too late. Most military mission-critical code is written in C++ these days, with direct hardware control in ANSI C. The operating system is some form of Linux.

Ada retains a niche in safety-critical code, where DO-178 is implemented. But DO-178 is so heavy that it actually does not matter what language one uses, and one can argue that assembler is safer than any HOL, because with assembler nothing is hidden.

Joe Gwinn

Reply to
Joseph Gwinn

tweaked the C code until the assembly code was clean and fast. It turned out that the resulting code was largely portable in that all C compilers generated clean, fast code from the same tweaked C source code, after the source code was tweaked to the first two compilers.

Expanding on this some more --

Just in my humble experience alone, your writing style can massively affect code generation.

The optimizer is terribly, terribly far from exhaustive. (It /could/ be exhaustive -- but then users would complain of hours or years of compilation time for almost no benefit, and that's no good!) If it doesn't figure out any simple tricks, it's just going to pick the best, mediocre solution and let that be that.

And mind this affects execution time about as much as it does code size. Often, compact code runs faster, especially on simpler embedded platforms. (Yeah, when pipelines and caches get involved, unrolled and inlined operations get more attractive, and the discrepancy between compact and fast code can grow.)

Things the optimizer is likely to check, can range from modestly unrolling or reordering loops, to factoring numerical expressions, to inlining functions and operating on the resulting mega-function, and more. All of these grow quickly in complexity, however, and the pursuit can become self-defeating.

A recent example was a bit-packing function, on an 8-bit platform with hardware multiply. I wrote this a few different ways. By far, the worst was a mega-expression: between macros and carefully indented and inspected sub-expressions, the whole operation can be expressed entirely numerically. That's technically fine, but the compiler really throws up its proverbial hands and basically ends up writing out the expression long-form without any reuse of sub-expressions, or registers even(!).

The next best was using a bunch of variables (which are allocated on the stack normally, but these are optimized out quickly when there are enough free registers to put them there instead, which was the case here) to hold intermediate steps, and repeating common steps in a short loop. But keep in mind that, if the variables are allocated in registers... you can't loop over them. Doing it with a loop, forces it to allocate stack, get the pointer, and index the variables. Plus the memory accesses themselves, which are slower. It is not without overhead! (This is a much better deal on, say, classic 16-bit machines (x86, 68k), and most everything since.) It might even try it with the loop and array, then try it unrolled with registers, and keep the unrolled version because it's simply better!

I forget what I ended up with; I think I sliced the bit pattern differently, still using a loop but getting better reuse. It's still ugly, like hundreds of bytes for something nearly trivial if the data were byte aligned.

All of this optimizing is subject to the constrains of the functions executed within each expression or statement. C functions can do literally anything. Side effects are the bane of optimization. If the optimizer can't reason about being able to move a function up or down the expression tree, it's simply forced to treat that as a sequence point. (Sure, it could reason about the function's contents as if inlining it, but that would be extremely costly.)

As far as I know, the optimizer is bad at guessing what functions do, in terms of side effects, so it can help greatly (and this is why they put the features in there) to add hints about the nature of the function (e.g., using const with parameters, writing pure functions when possible, etc.).

(FYI, most of my experience centers around GCC. Most of this is motivated by my own observations, illuminated by some of the official documentation about the optimizing step.)

Tim

--
Seven Transistor Labs, LLC 
Electrical Engineering Consultation and Design 
Website: https://www.seventransistorlabs.com/
Reply to
Tim Williams

Am 09.04.19 um 21:50 schrieb Tim Williams:

National tried to implement indexing over registers in their 16032 /

32032 processors. They didn't get it to work. Also, it was a bad idea. When the array is small, then there is no real advantage. When it is large, it will not fit into the registers.

regards, Gerhard

Reply to
Gerhard Hoffmann

Cool. In that, it was something to try, that didn't work out, but now we know.

There's been a number of architectures with memory mapped registers (not including memory-as-register architectures, which are worse :^) ), but they never really seem to stick around. IIRC, AVR for example was introduced with it, but most examples today have dropped it.

In some cases, you can still hack it (self modifying code), but that's almost always worse (due to instruction pipelines and caches). Other times, it's specifically forbidden (physical ROM, Harvard architecture, execute-only memory spaces, etc.).

Perhaps ironically, the feature still lives on, but in a more limited way, as register renaming is a well established feature in more advanced hardware.

Tim

--
Seven Transistor Labs, LLC 
Electrical Engineering Consultation and Design 
Website: https://www.seventransistorlabs.com/
Reply to
Tim Williams

Am 10.04.19 um 02:30 schrieb Tim Williams:

The whole family was quite buggy. Some friends of mine tried to port Andy Tanenbaum's p-code machine to it.

DEC system 10. That wasn't bad.

(not

OMG! TMS9900. What a turd.

regards, Gerhard

Reply to
Gerhard Hoffmann

Yes, TMS 9900 was interesting, but with main memory cycle times in the order of one microsecond it was s_l_o_w. However, these days with modern caches and cache management it would make sense.

Reply to
upsidedown

One moderately successful was Sun / Oracle Sparc, with 24 of the 32 registers windowed into the stack.

A different story is that it needs the branch delay slots in the traditional RISC way, and it makes the assembly code pretty difficult to read (and still more to write).

--

-TV
Reply to
Tauno Voipio

Am 10.04.19 um 06:40 schrieb snipped-for-privacy@downunder.com:

No, never ever. That design is bad to the bone.

There is a direct and unbreakable link between the computer's fastest and slowest operations.

In the time one can do sub (Rdest,flags), Rsrc1, Rsrc2

one has to decide in addition:

- It is in L1 cache if not: is it in L2 cache if not: is it in L3 cache if not: is it at least in L1 page tables...L2 page tables if not: is swapped out -> trap, needs software to handle if not: does it exist at all -> trap

All that 3 times for a simple instruction. In addition to the normal complications for pipelining, multi issue, speculative execution.

Even large register files or a huge number of renaming registers or too large caches at a certain level are bad. Not having constant replacement pressure means that the resource is too fat and therefore to slow.

Some registers are badly needed to keep the complexity out of EVERY instruction.

cheers, Gerhard

Reply to
Gerhard Hoffmann

While this is true to any register in the register set after switching workspace pointer, a read into cache usually reads more than a single register but instead loads a full cache line, containing multiple or even all registers. After this initial load, all active registers are already in the cache and no more loads from main memory is needed.

Better yet, the loading a new value into the workspace pointer should automatically load the full register set in one or more cache lines. This would be better for cache consistency.

An other issue about cache consistency is how to restore the modified values into main memory. Either use "write through" to immediately save individual modified cached register values or "write back" the whole register set into main memory just prior to workspace pointer reload.

Register renaming is just reloading register blocks, often with two or more workspace pointers.

Using bulk load of the whole register set does just that.

Reply to
upsidedown

On Apr 9, 2019, Tim Williams wrote (in article ):

Yes, this matches my experience as well. Assembly code rules!

Joe Gwinn

Reply to
Joseph Gwinn

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.