Making Fatal Hidden Assumptions

CBFalconer · 2006-03-06T23:05:49+00:00

We often find hidden, and totally unnecessary, assumptions beingmade in code. The following leans heavily on one particularexample, which happens to be in C. However similar things can (anddo) occur in any language.These assumptions are generally made because of familiarity withthe language. As a non-code example, consider the idea that thefaulty code is written by blackguards bent on foulling thelanguage. The term blackguards is not in favor these days, and forgood reason. However, the older you are, the more likely you areto have used it since childhood, and to use it again, barringspecific thought on the subject. The same type of thing applies towriting code.I hope, with this little monograph, to encourage people to examinesome hidden assumptions they are making in their code. As ever, indealing with C, the reference standard is the ISO C standard. Versions can be found in text and pdf format, by searching for N869and N1124. [1] The latter does not have a text version, but ismore up-to-date.We will always have innocent appearing code with these kinds ofassumptions built-in. However it would be wise to annotate suchcode to make the assumptions explicit, which can avoid a great dealof agony when the code is reused under other systems.In the following example, the code is as downloaded from thereferenced URL, and the comments are entirely mine, including the'every 5' linenumber references./* Making fatal hidden assumptions *//* Paul Hsiehs version of strlen. Some sneaky hidden assumptions here: 1. p = s - 1 is valid. Not guaranteed. Careless coding. 2. cast (int) p is meaningful. Not guaranteed. 3. Use of 2's complement arithmetic. 4. ints have no trap representations or hidden bits. 5. 4 == sizeof(int) && 8 == CHAR_BIT. 6. size_t is actually int. 7. sizeof(int) is a power of 2. 8. int alignment depends on a zeroed bit field. Since strlen is normally supplied by the system, the system designer can guarantee all but item...

E

Ed Prochak 20 years ago

Funny, I always called it a glorified assembler.

It fills a nitch that few true high level languages can.

Would a better universal assembler be more like assembler or more like high level languages? I really think C hit very close to the optimal balance.

Good programmers definitely have to be multilingual. ed

Vote

K

Keith Thompson 20 years ago

[...]

Whether C is a "universal assembler" is an entirely separate question from whether C is "good", or "better" than something else, or close to some optimal balance.

As I understand the term, an assembly language is a symbolic language in which the elements of the language map one-to-one (or nearly so) onto machine-level instructions. Most assembly languages are, of course, machine-specific, since they directly specify the actual instructions. One could imagine a more generic assembler that uses some kind of pseudo-instructions that can be translated more or less one-to-one to actual machine instructions. C, though it's closer to the machine than some languages, is not an assembler in this sense; in a C program, you specify what you want the machine to do, not what instructions it should use to do it.

Forth might be an interesting data point in this discussion, but if you're going to go into that, please drop comp.lang.c from the newsgroups.

Keith Thompson (The_Other_Keith) kst-u@mib.org San Diego Supercomputer Center We must do something. This is something. Therefore, we must do this.

Vote

P

Paul Keinanen 20 years ago

I don't see much room for a "universal" assembler between C and a traditional assembler, since the instruction sets can vary quite a lot.

However, there exists a group of intermediate languages that try to solve some problems in traditional assembler. One problem is the management of local labels for branch targets, the other is that with usually one opcode/source line the program can be quite long. Both these things make it get a general view of what is going on, since only a small fraction of the operations will fit into the editor screen.

To handle the label management, some kind of conditional and repeat blocks can be easily created. Allowing simple assignment statements with assembly language operands makes it easier to but multiple instructions on a single line.

Such languages have been implemented either as a preprocessor to an ordinary assembler or directly as a macro set at least on PDP-11 and VAX. The source code for PDP-11 might look like this:

IF R5 EQ #7 R3 = Base(R2) + (R0)+ - R1 + #4 ELSE R3 = #5 SETF ; "Special" machine instruction END_IF

would generate

CMP R5,#7 BNE 9$ ; Branch on opposite condition to else part MOV Base(R2),R3 ADD (R0)+,R3 SUB R1,R3 ADD #4,R3 BR 19$ ; Jump over else part

9$: ; Else part MOV #5,R3 SETF ; Copied directly from source 19$:

The assignment statement was always evaluated left to right or no parenthesis were available to change the evaluation order. The conditional expression consisted of one or two addressing mode expression and a relational operator that translated to a conditional branch instruction. There is no point of trying to invent new constructions for each machine instructions, so there is no problems in inserting any "Special" machine instructions in the source file (in this case SETF), which is copied directly to the pure assembler file.

For a different processor, the operands would of course be different, but the control structures would nearly identical.

Paul

Vote

E

Ed Prochak 20 years ago

The one-to-one mapping is broken for a macro assembler. Often there are macros defined for subroutine entry and exit, loops and other things. the only difference is that everyone defines their own macros, while in C everybody uses the same "macros".

x++; in most machines can map to a single instruction which is different from the instruction for ++x;

I would agree that if an assembler must be a one-to-one mapping from source line to opcode, then C doesn't fit. I just don't agree with that definition of assembler.

Forth is definitely a contender.

nice discussion. ed

Vote

R

Richard G. Riley 20 years ago

"Ed"posted the following on 2006-03-10:

But a mnemonic representing an instruction is still just that. The macro part is nothing more than rolling up of things for brevity. A subsequent disassembly will reveal all the hidden gore.

How do you see them being different at the assembler level? They are not are they? Its just when you do the (pseudo ASM) INC_REGx or ADDL 02,REGx or whatever that matters isnt it?

e.g if we have y=++x;

then the pseduo assembler is INC x move y,x

where as

y=x++

is move y,x INC x

Not taking into account expression return value that are CPU equivalent. Admittedly I havent dabbled in later instruction sets for the new post 80386 CPUs so please slap me down or better explain the above if not right : its interesting.

Sorry for coming late, but how do you see an assembler? In common parlance it has always been (in my world) a program for converting instruction set mnemonics into equivalent opcodes which run natively on the target CPU.

"A desk is a dangerous place from which to view the world" - LeCarre.

Vote

A

Al Balmer 20 years ago

Not really. Many assemblers have predefined macros for various things, and C programmers write macros using preprocessor directives.

Huh? As a standalone statement, if x is an integer type, I'd expect both to be mapped to the machine's equivalent of INC x. If it's embedded in a larger statement, or it's a pointer, it's likely that several instructions will be generated, and a compiler (including a C compiler) will do things that a macro assembler won't do.

Proposed decades ago, and there has been some implementation.

Nor does anyone else, since the invention of macros. However, C doesn't fit any widely accepted definition of assembler. You can have your own definition of assembler, as long as you don't expect folks to know what you're talking about.

Al Balmer Sun City, AZ

Vote

E

Ed Prochak 20 years ago

[]

But it will NOT display the original macro. There is no 1 to 1 mapping from source to code.

Actuall I still think in PDP assembler at times (my first assemblerprogramming). so y=x++; really does map to a single instruction which both moves the value to y and increments x (which had to be held in a register IIRC)

MOV R1,x A: MOV y,R1++ ## I may have the syntax wrong, it's been a LONG time

MOV R1,x B: MOV y,++R1

The opcodes for those two instructions (lines A and B) are different in PDP assembler.

I haven't played much in the intel realm since about the 286, and I haven't done much assembly at all for about 10years. Even the last embedded project I worked on with a tiny 8bit micro had a C compiler, so I did nearly nothing in assembler. C makes it so much easier. I've had the opinion of C as assembler since I first learned it (about

1983).

Some other languages do so much more for you that you might be scared to look at the disassembly. e.g. languages that do array bounds checking for you will generate much more code for a[y]=x; than does C. You can picture the assembly code for C in your head without much difficulty. The same doesn't hold true for some other languages.

I told you, a macro assembler does not work that way. One macro might expand not just to multiple mnemonics, but to different mnemonics depending on parameters. It is not 1 to 1 from source to assembly mnemonics (let alone opcodes). A macro assembler can abstract just a little or quite a lot away from the target machine. Depends on how you use it. So while , an assembler is

there's nothing about that conversion being one-to-one (mnemonic to opcode)

even without macros, the one-to-one doesn't work if in the instruction set the opcode for moving registers differs from moving memory, so MOV R2,R! differs from MOV B,A where R1 and R2 are register identifiers and A and B are memory location. Yet we talk about the MOVe mnemonic as if both were the same operation.

C's assignment operator maps about as closely to those opcodes as that MOV mneumonic does. That's why I say it's a glorified assembler. You have about as good an idea of what code is generated as you do with a good assembler (as long as we can ignore the compiler's obtimizer).

Nice quote. ed

Vote

C

cs_posting 20 years ago

Hmm, so if I'm decrementing a divisior, and branching off somewhere else before the actual divide instruction if the would be divisor is zero, and your precomputation of both branches traps a division by zero that a literal execution of my program would never perform... whose fault is that?

I suspect that exception handling in speculative execution is a problem that has been looked into.

Vote

M

Michael Wojcik 20 years ago

Sure, if "most machines" excludes load/store architectures, and machines which cannot operate directly on an object of the size of whatever x happens to be, and all the cases where "x" is a pointer to an object of a size other than the machine's addressing granularity...

I suppose you could argue that "can" in your claim is intended to be weak - that, for "most machines" (with a conforming C implementation, presumably), there exists at least one C program containing the statement "x++;", and a conforming C implementation which will translate that statement to a single machine instruction.

But that's a very small claim. All machines "can" map that statement to multiple instructions as well; many "can" map it to zero instructions in that sense (taking advantage of auto-increment modes or the like). What can happen says very little about what will.

The presence in C of syntactic sugar for certain simple operations like "x++" doesn't support the claim that C is somehow akin to assembler in any case. One distinguishing feature of assembler is a *lack* of syntactic sugar. (Macros aren't a counterexample because they're purely lexical constructs; in principle they're completely separate from code generation.)

C isn't assembler because:

- It doesn't impose a strict mapping between (preprocessed) source and generated code. The "as if" clause allows the implementation to have the generated code differ significantly from a strict interpretation of the source acting on the virtual machine.

- It has generalized constructs (expressions) which can result in the implementation generating arbitrarily complex code.

Michael Wojcik michael.wojcik@microfocus.com Any average educated person can turn out competent verse. -- W. H. Auden

Vote

A

Al Balmer 20 years ago

Not something to worry about, though you'd have to ask an expert why :-) I suspect that this stuff is below the level of exception triggers.

Al Balmer Sun City, AZ

Vote

P

Paul Keinanen 20 years ago

Mnemonics and symbolic addresses in assemblers are just syntactic sugar built on the binary machine code :-).

Entering machine codes in hex or octal is also syntactic sugar.

Paul

Vote

W

Walter Roberson 20 years ago

[Getting off-topic for comp.lang.c...]

Yes. For example on the MIPS architecture, an exception state is inserted into the flow, but the exception itself is not taken unless the exception "graduates"; the exception is supressed if the conditional results turn out to be such that it was not needed.

In the MIPS IV instruction set, divide can be done as "multiply by the reciprical", and it is not uncommon to schedule the reciprical operation ahead of time, before the code has had time to check whether the denominator is 0. The non-zeroness is speculated so as to get a "head start" on the time-consuming division operation.

If I recall correctly, a fair bit of the multi-instruction pipelining on MIPS is taken up with controls to handle speculation properly.

Prototypes are supertypes of their clones. -- maplesoft

Vote

K

Keith Thompson 20 years ago

[...]

There's a continuum from raw machine language to very high-level languages. Macro assembler is only a very small step up from non-macro assembler. C is a *much* bigger step up from that. Some C constructs may happen to map to single instructions for *some* compiler/CPU combinations; they might map to multiple instructions, or even none, for others. An assignment statement might copy a single scalar value (integer, floating-point, or pointer) -- or it might copy an entire structure; the C code looks the same, but the machine code is radically different.

Using entirely arbitrary units of high-level-ness, I'd call machine language close to 0, assembly language 10, macro assembler 15, and C about 50. It might be useful to have something around 35 or so. (This is, of course, mostly meaningless.)

Assembly language is usually untyped; types are specified by which instruction you use, not by the types of the operands. C, by contrast, associates types with variables. It often figures out how to implement an operation based on the types of its operands, and many operations are disallowed (assigning a floating-point value to a pointer, for example).

I know the old joke that C combines the power of assembly language with the flexibility of assembly language. I even think it's funny. But it's not realistic, at least for C programmers who care about writing good portable code.

Keith Thompson (The_Other_Keith) kst-u@mib.org San Diego Supercomputer Center We must do something. This is something. Therefore, we must do this.

Vote

C

cs_posting 20 years ago

Have to remember thought that the C program in question is really a back translation of approximating an assembly language original. If the compiler builds the undefined pointer operation in the logical way, it will be essentially the same as the hand written assembly language code.

To then claim that speculative execution may cause an exception on the result is to imply that the assembly language author, who has a pretty good idea what assumptions he is making, must now add "speculative loading of something I wasn't going to fetch" to the list of concerns.

Or were you thinking it was the compiler rather than processor logic which was going to do the speculating?

Some pipelining tricks like the MIPS branch delay slot, are explicitly part of the programming model, and you do have to manually handle them when working with low level assembly code. But for the x86, speculation is not...

Vote

A

Al Balmer 20 years ago

Why? We've long since stopped discussing that program.

Al Balmer Sun City, AZ

Vote

M

Mark L Pappin 20 years ago

It's a little OT in c.l.c, but would you mind telling us just what processors those are, that you can make such a guarantee? What characteristics do they have that means they'll never have a C compiler?

(A few I can recall having been proposed are: tiny amounts of storage, Harvard architecture, and lack of programmer-accessible stack. Funnily enough, these are characteristics possessed by chips for which I compile C code every day.)

mlp

Vote

A

Albert van der Horst 20 years ago

I find it not an example of implicit assumptions, just of bad coding.

There is no reason not to use the almost as readable, problemless

if ( buffer_end - buffer < space_required )

(It mentally reads as " if the number of elements that still fit in the buffer is less then the amount of elements we require")

Groetjes Albert

--

Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. albert@spenarnc.xs4all.nl http://home.hccnet.nl/a.w.m.van.der.horst

Vote

J

Jordan Abel 20 years ago

"tiny amounts of storage" may preclude a conforming hosted implementation [which must support an object of 65535 bytes, and, of course, a decent-sized library]

Vote

C

CBFalconer 20 years ago

These machines may well have C compilers, just not conforming ones. The areas of non-conformance are likely to be:

object size available floating point arithmetic recursion depth (1 meaning no recursion) availability of long and long-long. availability of standard library

Once more, many programs can be written that are valid and portable C, without requiring these abilities. The thing that needs to be documented is use of possible non-standard substitutes for standard features.

"If you want to post a followup via groups.google.com, don't use the broken "Reply" link at the bottom of the article. Click on "show options" at the top of the article, then click on the "Reply" at the bottom of the article headers." - Keith Thompson More details at: Also see

Vote

C

cs_posting 20 years ago

If the computation in one version can be reduced to a constant by the compiler, that would be a reason for using that version.

I can imainge a number of situations in which "bad coding" is the result of a programmer with a mental idea of how to accomplish something efficiently, trying to render that approach in C as if it were assembly language. This is doubly likely on small systems...

The problem of course is that the compiler has it's own ideas about how to be efficient.

And the standards committee may have very different ideas from the would-be hand optimizing programmer about how you are supposed to instruct the compiler in what you want!

Vote

Making Fatal Hidden Assumptions

Join the Discussion

Didn't find your answer?