Making Fatal Hidden Assumptions

CBFalconer · 2006-03-06T23:05:49+00:00

We often find hidden, and totally unnecessary, assumptions beingmade in code. The following leans heavily on one particularexample, which happens to be in C. However similar things can (anddo) occur in any language.These assumptions are generally made because of familiarity withthe language. As a non-code example, consider the idea that thefaulty code is written by blackguards bent on foulling thelanguage. The term blackguards is not in favor these days, and forgood reason. However, the older you are, the more likely you areto have used it since childhood, and to use it again, barringspecific thought on the subject. The same type of thing applies towriting code.I hope, with this little monograph, to encourage people to examinesome hidden assumptions they are making in their code. As ever, indealing with C, the reference standard is the ISO C standard. Versions can be found in text and pdf format, by searching for N869and N1124. [1] The latter does not have a text version, but ismore up-to-date.We will always have innocent appearing code with these kinds ofassumptions built-in. However it would be wise to annotate suchcode to make the assumptions explicit, which can avoid a great dealof agony when the code is reused under other systems.In the following example, the code is as downloaded from thereferenced URL, and the comments are entirely mine, including the'every 5' linenumber references./* Making fatal hidden assumptions *//* Paul Hsiehs version of strlen. Some sneaky hidden assumptions here: 1. p = s - 1 is valid. Not guaranteed. Careless coding. 2. cast (int) p is meaningful. Not guaranteed. 3. Use of 2's complement arithmetic. 4. ints have no trap representations or hidden bits. 5. 4 == sizeof(int) && 8 == CHAR_BIT. 6. size_t is actually int. 7. sizeof(int) is a power of 2. 8. int alignment depends on a zeroed bit field. Since strlen is normally supplied by the system, the system designer can guarantee all but item...

E

Ed Prochak 20 years ago

I should have phrased this: C is LIKE an assembler.

a Warning.

To some degree you are right. It's actually pointer manipulation that makes it closer to assembler.

How about a bitsliced machine that uses only 6bit integers?

Forgive my memory,but is it PL/1 or ADA that lets the programmer define what integer type he wants. Syntax was something like INTEGER*12 X defined X as a 12 bit integer. (Note that such syntax is portable in that on two different processors, you still know that the range of X is

+2048 to -2047 The point is a 16bit integer in ADA is always a 16bit integer and writing x=32768 +10 will always overflow in ADA, but it is dependent on the compiler and processor in C. It can overflow, or it can succeed.

But my point on this was, you need to know your target processor in C more than in a language like ADA. This puts a burden on the C programmer closer to an assembler programmer on the same machine than to a ADA programmer.

a big characteristic of assembler is that it is a simple language. C is also a very simple language. Other HLLs are simple too, but the simplicity combined with other characteristics suggest to me an assembler feel to the language.

No I was talking about the original motivation for the design of the language. It was designed to exploit the register increment on DEC processors. in the right context, (e.g. y=x++;) the increment doesn't even become a separate instruction, as I mentioned in another post.

But other HLL's don't even have register storage.

I know that it is just a suggestion. The point is Why was it included in the language at all? Initially it gave the programmer more control.

Which makes sense to an assembler programmer, but not to a typical HLL programmer.

lets put it this way. there is a gradient scale, from pure digits of machine language (e.g., programming obcodes in binary is closer to the hardware than using octal or hex) at the lowest end and moving up past assebmler to higher and higher levels of abstraction away from the hardware. On that scale, I put C much closer to assembler than any other HLL I know. here's some samples

PERL, BASH, SQL C++, JAVA PASCAL, FORTRAN, COBOL C assembler HEX opcodes binary opcodes digital voltages in the real hardware.

Boy. you'd think I was insulting C based on the length of this thread.

8^)

But maybe this made my position clearer. Ed.

Vote

E

Ed Prochak 20 years ago

Well you cannot, but those processors did not even exist when C was created. So those features didn't make it. To some degree, C is more of a PDP assembler. But I wonder if there is a way to write it in C that the compiler can recognize. You would only care IF you are targetting such a specific RISC processor, in which case, your thinking shades closer to the approach an assembler programmer takes than an HLL programmer takes.

See the difference? It is not so much that C gives you absolute control of the hardware, but approaching many programming tasks in C from the view of an assembly programmer makes your code better. Then when the need is more abstract, c still works for higher level programming.

I never said C was not an HLL. Ed

great discussion BTW.

Vote

E

Ed Prochak 20 years ago

not if you can live with a WARNING message.

guess I'm getting forgetful in my old age. (haven't touched PASCAL in over 10tears). I thought PASCAL defined fixed ranges for the datatypes like integers. I guess I didn't port enough PASCAL applications to see the difference. (I could have swore you'd get an error on X=32767+2 ;

Yes PASCAL and P-code, you have a point there, but I'm not sure it is in your favor. Due to P-code, PASCAL is abstracted even above the native assembler for the target platform. so we have C->native assembler->program on native hardware Pascal->program in p-code-> runs in p-code interpreter So you have even less reason to think of the native hardware when programming in PASCAL. This makes it more abstract and a higher HLL than is C.

The point is why even include this feature? It is because programming you tend to think closer to the hardware than you do in PASCAL. Even when I was doing some embedded graphics features for a product in PASCAL, I don't think the CPU architecture ever entered my thoughts.

So the PASCAL compiler was more advanced than the C compiler of the time. DO you think maybe it was due to PASCAL being a more abstract HLL than C might have had an effect here? (more likely though, it was PASCAL predated C, at least in widespread use.)

The difference is, IMHO, that PASCAL is a more abstract HLL, letting the programmer think more about the application. While C is a HLL, but with features that force/allow the programming to consider the underlying processor. (in the context of this topic, "force" is the word.)

Ed

Vote

R

Rod Pemberton 20 years ago

Or like this: C has low level features similar to an assembler.

I thought those died out. Were any those CPU's actually used in a computer sufficiently advanced enough to compile C? As I recall, they were only used as custom DSP's in the pre-DSP era, or as custom D/A convertors, etc...

Probably ADA, I don't recall that in PL/1.

Again, C has low level features. I always use structured code when I program in C. But, C allows coding in many unstructured ways (rumor: to allow program porting of Fortran to C). But, it is a high level language. I don't have to keep track of what data is in what register, or stack, or memory, like when I coded in 6502 or when I code in IA-32. I don't need to move data around to between registers, stack or memory, it's done for me in C. I just need the name of the data or a named pointer to the data. I don't need to setup prolog/epilog code. I don't need to calculate offsets for branching instructions. etc...

Common C myth, but untrue:

formatting link

"Thompson went a step further by inventing the ++ and -- operators, which increment or decrement; their prefix or postfix position determines whether the alteration occurs before or after noting the value of the operand. They were not in the earliest versions of B, but appeared along the way. People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed. The PDP-7, however, did have a few `auto-increment' memory cells, with the property that an indirect memory reference through them incremented the cell. This feature probably suggested such operators to Thompson;"

True. Many HLL's don't have pointers either which is a key attraction, for me, to any language.

BTW, I've heard one of the Pascal standards added pointers...

True. This wouldn't or shouldn't make any sense to someone who doesn't understand assembly.

Based on my experiences, I'd list like so:

C, PL/1, FORTH BASIC PASCAL,FORTRAN C (lowlevel), FORTH (lowlevel) IA-32, 6502 assembler HEX opcodes

My ranking of FORTRAN is highly debatable. It is strong in math, but seriously primitive in a number of major programming areas, like string processing. Yes, PASCAL is less useful that BASIC. BASIC had stronger, by comparison, string processing abilities. Also, I don't see how you can place Java above C, since it is a stripped down, pointer safe version of C. PASCAL, (until) they added pointers, was basically a stripped down, pointer safe version of PL/1.

Rod Pemberton

Vote

E

Ed Prochak 20 years ago

True, but for( ; ; p3> ) looks like a MACRO to me. It's up to the programmer to optimize specific instances when you program in assembly. That's why there's also while()

only difference is the loop block. In a macro assembler IOW you still have to code that goto top_of_for_loop2 at the end of the loop, either literally or in some endfor macro.

Ed

Vote

A

Al Balmer 20 years ago

The difference between "error" and "warning" is usually unimportant, and for some compilers, seems arbitrary.

Violations should be fixed, no matter how nicely the compiler tells you about them.

Al Balmer Sun City, AZ

Vote

K

Keith Thompson 20 years ago

And a raven is like a writing desk.

"C is an assembler" and "C is like an assembler" are two *very* different statements. The latter is obviously true, given a sufficiently loose interpretation of "like".

The C standard doesn't distinguish between different kinds of diagnostics, and it doesn't require any program to be rejected by the compiler (unless it has a "#error" directive). This allows for language extensions; an implementation is free to interpret an otherwise illegal construct as it likes, as long as it produces some kind of diagnostic in conforming mode. It also doesn't require the diagnostic to have any particular form, or to be clearly associated with the point at which the error occurred. (Those are quality-of-implementation issues.)

This looseness of requirements for diagnostics isn't a point of similarity between C and assemblers; on the contrary, in every assembler I've seen, misspelling the name of an opcode or using incorrect punctuation for an addressing mode results in an immediate error message and failure of the assembler.

C provides certain operations on certain types. Pointer arithmetic happens to be something that can be done in most or all assemblers and in C, but C places restrictions on pointer arithmetic that you won't find in any assembler. For example, you can subtract one pointer from another, but only if they're pointers to the same type; in a typical assembler, pointer values don't even have types. Pointer arithmetic is allowed only within the bounds of a single object (though violations of this needn't be diagnosed; they cause undefined behavior); pointer arithmetic in an assembler gives you whatever result makes sense given the underlying address representation. C says nothing about how pointers are represented, and arithmetic on pointers is not defined in terms of ordinary integer arithmetic; in an assembler, the representation of a pointer is exposed, and you'd probably use the ordinary integer opertations to perform pointer arithmetic.

What about it? A conforming C implementation on such a machine must have CHAR_BIT>=8, INT_MAX>=32768, LONG_MAX>=2147483647, and so forth. The compiler may have to do some extra work to implement this. (You could certainly provide a non-conforming C implementation that provides a 6-bit type called "int"; the C standard obviously places no constraints on non-conforming implementations. I'd recommend calling the resulting language something other than C, to avoid confusion.)

I'm not familiar with PL/I.

Ada (not ADA) has a predefined type called Integer. It can have other predefined integer types such as Short_Integer, Long_Integer, Long_Long_Integer, and so forth. There are specific requirements on the ranges of these types, quite similar to C's requirements for int, short, long, etc. There's also a syntax for declaring a user-defined type with a specified range: type Integer_32 is range -2**31 .. 2**31-1; This type will be implemented as one of the predefined integer types, selected by the compiler to cover the requested range.

C99 has something similar, but not as elaborate: a set of typedefs in such as int32_t, intleast32_t, and so forth. Each of these is implemented as one of the predefined integer types.

You can get just as "close to the metal" in Ada as you can in C. Or, in both languages, you can write portable code that will work properly regardless of the underlying hardware, as long as there's a conforming implementation. C is lower-level than Ada, so it's there's a greater bias in C to relatively low-level constructs and system dependencies, but it's only a matter of degree. In this sense, C and Ada are far more similar to each other than either is to any assembler I've ever seen.

[...]

If you're just saying there's an "assembler feel", I won't argue with you -- except to say that, with the right mindset, you can write portable code in C without thinking much about the underlying hardware.

[...]

The PDP-11 has predecrement and postincrement modes; it doesn't have preincrement or postdecrement. And yet C provides all 4 combinations, with no implied preference for the ones that happen to be implementable as PDP-11 addressing modes. In any case, C's ancestry goes back to the PDP-7, and to earlier languages (B and BCPL) that predate the PDP-11.

[...]

Sure, but giving the programmer more control is hardly synonymous with assembly language.

Sure, it's a low-level feature.

That seems like a reasonable scale (I might put Forth somewhere below C). But you don't indicate the relative distances between the levels. C is certainly closer to assembler than Pascal is, but I'd say that C and Pascal are much closer to each other than either is to assembler.

You can write system-specific non-portable code in any language. In assembler, you can *only* write system-specific non-portable code. In C and everything above it, it's possible to write portable code that will behave as specified on any system with a conforming implementation, and a conforming implementation is possible on a very wide variety of hardware. Based on that distinction, there's a sizable gap between assembler and C.

Keith Thompson (The_Other_Keith) kst-u@mib.org San Diego Supercomputer Center We must do something. This is something. Therefore, we must do this.

Vote

K

Keith Thompson 20 years ago

"Rod Pemberton" writes: [...]

Pascal has always had pointers.

-- Keith Thompson (The_Other_Keith) snipped-for-privacy@mib.org San Diego Supercomputer Center We must do something. This is something. Therefore, we must do this.

Vote

K

Keith Thompson 20 years ago

Assigning an integer value to a pointer object, or vice versa, without an explicit conversion (cast operator) is a constraint violation. The standard requires a diagnostic; it doesn't distinguish between warnings and error messages. Once the diagnostic has been issued, the compiler is free to reject the program. If the compiler chooses to generate an executable anyway, the behavior is undefined (unless the implementation specifically documents it).

An assignment without a cast, assuming the compiler accepts it (after the required diagnostic) isn't even required to behave the same way as the corresponding assignment with a cast -- though it's likely to do so in real life.

C compilers commonly don't reject programs that violate this particular constraint, because it's a common construct in pre-standard C, but that's an attribute of the compiler not of the language as it's now defined.

[...]

Keith Thompson (The_Other_Keith) kst-u@mib.org San Diego Supercomputer Center We must do something. This is something. Therefore, we must do this.

Vote

S

Steve O'Hara-Smith 20 years ago

Texas Instruments had a range of machines based on a bitslice design that became the basis of the TMS9900 processor design. I don't recall if there was a C compiler for it but it certainly could have supported one easily.

C:>WIN | Directable Mirror Arrays The computer obeys and wins. | A better way to focus the sun You lose and Bill collects. | licences available see | http://www.sohara.org/

Vote

A

Andrew Reilly 20 years ago

Anecdote:

About ten years ago I did a project involving an AT&T/WE DSP32C processor that had a very original-feeling AT&T K&R C compiler. This compiler did essentially no "optimization", that I could see. It didn't even do automatic register spill or fill (other than saves and restores at subroutine entry and exit, of course): normal "auto" local variables existed entirely in the stack frame, and had to be accessed from there on every use, and "register" local variables existed entirely in registers: specify too many in any context and the code wouldn't compile.

A very different (and somewhat more laborious) experience than programming with a modern compiler of, say, gcc vintage, but it was actually pretty easy to get quite efficient code this way. That compiler really was very much like a macro assembler with expression parsing.

[The C code that resulted was very much DSP32C-specific C code. That's why a "universal assembler" would want a more abstract notion of register variables that corresponds quite closely to that of modern C.]

Cheers,

Andrew

Vote

I

Ian Bell 20 years ago

Ah, so it means the implementation of C, not the implementation of the application.

Ian

Vote

C

Chris Torek 20 years ago

Funny, I get an "error" message:

% cat t.c void *f(void) { return 42; } % strictcc -O -c t.c t.c: In function `f': t.c:2: error: return makes pointer from integer without a cast %

Is my compiler not a "C compiler"?

(I really do have a "strictcc" command, too. Some might regard it as cheating, but it works.)

(Real C compilers really do differ as to which diagnostics are "warnings" and which are "errors". In comp.lang.c in the last week or two, we have seen some that "error out" on:

int *p; float x; ... p = x;

and some that accept it with a "warning".)

In-Real-Life: Chris Torek, Wind River Systems Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603 email: forget about it http://web.torek.net/torek/index.html Reading email is like searching for food in the garbage, thanks to spammers.

Vote

C

CBFalconer 20 years ago

But not uncontrolled pointers, except in abortive quasi-Pascals such as Borland/Delphi.

-- "If you want to post a followup via groups.google.com, don't use the broken "Reply" link at the bottom of the article. Click on "show options" at the top of the article, then click on the "Reply" at the bottom of the article headers." - Keith Thompson More details at: Also see

Vote

C

CBFalconer 20 years ago

... snip ...

Not me (regard it as cheating). I also have the equivalent, through an alias. I call it cc. The direct access is called gcc. The alias causes "cc " to be translated into gcc --help.

"If you want to post a followup via groups.google.com, don't use the broken "Reply" link at the bottom of the article. Click on "show options" at the top of the article, then click on the "Reply" at the bottom of the article headers." - Keith Thompson More details at: Also see

Vote

D

Dik T. Winter 20 years ago

In article "Ed Prochak" writes: > Dik T. Winter wrote: > > In article "Ed Prochak" writes: ... > > > -- datatype sizes are dependent on the underlying hardware. While a lot > > > of modern hardware has formed around the common 8bit char, and > > > multiples of 16 for int types (and recent C standards have started to > > > impose these standards), C still supports machines that used 9bit char > > > and 18bit and 36bit integers. This was the most frustrating thing for > > > me when I first learned C. It forces precisely some of the hidden > > > assumptions of this topic. > >

In the original version of Pascal that was certainly *not* an error. The range of integers was from -576460752303423487 to +576460752303423487, although values in absolute value greater than 281474976710655 were unreliable in some calculations. Do you know where the original limit of set sizes to 60 elements comes from?

*Not* P-code. The Pascal version I refer to predates P-code quite a bit. P-code came in the picture when porting the language to other processors came in the picture. The original compiler generated direct machine instructions (and that could still be done on other architectures).

In the original version of Pascal it was: Pascal->program on native hardware without an intermediate assembler or an interpreter.

It was included at a point in time when optimisation by some compilers was not as good as you would wish. In a similar way the original version of Fortran had a statement with which you could tell the compiler what the probability was that a branch would be taken or not.

No, it was because the machine that compiler was running on was quite a bit larger and faster than the machines C compilers tended to run on.

dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131 home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Vote

D

Dik T. Winter 20 years ago

In article "Ed Prochak" writes: > Dik T. Winter wrote: ... > > Indeed. But even when we look at the published instructions C falls > > short of providing a construct for every one. Where is the C construct > > to do a multply step available in quite a few early RISC machines? > > Note also that in assembler you can access the special bits indicating > > overflow and whatever (if they are available on the machine). How to > > do that in C? > > Well you cannot, but those processors did not even exist when C was > created. So those features didn't make it. To some degree, C is more of > a PDP assembler.

How do you get access to the condition bits?

dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131 home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

Vote

M

Mark F. Haigh 20 years ago

That's really interesting, because I have pondered the question and answered it in nearly the same way. However, the conclusion I came to is that C occupies a relatively large range in the continuum, not a single point.

C dialects that accept inline assembly push the lower bound much lower than it would otherwise be. Likewise, the wealth of freely-available C libraries push the upper bound much further up. You can run a real-world system written in C all the way from the top to the bottom-- from the GUI (GNOME, for example), to the kernel; from the compiler to the shells and interpreters.

As the continuum of C expands, the C standard (ideally) acts to reclaim and make (semi-) portable chunks of this continuum. Look at 'restrict', for example. It enables portable annotation of alasing assumptions so that encroachment of the lower bounds is not necessary for better efficiency. Is it perfect? No. Does it need to be? No. Is it workable? Yes.

I've also pondered that joke on many occasions. Each time I see it, I think it's more and more of a compliment. But if it added that C is usually more efficient than non-diety-level assembly, and that well-written C is nearly pathologically portable, it wouldn't really be a joke, would it?

In some respects, C is like English: overly succeptable to insane degrees of corruption, but all the same, nearly universally understood regardless, for better or for worse.

Mark F. Haigh snipped-for-privacy@sbcglobal.net

Vote

M

msg 20 years ago

Dik T. Winter wrote: > the original version of Fortran had a statement with

Sorry I don't have access to our archives at the moment - we do have materials from 70x and 650 Fortran; to which machine are you referring? 704, 709, 650?

Michael Grigoni Cybertheque Museum

Vote

C

CBFalconer 20 years ago

With the usual gay abandon about extensions, you might define a variable in system space, say _ccd, to hold those bits. You specify the conditions under which it is valid, such as immediately after an expression with precisely two operands, preserved by use of the comma operator. Then:

a = b + c, ccd = _ccd;

allows you to detect overflow and other evil things. A similar thing such as _high could allow capturing all bits from a multiplication. i.e.:

a = b * c, ccd = _ccd, ov = _high;

tells you all about the operation without data loss.

Just blue skying here.

"If you want to post a followup via groups.google.com, don't use the broken "Reply" link at the bottom of the article. Click on "show options" at the top of the article, then click on the "Reply" at the bottom of the article headers." - Keith Thompson More details at: Also see

Vote

Making Fatal Hidden Assumptions

Join the Discussion

Didn't find your answer?