Making Fatal Hidden Assumptions

- C
- CBFalconer
  
  Contact options for registered users
posted
18 years ago

Mon, Mar 6, 2006 11:05 PM

We often find hidden, and totally unnecessary, assumptions being made in code. The following leans heavily on one particular example, which happens to be in C. However similar things can (and do) occur in any language.

These assumptions are generally made because of familiarity with the language. As a non-code example, consider the idea that the faulty code is written by blackguards bent on foulling the language. The term blackguards is not in favor these days, and for good reason. However, the older you are, the more likely you are to have used it since childhood, and to use it again, barring specific thought on the subject. The same type of thing applies to writing code.

I hope, with this little monograph, to encourage people to examine some hidden assumptions they are making in their code. As ever, in dealing with C, the reference standard is the ISO C standard. Versions can be found in text and pdf format, by searching for N869 and N1124. [1] The latter does not have a text version, but is more up-to-date.

We will always have innocent appearing code with these kinds of assumptions built-in. However it would be wise to annotate such code to make the assumptions explicit, which can avoid a great deal of agony when the code is reused under other systems.

In the following example, the code is as downloaded from the referenced URL, and the comments are entirely mine, including the 'every 5' linenumber references.

/* Making fatal hidden assumptions */ /* Paul Hsiehs version of strlen.

formatting link

Some sneaky hidden assumptions here: 1. p = s - 1 is valid. Not guaranteed. Careless coding. 2. cast (int) p is meaningful. Not guaranteed. 3. Use of 2's complement arithmetic. 4. ints have no trap representations or hidden bits. 5. 4 == sizeof(int) && 8 == CHAR_BIT. 6. size_t is actually int. 7. sizeof(int) is a power of 2. 8. int alignment depends on a zeroed bit field.

Since strlen is normally supplied by the system, the system designer can guarantee all but item 1. Otherwise this is not portable. Item 1 can probably be beaten by suitable code reorganization to avoid the initial p = s - 1. This is a serious bug which, for example, can cause segfaults on many systems. It is most likely to foul when (int)s has the value 0, and is meaningful.

He fails to make the valid assumption: 1 == sizeof(char).

*/

#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080) #define SW (sizeof (int) / sizeof (char))

int xstrlen (const char *s) { const char *p; /* 5 */ int d;

p = s - 1; do { p++; /* 10 */ if ((((int) p) & (SW - 1)) == 0) { do { d = *((int *) p); p += SW; } while (!hasNulByte (d)); /* 15 */ p -= SW; } } while (*p != 0); return p - s; } /* 20 */

Let us start with line 1! The constants appear to require that sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't really looked too closely, and it is possible that the ~x term allows for larger sizeof(int), but nothing allows for larger CHAR_BIT. A further hidden assumption is that there are no trap values in the representation of an int. Its functioning is doubtful when sizeof(int) is less that 4. At the least it will force promotion to long, which will seriously affect the speed.

This is an ingenious and speedy way of detecting a zero byte within an int, provided the preconditions are met. There is nothing wrong with it, PROVIDED we know when it is valid.

In line 2 we have the confusing use of sizeof(char), which is 1 by definition. This just serves to obscure the fact that SW is actually sizeof(int) later. No hidden assumptions have been made here, but the usage helps to conceal later assumptions.

Line 4. Since this is intended to replace the systems strlen() function, it would seem advantageous to use the appropriate signature for the function. In particular strlen returns a size_t, not an int. size_t is always unsigned.

In line 8 we come to a biggie. The standard specifically does not guarantee the action of a pointer below an object. The only real purpose of this statement is to compensate for the initial increment in line 10. This can be avoided by rearrangement of the code, which will then let the routine function where the assumptions are valid. This is the only real error in the code that I see.

In line 11 we have several hidden assumptions. The first is that the cast of a pointer to an int is valid. This is never guaranteed. A pointer can be much larger than an int, and may have all sorts of non-integer like information embedded, such as segment id. If sizeof(int) is less than 4 the validity of this is even less likely.

Then we come to the purpose of the statement, which is to discover if the pointer is suitably aligned for an int. It does this by bit-anding with SW-1, which is the concealed sizeof(int)-1. This won't be very useful if sizeof(int) is, say, 3 or any other non-poweroftwo. In addition, it assumes that an aligned pointer will have those bits zero. While this last is very likely in todays systems, it is still an assumption. The system designer is entitled to assume this, but user code is not.

Line 13 again uses the unwarranted cast of a pointer to an int. This enables the use of the already suspicious macro hasNulByte in line 15.

If all these assumptions are correct, line 19 finally calculates a pointer difference (which is valid, and of type size_t or ssize_t, but will always fit into a size_t). It then does a concealed cast of this into an int, which could cause undefined or implementation defined behaviour if the value exceeds what will fit into an int. This one is also unnecessary, since it is trivial to define the return type as size_t and guarantee success.

I haven't even mentioned the assumption of 2's complement arithmetic, which I believe to be embedded in the hasNulByte macro. I haven't bothered to think this out.

Would you believe that so many hidden assumptions can be embedded in such innocent looking code? The sneaky thing is that the code appears trivially correct at first glance. This is the stuff that Heisenbugs are made of. Yet use of such code is fairly safe if we are aware of those hidden assumptions.

I have cross-posted this without setting follow-ups, because I believe that discussion will be valid in all the newsgroups posted.

[1] The draft C standards can be found at:

--
"If you want to post a followup via groups.google.com, don't use
 the broken "Reply" link at the bottom of the article.  Click on 
 "show options" at the top of the article, then click on the 
 "Reply" at the bottom of the article headers." - Keith Thompson
More details at: 
Also see

- E
- ena8t8si
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 8:10 AM

CBFalconer wrote (in part):

None of these objections is warranted in the original context, where the code is given as transliteration of some x86 assembly language. In effect, the author is offering a function that might be used as part of a C implementation, where none of the usual portability considerations need apply. Your commentary really should have included some statement along those lines.

- J
- James Dow Allen
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 8:34 AM

I haven't read your entire treatise, and I surely wouldn't want to defend Mr. Hsieh (whose code, he's led us to understand, was written by his dog) but I do have some comments.

The fact that sizeof(char)==1 doesn't make this line bad *if* the programmer feels that viewing SW as chars per int is important. I don't defend this particular silliness, but using an expression for a simple constant is often good if it makes the constant self-documenting.

Mr. Hsieh immediately does p++ and his code will be correct if then p == s. I don't question Chuck's argument, or whether the C standard allows the C compiler to trash the hard disk when it sees p=s-1, but I'm sincerely curious whether anyone knows of an *actual* environment where p == s will ever be false after (p = s-1; p++).

Many of us fell in love with C because, in practice, it is so much simpler and more deterministic than many languages. Many of the discussions in comp.lang.c seem like they'd be better in a new newsgroup: comp.lang.i'd_rather_be_a_lawyer

:-) :-)

James Dow Allen

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 8:49 AM

Getting people to think about their code is no bad idea. I've added a couple of comments below...

Not guaranteed in what way? You are not guaranteed that p will be a valid pointer, but you don't require it to be a valid pointer - all that is required is that "p = s - 1" followed by "p++" leaves p equal to s. I'm not good enough at the laws of C to tell you if this is valid, but the laws of real life implementation say that it *is* valid on standard

2's complement cpus, *assuming* you do not have any sort of saturated arithmetic.

In other words, the assumption is valid on most systems, but may be risky on some DSPs or on dinosaurs.

In what way could it not be meaningful here? In particular, the code is correct even in a 32-bit int, 64-bit pointer environment.

Again, this is valid on everything bar a few dinosaurs.

Ditto.

This is clearly an assumption that is not valid in general (although perfectly acceptable in the context of this webpage, which assumes an x86 in 32-bit mode).

No, the assumption is that the ratio of two size_t items is compatible with int. This may or may not be valid according to the laws of C.

This is implied by (5), so it's not a new assumption.

I'm not quite sure what you mean here, but I think this is implied by (3) and (4).

That's incorrect. No machine (that I can imagine) would segfault on p = s - 1. It might segfault if p is then dereferenced, but it never is (until it has been incremented again).

Since strlen is supplied by the system, it is reasonable to make assumptions about the system when writing it. It is *impossible* to write good C code for a low-level routine like this without making assumptions. The nearest you can get to a portable solution is a horrible mess involving pre-processor directives to generate different code depending on the target architecture. *All* the assumptions made here are valid in that context - the code is good.

As a general point, however, it is important to comment such routines with their assumptions, and possibly use pre-processor directives to give an error if the assumptions are not valid. In particular, this code assumes a 32-bit int, 8-bit char model, and an otherwise standard CPU.

I'm following this in comp.arch.embedded, where we see a lot of different target architectures, and where efficient code is perhaps more important than in the world of "big" computers (at least, a higher proportion of "big" computer programmers seem to forget about efficiency...). There are two things about such target-specific assumptions, in embedded programming - you should make such assumptions, for the sake of good code, and you must know what they are, for the sake of correct and re-usable code.

- F
- Flash Gordon
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 9:18 AM

Mentioned by Paul.

Paul says it wouldmentionedtaken care of.

Much as I dislike some of what Paul posts, on that site Paul explicitly states in the text that the code is non-portable and mentions as least a couple of issues.

Personally, I would not expect a function documented as being non-portable and optimised for a specific processor to specify every single assumption.

--
Flash Gordon, living in interesting times.
Web site - http://home.flash-gordon.me.uk/
comp.lang.c posting guidelines and intro:
http://clc-wiki.net/wiki/Intro_to_clc

- P
- pete
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 10:31 AM

Merely subtracting 1 from s, renders the entire code undefined. You're "off the map" as far as the laws of C are concerned. On comp.lang.c, we're mostly interested in what the laws of C *do* say is guaranteed to work.

--
pete

- P
- pete
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 10:36 AM

If you don't want to know whether or not (s - 1) causes undefined behavior, then don't let it bother you. You can write code any way you want. But, if you want to discuss whether or not (s - 1) causes undefined behavior, it does, and this is the place to find out about it.

--
pete

- R
- Richard Heathfield
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 11:11 AM

Specifically, there is a good argument for char *p = malloc(n * sizeof *p).

Not so, since forming an invalid pointer invokes undefined behaviour.

It does.

Since it won't compile, the question is academic. But assuming you meant to have a test in there somewhere, my personal answer would be: no, I don't know of such an environment off the top of my head, but there are some weird environments out there (the comp.lang.c FAQ lists a few), and it would certainly not surprise me to learn of such an environment. In any case, such an environment could become mainstream next week, or next year, or next decade. Carefully-written code that observes the rules will continue to work in such environments.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)

- B
- Ben Bacarisse
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 3:38 PM

It seems to me ironic that, in a discussion about hidden assumptions, the truth of this remark requires a hidden assumption about how the function is called. Unless I am missing something big, p = s - 1 is fine unless s points to the first element of an array (or worse)[1]. One can imagine situations such as "string pool" implementations where all strings are part of the same array where p = s - 1 is well defined. (You'd have to take care with the first string: char pool[BIG_NUMBER]; char *pool_start = pool + 1;).

[1] Since this is a language law discussion, we need to take the definition from the standard where T *s is deemed to point to an array of size one if it points to an object of type T. By "worse" I mean that s does not point into (or just past) an array at all.

--
Ben.

- P
- Paul Burke
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 4:45 PM

My simple mind must be missing something big here. If for pointer p, (p-1) is deprecated because it's not guaranteed that it points to anything sensible, why is p++ OK? There's no boundary checking in C (unless you put it in).

Quote from p98 of K&R 2nd (ANSI) edition: "If pa points to a particular element of an array, then by definition pa+1 points to the next element, pa+i points to i elements after pa, and pa-i points to i elements before."

Paul Burke

- B
- Ben Pfaff
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 5:21 PM

Paul Burke writes:

You're missing what the standard says about it:

8 When an expression that has integer type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integer expression. In other words, if the expression P points to the i-th element of an array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N (where N has the value n) point to, respectively, the i+n-th and i-n-th elements of the array object, provided they exist. Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. If the result points one past the last element of the array object, it shall not be used as the operand of a unary * operator that is evaluated.

(Wow, that's a mouthful.)

--
"To get the best out of this book, I strongly recommend that you read it."
--Richard Heathfield

- A
- Arthur J. O'Dwyer
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 5:28 PM

(BTW, "deprecated," in the context of programming-language standards, means something that once was okay but is not recommended anymore. (p-1) isn't like that.) Read on for your answer.

K&R answers your question. If pa points to some element of an array, then pa-1 points to the /previous element/. But what's the "previous element" relative to the first element in the array? It doesn't exist. So we have undefined behavior. The expression pa+1 is similar, but with one special case. If pa points to the last element in the array, you might expect that pa+1 would be undefined; but actually the C standard specifically allows you to evaluate pa+1 in that case. Dereferencing that pointer, or incrementing it /again/, however, invoke undefined behavior.

Basically: A C pointer must always point to something. "The negative-oneth element of array a" is not "something."

int a[10]; int *pa = a+3; pa-3; /* fine; points to a[0] */ pa+6; /* fine; points to a[9] */ pa+7; /* fine; points "after" a[9] */ pa+8; /* undefined behavior */ pa-4; /* undefined behavior */

HTH,

-Arthur

- E
- Eric Sosman
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 5:29 PM

There's a special dispensation that allows you to calculate a pointer value that points to the imaginary element "one position past the end" of an array. (You cannot, of course, actually try to access the imaginary element -- but you can form the pointer value, compare it to other pointer values, and so on.)

There is no such special dispensation for an element "one position before the beginning."

The Rationale says the Committee considered defining the effects at both ends (bilateral dispensations?), but rejected it for efficiency reasons. Consider an array of large elements -- structs of 32KB size, say. A system that actually performed hardware checking of pointer values could accommodate the one-past-the-end rule by allocating just one extra byte after the end of the array, a byte that the special pointer value could point at without setting off the hardware's alarms. But one-before-the-beginning would require an extra 32KB, just to hold data that could never be used ...

--
Eric.Sosman@sun.com

- B
- Ben Bacarisse
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 5:40 PM

As your (snipped) quote from K&R illustrates, pointer arithmetic is defined only "within" arrays. "Within", includes a pointer that points just after the last element. This is very handy, since pointer expressions like end - start are a common idiom. When a pointer, p, points to the first element of an array, p - 1 is not defined (indeed it may not be representable given a devious enough, but conforming, implementation). p - 1 is not deprecated at all (so far as I know). It is either perfectly valid or entirely undefined. p + 1 is well-defined if p points to any array element (including the last). p + 2 is not defined if p points to the last element of an array.

For the purpose of arithmetic, pointers to single data objects are treated like pointers to arrays of size 1. So we have:

int x, a[2];

int *p = &x + 1; /* defined */ int *q = &x - 1; /* undefined */ int *r = &x + 2; /* undefined */

int *s = a; /* defined */ int *t = a - 1; /* undefined */ int *u = a + 1; /* defined */ int *v = a + 2; /* defined */ int *w = a + 3; /* undefined */

I hope I have not missed what you were talking about.

--
Ben.

- W
- Walter Roberson
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 6:26 PM

Right.

These limitations may perhaps be most easily be understood with reference to the VAX "descriptor" model of pointers, in which a pointer did not refer -directly- to the target memory, but instead refered to a a -description- of that memory, including sizes and basic types and current offset. In that scheme, with pa a pointer, pa-1 involves internal work to produce the descriptor with the appropriate sizes and offsets -- and at the time the relative offset was calculated, it would be compared to the known bounds, and an exception could occur *then* [when the pointer was built] rather than at the time the pointer was used.

Other circumstances where it could make a difference include cases in which pa points to a block of memory at the very beginning of a memory segment (on a segmented architecture). The calculation of the value of pa-1 could then proceed in several ways:

- by exception (because the system notes the attempt to point before the segment beginning)

- by wrapping the relative segment offset field to its maximum value (which might or might not trigger odd behaviours)

- by wrapping the relative segment offset field to its maximum value -and- decrementing the field that holds the address register number that holds the base virtual memory (this effectively pointing into a completely -different- block of memory, which might or might not trigger odd behaviours)

- by noticing that the segment descriptor is not suitable for the pointer and building a new segment descriptor that covers the extended range (which might use up scarce segment descriptors unnecessarily)

- by exception (because the system notes that the new memory address is not one that the user has access rights to)

There are undoubtedly other situations, but the point remains that creating a pointer to "before" an object is not certain to work, even if that pointer is never dereferenced.

--
   Okay, buzzwords only. Two syllables, tops.  -- Laurie Anderson

- A
- Andrew Reilly
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 8:57 PM

It's precicely this sort of tomfoolery on the part of the C standards committee that has brought the language into such ill-repute in recent years. It's practically unworkable now, compared to how it was in (say) the immediately post-ANSI-fication years.

The code in question could trivially have p replaced by (p+p_index) everywhere, where p_index is an int, and all of the arithmetic currently effected on p is instead effected on p_index. I.e., if p was set to s, and p_index set to -1, and p_index++ appeared as the first element inside the outer do loop.

So having the standard make the semantic equivalent "undefined" only serves to make the standard itself ever more pointless.

Bah, humbug. Think I'll go back to assembly language, where pointers do what you tell them to, and don't complain about it to their lawyers.

--
Andrew

- A
- Andrew Reilly
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 9:05 PM

Only because the standard says so. Didn't have to be that way. There are plenty of logically correct algorithms that could exist that involve pointers that point somewhere outside of a[0..N]. As long as there's no de-referencing, no harm, no foul. (Consider the simple case of iterating through an array at a non-unit stride, using the normal p < s + N termination condition. The loop finishes with p > s + N and the standard says "pow, you're dead", when the semantically identical code written with integer indexes has no legal problems.

Only because the standard says so. The standard is stupid, in that respect.

--
Andrew

- B
- Ben Pfaff
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 9:10 PM

The semantics you're complaining (one-past-the-end) about didn't change in important ways from C89 to C99. I don't know why you'd think the C99 semantics are unreasonable if you didn't think the C89 semantics were unreasonable.

--
int main(void){char p[]="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz.\
 \n",*q="kl BIcNBFr.NKEzjwCIxNJC";int i=sizeof p/2;char *strchr();int putchar(\
);while(*q){i+=strchr(p,*q++)-p;if(i>=(int)sizeof p)i-=sizeof p-1;putchar(p[i]\
);}return 0;}

- A
- Andrew Reilly
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 9:14 PM

So the standards body broke decades of practice and perfectly safe and reasonable code to support a *hypothetical* implementation that was so stupid that it checked pointer values, rather than pointer *use*? Amazing.

--
Andrew

- K
- Keith Thompson
  
  Contact options for registered users
Vote on answer
posted
18 years ago

Tue, Mar 7, 2006 9:26 PM

The standard is specifically designed to allow for architectures where constructing an invalid pointer value can cause a trap even if the pointer is not dereferenced.

For example, given:

int n; int *ptr1 = &n; /* ptr1 points to n */ int *ptr2 = ptr1 - 1 /* ptr2 points *before* n in memory; this invokes undefined behavior. */

Suppose some CPU uses special instructions and registers for pointer values. Suppose arr happens to be allocated at the very beginning of a memory segment. Just constructing the value ptr1-1 could cause a trap of some sort. Or it could quietly yield a value such that ptr2+1 != ptr1.

By saying that this is undefined behavior, the C standard isn't forbidding you to do it; it's just refusing to tell you how it behaves. If you're using an implementation that guarantees that this will work the way you want it to, and if you're not concerned about portability to other implementations, there's nothing stopping you from doing it.

On the other hand, if the standard had defined the behavior of this construct, it would have required all implementations to support it. On a system that does strong checking of all pointer values, a C compiler might have to generate inefficient code to meet the standard's requirements.

--
Keith Thompson (The_Other_Keith) kst-u@mib.org  
San Diego Supercomputer Center               
We must do something.  This is something.  Therefore, we must do this.