We often find hidden, and totally unnecessary, assumptions being made in code. The following leans heavily on one particular example, which happens to be in C. However similar things can (and do) occur in any language.
These assumptions are generally made because of familiarity with the language. As a non-code example, consider the idea that the faulty code is written by blackguards bent on foulling the language. The term blackguards is not in favor these days, and for good reason. However, the older you are, the more likely you are to have used it since childhood, and to use it again, barring specific thought on the subject. The same type of thing applies to writing code.
I hope, with this little monograph, to encourage people to examine some hidden assumptions they are making in their code. As ever, in dealing with C, the reference standard is the ISO C standard. Versions can be found in text and pdf format, by searching for N869 and N1124. [1] The latter does not have a text version, but is more up-to-date.
We will always have innocent appearing code with these kinds of assumptions built-in. However it would be wise to annotate such code to make the assumptions explicit, which can avoid a great deal of agony when the code is reused under other systems.
In the following example, the code is as downloaded from the referenced URL, and the comments are entirely mine, including the 'every 5' linenumber references.
/* Making fatal hidden assumptions */ /* Paul Hsiehs version of strlen.
Some sneaky hidden assumptions here: 1. p = s - 1 is valid. Not guaranteed. Careless coding. 2. cast (int) p is meaningful. Not guaranteed. 3. Use of 2's complement arithmetic. 4. ints have no trap representations or hidden bits. 5. 4 == sizeof(int) && 8 == CHAR_BIT. 6. size_t is actually int. 7. sizeof(int) is a power of 2. 8. int alignment depends on a zeroed bit field.
Since strlen is normally supplied by the system, the system designer can guarantee all but item 1. Otherwise this is not portable. Item 1 can probably be beaten by suitable code reorganization to avoid the initial p = s - 1. This is a serious bug which, for example, can cause segfaults on many systems. It is most likely to foul when (int)s has the value 0, and is meaningful.
He fails to make the valid assumption: 1 == sizeof(char).
*/#define hasNulByte(x) ((x - 0x01010101) & ~x & 0x80808080) #define SW (sizeof (int) / sizeof (char))
int xstrlen (const char *s) { const char *p; /* 5 */ int d;
p = s - 1; do { p++; /* 10 */ if ((((int) p) & (SW - 1)) == 0) { do { d = *((int *) p); p += SW; } while (!hasNulByte (d)); /* 15 */ p -= SW; } } while (*p != 0); return p - s; } /* 20 */
Let us start with line 1! The constants appear to require that sizeof(int) be 4, and that CHAR_BIT be precisely 8. I haven't really looked too closely, and it is possible that the ~x term allows for larger sizeof(int), but nothing allows for larger CHAR_BIT. A further hidden assumption is that there are no trap values in the representation of an int. Its functioning is doubtful when sizeof(int) is less that 4. At the least it will force promotion to long, which will seriously affect the speed.
This is an ingenious and speedy way of detecting a zero byte within an int, provided the preconditions are met. There is nothing wrong with it, PROVIDED we know when it is valid.
In line 2 we have the confusing use of sizeof(char), which is 1 by definition. This just serves to obscure the fact that SW is actually sizeof(int) later. No hidden assumptions have been made here, but the usage helps to conceal later assumptions.
Line 4. Since this is intended to replace the systems strlen() function, it would seem advantageous to use the appropriate signature for the function. In particular strlen returns a size_t, not an int. size_t is always unsigned.
In line 8 we come to a biggie. The standard specifically does not guarantee the action of a pointer below an object. The only real purpose of this statement is to compensate for the initial increment in line 10. This can be avoided by rearrangement of the code, which will then let the routine function where the assumptions are valid. This is the only real error in the code that I see.
In line 11 we have several hidden assumptions. The first is that the cast of a pointer to an int is valid. This is never guaranteed. A pointer can be much larger than an int, and may have all sorts of non-integer like information embedded, such as segment id. If sizeof(int) is less than 4 the validity of this is even less likely.
Then we come to the purpose of the statement, which is to discover if the pointer is suitably aligned for an int. It does this by bit-anding with SW-1, which is the concealed sizeof(int)-1. This won't be very useful if sizeof(int) is, say, 3 or any other non-poweroftwo. In addition, it assumes that an aligned pointer will have those bits zero. While this last is very likely in todays systems, it is still an assumption. The system designer is entitled to assume this, but user code is not.
Line 13 again uses the unwarranted cast of a pointer to an int. This enables the use of the already suspicious macro hasNulByte in line 15.
If all these assumptions are correct, line 19 finally calculates a pointer difference (which is valid, and of type size_t or ssize_t, but will always fit into a size_t). It then does a concealed cast of this into an int, which could cause undefined or implementation defined behaviour if the value exceeds what will fit into an int. This one is also unnecessary, since it is trivial to define the return type as size_t and guarantee success.
I haven't even mentioned the assumption of 2's complement arithmetic, which I believe to be embedded in the hasNulByte macro. I haven't bothered to think this out.
Would you believe that so many hidden assumptions can be embedded in such innocent looking code? The sneaky thing is that the code appears trivially correct at first glance. This is the stuff that Heisenbugs are made of. Yet use of such code is fairly safe if we are aware of those hidden assumptions.
I have cross-posted this without setting follow-ups, because I believe that discussion will be valid in all the newsgroups posted.
[1] The draft C standards can be found at: