hashlib package

- C
- CBFalconer
  
  Contact options for registered users
posted
19 years ago

Wed, Nov 3, 2004 10:55 PM

I released this under GPL some time ago, (2003-May) and have been advertising it occasionally here, where it seemed applicable. I have received no bug reports.

I have just gotten around to writing a usage manual for it, which follows. I would like some opinions on it. Please don't quote the whole thing back at me, a short excerpt followed by your pithy commentary will do nicely. I am off for an operation Monday, so I won't be available for a while after that, and you might as well hold things after Saturday.

HOW TO USE hashlib ==================

To use this easily you should also have a copy of hashlib.h printed out, or easily available in another editor window. It describes the complete interface to hashlib, and this is just an explanation of the functions, why they exist, etc. hashlib.h is extensively commented.

What is it for? ==============

You may be wondering "for what is hashlib useful". The answer is that it is a storage facility. You can hand it things, and it will tuck them away, and make it easy for you to find them later.

A major point is that the time it takes to store, find, delete, or retrieve an item is almost constant no matter how big the table gets. Also, you don't have to worry about the table size, because it will automatically adapt itself. It may hold 5 items or millions. The limit is your memory.

What does it do? ===============

For a list of the things it will do, you should have the file "hashlib.h" handy. This details all the things you can do, and how to customize the system to your data. The interface functions are:

hshinit, hshkill Make or destroy a hashtable hshfind, hshinsert, hshdelete Insert, find, take out items hshwalk For advanced usage, later hshstatus Such things as how many stored

Customizing to your data: ========================

In order to use a table, the first thing you have to do is to create it with hshinit. At that time you tell hashlib how to process your data. I will return to this later.

Your actual data takes some form, which is entirely up to you. It must be possible to refer to a complete data item by a single pointer. Your data also will have some sort of key, or even multiple keys. It can have whatever auxiliary data you like. This implies you must define a structure somewhere for your own benefit:

typedef struct hashitem { sometype yourkey; otherstuff yourdata; } item, *itemptr; The field names, structure name, typedef'd names, etc are entirely up to you. Somewhere in your program you will have at least one of these things. hashlib will make more of them in which to store copies of the data you insert.

Equality ========

Since hashlib works on all forms of data, it obviously can't read your data description. So you have to tell it how to find out that two data items have the identical key. This introduces the type (defined in hashlib.h):

typedef int (*hshcmpfn)(void *litem, void *ritem); which is a function you will design and program, and one of the items you pass in the hshinit call is a pointer to that function. Let us assume that in the item definition above sometype is int (such as the typedef below under copying). Then your comparison function could be:

mycmp(void *litem, void *ritem) { itemptr left = litem; itemptr right = ritem; int lvalue, rvalue; lvalue = left->yourkey; rvalue = right->yourkey; return lvalue == rvalue; } NOTE: I have made this function more complex than it need be, in order to emphasize how it goes about it. The left and right pointers come from hashlib, and hashlib doesn't know about your data type. Therefore it converts them into the C universal pointer, a "void *". When you get them back you have to convert them back into itemptr, so you can access the fields of your data.

All hashlib cares about is "are they equal", so the above returns only 0 or 1, for notequal and equal. The comparison routine will be useful for other things if you make it return

-1, 0, or +1 for less, equal, greater. To do this you could make the return statement say:

return (lvalue > rvalue) - (lvalue < rvalue); which will turn out to be 1-0, 0-0, 0-1 for the three cases. The point is not to return (lvalue - rvalue), because this can run into overflow and give erroneous results.

Copying =======

When you pass an item to hashlib you don't want to worry about who owns the space it takes. Therefore the principle is "hashlib owns all the items it stores". Thus hashlib makes a copy of any data item it inserts into the table. Once more, only you know how to do this, and you have to tell hashlib.

typedef void *(*hshdupfn)(void *item);

in hashlib.h specifies what this function must look like. For the simple structure above, all it would have to do is malloc space for a copy, and copy the fields. Remember it is dealing with pointer to data, and the first thing you have to do is make the item pointer into a pointer to your structure.

Lets make the simple data structure above more concrete:

typedef struct hashitem { int yourkey; int yourdata; } item, *itemptr;

Then the hshdupefn (notice how the function is defined by editing the typedef for hshdupfn) could be:

void *mydupe(void *item) { itemptr myitem = item; itemptr newitem; if (newitem = malloc(sizeof *newitem) { newitem.yourkey = myitem.yourkey; newitem.yourdata = myitem.yourdata; /* or "*newitem = *myitem" in this case */ } return newitem; } Notice again that only your code knows what is in the items to be stored, and thus how to copy them. Your item can be as complicated as you wish. So lets make it store strings:

typedef struct hashitem { char *yourkey; int yourdata; } item, *itemptr;

and see how it affects the hshdupefn. Yourkey is now just a pointer to a string somewhere, which may want to be modified or used in some manner. So we have do what is called a deep copy.

void *mydupe(void *item) { itemptr myitem = item; itemptr newitem; if (newitem = malloc(sizeof *newitem) { if (newitem->yourkey = malloc(1+strlen(myitem->yourkey) { strcpy(newitem->yourkey, myitem->yourkey; newitem.yourdata = myitem.yourdata; } else { /* we ran out of memory, release and fail */ free(newitem) newitem = NULL } } return newitem; }

Notice how it returns NULL if malloc fails to secure the necessary memory anywhere. This allows hashlib to do the right things under nasty cases, such as exhausting memory.

The need for a deep copy is generally signalled by having pointers in your data type description. All those pointers have to be resolved to data that can belong to the hash table.

Letting Go ==========

Once you have thrown a whole mess of data at hashlib, and it is keeping track, you may decide to release it all. While you could often just abandon it, and let the operating system clean up after you when your program ends, this is not a good practice. Besides, your program may not end. So you have to tell hashlib how to get rid of one item, which it will use to get rid of all of them when you use the hshkill function (described later).

typedef void (*hshfreefn)(void *item); in hashlib.h describes that function. Now we will assume the complex hshdupefn last described above, and the corresponding type definition for an item. Again, we build the function header by editing the typedef and converting the passed void* pointer:

void myundupe(void *item) { itemptr myitem = item; free(myitem->yourkey); /* First, because this won't */ free(myitem); /* exist after this one. */ }

thus returning all the allocated memory. Notice how it undoes everything that mydupe did. The mydupe/myundupe pair could even open and close files, but you will rarely want to handle thousands of open files at once.

Hashing =======

This is fundamental to the efficient operation of a hashtable, although hashlib can put up with pretty rotten hashing and still grind out answers (but it may take a long time). What we need to do is calculate a single unsigned long value from the key. What these functions are is basically black magic, therefore hashlib contains a couple of utility functions usable for hashing strings. There are also examples of hashing integers in the hashtest.c program along with some references to the subject of creating hash functions.

Because of the efficient way hashlib handles overflows (it basically just corrects them) it is necessary to have two hash functions. For the above item type with strings, they would be:

typedef unsigned long (*hshfn)(void *item); for reference, which we edit again and get:

unsigned long myhash(void *item) { itemptr myitem = item; /* getting used to this? */ return hshstrhash(myitem->yourkey); }

and we need two such functions, so:

unsigned long myrehash(void *item) { itemptr myitem = item; /* getting used to this? */ return hshstrehash(myitem->yourkey); } which basically differ only in their names and in the convenience hash function they call.

Now we have finally customized the system to our own data format. We will tell hashlib about these functions when we create a hashtable with hshinit.

Using hashlib =============

First, we need some way to refer to the table. So we must have a data item of type hshtbl* to hold it. We will initialize that by calling hshinit. This is much like opening a file. For convenience here is the prototype for hshinit again:

/* initialize and return a pointer to the data base */ hshtbl *hshinit(hshfn hash, hshfn rehash, hshcmpfn cmp, hshdupfn dupe, hshfreefn undupe, int hdebug);

Now this following is a fragment from your code:

hshtbl *mytable;

/* initialize and return a pointer to the data base */ mytable = hshinit(myhash, myrehash, mycmp, mydupe, myundupe, 0);

which tells hashlib all about the customizing functions you have created. Note that all those functions can be static, unless you have other uses for them outside your source file. You can use those functions yourself as you please.

Don't forget the final 0 in the call to hshinit. That parameter provides for future extensions and debugging abilities, and passing a zero here will maintain compatibility.

You can create more than one hash table if you desire. If they handle the same data format you can just do exactly the same call as above, except you will need a new variable of type hshtbl* to hold the table identification. If they don't hold the same data type you can supply different functions to hshinit. It is up to you.

hshtbl *mysecondtable;

mysecondtable = hshinit(....); /* as before */

These tables will live until you exterminate them. Meanwhile you can store, find, delete, etc. items from the table. You destroy the table by calling hshkill with the pointer that hshinit returned.

hshkill(mytable); /* all gone */ but until that is done, lets use the functions:

Inserting (storing) data: =========================

From here on I am assuming you have opened the hash table with

mytable = hshinit(...), and that you have defined your data with:

typedef struct hashitem { char *yourkey; int yourdata; } item, *itemptr;

Surprise, you store data by calling hshinsert. Here is the prototype, for reference:

void * hshinsert(hshtbl *master, void *item);

and you call it with a pointer to the table in which to insert the item, and a pointer to the item to insert.

You may have a variable of type item (after all, you know what it is, even if hashlib does not). So the critical items are:

hshtable *mytable; item myitem; item *something;

You will put the data you want into myitem, filling its fields as needed. Then you call:

something = hshinsert(mytable, &myitem); If, after this, 'something' is NULL, the insertion failed (probably because you ran out of memory). Otherwise 'something' points to the piece of memory owned by hshlib which stores a copy of myitem. You can use something to modify the stored copy, but you MUST NOT do anything that would change the value of the key, and thus change what a hshfn such as myhash or myrehash returns when passed that item. NEVER EVER do that.

One thing you might want to do is have a field in an item that holds a count. You could have the dupe function zero this field, so that you know how it is initialized. Then, when hshinsert returns an itemptr you can use that to increment that field. That way you can keep track of how many times a given key has been inserted.

NOTE: If hshinsert finds an item already stored, it simply returns a pointer to that storage. It does not use the dupe function to make another copy.

Finding a data item by the key: ==============================

Again we have the same variables as above for insertion. We simply call:

something = hshfind(mytable, &item); and if 'something' is NULL the item is not present, otherwise it is a pointer to the memory holding it. The same cautions as for hshinsert hold, i.e. you MUST NOT do anything that affects the key and thus the hash functions. Being present means only that 'something' and &item have identical keys, as defined by mycmp() function.

Deleting stored items: =====================

Again, we have the same variables. Surprise, the calling format is the same:

something = hshdelete(mytable, &item); but now there is a significant difference. The hash table no longer owns the memory that stored that item, you do. So you have to do something with it, assuming it isn't NULL (meaning that the value in item was never stored in the table). What you do is up to you, but sooner or later you should release it by: myundupe(something); which you designed specifically for this purpose.

Other abilities ===============

I plan to add information about walking the entire contents of the table, and performing operations on each stored item. There are illustrations of these operations in the demonstration applications (markov and wdfreq) in the hashlib package.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- G
- Gary Kato
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Nov 4, 2004 1:19 AM

Good luck on the operation.

- H
- Holger Hasselbach
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Nov 4, 2004 3:07 PM

No comments for the manual, but for the hashlib - constructive, as I hope. The most important thing first: No bugs found so far. ;)

Your hashlib is "generously" commented. Unfortunately, there are a lot of comments like this one:

master->hstatus.probes++; /* count total probes */

while the more important comments are completely missing: Comments on the algorithm and design decisions. Here are some ideas for the missing comments.

The general element insertion looks like this:

+--------------------------+ | I x x D x N | +--------------------------+

Array, size is a prime. I is the initial search position denoted by the first hash value. The step size for the search if the I position is not NULL is denoted by the second hash value. Step is added with wrap-around. x are arbitrary elements. D is an entry marked as DELETED. N is an empty entry (NULL).

- Why is it important that the array size is a prime? (mathematical guaranteed to cover all array entries for *any* step size below the array size and except 0)

- Why is it important to mark deleted entries as DELETED instead of simply resetting them to NULL? (breaking the search chains)

- Why are DELETED entries skipped instead of being reused? (possible by memorizing the first DELETED/NULL entry, nonetheless completely searching the current chain to the end, then inserting on the memorized position)

- Why is a second hash function used for the step size, instead of calculating it from the first value, e.g. instead of h2 = master->rehash(item) % (master->currentsz >> 3) + 1; calculating it by h2 = ((h >> 13) | (h currentst >> 3) + 1;

If the first hash function is weak (but fast), then there will probably be a lot of collisions, making it necessary to call the second (better and slower) hash function, anyway. So why not using one single and strong hash function right from the beginning? Was this an arbitrary choice or backed up by literature or profiling?

- How were the thresholds chosen? Arbitrary (for easy binary calculation) or by literature or profiling? #define TTHRESH(sz) (sz - (sz >> 3)) if (master->hstatus.hdeleted > (master->hstatus.hentries / 4))

Holger

- D
- Dave Vandervies
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Thu, Nov 4, 2004 10:44 PM

It would also be helpful to have an abbreviated reference manual, with a list of what functions you define, what functions the user needs to define, what they do, and requirements on the input and output, but without the verbose commentary.

(I haven't looked at the package, only the usage manual you posted, so if you already have this I can safely be ignored, at least on this point.)

F'rexample, this is nice when you're not familiar with the library:

But once you've got a working knowledge of it, it's a lot easier to go to this to refresh your memory for, say, "What's the significance of the return value for that one?":

}Inserting (storing) data }------------------------ }void *hshinsert(hshtbl *, void *); }handleptr=hshinsert(mytable, &myitem); } }Inserts item myitem into the table referenced by mytable. }Returns: }Pointer to newly allocated (with user-specified dupe function) internal } item storage if the item was not already in the table }Pointer to already-allocated internal item storage if the item was } already in the table }NULL on failure (most likely out-of-memory) } }The handle pointer may be used to modify non-key data in the item. }It MUST NOT be used to modify key data.

dave

--
Dave Vandervies                             dj3vande@csclub.uwaterloo.ca
I'm ashamed to admit it, but I actually thought of this possibility when
writing the code shown.  That means I've been hanging around comp.lang.c
*much* too long.                            --Eric Sosman in comp.lang.c

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Nov 5, 2004 2:44 AM

But I thought that was just what I was doing! These things are all detailed in the .h file and here I was trying to tie them together as a logical entity, so it could be used intelligently.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- E
- E. Robert Tisdale
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Nov 5, 2004 3:05 AM

You might want to look into something like Doxygen

formatting link

to help you generate documentaion from the comments in your *.h file(s).

- D
- Dave Vandervies
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Nov 5, 2004 4:22 AM

If they're all detailed in the .h file, then it sounds like the next paragraph that you snipped applies: }(I haven't looked at the package, only the usage manual you posted, so }if you already have this I can safely be ignored, at least on this point.)

dave

--
Dave Vandervies                      dj3vande@csclub.uwaterloo.ca
Hey, Eric's hogging all the paranoia. He must be up to something.
Leave Me Alone too.
                              --Brian B. Rodenborn in comp.lang.c

- E
- Eric Sosman
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Fri, Nov 5, 2004 11:33 PM

Some of the function typedefs and illustrative definitions could be improved by const-qualifying their pointer arguments.

Case in point: the mycmp() function. The manual says it "will be useful for other things" if it goes beyond hashlib's basic true/false contract and instead returns -1,0,+1. If the "other things" are meant to imply qsort() and bsearch(), `const' is required.

The implicit `int' strikes me as a poor idea -- it was, after all, outlawed five years ago.

And, of course, the qsort()/bsearch() contract doesn't require -1 and +1; any -ve and +ve values will suffice.

The first mydupe() function has (at least) three syntax errors. (Two of them are, I think, a consequence of hiding a pointer behind a typedef, a practice I find Really Ugly.) Unoriginal suggestion: Put all these code snippets into a complete program, compile and test the program, and then cut'n'paste into the manual. In short, do as K&R did.

The second mydupe() function has an opportunity to show a technique that you might actually prefer not to let the newbies get a peek at. Instead of

newitem = malloc(sizeof *newitem); newitem->yourkey = malloc(strlen(myitem->yourkey) + 1);

one can reduce the memory-management overhead by using just one allocation:

newitem = malloc(sizeof *newitem + strlen(myitem->yourkey) + 1); newitem->yourkey = (char*)(newitem + 1);

... a sort of distant cousin of the struct hack, but perfectly legal. Of course, myundupe() would have to be changed accordingly -- which might sort of defeat the pedagogic purpose.

Not a criticism of the manual, but of the interface: You didn't agree with me when you first published the interface for review a couple years back, but I still feel that requiring two hash functions when one will do is just plain silly.

That's all from me. I hope the operation was successful and as, er, pleasant as possible.

--
Eric.Sosman@sun.com

- A
- Andy
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 5:11 AM

Could you explain the advantages of using your library vs. for example the default tool function of libc (search.h) like hcreate_r(), hsearch_r(), etc.?

CBFalconer wrote in message news:...

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 6:26 AM

I agree. Somewhere in the source I wrote a note to myself to constize as much as possible. But it is easy to generate -1,0,1 and safe.

What implicit int?

It is in a complete program - did you see the link to hashlib.zip?

I don't want to be tricky in that sort of manual

... snip ...

It is absolutely necessary for the algorithms I use. It prevents clumping, and keeps the searches quick. This way things that have the same first hash value won't have the same second hash value.

It's this coming Monday.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- C
- CBFalconer
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 8:25 AM

I don't understand the question. None of the things you mention exist in standard C, so those examples don't help. As I said in the manual, it handles putting things away, and finding them again later. I even described all the function calls to it.

--
Chuck F (cbfalconer@yahoo.com) (cbfalconer@worldnet.att.net)
   Available for consulting/temporary embedded and systems.
     USE worldnet address!

- R
- Rob Thorpe
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 3:04 PM

I don't have many comments right now, I'm sure if I used the library I would think of some.

It would be a good idea to make an info page on hashlib too with the same information in it. Maybe a man page too.

- B
- Ben Pfaff
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 5:54 PM

You quoted 423 lines of text to add "I don't have any comments?" Please learn some netiquette.

--
Ben Pfaff 
email: blp@cs.stanford.edu
web: http://benpfaff.org

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 6:49 PM

I was about to make the same comment.

Jon

- R
- Richard Harter
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 8:28 PM

And the pair of you have contributed even less of value to the thread than Mr. Thorpe did.

- E
- Elder Costa
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 9:59 PM

And the both of us even less. But I fully agree with the two gentlemen: we could use some netiquete regarding quotes.

Regards.

- F
- Flash Gordon
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sat, Nov 6, 2004 10:11 PM

However, they still managed to produce a vastly higher signal to noise ratio.

Not snipping is a real problem.

--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

- A
- Andy
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Nov 7, 2004 2:54 AM

hcreate_r() and hsearch_r() are standard function calls in libc now (I think since v. 2.0). Run for example "man hcreate" to get some basic info (the _r versions are the re-entrant versions, you can find their declarations in search.h). I was able to run these functions successfully even though I haven't found a way to browse the contents of the hash table (in a similar way your function hashwalk() does).

- P
- pete
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Nov 7, 2004 3:22 AM

That's not the point. I don't know what libc is, but standard C functions are the functions which are described in the C standard.

formatting link

--
pete

- M
- Mark McIntyre
  
  Contact options for registered users
Vote on answer
posted
19 years ago

Sun, Nov 7, 2004 7:07 PM

CBF was being ironic.

When you x-post to comp.lang.c its as well to be sure you know the topicality requirement here. That is standard C as definedby the ISO standard. What extras might be in yuor libc are largely irrelevant, and could technically render it an "illegal" c implementation.

f:\program files\4nt500>man hcreate

4NT: Unknown command "man"

Hmm.

--
Mark McIntyre
CLC FAQ 
CLC readme: 

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+
Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----