Looking for pointers on implementing a symbol table using little RAM

P

Peter Dickerson 15 years ago

I'm looking for ways to convert textual symbols/identifiers into a small integer e.g. 16-bit short int. The key things are for the amount of RAM used to be small, lookup to be reasonably quick and for back convertion from the integer to string to be possible. The table needs to be created incrementally such that the integers coresponding to a given identifier doesn't change when new identifier is added. The target system (LPC) has a lot more Flash than RAM so I already have a table of common and system identifiers in Flash. Of course, for that I have the entire list in advance so I can sort and index it for speed, and space is less of an issue.

The aim is to produce a small scripting engine for driving my hardware. There is already a command system but it's not powerful enough for the future (or even for now). The simplest solution is a long list of strings with new ones appended. Searching is then linear and the return value could be the position of the found string from the start or somtheing like that.

Thanks for any hints...

Peter

Vote

J

Jon Kirwan 15 years ago

Chapter 3, "C Interfaces and Implementations," David R. Hanson, 1997:

formatting link

Jon

Vote

P

Peter Dickerson 15 years ago

Thanks for the quick reply. After a quick scan this looks like a normal hash table type solution. Quick and does the job except it isn't particularly compact. I'm looking for something where RAM data structures are not much bigger than all the strings concatinated, and perhaps smaller (e.g. sharing prefix). Still, a good link to remind myself of the clasic solution.

Peter

Vote

H

Hans-Bernhard Bröker 15 years ago

Just about the most efficient and compact, while generic, solution would be binary lookup in an indirect sorted table. I.e. keep your strings in an array whose index is the "command number", but keep a second table that holds indices into that table, in alphabetical order of strings. Then do a bsearch() on that index table.

For extra points, make a tool that builds this sorted table off-line, so it can go into constant memory, too.

OTOH, if speed is more of a worry than memory consumption, you might want to build a "perfect hash function". If so, GNU's "gperf" might be something for you.

Vote

P

Peter Dickerson 15 years ago

No. The point is I need to build it incrementally. I don't know the strings ahead of time and I need to use the values before the table is complete - hence they mustn't change. Off-line solutions aren't solutions in my case. I do have a set of pre-known strings and those are kept in Flash in a sorted table as you describe. However, that uses more memory than I was hoping for and the indexes would change with additions.

Peter

Vote

T

Tauno Voipio 15 years ago

If you can limit the symbol character set to 40 values, (39 + null), you can use the old trick of the PDP translators by using base-40 coding to pack 3 charactrs in a 16 bit integer.

A classical solution to your problem is a B-tree, but it is quite greedy on pointer space.

Tauno Voipio tauno voipio (at) iki fi

Vote

P

Peter Dickerson 15 years ago

Unfortunately three chars isn't enough. I also need case sensitive.

Yes, that or a hashing solution. Both are quite expensive on extra data on top of the text.

Lots of subtables one for each length would mean that the length doesn't have to be kept separately for each entry. Similarly, subtables with the same start character would save for each entry. Trading per-entry data for fixed overhead (pointer/offsets to subtables etc).

Peter

Vote

C

Chris Burrows 15 years ago

You need to state your requirements more precisely if you are looking for an optimum solution to satisfy all of them. As a starting point, answering at least some of the following questions might help:

What is the range of character values allowed in the symbols? (e.g. ASCII

020 Hex to 07F Hex)

What is the minimum / maximum length of the symbols?

What is the maximum number of symbols you need to allow for in one session?

What is the maximum total symbol length?

Are the symbols based on English words?

Is the frequency of occurrence of all characters equivalent or some more likely to occur than others?

What is the required minimum / desired / maximum response time?

What is the minimum / desired / maximum amount of RAM you want to use?

Regards, Chris Burrows

CFB Software Astrobe: LPC2xxx Development System

formatting link

Vote

D

D Yuniskis 15 years ago

I'm not sure of the *actual* criteria you are imposing, here.

The first idea that comes to mind is a hash function (though making that invertible is a chore). Of course, other "compression" techniques can also be used (5 or 6 bit chars; stateful encodings akin to Baudot, etc.)

When you "add a new symbol", are *all* of the symbols in existence

*known* at that point? Or, are you trying to construct a symbol table for "this application" on-the-fly?

Can you just store a symbol table "in some format" with each application (i.e., script), parse it at load time and then discard it thereafter? I.e., if a particular script never references "Foobar27", then why set aside an identifier for "Foobar27"? (the "built-in" identifiers can have *their* symbol table stored in FLASH and the application's symbol table parsed EFFECTIVELY *after* that one is)

What do the symbols represent? E.g., if they are "command words" (aka Forth), you could store the string literal at some fixed location in the "command code", etc. Ditto for "variables". Mapping an identifier back to a string is then a linear search of all "registered" commands/variables.

Hard to say without knowing how you actually want to be able to resolve symbols (algorithmically).

Vote

J

Jon Kirwan 15 years ago

The solution mentioned, Atoms, is faily compact. Your earlier comment was that it needed to be compact. The above is newer information, that you'd like it smaller by something like sharing parts that can be shared, readily.

I've no idea what the size of strings are. That has a huge impact on the design, given the additional information you mention above.

Jon

Vote

P

Peter Dickerson 15 years ago

In the first instance the need to be identifiers containing a-z,A-Z,0-9 and underscore. However, a solution that can handle printable UTF-8 would also be of interest.

Note that I'm not creating the symbols just the scripting language. So the length is whatever the user wants, but they will be aware that long identifiers will reduce memory for other things. In any case source code lines have a limit of 1000 bytes, which would limit identifiers to that. In practice, I'd bee happy with a much smaller limit such as 50. Min is 1 but I already have all of those in Flash, so you could say 2.

Limited by memory capacity, but a few hundred might surfice. Generally the issue is the other way round. I have n bytes for symbol space, how many can I cram in. n here is likely to be a small number of K bytes.

As big as will fit in the memory allocated to symbols.

Mine would likely to be but other people might not. There are some common biotech terms that will probably end up in Flash so that using those would not take up additional RAM. In other words if I knew what the user would commonly use I'd shift it to the read-only table.

Thats up to the user, but I'd expect letters to be more common than digits, and those to be more common than underscore.

As fast as I can have without sacrificing compactness. Indentifier lookup is the common case, so needs to be fastest. Identifier insertion is next and integer back to identifier text can be slow because it onl;y occurs in an idiom that I don't want to incourage.

I am targetting LPC2377 right now, which has 58K of RAM. However, I might only allows 20K for scripting, so maybe 4K for symbols.

Peter

Vote

P

Peter Dickerson 15 years ago

On the fly. I use symbols that have been inserted before others have even been identified (pun intended).

Can't determine what won't be used easily (or at all). The system needs to include an eval type function. It is also possible to construct source code at run time and compile it. So the sysmbol table is updated at run time.

Yes, linear search is the base against which to guage alternatives.

I just want to use it like an atom table. Textual identifier converted to small number. Small number used instead of text. Occasionally need to convert small number back to text.

Peter

Vote

P

Peter Dickerson 15 years ago

The naive solution is to concatinate all the strings and comapare them one by one. This puts an upper limit on how much memory is necessary to deliver the functionality. I am just hoping for something similar in sixe that delivers better performance. In particular the naive solution has fast insert, slow search and slow back covertion.

They are identifiers, in the programming sense. Its for a scripting language. Short is common, more than 20 chars is rare.

Peter

Vote

J

Jon Kirwan 15 years ago

Ignoring the idea of compression (by whatever skilled means devised) for now, anything that provides better features than the simple solution you mention (which is identical to how DOS used to handle the "environment" string list) would probably need to include some kind of additional information such as pointers. In other words, without compression improvements won't be free (there is some price to pay and accept.)

The way to buy improvements without a price is, as you've suggested, to include some means of compressing the strings themselves.

Which gets to the issue of compression. With long strings, there is a lot of opportunity for compression to make a significant difference. With short words, certainly simple schemes are less likely to improve things much. Frequency counting compression ideas would help... but unfortunately for your case, adding words would change frequencies and thus the optimal compression. So that gets you back to "not much good available here."

Of course, if the letters you allow are limited, they could be packed more tightly.

So that gets back to the idea that if you don't have a lot of compression opportunity available, you either _must_ pay at least a small additional price for features over and above what you get with your basic plan or else accept the basic plan you first outlined.

............

I'm going to add a question. You mentioned a scripting language. Is this something similar to, say, BASIC or perhaps FORTH? And are you using this during the execution of an interpreter which must look up "variables" in a symbol table in order to find them (or add them on the fly?)

If so, there are ideas which predate my first experience with computers, used when RAM was tight (the entire timeshared system ran in 8-16k word) and symbol lookup during execution time very important, which may help (and may not.)

Jon

Vote

M

Michael Karas 15 years ago

I have implemented string tables that supported efficient storage in packed format. String lengths limited to 253 bytes and using a single byte ID per string so as to support up to 255 strings in the table. For small numbers of strings the lookup is quite efficient. They are stored as:

MAXTABLEN LEN ID STRINGBYTES LEN ID STRINGBYTES .... LEN ID STRINGBYTES NULL.

Finding strings is bouncing through the table length by length and looking at the ID bytes. Deleting strings is as simple as finding string and then closing up hole in the table. Adding strings get placed at end of table after determining if previous ID existed and deleting it.

Used for many things over years including symbol storage.

Offlist to share.

Michael Karas Carousel Design Solutions http://www.carousel-design.com

Vote

D

D Yuniskis 15 years ago

But when do you "define" and "resolve" those symbols (i.e., add them to the table; then look them up *in* the table)? E.g., if you do this at *load* time, then you "see" *all* of the symbols (at least, all that are referenced in that "script") in a particular, predictable order. OTOH, if you try to define them as you encounter them during execution (interpretation?), the order can vary from one invocation to the next.

My point is, a symbol that isn't used need not have space set aside for it in the symbol table. I.e., when SCRIPT_A is loaded, the symbol "Foo" might be assigned 0x0003 yet when SCRIPT_B is loaded, the symbol "Baz" might be assigned that 0x0003. I.e., there is no need (?) for a particular mapping to be in place at all time (Foo -> 0x0003, e.g.)

I wrote a small BASIC interpreter years (decades?) ago as a scripting mechanism (back when 16K was a lot of EPROM). Reserved words had fixed "atoms" (to use your terms). These were hard-coded in the EPROM. Symbols encountered in the "TEXT" were packed into four per three bytes (i.e., 6b per "character"). This kept the symbol table manageable in terms of size. Whitespace was discarded. Etc.

A consequence of this is that when you attempt to "list" the "script", it doesn't appear the same way that it was entered. E.g., whitespace gets replaced with a single space, "GOTO" is explicitly inserted even though the parser syntax didn't require it, etc. For some folks, this can be annoying. I.e., you might opt to deliberately present (reconstructed) "source" in a "different enough" format that it is very obvious that you are NOT trying to reproduce the original "syntax" (e.g., you might deliberately reformat IF - THEN's on multiple lines, etc.)

Vote

P

Peter Dickerson 15 years ago

Hi again

Symbols are defined dynamicly. They can be assigned as the script is parsed but others won't be known until run time because of the dynamic nature. What never happens is that symbols are deleted, not until the whole table is deallocated. The issue is that objects created at run time include the ids from lookups in the table. The same id can't be used to mean different things at different times.

no there is no need for the id number to be the same each time the script is run. In particular on might run a pre-script that sets up the environment for the main script. Changes to the pre-script might have knock-on changes to the id numbers issue for the main script, but that wouldn't of itself change how the script ran.

Yes, that sort of thing. Generally symbol text compression is an orthogonal issue.

OK, but I don't expect to need to expand the tokenized/compiled code back to text. But I will need to option to include some debugging info.

Peter

Vote

Looking for pointers on implementing a symbol table using little RAM

Join the Discussion

Didn't find your answer?