Information theory/coding

- D
- D Yuniskis
  
  Contact options for registered users
posted
12 years ago

Sun, May 22, 2011 9:52 AM

Hi,

I need a scheme to represent "identifiers" (serial numbers) that is resistant to "typical human errors" in recognition and reporting. These are important for licensing and configuration management -- so, an error can be costly (in time or performance).

I.e., an 8 digit (decimal) number is too easily misremembered (or misreported); a user can easily transpose digits, etc. [e.g., I have a colleague who routinely leaves me notes: "Pass this on to Gray", "Hold these for Gray", etc. There is no "Gray" -- though there is a "Gary"! :> Thankfully, he doesn't call me "Dno"!]

My first thought was to just create a population of identifiers with a large Hamming distance and "hope for the best". E.g., like credit card account numbers having far more digits than souls on the planet decreases the chance that a misreported number happens to coincide with a legitimate account.

But, this is folly -- the errors people introduce when reading, remembering and reporting identifiers are not of the same type as data being corrupted on a noisy channel!

[Besides, that just makes the 8 digit number even longer! How many folks can remember someone *else's* SSN (9 digits) or phone number (10 digits), etc. -- unaided?]

So, I took a different approach: increase the set of "digits" used in representing these identifiers. (a greater number of choices for each "digit" can handle a greater number of "values")

The obvious such choice would be to use alphanumerics (or just alphabetics). But, this results in distressing identifiers (e.g., "Q0B8") that are just as difficult to remember. It also introduces chances of ambiguous misreads ("Was that an '8' or a 'B'? An 'O' or an '0'? etc."). And, it does nothing about misreported numbers -- like poor "Gray"!

But, it has appeal because it shrinks the identifier to a more manageable length (~32 symbols means 4 "digits" gives you 1,000,000 valid combinations; 6 "digits" increases that another 3 orders of magnitude!)

The problem lies in the fact that digits are "interchangeable" so transpositional errors can't be caught -- "Q0B8" and "0Q8B" are each potentially valid identifiers!

OK, so impose some structure on the identifier. E.g., something like digit-letter-digit-letter (not robust enough) or maybe evendigit-vowel-odddigit-consonant (not enough "values"). This allows simple tests to identify obvious cases of transpositional or errors in recording (e.g., "1A3B", "A03B", "0A38", etc. are all obviously "bad" identifiers). It still eaves other potential ambiguities (0 vs O vs Q, 8 vs B, etc.)

Trying to push to a (user-) friendlier implementation, I finally seized on the idea of replacing the "single character symbols" with *words* -- from a restrictive vocabulary and grammar. If, for example, I have a list of ~30 words that are valid as the *first* word in the identifier, a DIFFERENT group of 30 that are valid as the second word in the identifier, etc. then I can replace an N-character (case insensitive) alphanumeric with an N-word identifier!

This would let me easily isolate "bad" identifiers: "first word must be one of: ...". E.g., if the first words are all chosen to "begin with 'A'" then "boat apple frog house" is a bad identifier (regardless of the actual set of words valid for "word #1" *or* the rules for the remaining words).

As a final revision, if I impose a *grammar* on the word choices, then the resulting identifier can be more memorable (though

*nonsensical*) to the user: "Five Berries Ran Quickly" ( -- if I've correctly remembered those terms! :> )

This has *got* to be the easiest way (for the user) to remember identifiers accurately *and* make the verification of those identifiers "robust"! I'll have to figure out just how many different "words" (symbols) I need in each of those "places" to get the 100,000,000 combinations required... I don't think adding more than a fifth word would be prudent (4 seems to be the sweet spot)

[of course, this addresses -speaking customers, only]

Comments?

--don

- R
- Rich Webb
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 12:19 PM

They shouldn't need to remember it, just read and say it or copy/paste it into an email.

An easy (or easy-ish) approach is to use hex numbers but remap what you consider a potential ambiguity. 8 and B so map B --> H; 0 and D so map D

--> K. Or, use lower case hex letters, which have more visual hints, although introduce their own potential ambiguities with b and d and 6.

Include a hash, such as a four character CRC-16, as part of the identifier. One could also be clever and scatter the hash symbols rather than packing them all on one word.

If it's for licensing, go the route so many others have followed. Supply the product with a unique serial number (that may itself include a hash). Have the end-user register the serial number, you supply a registration code that includes a hash built from that base number and the base serial number.

--
Rich Webb     Norfolk, VA

- R
- Rocky
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 2:48 PM

Look up 'Luhn checksum'. It caters for most transpositions. Not competely foolproof, but good enough for a lot of users.

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 5:43 PM

I don't know about you but *I* don't have any form of communication device available to me 24/7 (I certainly don't carry a PC with me everywhere, nor a cell phone). If someone gives me a phone number, I jot it on a piece of paper until I can get to "wherever" I record my phone contacts (I've a slip of paper in my pocket right now with "Charles'" phone number on it :> )

When reviewing other people's writing, I frequently see conflicts: {4, 9}; {5, S}; {0, C, D, Q}; {8, B}; {1, I, L}; {F, P}; {H, K}; {U, V}; etc.

Any mapping you create should be recognizable to the user, as well. I.e., he should *know* that a symbol he has recorded can't possibly be a "B" because B has been outlawed, etc. This is not an intuitive process for most folks (e.g., I always misremember which "letters" are missing from a (US) telephone dial -- logic tells me it should be {0,Q,O} and {1,I}... but logic is always wrong, there! :< )

And, even with Hex, you need 7 digits to represent 100M identifiers (not counting any check digits, etc.)

The advantage that the "words" approach has is that it leverages existing knowledge/familiarity -- the redundant "letters" just reinforce the "concept" represented by the word (even for folks who spell "berries" as "berrys", etc.). And, it does so without putting an increased cognitive load on the user (I suspect most people can remember a sentence of four familiar words -- even if it is nonsensical: "Eight giraffes boiled slowly")

Yes, but now you're up to 11 "digits"? And, there is nothing "memorable" about any of them!

The goal isn't *just* to prevent errors but, rather, to facilitate recording/remembrance/recognition, etc.

(e.g., I could put a 35 character code -- XP license? -- on each device which I suspect will drive the number of false positives to *zero*. But, at the expense of being incredibly tedious for the user)

That was the point, here. The "identifier" (S/N) is tracked at the source for configuration management, features, etc.

We're not concerned with the typical case of people trying to "crack" a "registration code". Rather, trying to track "who has what". I.e., there is no incentive for a user to falsify an identifier. It doesn't *get* him anything (that he doesn't already *have* -- or NOT have -- under his *real* "identifier")

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 5:45 PM

Ah, that's interesting! But, it still leaves me with 8 (or 9) digit identifiers and does nothing to alleviate the load on the user to record/remember these.

- R
- Rich Webb
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 6:48 PM

I'd still go with a hash of some sort that would "self identify" a transcription or mis-reading error. It does make the sequence of digits a little longer. Is there a requirement that the end-user memorize the number?

--
Rich Webb     Norfolk, VA

- R
- Roberto Waltman
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 7:11 PM

(Read the original post for details)

Some time ago I was thinking of a similar scheme. The trigger was having to re-enter 4 times the registration key for a software package I was installing. The (long) key was printed in a font were the digit '6' and the letter 'G', as well as other pairs, were indistinguishable from each other.

I took the initial alphabet 0..9A..Z and removed all potentially ambiguous characters. That is, the quartet '0' (digit), 'O' (letter), 'D' (letter) 'Q' (letter) was removed, ditto for '1' and 'I', '5' and 'S', letter 'U' and letter 'V', and so on.

This left "3479ACEFHJKLMNPRTWXYZ" (Until a see a font where the '7' and the 'T' look alike)

That's 21 characters. Interpreting strings as a number in base 21 will allow 14 character keys to be stored in a 64bit long int. A 20-char "Microsoft like" key could be stored in 88 bits.

Moving on, any encryption key, password, etc would have to be built from these characters + some structure, as in digit-letter-letter-digit-letter-letter.... Also, part of the strings (not necessarily at the end) would be a checksum or CRC signature that should be able to detect swapped characters.

Sorry, do not have any concrete details. This was not an actual problem, more a "how would I solve this" type of thing, and the napkin was used and then discarded.

-- Roberto Waltman

[ Please reply to the group, return address is invalid ]

- A
- Andrew Smallshaw
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 7:14 PM

I've seen this done before and indeed I was about to suggest it before I read on and saw you were already considering it. It is relatively easy to analyse. If each word class has 32 entries that is five bits of data per word. Five such words gives you 25 bits or ~33 million combinations. To get it up to 100 million plus you can encode an additional couple of bits by using four distinct "sentence" structures for a resulting 27 bit quantity.

That gives you your 100 million combinations but no space for checksumming: if you want to validate the input that will have to come out of the available combinations, or you need more words and/or larger dictionaries.

--
Andrew Smallshaw
andrews@sdf.lonestar.org

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 8:31 PM

No "requirement". Nor is it particularly useful "in the long run" for the user to have memorized it. I.e., it is *not* like a SSN or credit card number that you *could* benefit from memorization.

OTOH, being able to cache it in short term memory to eliminate the need to "write it down" (or otherwise record it) is a desirable feature. (recording it could be tedious for some users).

The "words" approach has the benefit that you can almost *guess* what the word is from some recognizable group of characters without having to meticulously identify each individual character ("wh_l_" is probably "while", "whole" or "whale"; if you are expecting a noun, then "whale" is the likely choice).

E.g., when I am driving, I frequently use the "length" (width?) of the word(s) on a street sign as a gauge of the number of letters and, thus, whether it is likely to be the street I seek. Granted, this isn't foolproof but it can let me "see beyond my eyes" and take action earlier than I might otherwise (e.g., start working my way to the turn lane).

If, in this case, all street names were 7 letters, I'd be SoL! :>

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 8:41 PM

Exactly. Clearly a case where the vendor has done something

*entirely* in their own interests without concern for the user (customer). I.e., they want to protect their IP (understandably) and have settled on a solution that further burdens the *customer* (!) instead of assuming that burden on themselves.

Yes. But, this only addresses "font faults".

[I chuckle remembering a friend blaming one of her typos on a "font defect" :> ]

If the user is likely to record this information (handwritten or otherwise) then you have to also consider ambiguities that can arise thereafter.

E.g., I see people write 9's with "open loops" at the top that resemble 4's; F's that can be regarded as poorly formed P's; a C that leaves you wondering if it really is an unclosed O, etc. All of these consequences of quickly jotting down a number (that they can't *remember*!) without taking much care in the process. Then, later, wondering "what they wrote".

If you introduce physical disabilities to the mix, you're just making the process even harder! ("And why, exactly, did I buy this &^%*^ product???")

Since there is only *one* such identifier, the costs of storing it don't seem significant. E.g., using the "words" approach, I would literally store the *words* themselves -- not some table driven mapping function that can derive a "word set" from a "decimal serial number".

The lesson to be learned, here, is not to eat chili-dogs or any other "messy" foods while designing!! ;-)

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sun, May 22, 2011 8:54 PM

I suspect it wouldn't be too hard for people (users) to accept. After all, populations that we never would have expected to "understand" a MAC, IP address, etc. can at least refer to them now.

But, I haven't figured out an appropriate metaphor or terminology to use in referencing it: "What it is your serial number?" is clearly misleading -- "there are no numbers on the box! where do I look for it??". Likewise, something "cute" like "what is your pass phrase?" is suggestive of a "secret"/password: "I forgot! I didn't write it down anywhere..."

This requires some pondering.

One advantage I see is that the user doesn't implicitly know what the "wrong" symbols are. E.g., you can misremember "123" as "124". But, are you likely to misremember "the brown shoe" as "the yellow bird"? (i.e., do you even know *if* "yellow" and "bird" are in the lexicon of symbols??)

That was the motivation behind the template I mentioned earlier. In addition to the list of valid "symbols" (words) for each "word position" being constrained, the order is implied/enforced or derivable if you just remember the words but not their order.

E.g., "gorilla, slowly, climbed, nine" only makes "sense" in the order: "nine gorilla(s) slowly climbed".

By contrast, a grammar like in which the predicate can also be a allows things like: "gorillas eat trucks" and the equally nonsensical (though grammatically correct) "trucks eat gorillas" -- are both of these valid??

The hurdle I think I need to surmount is adding or avoiding a fifth word to the "sentence". Trading off increasing the number of valid "words" for each "word position" (PoS) vs. the total number of words in the phrase.

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, May 23, 2011 3:51 PM

May I ask /why/ the user has to /remember/ these codes? Will they be needed at a moment's notice for emergency override?

Anyway, research has shown that people can remember 7-10 "symbols" in the short term ... the semantic content of the symbols doesn't seem to matter.

Breaking a long number into small groups of characters, as well as exploiting patterns seems to help most people. People don't remember a phone number atomically as 2125551212 or by character as 2 1 2 5 5

5 1 2 1 2 ... first it will be grouped into area code, exchange (prefix) and number: 212 555 1212 and then, in this case, the number 1212 typically will be remembered as a repeated pattern 1 2 ... so in memory the whole thing becomes 4 symbols: 212 555 1-2 1-2.

This kind of grouping works for character data as well. Roberto's idea of restricting the code to characters that are easily distinguishable in typeface helps if the person has to read it.

Also, phonetic word spelling such used by military helps some people. It's easier to remember "November One Tango Charlie" than N1TC.

But if you're specifically trying to address dyslexia, then nothing you try to do is likely to help.

George

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, May 23, 2011 7:03 PM

They don't need to remember them "in the long term" (i.e., not like you might remember your credit card numbers, social security number, etc.). Rather, they need to remember them in the short term to avoid having to resort to writing them down.

I was taught 5-7. And, as I get older, I feel like revising that number downward! :>

Note that STM is also severely impacted by "distractions". E.g., if you have to commit your "attention" to some other task (like dialing the phone number of the help desk), then those activities compete for STM real estate (requiring you to re-"notice" them to keep them refreshed).

Yes, grouping or chunking. Note that phone numbers tend to have psychological ties -- an emotional connection with the number (even if it is someone you are going to call to complain!). Arbitrary numbers/letters, however, tend to be more difficult to keep alive in the short term. Hence, people mumbling to themselves seemingly oblivious to things around them (because those "things" represent distractions that will interfere with their STM)

That was the basis for my "words" solution. Except words that at least *try* to relate to each other ("six elephants ate bricks" is probably more memorable than "november one zebra xray").

And, since only *I* need to be aware of the entire lexicon, I am not limited to 26 different words. Nor do I run the risk of two words being swapped (if I force a grammar on the phrases I construct)

If you avoid interchangeable symbols, then you can at least detect that a transposition has taken place. can be recognize transposed as whereas can't be!

- J
- Jim Stewart
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, May 23, 2011 11:21 PM

Absolutely. I fly and if I'm landing at a towered airport or talking to ATC, I have to use a notepad for everything, including something as simple as a four digit transponder code. My short-term memory pretty much evaporates under any sort of multitasking.

- I
- Ignacio G. T.
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, May 24, 2011 2:04 PM

If you transpose the letters of all words in a sentence, except the first and last letters of each word, you get a sentence most readers will be able to understand. Note that 3-letter words will remain the same.

On the other hand, people seem to remember 3-digits numbers quite well.

So, my favourite scheme to temporally (or even permanently) remember long numbers is to break them into 3-digit pieces.

For instance, it's easier to remember this:

682 411 175

than this:

682411175

especially if you read the number digit by digit, and make a pause between the groups.

Rhythm is important!

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, May 24, 2011 6:25 PM

Yes. Or, if you jot that sentence down and misspell many of the words, you are still likely to "recover" the original words (I am continuously amazed at some of the creative misspellings I encounter! tornato, spinage, etc. :> )

Agreed. Easier still if it fits a common template (phone number, etc.). E.g., credit card numbers are tedious.

The problem is trying to come up with a scheme that is resilient enough to be applied across age groups, presentation media, etc.

E.g., some research indicates that cognitive decline begins as early as the late 20's (!). Most would agree that 40-ish is the point at which a notable decrease is observable. And, with the average age of the population (US) increasing, it seems prudent to address this reality (instead of just opting for "the easy way out").

The "word based" approach also seems to capitalize on the fact that vocabulary tends to *improve* with age (great! you'll know more words -- but you won't be able to REMEMBER them! ;-) ).

Unfortunately, I've forgotten more about grammar than I ever

*knew* so I need to go visit the children's section of the library to relearn the basics... how to create *simple* sentences/phrases that can then be reliably "parameterized".

"See _____ run! Run _____, run."

I wonder if I should include a box of *crayons* with every device??

- T
- Thad Smith
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, May 26, 2011 4:45 AM

HP addressed this problem in the 1970's with their 5004A Signature analyzer. Their hex digits (displayed in 7-segment LEDs) were 0123456789ACFHPU.

See

formatting link

--
Thad

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Jul 2, 2011 6:32 PM

I hacked together an informal little "test" to see just how readily folks would remember "short phrases". (I only chose "victims" that I knew, personally -- i.e., they were inclined to want to cooperate with me. OTOH, one would tend to expect a user to be inclined to want to recall an identifier that he had to "present" to a service rep!).

I would ask folks to remember a short phrase and tell them that I was going to ask them to repeat it to me later. So, I have pre-loaded their minds with the idea that they should *try* to remember this. I would select the phrase from a list I had prepared ahead of time -- with no *intentional* prejudice towards "who" got "what" phrase.

Then, engage in our usual "banter" for some period of time (at least 5 minutes -- sometimes almost an hour!!).

[granted, this period should have been better controlled than it was. But, I was just doing an *informal* survey to decide if this idea was worth pursuing further]

Three word *simple* phrases -- regardless of how nonsensical they were -- seemed easily remembered, accurately. Four word phrases met with roughly similar success -- *unless* the words used were polysyllabic.

E.g., "hippopatumuses" [sic] was less easily remembered than "hippos"; "revolutionary" less memorable than "wild".

I don't think the subject matter was a factor (i.e., it wasn't like I was using long LATIN words vs. short *pejoritives* (which one would assume might be more "memorable").

It didn't seem to matter where in the "phrase" the "long word(s)" was(were) located. And, the more long words there were, the worse the chance of recall. As if they got stuck trying to remember the first "long word" and this drove all memory of the *other* long word out of their mind.

So, I just started reading over the lists of phrases (including those that I'd not yet "tested" on folks). When doing this, I noticed that I was "vocalizing" in my mind each phrase as I tried to commit it to memory. I.e., it wasn't like I was "photographing" the words on the page but, rather, was repeating them to myself AS IF I had been speaking -- just not *aloud*.

And, a natural observation accompanying this is that long, polysyllabic words take a long time to *say* (d'uh!).

Likewise, 4 words take longer to say than 3. And 5 take even longer, still! And, 4 words of which one or two are *lengthy* can be a deal breaker!

I've always contended (when designing interfaces) that people don't have good auditory memories. I.e., if you see a motion video "get choppy", you can better cope with it than if you

*hear* an audio segment get choppy. It's like images have more persistence in our brains so we can (?) connect-the-dots between chopped up images (e.g., strobe effect, large dropouts, etc.) rather easily. OTOH, doing the same with audio gives less than ideal results.

So, I wonder if the problem isn't that we (? I can only speak for myself, here. But, try it. Write down three or four words and try to commit them to memory -- even if only temporarily. Do you just *visually* process the printed words or do you "subvocalize"?) tend to put these sequences of sounds into our "audio memory" and that memory is "too short"?

[maybe that's the key: audio is used for 1-dimensional recording whereas video is used for 2D?]

Sorry, I'm just flailing around trying to make sense out of random observations...