Hi,
I need a scheme to represent "identifiers" (serial numbers) that is resistant to "typical human errors" in recognition and reporting. These are important for licensing and configuration management -- so, an error can be costly (in time or performance).
I.e., an 8 digit (decimal) number is too easily misremembered (or misreported); a user can easily transpose digits, etc. [e.g., I have a colleague who routinely leaves me notes: "Pass this on to Gray", "Hold these for Gray", etc. There is no "Gray" -- though there is a "Gary"! :> Thankfully, he doesn't call me "Dno"!]
My first thought was to just create a population of identifiers with a large Hamming distance and "hope for the best". E.g., like credit card account numbers having far more digits than souls on the planet decreases the chance that a misreported number happens to coincide with a legitimate account.
But, this is folly -- the errors people introduce when reading, remembering and reporting identifiers are not of the same type as data being corrupted on a noisy channel!
[Besides, that just makes the 8 digit number even longer! How many folks can remember someone *else's* SSN (9 digits) or phone number (10 digits), etc. -- unaided?]So, I took a different approach: increase the set of "digits" used in representing these identifiers. (a greater number of choices for each "digit" can handle a greater number of "values")
The obvious such choice would be to use alphanumerics (or just alphabetics). But, this results in distressing identifiers (e.g., "Q0B8") that are just as difficult to remember. It also introduces chances of ambiguous misreads ("Was that an '8' or a 'B'? An 'O' or an '0'? etc."). And, it does nothing about misreported numbers -- like poor "Gray"!
But, it has appeal because it shrinks the identifier to a more manageable length (~32 symbols means 4 "digits" gives you 1,000,000 valid combinations; 6 "digits" increases that another 3 orders of magnitude!)
The problem lies in the fact that digits are "interchangeable" so transpositional errors can't be caught -- "Q0B8" and "0Q8B" are each potentially valid identifiers!
OK, so impose some structure on the identifier. E.g., something like digit-letter-digit-letter (not robust enough) or maybe evendigit-vowel-odddigit-consonant (not enough "values"). This allows simple tests to identify obvious cases of transpositional or errors in recording (e.g., "1A3B", "A03B", "0A38", etc. are all obviously "bad" identifiers). It still eaves other potential ambiguities (0 vs O vs Q, 8 vs B, etc.)
Trying to push to a (user-) friendlier implementation, I finally seized on the idea of replacing the "single character symbols" with *words* -- from a restrictive vocabulary and grammar. If, for example, I have a list of ~30 words that are valid as the *first* word in the identifier, a DIFFERENT group of 30 that are valid as the second word in the identifier, etc. then I can replace an N-character (case insensitive) alphanumeric with an N-word identifier!
This would let me easily isolate "bad" identifiers: "first word must be one of: ...". E.g., if the first words are all chosen to "begin with 'A'" then "boat apple frog house" is a bad identifier (regardless of the actual set of words valid for "word #1" *or* the rules for the remaining words).
As a final revision, if I impose a *grammar* on the word choices, then the resulting identifier can be more memorable (though
*nonsensical*) to the user: "Five Berries Ran Quickly" ( -- if I've correctly remembered those terms! :> )This has *got* to be the easiest way (for the user) to remember identifiers accurately *and* make the verification of those identifiers "robust"! I'll have to figure out just how many different "words" (symbols) I need in each of those "places" to get the 100,000,000 combinations required... I don't think adding more than a fifth word would be prudent (4 seems to be the sweet spot)
[of course, this addresses -speaking customers, only]Comments?
--don