I'm trying to settle on a pronunciation algorithm for numeric strings (e.g., "1234567.89") for my *backup* speech synthesizer (i.e., the synthesizer that MUST WORK regardless of what else might *not* be working -- think of this as the mechanism by which panic() messages are emitted).
The backup synthesizer is more severely resource constrained. OTOH, it also doesn't need to be as clever/effective! The range of messages that it has to emit is less "general". Yet, it doesn't want to be deliberately *stilted* just for the sake of efficiency!
E.g., the above value could be spoken as:
"one two three four five six seven point eight nine"
"one two three, four five six, seven, point eight nine"
"one, two three four, five six seven, point eight nine"
"one million, two hundred thirty four thousand, five hundred sixty seven, point eighty nine"
There are a couple of issues, here. The first is the cost of emitting the "translation". The second is the "intelligibility" of the spoken output.
I.e., if someone *read* these four different interpretations to you, which could you most easily "write down"? Which could you most easily *remember* (even if you don't remember all of the actual digits, can you recall the magnitude of the number -- obviously, my 1234... example is artificial and easy to remember).
Might there be a hybrid approach that improves intelligibility (at some increased implementation expense)? E.g., recite individual digits for values greater than 1,000 and more "verbose" representations for smaller values (like "two hundred and five").
Or, ways by which hints can be passed to the synthesizer
*without* explicitly passing directives. For example: [say_as_digits] 1234567.89 vs. [say_verbose] 1234567.89 are less desirable than counterparts like: 1234567.89 1,234,567.89 (the latter encoding the "verbose" flag by the presence of the group separators)
Consider, also, how things like inflection might be affected by these different presentations (e.g., you can think of the verbose presentation as tending to carry more inflection than the "as digits" -- which might end up sounding monotonic).
Also, think about how different the audio channel is than the visual channel. While I can include provisions to let the user repeat/review the message, you really don't want to
*need* to do this as it indicates a communication deficiency in the design!
I think 'one point two three four five six seven eight nine eee six' is best, providing the recipient understands that format. It's consistent and the 'point' and 'eee' provide anchors.
Hmmm... I hadn't thought about that sort of approach! But, I don't think it would work with the expected audience (think: "normal people") :<
I suspect breaking digits into groups and speaking them
*as* digits is probably the best approach. Though that still leaves the choice between: 123 456 7 . 89 and 1 234 567 . 89
I also think small (whatever THAT means) should probably be pronounced "verbose" -- three hundred ninety seven (vs. 3 9 7).
But, all this is just a "hunch". I can think of lots of specific examples that can lead to conflicting results regardless of the rule I apply (this is a consequence of the backup synthesizer's not understanding context :< )
For example:
"The volume is currently set to 12 (twelve)."
"Call tech support at 5 5 5 - 1 2 1 2 between the hours of 11 (eleven) and 8 (eight)"
Well, only you know how minimal your resources really are.
If you don't do everything as a string of digits (which is, after all, the least resource way to go), then you'd pretty much have to solve the "error number vs. numerical quantity" problem by having different functions that treat things differently.
You gonna do this with recorded sounds? Can one even get silicon (or software) that does phoneme-mashing anymore?
--
Tim Wescott
Control system and signal processing consulting
www.wescottdesign.com
If this is your backup speech synthesizer, then what is the event causing this backup to jump in? If it's an infinite loop in an emergency "error 4711 at address 80043226, call 1-555-1234567 error 4711 at address 80043226, call 1-555-1234567", I would pronounce all digits individually. I would also speak 9 as "niner", as used in aviation.
This also has the advantage of being internationalizable. English is mostly big-endian ("three-hundred-ninety-seven" - but "three-hundred- seven-teen"). German is more middle-endian ("drei-hundert-sieben-und- neunzig", "three-hundred-seven-and-ninety"), which is pretty annoying when trying to capture that message with pen and paper. Especially in an emergency.
Maybe I have wrong understanding of a backup "panic()" synthesizer, but I cannot imagine where it would say THAT :-) But then, there's probably enough room to store synthesized numbers 1 to 15 (or how many volume levels there are), which certainly simplifies the synthesizer a lot.
(commas, obviously, representing pauses for phrase groups)
Agreed. But, then it's hard for users to understand what you are trying to convey:
"Battery level is at 8 7 percent"
requires more "attention" on the listener's part than:
"Battery level is at eighty seven percent"
I parse the string presented to the synthesizer. Hence my deliberate use of the "0x" prefix for the hex error code (even if there are no alphas in the "value"). Likewise, I could introduce commas in large numbers to force them to be pronounced more verbosely: 1,234 --> one thousand, two hundred and thirty four vs. one two three four.
No. I create the speech waveforms algorithmically. So, there are a fair bit of resources required before you can even emit a "burp" :-/
Some aspect of the system on which the *normal* synthesizer relies is not available or non-functional.
Think of it like a BSOD: you can't count on having all the resources to display your full graphic user interface; yet, you have to convey some information to the user. So, you resort to a text only "terminal" -- force the video hardware into a known supported configuration (since you can't rely on your knowledge of the *particular* video adapter installed -- that might be corrupted info!), the monitor to a known resolution (since you can't know about its capabilities, either!), etc.
Or, think of it like kprintf() -- how the kernel gets panic()'s out to the user (it can't rely on the network to do so -- since the network hardware/software might be suspect, etc.). It probably doesn't care about printing long doubles -- so kprintf() can be smaller/simpler/MORE RELIABLE... you know when you use it (from within the kernel) that you can only do certain things with it)
My BSOD analogy falls down because a BSOD is static. "I'm broke. Here's some information. End-of-Line."
In my case, some portion of the device remains operable (i.e., speech happens over *time* -- it's not something you just put up on a screen and then halt!). E.g., call tech support and they will likely tell you to try various things and report the "results" (output) from each test.
Why? Is it more intelligible than "nine"?
"Three hundred twenty niner?"
Oh, perhaps you mean as a freestanding *digit*? "eight six niner?"
"Emergency" overstates the situation (though I understand your point). I'm more concerned with folks being able to figure out what is being conveyed to them.
I think most folks are visual oriented and have a harder time taking in information aurally (*I* do!). A string of digits gives you no clue as to when the string will end (people can only hold a certain amount of information in their memory at a time). So, listening to "1 6 3 4 5" can be more "stressful" than "Sixteen thousand, three hundred forty five". The former gives you no clue of "how much more" you will have to remember UNTIL the last digit is spoken (and, presumably, followed by a
*word*, etc.). The latter gives you information almost immediately that you can use to construct a template in your head into which you fill the values heard. I.e., you hear "sixteen" and you know it's not going to be 163 or 1,6XX (this is a small lie :> ). When you hear "thousand", you know the value will be 16,XXX and not 16,XXX,XXX,... Etc.
Audio output. You need a way of changing the volume level even when the backup synthesizer is the only output method available. How else would you have the device tell you what the current setting is?
"The volume has been CHANGED to 14"
Unlimited vocabulary. Numbers are easy (in english -- I suspect similarly in other languages as many have oddities comparable to english "twelve", "eleven", etc. (french onze, douze, etc.) before settling into a more "algorithmic" presentation "-teens".
In terms of resources, speaking digits is cheaper simply because it skips so many parts of the synthesis! I.e., it then becomes a problem of mapping a *character* (digit) to a word and the word to a set of sounds.
while (isdigit(*ptr)) { say(words[todigit(*ptr)]) interdigitpause(); }
words[] = { "one" ... "nine" }
By contrast, other schemes (like the "verbose" approach) require parsing a *string* of digits (i.e., you have to "look ahead" to see where the string will end so you know how to partition it into phrases!), building words including "units" (thousands, etc.) into phrases applying appropriate inflection to those phrases, then mapping them to sounds.
E.g., "16345" gets converted to: "sixteen thousand, three hundred forty five" This is then parsed IN PLACE OF "16345" to apply the appropriate pauses and inflection -- you speak "one six three four five" in a monotone, generally; but, there is more prosody in the verbose form (something has to create that! :> )
Similarly, note the difference between: 1 234 567 . 89 and 123 456 7 . 89 in terms of the code required to parse each.
The latter is how thousands separators are used, i.e.:
1,234,567.89
and roughly corresponds to how the number would normally be pronounced.
But large numbers really need to have the scale embedded (i.e. 1 million,
234 thousand, 567 point 89) regardless of whether you pronounce 234 as "two three four" or "two hundred (and) thirty four". Due to the use of big-endian notation, you don't know the scale of each digit until you reach the decimal point, at which point you have to mentally backtrack to make any sense of the number.
Yes. The difference between my "1 234 567 . 89" representation and "1,234,567.89" being the more verbose inclusion of units in the latter (instead of just *pauses* between sets of 3 digits).
The "123 456 7 . 89" groups digits in "triplets" but does so without regard for the decimal's location. I.e., if you were to recite (to someone over the phone) an arbitrarily long number devoid of "separators", you might break it into groups of three (or four?) digits... then let the last group have whatever is "left over".
Yes -- *if* you are not just transcribing them onto another medium. I.e., I can recite the digits of pi to you (starting at some arbitrary place to the right of the decimal) and you wouldn't be concerned as to where the decimal was at any given point -- just what the next grouping was going to be.
If, however, you are looking to put a magnitude on the value (implicitly or explicitly), you want to know ASAP just how many more digits you are likely to encounter. E.g., if you hear "two hundred thirty eight septillion", you will probably quickly decide that "3 digits" are all you are going to bother remembering! :>
Conversely, if you hear "two hundred thirty eight thousand", you will probably "prepare yourself" for another three digits without getting overly stressed.
Note that I can opt to use any particular sort of approach as long as it is consistent. The way any particular value is conveyed to the user can alert him to characteristics of the value ahead of time. I.e., if I start rattling off three digit groups (e.g., one five seven), the user can recognize that "this is going to be a *long* string of digits -- not just 3 or 4!"
And, I can adopt presentations that tend to better fit certain usages instead of being pedantic in applying a fixed set of rules. E.g., four digit values might be expressed as "W thousand, X hundred Y-ty Z" -- or "WX hundred, YZ". Or, some combination of the two approaches (i.e., 19XX, 20XX are likely "dates"/years while 3XXX is probably NOT! :> )
So that probably isn't an emergency service, but more kind of an engineering service.
Freestanding digit. They say, "nine" sounds too much like "five" under bad radio conditions. Also, it means "no" in some languages.
That's the difference whether you want to REMEMBER (an approximate number) or whether you want to WRITE DOWN (exactly). For writing down, single digits are much easier. Imagine the number "16005". "Six..." write: "6" "teen" whoops, must put a "1" in front "thousand" very long word, but cannot write anything here because, but just must remember.... "and" ... waiting... "five" Now I must write two zeroes and a five in short time. If it says "one six zero zero five", I can just write left-to-right.
It's even worse in French, where they do not say "ninety", but "four-times-twenty-plus-ten". And, I believe, in Russian they say "ten-to-hundred". (I've done a speech synthesizer for a dozen European languages, most of which I don't speak, a few years ago; forgotten most of the grammar rules by now.)
In Germany, it's custom to split telephone numbers into digit pairs and pronounce them as such. E.g., 0 89-32 16 8 is pronounced as "zero, nine-and-eighty, two-and-thirty, six-teen, eight". This bites me all the time because I have to squish the "8" between the "0" and the "9" I heard and wrote before.
"sixteen thousand millions and five"? :-)
I don't know whether numbers will get so big in your application but "billion" vs. "milliard" would be another problem.
If you're doing real sentences, you'll have to talk of flection. English is easy with "Distance set to 1 mile" vs. "Distance set to 5 miles" (but: "hundred-and-one mile" or "hundred-one miles"?). Slavic languages have three cases, depending on the ones and the tens digit.
Yes, that's why I'd prefer that for an emergency BSOD kprintf panic synthesizer.
Do you really parse text before saying it (text-to-speech engine)? I had a bunch of phrases as .wav files, a function 'int sayNumber(int value)', and a simple grammar coded in C++. void English::sayTurnRightManeuvre(int distance) { say("turn_right_in.wav"); if (distance == 1) { say("one_mile.wav"); } else { sayNumber(distance); say("miles.wav"); } } This code was mostly generated off-line (i.e., if there hadn't been a "turn_right_in.wav" file, the generator would have used "turn_right.wav" and "in.wav"), partially hand-written.
Ah, OK. No, radio and foreign languages aren't problems. I suspect saying "The volume is currently set to level niner" would probably confuse most "Americans". I can make the output machine "readable" using a similar technique to the way error/status messages are emitted from many UN*X services: " " -- except, the code need not be a *numeric* but, rather, a coded set of tones. Something melodic enough to not irritate the user -- yet carrying enough information so that holding the device up to a telephone would allow a remote machine to understand the intended .
[I may require the user to perform some action to enable these tones -- depending on how annoying they become]
Yes. Most numbers tend to represent cardinals or ordinals and not "identifiers". I.e., "This is box number 37"; "This is the 37-th box". etc. Contrast with "Error 0x80042321".
In the last case, I can use syntax to force the error code to be pronounced more like an identifier (you wouldn't think of it as "This is the 80042321-th error" or "Error number
80042321" but, rather as *error* 80042321 (just like "error divide_by_zero")
quatre-vingt dix
In the US, we don't have the wacky "thousand millions" to contend with. 1000 million is a billion. 1000 billion is a trillion, etc.
[OTOH, we tend to be a bit wacky with street addresses]
I think anything beyond thousands (i.e. < 10^6) would best be handled by groups of single digits. I just think it is too hard for people to keep track of more than 6 digits in any cognitively meaningful way: "I dunno, it was 16 million and SOMETHING..."
Yes. Prosody becomes important if you expect to convey more than a small bit of information. "Listening fatigue" sets in quickly. This is especially true when you consider folks aren't *supposed* to ever hear this synthesizer! As such, they won't have any listening experience with it. *And*, the fact that they are hearing it means they are already "stressed" because the device isn't working as expected -- don't want to irritate them any more than you have to!
OTOH, I don't want to have to adopt a limited domain synthesizer and tie my hands with the type of information that I can convey to the user. Better to come up with some consistent rules
*knowing* where the synthesizer's weaknesses lie and crafting messages to steer clear of these areas.
E.g., on a visual output device, you might rely heavily on abbreviations, punctuation, etc. to pack lots of information into a small "portal". With speech output, that just results in jibberish being spoken (even if you voice the names of each punctuation mark, it becomes counterproductive instead of helpfully condensing information)
Again, the analogy isn't perfect. kprintf()'s output is typically intended for a techy. About all a regular user can do is convey the messages to a tech support person.
In my case, I want to interact with the user. E.g., the system might have degraded to this operating mode because a flash update was aborted. I wouldn't want to say: "Error
80342515" but, rather, "The program appears to be corrupt. The log file indicates that a software update was recently attempted -- but not recorded as successfully completed. If this is the case, please retry the update procedure. (If not, call 555-1212 and ask for Bob!)"
Or:
"The main battery does not have sufficient charge to power the entire system. Please recharge the battery and try again."
Or:
"The main memory card does not appear to be present. Check to see that the memory card is inserted correctly and try again. If the problem persists..."
Yes. But, the synthesizer's resource constraints mean that the quality of its output is "less than ideal". I.e., I'm sure it would mispronounce "Phoenix" (I've never tested it). And, homographs (read vs read, lead vs lead, etc.) aren't disambiguated by PoS. And unpronounceable words are... "unpronounceable"! ("tty"?) :> As well as a few gazillion other "short cuts" :> But, it's not something that the user is going to interact with *much*.
Uses too many resources. E.g., even a diphone synthesizer (based on prerecorded diphones) would eat up way too much memory. I need to (effectively) be able to fit in the boot sector so that a trashed system can still talk to the user to allow it to become "untrashed".
And, its hard to add prosody to something "prerecorded".
So, I use cheap text-to-sound algorithms. Then, a reasonably small "parametric sound synthesizer" to generate the actual "voice". There are some tweeks that the user *could*, in theory, do to the voice... but, I don't think I will export that functionality (again, he's not *supposed* to be interacting with this!) Instead, I'll probably just let different folks listen to it and get some rough comprehension metrics and pick the voice parameters that sound "least confusing".
Gotta go scrape the ice cream off the dasher before it ripens there... yum, yum!! :>
But that depends on the particular message. And, the listener's
*intent* at the time.
E.g., "The battery is at eighty-seven and a half percent capacity" is probably NOT going to be "written down" -- even if the tech support guy (later) claims it is an important detail: "Oh, I think it said it was about 80 percent..."
Likewise, rattling off a phone number might not prompt the user to *use* it (at that time).
You don't want to have to preface *every* utterance with "Write this down!" :-/
Repeat is a given. As is the ability to "step" through a message (you don't want to REPLAY, several times, a list of things that you have to do just because you didn't catch them all "the first time").
And, synthetic speech in the absence of a limited domain AGREED UPON A PRIORI (by the user as well as the implementer) can often lead to listening ambiguity. Certain sounds aren't as crisp as spoken. Or, absent *visual* cues (pseudo lip-reading), hard to resolve ("Was that a 'p' or a 'b' that I heard?" "Did you say 'ess' or 'eff'?")
It's amazingly interesting to resort to other output modalities and have to reevaluate *how* you convey *what* you convey!
ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here.
All logos and trade names are the property of their respective owners.