Constrained vocabulary speech synthesis

D

Don Y 12 years ago

Hi,

I have a "fallback" speech synthesizer that is used when the primary speech synthesizer is unavailable. Depending on the user's abilities, this may be the

*only* output modality available to him/her (i.e., he/she may not be able to perceive other available modalities) so *everything* communicated to the user must (potentially) pass through this channel.

As the synthesizer is only intended to be used when the system is operating in a degraded mode, it doesn't have to resolve a limitless vocabulary. _For_the_most_ part_, I have complete control over what it will be required to speak. So, I can pass text to it that I know to be devoid of characteristics that it would be unable to handle "properly".

For example, "The Polish housekeeper who works for Dr. Stephens in his house on Stevens Dr in Phoenix bought some furniture polish for the credenza in his office."

[with a tiny bit of effort, you can imagine lots of similar constructs that require significant knowledge of grammar, PoS, and other context to "get right". Let alone oddities like Billerica, Worcester, Usa, etc.]

But, there are other (external) text sources that I can't as easily constrain. So, I have to make a best effort to cover those (unknown) inputs while not unduly burdening the implementation (Sorry, Billerica!).

Adding context to the pronunciation, prosody, etc. algorithms gets expensive, *fast*. This is a small, highly portable device with very limited resources (CPU, memory, etc.) and extremely low power requirements (has to operate for ~16 hours with a very small battery).

I am happy with the text rules that I have put in place. They cover most "typical" input that the synthesizer is likely to encounter. Obviously, input that isn't grammatically correct can be handled however the algorithm likes (e.g., "The elephant are read?").

Recall that this synthesizer sees limited use -- so, the user is *probably* unaccustomed to its quirks and other idiosyncrasies. Hopefully, the user *never* hears it speak! But, if he is in a situation where he is relying on its speech, he's probably already annoyed (because something else is "not working"). Encountering something like "411 Length Required" probably won't find him very willing to understand what was *intended* by that terse message.

However, "numbers" seem to really benefit from context. Often, a tiny bit of context is sufficient to enhance the pronunciation (and, thus, comprehension). But, other times, you really need to understand what is being said to know how best to speak the "number(s)". E.g., "The 2300 block of State Street".

And, (from surveying users) there appear to be cultural differences in how things (like numbers) are spoken. E.g., "oh" vs. "zero" (which even seems to vary *within* a speaker's ruleset!), how/when numbers are read off as srings of digits, use of "and" as a connective in numeric values ("three hundred and ten" vs. "three hundred ten"), and the value represented by 1,000,000,000.

The cop-out approach is just to recite strings of digits,

*always*. But, try listening to "data" presented in this form for even a few moments and you'll see how silly that approach is!

"Your IP address is 10.0.1.223" "Volume level 18 of 24" "MAC address 23:C0:11:00:14:89" "Signal strengths 23.1, 18.6, 8.5 and 33.0" "Scheduled server maintenance at 03:00" "Battery time remaining 3:12" "Contact Dr Smith at (888) 555-1212 x3-1022"

I *don't* want to put any (other) signaling/control information "in band" relying, instead, on *limited* context to resolve these issues.

For example, requiring times of day to be indicated with AM/PM (so "03:12" is NOT a time of day) and time intervals *without* ("three hours and 12 minutes"). At the same time, I don't want to unnaturally burden the algorithms that *create* (emit) these text strings E.g., requiring all numeric values to be in scientific notation or, to embed separators every three digits (imagine how tedious it would be to have to process numbers as *strings* in order to properly place commas to separate thousands, millions, etc.).

Finally, I don't want to force unnatural presentations that a user employing a *different* output modality (e.g., a video display) would find tedious. Imagine requiring the text source to pass input of the form: "Your Eye Pee address is ten dot zero dot one dot two two three".

My questions:

- what other "number presentations" are likely encountered in an electronic device (e.g., IP, MAC, time-of-day, durations, phone numbers, ARE but bible references AREN'T)

- how do users *colloquially* pronounce numbers (e.g., "0.1203", "101.05", "4005", "8921600002")

- other suggestions to make this easier on the user?

- pitfalls that other developers are likely to stumble on?

I hope I haven't missed anything (*obvious*) :< I am amazed at how many different forms numbers take and how much is "encoded" in our contextual awareness of them!

Time for bed...

Thx!

--don

Vote

T

Tim Wescott 12 years ago

If you had total control over the text, then a possible right answer would be to give it text with embedded clues, or just phonemes. I'm not sure that isn't the right answer anyway, and just let it have problems with the "alien" text.

Tim Wescott Wescott Design Services http://www.wescottdesign.com

Vote

B

Boudewijn Dijkstra 12 years ago

Op Tue, 03 Jun 2014 18:09:19 +0200 schreef Don Y :

Why is there a degraded mode? Which parts of the system are intended to be operational in this mode?

Text, or as Tim said, phonemes.

What is the purpose of attempting to synthesize these?

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Vote

D

Don Y 12 years ago

Yes, I had a similar "epiphany", originally. But, like most siren songs, it proved to be misleading.

Initially, I looked into *canning* all of the speech: "Hey, I know everything it will *ever* say (nope!), so why not just *record* it? LPC encode every utterance and omit *all* the run-time message processing!" I.e., just treat this as an "audio player". By contrast, a device designed for *visual* output need not be concerned with this (or, messages could equivalently be large *bitmaps* painted on the display thereby eliminating the need for a font generator, etc.!)

[Of course, this fragments your code base as you now need to have vastly different user I/O handlers for each type of device "at the abstraction level" -- not just "reification". You're now dealing with entire messages as "objects"]

But, that leaves you with little control over the actual "voice" (e.g., male/female/child/etc.) as well as the more specific characteristics of the voice. In addition to picking a voice that suits their personality, users find that different characteristics of a voice may help/hinder intelligibility (pitch, breathiness, etc.)

And, it fails miserably when tasked with "variable data"... it turns into a giant "unit selection" problem as you try to piece together words, numbers, etc. each potentially "recorded" with different pitch, timing, prosody, etc. (i.e., "I need to choose from among the several recorded utterances of "fifteen" that has the following stress characteristics..." -- contrast how you pronounce "There are 15 children", "It is 9:15"). [This is hard to simulate using your own voice; but, very obvious when piecing together *recorded* voice samples.]

The next (false) "epiphany" came when realizing that I could just

*encode* the speech "off-line" and store (stress-marked) phonemes. As with the LPC approach, this eliminates lots of code to do the text-to-sound conversion, stress assignment, prosody, etc. Do everything "at compile time". Concentrate on the "voice" instead of the *content*.

This allows you to exert some control over the characteristics of the voice (pitch, breathiness, rate, etc.) that the LPC approach couldn't. E.g., I can choose how to pronounce a particular set of phonemes instead of relying on an LPC encoded *recording* of those phonemes (sounds). And, *how* to pronounce them in a given context/utterance.

Words tend to have fewer phonemes than letters so you could, conceivably, encode a second of speech into ~6-8 bytes (speech is about a 50bps channel) and still be able to tailor the "sound" to the user's needs.

Removing the text-to-sound, stress, and prosody processing also eliminates the buffers needed for "rewrite" rules (e.g., when you might otherwise have to change an "interpretation" based on context; or, are tasked with converting "3741" into "three thousand, seven hundred and forty one" on-the-fly so that it can then be converted as any other "words").

It also ensures every utterance is "proofed" before deployment (they all have to go through the *offline* converter/synthesizer before making their way into the code base). No surprises after release!

But, this approach proved to be tedious for the developer. You can't just write: ASSERT(0 != charge); printf("Battery charge remaining: %d:%02d", charge/60, charge%60); Instead, you have to piece together: ASSERT(0 != charge); speak(PHONEMIZATION("Battery charge remaining")); // const speak(PHONEMIZATION(HALF_STOP); // const if (0 != charge/60) { speak(PHONEMIZATION(NUMBER[charge/60])); if (1 != charge/60) { speak(PHONEMIZATION("hours"); // const } else { speak(PHONEMIZATION("hour"); // const } and"); // const } speak(PHONEMIZATION(NUMBER[charge%60])); if (1 != charge%60) { speak(PHONEMIZATION("minutes"); // const } else { speak(PHONEMIZATION("minute"); // const }

[Also note that you can't unilaterally pluralize a noun by adding an "s" sound to the end. Consider: speak(PHONEMIZATION("hour"); // const speak(PHONEMIZATION("s"); // const -- voiced! vs. speak(PHONEMIZATION("hours"); // const contrasted with speak(PHONEMIZATION("minute"); // const speak(PHONEMIZATION("s"); // const -- not voiced! vs. speak(PHONEMIZATION("minutes"); // const I.e., *say* these to yourself] [you also have to hope the prosody/stress assignment of each of those "const's" is appropriate for this presentation instance. E.g., how you pronounce "There is no battery time remaining" differs from how you pronounce "Battery time remaining" in each of the above examples!]

So, it doesn't *really* save you anything -- you still have to map between abstract numeric representations and *concrete* vocalizations. I.e., you're back with the same set of questions that I posed.

And, it's done NOTHING to deal with "text" coming from external ("alien") sources.

"What seems to be the problem, Ms. Cornali?" "My Gizmolator3000 isn't working!" "What is it *not* doing?" "Working!" "I mean, what is it telling you?" "Oh. 'Cannot connect to server'" "Yes, but *why* can't it connect to the server?" "That's what I'm calling YOU for!" "I mean, is there any other information provided?" "No. It just keeps repeating that message every time I try!"

In reality, the server may have been reporting any of: "Scheduled maintenance. Try again at 04:00" "Unauthorized MAC addr." "Error 938" "716" "^D" (i.e., anything other than "Connected") but the device can't relay that information to the user unless it already *knows* how to speak each of those messages and, encountering a reply, blindly compares the reply to its repertoire of stored pronunciations. Instead, it maps all "FAIL" responses from the server into "Cannot connect to server"

Of course, it can report on issues that it *knows* about "at design time". But, only for services existing at that time and in that (known) *form*. That, of course, means it has to be burdened with all that knowledge *and* the pronunciations of each (or, "user friendly" alternatives tagged with additional information that could be help "Support" resolve the issue by making the *real* message available to them in this "coded" form).

[Imagine if your native tongue was Swahili and every (error) message presented to you was in Arabic; how would you converse with a Support technician to resolve your problem -- even if he was fluent in Spanish??]

Thankfully, many of the "text" issues can be resolved/avoided... or, understood despite mispronunciation caused by ignorance of context. E.g., if "Polish" was pronounced as "(furniture) polish" (or vice versa) you might initially be puzzled by the example I previously mentioned... *but*, you'd quickly sort it out and understand the intended meaning.

The "numbers" are the real pisser. :< There is a lot of speech encoded in numeric presentations. E.g., when you encounter "5:00" you *think* "five O'CLOCK". Or, if the context suggests it represents a time interval, you think "five MINUTES" (or, "five HOURS"). When you encounter 6/15, you think "June fifteenth", not "June fifteen" or "six slash fifteen" (or, perhaps, "six of fifteen"). You probably speak "(800) 555-1212" differently than "(708) 555-1212" -- and both differently than "800, 555, 1212".

I've been building a lexicon of "number formats" and their associated "speaking formulae" iteratively -- throwing more and more "sample input" at the algorithms to see which forms I handle suboptimally; then, crafting recognizers for each to try to reduce the stilted nature of that aspect of the speech. I keep getting surprised by forms that I haven't (yet) covered but that are surprisingly common! Hence the first of the questions I posed ("what other 'number presentations' are likely encountered in an electronic device")

The second question was an informal survey of speaking *patterns* for numeric quantities. For example, "0.1203" could be "zero point one two zero three", "zero point one two OH three", "point one two oh three", etc. Similarly, "101.05" could be conveyed as "one hundred one point zero/oh five", "one hundred AND one point zero/oh file", "one hundred one and five one-hundredths", "a hundred one point oh five", etc. *I* would pronounce "8921600002" as "eight nine two, one six zero, zero zero, zero two" (note the last four digits are treated as two *doublets* instead of a *triplet* followed by a singleton as might have been expected from the grouping of the preceeding digits!)

Run samples by your friends and neighbors and see how each has a particular (usually inconsistent) set of rules for how they pronounce these!

Third question looked for other ideas to help the user. E.g., I had initially included a "spell mode" (numbers *and* letters). But, this gets really tedious! And, brought me back to wondering, "if they have to resort to spellings of the messages, then the quality of the synthesis -- or, nature of the messages -- must be total crap!" So, fix the *real* problem! OTOH, it made sense to allow the user to change the voice to something better suited to his/her hearing and comprehension. Likewise, speaking rate. I am currently toying with the idea of a "word-at-a-time" mode so the user can "step" through words individually (instead of having to hear the entire message replayed)

The last question anticipates what problems others writing code for this environment are likely to trip over. E.g., the example of "pluralizing" a noun is something an eager developer is likely to get wrong ("I can just store the singular forms of each noun and pluralize them by adding an 's'!") -- much the same way someone might naively pluralize "thief" as "thiefs".

Or, thinking himself exceedingly clever and pronouncing large values "the way you were taught in school" ("eight billion, nine hundred twenty one million, six hundred thousand, two" for the aforementioned example). Never thinking about how much information that likely conveys to the user IN THIS APPLICATION/SITUATION (i.e., it is unlikely that such a value is intended to be interpreted as an ordinal in that sense; more likely a "string of digits" is appropriate)

(sigh) English is such a bastard language! I'm sure there are other languages that are far more *regular*! :<

--don

Vote

D

Don Y 12 years ago

The device is a "terminal", of sorts. Normally, the (real) speech synthesizer is located remotely and passes "audio" to the device. However, if that synthesizer is "unavailable" (down, improper authorization, comms failure, etc.) you still need to be able to tell the user these things!

"Hmmm... I'm not getting any sound. Have I got the volume turned down too much? Is the battery dead? Why isn't the server responding?"

Think of an X terminal as a conceptual model. You may have thousands of fonts available on your font server. But, the X terminal has to have AT LEAST ONE built in to be able to talk with the user (e.g., during configuration/setup) *before* the terminal has access to the font server!

See my reply to Tim. Briefly, canned speech ("recorded" or phoneme based) falls down on any messages with variable content. "Your IP address is %d.%d.%d.%d", "The time is %d:%02d", "Volume level is %d%%". You still need something to "evaluate" numerics in a particular context.

And, "alien" text (Tim's term) leaves you helpless. How do I tell the user that the server is refusing to accept his credentials? Or, that the server will be down for maintenance until 4:00PM and that the server at A.B.C.D should be used in the interim? I.e., I would have to be able to constrain *everything* that will eventually talk to the device and fold all of the accommodations for these external devices into the design of *this* device.

It's undesirable to have those devices emit "phonemes" as they must now accommodate every potential output modality that some (future) remote device requires/supports. E.g., should they also output their messages in level two Braille for remote Braille displays? (or, does that Braille display have to pre-store all text that it could potentially encounter and associated Braille equivalents?)

See above (and other reply). The device can only speak things that it knows about. E.g., I can tell the user that the battery is low, signal strength is insufficient for contact with server (move closer), error rate is too high for the connection (local noise sources?), the server's response time is too high (too many clients? pick another server??), etc.

But, I can't tell the user about issues that the (remote) server wants to communicate -- unless I also constrain *it*! (and never let it evolve without requiring software updates of all potential clients).

So, if the guy maintaining the server brings it down for maintenance and specifies:

"Server down for maintnance [sic]. Contact Boudewijn Dijkstra for assistance at 813567 after 5:00PM"

What do I report to the user (other than "connection failed")? He's misspelled maintenance so any attempt (by me) to find a prestored pronunciation will fail.

[I can require a preceding numeric "reply code" to assist the "terminal" in determining the *intent* of the message. (e.g., "925 Server Maintenance") But, the balance of the message is unavailable to the user. So, when he/she goes looking for help to resolve his/her problem (or, attempts to resolve it directly), he has little more than this "message code" to go by.]

Time to make some ice cream! Butter Pecan. Mmmmm... fat city!

--don

Vote

B

Boudewijn Dijkstra 12 years ago

These are issues that are directly helpful to the user, i.e. things that the user may be able to do something about.

These are issues that are most likely not directly helpful to the user. At this point the device might output perfectly synthesized text, DTMF tones, a fax message, it doesn't really matter as the user cannot directly employ the information to make things work again. In other words, this kind of information is generally best relayed to a helpdesk of some sort. So, speech synthesis should not be an absolute requirement.

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Vote

D

Don Y 12 years ago

How does the user *know* what the device is wanting to tell him in order to relay that information to the help desk? I.e., you have to get the information *to* the user before he can relay it to "Support". Perhaps have the device store the error message in FLASH and have the user snail mail the device to Support? :)

Vote

B

Boudewijn Dijkstra 12 years ago

Assuming that the device is not subdermally implanted, the user doesn't need to hear or understand the information at all! The device could say: "Please hook me up to a phone line, I wish to send a problem report" or something similar. Then the user could listen in and wait for the exchange to finish.

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Vote

D

Don Y 12 years ago

[attrs elided]

So now the device needs to be able to connect to a phone line (acoustically or otherwise) *and* there needs to be a phone line *handy* (as well as accessible to the device's dialing capabilities -- e.g., not behind a PBX). And, to know how to report dialing/connection problems there, as well.

All this just to avoid being able to convey "alien" messages and pronounce numbers in various formats in an intelligent manner?

One scheme I could adopt is to just have the server return a "result code" for *every* condition. Those codes known to the device at time of manufacture can be explained (*by* the device) to the user. Those

*unknown* can simply be conveyed to the user as a "number".

That puts the burden on the device to know how to explain each error code. ("error codes"... welcome to the 60's! :

Vote

G

George Neuner 12 years ago

Let's just agree that Massachusetts is hopeless and write it off. Even making allowances for pronunciation, poor diction and odd colloquialisms, there are too many MA natives who badly misspeak [including many who theoretically have been well educated].

And don't pick on poor Worcester: it's a nice city ... in England ... that isn't responsible for what Massachusetts did to it's name. Historically it was pronounced the same as the English city - as was Gloucester, Medford, Woburn, Salisbury, etc. The scratch-your-head "huh?" pronunciations are all post Revolution (some post 1812).

George [I can mock MA because I'm from MA: I was born there and I live there currently. Thankfully, during my formative years I lived elsewhere.]

Vote

D

Don Y 12 years ago

There are lots of "towns" in the US that have bastardized their "original" foreign pronunciations.

Berlin MA/CT/WI (BURR-lin) Italy TX (IT-lee) New Madrid MO (MAD-rid) Milan NY (MY-lun) Russia OH (ROO-shee) Cairo IL (KAY-row)

etc.

Proper nouns are always good candidates for "exception dictionaries" (e.g., "Kurzweil" was one of the first words added to the KRM's exception dictionary -- for fairly obvious reasons :> )

But, English is so "screwed" that even commonplace words are exceptions (wrt the "rules" typically applied to other words in the language): two, of, this/these/them/that/etc., Wednesday, woman/women, etc.

And, forget local variations: TX twang, southern drawl, Boston 'R' (vs. New York 'R'), "bash/mash" (esp in places like OH), "oil" in NJ, Louisiana cajun, Mainers, Wisconsin's odd stress assignment, etc.

INsurance vs. inSURance, POlice vs. poLICE, etc. (I've heard arkansas pronounced areKANSAS)

Trying for a "nominal" US speaking pattern is an exercise in futility! OTOH, most of us (US) are accustomed to encountering folks with different speech mannerisms -- let alone entirely different terms, colloquialisms, etc.

Vote

B

Boudewijn Dijkstra 12 years ago

Or ask the user to dial the helpdesk.

The users are not deemed capable of that themselves?

Yes. To me it sounded like an option worth considering.

Indeed.

(Remove the obvious prefix to reply privately.) Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/

Vote

D

Don Y 12 years ago

[attrs elided]

OK. The device can, presumably, speak the phone number to the user so the user doesn't have to keep that "handy". If he happens to be on a (commuter) train at the time, he can presumably have a cell phone handy.

Having dialed the help desk, how does the device *talk* to the help desk -- hold it up to the phone and have it "beep and bop" into the phone's mouthpiece? This would have to be a SIMPLEX data exchange else the help desk would have to alternately be telling the user "OK, now hold the mouthpiece of the device up to the earpiece of the phone", etc.

I assumed you would be thinking in terms of an *automated* help desk (since all that facility would be doing is interpreting beeps and bops from the device)

Now, imagine the user of a visual display device being forced into the same situation. Nice full-graphic display available for him but all he sees is "Error 2341. Please dial 555-1212 for assistance" followed by a string of hex digits that he must somehow communicate to the person on the help desk (e.g., DTMF if the user is mute).

[Because the spec for the device and the servers says, "these are the a priori known error codes and what they really mean. any other error code will be reported as a 4 decimal digit code followed by a binary object presented in a form suitable for the device's output modality"]

I don't see this as any easier -- and far less convenient (user will already be anxious because he is unable to use device) -- than just coming up with some reasonable constraints on the sorts of numerics that can be expressed in these "alien" messages. Falling back on "reading strings of digits" seems so kludgey and tedious...

So far, I have been impressed with the thoroughness of the "patterns" included! I'll have to inquire as to where he found them -- or, if he sat down and created them each, himself (though it appears he'd have had to do a fair bit of *research* to know what all of these data/number formats

*are* before coding the regex's! -- license plate identifiers, phone numbers for various foreign dialing plans, etc.)

Vote

R

Robert Wessel 12 years ago

My new washing machine can do just that. At some point in the conversation with customer service you can hold your phone up to a particular spot on the device, press another button for a few seconds, and it'll transmit a problem report.

Obviously that sidesteps the problem of establishing the phone connection, as you have to be talking to support anyway.

Vote

D

Don Y 12 years ago

[attrs elided]

But, your washing machine probably doesn't have some *other* means of conveying generic messages to the user. E.g., a graphic/text display. If it had, presumably, it could display an N-digit "code" (embodying all of the pertinent information) along with instructions that you could dictate to the support tech.

Or, a REAL MESSAGE! (gadzooks!)

I've already got a means of communicating with the user. It's now a question of whether I let other devices (e.g., remote server) *use* that mechanism to convey their diagnostic/status messages to the user *or* force them to produce an "error code" that can *always* be conveyed to the user -- but, that the user then has to resolve via some other agency (e.g., help desk).

It would be annoying to have to contact support only to be told "your device is telling you that your (paid) subscription has expired and, for a limited time, you could renew for $19.95 (plus shipping and handling)" :-/

Or, that the server is shedding load, "*PLEASE* use another server unless absolutely necessary. Otherwise, your request will be handled, shortly."

Or, "A newer, faster service is available at XXXX. But, feel free to keep using this slower, outdated service if you like".

Or, "Notice: a security exploit was detected last tuesday. Folks who used this service on XX/XX/XXXX should contact the System Administrator at 555-1212 x3-2211 ASAP."

Or, "Upgraded are available at no cost from the service department from 9:00A - 5:00P. Ask for Bob."

[BTW, Our washing machine makes all sensor information available to the user via the front panel in a "service mode" (published in the User Guide) using the numeric display inherent in the front panel]

Vote

G

George Neuner 12 years ago

No kidding. You wouldn't believe what people do to my name ... and it's relatively simple to guess the American if you don't know the proper German pronunciation (I answer to either, at least initially).

George

Vote

D

Don Y 12 years ago

Too many different rule systems involved -- you'd need to be able to recognize *which* language's rules should apply, etc. E.g., Polish 'w'. So, "don't bother". :>

I suspect my ruleset would even choke on many (common) *first* names -- though I've not run a formal test with that sort of input (e.g., Stephen, John, Valerie, Alan, etc.). I suspect stress assignment would also be incorrect.

I'll add that to my ToDo list -- just to see (and laugh?). I think I have a list of names here, somewhere...

Vote

R

Robert Wessel 12 years ago

I was merely providing an example of an implementation of something being discussed.

In any event, the data transmitted is likely much larger than what the display (and there is one) could reasonably accommodate (or the user could usefully interpret, record, copy, etc.). The manual says it can take as long as 17 seconds to transmit the burst, and even with the most pessimistic assumptions, several hundred bytes of diagnostic information (after considering framing, error correction, etc.), should be possible.

Vote

D

Don Y 12 years ago

Understood. OTOH, presumably these "diagnostic purposes" are more involved than providing information to the user that the

*user* can make sense of.

E.g., instead of "login failed", a server might return:

- "login denied (which maps to 'login failed') due to nonpayment of fees"

- "login denied; this account only accessible in non-peak hours. The current server time is XX:XX"

- "login denied; this account suspended pending disciplinary action"

Rather than trying to anticipate every *local* policy that might be implemented at some future date, allow the server to return a result code that the device *always* knows how to interpret ("login denied") along with a message INTENDED FOR THE HUMAN USER.

This tends to be how most server replies are designed, nowadays. I.e., the "client" doesn't parse the text of the message but, rather, just the "error code" -- optionally passing the text on to the user for the user's perusal.

Consumer kit tends to have skimpy displays. E.g., our washer has only a few seven segment displays (not counting "icons" or other annuciators/indicators). But, you can map a fair bit of information onto those beyond the "idiot light" that tells the user "something is wrong".

[Granted, in our case, this is an interactive process but one that I could easily see a "tech" guiding a user through over the phone -- assuming they don't just want to dispatch a technician directly]

In my case, I could conceivably recite *paragraphs* of speech (as it is an inherently serial output device and "capacity" is limited solely by the user's wetware). And, since I can already recite from a nontrivial vocabulary (i.e., not just digits), there's little to gain by placing a limit on that *now* -- especially when the issue is really just one of identifying likely numeric formats that have *implied* content that isn't explicitly conveyed by the individual "characters".

Forcing the user to contact "support" when providing access to this message content would remove that *need* for the contact seems silly. (i.e., should I similarly HIDE the accompanying text message for users with *visual* display devices? "Call support if you want to know the text that follows this 'error number'")

I think I've got a reasonably small "lexicon" of templates that cover most numeric presentations. It's enlightening to see how much we take for granted/assume/imply in these presentations!

I've been browsing various server sources to get an idea of the types of "accompanying messages" that "result codes" with which they are tagged -- as well as poking at various online services to see what *they* want to disclose ("raw"). For "US" servers, I think I can cover almost everything that I've encountered (with the exception of personal names) -- let someone else worry about other localities! :)

Thx,

--don

Vote

Constrained vocabulary speech synthesis

Join the Discussion

Didn't find your answer?