OT: Speaking "unpronounceable" text

Hi,

I've got a product that uses audio/speech, EXCLUSIVELY, as it's output modality. It's severely resource constrained (size, power, cost, etc.) so the TTS system doesn't have features that aren't "guaranteed" to pay for themselves (e.g., it does an abysmal job reciting poetry, mispronounces many place names -- "Worcester", "Phoenix", etc. -- and proper names -- "Xerox", "Giovanni", etc.).

But, the text that it is primarily charged with speaking it does a good job handling -- obviously because the two issues were designed hand-in-hand! And, users become accustomed to the speech and its idiosyncrasies, with use.

However, there is a whole class of "unconstrained" input that I have to address -- text sourced from outside the device -- that I could

*never* handle even "adequately" given the resources available. This is true even when the application domain is reasonably well constrained. (e.g., expecting it to handle "The Polish woman polishing the pews in the church of St. Zygmunt Gorazdowski on Leigh St. read the hymnal that I am reading" would be an absurd goal!)

This unconstrained input tends to occur in "exceptional" circumstances: something unexpected has happened which exposes these dialogs to the user. So, they aren't familiar with them -- nor would the folks at a "Help Desk" be (as they are outside the scope of the product!). The problem is, most interfaces are designed expecting a user to *read* them. So, things like "Contact Dr. S. Martin at x237 for assistance 9A-5P" are relatively easy to comprehend "in print form". But, deciding how that should be *spoken* is an entirely different matter!

I've implemented a fair number of rules to try to intelligently resolve things like numbers (555-1212; 3:00; 1:20:45; 4.25; 1,997; 2015; FE:00:C2:80:99:04; 192.168.1.193; etc.). And, things like "all caps" (IBM, WWW, FTP, DNS, HTTP, HTML, RSA, etc.) and mixed alphanumeric (P2P, SHA256, etc.) opting to "spell" those. Common abbreviations I replace with their expected counterparts (etc., e.g., i.e., Mon, Feb, etc.)

This still leaves a whole bunch of unpronounceable "text" (e.g., www, wrt, ext, etc.) that, I assume, should be treated much like the "all caps" case, above.

The question is, then, how do I signal to the listener that some part of the output text has been deemed unpronounceable (by whatever criteria I impose)? E.g., the typical "all caps" cases (above), "sound right" when spelled out "in-line" with the rest of the adjacent text. Because that's the way they tend to be spoken, normally!

But, what if an ARBITRARY, non-pronounceable string of characters is encountered and I have to resort to "spell it mode"? Should I just spell it like I would the all-caps case? Or, switch to an alternate voice to draw attention to the fact that "you will now hear a sequence of letters, digits and symbols spoken in place of a 'word'"? Or, inject some other annunciator before falling into this mode? ("bong", "ding", etc.)

The user always has the option to re-hear a message. And, explicitly call for any portion of it to be spelled out. But, this is tedious (you wouldn't want to hear an entire sentence spelled out when just one or two "words" were in question). And, remember, the fact that it is a message that the user doesn't normally encounter already means the user is under some sort of stress ("What the hell has gone wrong??") and is probably not as patient as you'd like! Calling the Help Desk and playing the message over the phone won't be any better if the text is "unpronounceable"...

Reply to
Don Y
Loading thread data ...

To be clear, what I am asking is for folks to imagine *listening* to some of the text that you routinely encounter and imagine how you would like these sorts of exceptions to be handled. Would you rather have to explicitly direct the device to back up and repeat a message spelling a word at a time when you were unsure of "something" that it said? Or, automatically opting to spell things out for you? Or, alerting you that it was about to do this? Or, ...

Reply to
Don Y

I use speech for both input and output, for over a decade. Sounds like you want to notify the user when a word is unpronounceable or incorrectly pronounced. I don't think you can do that, or that it's necessary. Just make the thing pronounce as much as possible given the expected words. People are smart enough to know the difference, if it matters.

I use TextAloud for speech output. It does a reasonable job of pronouncing just about any English word. It's a relief when reading certain texts that I don't have to pronounce, or understand the pronunciation, all of the unusual words.

Reply to
John Doe

By choice? Or need (visual impairment)? Or, none of my business?

What I want to do is NOT have to be able to handle totally unconstrained text "reasonably". Invest in handling EXPECTED text, well. And, make a "best effort" towards the unconstrained text that I (occasionally) encounter.

The nominal text that the user encounters is spoken reasonably well. And, as the user encounters it often enough, he/she has learned any idiosyncrasies associated with those pronunciations. I've also taken care to make sure I feed that text to the TTS in a way that ensures it will be spoken as I intend. E.g., "3:00AM" instead of "3 am" (the latter possibly being pronounced as "three am" while the former is pronounced as "three o'clock ayem")

How would you expect the "from" header of your post to be pronounced? "John Doe open angle bracket always period look commercial at message period header close angle bracket"? Or, "John Doe open angle bracket ay el double you ay why es period ..."?

Do you expect the TTS to be aware of different contexts? I.e., "this is an email address; use these DIFFERENT rules when parsing and speaking it"?

*You* might know how to *direct* a TTS to handle this input. But, would others know to expose the "hidden" punctuation therein when relaying a problem they are having to another individual? Would they even realize how much text was there if the TTS happened to "Qbert-ize" it?

The point is the words aren't "expected". This is an exceptional situation that the user may never have dealt with, previously. And, "Support" may never have seen -- because *this* user is interacting with something that Support hasn't encountered, previously. Forcing the output to be spoken entirely as "spelled text" would make the *implementation* trivial! But, it would make *usage* really tedious!

TextAloud is *huge*, by comparison. With a reasonably rich user-interface. Try to squeeze it -- along with all of its required resources and user interfaces -- in a box for $8-12 DM+DL. Kilobytes of RAM instead of Megabytes/Gigabytes; MIPs instead of GIPs; milliwatts instead of watts; etc.

TextAloud (et ilk) sets out to do far more than I need done, nominally. It's this *occasional* need to support unconstrained text that burdens my current implementation. I'm just looking for an approach that addresses it without increasing size, power, cost, etc. needlessly.

Reply to
Don Y

Have it say "uhhhh", then spell it out like you do for all caps, then say "okay".

Failing that, I like the idea of making some kind of chime or beep when it resorts to spelling things out. Maybe put this both before and after the section that is spelled out.

Matt Roberds

Reply to
mroberds

Speech recognition...

It's the ultimate activator for macros/scripts in Windows. Anybody who loves automating their PC needs speech recognition. There are too few keystroke combinations to be used for activating macros. Those of us who use speech have many hundreds of macros that can be activated with easy to remember two or three syllable voice commands. For example "ad one" types my main email address. And macros do many more things than just typing.

It's also partly due to repetitive strain injury (RSI). But I'm not a professional programmer, thankfully.

After the agony of training your voice, it also helps with writing. I believe my writing skill has improved much over the years thanks to speech recognition. And I never make a spelling error.

But technically it can be a bear to use.

Text-to-speech (TTS)...

Using speech output is because I want to read without having to look at the text. Human voices sound better, like with voice recordings, but sometimes the information comes from a document. Sometimes from a very long document that I would rather have read to me while doing something other than looking at the text. It's also good for proofreading.

A good place to ask might be blind users. They deal with that stuff.

I am very selective about the text my TTS reads. Reading court documents is the most annoying of late. Sometimes I edit the document before reading. But TTS is most useful when the document is very long, like with some court documents. I cope with all of the extraneous verbal output. Still beats reading.

Sounds like you're trying to solve a very difficult TTS problem. You might need to ask people who are experienced with it.

Why don't you say what you're talking about? Not that it will help, but that's the usual method.

Are you talking about relaying information about electronic parts or service over the phone?

You would probably need to specify the text that you're talking about, to someone with the pertinent skill.

I suppose you're also talking about people who are not looking at a screen. That increases the communications difficulty level, of course.

Reply to
John Doe

Probably too "casual". Big sigh??

My current implementation inserts a pause before the unpronounceable text, plays a soft tone, then spells the characters in the "word" using a different voice (from the normal narration). At the end, another pause is inserted. From the cadence in the spelling effort, it is relatively easy to tell when "normal speech" will be resuming (which can be verified by the change in the voice *back* to "normal").

For *me*, this has proven very effective. The "non canned" messages are uttered in a different (configurable, usually at a slower speaking rate) voice so I already know that I will have to pay closer attention to what is being said. The pause before the tone further increases your attention with the issuance of the tone focusing your attention more keenly.

ISTM that "audio memory" is considerably different than memory associated with visual presentations -- even if text is involved in each. Having to focus on a series of letters/digits/punctuation quickly pushes any previous message content out of your mind -- until you can reassemble the letters/digits/punctuation into something more easily represented in your mind (and STM). Then, you *may* be able to pull the preceding text back into this context. *Or*, you may need to "rewind" the audio stream and refresh your memory of what came before -- while holding onto the reassembled "word(s)" you've just heard.

The voice changes, pauses, tones, etc. put me on alert: be prepared to

*stop* the playback and rewind it, advance word-at-a-time, etc. It's much easier to do this *at* the points where it is difficult to resolve what is being said than to have to rewind the entire sentence and start over again "from the end".

Unfortunately, I've run so many experiments on friends and colleagues that they are now "too accustomed" to the problem and, as such, unable to effectively comment as a "new user" might. And, far more patient: they aren't going to return the device to the store out of frustration as a

*real* user might! :-/
Reply to
Don Y

If that is the case, don't store text, store phonemes.

--
umop apisdn
Reply to
Jasen Betts

That only works if you are emitting *just* canned phrases. It makes any unconstrained text impossible. And greatly complicates simple things like: printf("Volume level: %d %%", volume); printf("The current time is %d:%02d", hour, minute); etc.

Reply to
Don Y

Coincidentally, I encountered an "automated (phone) attendant", yesterday, and was able to see the folly of "the easy way out" approach to this sort of "unpronounceable text". In particular, proper names (e.g., employee directory).

Having "selected" the desired employee (by specifying the first three letters -- insofar as you *can* specify individual letters on a touch-tone phone keypad -- of the name), I was greeted by the *unannounced* SPELLING of the employee's full name. I.e., just a list of characters recited to me: "Press 1 if that is correct".

Having no prior warning that I would be met with such a "recitation", I had to repeat the process in order to catch the entire string and reassemble it in my mind.

Pretty stupid approach for a company with ~13 employees! (why not have each employee simply *record* their own spoken name??)

Reply to
Don Y

Since you can implement a 'repeat last message' (RLM) command, perhaps a shift-mode RLM command could spell out everything in a phonetic alphabet? Or, even (for a help-desk) some kind of software modem can be implemented, that would make sense after phoning in to the help desk that has a demodulator...

The best solution (long-term, not for your project) is to have a human interface standard for this class of messages, that supports a versatile message-delivery system for (1) cryptic short text, all lowercase, (2) expanded text, all cases and punctuation, (3) voice-only , (4) video overlay alert message, (5) full video message, (6) HTML on a browser screen, (7) XML, and (8...infinity) room to expand. Next, a virtual-reality CGI owl flies to your avatar and drops off a scroll...

Reply to
whit3rd

Spelling *everything* gets tedious. Imagine reading any of these phases, here, in spelled-out form. Imagine if punctuation was significant (as in an IP address, email account, domain name, line of source code, etc.)

I've taken my cues from the Reading Machine (KRM): let the user step through the message in it's entirety, word-at-a-time, spell-a-word, etc. Along with allowing the voice characteristics to be changed while doing so (some sounds may be more/less intelligible with a different vocal tract model, speaking rate, etc.).

As it is not a "continuous output" device, but, rather, "command-response", the interface can safely stall at the user's wish -- until he has resolved the "last message" and is willing to move on to a *new* exchange.

But, that doesn't mean one shouldn't make a best effort to convey the content of the message correctly *without* resorting to that sort of "message navigation". E.g., speaking "No Signal" instead of spelling it goes a long way to improving usability.

This device is intended to be *the* means of accessing services. E.g., here (home), it will be the bridge *to* the (telephone) system. So, it's very use implies it (and the user) is an "island". "System down for maintenance" "No signal" "Account suspended" "Ths [sic] is a test" The last intended as an example of how a misspelling can render text unpronounceable.

An awareness among folks providing "public interfaces" would go a long way. How many web sites are, in effect, only viewable with a particular browser (brand, version, JVM rev, etc.)? And, that's usually just because the creator failed to *test* with other versions!

E.g., designers still use color to convey information -- despite the fact that ~7% of men are colorblind. "Engineers" are notorious for employing technospeak despite the fact that their users may be non-technical. The same may be true of other professions designing interfaces (or even just "message classes") for users outside of their professions.

Virtually everyone expects interfaces to be *visual* -- even when the content doesn't necessitate that form of presentation. etc. Do you even *think* about how your content would be "rendered" in a non-visual medium?

[E.g., when I correspond -- email -- with blind colleagues, I make dramatic changes to my writing style due to my awareness of the limitations of the "screen readers" and other devices (e.g., haptic) that they will use to digest my words. It's unlikely that they would even *inform* other correspondents of their visual impairment but, rather, just struggle through the content using the tools available to them -- perhaps even *missing* portions due to the limitations thereof!]

And, of course, there are always the "typos" that plaque :> even the best-intended... (as above).

Reply to
Don Y

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.