Hi,
I've got a product that uses audio/speech, EXCLUSIVELY, as it's output modality. It's severely resource constrained (size, power, cost, etc.) so the TTS system doesn't have features that aren't "guaranteed" to pay for themselves (e.g., it does an abysmal job reciting poetry, mispronounces many place names -- "Worcester", "Phoenix", etc. -- and proper names -- "Xerox", "Giovanni", etc.).
But, the text that it is primarily charged with speaking it does a good job handling -- obviously because the two issues were designed hand-in-hand! And, users become accustomed to the speech and its idiosyncrasies, with use.
However, there is a whole class of "unconstrained" input that I have to address -- text sourced from outside the device -- that I could
*never* handle even "adequately" given the resources available. This is true even when the application domain is reasonably well constrained. (e.g., expecting it to handle "The Polish woman polishing the pews in the church of St. Zygmunt Gorazdowski on Leigh St. read the hymnal that I am reading" would be an absurd goal!)This unconstrained input tends to occur in "exceptional" circumstances: something unexpected has happened which exposes these dialogs to the user. So, they aren't familiar with them -- nor would the folks at a "Help Desk" be (as they are outside the scope of the product!). The problem is, most interfaces are designed expecting a user to *read* them. So, things like "Contact Dr. S. Martin at x237 for assistance 9A-5P" are relatively easy to comprehend "in print form". But, deciding how that should be *spoken* is an entirely different matter!
I've implemented a fair number of rules to try to intelligently resolve things like numbers (555-1212; 3:00; 1:20:45; 4.25; 1,997; 2015; FE:00:C2:80:99:04; 192.168.1.193; etc.). And, things like "all caps" (IBM, WWW, FTP, DNS, HTTP, HTML, RSA, etc.) and mixed alphanumeric (P2P, SHA256, etc.) opting to "spell" those. Common abbreviations I replace with their expected counterparts (etc., e.g., i.e., Mon, Feb, etc.)
This still leaves a whole bunch of unpronounceable "text" (e.g., www, wrt, ext, etc.) that, I assume, should be treated much like the "all caps" case, above.
The question is, then, how do I signal to the listener that some part of the output text has been deemed unpronounceable (by whatever criteria I impose)? E.g., the typical "all caps" cases (above), "sound right" when spelled out "in-line" with the rest of the adjacent text. Because that's the way they tend to be spoken, normally!
But, what if an ARBITRARY, non-pronounceable string of characters is encountered and I have to resort to "spell it mode"? Should I just spell it like I would the all-caps case? Or, switch to an alternate voice to draw attention to the fact that "you will now hear a sequence of letters, digits and symbols spoken in place of a 'word'"? Or, inject some other annunciator before falling into this mode? ("bong", "ding", etc.)
The user always has the option to re-hear a message. And, explicitly call for any portion of it to be spelled out. But, this is tedious (you wouldn't want to hear an entire sentence spelled out when just one or two "words" were in question). And, remember, the fact that it is a message that the user doesn't normally encounter already means the user is under some sort of stress ("What the hell has gone wrong??") and is probably not as patient as you'd like! Calling the Help Desk and playing the message over the phone won't be any better if the text is "unpronounceable"...