Embeddable text-to-speech

P

pozz 11 years ago

I'm designing an electronic gadget that will interact with humans through IVR (Interactive Voice Response) and keypad. The user hears the voice and press some buttons to take some actions.

Most of the sentences are well known at design time, so I can think to generate and record them on the computer and save them on a memory (PCM, ADPCM, ...). Unfortunately some sentences are customizable by the user, so they are known only at run-time.

So I'm thinking to TTS (Text To Speech) technology that generates whatever word/sentence at run-time, starting from the associated string.

How difficult is to integrate a TTS functionality in an electronic product? What is the MCU power that TTS needs? Do you know some TTS libraries that can be embeddable in an electronic project? Do you know of some free libraries?

Please note, I don't need a real "on-the-fly" TTS. I could spend some time to generates the short message to play.

Vote

J

John Speth 11 years ago

As long as you keep your requirements bounded, it should be easy to achieve your goals. As a point of reference, I was able to output 8 bit PCM code (wave files) at 8 kHz SR to a PWM bit with no problem on an MSP430 running at 8 MHz. The output ISR ran at the sample rate with about 75% loading.

JJS

Vote

D

Don Y 11 years ago

What quality of speech? What level of naturalness? A single voice? Or, user-selectable/customizable? Presumably entirely in English? Or, do you need to support other languages? Concurrently??

This is called "limited domain synthesis". Think of TI's "Speak 'n' Spell" product ("The cow goes 'mooo'")

And there's the rub!

How does the user indicate the message to be spoken? I.e., is it "unconstrained text" that you read from a char[]? Could the user opt to command it to speak "I'd rather have this bottle in front of me than a frontal lobotomy"? Or, will the sentences/phrases still be largely constrained by the application domain: "The date of your last withdrawal was..."?

I.e., could you provide a set of ("prerecorded") words that the user can then "string together" to form messages? So, the actual message is created by the user but built from words that your device already knows how to speak?

What happens if the user specifies a word that is hard to pronounce ("Phoenix", "salmon", "Worcester", etc.) using "canned" rules?

What happens if the user specifies a "word" (sequence of letters) that is unpronounceable (Mr. Mxyzptlk)?

How do you handle special characters (pronounce "%^*&%$!")? Acronyms (LPC, IVR, TTS, MCU, etc.)? Numbers (34; 2015; 1,093;

192388535; etc.)? Mixed strings ("Please call 555-1212 x342 between the hours of 8:00AM and 4:00PM CST")?

This always *sounds* like the right approach -- until you look into the issues that it drags in with it! It's really hard to come up with a "good" set of rules that can handle unconstrained input "practically" (abandoning the goal of "properly"!)

Easy *or* hard -- depending on your constraints, goals, resources, expertise, etc.

That depends entirely on the constraints you are willing to impose and quality you seek. You can make noises that sound like speech with a

1MHz 8b CPU. If you only relied on it for occasional interactions, you could probably tolerate it. OTOH, you wouldn't want to listen to it for an appreciable period of time!

Start at CMU's Hephaestus page. You might also want to look into "dialog systems". Also, don't forget to research intelligibility testing (e.g., modified rhyme test, anomalous sentences, etc.) as having speech that isn't intelligible is like having an LED indicator that's "burned out"!

Invest some time in understanding the "listening prowess" of your target audience.

Note that if you try to synthesize and *then* play back, you need enough R/W store to hold the entire message as you are creating it. I.e., so it has been completely synthesized *prior* to beginning playback. If the user controls the content of the message, how do you ensure that you have *enough* space to store it? "This is a really long message that would, obviously, require considerably more memory to synthesize" Said another way, how do you handle the case when the user has asked you to speak something that is too long for your "buffer"?

OTOH, running the synthesizer and playback concurrently allows you to shrink your buffer (to a size that just handles jitter in the algorithms) and speak phrases of "unlimited" length.

[Of course, encoding prosody on-the-fly gets trickier]

Vote

P

pozz 11 years ago

The voice should be as understandable as possible. Of course, greater quality is better, but I don't need high fidelity quality. At the moment, I need only Italian language and single voice. Not customizable.

I think this isn't important for TTS. The message will be stored in memory someway.

In order to simplify the TTS implementation, I could constrain the user-customizable text to simple words, not sentences.

So I will have some fixed/constant/"known at compile time" sentences that I can generates and save with high-quality TTS software on desktop computers. And some user-customizable words.

For example:

"Hello Rick, your air conditioner at the first ^^^^ ^^^^^^^^^^^^ floor has just switched off." ^^^^^

The words marked with carets ^ are user customizable.

I already thought about this possibility, but it is a big limitation that I would prefer to avoid. Think of names (Rick in the example): it's impossible to have a full list of prerecorded names in the device.

The user can hear in the advance how his word is pronounced from the device. If it's too difficult to understand, he has the possibility to change the words with some other more understandable.

The user will change it.

Of course numbers must be well pronunced, for example for some settings (only small integers, in the range 0-100) and for times. But the sentences where the numbers are used are generated at compile time, so I can avoid 8:00AM or 4:00PM CST. The user will never creates sentences like those.

Of course, the user has a limited space to write words/sentences.

Vote

D

Don Y 11 years ago

That really doesn't say much :-/

Have you listened to many synthetic voices? They range from *very* natural to "ick".

Given that you appear to be pasting "compile-time" speech with "run-time" speech, are you willing to tolerate the sudden "voice/quality" change that will be apparent where you have "filled in the blanks" with run-time utterances? I.e., you can have very natural compile-time speech that is laced with potentially *crude* run-time phrases.

The user will obviously know where the "filled in blanks" occur in the audio output (which may be acceptable to you). What might *not* be as acceptable is the change in *quality*/intelligibility that results.

OK. Note that I'm speaking from an English language perspective. No idea how "uniform" the ruleset might be for Italian... (English is full of exceptions)

Sorry, perhaps my question wasn't clear enough.

Does the user type in (somehow) a series of characters? Does he choose from among preselected words/phrases? etc.

I.e., I am trying to ascertain how constrained/unconstrained the input will be. With a keyboard, a user could potentially type: "supercalifragilisticexpialidocious" OTOH, a user selecting phrases from preexisting choices (even if you actually synthesize the voice on-the-fly) has more limited choices: at the first floor at the second floor at the third floor etc.

That essentially eliminates the need for any prosody controls (as portions of the "sentence" will have been predefined and, thus, have their own prosody imposed irrespective of the "blanks filled in".

But, can the user specify *any* word? "smartphone"? "technology"? "disillusionment"? "apartheid"?

So: on (at) the first floor on (at) the second floor on (at) the third floor in the penthouse in the basement in the garage Or, perhaps: in the basement of your clothing store for the dog kennel etc?

Think, again, about that. Ignore, for the moment, proper names/nouns and, instead, concentrate on just *words*. You can store a rather large dictionary of words and their (encoded!) pronunciations if you can eliminate the code and the "rule sets" that determine how to convert "Rick" into /R/ /IH/ /K/. Furthermore, you could compress this "dictionary" by noting that you need only represent upper (or lower) -case alphas (RICK, rick, RiCk, etc. all result in the same pronunciation) and the corresponding sounds into which the "text" will be mapped. E.g., ~5 bits for each character in the "name" and ~6 bits for each sound.

So, "Rick" requires 38 bits (about 5 bytes) to encode (alond with its pronunciation). At run-time, you need only convert the "sound codes" into actual "audio waveforms" -- instead of having to convert the textual representation of the name into the sound codes *and* then into waveforms.

[I have no idea how large your vocabulary will be so no idea how large the dictionary would be.]

Depending on the level of expertise of your users -- and the hoops through which they are willing to jump -- you could also direct them to specify the phrases *using* those sound codes. I.e., force them to do the "letter to sound" conversion in their heads -- possibly aided by allowing them to easily replay what they have just "typed": "Hmmm... that 'i' sound needs to be shorter. Let me try..."

Given the variation in how proper names are pronounced, this may well be the best approach. E.g., my (english) ruleset would butcher "Alfio", "Gabriella", etc. I'm not sure it would even handle "ciao" properly!

OK.

But, you still have to have rules that allow *you* to come up with a pronunciation. And, the user needs a way of coercing the device to pronounce the word the way he *wants* it to be pronounced.

Does "read" rhyme with "tweed" or "bed"? I.e., a user wanting it to be pronounced in a particular way would have to misspell it "reed" or "red" (assuming the device picked the "wrong" pronunciation).

Allowing the user to enter "sound codes" avoids that problem.

So, you need rules that allow [] to be pronounced one way while []: is pronounced another.

Presumably, the mechanism by which the user specifies the "words" that he wants spoken will disallow any digits in that "text specification"?

(Likewise, punctuation and other special symbols?)

I'm not talking about specifying the text. Rather, I am addressing your comment about "spend some time to generates the short message to play". I.e., starting with "text", you'd have to convert the graphemes to phonemes; then, synthesize the audio waveform (however your output device expects to be fed) from these sound codes and prosodic envelope.

The bulk of the "work" (CPU cycles) is in the creation of the waveforms. If you can't "keep up" with real time, then you need to be able to buffer the waveform while you create it -- and before you "utter" it. Yet, once you *start* to "speak" (i.e., push signal out the speaker), you probably can't arbitrarily stop/pause without affecting intelligibility (i.e., you'd have to make sure you only paused at word boundaries; never in the middle of a word)

So, you need a buffer for all that "analog data". The number of characters in the "input word" has little to do with the duration of the utterance that will ultimately result.

E.g., the /IH/ vowel sound (rIck) is probably half the duration of the /AY/ vowel sound (bIte). Note how long your "mouth is engaged" saying the two words. Or, "ewe"/"you" vs. "hit". (now you see why we call them "short" and "long" vowels! :> )

If you (your users) can tolerate the effort of "specifying sounds" instead of "specifying letters", it might be best to let them specify the text in that manner.

At the very least, you could run the text-to-sound portion of the algorithm as soon as they have typed in the desired text and store the *sound* codes at that time -- to eliminate the effort of doing the conversion at "run time" (i.e., when the actual spoken output is *required*).

Before you go too far down this road, you may want to explore some of the on-line synthesizers to get a feel for how robust they are, the quality of their voices, etc. (many are diphone based; you can actually make the synthesizer sound like a particular -- REAL -- *person*!)

Then, explore some of the "cheaper" approaches (i.e., those that you are more likely to employ in your implementation). Get a feel for how the costs change -- as well as the "quality".

At the very least, you'll get an appreciation for how much processing we automatically do when handling "combinations of characters" in particular, specific contexts.

Vote

P

pozz 11 years ago

You agree with me: high quality for "compile-time" sentences *and* for "run-time" senteces is better. But I don't need it. The device is for low-cost market, so the user won't have too much expectations.

It will be acceptable. The change in "quality" corresponds exactly to the customizable words. So the user understands what happens.

The user can type any sequence of chars, but he is encouraged to play and check the result. If it is too noisy, he can change the words.

Yes.

The user can write everything, but it is reasonable he writes simple words.

I'm sure the user will want to use a words that isn't present in the dictionary. I'd prefer to avoid this way.

No, the user will not have this kind of expertise.

I understand, but it's difficult to explain to my users. It is simpler to explain him misspelling the word in such a way the final result is similar to the sound he wants to hear.

The user will never needs to customize texts with numbers or times. Numbers are managed at compile time.

I understand, but you agree with me that a short text (a small number of chars) corresponds to a short waveform duration. I can calculate a worst case for a certain number of chars.

Yes, I'm trying to understand text-to-speech world, but it seems too difficult for me. I hoped it was possible to embed some ready-to-use TTS libraries (free of charge or after payment) as source codes or object files, without being a TTS expert. It seems, this isn't the case.

Anyway, Thank you very much for your explanations and time.

Vote

D

Don Y 11 years ago

Only *you* can comment on your market and what it will accept. I'm just pointing out that there *will* be a very noticeable "pieced together" feel (sound) to it.

Have you also considered just letting the user *record* his messages (i.e., using his own voice via a microphone *or* "downloading" it into the device from a "PC")?

OK. In my application, the user has no "preview" capability. So, he has to be able to recognize what the device (as a proxy) is trying to "tell" him regardless of the complexity of that (unconstrained) input. As such, I have controls that allow him to replay messages, "spell" individual words/numbers, change the characteristics of the speech (pitch, rate, etc.) to be more intelligible, etc.

Then your TTS rules will need to address every potential case. Note, however, that if your rules are *intuitive*, users will quickly learn how to misspell the text in order to get an acceptable pronunciation. E.g., in English, the only (phonetic) use for the letter 'C' in the input text is to represent the "CH" sound. All other C's can be replaced with 'S' or 'K'.

You can probably also eliminate a lot of the subtle differences in sounds that would promote more naturalness. E.g., (in English), the 'N' sound in "Next" is subtly different from that in "buttoN"; likewise, the 'L' in "Let" vs. "piLL"; the 'R' in "Ready" vs. "tiRe"; the 'W' in "Which" vs. "Wet"; etc.

Find a word-list of "common" words (in Italian) and prepare to feed them to your TTS to see how good/bad the resulting pronunciation. And, for those that are less than ideal, see if you can misspell them in ways that make their pronunciations more acceptable. Finally, look at those misspellings and see if a user could readily come to the same sort of realization (*if* the pronunciation of the proper spelling was "bad enough" to warrant)

See above. Sorry, I can't comment on appropriate "bastardizations" for Italian. But, in English, a "motivated user" can usually come up with ways to coax a TTS into "uttering the sounds" that he'd like to hear.

Important tip: be sure to encode some basic punctuation. People quickly learn that they can influence "playback" if they insert a ',' to force a small pause at a certain point in the text; a '.' for a longer pause; etc. If you also tried to encode prosody (doubtful given your description), things like '!' and '?' could be artificially injected to influence that.

The result will be the same -- *if* your ruleset is simple/obvious. E.g., "'c' only makes sense in 'ch', else 'k'". E.g., I would encode "ciao" as "chow" to get the pronunciation I (English) sought.

So, he'd never say "The air conditioner in room 307 has just switched off"? Or, if he wanted to do so, he would be expected to write it as "The air conditioner in room three oh seven has just switched off".

I think you will find this isn't as obvious/easy as you expect. You may find it easier to just "run a tighter loop" -- possibly dismissing other activities at the time -- and synthesize on the fly. This can dramatically shrink your memory (buffer) requirements. If you are willing to accept "crude" for the "filled in blanks", then a lot of processing can be skipped (e.g., prosody -- just rattle those things off in a monotone)

It's not easy. Speech (and language) have lots of subtleties that we take for granted in our daily life. Why "nickEL", yet "pickLE"? Hopefully, Italian (as a language) is "more regular" than English. (ISTR some of the Scandinavian languages are very "regular")

You can play with flite but I think you will find it too large for your needs. There are several other "open" TTS implementations (though not sure how well suited to Italian their rulesets would be) but, most suffer from the same "lack of concern for resources" that you might encounter in a deeply embedded product.

I'm starting my *third* version (different approach than either of the first two) at a "lean TTS" and suspect I will be disappointed with that, as well (primarily due to the unconstrained vocabulary consequences -- it's always easy to come up with "typical" things that are difficult to handle WITHOUT making the algorithms incredibly complex)

Good luck!

--don

Vote

R

Robert Wessel 11 years ago

I'm a bit unclear on your scenario.

Are you going to be generating the speech offline from the device, and then installing the resulting sound file (.wav, etc.) on the device? If so, there are a number of possible ways to do that without too much work.

Windows, for example, has a built in TTS system, and an API an application can use ("SAPI"). An obvious use case is with direct output to the user, but you can also write output to a .wav file.

formatting link

Windows comes with a built in TTS engine, which does a pretty good job for general use (it's the basis for MS's default screen reader), and has likely had a ton more work put into parsing an analysis of text than you could justify. But if it's not good enough, there are third party plug-in TTS engines that you can add as well. These usually add other voices and additional customization options.

Even if you weren't primarily doing your management on a Windows machine, you ought to be able to toss a Windows box or two in a corner as a TTS .wav file server.

I believe MS uses the same SAPI on their mobile systems as well.

I'm sure similar exists for Linux.

There is a TTS package and API for Android. That might be usable, even if you have to run Android on a machine as a server. My understanding is that it uses the same text analysis engine as Google Translate does, and Google translate has a TTS option as well (use do English-to-English as the translation and select My guess it's that the same TTS back end as what in the Android package). It may well be that there's an API or service you can use in there somewhere.

And the Android version is presumably open source, although I'm sure it's going to be a handful.

Even if you weren't planning on doing this offline, there are some advantages to that, especially if the device (or management application) has internet access - there's a big lump of code you don't have to distribute and run on the device.

Vote

D

Don Y 11 years ago

Join the club! ISTM that the OP wants to have (reasonably) high quality *canned* phrases/sentences into which the user can "salt" user-specific data/phrases: "The _____________ device has reported a power failure." "Your _______ door has been opened!" "The _________ seems to be running too hot." The canned portion can obviously be "processed" (whatever THAT means) at compile-time as they are invariant. But, the "blanks" need to be created "at CONFIGURATION time" (which, presumably, is somewhere between compile-time and run-time).

Further, the content for those "blanks" is relatively unconstrained and may include "words" that defy traditional TTS algorithms. E.g., names (how do *you* pronounce "Berlin"?).

I really don't understand the need for a compile time TTS! Why not just *record* the speech and then encode it ? Why let an (inferior) algorithm try to come up with "natural sounding" speech when you could find a genuine human being to do this??

I am trying to understand a situation where "storing" a message in "audio" form makes sense given that he plans on having some TTS capability in the product. AFAICT, the only advantage comes if you can't do the synthesis on-the-fly and have to resort to building output waveforms in volatile memory at run-time; this hybrid approach could let you shrink the amount of such memory in favor of "ROM" with the canned representations.

[OTOH, a cleverer approach could synthesize everything "in small word groups" and piece them together -- with pauses between]

ISTM, that storing the canned portions in the same "bastardized spellings" that were discussed up-thread and letting the TTS synthesize *everything* would be the better approach. E.g., I run *all* of my "canned text" through the TTS engine in my device just to eliminate the burden on the developer of having to "precompile" the "spoken output".

But, the OP understands his market better than I...

Let the *user* download a .WAV file from *his* PC. Then, just concentrate on being able to reproduce those files accurately (given that they may contain "wonkiness"). Reserve a portion of your flash to hold messages? Add something to verify that portion of the flash contains something that

*looks* like a message?? (hey, user may opt to store sound-effects instead of actual "spoken speech") [There are some low vision aids that just let the user record their own voice in place of accepting text for . Then, the device simply plays back their recording when they want to "access" that "data": "This is a can of corn niblets"; "Appointment at dentist on Friday"; etc.]

The bulk of the code involved in TTS lies in the "rules" by which text is evaluated in context, etc. A formant-based synthesizer (i.e., feed it with "sound codes") is surprisingly small/compact -- tens of KB. Biggest issue is dealing with all the run-time math (esp if you don't have floats).

OTOH, a diphone synthesizer may require several MB for the unit database. And, a fair bit of smarts piecing together adjacent diphones.

If you can afford crude text-to-SOUND rules, you can trim that portion of the codebase to a few KB -- largely to encode the rules. Even those can be simplified if you are willing to shift some of the burden to the user/developer (e.g., replace "qu" with "KW" or "K", as appropriate, in the "text" fed to the TTS and eliminate those "q" rules). Skip prosody and you can save there, as well.

[Low cost product, low expectations from user...]

Vote

P

pozz 11 years ago

This is exactly what my competitors already do, but I was thinking how to improve this.

The final result/sound isn't good: you have a mix between very good words (maybe from a female voice) and the words pronounced by the user at the microphone (maybe a male user).

The best option is to use the same TTS engine for "compile-time" words/sentences and "run-time" words. In this way, the result will not have quality gaps. But it isn't simple to embed a high-quality TTS engine.

Another possibility is to use the same TTS engine with "two levels of quality". The high quality is used on desktop/developing computer to generate "compile-time" words/sentences. The low quality version should be embeddable in the device. In this way, some gaps in the quality can be heared, but I think the overall result would be good (at least, the engine uses the same voice).

I have already seen the flite project and I'm studying it. It seems there's an italian version too. Maybe this could be a good starting point.

Thank you.

Vote

P

pozz 11 years ago

Il 09/04/2015 01:18, Don Y ha scritto:

Perfect description of my situation :-)

Because it isn't difficult to record a good voice with a microphone at a computer, if you aren't a "voice talent". Usually you'd like to hear a female voice, but I am male. I have to engage a good female voice and try to record something and fix the result. Maybe after some months I notice that is better to say "Dear user" and not "Hello user" and I have to call the woman again, but maybe she isn't available anymore.

If the result is good enough, this could be a good approach. Anyway it's a pity to use a low-quality embedded TTS engine for pronounce the

99% percent of sententes, where it can be used only for the 1%.

Vote

D

Don Y 11 years ago

My point was:

- use a human being instead of a synthetic voice

- hey, why not use the CUSTOMER?? To record the canned portions *and* the "filled in blanks"! I.e., just let the customer record the *entire* message -- canned and "filled in blanks"

For example, I have a device, here, that is basically a portable barcode scanner with "audio output". A user scans a barcode (e.g., on a can of corn niblets), the device looks up the barcode (UPC label) in an internal database and then speaks the identification associated with that label: "Corn niblets, Green Giant (brand), 12 oz" using a synthetic voice selected by the user.

But, there are occasions where the scanned label is not present in the database. For these, the user can *record* their own "annotation" which will then be tied to that particular barcode label: "My favorite black sweater" Thereafter, whenever that same label is encountered, the device replays the user's annotation (in *their* voice). This is far more convenient than having the user *type* a formal description of the item (which the speech synthesizer could then speak).

Exactly.

But, there are ways you can work-around this.

As I mentioned (elsewhere this thread), much of the complexity of a TTS lies in the text-to-sound algorithms. I.e., knowing when "read" is to be pronounced as "red" vs. "reed"; knowing that strings of digits of the form ###-#### are telephone numbers (in which case, each digit should be spoken individually with a pause inserted for the '-') while XXXX is likely a *year* (esp if a month name is noted "nearby" and/or the value encoded is "reasonably current"); adding prosodic features; etc.

It might not be unreasonable to "require" the user to determine how things are pronounced (as discussed in past message). This eliminates the need for much of the code that bloats OTS TTS implementations. The most difficult part of listening to synthetic speech is dealing with incorrect pronunciations. Unlike *print*, it's hard for most people to "rewind" their memories of what they just heard -- especially while the device *continues* to speak! (our "aural" memories are much too short; we remember speech only *after* recognizing the individual words! So, if you are stumped by an unexpected mispronunciation, you have to rely on your memory of the raw *sounds*)

The other big issue listening to synthetic speech is prosody and cadence. My comments re: pauses and punctuation can allow the user to artificially create a better sounding sentence (by injecting pauses "for best effect"). Operating in a pure monotone is acceptable for infrequent exposure -- you wouldn't want to listen to such speech "all day long" (your ears literally get "tired" in much the same way that your eyes tire after a long day of reading print).

TSTR that you can use markup languages with flite/festival, etc. If so, you may want to try *deliberately* creating some input text that forces the synthesizer into a "monotone mode" (i.e. deliberately remove all inflection). Then, try replacing the voices with different technologies: diphone, mbrola, formant, etc. and see how you like the intelligibility of the result. WITH AN EYE TOWARDS SYNTHESIZING THE ENTIRE MESSAGE(S).

Again, the difference won't be in the "characteristics" of the voice. But, rather, in the quality of the *pronunciation*. I.e., the sounds that the synthesizer is directed to utter based on the analysis of the input text.

You've not indicated your resource budget. You might give some of the diphone voices a listen and see how "natural" you think they are -- esp when configured in a "monotone" mode. You can then select a "real" person as the model for the voice you choose (incl yourself!).

Part of the problem with formant synthesizers (much lower resource requirements) is sorting out how to tweek the multitude of *parameters* to get a voice that sounds the way you'd like it to sound. With a diphone synthesizer, you just find someone who's voice you *like*! :>

Flite is big -- despite its claim to being small! There are lots of other "open" synthesizers out there to poke at. Many years ago, there was a crude "say.com" for PC's that was cheap and dirty in its implementation. You can also find many implementations of the Klatt synthesizer (but this doesn't include the text-to-sound algorithms).

You might also be able to find commercial demos that you could evaluate to get a feel for how *good* it can be (which gives you a yardstick against which to evaluate your particular implementation). I think DECTalk was sold to Fonix many years ago. From there, it may have moved to Sensimetrics? (google would be your friend, here)

It costs nothing to play with existing implementations (even COTS) and get a feel for what that technology has to offer.

Vote

P

pozz 11 years ago

I can't use this approach. The gadget is an interactive voice response (IVR), so the sentences that it should say are:

- Press 1 to change settings

- Press 2 to read status

- Press 3 to read firmware version

- ...

Those kind of sentences can't be recorded by the user.

Vote

D

Don Y 11 years ago

Why not? Why can't your "setup" directions lead the user through this? Or, why can't you store "factory default messages" for these and the

*other* messages that you described (e.g., the air conditioner) and then let the user change all/none as he sees fit? Or, *only* allow him to change the "air conditioner"-type messages? "Hello User, your air conditioner in location #1 has just switched off." For some folks, this may be acceptable (they just have to remember the identity of "location 1") [But, then again, it's been argued told people don't have to "customize things"...]

You have no problem mixing two different *types* (quality, sources, etc.) of speech *synthesis*; why not let the user decide if he wants to retain some messages in the "factory voice" while changing those that require customization?

Vote

D

Don Y 11 years ago

Because it *is*? Or, *isn't*? I think user's charged with this "responsibility" would be willing to accept whatever quality in which they are willing to invest. I.e., some folks will make one pass at this while others will refine their recordings. People have no problem recording outgoing messages on answering machines, etc.

Understandable -- *if* you insist on generating the speech at the factory.

Note that speech synthesis doesn't always give you a choice as to the actual characteristics of the voice you'll employ: "pick from among these

12 voices", etc. You have far more flexibility in selecting a specific voice if you interview "voice talent".

Creating speech from text generally boils down to:

- text normalization * "expanding" abbreviations ("Mr" -> "Mister"; "etc" -> "etcetera"; etc.) * spoken punctuation ('&' -> "and"; '%' -> "percent"; etc.) * encoding punctuation (',' -> phrase boundary; '.' -> sentence; etc.) * handling numerics = decimal numerals (1234; 41.09; .9; 0.75; 1,000,000; 000.3; 3.000) = ordinals ("1st"; "2nd"; "3rd"; "n-th"; etc.)

= time/date (note cultural differences, here!) = Roman numerals ("Henry VII"; "Tom Smith II"; etc.) = non-decimal radices ("0xDeadBeef"; "027"; "16rFF"; etc.)

- word decomposition (stripping affixes to determine the root word as an aid to pronunciation ("flies" -> "fly"+'s'; contrast "pennies" -> "penny"+'s';

- letter-to-sound mapping

- stress assignment ("Berlin" -> "BUR lin" vs. "bur LIN")

- prosody (F0 contours at phrase and sentence level) [I may have missed a step or two... :> Too early in the morning!]

And, of course, any application-specific additions ("4K7 resistor"). This latter is often where "the bear" wins (how do you speak seemingly unrelated orthography?? e.g., diagnostics emitted by programs)

The bulk of the code in a *good* TTS deals with the first two ('-') items. And, it is also where the bulk of the screwups eventually occur! Getting to a point where you can *reliably* figure out what sounds should be imposed on the "text" requires knowledge of what is actually being *said*! "Dr. Jones lives on Jones Dr." (Doctor Jones lives on Jones Drive) "Nurse, please start an IV on Henry" (Nurse, please start an I V on Henry)

Once the text has been "disambiguated", mapping letters to sound is considerably *less* problematic (though still challenging).

Stress assignment and other prosodic features largely affect the naturalness of the speech. But, in their absence, a "motivated listener" can still discern

*intent*. Especially if the listener is familiar with the context.

For the messages over which *I* (and, by analogy, *you*) have control, I choose representations with which I know *my* synthesizer will perform well. As the text from which the speech is synthesized is largely algorithmically generated printf("The current volume level is %d %%.", volume); printf("Your MAC is %02x:%02x:%02x:%02x:%02x:%02x.", ...); I can bias the algorithms in the "front end" of the TTS to exploit that (i.e., it is far *less* likely for me to encounter "numbers" of the form "1,234,567" than "1234567")

OTOH, there is the potential for some text (from external sources) that may not have been chosen with TTS -- esp *my* TTS! -- in mind: "521 google.com does not accept mail" Yet, I still need to be able to convey their content to the user, unaltered.

In *your* case, you can similarly choose messages that "convert" effectively. And, can fold much of the text normalization (etc.) into your compile-time actions. E.g., "M A C" instead of "MAC" (so you don't have to be able to recognize that abbreviation -- yet still cause it to be "spoken" properly)

Likewise, you can "game" the pronunciation phase of the algorithm by deliberately misspelling the desired text with foreknowledge of how *your* algorithm operates. E.g., Americans pronounce many "ise" words as "ize" while Brits treat it as a softer 's'; "Susanne" becomes "Suzanne"; etc.

I contend that motivated users could easily make the same sorts of adjustments in how they define *their* message content. Thus, greatly simplifying your effort/algorithms.

Finally, if spoken messages are short and/or infrequent, you can probably omit the stress assignment and prosodic features and just speak in a monotone. Or, impose only the crudest processing in this regard.

You then end up with a single approach to speaking *everything* -- instead of trying to marry two different implementations/technologies.

As I said, before: play with some OTS (commercial or otherwise) synthesizers and see what they sound like "crippled". The Reading Machine had a *dreadful* synthesizer (Votrax 6.3) yet folks would learn to listen to (i.e. "tolerate") it for hours at a time as it was the only game in town! :-/

Vote

Embeddable text-to-speech

Join the Discussion

Didn't find your answer?