semi OT: linear predictive coding

Does anyone know where I could find some example code that implements linear predictive speech synthesis? The particular language doesn't really matter, so long as the example is not using a ton of opaque "library functions" that one can't see the workings of. This isn't for any business application - just education and personal interest.

Reply to
bitrex
Loading thread data ...

FAICT LPC is for ananysis, not synthesis.

--
umop apisdn
Reply to
Jasen Betts

I had forgotten about comp.dsp, I think I'll ask this question there instead.

Reply to
bitrex

There's an old IEEE subroutine collection book that has that, I think. (I have it somewhere.) So do more recent versions of Numerical Recipes, I believe.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs 
Principal Consultant 
ElectroOptical Innovations LLC 
Optics, Electro-optics, Photonics, Analog Electronics 

160 North State Road #203 
Briarcliff Manor NY 10510 

hobbs at electrooptical dot net 
http://electrooptical.net
Reply to
Phil Hobbs

I think the latest descendent of that is commonly available on BSD-derived Unixes (including OS/X) as the "say" command. I'm sure you can find the source code. Understanding it might be an issue though; as Don said, it's a deep rabbit hole.

Reply to
Clifford Heath

Marvelous book... also heavy enough to hold things together while gluing ;-) ...Jim Thompson

--
| James E.Thompson                                 |    mens     | 
| Analog Innovations                               |     et      | 
| Analog/Mixed-Signal ASIC's and Discrete Systems  |    manus    | 
| San Tan Valley, AZ 85142     Skype: skypeanalog  |             | 
| Voice:(480)460-2350  Fax: Available upon request |  Brass Rat  | 
| E-mail Icon at http://www.analog-innovations.com |    1962     | 
              
I love to cook with wine.     Sometimes I even put it in the food.
Reply to
Jim Thompson

Don, have you looked at:

formatting link
Just something I found while searching - I know little about this subject.

It seems I was wrong about OSX "say". It's just a front-end to Apple's speech synthesis APIs. The old Berkeley work I referred to was by Mozer, see for a start.

Also, this article says that MAME has a decoder for Mozer-compressed speech:

Clifford Heath.

Reply to
Clifford Heath

Your subject is slightly out of sync with your post.

Are you looking for a *coder* or a *decoder*? (or, both??)

I.e., are you trying to encode speech into an LPC data stream? Or, take an encoded data stream and decode the represented speech?

Reply to
Don Y

In essence, write some kind of crude text-to-speech thing.

Reply to
bitrex

I think you misunderstand the role that the LPC decoder plays in speech synthesis. :-/

The decoder can be thought of as exactly that: a decoder. It decodes a signal that has already been ENcoded. Sort of like playing a WAV file of a sound/melody that someone else previously RECORDED in that format.

Typically, you would use a Vocoder to encode speech and, later, play it back (with approximately the same fidelity).

You can, thus, use a LPC decoder to play back *sounds* that can be pieced together to form "speech". E.g., play back previously recorded sentences to create dialogs, phrases to create sentences, words to create phrases -- or even FORMANTS to create words!

The relative naturalness of each approach varies. E.g., piecing together sentences to form dialogs tends to be more natural sounding (Hello. / How is the weather? / Today is Tuesday. / The bee is a busy bee. etc.) than piecing together words to form sentences ("How / is / the / weather / ?" contrast with "The / weather / is / good / ." -- try pronouncing the words in the second instance *identically* with the pronunciation that you employ in the first sentence -- the inflection and prosody are "off").

For example, it is relatively easy to generate speech to deliver messages like: "The account number you entered was" "four" "four" "three" "two" "." "If this is correct, press one." A lot harder to generate (natural sounding) speech that treats each of the above words as standalone units, pieced together in the hope of sounding like a sentence!

[Imagine trying to piece together component *sounds* (formants) to form words -- where the inflection varies for each sound based on its position in a particular word, etc.] *If* you have an inventory of formants/words/phrases/sentences/etc. and know the model order used (plus data format), then you can begin to reconstruct speech by piecing together these units (a crude form of concatenative synthesis).

But, to get to this point, there is a fair bit of effort to convert "text" (graphemes) into "sounds" (phonemes/words/phrases/etc). Nowadays, this is usually done with large dictionaries and parts of speech tagging. English is especially annoying when it comes to creating special rules that apply to everything with any degree of consistency.

And, before you even get to sorting out what words *sound* like, you have to figure out what they *are*: Dr. Smith's polish housekeeper was polishing the furniture in his home on Smith Dr. while he was reading the book he had read the day prior.

As a starting point, something like: may give you a better feel for the issues you face.

Reply to
Don Y

Thank you for this information!

The idea I had was to recreate something like a software version of the IC that ran the Texas Instruments "Speak And Spell" toy - I guess it pieced together formants to form words. It debuted in 1978, so it doesn't seem like it should be terribly difficult to make some kind of replica using modern hardware, no? I'm just not sure what to read to get started.

Reply to
bitrex

As far as I recall,however, the TI chip involved couldn't speak arbitrary words - it had a limited corpus of words with which it worked with to do its thing. I believe some models of the toy were expandable using ROM cartridges to increase the number of words/features.

So I guess for my toy program, I would need to write some software to analyze example speech to generate appropriate coefficients for the decoder to work with? That could be interesting.

Reply to
bitrex

If they were to build it today, they'd probably just use brute force and program a small micro with MP3s of the sounds.

Reply to
krw

The S'n'S has "vocabulary ROMs" that encode whatever the device was required to say. Think of this as a digital record player: "The" "pig" "says" "oink" "The" "cow" "says" "moo" etc. These could just as easily have been embelished to be things like: "The extraordinarily large" "pig" "emphatically says" "oink" "The extraordinarily large" "cow" "emphatically says" "moo" Said another way, you can probably *record* each of these sentences and compare the waveforms for the corresponding words on a 'scope and see they are the exact same utterances!

The actual "voice" probably belongs to a genuine human being, somewhere (perhaps slightly stylized).

The device sound relatively good because the things that it is called upon to speak are pretty much the same thing, with substitutions. This is often the case with limited domain dialog systems (like most automated voice systems for banks, airlines, etc.).

The real pisser is trying to address arbitrary groups of letters arranged into (what we *hope* are) words; and, those words grouped to form (we hope) sentences.

How would a TTS be expected to speak: "The yellow" "Blue running" "xksdfpou" "bitrex"

It is an incredibly interesting problem! I have been writing different versions of "low resource" TTS's for my current project. While they are intended to work in *somewhat* limited domains (i.e., a fixed set of things that they will be called upon to say), they have to also accommodate some unconstrained input. So, need to at least be able to handle unusual input in a somewhat rational manner: "Service unavailable. Contact Dr. B. Smith @ 555-1212 x234 8A-5P" "Your IP: 129.34.56.78 is blacklisted. 0xFFE08845"

Even pronouncing numbers becomes an interesting problem! 23 dogs dog 23 5:00AM 12/31/1999 555-1212 2,015 2015 1231432524534

Notice the differences in the "wh" sound in "what", "who", "where", "which", "when", etc. Subtleties in pronunciation in context ("the end" vs "the beginning"). etc. Once you start down the rabbit hole, be prepared for all sorts of unusual revelations! ;-)

If you are genuinely interested in the problem, chase down a copy of _From Text to Speech: The MITalk System_ (Allen, Hunnicutt, Klatt, et. al). It's a dated text but one that gives a fairly comprehensive summary of the issues as they pertain to a genuine implementation.

[The synthesizer mentioned therein later became DECtalk. You can find DECtalk demo implementations online.]

Tidbit: the default voice in the synthesizer was largely modeled after DKlatt's speech (only natural as he did the implementation!)

Reply to
Don Y

You may be able to find versions of the Klatt synthesizer (the mechanism that converts phoneme codes to waveforms) but I've yet to find a comprehensive DECtalk implementation available FOSS.

Squeak tries to implement some of the algorithms in the synthesizer (prosody). But, I've not found an implementation that includes the morph dictionary, PoS coding, etc. that was present in DECtalk/MITalk.

Newer TTS's tend to rely on *big* datasets for their work. The DECtalk (PC) demo executable was < 1MB. And, included lots of cruft that allowed a savvy user to implement a "speech markup language" in the text input stream. E.g., change voices, speech rates, forced inflection, etc. You could even make it "sing" (ick).

By far, the most interesting point is what you learn about how much you take for granted in language processing! How much stuff goes on without your conscious awareness in everyday speaking!

Why don't we spell "love" as "lof" (e.g. pronounce "of" to see!)?

I recall reading the TeX references (books) and a point was made (Knuth?) that once you start thinking about typefaces, you never look at a grapheme with the ignorance/naivite that you *used* to! The same is true when you start looking into "how we speak".

Reply to
Don Y

These are called "limited domain" (or limited vocabulary) implementations. It's relatively easy to make something speak "a little", well. E.g., you can just *record* your favorite (professional) "voice talent" speaking those phrases/sentences/etc. and play them back, later.

Yes -- of you want to be able to change what is spoken.

The other approach is to "record" individual phonemes and then paste those together. E.g., record an 'f' sound, a 'p' sound, an 'n' sound, various vowel sounds (note vowels tend to have multiple sounds associated with a single "grapheme/letter"). Then, "concatenate" those sounds to make words.

In practice, this tends to sound like crap. The transitions between sounds (phonemes) don't like these abrupt changes. So, you either have to massage the sounds as you synthesize them -- in anticipation of the sound that will follow -- or store different *versions* of each sound!

In the latter case, it's been observed that you can much more "harmoniously" patch together the *first* half of a particular 'a' (i.e., the 'a' that follows 'b' in "back") with the *last* of a similar 'a' (i.e., the 'a' that precedes the 'd' in "bad"). So, instead of storing a few dozen "pure" phonemes, you store hundreds/thousands of "diphones" and piece them together in the

*middle* of sounds.

E.g., instead of 'b' 'a' 'd' sounds, you would use "-b" "ba" "ad" "d-" (assume the '-' represents silence).

Note that now you need more diphone for the "-c" and "ca" sounds in order to say "cad"; ditto "-f" and "fa" for "fad", etc. Hence the number of "units" (diphones) in your "inventory" grows quickly!

One BIG advantage of this approach, however, is that you can more readily make a "voice" that sounds very much like *you* (or any other person of your choosing)! You "simply" (RoTFLMFAO!) have that "voice talent" record a lengthy passage (30-40 min?) that contains most of these "sound-pairings" and then algorithmically slice the sound samples into their component diphones (the algorithm knows what was *said* so it knows which sounds are "ba", "ad", etc.!)

(sigh) As I said, the rabbit hole is pretty deep. You can make a

*career* out of playing with this stuff and probably NEVER come up with a "measurably correct" implementation! [How do you decide if your implementation is "correct"? "Good enough"? :> ]
Reply to
Don Y

Well, sure, but that's no fun!

Reply to
bitrex

"Solving" a forty year old problem the same old way isn't a lot of fun, either but whatever floats your boat.

Reply to
krw

The open source codec 2 might have what you need

formatting link

Reply to
David Eather

I think you might be looking for this:

formatting link

speak jet or jet speak (I forget) will say anything but with a slightly Chinese accent

Reply to
David Eather

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.