semi OT: linear predictive coding

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Jul 13, 2015 4:56 AM

Yes, I've looked at it. And Festival, Flite, etc. (I've probably got damn near every bit of "available" TTS code, here, as I did my research before undertaking this effort.) I'm not sure if your comment here is wrt *my* needs or in contrast with DECtalk so I'll comment on both.

eSpeak is pretty big and tries to be too general for *my* needs. E.g., I have no desire to support other languages. I don't need support for a markup language. Etc. But, I need something *really* small, inexpensive to implement (it has to reside *in* a BT earpiece ALONGSIDE the BT stack, etc.) and "low power" -- can't burn lots of CPU cycles trying to "make noise". Think tens of KB, not hundreds of KB; KIPS, not MIPS (or FLOPS!!)!

As damn near all TTS's fall flat *somewhere* -- there's just way too much variation in the text that it can conceivably encounter (how do you pronounce "Heztdgserye"? Or, "NnnnbgTT"? Or "W3.CNN.com"? What do you do *if* you encounter one of these??) -- I am carefully picking where and how *I* will "fall flat"... with an eye towards performing in a manner that a user can expect with some consistency in the domain/application that I'm addressing. E.g., I don't really expect I will ever need to say "The aardvark chased the heffalump down the boojum tree in northern Phoenix!"

I'm more likely going to be saying: Volume level: XX% High frequency emphasis: medium Speaking rate: 100 wpm Remaining battery life: HH:MM (not MM:SS!) Firmware version: 123.5A2 Date: Sunday, 12 July 2015 Current time: 12:34:56 Connected to: MyHost Service unavailable Password incorrect Signal strengths XX, YY, ZZ

I can't afford to set aside large, fixed, conversion buffers on the off chance that I might encounter a really *long* "word". OTOH, there's nothing I can do to prevent "external sources" from passing arbitrary text to me that I must render into intelligible speech! So, if I encounter something of that sort, I cheat -- in a way that users can rely upon! :>

I also need to allow a user to adjust a voice without having to constrain him/her to "one of N choices". E.g., some users may have high frequency losses and favor more baritone voices; some may desire rapid speech; etc. A more "parametric" synthesizer is appropriate.

As regarding the comparison to DECtalk, DECtalk was written more to model how people approach pronunciation. I.e., we don't "store" a giant dictionary of every word/pronunciation (as is common, with new synthesizers). Nor do we have hundreds of little "rules" regarding how particular graphemes are pronounced in specific contexts (I suspect most folks just use few rules like "qu" --> "kw", etc.)

The TTS problem is basically:

- text normalization Dr. Smith --> Doctor Smith Smith Dr. --> Smith Drive Henry VIII --> Henry the eighth 2 boxes --> two boxes box 2 ---> box (number) two 1st --> first 23 --> twenty three 1,234 --> one thousand two hundred and thirty four 1234 --> twelve hundred thirty four $3.17 --> three dollars and seventeen cents TV --> tee vee AM --> ayem KB --> kilobyte(s) 192.168.1.1 --> one ninety two dot one sixty eight dot one dot one 23.92 --> twenty three point ninety two 11:45 --> eleven forty five You can put a lot of ad hoc processing here! Think about how many special cases you routinely encounter and effortless compensate! Or, you can treat any non-text as "" (not allowed)

- grapheme to phoneme conversion map letters into sounds, handle homographs based on PoS tagging, etc.

- stress assignment syllabification

- prosody to avoid the monotone characteristic of inhuman speech

- waveform generation make the sounds!

The real meat and potatoes of most TTS's is in the grapheme-to-phoneme conversion. You can decide not to handle a wide variety of input forms to simplify the text normalization phase. Or, tailor it to a specific application domain. You can botch the stress assignment and still yield an intelligible output (how many folks say PO-lice instead of poLICE? INsurance instead of inSURance? etc.) Prosody has a primary impact on long exposures to the speech. "Your ears get tired" when you listen to speech with bad or missing prosodic cues -- but you can still understand what is being said! And, waveform generation is just number crunching.

Some TTS's employ large dictionaries containing the words expected along with their pronunciations. Often, augmented with part-of-speech information to disambiguate among homographs (polish as a noun vs. polish as a verb vs. polish as an adjective).

But, you can't have an infinite dictionary (when faced with dealing with unconstrained input).

Most TTS's rely on a suite of letter-to-sound rules to map graphemes to phonemes. These are typically of the form: left_context character(s) right_context --> sounds So, for example, "qu" at the start of a word, followed by almost anything is pronounced as "K W". On the other hand, when encountered at other places (e.g., unique), it has a different sound mapping!

There are variations on these. E.g., the (infamous) NRL ruleset allowed a single "character" to be matched, characters were examined left to right and rules in fixed order (which implicitly imposes a priority on them!).

McIlroy's rules allowed the "input" to be rewritten as a convenience to the algorithm. So, "quick" could be rewritten as "kwick" and then reprocessed as if that was the original input.

DECtalk takes a hybrid approach that more closely mimics how we are taught to "sound things out". E.g., it first peels off affixes (prefix/suffix) that just clutter up the input string. Trailing y's that have been converted to "i"/"ie" in plurals are restored (as the trailing 's' is removed). Then, it tries to break the "root" word into its constituent morphs.

E.g., "denominations" is treated as: prefix De root nomin suffix ate can be preceded by an adjectival, verbal or nominal suffix suffix ion forms nouns, verbs or adjectives (drives suffix that follows!) suffix s word final position, forms nouns or verbs, can be preceded by adjectival, verbal or nominal suffix So, the problem boils down to figuring out how to pronounce "nomin". A "morph dictionary" is consulted for the pronunciation (I think the dictionary has ~10,000 morphs). If *not* present, *then* context-specific rules are sequentially applied -- but just to the "root" ("nomin") as we already know how to pronounce the affixes!!

[One way to trip up many TTS systems is to provide words that have two roots. E.g., "houseflies" -- note the middle 'e' is silent (house-flies not hous-eh-flies)]

Another difference in the DECtalk rule system is that it allows substrings of characters to be matched/replaced in each rule. E.g., "chem" --> "K EH M" instead of having to deal with this as "ch" "e" "m".

Also, the DECtalk rules don't blindly proceed through the input string left-to-right. Instead, the *consonants* are processed, first. This provides a framework to better resolve the *vowels* that they bracket! (recall vowel graphemes can have multiple meanings!) Additionally, the rules allow the input *text* to be examined along with the already converted "sounds"! So, instead of a "qu" rule to handle "quick", a "KW" rule can be applied to it in its intermediate form (i.e., after the "qu" has been converted)

By contrast, most rule-based (non-dictionary) TTS's would walk the string of characters from left to right converting characters into sounds based solely on the characters that *surround* the character under examination.

So, DECtalk bridged the gap between the naive, pure rule-based approaches of things like the NRL ruleset and the heavyweight dictionary approaches common nowadays. This was a necessary consequence of the fact that resources weren't as abundant "back then" (recall, this is 35 year old technology!)

Given how resources have changed over the years, the "compute intensive, modest dictionary" approach is even more viable, nowadays!

There used to be a PC utility called SAY.exe.

- M
- Martin Brown
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Jul 13, 2015 10:50 AM

Try NR 3rd edition as already suggested. Online at:

formatting link

p679-680 for an intro.

If you just want to do simple speech synthesis you actually want something that contains a library of phonemes and will string them together in a more or less convincing way to make speech sounds when given as input a word in ASCII representation.

LPC merely allows you to encode the waveforms more compactly.

--
Regards, 
Martin Brown

- B
- bitrex
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Jul 13, 2015 12:47 PM

Great, thanks!

- B
- Bill Sloman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Jul 13, 2015 4:11 PM

e

ry words

thing.

o

.

nalyze

work

ds

ly"

ollows

des

:> ]

Apple has been doing it for years.

The number of possible di-phones is limited - there are 40 phonemes in Amer ican English, which implies 1600 possible diphones. In fact it's a bit more - a recent paper listed 2,288 - and for natural sounding English speech yo u have to pay attention to the difference between stressed and unstressed vowels, since every unstreesed vowel in English is reduced to a "schwa"

formatting link

and there's always sentence stress to complicate life even further.

--
Bill Sloman, Sydney

- L
- Lasse Langwadt Christensen
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Mon, Jul 13, 2015 4:21 PM

Den mandag den 13. juli 2015 kl. 03.47.18 UTC+2 skrev krw:

this is from 1939:

formatting link

the concept isn't far off

-Lasse

- B
- bitrex
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Jul 14, 2015 3:09 AM

Computer singing is actually a cultural phenomenon in Japan (naturally).

formatting link

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Tue, Jul 14, 2015 7:43 AM

With the controls for prosody, the *mechanism* for singing is straightforward -- singing is just exaggerated prosody!

I went looking for a copy of the DECtalk PC demo executable but it seems to have gone into hiding. Fonix bought the sources some years back and *appears* to have made some changes. Here's an online demo:

If I get a chance, I will try to (informally) compare the text recited by each of these voices (which sure seem to be the same as in the DECtalk implementation -- e.g., "Dennis" is DHKlatt) to the PC demo's rendering. I have no idea how the differences could be characterized.

I can also look at the output of the original DECtalk (hardware) and DECtalk Express (I have a collection of different speech synthesizers :> )