Yes, I've looked at it. And Festival, Flite, etc. (I've probably got damn near every bit of "available" TTS code, here, as I did my research before undertaking this effort.) I'm not sure if your comment here is wrt *my* needs or in contrast with DECtalk so I'll comment on both.
eSpeak is pretty big and tries to be too general for *my* needs. E.g., I have no desire to support other languages. I don't need support for a markup language. Etc. But, I need something *really* small, inexpensive to implement (it has to reside *in* a BT earpiece ALONGSIDE the BT stack, etc.) and "low power" -- can't burn lots of CPU cycles trying to "make noise". Think tens of KB, not hundreds of KB; KIPS, not MIPS (or FLOPS!!)!
As damn near all TTS's fall flat *somewhere* -- there's just way too much variation in the text that it can conceivably encounter (how do you pronounce "Heztdgserye"? Or, "NnnnbgTT"? Or "W3.CNN.com"? What do you do *if* you encounter one of these??) -- I am carefully picking where and how *I* will "fall flat"... with an eye towards performing in a manner that a user can expect with some consistency in the domain/application that I'm addressing. E.g., I don't really expect I will ever need to say "The aardvark chased the heffalump down the boojum tree in northern Phoenix!"
I'm more likely going to be saying: Volume level: XX% High frequency emphasis: medium Speaking rate: 100 wpm Remaining battery life: HH:MM (not MM:SS!) Firmware version: 123.5A2 Date: Sunday, 12 July 2015 Current time: 12:34:56 Connected to: MyHost Service unavailable Password incorrect Signal strengths XX, YY, ZZ
I can't afford to set aside large, fixed, conversion buffers on the off chance that I might encounter a really *long* "word". OTOH, there's nothing I can do to prevent "external sources" from passing arbitrary text to me that I must render into intelligible speech! So, if I encounter something of that sort, I cheat -- in a way that users can rely upon! :>
I also need to allow a user to adjust a voice without having to constrain him/her to "one of N choices". E.g., some users may have high frequency losses and favor more baritone voices; some may desire rapid speech; etc. A more "parametric" synthesizer is appropriate.
As regarding the comparison to DECtalk, DECtalk was written more to model how people approach pronunciation. I.e., we don't "store" a giant dictionary of every word/pronunciation (as is common, with new synthesizers). Nor do we have hundreds of little "rules" regarding how particular graphemes are pronounced in specific contexts (I suspect most folks just use few rules like "qu" --> "kw", etc.)
The TTS problem is basically:
- text normalization Dr. Smith --> Doctor Smith Smith Dr. --> Smith Drive Henry VIII --> Henry the eighth 2 boxes --> two boxes box 2 ---> box (number) two 1st --> first 23 --> twenty three 1,234 --> one thousand two hundred and thirty four 1234 --> twelve hundred thirty four $3.17 --> three dollars and seventeen cents TV --> tee vee AM --> ayem KB --> kilobyte(s) 192.168.1.1 --> one ninety two dot one sixty eight dot one dot one 23.92 --> twenty three point ninety two 11:45 --> eleven forty five You can put a lot of ad hoc processing here! Think about how many special cases you routinely encounter and effortless compensate! Or, you can treat any non-text as "" (not allowed)
- grapheme to phoneme conversion map letters into sounds, handle homographs based on PoS tagging, etc.
- stress assignment syllabification
- prosody to avoid the monotone characteristic of inhuman speech
- waveform generation make the sounds!
The real meat and potatoes of most TTS's is in the grapheme-to-phoneme conversion. You can decide not to handle a wide variety of input forms to simplify the text normalization phase. Or, tailor it to a specific application domain. You can botch the stress assignment and still yield an intelligible output (how many folks say PO-lice instead of poLICE? INsurance instead of inSURance? etc.) Prosody has a primary impact on long exposures to the speech. "Your ears get tired" when you listen to speech with bad or missing prosodic cues -- but you can still understand what is being said! And, waveform generation is just number crunching.
Some TTS's employ large dictionaries containing the words expected along with their pronunciations. Often, augmented with part-of-speech information to disambiguate among homographs (polish as a noun vs. polish as a verb vs. polish as an adjective).
But, you can't have an infinite dictionary (when faced with dealing with unconstrained input).
Most TTS's rely on a suite of letter-to-sound rules to map graphemes to phonemes. These are typically of the form: left_context character(s) right_context --> sounds So, for example, "qu" at the start of a word, followed by almost anything is pronounced as "K W". On the other hand, when encountered at other places (e.g., unique), it has a different sound mapping!
There are variations on these. E.g., the (infamous) NRL ruleset allowed a single "character" to be matched, characters were examined left to right and rules in fixed order (which implicitly imposes a priority on them!).
McIlroy's rules allowed the "input" to be rewritten as a convenience to the algorithm. So, "quick" could be rewritten as "kwick" and then reprocessed as if that was the original input.
DECtalk takes a hybrid approach that more closely mimics how we are taught to "sound things out". E.g., it first peels off affixes (prefix/suffix) that just clutter up the input string. Trailing y's that have been converted to "i"/"ie" in plurals are restored (as the trailing 's' is removed). Then, it tries to break the "root" word into its constituent morphs.
E.g., "denominations" is treated as: prefix De root nomin suffix ate can be preceded by an adjectival, verbal or nominal suffix suffix ion forms nouns, verbs or adjectives (drives suffix that follows!) suffix s word final position, forms nouns or verbs, can be preceded by adjectival, verbal or nominal suffix So, the problem boils down to figuring out how to pronounce "nomin". A "morph dictionary" is consulted for the pronunciation (I think the dictionary has ~10,000 morphs). If *not* present, *then* context-specific rules are sequentially applied -- but just to the "root" ("nomin") as we already know how to pronounce the affixes!!
[One way to trip up many TTS systems is to provide words that have two roots. E.g., "houseflies" -- note the middle 'e' is silent (house-flies not hous-eh-flies)]Another difference in the DECtalk rule system is that it allows substrings of characters to be matched/replaced in each rule. E.g., "chem" --> "K EH M" instead of having to deal with this as "ch" "e" "m".
Also, the DECtalk rules don't blindly proceed through the input string left-to-right. Instead, the *consonants* are processed, first. This provides a framework to better resolve the *vowels* that they bracket! (recall vowel graphemes can have multiple meanings!) Additionally, the rules allow the input *text* to be examined along with the already converted "sounds"! So, instead of a "qu" rule to handle "quick", a "KW" rule can be applied to it in its intermediate form (i.e., after the "qu" has been converted)
By contrast, most rule-based (non-dictionary) TTS's would walk the string of characters from left to right converting characters into sounds based solely on the characters that *surround* the character under examination.
So, DECtalk bridged the gap between the naive, pure rule-based approaches of things like the NRL ruleset and the heavyweight dictionary approaches common nowadays. This was a necessary consequence of the fact that resources weren't as abundant "back then" (recall, this is 35 year old technology!)
Given how resources have changed over the years, the "compute intensive, modest dictionary" approach is even more viable, nowadays!
There used to be a PC utility called SAY.exe.