TTS "specification"


I have a few synthesizers that I've created (to explore different approaches, resource requirements, complexity, quality, etc.). Application is *semi* limited domain; i.e., I know *many* of the things it will be asked to say -- but, not all!

I'm trying to sort out a means of grading their respective "quality". I.e., I can easily measure how much text they occupy, how long they take to process a given string, how much RAM they require, etc.

But, the tough part is trying to decide how "well they speak". (e.g., I can make NOISES with very small text, RAM, MIPS... but, you probably wouldn't consider those *noises* to be SPEECH! :> )

Initially, I'm just looking at text-to-sound/phoneme portion of the synthesizer. E.g., skipping text normalization, prosody, etc. Those are separate issues and the algorithms can be layered onto any of the test-to-sound algorithms.

One OBVIOUS way of evaluating "quality" is just to feed them words and see if they map those *graphemes* into the proper *phonemes*! To that end, there are some pronouncing dictionaries that I can use: feed each word to a TTS and compare the resulting phonemes to those that the dictionary claims are "correct" (assume a pronunciation that matches that of any legitimate heteronym is "correct").

But, even there, how can I score dictionary content without considering the likelihood of encountering it? E.g., if TTS#1 gets "boat" correct but "syzygy" wrong, is that the same level of performance as TTS#2 getting "syzygy" *right* but "boat" wrong?!

Stepping back a bit further, how do I *specify* the desired performance a priori given the unconstrained nature of the potential inputs?



Reply to
Don Y
Loading thread data ...

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.