Hi,
I have a few synthesizers that I've created (to explore different approaches, resource requirements, complexity, quality, etc.). Application is *semi* limited domain; i.e., I know *many* of the things it will be asked to say -- but, not all!
I'm trying to sort out a means of grading their respective "quality". I.e., I can easily measure how much text they occupy, how long they take to process a given string, how much RAM they require, etc.
But, the tough part is trying to decide how "well they speak". (e.g., I can make NOISES with very small text, RAM, MIPS... but, you probably wouldn't consider those *noises* to be SPEECH! :> )
Initially, I'm just looking at text-to-sound/phoneme portion of the synthesizer. E.g., skipping text normalization, prosody, etc. Those are separate issues and the algorithms can be layered onto any of the test-to-sound algorithms.
One OBVIOUS way of evaluating "quality" is just to feed them words and see if they map those *graphemes* into the proper *phonemes*! To that end, there are some pronouncing dictionaries that I can use: feed each word to a TTS and compare the resulting phonemes to those that the dictionary claims are "correct" (assume a pronunciation that matches that of any legitimate heteronym is "correct").
But, even there, how can I score dictionary content without considering the likelihood of encountering it? E.g., if TTS#1 gets "boat" correct but "syzygy" wrong, is that the same level of performance as TTS#2 getting "syzygy" *right* but "boat" wrong?!
Stepping back a bit further, how do I *specify* the desired performance a priori given the unconstrained nature of the potential inputs?
Thx,
--don