Resolving typos in research pubs

Hi,

Sorry for the poor choice of subject line. :(

I have a few different speech synthesizers that I've been refining for a product. All are intended to be *very* lightweight (minimal run-time resources). Voice quality, pronunciation, etc. isn't essential (but don't want to deliberately hamper performance).

As I can operate them in semi-limited domains (since I *tend* to be the source of the text they utter), I've opted to adopt some of the classic approaches to the text-to-phoneme portion of the algorithms instead of trying to begin a study in linguistics, etc. (even bloated synthesizers have problems with speech so why waste effort trying to *approach* their performance levels with 1% of *their* resources?!)

Most of the work I'm adopting is decades old. Current trends rely on having *lots* of resources available (big dictionaries, MIPS, etc.) so they've all gone off in a different direction. So, contacting original authors is a dubious proposition ("Hey, do you remember that work you did 30 years ago? I've got some niggly little detail that I need help resolving. Off the top of your head...")

[I've had *some* success -- thx DM!]

Many of the documents are N-th generation photocopies, fiche, etc. So, lots of artifacts in them ("Is that a speck of lint or a backslash?") But, I've been able to resolve many of the "unintended additions" with a bit of careful examination of the details involved. The worst cases are the long lists of (hundreds of) rules -- any of which might be corrupted by a speck of paper lint, a crease in the original when it was photocopied, etc.

But, there are some things that simply can't be attributed to copying errors. I.e., cases where glyphs are obviously *missing*. And, others where something is present in a legend -- yet never occurs elsewhere! Still other ambiguities exist (Is this instance of "YL" to be interpreted as the legend symbol "YL"? Or, as the legend symbol "Y" followed by the legend symbol "L"? And, what is the *effective* difference??)

[This sort of crap happens when people aren't careful preparing docs. And, when they don't (or can't?) "cut and paste" from the ACTUAL SOURCE CODE into the final documentation but, instead, try to transcribe things manually: "Is that a lowercase L or a digit 1?"]

I *think* the only way I can *hope* (no guarantee) to resolve these sorts of things is to throw lots of data at it and hope to see a pattern in the failure(s) that result. Perhaps even instrumenting my code so that I can flag each datum that tickles a "suspicious rule". Then, hope I can fathom what they have in common and how to resolve the error.

This is complicated by the fact that the algorithms aren't "perfect" to begin with. So, the idea of comparing computed pronunciations against a *dictionary* of pronunciations would be ineffective as it would flag all of the "semi-acceptible" pronunciations as "errors". I don't have ready access to the original data from which the rules were derived (nor the "private notes" by which they decided to trade off performance of one rule vs. another in certain instances).

[It's actually fascinating to look at word spellings in detail and the big differences in their pronunciations! E.g., water/pater/later; valentine/aborigine/clandestine; etc.]

Does this approach seem to make sense? I.e., tag each input that tickles a suspicious rule and try to resolve the problems by "staring at them"? Any other suggestions that might be more productive? Esp given that we each view words as having specific pronunciations and, without religiously consulting a "reference", can easily dismiss what *appears* to be a problem as a NON problem (e.g., most folks seem to mispronounce "salmon" so wouldn't notice if the algorithm ALSO mispronounced it!)

[N.B. when I refer to examining the "flagged output", I don't mean *audio* output but, rather, phonemic transcriptions of the input]

Thx!

PS: I didn't bother withthe *.speech.* groups as they all appear to be moribund

Reply to
Don Y
Loading thread data ...

(snip)

The first, and close to last, time I worked on this problem was summer 1977, what would now be called a summer intern, but we didn't call them that at the time. The person I was working for had an actual Altair 8800 with non-Altair 64K DRAM. He bought a voice synthesizer S-100 card for it, and we were trying it out.

It is long enough now that I don't remember if we typed in phonemes or words. Maybe there was a BASIC program that converted words to phonemes and assembly program (which we had to fix up, as we used a different assembler than most) to run the hardware.

-- glen

Reply to
glen herrmannsfeldt

Yes, most of the "classic" work dates from the late 60's to the early 80's (i.e., when "computers" were common enough to be accessible for this sort of thing yet still "hog-tied" in terms of real resources). The card you reference was probably a "discrete" formant synthesizer similar to that which Gagnon produced (the Votrax "board set" which later became the Artic/SSI SC-01/2 "chips").

[I'd love to get my hands on a VS6.3 board set but it's not worth the time to chase one down -- "just to reminisce"...]

IIRC, pure software (formant) synthesizers weren't really practical until closer to 1980 (Klatt et al.).

The most common "public" text-to-phoneme algorithm of that era had to be the NRL ruleset. You could cram the entire ruleset into about 2.5KB and the algorithm to drive it was relatively simple/straightforward. And, crude as all hell! (no inflection control, prosody, etc.)

I posed my problem last night ("Boys night out") and got a couple of suggestions that I will follow up on as time permits. One was particularly interesting and, from the notes on the cocktail napkins I fished out of my pocket this morning, looks like it should give me more than I need with very little work! Always interesting to see how other minds approach problems! :>

But, I think I will keep the "throw lots of data at it" approach on hand and, instead of using it to help resolve the ambiguities/omissions in the published documents, will use it to help *evaluate* the different algorithms that I've implemented. The trick, then, will be to come up with a "scoring" criteria to allow for a "fair" comparison of performance. Maybe *literally* compare results to some authoritative PRONOUNCING DICTIONARY!

It will also be a good way to quantify run-time performance (resource utilization).

Reply to
Don Y

(snip, then I wrote)

Now that you say it, Votrax does sound right. It would have been a single S100 board.

I suppose so, but as well as I remember, it wasn't all that much hardware.

(snip on not remembering)

That sounds right. (snip)

-- glen

Reply to
glen herrmannsfeldt

The Votrax synthesizers were pretty large. E.g., the VS6 was four (potted) boards -- each about 3"x8" in a "chassis" with a power supply, etc.

Chances are, the board you were using was a simpler analog synthesizer: a couple of noise sources feeding a set of tuned filters (resonators) that tried to approximate the resonances of the vocal tract.

There were lots of "low resource usage" approaches to speech synthesis in that time period. Most had pretty dreadful "output" (I used to joke that the votrax was the only thing capable of penetrating concrete walls!)

IIRC, Digitalker is also of that approximate vintage.

Yes, see above. Even the software-based synthesizers (e.g., Klatt) that followed weren't "all that much (modeled) hardware". The advantages they had (besides cost) were more effective control of the transitioning between sounds (i.e., dynamically retuning the resonators with knowledge of their *intended* target frequencies and bandwidths). A bit easier to approach more "natural speech" when you have more dynamic control.

Other (earlier and later) rulesets were of comparable complexity (in terms of numbers of rules) but tended to have a less rigid algorithm by which they were applied. E.g., the NRL ruleset only looked at source text so a one-pass design was possible (this is actually one of the challenges in trying to come up with *truly* low resource implementations... you don't want to have to buffer entire sentences, phrases, etc. -- because they can be of nondeterministic length!)

Reply to
Don Y

To quote from Hertz ?SRS Text-To-Phoneme Rules A Three Level Rule Strategy? ICASSP 1981 ?A number of English text-tophoneme systems exist. MITalk-79, for example, uses a large morph dictionary and a small set of rules. The system is very accurate, but it is too large for many applications. The Naval Research Laboratory (NRL) system, on the other hand uses only a small set of rules with no dictionary. These rules require little storage space, but do not perform with the kind of accuracy that is desirable for most appications. Other text-to-phoneme strategies generally lie somewhere between these two extremes.?

Klatt was MITalk-79

formatting link
( text in german ) The fontend was on a TMS320 DSP and much more elaborate then the Votrax I/II. Notice the large bank of EPROMs.

The source for NRL

formatting link
( text in german ) was originally published in SNOBOL but ported to other languages
formatting link
The General Instruments SP0256-AL2 "Votrax clone" had a controller CTS256-AL2 often found at ebay that implemented a version of it.

Remember the chess computers ? There were the brute force machines and a few "chess knowledge" machines that were not successfull. Speech is rather unstructured. The speech recognition systems in the 70ies that were rule-based "artificial intelligence" failed too. I would say MIT started with something like NRL in the

60ies, patched in exception after exception and reworked it to a dictionary/table system to clean up the mess. You are on that road too.

MfG JRD

Reply to
Rafael Deliano

There wasn't a text-to-speech system based on it as far as i know.

formatting link
( text is in german )
formatting link
( text is in german )

Inflection control, prosody is much harder in a time-domain frontend. Commercially Digitalker was more usable then then Votrax or even LPC. Speech quality was fine for female voices like the one used in the Audi Quattro car. That was usually no true for LPC.

A text-to-speech system based in LPC looks easier and TI seems to have worked on it:

formatting link
on TI Speech Synthesis.pdf ( page 13 ) Perhaps there was a implementation on their home computer.

--------

As for small ( 8 bit ) embedded controllers my view is that text-to-speech is less practical then prerecorded words with flat intonation like Digitalker. If the sentence is short: "channel - four - is - on" then flat robotic intonation is ok.

Standard application vocabulary has been the talking clock:

formatting link
( text is in german ) That goes back to Edison.

While one can use PCM, ADPCM i would say that CVSD ist much more appropriate, because bitrate can more easily switched.

MfG JRD

Reply to
Rafael Deliano

Yes, no news here...

_From Text to Speech: The MITalk System_ discusses Allen, Hunnicutt and Klatt's work -- in reasonable detail. Other papers are available with clues as to more of what's under the hood. E.g., it was in Hunnicutt's ruleset that I was fighting typos.

Note that the Votrax VS6 predates KlattTalk (MITalk, DECtalk).

[I have a DTC-1, DECtalk Express, Type 'N' Talk, IntexTalker, PSS, SC01A, SP0256/CTS256, Digitalker, etc. -- and that doesn't count the *software*-based synthesizers! I've been at this for a few decades...]

Many of the ports failed to understand the subtleties inherent in SNOBOL and, as such, have bugs in their pattern matching algorithms.

Likewise, many of the ports of Klatt's synthesizer failed to understand what was *really* happening under the hood and just blindly tried to reimplement his (FORTRAN) implementation.

*General* speech (and, with that, only talking about *English* -- other languages are far more "well behaved") is unstructured. But, that doesn't apply to *all* speech.

Lee and Allen both pursued speech (with similar goals -- reading machine for the blind). Note that Kurzweil opted to *use* the VS6 in The Reading Machine. Though there was work on trying to enhance a Digitalker to bring the design "in-house" (Votrax board sets were expensive, crude and tended to fizzle out for unknown reasons -- being potted meant you had no option but to return them to the factory for a replacement)

Interesting that he did *not* pursue MITalk/KlattTalk given as they were half a dozen city blocks away (perhaps DEC was already involved in that venture). OTOH, the "Personal Reader" has a custom DECTalk implementation within.

Note that folks who listened to the KRM (Votrax) soon developed a much improved sense of comprehension. As it's speech patterns were methodical, you could learn that and exploit it to enhance your understanding (experienced users would complain that the machine wasn't *fast* enough -- even when set to speak at its maximum speed! IIRC, about 300 WPM?)

You appear to have missed the comment in my initial post: "As I can operate them in semi-limited domains (since I *tend*

------------------------------^^^^^^^^^^^^^^^^^^^^ to be the source of the text they utter), I've opted to adopt some of the classic approaches to the text-to-phoneme portion of the algorithms" so, I am clearly *not* on THAT road! :> As I said later in that same post: "even bloated synthesizers have problems with speech so why waste effort trying to *approach* their performance levels with 1% of *their* resources?!"

Reply to
Don Y

Digitalker and Votrax addressed different markets. Unconstrained text with a Digitalker would be impractical. OTOH, the Votrax could at least make a *stab* at it (subject to how well you could map glyphs to phonemes and control the inflection with your algorithm). E.g., the VS6.G could also *sing*!

The 9900 had a synthesizer. LPC is also limited domain applications. I.e., great for "The cow goes 'Moooo'!" (Speak n Spell)

Recording speech is the most limited domain approach possible. You have to know *every* utterance that you are likely to encounter "at run time". As you say below, it really only makes sense for small, limited vocabularies (like clocks!)

CVSD was used in video games in the early 80's. Even clocking the devices at their maximum rates left much to be desired in the quality of the speech that resulted.

A formant-based synthesizer is very practical with current technology. And, simple text-to-phoneme algorithms can give a lot of coverage for "nominal" utterances. The two, combined, seem to be the most effective way to get speech in a small footprint.

Even simple algorithms to impose prosodic structure to complex phrases is relatively easy and effective.

But, all this presupposes "nominal" text. E.g., reading source code listings would end up sounding like Qbert! Ths is where having "semi-limited" domains can leverage the basic performance of "cheap" synthesis without stumbling over the complexities in more general speech:

"Dr. Reed had already read the Polish language book that he was reading when I drove up to his house on Reading Dr. and caught him polishing off a small brandy."

Fit the solution to the problem.

Reply to
Don Y

CVSD at 16kbit/sec that the military used was poor. We used CVSD at 24kBit/sec in answering machines sold by Philips, AEG for cars in 1985. These were the early analog cellular radios for the telephone network here in Germany. CVSD is noisy, but the analog radio-link was noisy too. So the public didn't mind. Simple chips from Harris and CML, nothing complicated. This was a viable application with many thousand units sold. The company did earlier try to sell products based on the SC01A but there was no market for that sort of speech.

I doubt that messages of that length are typical for commercial applications.

The typical commercial speech output applications do not have a "at run time" requirement. The biggest vocabulary i can think of is speech output in GPS car navigation.

Digitalker demoboards had the complete vocabulary in ROM, but that predates FLASH. For a small fixed language application the user has to have access to a database that contains "all" words. He selects what he needs and downloads them to his hardware. Of course there is the old problem to get the words done in a recording studio. But ( copyright niceties aside ) i have found language trainer CDs where someone utters a word in German and then in English reasonably good testing material. True, "the most common 2000 words of the english language" do not contain everything the application vocabulary requires. But then some butchering ...

MfG JRD

Reply to
Rafael Deliano

Allen was publishing from 1968 onwards. 1972 was roughly the time Gagnon filed his first patent. The MIT system is the work of several people over many years. Klatt did the transfer to a commercial product. The success of Votrax did help him there. As far as i know Votrax didn't do much text-to-speech research. NRL was using a VS6 because that was the only commercially available frontend then. And so Votrax were lucky to get back a TTS-system that way. Whatever the software was, i can't see the SC01A or for that matter the VS6 match up to the more expensive DECtalk. Since they had to be or intended to be somewhat backward compatible the SSI263 was limited too. That said, i have never heard both systems. But i am rather skeptical for general applications of anything claiming TTS and beeing small.

That i do not doubt. But typical users were blind people and thats a closed user group. Some of the old military communication systems ( CVSD 16kBit/sec ; early LPC10 ) did work too within their limits but would no be viable for general commercial application.

For a typical embedded speech output application the selling point of the SC01A ( maybe plus TTS ) was: the user can create unlimited speech rapidly. For TI/LPC and Mozer the IC was a bit cheaper then the SC01A and the quality much better. But getting it done was slow and expensive. I am in no doubt that one can create technically a hybrid version, but i am unclear where the commercial advantage would be.

MfG JRD

Reply to
Rafael Deliano

Likewise its use in video (arcade) games -- the consumer is more titillated than concerned with the poor quality of the speech.

Likewise, visually impaired users could make *huge* allowances to gain "on demand" access to print material. The alternative being to find someone to sit down and *read* it to them!

Wrong capabilities match. You didn't need the unlimited vocabulary that it provided.

The point of the example was to illustrate how high a bar is set if you want to accurately speak the sorts of sentences that we encounter on a daily basis.

Dr. (doctor) vs. Dr. (drive) Polish (nationality) vs. polish ("to make shiny") Read (past tense) vs. reading (present) vs. Reading (pronounced "redding")

How would you read incoming email to a user? Or, even tell the user that the connection to the IMAP (eye-map vs I M A P) server has timed out -- or their credentials have been rejected (user name/password)? Or, that the server is down for repair and expected to be up again at 1:00PM DST (D S T? Daylight Savings Time?)? Or, that there are 23 minutes of estimated battery life remaining? Or, that the signal quality is poor due to obstructions in the RF signal path?

Notice how GPS units choke on many of the street names? Or, even names of towns (e.g., "Berlin" is pronounced in two different ways here -- US)? Other towns defy any logical pronunciation (Worcester -> "wooster"; Billerica -> "billrika"; Phoenix -> "feenigs")

I.e., you are thinking in terms of *fixed*, closed (limited domain) applications. I'm addressing a wider class of applications in a single "device". (in much the same way that the KRM addressed with a wider range of *books*)

Yes, and he is then *constrained* to speak only those words. If the application needs him to speak some *other* word, he can't. I.e., the application must be closed/limited domain.

Again, your butchering won't help you if you want to say "research" or "pubs" if those words don't already exist in your vocabulary. You end up (at best) trying to paste together phones from words that you have on hand (unit selection) -- without the benefit of (e.g.) a diphone representation (so, it sounds like you pieced together fragments of words).

Again, if you have a limited domain application, you can can all of your speech. E.g., our answering machine sounds very natural when it tells you that no one is home. OTOH, when it tries to tell me how many messages are waiting, it starts to sound artificial. And, I surely wouldn't expect it to be able to tell me *who* called (by looking up the CID and reciting the name of the caller to me).

As you move towards larger/unconstrained vocabularies, you *need* a mechanism to decide how to pronounce the text you are encountering.

*Spelling* everything is unacceptable: "You received a call from R A F A E L,,, D E L I A N O today at 3:27PM"

Trying to handle *all* possible text *correctly* leads to an implementation that is bloated -- and *STILL* will only address a particular *style* of speech. E.g., how would you tell the user that his stored password is "puppy"? Or, "JhfD@f5%"?

Reply to
Don Y

DECtalk is more expensive, physically larger and (probably) draws more power (I'd have to check the details, for sure). E.g., the VS6 fit in a box about 60% of the size of the DTC-01. The Votrax packaging was undoubtedly more robust (though a "custom" DECtalk could have probably been created to be more durable and more readily *integrated* into a product). The DECtalk Express is *much* smaller than the original DECtalk -- probably a factor of 5 volumetrically (60% of a "carton" of cigarettes) -- and greatly reduced power consumption (i.e., battery powered).

But, comparing solutions of one timeframe to those of another is never fair. Gagnon was building hardware filters so his solution would ALWAYS be constrained to that which could be reified AS a hardware filter. Whether it was a bunch of op-amps and discretes in a potted case -- or a more "integrated" approach, it was still a genuine filter with *only* the type of tuning and transitioning present that he could realize *in* hardware. There are no "smarts" in the design.

All formant based synthesizers sound largely the same. You'd be hard pressed to tell the difference between a Votrax and a DECtalk if operating in the same "voice" (i.e., basic formants). However, DECtalk embodies a set of LTS rules whereas Votrax relies on "something external" to push phoneme codes at it. So, the speaking

*pattern* of a Votrax is largely defined by the LTS algorithm that drives it. E.g., McIlroy's "sounds different" from the NRL when you explore a greater variety of input samples. Push phoneme codes (from your favorite LTS front-end) at the DECtalk and you'd wonder if it wasn't the Votrax!

By contrast, diphone synthesizers sound like real people (because the diphones were recorded from real speech). So, it's more a question of how good the unit selection algorithm is at piecing together "compatible" diphones -- without sounding like it is "piecing together diphones" :> But, you then need to be able to store a large diphone inventory! And, if the user doesn't *like* that "voice", there are limited remedies available to you (as building a new voice isn't just "tweeking a few settings")

Again, you're ignoring my "semi-limited domain" qualification!

Samples of the output I have to address (and the text strings that would be presented to the TTS subsystem for each of them):

The current time is 12:34PM, MST. Today is Sunday, 25 Jan 2015. Volume level: 45% Balance: 30% left of center Voice selection: "casual2" Voice parameters: Battery life remaining: 4:23 Estimated recharge time: 1:47 Relative signal strengths: A=12.4; C=15.0; D=3.5 Using beacon C. MAC address 12:34:56:78:9A:BC Servers available: "...", "...", "..." Attempting connection to server "..." Service unavailable. The server replied "..." Access denied. The server replied "..." User ID is "rafael_deliano". Current passphrase is "^gF4WxKK98".

Plus, of course, the dialogs (and "help") to adjust each of these settings...

It's *easy* to get "quality" speech for the "known" (a priori) text portions of these prompts. Even the numerical parameters are relatively easy to encode -- *if* you are conscious of how they will be spoken when you adopt them! (e.g., "45%" instead of "0.45")

But, once you accept unconstrained input (e.g., "The server replied..."), you have no way of accommodating that "unknown" -- short of resorting to spelling the contents of the message: "S E R V E R D O W N F O R R O U T I N E M A I N T E N A N C E. T R Y A G A I N L A T E R. F O R A S S I S T A N C E , C O N T A C T B I L L S L Y A T ( 7 0 8 ) 5 5 5 - 1 2 1 2 e x t 5 4 7"

Similarly, do you want to pass ASCII strings to the speech subsystem and have *it* know that it shouldn't even *try* to pronounce "^gF4WxKK98" because that will only result in the user requesting clarification? Or, do you want to have to invoke a different interface to the speech subsystem when the *content* you are passing should be treated differently? (e.g., "45%" instead of "forty-five percent")

This is why restricting yourself to a limited domain approach is tedious and ineffective. You can't just treat the speech subsystem as an output device but, rather, have to be aware of what you are likely to be saying in each invocation. Or, restrict yourself to simply not outputting things that it won't be able to say "well"` ("Access denied. The server said something but I have no idea what it was.")

Reply to
Don Y

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.