Speaker-independent speech recognition?

Question

I've got a very specific speech recognition application in mind, and I'm looking for a reference that will indicate if it's feasible. I want to recognize just one magic word, which is a very well-solved problem with high accuracy if we were talking about a boom mike and a silent environment. The difficulty is that there may be lots of other noises in the background, other people saying things, etc.

The application is something like a telematics device where you get its attention by saying "Computer...", except that the word in question can be assumed to be a unique word nobody would ever use for any other purpose. However the specifics of this application are something along the lines of:

- If the computer doesn't recognize that you want its attention, a ninja will beat you to death with a frozen muskrat, and

- If the computer hears your dog barking and thinks it was you trying to get its attention, you'll be charged $1,000 for the CPU time.

Is there an article someone can reference for me that will give some feel for the best I can expect from today's technology? Ideally, some information on the upper practical % limit to catching validly spoken words, and the lower practical limit to the number of false positives I'll see on other noises.

I see a lot of information about % recognition accuracy on the vendor websites, but they refer mostly to noise-free environments and of course to large dictionaries.

Richard Seriani · Accepted Answer

Maybe something here will help:Good luck.Richard

larwe · Answer

Lots to read here, thanks for the pointer. Not sure this will directly give me the statistic I need, but I may be able to gather enough samples of my "magic word" being spoken in different conditions to run it through this s/w and generate some of my own stats.

Rafael Deliano · Answer

Sounds a lot like "word spotting" of the old ( cold war ) days. Lots of unencrypted voice radio transmissions in russian that were recorded. People trained to listen in, identify in all the uninteresting routine conversations key words. Then that recording that possibly contained something of value was further listened to by people who would actually understand the language. There was funding in the 70ies/80ies to do that inital step cheaper by computer. Doubtfull if anything usefull came out of it.

MfG JRD

larwe · Answer

It's not the same kind of application at all - really it's more like a voice-operated "clapper" switch than anything else - but the requirements are similar. The cost of a false negative or a false positive are both pretty high, though a false negative is much more costly.

I think cheap DSP technology has come a long way in the past 20-30 years :)

Not Really Me · Answer

I'll take the muskrat Alex.  Actually, it sounds like an old Firesign Theatre line.Scott

larwe · Answer

Never really got into Firesign Theatre. I prefer the Goon Show, Hancock's Half Hour, etc.

Anyway, I was trying to demonstrate (flippantly) the real fact that both a false hit and a false miss have real costs in this application

- a false miss is dangerous, a false hit is financially expensive.

Tim Wescott · Answer

I suggest anti-muskrat armor and deep pockets.Some of the fellows over on comp.dsp may have some pointers -- go ask over there, see what you find out.-- Tim WescottControl systems and communications consultingNeed to learn how to apply control theory in your embedded system?"Applied Control Theory for Embedded Systems" by Tim WescottElsevier/Newnes,

przemek klosowski · Answer

Looks like Sphinx:Funny enough, I heard of it first in the context of a speech/VR interfacefor Infocom 'adventure'-type games.-- 		Przemek Klosowski, Ph.D.

Boudewijn Dijkstra · Answer

Op Thu, 08 Jan 2009 17:48:34 +0100 schreef larwe : I've got a very specific speech recognition application in mind, and I'm looking for a reference that will indicate if it's feasible. I want to recognize just one magic word, [...]

- If the computer doesn't recognize that you want its attention, a ninja will beat you to death with a frozen muskrat, and

- If the computer hears your dog barking and thinks it was you trying to get its attention, you'll be charged $1,000 for the CPU time.

Sounds like it could be a military application: firing too late is dangerous and firing for no reason is costly. Or remote assistance: screaming "help" too late is dangerous and rescueing you with a helicopter for no reason is costly.

[...]

I see a lot of information about % recognition accuracy on the vendor websites, but they refer mostly to noise-free environments and of course to large dictionaries.

I think accuracy will be a lot higher in the case of whistled languages like Silbo.

formatting link

Can your users be trained to whistle?

-- Gemaakt met Opera's revolutionaire e-mailprogramma:

formatting link

Speaker-independent speech recognition?

Join the Discussion

Didn't find your answer?