speech recognition

L

lvkeegan 21 years ago

I am working on a system of PIC microcontrollers in which there will be communication between computer and user. The computer will choose from a menu of Chipcorder messages, but the user or operator, I want to utilize a small vocabulary of spoken words.

I am learning that there some speech recognition chips available, and even setting these up I realize is not a trivial task. But I would like to do the experimenting with A/D and DAC and digitizing individual words and working on algorithms for comparing the operators spoken word with a stored template. This might even lead me into pattern recognition and neural network technologies.

Has anyone experimented in these areas or set up any type of speech recognition system?

Vote

L

lvkeegan 21 years ago

Joe G Thanks very much - this is exciting to see all these features incorporated into the microprocessor itself, ans also their software packages. I wsaved the link and will be studying it carefully. Thanks. Larry Keegan

Vote

L

lvkeegan 21 years ago

hi andrew this is basically what I thought of doing, except I thought using the A/D device and digitizing the word would give me a better handle. But I see what you are doing, and capturing a template of sorts seems to be the name of the game. In my experiments I will certainly try using the caps and developing an analog envelope. It will also be interesting to see the images, if I can get one of my 3 non working scopes up and going. I do have a graphics LCD working and I may display my digitized samples of a word like "yes" on the LCD. Sounds as tho u had an interesting experience with speech recognition. Larry Keegan

Vote

R

Rich Grise 21 years ago

I made a sort of "spectrum analyzer" with eight active filters spaced across about 300Hz - 3KHz, kinda like those displays on a "graphic equalizer". I found out that when you take the fundamental out, that each phoneme has a unique spectrum, regardless who's talking! I got hung up on the pattern matching algorithm, however.

These days they'd probably do the whole thing with DSP or something.

Good Luck! Rich

Vote

J

Joe G (Home) 21 years ago

Checkout Sensory Inc

A single chip can do all the PIC work as well as the VR.

formatting link

JG

Vote

L

Le Chaud Lapin 21 years ago

Congratulations (Andrew) on also making an AM demodulator, which is what that is, but you probably knew that. It works, but there is another way that you might want to try in software before you implement it in hardware, and it is also a great way to explore the basics of signal processing cheaply:

Get a standard PC running Windows or Linux. Use a sound application to digitize a vocabulary of your words.

Compute the frequency-domain discrete Fourier transform (DFT) using the Fast Fourier Transform (FFT) from these time domain samples. You can find how to do this by searching on Google. The math can be confusing at first, but this is one of the most beautiful processes in all of engineering, and its definitely worth learning if you haven't already.

Once you have done that, you will have the frequency bins that Rich Grise mentioned in his parallel post, each bin essentially represening the energy of particular frequency. Then, intuitiviely, a particular word will have a "signature", or a frequency pattern depending on what word it is. If your vocabulary has 16 words, you should have 16 representative signatures(frequency domain signals).

Then normalize the signatures by imagining the height (modulus of the coresponding component) of each frequency bin to represent the component of a vector. The FFT of the word would yield a vector in N-space, where N is the number of samples in the frequency-domain signal...you should normalize this vector to unity (having length 1) by replacing it with a the vector where each of the components was divided by the length of the vector. Naturally, you compute the length of the vector by taking square root of its scalar product : sqrt(A*A) = sqrt(a0*a0 + a1*a1 + ...an-1*an-1).

After you have normalized your vocabulary, you can normalize the utterances as they come in using the exact same procedure: take sample, compute DFT with N samples, regard as vector, normalize vector.

After you have this input vector, you want to guess which word was uttered. The simplest thing you can use is a minimum distance algorithm. Since each of your vocabulary utterances is a vector in N-space, and your input word is also a vector in N-space, and all of these vectors have length of one due to normalization, most likely, the vector of the uttered word will have its tip closest to the tip of vector of the corresponding utterance. You compute the distance between the utterance vector and input vector using standard formula for distance between two vectors. Whichever utterance yields the smallest distance, thats' the one you choose. Naturally, if someone uttters a word not in the vocabulary, you will necessarily have a mismatch, so it might make sense to have a threshold or thresholds.

Note that some words will be longer than others. Resist the temptation to to contract longer words in the time domain so that they all "have the same length", for if you did that, your utterances will sound like they were uttered by chipmunks.

To test your energy normalization, try yelling the word, then murmuring it, and see if the matching still works. But keep in mind that, when a person yells a word, the increase in intensity is not distributed along the entire length of the utterance. Often, in poly-syllabic words, the accented syllable takes the bulk of the emphasis. This illustrates another important point: You want very clear demarcations between the engineering aspect of what you're doing and the art aspect. Many speech recognition companies suffered during the 1980's and 1990's because they thought that engineering aspect would carry the day, but there is only so much that the math can do. The rest is art.

After you get all of this working, you can do all kinds of things like calculate the probability of a miss, a hit, conditional probabilities of miss or hit given a particular word was uttered. You can also take your vocabulary vectors and determine the degree of "orthogonality" in the utterances - the degree to which each word is likely to be distinguiable from the other words. You do this by taking the scalar product between every two vector pairs in you M utterances. If the scalar product is close to unity, that's not good; that means that the words are spectrally close, like "lighting" and "lightning". On the other hand, if it's close to zero, great!

It should be fun to find a vocabulary with maximum spectral orthongality which has the semantics you want: instead of the word "eat", say "ingest", if your vocabulary absolutely positively must contain "meat".

Once you have written the software, you can put it in hardware. You'll have to create data types and functions for complex numbers, but hey, let's not kid ourselves - there was never such a thing as a non-complex number anyway. :)

-Chaud Lapin-

Vote

A

Andrew Leech 21 years ago

I actually developed a very simple version of this using a pic16f84 running at 4MHz when I was about 16 (year 11 at high school). My method basically used circuitry to convert the signal from the microphone to an 'envelope' (i think that's what it's called), which can be visualized be looking at the output of the microphone on a oscilloscope, and then tracing a line along the top of the waveform. This basically gives different words a quite definable shape Which can be sampled at a relatively slow rate. This envelope was basically made by feeding the ac signal through a diode, and then have a resister and cap in parallel from the diode to ground. the size of the cap and resister has to be chosen to smooth out the waveform sufficiently, while still responding fast enough. I had extra buffering and amplification around this to get a solid signal. I then sampled this signal with an A/D converter and ran tests on it in the pic to recognize different words. Initially it only recognized stop and go, and did so quite well.

incidentally, I developed this as a prototype of a project to be given to university students, it was later used as a semester long project for 3rd year elec eng students, and I was later told that my method which was developed in 3 weeks performed better than all of the ones the students (who had the advantage of 5 years of age and 2.5 years of uni) developed in 12 weeks :p

Andrew

snipped-for-privacy@comcast.net wrote:

Vote

L

larry k 21 years ago

to Rich Grise and the "warm rabbit" thank you I had seen FFT mentioned and now know it is Fast Fourier Transform I can't understand all of the explanations, including some of Andrew's info, but I am putting it into my notebook for further study

I think a speech recognition system you will be hearing about is Sensory's RSC-4x Speech Recognition anbd Synthesis Microcontroller. I will not touch this myself, since I want to work with my own PIC microcontroller which is the PIC16F628. I wrote my own assembler and created programming software and hardware for it. It takes me 25 seconds to go from assembly pgm to a burned in chip.

Vote

R

Rich Grise 21 years ago

OK, here is where my 8-band filter project would have diverged - I was giving each phoneme a "signature", rather than whole words - too many axes! - then, if I had figured out how to do a pattern match on that signature, which you've so wonderfully described below, by vectors, (I was just comparing batches of numbers) my plan was to run the phoneme stream through something like a reverse Huffman table, or soundex table, and pick words by their sound sequence. Then, of course, just send the stream of derived words to a speakwrite, robot, microwave oven, or whatever.

Alas, where were you in 1990? ;-)

Thanks! Rich

[repeated for completeness]

Vote

R

Robert Baer 21 years ago

Bell called the various phonemes "formants" as i remember. They evenhad a kit using various LC resonators that one could combine combinations at selected amplitudes to create these formants - to make the vowels.

Vote

J

Jan Panteltje 14 years ago

On a sunny day (Wed, 18 Jan 2012 20:51:41 -0800 (PST)) it happened RichD wrote in :

I have read about research like that long time ago, and it was a new type of neural net where the communication is not in spikes quantity, but used frequency modulated spikes (IIRC). That was supposed to come close to how the real neurons worked in some part of the cortex. It was taken by DOD and used for submarine detection and detection of where shots came from from snipers. Then I could no longer find any publications about it on the net. That system was employed in Iraq.. Could be the basis of things developed now.

Crossposting sucks, do not do it.

Vote

C

Capt. Cave Man 14 years ago

Probably not a filter, but the second mike could be helpful in establishing a baseline for what is 'noise' compared to what my be originating from a point source.

Should be some "voice pattern" smarts capable of deciding what data IS from a human source and segregating it from the rest.

Goddamned overtly cross posting twit.

Vote

M

miso 14 years ago

It is far better to use a better sensor (or sensor array as the case may be) than to clean up the signal after the fact. Microphones are cheap. Complex signal processing is power hungry. I rather have a good noise cancelling scheme than a pile of DSP post-processing.

BTW, the worse people to give a technical opinion of anything is a stock analyst. ;-)

Phone voice quality got crappy when it was deemed flip phones are not cool. Their is simply nothing like having the microphone in the right place. Some of the cellular cases mimic the old flip phone so that there is a pressure zone effect from the mouth to the microphone. You can google "sena flip case" to see them. Note some of the iphone cases are designed for docking, so they flip at the wrong end of the phone. Most other Sena flip cases flip so that the flap provides a path for the voice.

Don't even get me started on bluetooth headsets where the microphone barely goes past the ear.

Vote

R

Robert Macy 14 years ago

...

se

t

o

so

s

e I

r

Google's telephone/voice service after recording an incoming voice message, provides 'voice to text' conversion which is then emailed to you. A bit of a chuckle at all the errors, EXCEPT call back numbers are extremely accurate. Plus you can hit "Listen to the message" while reading the email to reinforce what you're reading.

Vote

J

josephkk 14 years ago

formatting link

DTL

Totally cool with the music reseparation. I am totally interested the algorithms used, i have some uses for such at that. I must find one of = my unique signal samples as a offering to interest the Prof. I will let you know when i find it, then i will have to digitize it; that's the easy part.

?-)

Vote

speech recognition

Join the Discussion

Didn't find your answer?