Mel scale extraction is just doing a FFT on speech and massaging it so speaker identification and speech recognition work.
What I want it for is an automatic recognizer when a commercial comes on. T He way this is done in ANN's is to simply train the net to output what the speaker says. So if you capture a bit of speech, the ANN outputs the same. When speakers change, the ANN is way off, due to the degrees of freedom in the net.
Unlikely anyone here is doing this, but if there is a sci.dsp.audio, might have it.