From Speech Recognition to Speaker Recognition?

Please see this thread in the CMU Sphinx forum:

formatting link

TIA

-Ramon

Reply to
Ramon F Herrera
Loading thread data ...

It depends on what you are trying to achieve.

There are several different issues that can be lumped together in what you'd colloquially call "speaker recognition".

If you want to *authenticate* (i.e., claim with a high degree of confidence that this voice is that of speaker X and *only* speaker X) a particular set of utterances, then the bar is considerably higher (esp if you are trying to do so in a court of law -- "beyond a reasonable doubt" sort of mentality)

OTOH, if you are just trying to indicate which of (many) "likely candidates" spoke those utterances (ignoring the possibility that it may be someone not included in your sample set), then it's much easier to do so.

OToOH, if you are just trying to determine if "this is still Fred", it's still another issue.

E.g., I use speech recognition and speaker identification ("which of many likely candidates"), here. Because I can train the system on the speech samples I intend for it to "recognize" (i.e., I have access to all of the "likely candidates" on a reasonably continuous basis), and, because that list is relatively *short* (and contains cooperative individuals), I can leverage "identification" as one biometric that factors into "authentication" by quizzing the speaker -- i.e., have *some* confidence that "this is Bob" and improve on that by forcing Bob to utter shared secrets... even if those secrets aren't closely held (i.e., the pool of speakers may all know a single shared secret but, as long as any potential impersonator doesn't learn it...)

[In reality, I've adopted a more complex solution to avoid the risk of playback attacks]

In the case cited, if you want to believe the phone recording is from the individual in question, you can use "identification" to further enhance your convictions (find a bunch of similar sounding voices). In the

*absolute* case, the bar is much higher. Esp as speech patterns can change as the the anatomy "ages". I doubt anyone with enough resources is going to invest the effort just to prove a man a liar... (what's the payoff?) The folks who want to believe him will continue to do so; the folks who consider him a liar will likewise continue to do so.
Reply to
Don Y

Thanks, Don!

I am slowly but finally getting expert answers to my 1,000 questions...

Please don't go too far. I will be back.

-Ramon

ps: Click into the folder "Learning Material" here:

formatting link
[In Google Drive, you only click ONCE]

This is an educational experience, for the layman.

Do you have any recommendations? Papers, presentations?

Reply to
Ramon F Herrera

Please read my posts here:

formatting link

and here:

formatting link

What I am trying to do is a shot across the bow of the USS O'Reilly. I am furious at the way this "Pointy-Haired Boss" has insulted all of us who are in the technology business (or for fun).

This should be the way of We The People telling him

"WE ARE ONTO YOU, O'REILLY!"

-Ramon

Reply to
Ramon F Herrera

Well, #562 is "blue". And, I won't swear to it, but I *think* #844 is pi/4. Number 879 is so far above my head that I won't even wager a guess! :-/

I'll try to take a look later this evening. Today is pro bono work.

Do you want to understand it from a technical perspective? Or, more intuitive: "grok"?

What's your "application" (even if just abstract learning)?

Reply to
Don Y

As to your original question:

There is a lot of literature available on the subjects of:

- Speech Recognition (what is this person likely trying to say)

- Speaker Recognition/Identification (who is this)

- Speaker Authentication (who is this, *really*) etc.

And, ways of *spoofing* each of these algorithms.

E.g., Speaker Authentication suggests that you will use the knowledge (result) that you obtain to perform an action that the particular individual is AUTHORIZED to do. So, there is an incentive for folks to want to be able to convince you that "he" is speaking -- even if he is not!

Most biometric authentication mechanisms fall down because of relatively simple "attacks". E.g., a fingerprint reader can be fooled with an image of the "valid" party's fingerprint! Some may require a heat source to also be present (to verify that a warm-blooded animal is presenting the fingerprint!). Or, even evidence of a *pulse*.

The same sort of thing applies to speech: play a recording of the individual speaking the "password"; synthesize a similar voice with which you can "dictate" (keypad or speech-to-speech transcoding) the expected reply (for challenge-response systems: "Hello, Mr. Herrera. Could you please tell me the name of the object being displayed on the screen in front of you? (you have 2 seconds to reply -- so any "directions" you have to give to a "device" that spoofs the proper voice must be efficient!)"

Most "recognizers" extract some set(s) of features from the thing they are trying to recognize (speech, voice, printed text, "gestures", etc.) and, based on those features, make a "best guess" as to which (of possibly

*many*) candidates is most likely. In a sense, this is how we handle these tasks as human beings: a "celebrity impersonator" sounds *like* (but not *identical*) to the person he/she is impersonating. And, many also rely on non-verbal cues to bias our beliefs (body movements, etc.) where the pure audio rendering would not be convincing enough. [E.g., anyone impersonating John Wayne instintively takes his thumbs-in-belt cowboy/toughguy stance in the process. Elvis impersonators always have to have get-black, *big* hair, etc.]

In my case, I use "speaker identification" not as an authentication mechanism but, rather, as a convenience mechanism: *who* is issuing this command (e.g., "Turn on the radio" will result in a different channel being "tuned" depending on who is asking for it.)

When I need "authentication", the speaker recognition aspect is mainly used as a first level mechanism to weed out folks who shouldn't even be *trying* to access . And, for those who should (including folks who are trying to *spoof* those folks), it acts to select which secondary mechanisms should be used *for* that individual AND what capabilities they should be allowed to access.

E.g., if the burglar alarm is blaring (annoying the neighbors), then certain of those neighbors can command it *off*. "You", OTOH, can't. Or, if one of the irrigation lines ruptures while I'm out of town (resulting in a "geyser" in the yard), a cooperative neighbor can override the irrigation system (to save me some money, fines, etc.)

Reply to
Don Y

And who said that we geeks don't know how to have fun! ... ;-)

They don't call it Amusenet, for nothing!

What you are witnessing is a method that I have used many times along my career. Come to think of it, I should patent it... :-/

It is a solution, looking for a problem: the case in question is an excuse, a springboard to immerse myself in a technology that is darn hard and fascinating.

Here's some background info:

formatting link

I just joined the "Alize Biometry platform" group in LinkedIn:

formatting link

formatting link

and am making (finally!!!) some measurable progress.

Regards,

-Ramon

Reply to
Ramon F Herrera

(sigh) Pity the folks who see this as *work*! :-/

I approach things that interest me at a different angle: what sorts of technology could I apply to this otherwise difficult/poorly solved problem. In my case, attacking the premise that designing for accessibility has little/no cost. Anyone who's thought about it realizes this is just a naive dismissal of the complex issues in man-machine interactions: as if one could just as easily switch *between* (i.e., on a per user basis) visual, aural, haptic, etc. input *and* output modalities with no measurable hardware/software/development costs.

If so, why can't I just TALK to my thermostat? Or, have it talk back to me?? And, if it is inappropriate for me (or it) to be speaking aloud (middle of a conference), shouldn't it be able to adopt some *other* I/O modality on-the-fly? As if I suddenly was unable to speak/hear, as well??

This naturally leads to things like speech recognition, gesture recognition, etc. Adapting to the needs of multiple concurrent users means things like speaker recognition/identification, etc.

So, "hard and fascinating" is a fitting assessment of the task (but, at the same time, infinitely challenging because it transcends the technological issues and is intimately involved in the human factors aspect. And, one where you can never really be "done" as there are myriad "things" to which a human could interface/interact.

Yeah, I'd read that, previously.

Cool! You will find all things speech related to be incredibly resource (computationally, etc.) intensive. And, very "disappointing" -- like a calculator that responds "somewhere between 10 and 20" when tasked with "3 x 6".

OTOH, if (as the O'Reilly example suggests) you are willing to work in an "off-line" (non-interactive/real-time) mode, you can throw as much resources at it as fits your budget and wait accordingly.

But, as to the original goal stated up-thread, I think you'll find it a lot of effort that will just have the "yes" camp equally committed to "yes" as before your effort; and the "no" camp just as committed to "no".

[For an interesting exercise, you might consider looking into the total absence of "scientific evidence" behind many of the things that we take as gospel -- esp in our legal system wrt "evidence". E.g., are fingerprints *really* unique? If so, what *ensures* this? How much of what we have grown to rely on is based on smoke-and-mirrors?]

I prefer more gratifying results: having a group of neighbors "talk to" a box and having that box display their name produces intense excitement among them (even though it's not really solving the problem that they naively *think* it is solving: i.e., it would just as easily MIS-identify a stranger as one of them!). So, it's relatively easy for them to then imagine how I can let "Jane" turn off the alarm/siren while not responding to pleas to do so from "Bob".

Good luck!

Reply to
Don Y

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.