Please see this thread in the CMU Sphinx forum: TIA -Ramon

Thanks, Don! I am slowly but finally getting expert answers to my 1,000 questions... Please don't go too far. I will be back. -Ramon ps: Click into the folder "Learning Material" here: [In Google Drive, you only click ONCE] This is an educational experience, for the layman. Do you have any recommendations? Papers, presentations?

From Speech Recognition to Speaker Recognition?

R

Ramon F Herrera 11 years ago

Please see this thread in the CMU Sphinx forum:

formatting link

TIA

-Ramon

Vote

D

Don Y 11 years ago

It depends on what you are trying to achieve.

There are several different issues that can be lumped together in what you'd colloquially call "speaker recognition".

If you want to *authenticate* (i.e., claim with a high degree of confidence that this voice is that of speaker X and *only* speaker X) a particular set of utterances, then the bar is considerably higher (esp if you are trying to do so in a court of law -- "beyond a reasonable doubt" sort of mentality)

OTOH, if you are just trying to indicate which of (many) "likely candidates" spoke those utterances (ignoring the possibility that it may be someone not included in your sample set), then it's much easier to do so.

OToOH, if you are just trying to determine if "this is still Fred", it's still another issue.

E.g., I use speech recognition and speaker identification ("which of many likely candidates"), here. Because I can train the system on the speech samples I intend for it to "recognize" (i.e., I have access to all of the "likely candidates" on a reasonably continuous basis), and, because that list is relatively *short* (and contains cooperative individuals), I can leverage "identification" as one biometric that factors into "authentication" by quizzing the speaker -- i.e., have *some* confidence that "this is Bob" and improve on that by forcing Bob to utter shared secrets... even if those secrets aren't closely held (i.e., the pool of speakers may all know a single shared secret but, as long as any potential impersonator doesn't learn it...)

[In reality, I've adopted a more complex solution to avoid the risk of playback attacks]

In the case cited, if you want to believe the phone recording is from the individual in question, you can use "identification" to further enhance your convictions (find a bunch of similar sounding voices). In the

*absolute* case, the bar is much higher. Esp as speech patterns can change as the the anatomy "ages". I doubt anyone with enough resources is going to invest the effort just to prove a man a liar... (what's the payoff?) The folks who want to believe him will continue to do so; the folks who consider him a liar will likewise continue to do so.

Vote

R

Ramon F Herrera 11 years ago

Thanks, Don!

I am slowly but finally getting expert answers to my 1,000 questions...

Please don't go too far. I will be back.

-Ramon

ps: Click into the folder "Learning Material" here:

formatting link

[In Google Drive, you only click ONCE]

This is an educational experience, for the layman.

Do you have any recommendations? Papers, presentations?

Vote

R

Ramon F Herrera 11 years ago

Please read my posts here:

formatting link

and here:

formatting link

What I am trying to do is a shot across the bow of the USS O'Reilly. I am furious at the way this "Pointy-Haired Boss" has insulted all of us who are in the technology business (or for fun).

This should be the way of We The People telling him

"WE ARE ONTO YOU, O'REILLY!"

-Ramon

Vote

R

rev.11d.meow 11 years ago

We need Bill O'Reilly, for someone to point and laugh at.

Vote

D

Don Y 11 years ago

Well, #562 is "blue". And, I won't swear to it, but I *think* #844 is pi/4. Number 879 is so far above my head that I won't even wager a guess! :-/

I'll try to take a look later this evening. Today is pro bono work.

Do you want to understand it from a technical perspective? Or, more intuitive: "grok"?

What's your "application" (even if just abstract learning)?

Vote

K

krw 11 years ago

I see Alinsky is alive and well.

Vote

B

Bill Sloman 11 years ago

Krw sees what he wants to see. Reality doesn't deliver as often as he'd like so his imagination works overtime.

Bill Sloman, Sydney

Vote

D

Don Y 11 years ago

As to your original question:

There is a lot of literature available on the subjects of:

- Speech Recognition (what is this person likely trying to say)

- Speaker Recognition/Identification (who is this)

- Speaker Authentication (who is this, *really*) etc.

And, ways of *spoofing* each of these algorithms.

E.g., Speaker Authentication suggests that you will use the knowledge (result) that you obtain to perform an action that the particular individual is AUTHORIZED to do. So, there is an incentive for folks to want to be able to convince you that "he" is speaking -- even if he is not!

Most biometric authentication mechanisms fall down because of relatively simple "attacks". E.g., a fingerprint reader can be fooled with an image of the "valid" party's fingerprint! Some may require a heat source to also be present (to verify that a warm-blooded animal is presenting the fingerprint!). Or, even evidence of a *pulse*.

The same sort of thing applies to speech: play a recording of the individual speaking the "password"; synthesize a similar voice with which you can "dictate" (keypad or speech-to-speech transcoding) the expected reply (for challenge-response systems: "Hello, Mr. Herrera. Could you please tell me the name of the object being displayed on the screen in front of you? (you have 2 seconds to reply -- so any "directions" you have to give to a "device" that spoofs the proper voice must be efficient!)"

Most "recognizers" extract some set(s) of features from the thing they are trying to recognize (speech, voice, printed text, "gestures", etc.) and, based on those features, make a "best guess" as to which (of possibly

*many*) candidates is most likely. In a sense, this is how we handle these tasks as human beings: a "celebrity impersonator" sounds *like* (but not *identical*) to the person he/she is impersonating. And, many also rely on non-verbal cues to bias our beliefs (body movements, etc.) where the pure audio rendering would not be convincing enough. [E.g., anyone impersonating John Wayne instintively takes his thumbs-in-belt cowboy/toughguy stance in the process. Elvis impersonators always have to have get-black, *big* hair, etc.]

In my case, I use "speaker identification" not as an authentication mechanism but, rather, as a convenience mechanism: *who* is issuing this command (e.g., "Turn on the radio" will result in a different channel being "tuned" depending on who is asking for it.)

When I need "authentication", the speaker recognition aspect is mainly used as a first level mechanism to weed out folks who shouldn't even be *trying* to access . And, for those who should (including folks who are trying to *spoof* those folks), it acts to select which secondary mechanisms should be used *for* that individual AND what capabilities they should be allowed to access.

E.g., if the burglar alarm is blaring (annoying the neighbors), then certain of those neighbors can command it *off*. "You", OTOH, can't. Or, if one of the irrigation lines ruptures while I'm out of town (resulting in a "geyser" in the yard), a cooperative neighbor can override the irrigation system (to save me some money, fines, etc.)

Vote

B

Bill Sloman 11 years ago

Don, you need to look at what the OP said he was trying to do when he initiated this thread, which was to identify the speaker on a thirty year-old cassette recording as Mr. Bill O'Reilly, who is now a Fox News network star.

He's not interested in the ways you use speaker identification, and I doubt if anybody else is either - it's not as if we expect you to be doing anything interesting (and what I've snipped suggests that you aren't).

Bill Sloman, Sydney

Vote

R

Ramon F Herrera 11 years ago

And who said that we geeks don't know how to have fun! ... ;-)

They don't call it Amusenet, for nothing!

What you are witnessing is a method that I have used many times along my career. Come to think of it, I should patent it... :-/

It is a solution, looking for a problem: the case in question is an excuse, a springboard to immerse myself in a technology that is darn hard and fascinating.

Here's some background info:

formatting link

I just joined the "Alize Biometry platform" group in LinkedIn:

formatting link

and am making (finally!!!) some measurable progress.

Regards,

-Ramon

Vote

R

rev.11d.meow 11 years ago

Have you searched the dark web for bill o'reilly content?

There's TONS out there to snag and analyze.

Vote

R

Ramon F Herrera 11 years ago

Esteemed krw:

Here's the Herrera Theorem, fully proved and uncontested:

(a) When it comes to the Truth, the *only* possible reference is our Universities.

(b) The better the University, the more Liberal.

Q.E.D.

-Ramon The Truthful

Vote

R

Ramon F Herrera 11 years ago

Actually, I am using the O'Reilly case as an excuse to learn:

? Speaker Verification

? Analysis of the signal in Phone Conversations: delay, echo, etc. (*)

? Building websites (boy, is my face red on this one!)

-Ramon

(*) See my thread:

"Is it possible to determine whether a phone call is local or long distance by analyzing the audio?"

Megathanks to Jeff Liebermann !!

I only have one paper about this subject, though:

formatting link

[Click into "Learning Material" and then on:

"The Bergeron Method: A Graphic Method for Determining Line Reflections in Transient Phenomena"]

Vote

K

krw 11 years ago

You're lying, but lefties are good at it.

What utter bullshit.

No Ramon, the self-important prick and Slowman clone.

Vote

D

Don Y 11 years ago

(sigh) Pity the folks who see this as *work*! :-/

I approach things that interest me at a different angle: what sorts of technology could I apply to this otherwise difficult/poorly solved problem. In my case, attacking the premise that designing for accessibility has little/no cost. Anyone who's thought about it realizes this is just a naive dismissal of the complex issues in man-machine interactions: as if one could just as easily switch *between* (i.e., on a per user basis) visual, aural, haptic, etc. input *and* output modalities with no measurable hardware/software/development costs.

If so, why can't I just TALK to my thermostat? Or, have it talk back to me?? And, if it is inappropriate for me (or it) to be speaking aloud (middle of a conference), shouldn't it be able to adopt some *other* I/O modality on-the-fly? As if I suddenly was unable to speak/hear, as well??

This naturally leads to things like speech recognition, gesture recognition, etc. Adapting to the needs of multiple concurrent users means things like speaker recognition/identification, etc.

So, "hard and fascinating" is a fitting assessment of the task (but, at the same time, infinitely challenging because it transcends the technological issues and is intimately involved in the human factors aspect. And, one where you can never really be "done" as there are myriad "things" to which a human could interface/interact.

Yeah, I'd read that, previously.

Cool! You will find all things speech related to be incredibly resource (computationally, etc.) intensive. And, very "disappointing" -- like a calculator that responds "somewhere between 10 and 20" when tasked with "3 x 6".

OTOH, if (as the O'Reilly example suggests) you are willing to work in an "off-line" (non-interactive/real-time) mode, you can throw as much resources at it as fits your budget and wait accordingly.

But, as to the original goal stated up-thread, I think you'll find it a lot of effort that will just have the "yes" camp equally committed to "yes" as before your effort; and the "no" camp just as committed to "no".

[For an interesting exercise, you might consider looking into the total absence of "scientific evidence" behind many of the things that we take as gospel -- esp in our legal system wrt "evidence". E.g., are fingerprints *really* unique? If so, what *ensures* this? How much of what we have grown to rely on is based on smoke-and-mirrors?]

I prefer more gratifying results: having a group of neighbors "talk to" a box and having that box display their name produces intense excitement among them (even though it's not really solving the problem that they naively *think* it is solving: i.e., it would just as easily MIS-identify a stranger as one of them!). So, it's relatively easy for them to then imagine how I can let "Jane" turn off the alarm/siren while not responding to pleas to do so from "Bob".

Good luck!

Vote

R

rev.11d.meow 11 years ago

Just use a lie detector. If it shows positive, it's Bill O'Reilly.

Vote

B

Bill Sloman 11 years ago

What lefties are actually good at is saying stuff that krw refuses to belie ve. He sees it as "lies" but since his idea of evidence is "what krw thinks he knows" and his idea of "proof" is what everybody else calls reiteration , he's not to be taken seriously.

Krw is confused. Almost everybody disagrees with him - the exceptions are f ew right-wing nitwits who have similar delusions - and any rational person will end up pointing out that krw is a deluded idiot, if they can be bother ed take him seriously enough to post a response. There are lots of differen t ways of being rational, and Ramon has been around for even longer than I have which probably precludes him having cloned my routine for being rude a bout krw's obvious failings.

Bill Sloman, Sydney

Vote

B

Bill Sloman 11 years ago

Lie detectors are notoriously unreliable.

Bill Sloman, Sydney

Vote

M

Maynard A. Philbrook Jr. 11 years ago

so you failed a few? I can understand that.

Jamie

Vote

From Speech Recognition to Speaker Recognition?

Join the Discussion

Didn't find your answer?