Comparing similiar audio files, FFT?

- K
- kieran
  
  Contact options for registered users
posted
15 years ago

Wed, Oct 8, 2008 2:56 PM

Hello, I am trying to compare two similar audio files (WAV). From what i have read i need to sample both audio files at certain frequencies and run these through a FFT and then compare the results. Can anyone advise me if this is the correct approach and also describe the steps i need to take to get to the stage where I can compare the files. TIA, Kieran

- S
- Steve
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Oct 8, 2008 3:37 PM

That is one way to compare the files. But what are you trying to do with the comparison?

You need to do classical DSP work.

Use a low pass filter to prevent aliasing. Take a binary number of samples (ie: 128, 256, 512, 1024 ...) Run an FFT on the samples, this will give you frequency domain data from your time domain data. Each data point is refered to as a bin and the frequencies that fall into that bin depends on the clock frequency of the samples.

Plot the spectrum of the 2 audio files.

Even Excel can do FFTs but it is not obvious what it is doing unless you are familiar with FFTs Maybe there is some free FFT software you can grab.

- C
- christofire
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Oct 8, 2008 6:01 PM

Take a look at

formatting link

... the one that tells the truth about green pens and all that!

Chris

- J
- Jan Panteltje
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Oct 8, 2008 6:17 PM

On a sunny day (Wed, 8 Oct 2008 19:01:32 +0100) it happened "christofire" wrote in :

Look like a copycat in 2008 of what I wrote around 2000 (Linux):

formatting link

I used this to cancel common background in translated tracks.

It also aligns, matches amplitude, and substracts. Wrote quite a few more audio utilities actually, most are here:

formatting link

You would still need to understand digital audio and audio in general to use these of course.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Oct 9, 2008 3:39 AM

Maybe the other suggestions are good enough, personally i suspect that tempo adjusting software with at least one dial to keep them synchronized, put one signal in each ear and listen. The brains software is much better than any available package.

- M
- miso
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Oct 9, 2008 6:59 AM

You failed to indicate the criteria of the comparison. Just what in these files do you want to compare?

- K
- kieran
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Oct 9, 2008 11:22 AM

Hi TIA, This seems to be a good approach. What I am trying to do is to automate the comparison of audio files. The two files I will be comparing will be audio recorded from an IVR system. The first file will be a high quality recording, checked by ear, the second file will be recorded evey hour to ensure the IVR is working correctly, ie if the two files sound similarI can consider the IVR to be working. I will give this a go and let you know teh results. Thanks for your help, Kieran

- S
- Steve
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Oct 9, 2008 1:19 PM

Could you put a test mode in your IVR? Perhaps have it respond with something easy to detect like DTMF?

....or perhaps figure out a way to subtract the one recording from the other and except for some gain adjust and phase offset the results should be a close to silence. Calculate the amplitude of the results and see that it is low.

- W
- whit3rd
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Oct 9, 2008 10:54 PM

Well, yeah, but... what's the similarity criterion?

In some sense, an FFT will tell you the voice of the singer or the instrument(s) but might not distinguish multiple works of different composition performed on the same instrument. Similarly, a time/amplitude breakdown might pick up the 'Surprise' symphony easily from other works, but can't tell you whether it was performed by an orchestra or a kazoo band.

A two-minute selection from a CD has 10 million samples, and that means it selects a point in a 10-million-dimension vector space. What makes two such points similar?

- L
- Le Chaud Lapin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, Oct 10, 2008 5:32 AM

her

t

There is simple problem with this: there is no way to adjust the phase because phase only make sense in context of periodic signals. A time domain signal as above is not periodic, but one can pluck components from frequency domain from each signal and look at their phases.

In other words, if a speaker is offered $100US if s/he can create the same sampled digital signal, more or less, by speaking into IVR, such that only by shifting signal2 a bit relative to signal1 he is able to get the signals properly aligned for comparison, he will fail. The reason is that, even at the relatively low sample rate of 8kHz, no human is able to begin speaking just at the right instant, let alone control the physiology of speech path to generate more-or-less the exact same signal. Any attempt to find out when a signal begins is hopeless in the time domain. Is it the first non-zero sample? The second? Third? Is that noise or voice? Is it when the "hump" is really high? Almost really high? One cannot know.

This is classical problem in speech recognition and related areas. I responded to OP in comp.dsp with outline of what he needs to do:

formatting link

-Le Chaud Lapin-

- S
- Steve
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Fri, Oct 10, 2008 2:44 PM

I didn't think he was trying to use a human in this instance and that the IVR is playing the exact same speech each time. So would you not be able to do a cross-correlation?

There is simple problem with this: there is no way to adjust the phase because phase only make sense in context of periodic signals. A time domain signal as above is not periodic, but one can pluck components from frequency domain from each signal and look at their phases.

In other words, if a speaker is offered $100US if s/he can create the same sampled digital signal, more or less, by speaking into IVR, such that only by shifting signal2 a bit relative to signal1 he is able to get the signals properly aligned for comparison, he will fail. The reason is that, even at the relatively low sample rate of 8kHz, no human is able to begin speaking just at the right instant, let alone control the physiology of speech path to generate more-or-less the exact same signal. Any attempt to find out when a signal begins is hopeless in the time domain. Is it the first non-zero sample? The second? Third? Is that noise or voice? Is it when the "hump" is really high? Almost really high? One cannot know.

This is classical problem in speech recognition and related areas. I responded to OP in comp.dsp with outline of what he needs to do:

formatting link

-Le Chaud Lapin-

- L
- Le Chaud Lapin
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Tue, Oct 14, 2008 5:02 AM

Yes, I guess that would work too, as long as the signals are normalized first, as you pointed out in youir 2nd post.

You got me thinking about the pros and cons of the cross-correlation method versus mean-squared-error method, and minimum distance estimator.

-Le Chaud Lapin-

- K
- kieran
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Oct 15, 2008 9:32 AM

Hi all, thanks for your posts. They have helped me a great deal and have definatly steered me in the right dirtection. Some more info: I should have explained that I am comparing the same recording of the voice but the differences I am trying to identify are caused by interference from the mobile phone network. ie lost audio and noise. I will be listning to one of the samples (the master or reference), by ear to ensure the recording is clear and without interference. I will then record the same piece of audio at various times through out the day and compare it to the master. The comparison should identify which recordings are of high quality (low interference) and identify the recordings that are of low quality(lots of interference and lost audio). Kieran

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Oct 16, 2008 3:37 AM

Ooh. In that case maybe you should look for audio forensics software. I hear diamond cut AC5 can be useful.

- J
- JosephKK
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Oct 16, 2008 3:46 AM

This appears to be the successor to the produce i heard about:

formatting link