Suppose you have a recording of a human voice singing a song. The recording has been sped up or slowed down, such that both the tempo and pitch have changed. The aim is to detect as close as possible exactly how much you need to compress or expand the waveform (to speed it up or slow it down) in order to restore it to the pitch it was originally recorded.
The human voice has a limited range, so you could easily get it within this range, just by knowing that most people would not be able to sing outside this range. You also know that the song is sung in tune, in equal temperament, so the pitches will need to align exactly to a set of defined notes.
If you knew anything about the singing ability of the singer, you might also be able to infer something based on how strained the singing of each note is, but assume you only have the recording, and no prior information about the singer.
How would you do it?