Synchronization reprise

- D
- D Yuniskis
  
  Contact options for registered users
posted
14 years ago

Wed, Jan 20, 2010 7:12 PM

Hi,

OK, I've decided to code up an "experiment" to quantify the behaviour that I might expect from protocols *like* PTP, NTP, etc. in an "uncontrolled" environment (i.e., I place *no* restrictions on the network fabric or topology).

It appears that I don't need much in the way of OS services to support the protocol. Just a crude timing service to drive the generation of requests (and a precise timing service within which to measure packet delivery times).

It appears that I need to either put some hooks into the stack to *guarantee* some minimal QoS (i.e., so the time protocol itself has some minimum guarantees on network availability) but those should be easy.

I plan to implement at least two different approaches for each protocol:

- protocol sits in user land *above* the network stack

- (parts of) protocol sits in the stack itself The latter will allow me to remove the latency and attendant jitter associated with the stack's performance.

For each of these, I plan on:

- using hardware timestamps

- ignoring hardware timestamps (i.e., just sample microtime) The latter will tell me what support for hardware timestamps is really *worth* (hardware and software costs).

The "audio client" (discussed else-thread) seems like a nice quick way of prototyping this as it is a trivial project and, by design, doesn't have the bloat of typical network implementations (because it is resource starved :> )

I *think* this will also give me a nice "hook" to measure the degree of synchronization and how that changes over time (i.e., by turning off the filtering loop in the protocols). E.g., play a "synchronized" tone through each client and measure phase differences. Variations in phase (i.e., protocol jitter) would manifest as short term FM.

OK so far?

Problem I foresee is coming up with a suitablly aggressive environment to stress the protocols! The obvious weakness is the switches. Keeping traffic backed up on specific (switch) ports should allow me to increase the delay

*through* that port to the "targeted" node that lies beyond. Likewise, *starving* traffic to other nodes will keep their latencies at a minimum.

But, I would expect the protocols to easily adapt to any steady state condition like this (though assymetric Tx and Rx times -- possible on HDX *or* FDX -- might confound). Likewise, I would expect good behaviour in systems (NoW) with lots of entropy -- owing to the averaging characteristics of the protocols.

I guess I need to study the time constants involved to determine what the pathological case would be and then control the environment to synthesize that condition?

Yet another case where the testing effort far exceeds the design/coding effort! :<

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Jan 21, 2010 10:22 AM

Op Wed, 20 Jan 2010 20:12:11 +0100 schreef D Yuniskis :

It might, depending on whether any of the underlying protocols are likely to coalesce samples.

The Internet? Even a relatively short path from e.g. home to work could be quite hostile.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(remove the obvious prefix to reply by mail)

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Jan 21, 2010 5:37 PM

Not sure I understand your comment. :<

I.e., I use the timing protocol to come up with a *time* reference. And, to tune my FLL (for rendering the audio).

*If* I synchronize my audio stream exactly with the time reference, then I can compare absolute phases of audio signals between "nodes".

But, even if I don't (synchronize), the phase difference should remain constant. Any frequency change would be attributable to a change in my FLL's frequency (which would mean the timing protocol has *told* it that it needs to make some compensation).

If I deliberately disable any averaging effects in the timing protocol, then these changes will be much more visible.

For example, I could drive one audio signal based on instantaneous (i.e., short term) timing protocol results and the other on results AS IF the protocol was doing some averaging or heuristic approach to managing short term "time jitter".

Hmmm... I will have to think that through. I was planning on tuning a VCO as my system clock so I could get precise (audio) sample rates... :(

But not repeatable. My goal in finding an aggressive environment is to be able to identify the characteristics of the environment to which the protocol/implementation is most vulnerable (i.e., so I could either modify the protocol to address those issues

*or* verify that the deployment environment never exhibits those characteristics)

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Jan 22, 2010 8:52 AM

Then surely it will be vulnerable to an unpredictable environment. :)

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(remove the obvious prefix to reply by mail)

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, Jan 25, 2010 10:04 AM

[attributions elided]

I think you are probably stuck thinking about this in more conventional approaches. :>

First, imagine you *were* just "streaming audio" over the network. I suspect you think in terms of streaming to a "PC" (or *an* audio client) -- i.e., a box with 2 (or 4 or 5.1 or 7.1, etc.) "speakers" (or "audio out" signals).

Now, what happens when you have several of those boxes in a given listening space? E.g., a large auditorium? Or, an outdoor venue? You end up needing some way of synchronizing the sound *emitted* from each of those boxes (i.e., a streaming protocol will *deliver* the same data to each box "simultaneously" -- or, almost so; but, there is no guarantee that those boxes will emit the audio signals synchronously with respect to each other.

Granted, the difference may be small -- on the order of a millisecond or less (attributable to delay differentials at switches in the network) -- but, this can affect the psychoacoustical perception of the source material depending on where the listener is positioned.

Now, take this a step further: imagine driving a single "channel" from each "client". Imagine the "venue" is now something like a living room. I.e., the sound available from the "left client" is emitted by a device with no temporal relationship to the sound emitted by the "right client". (Compound this with "left rear", "right rear", "center", etc.).

A fixed, constant skew between the two (left / right) channels -- if it is small enough -- is perceived as a translation of the listener within the listening space (alternatively, a translation of one, or the other, of the drivers in that space). This can be compensated by adjusting "volume levels" to push the offending driver back to its "perceived" place, somewhat.

If the delay gets to be too large, then it "doesn't make sense", psychoacoustically -- the listening space doesn't "feel" right.

If there are temporal variations between the emitted sound (which is simply a manifestation of the digital data stream delivered to each client), then the sound appears to be frequency modulated (or, the driver appears to be "moving" wrt the listener).

Either a single audio *stream* can be delivered to all "drivers" and the audio stream's contents processed locally in a manner specific to that location (i.e., DSP at each "speaker stack"); or, you put all the smarts in the server and have it deliver *separate* data sequences to each "client" (in which case, the client needs only reproduce the data delivered to it -- but synchronous with some timebase!)

Now, imagine the audio delivered to each client is delivered in separate data streams. I.e., unicast. In this case, there is an inherent delay between the delivery of data to each client -- the left data sequence is transmitted at a slightly different time than the right data sequence. There isn't even the pseudosynchronicity inherent in a multicast protocol.

The advantage of treating data sequences as "client specific" allows the actual data delivered to each client (and, ultimately,

*driver*) to be tailored to that client. I.e., the server can synthesize a "left rear" signal and deliver it to the "left rear" client while delivering a "right rear" signal to the "right rear" client, etc.

In a small venue (e.g., living room), this seems overkill. However, it gives you flexibility that a "traditional" streaming audio client wouldn't (imagine delivering audio programming to adjacent rooms served by different "clients" or having an audio program "follow you" from room to room).

And, this approach makes deployment over really *large* venues straightforward.

For example, imagine audio program delivery to stadium (and larger!) size crowds. The actual drivers are located over large distances (i.e., locating an "amplifier" on stage and the "drivers" scores of feet away). It is easier to deliver a "low level" signal to an amplifier located immediately adjacent to each "speaker stack" than to run heavy cables carrying high power outputs over long distances to the drivers from the amplifier(s).

Furthermore, in very large venues, the sound emitted from each set of drivers is often more heavily processed than just "left" or "right". E.g., when additional drivers are located "out in the crowd", the sound emitted by these drivers must be delayed with respect to drivers located "on-stage" (to ensure the acoustic wave emitted by the "crowd speakers" is in phase with the acoustic wave arriving from the "stage speakers"). The "server" can do whatever is required to satisfy these criteria without burdening each "client" with the signal processing capability to perform these functions.

As I said in the original description of this device, it's "one to throw away" (and learn from). I make no claim as to its suitability as a commercial product, etc. (though I think the idea is sound). Rather, it gives me an easy way of measuring the performance of various synchronization techniques in a box that I can also make use of "in the home".

Yes, but you need to be able to *exactly* recreate the environment to which you find it vulnerable -- lest you can't "fix" any problems that are discovered.

- R
- RockyG
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, Jan 25, 2010 11:19 AM

Do you have any idea what a 'reasonable maximum' is for the delay. I would imagine that it needs to be sub-millisecond.

--------------------------------------- Posted through

formatting link

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, Jan 25, 2010 5:53 PM

Not sure I understand your question so I'll offer two interpretations.

The delay packets can experience in transit across a switch (recall, there can be a couple of switches in any given path from "node A" to "node B" -- where A might be server and B might be client) can be "several packets" (recall MTU can also vary). So, hundreds of microseconds are possible. (this would also reflect the maximum "jitter" as the delay may be there for some packets and not for others -- depending on how traffic "backs up" at the switch at this instant for this *port*)

As far as the maximum delay that can be *tolerated* (assuming the delay remains fixed and doesn't manifest as FM in the reproduced signal), that gets into gray areas. :>

There is some argument about whether or not phase "errors" can be perceived by the average listener. Consider at 20KHz, 10us translates to a 72 degree lag.

However, I don't think the problem is that severe. Any delay is applied equally to all frequencies. I.e., it causes the resulting acoustical wave to simply arrive "a bit later". Off the top of my head, 10us would correspond to a translation of the driver (speaker) about 0.1 inches :> (I think a millisecond should be roughly a foot?)

So, the more important issue is ensuring that the clock

*frequencies* between nodes are locked (so the pitch of the audio streams being reproduced at each node doesn't exhibit an FM component). Of course, there needs also to be some degree of synchronization between the data emitted from each node as there is obviously a point where it becomes perceptible.

I would imagine the maximum relates to the distance between drivers (speakers). E.g., if the drivers are 100 feet apart, then, worst case, the acoustical wave will take ~0.1 seconds to travel from one to the other. So, if you are sited somewhere between them, you would expect something on the order of +0.1 to -0.1 seconds delay wrt any particular driver (depending on which driver you consider as your reference).

OTOH, if the two drivers are located 10 ft apart, the expected delays in the acoustical wave would be +0.01 to -0.01. I.e., I think your brain would get confused if it heard a source displaced by 0.1 in that situation (?)

Dunno. :> That's what makes it fun to play with (besides learning something else that I need to know). I should be able to play with these numbers dynamically and see just what sort of perception. I guess it would be about as annoying as seeing sound not synchronized to picture (on TV, etc.)

I'd appreciate it if anyone had any pointers to references that quantify these things.

Now, are you as confused as I? :>

--don

- R
- RockyG
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Jan 26, 2010 6:49 AM

would

Hi Don,

Sorry it was unclear - when it is in one's head it always seems clear. What I meant was how much difference in delay could be tolerated so that the stereo image remained stable?

Thanks for the comprehensive (if somewhat inconclusive) reply. Maybe an area for experiementing would be to build a variable delay line and play back material with one channel fed through this delay?

--------------------------------------- Posted through

formatting link

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Jan 26, 2010 5:19 PM

Well, *usually*! :> I am amazed at the number of times it fails to be, in mine! :<

I don't know. I suspect some of this is listener dependant.

Once I have the hardware in place, I can freely experiment with these things. That's the advantage of moving all of the processing "server side" -- more horsepower and resources there to play with (e.g., a delay line is just a malloc()).

I suspect the image falls apart quickly -- though maybe not sub-millisecond. E.g., consider having a "radio" playing in two rooms at once, separated by 50 feet -- i.e., opposite ends of a house. The acoustic waves arrive from the "near" speaker some 40ms (?) ahead of those from the "far" speaker yet you don't conciously perceive the delay (?). Of course, how it is affecting the overall character of the signal is more the issue. (I suspect the source material also has a big effect on perception... I should try it just with a pair of TV's)

We'll see. A chance to learn something new, firsthand! ;-)

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Jan 26, 2010 5:31 PM

Op Tue, 26 Jan 2010 18:19:14 +0100 schreef D Yuniskis :

And perhaps other dependencies as well. I expect that this has been researched to the point of a somewhat useful formula, even.

The sound reaching your ear will be a total of all the direct _and_ the reflected sounds. You will not perceive the delay if the reflected waves are loud enough to effectively create an FM-blur.

Yes, of course the source can already be spread over a spectrum (i.e. if recorded in a sound-reflecting room and not post-processed to remove that "blur").

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/
(remove the obvious prefix to reply by mail)

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Tue, Jan 26, 2010 8:53 PM

Well, *short* delays are treated as cues to help localize sound sources. E.g., the temporal difference between wavefronts arriving at each pinnae help (at low frequencies) localize sound (frequency dependant effects caused by the "shading" of the head play a role at higher frequencies).

But, I'm not sure what happens when you get beyond these very short delays.

A friend comments that echo is perceptible in the tens of ms. So, the question is when does the delay begin to "color" the sound reproduction and how much does masking help the listener.

Yes, that applies in enclosed spaces. However, the "arena" example cited previously wouldn't "benefit" from the same phenomenon (e.g., Englishtown 1977)

It should be entertaining to play with these sorts of parameters and see what happens!