Audio CODEC choices

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 7:29 AM

Yes. But that doesn't mean it has to work 100% of the time regardless of other environmental factors -- that makes the system too expensive (dedicated) or too brittle (where you, as a consumer, have to buy more capability than you really need).

Think of it as a parallel to an SRT solution. Your appraoch is more in line with an HRT solution.

No, I am willing to let performance degrade IF NECESSARY, as dictated by the conditions in which the system is being operated. That doesn't mean I want to set out with "less performance" as a *goal*.

You don't need TCP to get "error free". Rather, you need mechanisms that allow you to detect and recover from errors.

TCP is a bad choice as it would require a "connection" from the server to each client. By contrast, with a UDP-based protocol, you can leverage multicasting to reduce the total traffic on the network (clients seeing the same "program" simply participate in the same multicast group). Since packets are essentially numbered, a client can determine if it has missed a packet and request its retransmission explicitly (assuming it *needs* that packet!). If missed packets are The Exception instead of The Rule, then the retransmission requests will be infrequent (they also let the server get a feel for the integrity of the fabric *during* operation).

See above.

Note, also, that you need not implement an entire stack with the "usual" complement of utilities, etc. Many aspects of a traditional stack are useless or can be short-circuited since the clients are designed to talk to *a* server.

E.g., there is no need for a resolver, arp cache, etc. A UDP-based protocol can omit checksums if the higher level protocol already has error detection mechanisms. Etc.

Of course! My point is just that there are "events" that you simply can't predict and (economically) guard against. So, you either end up with a brittle system "breaking" when things aren't running EXACTLY as you had hoped (at *design* time) -- or, you make the design resilient to the sorts of things that are likely to occur.

Yes.

Yes. Hence the appeal of token-ring (token-passing) networks in applications where you needed predictable performance.

It's not quite as simple as that. You need to know how big (deep) the buffer is, whether it is *shared* across all ports in the switch or a "buffer per port", etc. On top of that, you need to know if the switch is blocking or non-blocking. And, how the switch forwards packets (cut-through, etc.) and how it handles "bad" packets. All of these things affect how long a packet might "stall" inside the switch.

More importantly, it affects the *range* of times that this sort of parking might occur.

No, it's just one of other types of applications that it can address. It could (theoretically) be used in a drive-in theater, music/audio distribution within a hotel, audio distribution at a conference, "intercom" at a school, etc.

Why design for *an* application if you can, instead, address a *range* of applications?

Nothing that complicated! E.g., for the "sound stage" application, generate an audio "reference signal" at the "source" (e.g., at the proscenium). Feed that same signal through the "system" to your first "tower/speaker-repeater". Adjust the absolute delay of this second signal until the phase of the first and second signals are in sync.

No, the variation in network protocol stacks is too unpredictable. In "real time" stacks, you can control this to a large degree (though often the NIC is poorly documented in terms of *when* data actually gets onto the wire). But, even then, the delays that are encountered in the switches themselves is too unreliable (unless you want to hope there is enough entropy in the system as a whole to "guarantee" a typical observation remains constant forever). E.g. a blocking, store-and-forward switch with a deep *shared* buffer will exhibit a wider (theoretical) range of switch delays than a non-blocking, cut-through switch.

And, all that can vary over time, traffic, etc.

With NTP, you *might* get ~O(1ms+) synchronization. Plan on

10ms to be safe.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 10:25 AM

The simple answer here is don't play pathological samples. There was a time when such effects were a problem. For example, mp3 at a poor-but-bearable level of 64 kbps is okay for general music at a walkman (remember them?) quality level. But a drum solo might sound truly awful, because mp3 can't cope with rapid changes at that bit rate. But modern codes - such as ogg - are tested on far wider ranges of samples, and in particular the use of VBR lets it work harder at the difficult parts.

There are types of sound and music that require higher rates to encode well, but the answer is simply to use a high enough quality for the music you listen to to sound transparent.

It is true that there are still pathological samples, but you are not going to come across these in real music (or speech, or film sound).

That's guess work - at least by saying "hugely".

Of course it is best to avoid re-coding - you will /always/ get some loss. So if your source is in mp3 or ogg in the first place, then the best solution is to send it over the network untouched, and decompress it at the output end.

Each stage of encode-decode will add a little distortion, artefacts and noise. The more stages you have, the worse the result. But a few stages, with high quality encodings (and of course good quality implementations - not all codec implementations are created equal), will not be a problem.

It's the same as having a multi-stage amplifier. Each stage adds a little noise, and has limited bandwidth. But you still use them.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 10:33 AM

Then figure out your /real/ requirements before going any further.

Try some tests by doing the encoding and decoding on your PC - it costs nothing but a bit of time. Once you realise that you are actually quite happy with ogg encoding at 160 kbps, and find 256 kbps indistinguishable from the original, you won't have to spend so much time and effort trying to re-invent the wheel.

You can have your gold-plated iron /now/, at a small cost. Maybe you'll want real gold in the future, but don't buy it until you have proved to yourself that you really do need it.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 11:10 AM

Ogg[1] is not an audio codec, it's a container format. You probably had Vorbis[2] in mind.

[1]

formatting link

[2]

formatting link

Regards.

- N
- Noob
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 11:23 AM

AFAIU, q6 has a target bitrate of 192 kbit/s, and the codec supports VBR. If you have an audio sample that is not transparent at q6 with the latest encoder, you should definitely share it with the developers!

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 1:14 PM

You are, of course, correct. Thanks for the correction.

David

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 3:22 PM

IMO, that's a risky approach: "Here, folks, I *think* this works. Try it. Put *yourself* at risk. Hopefully, you won't have any bad experiences. *But*, if you do, I will be happy to redesign the devices, protocol, etc. at that later date."

You're advocating investing the time and money to design, document, build and deploy these things with the idea that I can throw it all away and do it over if this turns out, in the future, to be wrong? Possibly have to come up with a new hardware design, build those new devices, rewrite the client and server code, etc.

OTOH, using the network as a *wire* means *I* don't introduce anything to the signal or process -- except latency.

Folks "above my pay grade" can decide just how much manipulation an (unspecified!) source can tolerate before it becomes unacceptable. I don't have the skillset to understand all the transforms -- or their consequences -- that the signal undergoes to be able to predict what some future signal might experience when subjected to those manipulations.

[And, since this effort is "uncompensated" on my part, I have no real desire to 1) screw it up or 2) do it over! "Oops" is not an option :-/ Unless, of course, someone wants to put *their* money/time up and *give* me the prefabricated devices... I'd be very happy to act as a beta site! :> ]

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 3:33 PM

This is the approach I have taken -- except, I move the decoding to the *source* end of the pipe and turn the network into *just* a transport device. I.e.,

- decode using whatever CODEC is appropriate for the container, etc.

- package "raw audio samples" for transport

- pass to client

- unpackage raw audio

- deliver to loudspeaker

instead of:

- package "encoded audio samples" for transport

- pass to client

- unpackage encoded audio

- decode using whatever CODEC is appropriate for the container, etc.

- deliver to loudspeaker

This frees the client from having to understand a (*growing*!) number of CODEC/container formats and, instead, focusing on just one task -- "wire emulation".

- A
- Arlet Ottens
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 3:33 PM

With lossless compression you also introduce the risk that for some pathological cases, the compression rate is less than what you require for the system to work.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 3:47 PM

This has been addressed in another reply. Short answer: you turn off compression for those cases! (there are consequences to this -- see those other replies for details)

- H
- Hans-Bernhard Bröker
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 10:37 PM

And since you have to transport the information that this is uncompressed data along with the data itself, this means your compressed stream is now _bigger_ than the original.

And actually, that's at most half of the short answer. The part you're leaving out is: if your system will have to cope with the full bandwidth of the source signal in _some_ cases anyway, then there's no point bothering with that kind of compression.

Since you need to reserve that bandwidth anyway, you may as well use it all the time.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Fri, Jul 29, 2011 10:55 PM

By *one* bit? (c'mon, you're just being silly...)

Wow, do you *really* not see the issue here?

"Some day, your automobile may run out of fuel or suffer a mechanical breakdown. So, you might have to *walk* home in that instance. Therefore, you should resolve yourself to ALWAYS WALK and sell your vehicle -- spend the money saved on good walking shoes!"

If you have four members in your family, does that PREVENT you from owning a motorcycle? After all, you have no idea when all four of those individuals might want to accompany you in your travels... "Since you need to reserve that PASSENGER SPACE anyway, you may as well use it all the time!"

[and, since that 4-passenger car has already been proven to NOT be 100.000000% reliable (see above), you may as well skip the purchase entirely and just buy *four* pair of walking shoes and convert the garage into a spare bedroom!]

- U
- upsidedown
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Jul 30, 2011 6:29 AM

If you intend to break gracefully e.g. by reducing sampling rate or number of channels or going from uncompressed to compressed or going from lossless compression to lossy compression, how actually are you going to implement this in practice ?

How does the server know that the client is suffering from dropouts or excessively delayed frames due to network congestion ?

One might think that the client could send ACKs reporting if the stream is OK or congested. Things get interesting when broadcasts are used. The server must receive ACKs from all clients using this stream and if anyone reports problems, switch to a more compressed format for all clients using this stream (assuming the ACK reporting problems gets through).

Alternatively, the client just ACKs any received frame (or group of frames in the TCP style) and the server calculates the round trip delay and determines if there is a congestion. The problem is that in a full duplex network, one direction can be congested, while the other is not. If only the ACK direction is congested, this would cause too frequent fallbacks to the slower bit rate.

Anyway, the client memory buffer (and hence nominal extra delay) is the _two_way_ worst case propagation time.

The server must also _constantly_ generate the most compressed fallback stream. Since such high compression rate lossy systems typically have a quite a long latency, the default uncompressed or lossless stream must be delayed so that it matches the lossy compression latency. Thus, if a switch needs to be performed, the samples from both streams are from the same time.

Running the high compression rate backup compressors in parallel for all streams all the time (jus in case a fallback needs to be perform on one stream), will also increase the server load quite significantly. It would be unacceptable to just start the high latency lossy compressor at the server, when the ACKs report that there are congestion on the network.

Also the two compression must carry the original time stamps through the conversion processes into the messages, so that the client knows, how much overlap between the two streams needs to be discarded.

One additional note, if QoS is _not_ supported by the switches and there are other traffic (e.g. FTP TCP) on the network, cutting down the audio bandwidth will allow more bandwidth for the FTP transfer and the audio transfer is as unreliable as before :-(.

In principle a TCP (FTP) connection with 64 KiB window size could capture the full capacity of a 100baseT network, provided that the two way propagation delay (network and protocol stacks) is less than 6 ms, Not too hard these days.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Sat, Jul 30, 2011 7:00 PM

I don't see it as "more compressed" but, rather, "lossier" (which is also a misleading term... e.g., reducing the sample rate isn't technically lossier... it just lowers the bandwidth that can be represented (and adds artifacts).

[of course, you *can* think of it as more compressed *if* you assume the entire original signal is still reproducible from the degraded representation]

No. The server doesn't care if you received your packet or not. Having each client acknowledge each packet and/or *request* each new packet adds extra traffic (and processing) that doesn't buy you anything. A client has to be able to cope with whatever it is given -- it can't "rely" on the server to "bail it out".

I.e., whatever packets a client has *now* are all it can *expect* to EVER have (there is no guarantee that the server will be *up* when the next packet is needed, etc.). So, a client has to be prepared for the stream to end abruptly -- and/or *resume* just as abruptly.

The recovery algorithms have to avoid pathological behaviors when the system (from a client's perspective) starts to overload or otherwise behave unexpectedly (i.e., a system might not be in overload yet could still be "misbehaving").

This requires the clients to actively track the state of the system instead of just responding "open loop" to "current conditions" (otherwise, you end up with clients "oscillating" as they try to recover, then, suddenly, REdiscover that the system is broke, etc.).

Clients *try* to return statistics to the server periodically (but, they make no attempt to *guarantee* their successful delivery/receipt -- just like the clients can't rely on the server to provide what they want, when they want it, the server can't

*rely* on the clients to provide what *it* wants when *it* wants it!) The server uses these statistics (or their absence) to decide how it should provide its services to the clients (with direction from pre-existing configuration criteria).

If a server fails to hear anything from a particular client for an extended period of time, it can opt to contact the client directly (there are "control packets" mixed into the data stream to handle things like configuration parameters, data requests, etc.). If the client fails to respond, the server can take remedial action (the client may have crashed, *died*, been powered down, been unplugged, etc.) to reestablish a data stream *or* elect to shed the responsibilities associated with that client (which may or may not allow it to reduce the amount of network traffic -- or reSHAPE that traffic).

A client can *request* (with no guarantee that the request will be honored -- or even *received*!) a different delivery mechanism or format -- e.g., a unicast stream so that it's specific capabilities can better be exploited without regard to those of other clients that are currently *sharing* that stream with it.

Client's are peons -- they have to deal with what the system opts to do for (or *to*) them.

See above.

Propagation time, by itself, is a small factor. The buffers' sizes are more driven by synchronization issues, *intentional* latency (e.g., to synchronize with a video decoder running on a different client), etc.

"Frames" carry "playback timestamps" representing when they occur in the audio stream. Once a stream is available in its raw form (regardless of how it arrived at the client), the client ensures that the data is presented to the "audio output" at the right moment.

This load only exists for "live" feeds! Other "prerecorded" sources (i.e., your music library) can just be pulled off the storage medium and placed on the wire "at the right time".

See above.

The audio system can't do anything about the other uses to which the network is subjected. Someone could unplug a switch (gasp)! All you can hope for is a best case effort.

Remember, the switch isolates (to a large degree, esp for non-blocking switches) the node A's traffic from node B. I.e., if the server and clients are *not* participating in those FTP transfers, then the traffic on their "wires" will be exactly what the audio subsystem requires -- nothing more.

Deploying this sort of system with a bus (hub) technology (10Mb, wireless, etc.) is too hard to ensure *any* guarantees -- other users could have a lot of unrelated traffic that you could neither predict nor control (e.g., the folks in Marketing are engaged in a video conference call on the same 10Mb network segment serviced by your hub...)

IMO, this is why wireless was a non-starter for this. Neighbors (imagine living in an apartment complex) could have lots of traffic on their wireless routers, cordless phones, etc. You would quickly get fed up with a system that was constantly suffering dropouts each time the neighbor's cordless phone (located on the other side of a 4.5" thick plaster wall from one of your clients/server) was in use!

As I said previously, it is a *delightful* project to wrap your head around! Lots of opportunities for optimizations/tradeoffs/etc. By comparison, a simple "audio player" is a mere "toy" ;-)

And, once in place, the *new* opportunities that it affords get to be *really* exciting! (e.g., imagine an audio program following you around the house...)