Audio CODEC choices

Don Y · 2011-07-27T07:57:03+00:00

Hi,I'm looking for ideas for an audio CODEC to use instreaming audio to my "network speakers". I've debuggedthe system using "raw" 16b samples. But, this wastes afair bit of bandwidth "needlessly" (?)Ideally, I am looking for a lossless format that has*modest* encoding costs and *very* low decoding costs(time+space). I.e., consider a single transcoderfeeding a large number of endpoints.Any "frames" (if present) should be relatively smallas I may need to synchronize streams to a fraction ofa frame (consequences for space requirements).Ideally, the decoder should directly support sampleinterpolation (though this could be done in a post-processingstep) to synchronize to a fraction of a sample interval.Since it is intended for streaming use, there should be nolimits on "position" indicators inherent in the format.I need to be able to split a multitrack source into anequivalent number of single tracks or combine severalsingle tracks into a multitrack stream. The cost fordoing so should be low (i.e., AAAAABBBBB is cheaperto encode/decode than ABABABABAB).[what else have I forgotten?]Any pointers to existing (unencumbered) algorithms orsuggestions as to how to best roll my own?Thx!--don

D

Don Y 14 years ago

There's nothing wrong with the "assumption". E.g., :

"If you can get 3:2 compression *in* the CODEC, then that drops to 66K."

Or, more to the point, "if you can get X compression in the CODEC, then that drops to 100K/X"

The dirty little (unavoidable) secret is that you are stuck with THE PROBABILISTIC CONSEQUENCES of whatever "effective" buffer size you end up with. This simply changes the odds that you'll have a problem in a given traffic environment.

It says nothing about the *remedies* that you can bring to bear *if* there is a problem:

- signal the service to recode the audio for a lower sample rate

- drop from N channels (e.g., 5.1) to M channels ("stereo")

- go silent

- etc.

You *always* face this problem if you operate a system in overload. The question is whether it *breaks* or gracefully degrades... (and *that* is a function of how well it is engineered)

Vote

U

upsidedown 14 years ago

When using uncompressed samples, random loss of samples due to congestion of CRC errors is not a big issue. A first or second order interpolation will give quite good results with only 2-3 sample delay.

You should design the system in such a way that it can not overload, e.g. running well below the theoretical capacity.

Using non-compressed samples and interpolation should be OK, if only samples are lost randomly.

However, if there are multiple samples in the same frame, a lost frame causes a burst error and most one way wireless protocols use interleaving, by spreading adjacent samples into other frames. If one frame is lost, the de-interleaver will convert the burst error to random errors.

For uncompressed samples, interpolation can be used to generate the lost samples.

However, if compression is used, it is _essential_ that also some error correction (ECC) bits are transmitted, which are capable of correcting the missing random bits (after de-interleaving), before attempting the decompression. This will increase the latency and buffering requirements.

Vote

N

Nobody 14 years ago

Except: lossless compression always has a worst-case scenario where the compressed data is at least as large as the uncompressed data. So if you have fixed requirements, you have to either "budget" for uncompressed data or use lossy compression.

Lossless compression is still worthwhile if the typical-case performance matters, e.g. if you're also running best-effort services on the same bandwidth.

Vote

U

upsidedown 14 years ago

Install fibers and in the future, if there are new demands for very high throughput services, use WDM with different wavelength for each service and no congestion issues between services.

So the specification seems to be changing all the time. i did not see any reference to video in the original post.

Previous discussion was about simple clients. A client capable of video processing is not that simple.

If you intend to decompress HD MPEG4 video streams to HDMI etc. is

10-15 W enough ?

At least some HDTV set top boxes run quite hot.

Vote

D

Don Y 14 years ago

Sure -- I can also put a CD/MP3/DVD player on every flat surface of the house! :>

And that's the way I have designed things -- let the user trade off what's important to him.

If you start up 47 active applications on your PC and responsiveness suffers, *you* decide if (and what!) you want to kill one or more of those applications. You'd be really *pissed* if the PC simply "crashed" (and effectively made a unilateral decision *for* you!)

I believe people are accustomed to systems running out of capacity (for a given level of responsiveness). Cell phones, internet connections, PC applications, etc. Even the most naive user probably is aware that killing that Flash video playing in the upper right corner of the screen will allow their email to load faster...

And, I doubt many folks would welcome an interface designed with "arbitrary" (from *their* point of view) limits imposed: "I'm sorry, you already have 3 applications running so you can't start another..."

--^^^^^^^^^^^^^^^^^^

Having just rewired the entire house, trust me, this is NOT a viable option for most folks!

Yes. But, it isn't *big*. And, in all but live performances, you can compensate for latency. E.g., if streaming video to one device and the "accompanying sound" to another, you deliberately balance the "presentation paths" between the two devices so the user doesn't perceive any relative skew.

In live performances and interactive applications, latency can be an issue *if* it gets to be "perceptible". E.g., if you are

200 ft from the stage watching the performers through field glasses, you *will* notice a lag between the time you *see* them lick the guitar and the time you hear the resulting chord. Physics sucks. :>

And the situation changes daily. Note that buffer size just determines the probability of having to take "remedial action" to compensate for potential underruns. Put a 1MB buffer on the device and that still won't help you if the server has a momentary overload or encounters a disk error and has to retry repeatedly!

In overload (i.e., when the client *can't* get the data it needs before the deadline), you have to determine what the least offensive remedy is (for the application). E.g., that disk error doesn't care whether you are sending uncompressed or compressed data -- the pipe has gone dry!

How's that going to get the signal from the proscenium to the first stack of towers 100 ft out? :>

How does the server compensate for the distance between the monitors "on stage", the first set of towers at 100 ft and the second set at

180 ft?

And, you still have to account for real-time synchronization between the actual clients (so they all know when "now" is).

If, instead, you multicast and let each client *tweek* his relative timing wrt his peers, then you save network bandwidth at the expense of some (relatively inexpensive interpolation) at the clients.

It's not "7mm" it's (almost) half a wavelength (at the high end of the frequency response).

If I'm using the same system to drive/detect infrasonic signals and have moved the sample rate to ~100 Hz, that 1 sample time is now

11 ft.

(No, you can't drive infrasonics at 48KHz because the depth of your buffer proves to be insufficient)

Not being able to interpolate between samples also means the client/decoder can't resample the data stream. E.g., if the network is overloaded and the server transcodes to a lower sample rate (i.e., sacrificing frequency response), then the client has to change its "output driver" to track that sample rate in lock step. Much easier to run the output at a fixed rate (which reflects values of filter caps, etc.) and change the data sampling/interpolation rate as best fits the data stream.

Vote

D

Don Y 14 years ago

But, you can tolerate this and change the requirement to: "... where the compressed data is no larger than the uncompressed data."

I.e., the protocol can simply say, "this is uncompressed data".

You (the designer) can't predict what the environment will be. You can either design a brittle system that imposes constraints on the user and the deployment. Or, you can design a system that adapts -- within its inherent limitations -- as best it can to the situation at hand... and allow the user to change the environment if better performance is required.

Consider how you would specify a brittle/inflexible environment:

- source material must be capable of a 4.7:3 compression ratio

- there must be no greater than 8.5 clients active at any given time

- at most, 6.2 source channels can be in use

- all other transactions must use a MTU not exceeding 678.2 octets

- the media store must have a sustained bandwidth of 12.4MB/s with no idle period exceeding 1.9 seek times

- the media server must run FooOS version 3.2 (service pack 8-1/3) ...

*And*, WHEN it breaks in use, the user will probably have no clue as to what the *actual* cause of the problem was ("Did someone change to a *7th* audio course? Or, was someone surfing the web? Or, is the disk dying? Or...")

Vote

D

Don Y 14 years ago

It's not germane to the issue at hand. As far as you are concerned they are "other clients" (whether a user sitting at a PC FTP'ing some big files, a VoIP client conversing with someone overseas, etc.)

Which is why we're not talking about the video clients, here! :>

All that matters to the *audio* clients is bandwidth and variations in network latency.

No. But, a "speaker" and an "active display" have very different power requirements! I.e., you can deliver 10W to a small amplifier, attach a speaker and have a functional "endpoint". Something that you can bury in a ceiling/wall and forget about.

By contrast, if you have a STB -- even one that draws *zero* power -- you still have an accompanying display! Even a small display will draw 20, 30, 100W or more!

So, for *video* endpoints, you *expect* to have power available locally (i.e., at the client).

Also, video clients tend not to be operated in the same way as audio clients. It is likely that an audio client could want to be powered up "asynchronously" (e.g., someone pushes the doorbell, you power up some audio clients, play a chime and then power them down).

The same tends not to be true of video clients -- though there is nothing that prevents this *if* the client is designed to be able to supply *local* power when commanded. E.g., someone pushes the doorbell, you power up some audio clients (to play the chime)

*and* some video clients (to show a live view of whomever is *at* the front door...)

Almost *all* do!

Vote

D

Don Y 14 years ago

Philosophical difference. I design to degrade rather than build excess (unused/seldom used) capacity into a system. People quickly learn complex usage patterns that they can get away with ("Don't turn on the microwave while the dishwasher is in use." "Don't try to download a music video while Skype-ing." etc.) If a person feels there is a way to work within a framework instead of being (arbitrarily/artificially?) constrained, he tends to "think better" of the product.

Note that a system designed to be well-behaved (though degraded) during overload can be converted to one that can *not* be overloaded:

attach_new_client() { ...

if (number_of_clients > MAXIMUM_SAFE_CLIENTS) return FAIL;

... }

The opposite scenario just doesn't work (a system not designed to handle overload will usually break spectacularly in those situations)

Vote

D

David Brown 14 years ago

Don't forget that good quality lossy compression is normally totally inaudible - it's very unlikely that you'd be able to spot the difference between the raw CD quality sound (or equivalently FLAC) and high-quality (q6+) ogg encoding.

If all you want is something that looks like gold, then gold-plated iron has a lot of advantages over gold!

Vote

D

Don Y 14 years ago

This forces that *entire* frame to be routed to *every* client. So, 34/35ths of each network link is wasted carrying traffic that is not pertinent to the node at the end of the link.

If, instead, each packet carries one client's data -- or, one multicast channel -- then only the links that need that data bear the (bandwidth) *cost* of that data.

Packing the channels optimally can have a big impact on the overall system performance. E.g., if a client implements a "stereo" interface, sending two packets of AAABBB AAABBB is better than two packets of AAAAAA BBBBBB. OTOH, if there are other clients needing *only* A (or B), then the latter can be preferable as *those* clients aren't tied up receiving (and discarding) half-packets of B (or A).

[this is really a *fun* project as it affords lots of opportunities for small optimizations with big payouts!]

The only link that sees *all* the (audio) traffic is the link from the (audio) server.

Upgrading it to Gb (many 100Mb switches have one or two Gb ports) buys *that* link extra bandwidth. Alternatively, a second I/F on the server connected to a second port on the switch can achieve similar (though not as dramatic) results.

Traffic from "PC #1" to "PC #2", e.g., is unaffected (but the switch has to learn where multicast clients are located)

Bottom line, a 100MB network should easily handle this sort of application in a home/SOHO environment.

Vote

P

Paul Gotch 14 years ago

You can also get rid of a lot of complexity at the expense of slightly reduced compresion by use a subset of the standard. FLAC has compression settings between 1 and 8 and most of the really time consuming bits of the algorithm are only used between levels 5 and 8.

-p

Paul Gotch --------------------------------------------------------------------

Vote

P

Paul Gotch 14 years ago

There are two problems with this. The first is that there are always pathalogical samples which cause lossy compression to not be transparent you really don't want such a thing between your source and your loundspeakers.

The second is that it's *highly* probably that the source material is already in a lossy compressed format. Decoding it and then re-encoding it to a different lossy format ups the probability of artifacts hugely.

-p

Paul Gotch --------------------------------------------------------------------

Vote

D

Don Y 14 years ago

Of course! But, if -- after having purchased and plated all that iron -- you realize that you really *do* need gold, then you've not saved anything! :-/

Right now, I don't see anything that makes the FLAC approach "un-doable". My biggest concerns are how to harden the system (economically!) so it's use isn't a vulnerability/liability.

[(sigh) I've got to go replace some king studs before it gets *too* hot...]

Vote

U

upsidedown 14 years ago

I had assumed that you wanted real time lossless transport.

I tried to show what the constraints are in order to have a true real time system.

From other posts, it appears that you are willing to skip the real time requirements. perhaps you could also ease the lossless requirement, which will reduce the throughput requirements significantly.

Please note that using compression will typically require an error free stream, which would normally require TCP/IP support (and hence retransmissions). A proper TCP/IP stack on the clients will require much more resources compared to simple MAC level (or IP/UDP) framing.

If you expect such events frequently, you have to use sufficient buffering (several seconds ?) on the server side with a few megabyte queue to the kernel mode device driver.

Regarding Ethernet network latencies, the original half duplex coaxial

10base5/10base2 with CSMA/CD was awful, If a collision occurred, both frames were destroyed and each sending station performed a random backoff period, before trying again. In order to have predictable timing, you had to run the 10base2/5 network just like a glorified RS-485 multidrop network, i.e. having a single master and the slaves only transmit after the master prompted each slave to do so.

Current full duplex 10/100/1000baseT networks with switches are much more predictable. Each end node can transmit a frame whenever it likes and the switch will send directly to the desired link. If there are other older frames going to the same link, the new frame is queued. Only if this queue is full, a frame will be dropped. To avoid such situations, only that link needs to be upgraded. Thus, if there are

100baseT PoE switches at each floor, using a non PoE 1GbE link between these switches should do the trick.

So this is also intended as a stage sound system :-).

How does the clients know their physical locations ?

For outdoor stages, the clients could be equipped with GPS receivers. Using GPS would also solve the timing/synchronisation issue. Each speaker only needs to communicate their GPS coordinates to each other:-)

A more practical approach would use a single GPS receiver during the sound check. Visit each speaker, take the readings, feed those into the server and let the server calculate the relative positions and use those to generate suitable delays for each speaker, before sending out the streams. Alternatively, if broadcasts are used, send the required delays to each speaker.

Even NTP on a LAN will be quite accurate.

So what ? The alignment between elements in the same speaker box is typically not that good.

Band limit the signal at the server to 100 Hz using the 48 kHz sample rate. At the client, decimate by picking every 480 th sample from the

48 kHz band limited stream.

To generate a phase shift, simple move the start positions somewhere between 0..479 sample positions and then repeat every 480 sample. If 48 kHz sample rate consumes too much bandwidth for this band limited signal, just use differential PCM, which drops the data rate significantly in this case.

There are chips intended for "high end" SPDIF DACs. The DAC sample rate is constant, but the claimed "jittery" SPDIF stream is captured by a PLL. For instance, if the PLL is receiving data at say 47950 Hz, the chip converts it to 48 kHz DAC clock.

Vote

O

Oliver Betz 14 years ago

[...]

lossless "compression" doesn't guarantee any size reduction as "Nobody" already wrote. Try to compress noise lossless.

Oliver

Oliver Betz, Muenchen (oliverbetz.de)

Vote

D

Don Y 14 years ago

You're missing the point. The extent of compression (or LACK thereof -- which is a dynamic condition that can vary based on the program material chosen as well as the instantaneous characteristics of that program material -- just changes the odds that your buffer will be able to carry you over a momentary overload.

*Any* compression factor acts as a storage multiplier -- effectively increasing the depth of your buffer by that same factor. Worst case, you have *no* compression and whatever bytes you have set aside for your buffer represent the *actual* size of the "virtual" buffer.

You can also get the same sort of magnification by reducing the sample rate, changing the resolution of your samples, dropping channels, etc.

But, these have a direct ("hearable") impact on the quality of the audio delivered -- lossless compression doesn't (you get the benefits of an effectively larger buffer at the expense of computational complexity -- not signal quality).

So, when faced with the probability (not *possibility* -- since that is ALWAYS present in ANY system) of an overload, you can choose to sacrifice quality (sample rate, resolution), quantity (number of active channels) or risk a complete "dropout" ("Gee, everything sounded GREAT -- right up until the point where it went silent!")

Making that tradeoff is a dynamic one since you can't predict the future -- the packet you need may be "on the wire" this very instant! So, any choice to downgrade quality prematurely can be foolish.

OTOH, if you suspect that the source is getting "choppy", you might want to proactively downgrade quality in the hopes of limping through the rough spots without having to take a noticeable "hit" (dropout).

OTOOH, the packet may *never* come -- the server may have died, etc. In which case, any quality decisions will be short-lived, regardless.

Vote

D

Don Y 14 years ago

Hmmm... hadn't thought of that. I just figured "I want a virtual speaker wire" so a lossless CODEC seemed the only way to go...

Vote

D

Don Y 14 years ago

Note that I'm not too worried about the *encoder* end of the pipe as there tend to be more resources available, there. (though I will have to carefully study some of the optimizations that it relies on to see how I can give myself "play time flexibility" without having to do the actual encoding *at* that time)

I'd already planned on stripping a lot of the cruft from the protocol. E.g., a "speaker" doesn't need to see the "album artwork" associated with a piece of source material.

And, making some subtle/obvious changes like 86-ing the seektables (since they can only appear at the start of the program and the program will be "of indeterminate length", they don't serve much purpose!).

Some other changes I will have to think carefully about just to make it impossible for (certain) errors to require detection in the clients (e.g., if blocking strategy changes -- gasp -- there's nothing the client can do... should it just *crash*?! :> Better to just *fix* the strategy and eliminate the field entirely)

Some of the other options that try to make the CODEC more universally appealing might have no value to me (e.g., supporting different sample rates, sample sizes, etc.) so there may be some value in tweaking those things as well.

Unfortunately, I've not had a chance to dig into the specification very deeply (other, more pressing, responsibilities :< ).

Vote

C

Charles Bryant 14 years ago

In article , Don Y wrote: }The model I keep in mind is two video streams plus two music streams. }Keep in mind that video requires 5+ audio streams to accompany it }(e.g., 5.1 audio). } }So, there's 14+ audio streams and two video streams. } }[I base this on observations of family-of-4 households where two TV's }tend to be on simultaneously (often with one or both of them NOT }being actively watched) and figure the other two "users" are just }listening to music/etc.]

Assuming you're just doing the audio, then a family or four means you do not need more than four composite streams, or 4*(5+1) = 24 individual streams.

Rather than send all audio sources to all audio destinations, route all audio sources to your audio server, which mixes the appropriate sources for each destination. So, for example, if someone is listening to music, with a TV news channel on quietly in the background when a caller arrives at a the door and uses the imtercom, then the audio server mixes the music, TV, and intercom and feeds the result to the speakers that cover the area where this person is listening.

Vote

D

Don Y 14 years ago

I use the "family of 4" observation just as an example. I see no reason to impose a numerical limit on the number of streams (what happens if you're a family of *5*? Or, running 7.1 audio to accompany multiple video programs?) :>

Note that the number of streams ("channels") will tend to be less than the number of "active clients" as a channel may be delivered to multiple clients.

For example, it is common for a radio/TV to be in use in two places, here, simultaneously -- as we may be actively moving between the two rooms in question (and too lazy to shut down a source each time we *leave* a room and *start* that source on entering the other room)

A client only sees the program material (channel) intended for it.

Correct.

Correct. Note that in the example I mentioned above, this could cause the clients in *one* room to suddenly see *different* material than the clients in that *other* room (if, for example, we wanted to hear the intercom in the first of these two rooms, only).

The mental model here is that the network is *just* "virtual speaker wire" while the server performs all the activities normally associated with tuner/receiver/preamp/etc. (this is a slight misstatement as the "speaker" has gain and some rudimentary tone control, etc.)

Vote

Audio CODEC choices

Join the Discussion

Didn't find your answer?