Audio CODEC choices

D

Don Y 14 years ago

Hi,

I'm looking for ideas for an audio CODEC to use in streaming audio to my "network speakers". I've debugged the system using "raw" 16b samples. But, this wastes a fair bit of bandwidth "needlessly" (?)

Ideally, I am looking for a lossless format that has

*modest* encoding costs and *very* low decoding costs (time+space). I.e., consider a single transcoder feeding a large number of endpoints.

Any "frames" (if present) should be relatively small as I may need to synchronize streams to a fraction of a frame (consequences for space requirements).

Ideally, the decoder should directly support sample interpolation (though this could be done in a post-processing step) to synchronize to a fraction of a sample interval.

Since it is intended for streaming use, there should be no limits on "position" indicators inherent in the format.

I need to be able to split a multitrack source into an equivalent number of single tracks or combine several single tracks into a multitrack stream. The cost for doing so should be low (i.e., AAAAABBBBB is cheaper to encode/decode than ABABABABAB).

[what else have I forgotten?]

Any pointers to existing (unencumbered) algorithms or suggestions as to how to best roll my own?

Thx!

--don

Vote

N

Noob 14 years ago

I'd like to hear what you think of Xiph's latest codec, CELT (Constrained Energy Lapped Transform). AFAIU, it's still in development, but it sounds (ha!) interesting to me (a casual observer).

formatting link

Regards.

Vote

V

Vladimir Vassilevsky 14 years ago

TI "PurePath" Digital Audio.

The prominent american physicist Robert A. Millikan (1868 =96 1953) was=20 also known as a pathologically talkative person. In his name, the main=20 unit of garrulity was called "kan". On this scale, Dr. Millikan scored=20 only 1e-3 kan, or millikan. What is your score, Don ?

[..bla-bla-bla... bla-bla-bla... 2+2=3D4... 0xAA55, >> /dev/null ]

VLV

Vote

D

dp 14 years ago

Vladimir, leave Don alone will you. :-) Were it not for the threads he initiates and the stuff he puts for discussion the group could well have been dead already. We should thank him for what he does actually, come to think of it.

Dimiter

Vote

J

Jim Stewart 14 years ago

LOL

Vote

D

Don Y 14 years ago

AFAIK, they only had wireless solutions. [You can look at CobraNet for a wired commercial offering] Problems include:

- what happens when someone keys a significant RF source nearby (signal is "muted") I.e., you have a vulnerability not present in a wired solution

- how many "active channels" can they support (reliably) before error recovery, bandwidth, etc. suffer noticeably?

- only designed for 2/4 channels of audio (how do you do 5.1 or 7.1?)

- scales poorly (have to add transmitters/masters to get additional "audio program channels")

- range limitations (think: amphitheaters, large convention halls, etc. -- as well as multipath issues in residences)

- limited range of delays (~16-50ms) means you can't sync acoustic waves over distances of ~50 ft (without adding external hardware)

- delays are locked at integral multiples of sample clock and, since the data isn't readily acccessible, you can't easily compensate by ballistic interpolation between samples

- data is fixed at 16b (or worse)

- power must be supplied locally (i.e., you can't deliver power alongside the data)

- where's the *video* counterpart??

Consider different application domains:

- a typical home (parents + 2 kids) would routinely require two "video" programs (2 x 5.1 audio) plus two "audio" programs (2 x 2or4) [since it seems like everyone is either watching their own TV/movie or listening to their own "music"]

- an amphitheater could require dozens of (mono) channels as each channel can be tailored to a particular driver stack located at various temporal displacements from the proscenium

Dunno. Though, if it bothers you, you might consider adding my name to your kill file! I very deliberately maintain a consistent "header" so anyone who is bothered by my posts can free themselves of even the slightest *distraction* or temptation to reply!

[Personally, if I don't want to read something, I just don't *click* on it! :> ]

I get it, Vladimir -- you have nothing constructive to add...

Vote

D

Don Y 14 years ago

-------------------------------^^^^^^^^^

Lossy.

If I need to "compress" the audio to reduce bandwidth/storage/etc. I want that to be a separate design issue. Here, I am looking for a transport protocol that gets "signal" from A to B, reliably, and with manageable costs.

Vote

U

upsidedown 14 years ago

These days, why bother with compression in a real time network ?

For instance 100baseT Ethernet segments are capable of carrying quite a few uncompressed audio channels.

Using some differential PCM might dropping the bit rate to one half with low latency.

Why ?

At 48 kHz sampling rate, during one sample period, the sound moves 7 mm, thus moving the speaker by a few millimeters will have the same effect.

Vote

D

Don Y 14 years ago

Note that this was the approach I initially took! But, in doing so, I fully planned on testing with the network quiescent, etc. That's not something I can expect of a deployed system...

You don't always have control over what *else* is on the wire (do all clients have QoS guarantees?, etc.) so you have no idea as to what portion of the total bandwidth might be available to you at any given time.

Similarly, you have no control over the latency of particular packets wrt other consumers on the wire.

[unless, of course, you are designing a network *specifically* for -- and dedicated to -- this purpose]

But, the issue goes beyond just network bandwidth!

Compression acts as a storage magnifier, too. I.e., if you can efficiently decompress "on the fly", then you can store compressed packets in your client's elastic store and effectively increase the depth of that buffer -- with no associated hardware cost.

E.g., if you are storing 2 channels of 16b data at ~50Ks/sec, then you need 200KB/sec of stored audio. Probabilistically, if your CDF indicates that "all" (ahem) of your packets will arrive with a worst-case latency not exceeding 0.5sec, then you *must* set aside 100KB *just* for the data buffer (neglecting any other memory requirements in your device).

If you can get 2:1 compression *in* the CODEC, then that drops to 50K. Alternatively, your probability of having to deal with a dropout decreases (though not proportionately as you're probably out on the tail of the CDF at that point) for the same size buffer.

If you are trying to put this into something (physically) small and *inexpensive*, this can make a big difference (e.g., when you consider the PHY, power conditioning, connectors, audio amp, etc., you really don't have much room to fit "external memory"... especially if you're NOT interested in megabytes of it!)

You're assuming the sample rate is fixed at 48K *and* that you can "move the speaker". :>

The CODEC shouldn't care what the sample rate is (barring some constants). So, I should be able to use the same software to drive infrasonics (trading off frequency response, sample rate and buffer depth)

This also increases the signal processing possibilities in the client (e.g., a client could then resample efficiently)

Vote

P

Paul Gotch 14 years ago

If you want a lossless unencumbered algorithm then look at FLAC

formatting link

For general audio over network it depends on what you want on the network at the same time. Most professional level things use ethernet as a physical layer then put their own data layer on top. Examples include:

Aviom A-Net Rocknet

The next layer uses ethernet framing but so standard hubs and switches can be used but need the network to be dedicated to themselves. Examples include

Cobranet Ethersound

Finally the highest level encapslates data in IP packets and can share the network with other traffic the most noteable product here is:

Dante by Audinate

Of course Apple do AirTunes by encoding to something which looks very like ALAC on the fly then decoding it in the AirPort Express. If iTunes is feeding multiple speakers then it delays the local output to take account of the codec delay to get it across the air and out of the AirPort.

In terms of 'home' you get terrifically complicated things like DLNA happening.

-p

Paul Gotch --------------------------------------------------------------------

Vote

D

Don Y 14 years ago

D'oh! I thought FLAC (and SHN) had no compression! I see I was led astray by some of the large file sizes in my audio archive and *ASSuMEd* so, erroneously. For general audio over network it depends on what you want on the

I see no advantage to this approach. It's like capitalizing on the hardware and nothing else...

I thought CobraNet could share the network *but* required special switches and "bandwidth set-asides/reservations"?

How do they ensure audio emitted from two or more devices is synchronized? Or, is that all done "open loop"?

Much too ambitious an undertaking. I just want to develop a tool/component that others can build upon to solve their own particular problems. UPnP, etc. requires too much in terms of standardization, agreement, compatibility, interoperability, etc.

E.g., I expect to only deliver audio in *one* form to the "network speakers" and push all of the transcoding issues to the server side. I.e., keep the clients simple, reliable and inexpensive -- by limiting the variations they would otherwise have to support.

For *me*, this works well as I tend to play music from a stored archive. I.e., convert it *once*, store it in that format and then the transcoding costs disappear... :>

Thanks for an informative post!

Vote

U

upsidedown 14 years ago

You specify that a dedicated network should be used.

If someone insists of using the system on a _heavily_ loaded network, then it is their problem.

Alternatively, forget Ethernet and use some dedicated wiring, such as multiple SPDIF connections. In either case, the end user will have to install new wiring.

Compression also adds latency.

You are not going to be able to handle this kind of applications in the internal RAM of a single chip micro controller. Thus you have to use some external RAM chips with more or less constant cost, unless we are talking about multimegabyte chips. The cost situation was of course quite different a few decades ago.

In my experience, in a lightly loaded networks, worst case latencies is in the tens of milliseconds region. This reduces the buffer requirement by a decade.

In a simple non-compressed PCM system, some random missing samples are not a big deal (can be interpolated) as long as you know which samples are missing (frame counter).

On the other hand, a missing frame in a compressed system is much more serious thing, it will corrupt a large number of samples. To avoid this, compressed bits have to be interleaved into several frames. If a frame is lost, the deinterleaver at the receiver, the error burst will be converted to random bit errors, which can be corrected by ECC bits. Interleaving/deinterleaving will increase both the latency and memory requirements.

Then consider SPDIF.

If you have a high performance server and low cost clients, why not let the server make the synchronization and framing and send fully processed streams to each client.

Do you really expect that the user will be able to hear that 7 mm difference. Once some phase linear speakers were popular, with the base element moved forward, so that the voice coils of the high and low frequency elements were in the same plane. I have not seen much of this fad lately, so apparently phase accuracy is not so critical.

Vote

M

Meindert Sprang 14 years ago

Don,

I seriously wonder why you would want to compress audio for domestic use on a domestic network. A 1Gb network easily accomodates 500 hifi audio channels while leaving room to spare for some video and ordinary network traffic. I think it's just not worth the effort.

Meindert

Vote

D

Don Y 14 years ago

Gb is not anywhere near as ubiquitious as 100Mb (or 10Mb)

Gb cabling and fabric is more expensive (e.g., I only run Gb between a set of 4 colocated hosts, here... the rest of the house is 100Mb)

PoE is harder to support over Gb

Compressed data takes less space to buffer (*in* its destination) than uncompressed data. So, the size/cost/complexity of that device can be simplified if the data comes *to* it in compressed form (rather than adding memory *or* compressing *after* receiving)

---

What I've been working on FOR THE HOUSE (*this* house, YMMV :> ) are small modules roughly the size of a "classic" ice cube (1 in x 1 in x 1.5 in). This size easily fits into a standard "1 gang" junction box -- like a single duplex electrical outlet would require -- without feeling "cramped".

[There are other *commercial* applications but I'm not interested in those]

These modules consist of a PoE PD circuit (so the devices can be powered *from* the network), 100Mb network interface, CPU and two channel ("stereo") amplifier -- along with appropriate connectors, etc. PoE lets me deploy lots of devices without having to burn power in all of them while they sit idle. I.e., if I want to route audio to a particular device, I can power up the device, deliver the audio stream and then power the device back down.

Without moving to PoE+, I can get 10-15W at each module over the existing network cable. Connect *one* speaker to a device and you have modest sound output for a typical user. I am deploying most of these "in the ceiling" for background music, etc. (I am on a crusade to rid the house of obvious signs of technology! :> )

For more intimate settings (e.g., the food preparation area), you can hook a pair of speakers to one device and effectively have a tabletop stereo/radio. I.e., when you just want to listen to a news broadcast, etc.

Folks with teenagers in the house will need the optional, external,

8KW, water-cooled linear amplifier -- batteries not included! A pad on the audio output makes it trivial to drive an external amplifier directly.

With them in place, "TV" (video) can exploit that same hardware to present its audio channel(s). You could, for example, listen to a TV news broadcast without having to watch the "talking heads" (always amazes me how poorly news broadcasts use their video capabilities... why do I need to watch some bozo *reading* a prepared script??)

Having audio available "everywhere" also makes other things possible. E.g., the "doorbell" (as a discrete device hanging on the wall) disappears. The same is true of other "annunciators" (telephone, clothes wash cycle is complete, "It's raining! Close the windows!", etc.)

There are similar possibilities with *video* (though I haven't touched on those, here).

Vote

M

Meindert Sprang 14 years ago

If this system/network is dedicated, you have a lot of control over it. Let's say we have 100Mb, which can transport, say, 6Mbyte/s continuously. That would be 35 hifi stereo channels. What is the chance that every node plays a unique channel? My wild guess is that 10 different music channels simultaneously would suffice for an ordinary household. This traffic will be mostly unidirectional, so very little chance of collisions. So a server could serve these audio channels almost real time using UDP broadcasts. This lowers your buffer requirement dramatically. I'm not familiar enough with UDP/TDP to know if you could mark these packets high priority but if you can, there would be little chance of dropped UDP frames due to the occasional control packet to control the rest of the system. Just a few ideas.... I hope they make sense.

By the way: 10-15 per module is quite a lot. What about heat dissipation. Are you using class D amplifiers?

Meindert

Vote

U

upsidedown 14 years ago

The Quality of service (QoS) and virtual LANs are Ethernet MAC level (802.1Q) issues, so the switches do not have to know about IP or TCP/UDP.

There seems to be an interesting article in Wikipedia

formatting link

One trick used in some industrial protocols in which a very low number of bits (perhaps 1..100 bits) are to be sent to each slave, is to pack all samples into the same frame, and broadcast the frame over the LAN. Each slave will know, where the bits intended for it are located in the frame and only picks up those bits. In this way, the Ethernet header overhead is not a problem and new frames can be sent frequently. If in the audio application most traffic is from a central server to thin clients, the stereo samples for client 1 could be e.g. at bytes 50..53, for client 2, bytes 54..57 of the frame and so on. Of course, the frame header should contain some serial numbers/time stamps for detection of missing frames. With 48 kHz sampling rate and 35 stereo channels, 48000 frames/s with 140 net bytes would have to be sent in each frame. This would in practice saturate the 100baseT links. Putting 8 samples for each of the 35 stereo channels into a single frame, the frame size would be close to the maximum 1500 bytes, thus, the header overhead would be less and there would be some capacity for other traffic.

Vote

D

David Brown 14 years ago

Flac decoding is not too hard, and as mentioned there are no patents or other nonsense to worry about. There are plenty of open source implementations for big and small systems (look at the rockbox project for libraries for small systems).

However, Flac generally doesn't get more than about 50% compression - lossless compression of audio data is not easy.

If you are happy with a bit of loss (typically inaudible loss), then ogg is often a good choice.

Vote

O

Oliver Betz 14 years ago

[...]

since you can't rely on 2:1 with _lossless_ compression, this assumption is simply wrong.

Oliver

Oliver Betz, Munich despammed.com is broken, use Reply-To:

Vote

D

Don Y 14 years ago

It's not dedicated to A/V use. I don't want to have to run *another*

1,000 ft of CAT5...

The model I keep in mind is two video streams plus two music streams. Keep in mind that video requires 5+ audio streams to accompany it (e.g., 5.1 audio).

So, there's 14+ audio streams and two video streams.

[I base this on observations of family-of-4 households where two TV's tend to be on simultaneously (often with one or both of them NOT being actively watched) and figure the other two "users" are just listening to music/etc.]

If every network stack is designed with QoS capabilities (and nothing

*else* decides it is more important), it's not a problem. But, neither is there lots of wiggle room, either.

E.g., chances are, you won't turn off a TV while you check the video feed from the front door (in response to the doorbell).

Or, while you're watching a YouTube video on-line, etc.

One solution is a custom switch that has bandwidth throttling built-in. This would also prevent someone accidentally trying to flood one of the A/V devices with unrelated traffic!

Yes. One of the reasons the modules are smaller than "necessary" is to accomodate heatsinks.

Note, also, that even a few watts tends to be loud enough for background music "as one gets older" ;-)

Vote

D

Don Y 14 years ago

I started looking at it last night. It seems like I can pull a lot of the "fluff" out of the decoder (many of the tags aren't pertinent to my application). I'll have to look closer at the

*encoder* to see how to tune it to the resources available in my clients.

I might need to have clients advertise capabilities so the encoder knows how to ensure that the frames it delivers can be decoded in the resources available (while, simultaneously,

*not* overcompensating).

Yes. Especially if you want the decoder to be efficient.

I'd prefer to target gold before settling for silver...

Vote

Audio CODEC choices

Join the Discussion

Didn't find your answer?