[haiku-development] Re: Considering the audio Mixer formats
- From: "Adrien Destugues" <pulkomandy@xxxxxxxxxxxxx>
- To: haiku-development@xxxxxxxxxxxxx
- Date: Sat, 06 Feb 2016 20:29:37 +0000
Floating point math is not a magical thing to increase precision. Actually,
the "float" format has
only 24 bits of mantissa, so if you are after precision, you are better off
using 32-bit integers.
Besides anything, floating point math is known to be generally more flexible
and have simpler math
that means simpler algorithms, it's just the way pro audio out there
represent the data. We should
consider this besides any preference or technical implementation. Personally
I've always used
floating point math, probably because my first steps on serious audio were on
jack.
It seems we don't have the same definition of "pro audio". At the hardware
level, floats are not used, and the best you can get is 24-bit integers.
Let's take for example one company which at least considered using Haiku for
such purposes:
http://www.izcorp.com/products/radar/hardware-options/analogue-io/
As you can see, it is a 24bit soundcard, and also has a 16bit mode which is
probably already enough for most purposes.
Now, on the software side, it may be more convenient in some cases to work with
floats. The reason is you don't have to normalize your volumes. You can add as
much streams as you want (with a simple addition), and you won't overflow the
range of a float value. Then, after mixing everything, you can divide the final
samples by the number of signals mixed together, to normalize things to the
normal range of -1..1.
This all break apart once you do a slightly more complex thing. Let's take for
example a simple ring modulator. This is a node that takes two inputs and
multiply them together. If your inputs are normalized in the -1..1 range, the
output also is, so it would seem the node is easy to implement with floats. But
if your input is not normalized, you will run into problems. If your two input
signals are in the range -2..2, the output will be in -4..4. You can see things
get out of control quite fast, and depending on the nodes you use, you can't
assume anything about the value ranges.
If you use integers instead, the ring modulator must do A * B /
MAX_SAMPLE_VALUE (with the division optimizing to a bitshift, which is a rather
cheap operation).
I still fail to see what the problem is. The mixer code is quite simple, and
the current algorithms
target a low latency.
What I've been hypotizing is if we can reach better performances or fine tune
them if we have to
handle only one format.
I can see the need for an high-quality algorithm, probably for non-realtime
uses (because high
quality algorithms tend to add latency). And I still don't see what format
conversions have to do
with this. They cost us a simple normalization operation per sample (one
multiplication, one
addition), which is not where the problems are coming from (even the linear
interpolation
algorigthm, which is the simplest thing you could come up with, needs more
math than that).
I agree completely on a second thought, that the idea of using only floating
point was not a good
one.
But let's move the discussion a level higher and see where the problem is.
I want to start first with a question, what are the major needs of realtime
audio?
1 - Low latency
2 - Avoid data conversions
The low latency doesn't depend strictly and inexorably from the mixing
algorithm quality/speed
itself, but more on which elements are put to be mixed. If I put a chain of
nodes that plays audio
into the Haiku's mixer, I can reasonably suppose that under normal conditions
it will adapt itself
to match the best choice. What might begin to be a problem, is when different
framerates and
formats are put into the mixing thread. At this point the mixer have to make
a choice and there
will be different trade-offs depending on the situation.
The mixer has not much choice to make: it must generate things in a fixed
output format, matching the sound card's or the output file you are writing to.
Ideally, other nodes would use the same framerate, which is important to avoid
resampling. However, there isn't as much of a problem for the sample format.
The conversions are lossless when converting to a wider format (in order: 8bit,
16bit, float (which is 24bit), 32bit), and if your output is in a smaller
format than the inputs, well, there isn't much to do, the extra bits will be
lost (well, I guess time-dithering would be possible).
Now, the problem is a bit different if we are looking at things further up in
the node chain. Let's say you want to mix the output from two nodes and feed
the result into a third. In that case, the conversions may be a problem if they
are lossy, because the loss could be amplified by the remaining parts of the
node chain. I think a simple solution is that the mixer in that case should
output in the larger of the input sample formats (if the ouput node can handle
it, of course). A more annoying problem in that use case is the sample rate. If
your two inputs come at different sample rates, some resampling will have to be
applied, and there, going lossless is more of a problem (it would involve
taking the smallest common multiple of the input frequencies as the output one
- but that will give ridiculously high frequencies).
However, I think it is reasonable to expect that all nodes in a chain can agree
on a single frame rate. We don't have to enforce a single one on the whole
system, there are two cases:
- You want to output to the sound card: the system mixer will mix all node
chains and resample things down or up as needed. There is some quality loss
with our default mixer, but it's ok in this use case (we can always add better
algorithms)
- You want output to a file: in this case you can easily master the whole media
node graph and make sure all nodes run at the same frequency.
What instead the audio user want, is the assurance that through the chain we
doesn't incur into any
format conversion.
The way jack do it, is the easiest, limit the whole chain to just one
framerate and one format. As
you might imagine this is nothing so special itself, and have also it's own
drawbacks: jack is
simply unusable for any average/consumer task. We are exactly the opposite,
we can support a lot of
things in a matter of two lines of code, but can't easily enforce a certain
mode to run when we
need real time audio.
I've some idea on how to fix this "by design" but I think we can discuss it
when it will be time.
While agreeing on a common frame rate is important, and usually possible,
unless you're dealing with external constraints (soundcard, playing audio from
a file), I don't think enforcing a single sample format is as much useful.
What's important is having a good decision process on agreeing on sample
formats that are not smaller than the final output to avoid "bitcrushing" and
loss of useful information.
--
Adrien.
Other related posts: