What is a “virtual” loudspeaker? Part 1

#91.1 in a series of articles about the technology behind Bang & Olufsen

Without connecting external loudspeakers, Bang & Olufsen’s Beosound Theatre has a total of 11 independent outputs, each of which can be assigned any Speaker Role (or input channel). Four of these are called “virtual” loudspeakers – but what does this mean? There’s a brief explanation of this concept in the Technical Sound Guide for the Theatre (you’ll find the link at the bottom of this page), which I’ve duplicated in a previous posting. However, let’s dig into this concept a little more deeply.

To begin, let’s put a “perfect” loudspeaker in a free field. This means that it’s in a space that has no surfaces to reflect the sound – so it’s an acoustic field where the sound wave is free to travel outwards forever without hitting anything (or at least appear as this is the case). We’ll also put a “perfect” microphone in the same space.

Figure 1: A loudspeaker and a microphone (the circle) in a free field: an infinite space completely free of reflective surfaces.

We then send an impulse; a very short, very loud “click” to the loudspeaker. (Actually a perfect impulse is infinitely short and infinitely loud, but this is not only inadvisable but impossible, and probably illegal.)

Figure 2: The “click” signal that’s sent to the input of the loudspeaker.

That sound radiates outwards through the free field and reaches the microphone which converts the acoustic signal back to an electrical one so we can look at it.

Figure 3: The “click” signal that is received at the microphone’s location and sent out as an electrical signal.

There are three things to notice when you compare Figure 3 to Figure 2:

The signal’s level is lower. This is because the microphone is some distance from the loudspeaker.
The signal is later. This is because the microphone is some distance from the loudspeaker and sound waves travel pretty slowly.
The general shape of the signals are identical. This is because I said that the loudspeaker and the microphone were both “perfect” and we’re in a space that is completely free of reflections.

What happens if we take away the microphone and put you in the same place instead?

Figure 4: The microphone has been replaced by something more familiar.

If we now send the same click to the loudspeaker and look at the “outputs” of your two eardrums (the signals that are sent to your brain), these will look something like this:

Figure 5: The outputs of your two eardrums with the same “click” signal from the loudspeaker.

These two signals are obviously very different from the one that the microphone “hears” which should not be a surprise: ears aren’t microphones. However, there are some specific things of which we should take note:

The output of the left eardrum is lower than that of the right eardrum. This is largely because of an effect called “head shadowing” which is exactly what it sounds like. The sound is quieter in your left ear because your head is in the way.
The signal at the right eardrum is earlier than at the left eardrum. This is because the left eardrum is not only farther away, but the sound has to go around your head to get there.
The signal at the right eardrum is earlier than the output of the microphone output (in Figure 3) because it’s closer to the loudspeaker. (I put the microphone at the location of the centre of the simulated head.) Similarly the left ear output is later because it’s farther away.
The signal at the right eardrum is full of spikes. This is mostly caused by reflections off the pinna (the flappy thing on the side of your head that you call your “ear”) that arrive at slightly different times, and all add together to make a mess.
The signal at the left eardrum is “smoother”. This is because the head itself acts as a filter reducing the levels of the high frequency content, which tends to make things less “spiky”.
Both signals last longer in time. This is the effect of the ear canal (the “hole” in the side of your head that you should NOT stick a pencil in) resonating like a little organ pipe.

The difference between the signals in Figures 2 and 4 is a measurement of the effect that your head (including your shoulders, ears/pinnae) has on the transfer of the sound from the loudspeaker to your eardrums. Consequently, we geeks call it a “head-related transfer function” or HRTF. I’ve plotted this HRTF as a measurement of an impulse in time – but I could have converted it to a frequency response instead (which would include the changes in magnitude and phase for different frequencies).

Here’s the cool thing: If I put a pair of headphones on you and played those two signals in Figure 5 to your two ears, you might be able to convince yourself that you hear the click coming from the same place as where that loudspeaker is located.

Although this sounds magical, don’t get too excited right away. Unfortunately, as with most things in life, reality tends to get in the way for a number of reasons:

Your head and ears aren’t the same shape as anyone else’s. Your brain has lived with your head and your ears for a long time, and it’s learned to correlate your HRTFs with the locations of sound sources. If I suddenly feed you a signal that uses my HRTFs, then this trick may or may not work, depending on how similar we are. This is just like borrowing someone else’s glasses. If you have roughly the same prescription, then you can see. However, if the prescriptions are very different, you’ll get a headache very quickly.
In reality, you’re always moving. So, even if the sound source is not moving, the specific details of the HRTFs are always changing (because the relative positions and angles to your ears are changing) but my system doesn’t know about this – so I’m simulating a system where the loudspeaker moves around you as you rotate your head. Since this never happens in real life, it tends to break the simulation.
The stuff I showed above doesn’t include reflections, which is how you determine distance to sources. If I wanted to include reflections, each reflection would have to have its own HRTF processing, depending on its angle relative to your head.

However, hypothetically, this can work, and lots of people have tried. The easiest way to do this is to not bother measuring anything. You just take a “dummy head” -a thing that is the same size as an average human head (maybe with an average torso) and average pinnae* – but with microphones where the eardrums are – and you plunk it down in a seat in a concert hall and record the outputs of the two “ears”. You then listen to this over earphones (we don’t use headphones because we want to remove your pinnae from the equation) and you get a “you are there” experience (assuming that the dummy head’s dimensions and shape are about the same as yours). This is what’s known as a binaural recording because it’s a recording that’s done with two ears (instead of two or more “simple” microphones).

If you want to experience this for yourself, plug a pair of headphones into your computer and do a search for the “Virtual Barber Shop” video. However, if you find that it doesn’t work for you, don’t be upset. It just means that you’re different: just like everyone else.* Typically, recordings like this have a strange effect of things sounding very close in the front, and farther away as sources go to the sides. (Personally, I typically don’t hear anything in the front. All of the sources sound like they’re sitting on the back of my neck and shoulders. This might be because I have a fat head (yes, yes… I know…) and small pinnae (yes, yes…. I know…) – or it might indicate some inherent paranoia of which I am not conscious.)

* Of course, depressingly typically, it goes without saying that the sizes and shapes of commercially-available dummy heads are based on averages of measurements of men only. Neither women nor children are interested in binaural recordings or have any relevance to such things, apparently…

on to Part 2

Beosound Theatre: Outputs

#86 in a series of articles about the technology behind Bang & Olufsen

Beosound Theatre has a total of 11 possible outputs, seven of which are “real” or “internal” outputs and four of which are “virtual” loudspeakers. As with all current Beovision televisions, any input channel can be directed to any output by setting the Speaker Roles in the menus.

Internal outputs

On first glance of the line drawing above it is easy to jump to the conclusion that the seven real outputs are easy to find, however this would be incorrect. The Beosound Theatre has 12 loudspeaker drivers that are all used in some combination of level and phase at different frequencies to all contribute to the total result of each of the seven output channels.

So, for example, if you are playing a sound from the Left front-firing output, you will find that you do not only get sound from the left tweeter, midrange, and woofer drivers as you might in a normal soundbar. There will also be some contribution from other drivers at different frequencies to help control the spatial behaviour of the output signal. This Beam Width control is similar to the system that was first introduced by Bang & Olufsen in the Beolab 90. However, unlike the Beolab 90, the Width of the various beams cannot be changed in the Beosound Theatre.

The seven internal loudspeaker outputs are

Front-firing: Left, Centre, and Right
Side-firing: Left and Right
Up-firing: Left and Right

Looking online, you may find graphic explanations of side-firing and up-firing drivers in other loudspeakers. Often, these are shown as directing sound towards a reflecting wall or ceiling, with the implication that the listener therefore hears the sound in the location of the reflection instead. Although this is a convenient explanation, it does not necessarily match real-life experience due to the specific configuration of your system and the acoustical properties of the listening room.

The truth is both better and worse than this reductionist view. The bad news is that the illusion of a sound coming from a reflective wall instead of the loudspeaker can occur, but only in specific, optimised circumstances. The good news is that a reflecting surface is not strictly necessary; therefore (for example) side-firing drivers can enhance the perceived width of the loudspeaker, even without reflecting walls nearby.

However, it can be generally said that the overall benefit of side- and up-firing loudspeaker drivers is an enhanced impression of the overall width and height of the sound stage, even for listeners that are not seated in the so-called “sweet spot” (see Footnote 1) when there is appropriate content mixed for those output channels.

Virtual outputs

Devices such as the “stereoscope” for representing photographs (and films) in three-dimensions have been around since the 1850s. These work by presenting two different photographs with slightly different perspectives two the two eyes. If the differences in the photographs are the same as the differences your eyes would have seen had you “been there”, then your brain interprets into a 3D image.

A similar trick can be done with sound sources. If two different sounds that exactly match the signals that you would have heard had you “been there” are presented at your two ears (using a binaural recording) , then your brain will interpret the signals and give you the auditory impression of a sound source in some position in space. The easiest way to do this is to ensure that the signals arriving at your ears are completely independent using headphones.

The problem with attempting this with loudspeaker reproduction is that there is “crosstalk” or “bleeding of the signals to the opposite ears”. For example, the sound from a correctly-positioned Left Front loudspeaker can be heard by your left ear and your right ear (slightly later, and with a different response). This interference destroys the spatial illusion that is encoded in the two audio channels of a binaural recording.

However, it might be possible to overcome this issue with some careful processing and assumptions. For example, if the exact locations of the left and right loudspeakers and your left and right ears are known by the system, then it’s (hypothetically) possible to produce a signal from the right loudspeaker that cancels the sound of the left loudspeaker in the right ear, and therefore you only hear the left channel in the left ear. (see Footnote 2)

Using this “crosstalk cancellation” processing, it becomes (hypothetically) possible to make a pair of loudspeakers behave more like a pair of headphones, with only the left channel in the left ear and the right in the right. Therefore, if this system is combined with the binaural recording / reproduction system, then it becomes (hypothetically) possible to give a listener the impression of a sound source placed at any location in space, regardless of the actual location of the loudspeakers.

Theory vs. Reality

It’s been said that the difference between theory and practice is that, in theory, there is no difference between theory and practice, whereas in practice, there is. This is certainly true both of binaural recordings (or processing) and crosstalk cancellation.

In the case of binaural processing, in order to produce a convincing simulation of a sound source in a position around the listener, the simulation of the acoustical characteristics of a particular listener’s head, torso, and (most importantly) pinnæ (a.k.a. “ears”) must be both accurate and precise. (see Footnote 3)

Similarly, a crosstalk cancellation system must also have accurate and precise “knowledge” of the listener’s physical characteristics in order to cancel the signals correctly; but this information also crucially includes the exact locations of the loudspeakers and the listener (we’ll conveniently pretend that the room you’re sitting in does not exist).

In the end, this means that a system with adequate processing power can use two loudspeakers to simulate a “virtual” loudspeaker in another location. However, the details of that spatial effect will be slightly different from person to person (because we’re all shaped differently). Also, more importantly, the effect will only be experienced by a listener who is positioned correctly in front of the loudspeakers. Slight movements (especially from side-to-side, which destroys the symmetrical time-of-arrival matching of the two incoming signals) will cause the illusion to collapse.

Beosound Theatre gives you the option to choose Virtual Loudspeakers that appear to be located in four different positions: Left and Right Wide, and Left and Right Elevated. These signals are actually produced using the Left and Right front-firing outputs of the device using this combination of binaural processing and crosstalk cancellation in the Dolby Atmos processing system. If you are a single listener in the correct position (with the Speaker Distances and Speaker Levels adjusted correctly) then the Virtual outputs come very close to producing the illusion of correctly-located Surround and Front Height loudspeakers.

However, in cases where there is more than one listener, or where a single listener may be incorrectly located, it may be preferable to use the “side-firing” and “up-firing” outputs instead.

Wrapping up

As I mentioned at the start, Beosound Theatre on its own has 11 outputs:

Front-firing: Left, Centre, and Right
Side-firing: Left and Right
Up-firing: Left and Right
Virtual Wide: Left and Right
Virtual Elevated: Left and Right

In addition to these, there are 8 wired Power Link outputs and 8 Wireless Power Link outputs for connection to external loudspeakers, resulting in a total of 27 possible output paths. And, as is the case with all Beovision televisions since Beoplay V1, any input channel (or output channel from the True Image processor) can be directed to any output, giving you an enormous range of flexibility in configuring your system to your use cases and preferences.

1. In the case of many audio playback systems, the “sweet spot” is directly in front of the loudspeaker pair or at the centre of the surround configuration. In the case of a Bang & Olufsen system, the “sweet spot” is defined by the user with the help of the Speaker Distance and Speaker Level adjustments.

2. Of course, the cancelling signal of the right loudspeaker also bleeds to the left ear, so the left loudspeaker has to be used to cancel the cancellation signal of the right loudspeaker in the left ear, and so on…

3. For the same reason that someone else should not try to wear my glasses.

Fixed point vs. Floating Point

When an analogue audio signal is converted to a digital representation, the value of the level for each sample is rounded to the nearest quantisation step (because a digital audio system does not have an infinite resolution). I’ve talked about this in detail in a past posting.

When a sample value in a digital audio stream is stored or transmitted inside a piece of audio equipment or software, one of the choices the engineer can make is whether the value should be represented using a fixed point or a floating point system. These are related, but fundamentally different, and they have some effects on the audio signal that may be audible if you’re not careful…

Let’s lay down some basic points to start. We’ll say the following:

Audio is a kind of AC signal that has a level that can vary between two values.
For now, we’ll say that the limits on the range of values is -1 and +1, and it can be anything in between.
We’re going to divide up that range into some finite number of steps and round the actual signal value to the closest usable value. (I’ll assume for this posting that you already understand that dither is your friend.)
The value will be stored as a binary number somehow

The question that we’ll look at here is exactly how that binary value represents the number, and a little of what that means to the audio signal.

Fixed Point Representation

The simplest way to represent the value is to divide the total range from the minimum to the maximum number into an equal number of steps, and round the signal’s value to the closest step. This is a really generalised description of a “fixed point” system.

For example, if we have a 3-bit number to play with, we’ll take the first bit and use that one to represent the + or – portion of the value (where 0 means “+” and 1 means “-“). For values from 0 up to (just under) the positive maximum, the other 2 bits are used to just count the steps, from 000 up to 011. The negative values start at the bottom and work their way up to 1 step below 0, from 100 to 111. This can be seen in Figure 1.

Figure 1: A simplified representation of the use of quantisation steps in a 3-bit fixed point system.

If you look carefully at Figure 1, you’ll see that there is one extra negative step, since one of the positive steps is used to represent the value 0 in the middle. This means that, if the signal is symmetrical, then we will wind up using all of the possible quantisation values except for the bottom one (just like I’ve shown in the plot), however, for the rest of this discussion, we’ll be working with numbers that are so big that this one step doesn’t really matter, so I won’t mention it again.

If we are using a 3-bit number to represent the value, then we have a total number of 2³ quantisation steps: 8 of them. Each time we add one more bit, we double the number of steps. So, for a 16-bit sample, we have 2¹⁶, or 65,536 possible quantisation values. For a 24-bit sample, we have 2²⁴, or 16,777,216 steps.

By increasing the number of bits in the number, we don’t change the level (it still has a range of -1 to +1), we’re just increasing the resolution that we have to make the measurement. The higher the resolution, the lower the error, and so the lower the level of distortion (if we don’t dither) or noise (if we do) relative to the signal.

If you have a fixed-point system, and you want to calculate the difference in level between the maximum signal level and the noise floor, then you can use a somewhat simplified equation, shown below:

Dynamic Range In dB ≈ 6 * nBits – 3

As I said, this is simplified due to some rounding to keep the numbers nice, but the general idea is that you have a doubling of dynamic range for every extra bit (therefore 6 dB per bit) and you lose 3 dB for the (TPDF) dither (but that’s better than not having the dither and having distortion instead). If you wanted to do it properly, then you can use this math instead:

Dynamic Range In dB ≈ 20*log10(2^nBits) – 20*log10(sqrt(2))

So, if you have a 16-bit fixed point system, you have about 93 dB of range from the loudest signal to the noise floor. If you have a 24-bit system, it’s about 141 dB.

Remember that the noise floor is constant (I’m assuming it’s dithered), so as the signal level drops below maximum the current signal to noise ratio will drop by the same amount. Therefore, if your signal is 12 dB below maximum (or -12 dB FS, which means “12 decibels below Full Scale”), then the SNR in a 16-bit system is 93 – 12 = 81 dB.

If that last paragraph didn’t make complete sense, go back and read it again, because it’ll come back later…

Fixed point is a good system for conversion of an audio signal from and to analogue, but if you’re doing some really serious processing, it might not work out so well. This is due to two primary reasons:

If your signal is going to outside the range, it will clip at the maximum positive or the minimum negative value because fixed point is not designed to exceed its range.
If the signal is going to be reduced to a very low level somewhere in your proceeding (say, inside a biquad, for example) then you might need a LOT of bits to keep the noise floor low enough when the signal level is brought back up

Figure 2: The first half of a sine wave (in grey) quantised (without dither) in a simplified 4-bit fixed point system. (I’ve actually cheated a bit and just made 8 equally-spaced steps from 0 to 1 unlike the version shown in Figure 1.) The two plots show identical data, but the bottom plot has a logarithmically-scaled Y-axis.

As can be seen in Figure 2, the equally-spaced steps in a fixed point world mean that the quantisation error is always between -0.5 and 0.5 of a step (a “Least Significant Bit” or LSB), regardless of the level of the signal.

Floating Point Representation

There is another way to use the bits to represent the signal value. This is to divide the binary “word” into two parts and to do a little math involving some subtraction, multiplication, and an exponent to arrive at the value. Just like in the Fixed Point case, we’ll reserve one bit for the +/- indicator.

Let’s say that we have a 32-bit value to work with. We’ll divide this up into the following:

23 bits for the fraction or mantissa, which we’ll abbreviate f
8 bits for the exponent, abbreviated e
1 bit for the +/- sign (just like in Fixed Point)

We’ll then do the following math:

Sample Value = ± (1 – f) * 2^e

We need to know a little extra information:

because we’re using 23 bits for f, then it can range from 0 to 2²³-1. In other words, stated mathematically:
0 ≤ 2²³*f < 2²³
because we’re using 8 bits for e, then it has a total range of 2⁸ possible values. In other words it has a range from just over -2⁷ to just under 2⁷. In other words, stated mathematically:
-126 ≤ e ≤ 127
(Note that a couple of possible values are reserved for special purposes, but we won’t talk about those)

This is all a little complicated, but there is a “punch line” to which I’m headed:

Unlike Fixed Point representation, the divisions of the values – the number of steps, and therefore the step sizes – are not the same across the entire scale of possible values. It’s divided into sections, where each section has quantisation steps of equal size, but that step size is dependent on what the value is. In other words the step size changes with the value, but on a coarser scale.

That step size can be calculated as follows:

From 2^e to 2^e+1, the steps all have an equal size of 2^e-fBits where fBits is the number of bits used to express f (in the case of a 32-bit floating point word, fBits = 23 bits). In other words, we have 2^fBits equally-spaced steps in that range.

Therefore, each time the signal value moves from just below 0.5 to just above (for example) then the resolution changes, and the higher the value, the lower the resolution. This is is how Floating Point representation behaves.

Figure 3: The first half of a sine wave (in grey) quantised (without dither) in a simplified floating point system with 2 bits for the fraction. This means that there are 4 equally-spaced steps from (for example) 0.25 to 0.5 or 0.5 to 1. The two plots show identical data, but the top plot has a linearly-scaled Y-axis, whereas the bottom plot has a logarithmically-scaled Y-axis.

Do I care?

Let’s find out.

In a 32-bit floating point world (therefore, one with a 23-bit fraction), if I have a signal that has a level that has has a maximum positive value of 1 (or 2⁰), then the resolution of the value (which defines the error, which defines the “distance” in dB to the noise floor) is 2^-25 (or 1/33,554,432).* This means that the noise floor is about 150 dB below the signal (20 * log10(1 / 2^-25). As the signal level drops to 0.5, the noise floor remains the same, so the signal drops by 6 dB, and the SNR reduces to 150 – 6 = 144 dB.

Then, when we drop just below 0.5, the resolution of the value suddenly changes to 2^-26 (or 1/67,108,864) , which means that the noise floor is about 150 dB below the signal (20 * log10(0.5 / 2^-26). As the signal drops to 0.25 (-6 dB relative to 0.5), the noise floor remains the same, so the signal drops by 6 dB, and the SNR reduces to 150 – 6 = 144 dB.

Then, when we drop just below 0.25, the resolution of the value suddenly changes to 2^-27 (or 1/134,217,728), which means that the noise floor is about 150 dB below the signal (20 * log10(0.25 / 2^-27). As the signal drops to 0.25 (-6 dB relative to 0.5), the noise floor remains the same, so the signal drops by 6 dB, and the SNR reduces to 150 – 6 = 144 dB.

Hopefully, by now, you’re seeing a pattern here.

Figure 4: Notice that the error of the floating point version is reduced when the signal level (in grey) approaches 0.

Figure 6: The errors from the quantisation shown Figure 5. These are just the original signal subtracted from the quantised signals. Notice that, in Floating Point, the general level of the error is dependent on the level of the signal (it’s smaller on the left and right of the plot) whereas in Fixed Point, the overall level of the error is more constant.

The cool thing is that the pattern would have been the same if I had gone above 1 instead of below it. So, the two things to worry about in Fixed Point (inadequate resolution with (temporarily) low-level signals and clipping when the signal goes outside the range) are not problems in floating point.** And, if you have enough bits (32-bit floating point is the standard “single precision” resolution, but 64-bit “double precision” resolution is not uncommon).

Figure 7: The Signal to Distortion+Noise ratio of four different systems, as a function of the signal level in dB FS.**

This is why, in most modern audio systems, you have a fixed-point ADC and a DAC (an Analogue to Digital Converter and a Digital to Analogue converter) at the input and output of your system (because the signal range is reasonably well-defined, and the dynamic range is more than adequate if you do it right) but the processing on the inside is done in 32-bit or 64-bit floating point (or both, in some devices) so that the engineers have the resolution and the range to play with the signals before getting them ready for the output.***

There may be some argument made for a constant noise floor level in a fixed-point system (assuming it’s dithered) over a signal-modulated noise level in a floating-point world (assuming it’s not), however, there are two reasons why this is likely not a real-world issue. The first is that, even in a single-precision floating point system, the worst-case signal to noise ratio is about 144 dB, which is very good. The second is that smart people have already been thinking about dither for floating point systems. If this sounds interesting, you can start reading here…

One last thing

You may be wondering about that sawtooth plot: the red line in Figure 7. It can’t keep going forever, right?

Right.

Eventually, if the signal is quiet enough, then you run out of exponents and the system just behaves as a 23-bit fixed point system (assuming a 32-bit floating point). This will happen when e = -126. Below that, then the SNR just follows a downward slope just like the fixed-point plots. If the signal is loud enough (when e = 127) then you’ll clip, again, just like the fixed-point systems do when the input signal has a level of 0 dB FS.

So, then the question is: “how quiet / loud does the input signal have to be for that to happen?” The answer is very quiet and very loud, as you can see in the plot in Figure 8.

Fig 8. The limits of a 32-bit floating point signal. As you can see, you’ve got plenty of dynamic range to work with before you run out of room on either side. The black line is 16-bit fixed point, the blue line is 24-bit fixed point, and the red line is 32-bit floating-point.

You may be wondering how I calculated those limits:

The first peak in the sawtooth on the left side is at 20*log10(2^-126) = -758.6 dB FS
The last peak in the sawtooth on the right side is at 20*log10(2^127) = 764.6 dB FS
The slope that just below the 0 dB FS Signal level is where e = -1. The slope just above 0 dB FS is where e = 0.

* First small note for the attentive

You may have noticed what appears to be a mistake in my math in there. First I said:

From 2^e to 2^e+1, the steps all have an equal size of 2^e-fBits where fBits is the number of bits used to express f (in our case, fBits = 23 bits). In other words, we have 2^fBits equally-spaced steps in that range.

Then I did the math and said

In a 32-bit floating point world (therefore, one with a 23-bit fraction), if I have a signal that has level that has just come up to 1 (or 2⁰), then the resolution of the value (which defines the error, which defines the “distance” in dB to the noise floor) is 2^-25 (or 1/133,554,432).

Why did I say 2^-25 when maybe I should have said 2^-23 (because there are 23 bits in the fraction)? The reason is that the 2²³ quantisation levels are located between 1 down to 0.5. If I were to continue with the same spacing down to 0, then I would have twice as many quantisation levels, so there would be 2²⁴ instead. If I were to continue the spacing all the way down to -1, then there would be twice as many again, or 2²⁵.

In other words, a floating point signal ranging from a value of 2^-1 to 2⁰ (0.5 to 1) with some number of bits in the fraction that we’re calling fBits will have almost exactly the same signal to noise ratio as an non-dithered fixed point system that is scaled to range from -1 to 1 with fBits+2.

This would be the same from -2⁰ to -2^-1 (-1 to -0.5).

At any other signal value, the quantisation behaviours (and therefore the signal-to-noise ratios) of the two systems will be significantly different.

This is visible in Figure 6 where, when the signal is high (in the middle of the plots), the error level is approximately the same in the 4-bit fixed-point system and the floating point system with 2 bits for the fraction.

** Second small note for the attentive

You will notice that the black, blue, and green lines in Figure 7 have a sharp transition when the signal level hits 0 dB FS. This is because, in a fixed point system at signal levels below 0 dB FS, the signal to noise ratio is the difference in level between the dither’s noise floor and the signal. The dither level is constant, so as the signal level increases, it gets “further away” from the noise floor until you reach 0 dB FS (with a sine wave), as which point you reach the maximum possible SNR. However, once the signal goes beyond 0 dB FS (still assuming it’s a sine wave), then it starts to clip and distortion components are generated. It does not take much increase in level to drastically increase the level of the distortion relative to the level of the signal (since the signal level cannot increase – you’re just increasing distortion artefacts). Consequently, the signal to distortion+noise drops dramatically, because the distortion components increase in level dramatically.

This does not happen with the floating point system because, at 0 dB FS, you just change the exponent and keep going up with the signal level until you reach the maximum possible exponent value, which goes far beyond what I’ve plotted here.

Third small note for the attentive

You may be looking at Figure 7 and wondering why the fixed point plots and the floating point plots don’t overlap anywhere. For example, look where the green line (32-bit fixed point) crosses the red line (32-bit floating point). Why don’t they overlap each other there for that little 6 dB-wide range on the X-axis?

The reason is that I’m modelling the fixed point SNRs with TPDF dither, which “costs” 3 dB, but I’m assuming that the floating point signal is not dithered (which would normally be the case). If I were pretending that fixed point didn’t include the dither, then the plots would, indeed, overlap each other for that narrow little window.

***One last comment

You may be saying to yourself “But this is nonsense! Why do I need 150 dB SNR when the signal level is lower than -100 dB FS?” The long answer is in this posting, but the short answer is that the signal can go VERY low and VERY high inside a filter (a biquad), so you need to worry about this if you’re doing any changes to the magnitude response of the signal, for example…

Quantisation of Poles in the Z-plane

Back in this posting, I talked about biquads and their use in digital signal processing for making linear filters (what most of us call “equalisers”). Part of that explanation showed that a biquad is formed of a feed-forward section and a feed-back section, and you could swap the order of these two and get the same results. In a “Direct Form 1” implementation, the feed-forward comes first, as shown in Figure 1.

Figure 1: A Direct Form 1 implementation of a biquad

in the “Direct Form 2” implementation, the feed-back comes first, as shown in Figure 2.

Figure2: A Direct Form 2 implementation of a biquad.

There are advantages and disadvantages to each of these implementations, depending on things like how the rest of your system is implemented, and what, exactly you’re expecting the biquad to do.

However, for this posting, we’re going to “zoom in” a bit to the feed-back portion of the above diagrams. This portion of the biquad provides the “poles” in the Z-plane, as I described in the “Intuitive Z-plane” series of postings last week.

If I separate the feed-back portion of the above figures, it would look like Figure 3:

Figure 3: A Direct Form implementation of the feed-back portion of a biquad.

The locations of the poles (and therefore the magnitude response) of this portion of the filter are dependent on the gains at the outputs of the two 1-sample delays. What happens when these gains do not have an infinite resolution (which they can’t, because everything in a digitally-represented world is quantised to a finite number of steps)?

Before we go any further with this, I have to put in a reminder that quantisation (or “rounding”) in a DSP world is done in binary – not decimal. So, the quantisation that I’m about to do isn’t the same as stopping a couple of digits after the decimal…

In a world with infinite resolution, I can set the values of those two gains to be anything I want, and therefore the poles in my system can be anywhere within the circle. (We’ll assume that we don’t want a pole on the circle because that would result in a gain of ∞ dB, which is very loud.) However, let’s say that we quantise the gain values to a 4-bit binary value, where the first bit is reserved for the +/- indicator. This means that we only have +/- 2^3 possibilities, or 8 values, one of which is +0, and another is -0 (which is the same thing…). So, in other words, we have 7 possible negative values, 7 possible positive values, and 0.

The end result of this is that the poles have a limited number of locations where they can be placed. For the 4-bit quantisation described above, the resulting locations look like Figure 4.

Figure 4: The circles indicate the possible locations of the poles as a result of quantisation of the gain coefficients in a 4-bit Direct Form implementation.

Remember that we’re not talking about quantisation of the audio signal – it’s quantisation of the gain coefficients inside the biquad that will have an impact on the response of the filter.

Of course, it would be crazy to implement a biquad using only 4-bit quantisation for the gain coefficients. However, the point of this posting is not to show that biquads suck. It’s only to show one possibly important aspect of them if you’re a DSP engineer – but we’re getting there.

Just for fun, let’s increase the resolution of the system to 6 bits:

Figure 5: The dots indicate the possible locations of the poles as a result of quantisation of the gain coefficients in a 6-bit Direct Form implementation.

… or to 8 bits:

Figure 6: The dots indicate the possible locations of the poles as a result of quantisation of the gain coefficients in a 4-bit Direct Form implementation.

Any higher than this and the pattern will just get so dense that it’ll turn black, so I’ll stop.

So what?

At this point, you may be asking “so what?” The answer to this very important question lies on the far right side of all of those graphs. Notice that there aren’t any locations near the point on the graph where X=1 and Y=0. You may remember from Figure 7 in this posting that I did last week, that that’s the point on the Z-plane that corresponds to very low frequencies.

This means that, if you want to create a digital filter, and you need a pole near 0 Hz, you’re going to run into some trouble with a Direct Form implementation of the biquad. Yes, the higher the bit depth of the gain coefficients, the closer you can get, but this might not be the best way to do things.

There is another option for implementing your feed-back portion of the biquad. You can use a design called a Coupled Form instead, which is shown in Figure 7 (compare it to Figure 3).

Figure 7: A Coupled Form implementation of the feed-back portion of a biquad.

Notice that you still have two 1-sample delays, however there are now 4 gains instead of 2. How can this be better? Well, in a system with infinite resolution on the gain coefficients, it’s not. Given the appropriate choice of gain values, this implementation will do exactly the same thing as the Direct Form implementation if your resolution is infinite.

However, if you have a limited resolution, then the available locations of the poles on the Z-plane are very different. Let’s use the 4-bit, 6-bit, and 8-bit quantisations of the gain values again: these are shown in Figures 8 to 10.

Figure 8: The circles indicate the possible locations of the poles as a result of quantisation of the gain coefficients in a 4-bit Coupled Form implementation.

Figure 9: The dots indicate the possible locations of the poles as a result of quantisation of the gain coefficients in a 6-bit Coupled Form implementation.

Figure 10: The dots indicate the possible locations of the poles as a result of quantisation of the gain coefficients in a 8-bit Direct Form implementation.

As you can see in Figures 8 through 10, the Coupled Form implementation, given the same resolution on the gain coefficients, will give you much better placement opportunities for the poles in the low frequency region than the Direct Form implementation.

Of course, in all of these examples, I’m only showing up to an 8-bit word, and a typical DSP runs uses a lot more than 8 bits for the gain coefficients. So, it’s possible that, in the real world, the actual resolution is high enough that this is of no concern whatsoever.

However, if you’re building a very-low-frequency filter an if the magnitude response isn’t exactly what you’re looking for, this might help you get a little closer to your goal.

It’s important to point out here that the quantisation of the locations of the zeros is not the same as this. Someday, I’ll come back to this and plot the expected vs. actual magnitude responses of some biquads where I’ve quantised the zeros and poles, just to see how badly things go wrong when they do…

Intuitive Z-plane: Part 5 – Conclusion

If you’ve read through the first four parts of this series, then you’re already at a point where you can intuitively understand what’s going on. We just have a couple of details to take care of before finishing off.

Firstly, the plots showing the zeros and poles in the figures you’ve been looking at plots of the “Z-plane” or “Complex-plane“. As I said at the start, we’re only trying to get to an intuitive understanding of these plots – so I’m not going to get into complex numbers, or even much math (apart from what you’ll see below… which isn’t very complicated, and avoids complex numbers).

When I’m developing a new DSP algorithm, I use an application called Max from cycling74.com. Figure 1 shows a screenshot from Max, where I’m using an object to calculate the biquad coefficients to make a low pass filter, as you can see. I’ve then connected the output of that object (it looks like a magnitude response) to a Z-plan representation that shows me the same thing in a different way.

Figure 1: The top plot shows the magnitude response of the filter. The bottom plot shows the Z-plane representation of the same filter.

You may notice that this plot has two poles, one at (0, 0.408) and the other at (0, -0.408). In fact there are two zeros there as well, but they’re situated in the same place, on “on top” of the other, at (-1, 0). This is always true for a biquad – there are always two zeros and two poles. Sometimes, they’re located in the same place, sometimes not, sometimes they’re placed symmetrically, sometimes not, depending on the filter, as we’ll see below.

Let’s look at that Z-plane representation in 3-dimensions:

Figure 2: A 3D view of the Z-plane representation of the filter shown in Figure 1.

Figure 3: The same plot as shown in Figure 2, rotated to show the back of the plot.

So, as you would now expect, the poles pull up the edge of the circle, and the zeros (both in the same place) pull down, giving the red line the height that it has.

Now, think back to this Figure from earlier in the series:

Figure 4: Think of the edge of the circle as the frequency

If you therefore look at Figure 3, which is like looking at Figure 4 from the top, you’ll notice that the height of the red line (the edge of the circle is high on the left (in the low frequencies) and drops as you go to the right (the high frequencies). This is the magnitude response that’s shown on the top of Figure 1. The only difference is that it’s on a linear scale instead of a logarithmic scale, so the shape looks a little weird.

Let’s do another one:

Figure 6: A 3D view of the Z-plane representation of the filter shown in Figure 5.

Figure 7: The same plot as shown in Figure 6, rotated to show the back of the plot.

Hopefully, now you are able to look at a Z-plane representation of a filter and think about the effect of the poles and zeros on the edge of the circle, and therefore get a rough idea of the magnitude response of the filter…

If not, I apologize for wasting your time. On the other hand, if you’re in a life-threatening situation, this knowledge probably wouldn’t help you anyway… Very few people have gotten a critical injury in a biquad accident.

How I did it

If you want to make these plots for yourself, the math is pretty simple.

Start by choosing the frequency, which will be a point on the circle. You then find the four distances from the zeros and poles to that point (I’ve indicated those distances in Figure 8 with the variables z1, z2, p1, and p2.) This can be done using the Pythagorean theorem.

To find the gain of the filter at the frequency, you divide the sum of the zeros’ distances by the sum of the poles’ distances. In other words:

(z1 + z2) / (p1 + p2)

That will give you the result as a linear value. If you then want to convert it to decibels, like I’ve done, you do a little extra math like this:

20 * log10 ( (z1 + z2) / (p1 + p2) )

That’s it! You just need to do repeat that math for each frequency that you’re interested in, and you’re done!

Intuitive Z-plane: Part 4 – How, not what

I ended Part 3 by saying that DSP engineers think of the frequency scale as a circle rather than as a straight line. The questions are “Why do they think like this? What’s wrong with them?” Although I can’t answer the second question, the answer to the first question is fundamentally simple.

A DSP engineer is not really interested in what a filter does. She or he is interested in how it filters. A normal magnitude response plot shows us mortals the result of what’s happened to the audio after it’s gone through a filter (or a processor in general). Someone making that filter (or system) needs to know how it’s working instead.

So far, in this series, we’ve seen the following:

Digital audio filters are made with feed-forward and feed-back delays with different gains.
Feed-forward delays make narrow dips (and wide peaks) in the magnitude response
Feed-back delays make narrow peaks (and wide dips) in the magnitude response
DSP engineers think of frequency on a circle instead of a straight line
DSP engineers also want to see plots of how a filter works instead of its result on the audio signal.

Let’s put all of this together.

We’ll draw the circle showing the frequency scale, but then rotate the view to see it in three dimensions. For example, I can pretend to make the surface of of the circle out of a rubber sheet that can be pulled upwards (like a tent) or downwards (like a funnel), whilst always maintaining a circular edge.

If I want to pull the tent upwards, I’ll use a “pole” to do it. That pole has an infinite height (we’re going to need some very stretchy rubber). If I want to pull the funnel downwards, I’ll use something I’ll call a “zero“. (I am not going to go into why zeros and poles are called that, so as to avoid doing too much math.)

So, if I were to put a zero in the middle of the circle, its 2D representation would look like Figure 1 (notice the red circle in the middle showing where the zero is placed), and the 3D version would look like Figure 2:

Figure 1: A 2D representation of a “zero” (indicated by a red circle) in the middle of the circle that shows the frequency scale (the red line shows the scale going counter-clockwise from 0 Hz on the right to Fs/2 on the left)

Figure 2: A 3D representation of the same information shown in Figure 1.

If I were to put the pole (indicated by a red ‘x’) in the middle of the circle instead, then the result would look like Figures 3 and 4.

Figure 3: A 2D representation of a “pole” (indicated by a red circle) in the middle of the circle that shows the frequency scale (the red line shows the scale going counter-clockwise from 0 Hz on the right to Fs/2 on the left)

Figure 4: A 3D representation of the same information shown in Figure 3.

So far so good… If we were to rotate Figures 2 or 4 and look at the red line that I’ve drawn around the edge, we’d see that it’s flat with a height of 0 (on the vertical scale) all the way around. This is because I’ve carefully placed the zero or the pole at the exact middle of the circle, so it’s pulling equally on all points of the edge of the “tent” or the “funnel”.

However, what would happen if I moved the zero or the pole away from the middle? Some examples of this for a zero moved to the location (-0.75, 0) are shown in Figures 5 to 7, below.

Figure 5: A zero placed at the location (-0.75, 0), shown in a 2-dimensional representation

Figure 6: A 3-dimensional view of Figure 5.

Figure 7: The same plot as the one shown in Figure 6, rotated to show the “back” of the shape.

As you can see in Figure 7, when the zero is moved away from the centre of the circle, it pulls downwards on the closer edge (notice how the red line is lower than the black line which has a constant height of 0). However, it also doesn’t pull downwards as much on the opposite side of the circle (notice how the red line is higher than the black line on the left side).

Of course, if I were to do the same thing with a pole, everything would behave symmetrically, as shown in Figures 8 to 10.

Figure 8: A pole located at the position (-0.75, 0)

Figure 9: a 3-dimensional view of Figure 8

Figure 10: The same plot as Figure 9, rotated to show the “back”

We’re almost finished… One more posting to go to wrap up.

Intuitive Z-plane: Part 3 – More setup

I wrote an intuitive explanation of aliasing in this posting and dug in a little deeper, looking at the side-effects of aliasing with audio signals specifically in this posting.

One of the more important figures in that second posting is repeated below in Figure 1.

Figure 1: The black line shows the intended frequency of a sine wave created in the digital domain, expressed as a fraction of the sampling rate (Fs). The red line shows the actual frequency that comes out of the system as a result – its “alias”.

Let’s say that we wanted to make a sine wave generator in the digital domain. This is pretty easy to do using some rather simple math, as follows:

Output(n) = sin(2 * π * Fc / Fs * n)

where Fc is the frequency of the sine wave in Hz, Fs is the sampling rate in Hz, and n is the time, expressed as a sample number.

There are no restrictions on Fc – so if you wanted to plug in a value that is higher than Fs/2 (the Nyquist frequency) then you’ll get a value. However, if you used this math to try to make a sine wave where Fc > Fs/2, then the output will be different from what you expect. This is what’s shown in Figure 1. The red curve shows the actual frequency of the output (read off the Y-axis) for an intended frequency (on the X-axis).

This problem of the difference between input and output is identical to what would happen if you rotated a wheel by some angle, and then asked someone to measure the rotation. For example, look at Figure 2.

Figure 2. The Red arrow shows the actual rotation. The Black arrow shows the assumed rotation.

On the left, it shows a wheel that was rotated clockwise by 90º (indicated by the red arrow). Someone measuring the rotation would say that it was rotated by 90º – a perfect match! If you rotated by 180º (the second example), the person measuring would also get the right answer. However, if you rotated by 270º (the third example, in the middle), the person measuring would (correctly) say that you rotated by 90º counterclockwise. A rotation of 360º gets you back where you started, so it would be measured as 0º. A rotation of 450º (the example on the right) would be measured as a rotation of 90º.

If we were to do this a lot, and plot the results, they’d look like Figure 3.

Figure 3: The results of my little thought experiment shown in Figure 2.

Now compare Figure 3 to Figure 1. Notice how they’re identical? This is important because it’s a graphic example of exactly the way frequencies “wrap” in a digital audio world. This “wrapping” is the result of the fact that a sinusoidal wave (a signal containing only one frequency) is just a 2-dimensional view of a 3-dimensional rotation (I showed this with photos of a Slinky™ in this posting.

When we normal people look at a magnitude response of a device – let’s say, a low-pass filter, we put it on a nice cartesian plot with the frequency displayed on a straight line on the X-axis and the magnitude displayed on a straight line called the Y-axis. This looks something like Figure 4.

Figure 4: A typical, familiar way of viewing a magnitude response of something (in this case, a low-pass filter with Fc=1 kHz).

However, this is only a portion of the truth. The truth extends further than the limits of that plot. I conveniently stopped plotting at Fs/2 (since the filter that I made is running at 48 kHz, this plot goes up to 24 kHz). I also didn’t plot anything below 20 Hz – and I certainly didn’t extend the plot below 0 Hz into the negative frequencies… (“Negative frequencies?” I hear you ask… These are the same as positive frequencies, except that 3-dimensional wheel is rotating in the opposite direction; but since we’re only looking at it on-edge from one location, we can’t tell whether it’s rotating clockwise or counter-clockwise. See this posting if you want to go further.)

Let’s try extending the plot. First, I’ll show Figure 4, but using a linear scale for the frequency instead of a logarithmic scale. This is shown in Figure 5.

Figure 5: The same information shown in Figure 4, but on a linear frequency scale. Notice how 1 kHz is quite low compared to 24 kHz.

If I then were to plot beyond Fs/2, then the magnitude response would be a mirrored version of the one you see in Figure 4. The same would be true if I were to plot below 0 Hz. This is shown in Figure 6.

Figure 6: The same information shown in Figure 5, but showing an extended frequency range.

What does this mean? It means for example that, if I had an LPCM system running at 48 kHz, and I were to digitally generate a sine tone at 48 kHz, then the result would be the same as making a “sine tone” at 0 Hz (or “DC”) because all of the samples would have the same value – neither 0 Hz nor 48 kHz would be a sinusoidal wave in a 48 kHz system. If I then, inside the same system, sent that “48 kHz sine tone” through a low-pass filter with a cutoff frequency of 1 kHz, then it would go through un-impeded (just like a 0 Hz signal would get through a low-pass filter).

Assembling the pieces

Let’s take the illustration I just showed in Figure 6, and consider it, knowing what I showed in the comparison between Figures 3 and 1.

Although we normal people show each other magnitude responses that look like the one in Figure 4, this is not the way people who make digital signal processing (DSP) software think. They see the frequency axis on a circle that goes from 0 Hz up to Fs/2 (the Nyquist frequency), and then wraps back around to 0 Hz (= Fs). This weird way of viewing the world is shown in Figure 7.

Figure 7: DSP engineers think of the frequency axis as the top half of a circle that starts at 0 (Hz) on the right side, and goes (linearly) up to Fs/2 when you get to the opposite side. An example, with a system running at 48 kHz, is shown on the right.

There are some very good reasons why DSP engineers think like this – one of which you already know (the wrapping and aliasing issue). There are some reasons I’m not going to talk about here (but you can read this if you’re interested), and there are some other reasons that I’m headed towards…

However, before we move on to the next chapter in our little saga, it’s best to get really comfortable with the plots in Figure 7. I especially want you to get used to some specific things, in order of importance:

The frequency scale is circle – it’s not a straight line.
The scale starts on the right (at the 3 o’clock position) and goes counter-clockwise to the left (the 9 o’clock position).
The scale is linear, not logarithmic, like you’re used to seeing.
The maximum frequency is the Nyquist frequency, so it’s defined by the sampling rate.
Once the point on the circle goes beyond the Nyquist, we’ve started aliasing, and so we’ve entered a symmetrical world that mirrors the half below the Nyquist. (In other words, when we get a little farther, you’ll see that the top and the bottom of that circle are mirror images of each other – as I’ve already hinted in Figure 6 looking at the frequency range from 0 to 48 kHz.)

Intuitive Z-plane: Part 2 – Peaks and Dips

Most digital filters that are applied to audio signals use a “basic” building block called a “biquadratic filter” or “biquad” which consists of 2 feed-forward delays and 2 feed-back delays, each with its own output gain and a delay time of 1 sample. I’ve already talked a little about biquads in this posting, where I showed a couple of different ways to implement it. One of the standard ways is shown below in Figure 1.

Figure 1: A biquad implemented using the “Direct Form 1” method.

The signal flow that I drew for Figure 1 is a little more modular than the way it’s normally shown, but that’s to keep things separate for the purposes of this discussion.

The two feed-forward delays add to the input signal (via gains b0, b1, and b2) and the result shows up at the red arrow. Remember from Part 1 that this portion of the biquad can only make a magnitude response that has (in an extreme case) infinitely deep, sharp dips, and smooth rounded peaks.

The signal from the red arrow onwards goes into the feed-back portion of the filter with two feed-back delays adding through gains -a1 and -a2. Again, remember from Part 1 that this portion of the biquad can make a magnitude response that has infinitely deep, sharp peaks, and smooth rounded dips.

Let’s say that we wanted to make a simple filter – let’s make it a low pass filter – using this biquad. How do we do it?

The simplest way is to cheat and go straight to the answer.

Cheating Option 1: You go to this page at www.earlevel.com and put in the parameters you’re interested in (Filter Type, sampling rate, Fc, Q, etc…) and copy-and-paste the resulting five gains (we’ll call them “coefficients” from now on).

Cheating Option 2: We search on the Interweb for the words “RBJ Audio Cookbook” and then spend some time copying, pasting, and porting the equations that Robert Bristow-Johnson bestowed upon us many years ago* into your processor. You then say “I want a low pass filter at 1000 Hz with a Q of 0.5, please” and the equations spit out the five coefficients that you seek.

However, if you cheat, you’ll never really get a grasp of how those coefficients work and what they’re really doing – and that’s where we’re headed in this little series of articles. So, you might decide to go through this series, and then cheat afterwards (that’s what I would recommend…)

Now, before you go any further, I’ll warn you – the whole purpose of this series is to give you an intuitive understanding. This means that there are things I’m going to (intentionally) skip over, merely mention in passing, or omit completely. So, if you already know what I’m talking about, there’s no point in reading what I’m writing – and there’s certainly no need to email me to remind me that I didn’t mention some aspect of this that you think is important, but I’ve decided is not. If you feel strongly about this, write your own blog.

P.S.

* Thanks, Robert!

Intuitive Z-plane: Part 1 – Introduction

Back in this posting, I made the following statement:

Generally speaking, digital filters work by taking an audio signal, delaying it, changing the level, and adding the result back to the signal itself.

I then showed an example of a simple digital filter like this:

If we use the filter in Figure 1, set the delay to 0, and set the gain to 1, then the output is just the input signal added to itself, so it’s two times the amplitude or about 6 dB louder.

If we leave the gain at 1, and set the delay to something else – let’s say, 1 ms, for example – then, in the very low frequencies (say, 1 Hz) the phase difference caused by the 1 ms delay is almost nothing – therefore the output will be +6 dB. At 500 Hz, however, the 1 ms delay is equal to a 180º phase shift, so the output of the delay will always be opposite in polarity with the non-delayed signal. Therefore, at 500 Hz, this filter will have no output. At 1 kHz, the output will be +6 dB again, because 1 ms = 360º at 1 kHz. At 1.5 kHz, there’s no output (540º phase shift), at 2 kHz, we’re back to +6 dB, and so on all the way up. The result is a magnitude response that looks like this:

Figure 2: The magnitude response of the filter shown in Figure 1, if Delay = 1 ms and Gain = 1. Note that these are two identical plots, the only difference is the scaling of the X-axes.

If I used the same delay time on the same filter structure, but set the gain to something between 0 and 1 – let’s make it 0.75, for example, then the overall shape would be the same, but the effect would be less, as shown in Figure 3.

Figure 3: The magnitude response of the filter shown in Figure 1, if Delay = 1 ms and Gain = 0.75.

If we make the gain a negative value, then the overall shape remains the same, but the high points and the low points swap places because the delayed signal is now cancelling where it was adding, and vice versa.

Figure 4: The magnitude response of the filter shown in Figure 1, if Delay = 1 ms and Gain = -0.75. Notice how the boosts and dips have swapped places as a result of using a negative gain.

Let’s think about this intuitively. If my audio signal cannot exceed a value of 1 (which is normally the way we work… a full-scale signal ranges from -1 to 1) and the gain of the delay output in the filter in Figure 1 also ranges from -1 to 1, then the maximum possible output of the filter is 2.

If I had two delay lines and I were summing all three signals (the input and the two delayed signals) and the gains were still limited within the range of -1 to 1, then the maximum possible output would be 3…

However, the minimum possible output level (not the minimum possible output value) would be no signal (as in the case of a 500 Hz input in Figure 2. This is equivalent to an output of -∞ dB.

If I generalise this, then I can say that if your filter is built using ONLY the summed outputs of feed-forward delays, then the maximum possible output can be easily calculated, and the minimum possible output is no signal.

Still generalising: notice that the “bumps” in the above frequency responses are smooth and rounded, and the dips are pointy notches.

What happens if the filter uses feed-back instead of feed-forward?

Figure 5: A simple filter with a feed-back delay line.

Let’s set the delay time to 1 ms again, and set the gain to 0.99. I chose 0.99 instead of 1 because this means that each time the signal re-circulates back, it will get a little bit quieter. If I had set the gain to 1, then the delay would keep “echoing” forever. If I made it greater than 1, then the output would get louder on each re-circulation, and things would get very loud, sooner or later…

So, if Delay = 1 ms and Gain = 0.99 in the filter in Figure 5, the resulting magnitude response looks like Figure 6.

Figure 6: The magnitude response of the filter in Figure 5 when Delay = 1 ms and Gain = 0.99.

There are some things to notice in Figure 6.

Firstly, notice that the overall “shape” of the curve is upside-down relative to the one in Figure 2. The rounded bits are on the bottom and the pointy bits are on the top. This means that instead of having very narrow notches, you have very narrow resonances that are “singing” like a collection of sinusoidal waves, one at the frequency of each peak.

Secondly, notice that the peaks and the dips are in the same places as in Figure 2. In both cases, the Delay = 1 ms and the gain is positive, so the frequencies that are boosted are the same in both cases. For example, both have a peak at 1 kHz, and a dip at 500 Hz.

Thirdly, notice that the overall level is much, much higher. 40 dB is a LOT louder than 6 dB – this is because the sum of all those re-circulated signals echoing over and over in the filter add up to something loud over time.

If I reduce the gain, but keep it positive, then (just like in the case with the feed-forward filter) the shape of the magnitude response stays the same, it’s just reduced in effect.

Figure 7: Delay = 1 ms and Gain = 0.75. Notice that the peaks are smaller because the lower gain causes the signal to “die away” faster.

If I did the same thing, but set the gain to a negative number (say, -0.99) instead, then each time the signal re-circulates, it flips polarity. The resulting magnitude response looks like Figure 8.

Figure 8: The filter in Figure 5 where Delay = 1 ms and Gain = -0.99.

Notice that this is related to the magnitude response in Figure 4 – we have less output in the low end, and now the first peak is at 500 Hz instead of 1 kHz.

If I generalise this one, then I can say that if your filter is built using ONLY the summed outputs of feed-back delays, then the peaks are much higher than with a feed-forward design because they’re resonating.

Still generalising: notice that the “bumps” in the above frequency responses are pointy (because they’re resonances), and the dips are smooth and rounded.

The summary (for now)

Repeating myself, because this is the take-away information for this posting:

If your filter is built using ONLY the summed outputs of feed-forward delays, then:
- the maximum possible output can be easily calculated
- the minimum possible output is no signal
- the “bumps” in the above frequency responses are smooth and rounded
- the dips are pointy notches.

if your filter is built using ONLY the summed outputs of feed-back delays, then:
- the peaks are much higher in level than with a feed-forward design because they’re resonating
- the “bumps” in the above frequency responses are pointy because they’re resonating
- the dips are smooth and rounded

“High-Res” Audio: Part 13: Wrapping up

As I’ve stated a couple of times through this series, my reason for writing this stuff was not to prove that high res audio is better or worse than normal res audio (whatever that is…). My reason was to highlight some of the advantages and disadvantages associated with LPCM audio at different bit depths and sampling rates. Just as a bullet-point summary of things-to-remember/consider (with some loose grouping):

“High resolution audio” could mean
- “more than 16 bits per sample”
  or
- “a sampling rate higher than 44.1 kHz”
  or
- both.
These two dimensions of the specifications have different implications on the signal

Doubling the sampling rate only increases your audio bandwidth by 1 octave.
Yes, it’s twice as much information, but that’s only one octave. If you add an extra octave on top of a piano, you don’t get twice as many notes.
Just because you have more bits per sample doesn’t mean that you are actually getting more resolution.
There are examples out there where a “24-bit recording” is just a 16-bit recording with 8 zeros stuck on the end.
Just because you have a higher sampling rate doesn’t mean that you are actually getting a recording that was done at that sampling rate.
There are examples out there where, if you do a spectral analysis of a “high-res” recording, you’ll see the cutoff filter of the original 44.1 kHz recording.
Just because you have a recording done at a higher sampling rate doesn’t mean that the extra information you get is actually useful.

There is no such thing as “temporal resolution” or “better timing information” caused by higher sampling rates. It’s not film.
Staircase drawings of digital audio signals are just there to help you understand the concept – they don’t actually exist in the audio signal.

If your playback system has sampling rate converters (it probably does), try to make sure that they’re good.
If they’re bad (which happens often), then it could be that a “high res” signal sounds/performs worse than a “normal res” signal.
If you are filtering the audio signal at low frequencies, it’s better to have a lower sampling rate.
If your processing distorts the signal for some reason, it’s better to have a higher sampling rate to keep the aliased distortion artefacts as far away from the audio signal as possible.
If you are a lazy DSP engineer who thinks that filters give you the expected magnitude response, no matter what the centre frequency, you’d better have a higher sampling rate. (Or you could just stop being lazy and compensate.)
If you need a lower noise floor for the same audio bandwidth, it’s more efficient to add bits than to increase the sampling rate.

There are many cases where you want equipment that has higher specifications than your audio signal.
If you have a volume control after the conversion to analogue, then 93 dB of dynamic range (16 bits, TPDF dithered) might be enough – especially if you listen to music with a limited dynamic range. However, if your volume control is in the digital domain, and you have a speaker that can play loudly, then you’ll probably want more dynamic range, and therefore more bits per sample hitting the DAC.

Like I said, I’m not here to tell you that one thing is better or worse than another thing.

As I said, my intention in writing all of this is to help you to never fall into the trap of assuming that “high resolution audio” is better than “normal resolution audio” in all respects.

More is not necessarily better, sometimes, it’s not even more. Don’t fall victim to misleading advertising.

earfluff and eyecandy

mostly audio, but with some other stuff occasionally

Category: DSP