In the last post, I talked about why a THD+N measurement is useless if you don’t know about the type of distortion that you’re measuring. Let’s now talk about another reason why it’s useless in isolation.
Once again, let’s assume that we’re doing a THD+N measurement the old-fahsioned way where we put a sine wave into a device, and apply a notch filter to the output at the same frequency of the sine wave and find the ratio of the level of the sine wave to the output of the notch filter.
This time, instead of taking a signal and distorting it, I’ll do some additive synthesis. In other words, I’ll build a final signal that contains four components (although they’re not entirely independent…):
a “signal” consisting of a 100 Hz sine wave which we’ll call “the fundamental”
sine tones at frequencies that are multiples of the fundamental frequency (in other words, they are harmonically related to the fundamental).
sine tones at frequencies that are not multiples of the fundamental frequency (in other words, they are not harmonically related to the fundamental).
wide-band noise
Version 1: No artefacts
Let’s start by listening to the original 100 Hz sine wave with a level of -10 dB FS without any other additional components. If you hear any distortion or noise, then this is a problem in your playback system (unless your system is so good that you can hear the quantisation error caused by the fact that I didn’t dither the signal).
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
Version 2: Wide-band noise
Now let’s add noise. I’ve added noise with a white spectrum and a level such that a THD+N measurement will tell us that we have 10% THD+N (relative to the level of the 100 Hz sine tone signal). In other words, I have a sine wave with a level of -10 dB FS and I have added white noise with a long-term RMS level of -30 dB FS.
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
It should be pretty obvious, even with poor playback equipment, that I have added noise to the 100 Hz tone. This should not be surprising, since a 10% THD+N is pretty bad.
Version 3: 2nd harmonic
For this version, I’ll add a 200 Hz sine tone to the 100 Hz tone. The fundamental (100 Hz) has a level of -10 dB FS. The level of its second harmonic (200 Hz) is -30 dB FS. This means that, again, I get a THD+N value of 10%.
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
Version 4: 3rd harmonic
For this version, I’ll add a 300 Hz sine tone to the 100 Hz tone. The fundamental (100 Hz) has a level of -10 dB FS. The level of its third harmonic (300 Hz) is -30 dB FS. This means that, again, I get a THD+N value of 10%.
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
Version 5: 2nd to 5th harmonics
For this version, I’ll add a four additional sine tones to the 100 Hz tone. The fundamental (100 Hz) has a level of -10 dB FS. I have added tones at 200 Hz, 300 Hz, 400 Hz and 500 Hz (the 2nd through to the 5th harmonics, inclusive) with a spectral pattern where each successive tone is half the amplitude of the previous. In other words, 500 Hz is half the amplitude of 400 Hz which, in turn, is half the amplitude of the 300 Hz tone, which is half the amplitude of the 200 Hz tone.
I have adjusted the overall level of the harmonics so that we get a THD+N value of 10%. In other words, the RMS level of the signal comprised of the 200 Hz to 500 Hz sine tones (inclusive) is -30 dB FS.
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
Version 6: 5 kHz
For this version, I’ll add a 5 kHz sine tone to the 100 Hz tone. The fundamental (100 Hz) has a level of -10 dB FS. The level of the 5 kHz tone is -30 dB FS. This means that, again, I get a THD+N value of 10%.
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
Version 7: Noise plus five sine tones with random frequencies
For this version, I’ll add a mess to the 100 Hz tone. The fundamental (100 Hz) has a level of -10 dB FS. To this I added a signal that is comprised of wide-band white noise and 5 sine tones at random frequencies between 0 Hz and 20 kHz (no, I don’t know what they are – but it doesn’t matter for the purposes of this discussion). The levels of the noise and 5 sine tones are random.
I have adjusted the overall level of the signal comprised of the noise and 5 random sine tones so that we get a THD+N value of 10%. In other words, the RMS level of the signal comprised of the noise and 5 random sine tone is -30 dB FS.
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
The punch line!
Each of the six signals I’ve presented above in Versions 2 through 7 (inclusive) is a “distorted” version of the original 100 Hz sine tone in Version 1. Each of those six signals will have a measurable THD+N of 10%. However, it is quite obvious that they have very different spectral patterns, and therefore they sound quite different.
This isn’t really revolutionary – it’s jut another reminder that a THD value, in the absence of any other information, isn’t terribly useful – or at least, it doesn’t tell you much about how the signal sounds.
Caveat: This is basically a geek version of a cover tune. The point that I make here was one that I originally heard someone else present at an AES convention years ago. However, since I haven’t heard anyone tell this story since, I’ve written it here.
Let’s build two black boxes, each of which creates a measurable distortion. We’ll call them Box “A” and Box “B”.
Box “A” has a measured THD+N of 20%. Box “B” has a measured THD+N of 2%. We’ll be using the old-fashioned way of measuring THD+N where we put a sine wave into the device, and apply a notch filter to the output at the same frequency of the sine wave and find the ratio of the level of the sine wave to the output of the notch filter.
Let’s put a 500 Hz sine wave into the boxes and listen to the output. The original sine wave sounds like the following:
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
The sine wave at the output of Box “A” (with a THD+N of 20%) sounds like the following:
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
The sine wave at the output of Box “B” (with a THD+N of 2%) sounds like the following:
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
So far so good. There should be no surprises yet.
Now let’s put a recording of something that I listen to all the time (my own voice) into the same black boxes to see what happens.
We’ll start with the original recording (this is just a file that I happened to have on my hard drive for testing imaging – ignore the fact that it talks about coming from the left channel only – your computer will probably play it as a mono file out both channels – this is irrelevant to the discussion):
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
Now let’s listen to how that recording sounds at the output of Box “A” (with a measured THD+N of 20%)
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
As you’ll hear, there is no audible distortion on the sound file, despite the fact that it has gone through a box that generates a distortion that we measured to be 20%.
Now let’s listen to how the original recording sounds at the output of Box “B” (with a measured THD+N of 2%)
Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.
As you will probably hear in that last sound file, the Box “B” – the one with “only” 2% distortion sounds MUCH worse than either the original sound file or the output of Box “A” which should have much more audible distortion.
So, the question is “why?”
Let’s look at the waveforms to see what’s going on here.
The original sine wave looks like the following:
After that sine wave has gone through Box “A”, the output looks like the following:
As you can see, I’ve created Box “A” to generate its distortion by clipping the signal at a limits of -0.5 and 0.5.
The output of Box “B” when fed with the same sine wave looks like the following:
If we zoom in on that plot, it looks like the following:
So, as you can see, I’ve made Box “B” to generate a zero-crossing distortion – but a pretty small one.
The reason the THD+N of Box “A” is 20% and that of Box “B” is only 2% is not just because the “damage” done to the signal is bigger with Box “A”. It’s also caused by where the damage is done. This might not make sense, so let’s look at the signals a little differently.
Let’s do a histogram of the original sine wave. This tells us how often the sample values are a given value. This is shown below in the following plot.
This histogram shows that the sample values in the original sine wave are usually near -1 and +1, and rarely around 0.
Now let’s look at a histogram of the output of Box “A” – the distorted sine wave with 20% THD+N. It looks like the following:
As can be seen in the plot above, the sample values from the original sine wave that were below -0.5 are now all congregated at -0.5, and the values that were above 0.5 are now congregated at 0.5. This is the result of the clipping applied to the signal.
By comparison, the histogram of the output of Box “B” is shown below:
As you can see by comparing these last two plots, the zero crossing distortion of Box “B” results in a histogram that is more similar to the histogram of the original signal than that of the clipping distortion of Box “A”. This is because the zero crossing distortion distorts the signal where the signal rarely is.
Now let’s look at the histograms of the speech signal. Below is a histogram of the original speech recording.
As you can see in this plot, the speech signal is unlike the sine wave in that it is usually at 0, and not at the extreme values of -1 and 1. In addition, you can see that very little, if any, of the signal is below -0.5 or above 0.5 which are the clipping values of Box “A”. Consequently, as you can see below, the histogram of the output of Box “A”, when fed with the speech signal, looks almost the same as the histogram of the original signal, above.
However, the output of Box “B” is different. The histogram of that signal is shown below:
So, as you can see here: the zero crossing distortion is affecting the signal where it is most often, whereas the clipping of Box “A” has no effect on the signal.
The moral of the story
The point that I’ve (hopefully) illustrated here is that the value generated by a THD+N measurement is basically irrelevant when it comes to expressing how a device distorts a normal signal. However, the problem is not with the measurement technique, but the signal that is used in the procedure. We use a sine wave to do a THD+N measurement because that used to be the easy way to do a THD+N measurement back in the old days of signal generators, analogue notch filters, and voltmeters. The problem is that the probability distribution function (PDF) of that sine wave is completely unlike the PDF of a music or speech signal. So, if the distortion of the device affects the signals in the wrong place, then the result of the measurement will not reflect the sound of the device.
Now, before you start sending me hate mail because you think this posting is a Windows vs. Mac lecture, hold your horses. That’s NOT the kind of windows I’m talking about. This one’s about windowing functions and one (possibly unexpected) effect on the results of the analysis of the impulse response of an allpass filter. So, if you want to debate Windows vs. Mac – go somewhere else. If you think that you can get all riled up over a Blackman Harris window function, read on!
Last week I had to do some frequency-domain analysis of a system that had a small problem with noise in its impulse response measurements. The details of where the noise came from are unimportant. There is only one important thing from the back-story that you need to know – and that is that I was measuring the response of an allpass filter implementation.
So, I did my MLS measurement of the allpass filter and, because I had noise in the impulse response, I chose to use a windowing function to clean up the impulse response’s tail. Now, I know that, by using a windowing function (or a DFT, for that matter), there are consequences that one needs to be aware of. However, the consequence that I stumbled on was a new one for me – although in retrospect, it should not have been.
Here’s a sterilised version of what happened, just in case it’s of use.
Below is a plot showing a (very clean) impulse response of an allpass filter. To be more specific, it’s a 4th order Linkwitz Riley crossover with a crossover frequency of 100 Hz, where I summed the outputs of the high pass and low pass components together to make an output. (We will not discuss why I did it this way, since that information is outside the scope of this discussion.) In addition, I have plotted three windowing functions, a Hann, a Hamming and a Blackman Harris.
Note that the length of the windowing functions is big – 65536 samples to be exact. As you can see in the plot, the ringing of the allpass filter is negligible in this plot by the time we get to the end of the window. This can also be seen below in the next two plots where I’ve shown the impulse response after it has been windowed by the three (actually four, if we include rectangular as a function), scaled in linear and dB FS. (I know, I know, dB FS is an RMS measurement and I plotted this as instantaneous values – sue me.)
So, if you now take those windowed impulse responses and calculate their magnitude and phase responses, you get the plots shown below.
“So what?” I hear you cry. The magnitude responses of the four versions of the windowed impulse response are all identical enough that their plots lie on top of each other. This is also true for their phase responses. “I see what I would expect to see – what are you complaining about?” I hear you cry.
Well, let me tell you. The plots above show the results when you use a 65536-point FFT and a 65536-sample window (okay, okay, DFT – sue me).
Let’s do all that again, but with a 65536-point FFT and a 1024-point window instead (I did this in MATLAB, so it’s zero-padding the impulse responses with the remaining 65536-1024 = 64512 samples.)
Now we can see immediately, that the ringing in the allpass filter’s impulse response hasn’t settled down by the time we get to the end of the window. This can also be seen in the following two plots.
I had an interesting email from an old recording-engineer friend of mine this week regarding a debate he had with a student concerning the issue of “depth” in recordings (in his specific case, 2-channel stereo recordings done with an ORTF mic configuration). This got me thinking about to a bunch of thoughts I had once-upon-a-time about distance perception, and a newer bunch of thoughts about loudspeaker directivity. Now, those two bunches of thoughts are congealing into a single idea regarding how to achieve (and experience) a reasonable perceived sensation of distance and depth in 2-channel stereo.
To start, some definitions:
When I say “stereo” I mean “2-channel sound recording”
“Distance” to a source in a stereo recording is the perceived distance between the listener and the (probably phantom) image.
“Depth” in a stereo recording is the difference in the perceived distances from the listener to the closest and farthest (probably phantom) images (i.e. the distance to the concert master vs. the distance to the xylophone in a symphony orchestra)
Step 1: Distance perception in real life
Go to an anechoic chamber with a loudspeaker and a friend. Sit there and close your eyes and get your friend to place the loudspeaker some distance from you. Keep your eyes closed, play some sounds out of the loudspeaker and try to estimate how far away it is. You will be wrong (unless you’re VERY lucky). Why? It’s because, in real life with real sources in real spaces, distance information (in other words, the information that tells you how far away a sound source is) comes mainly from the relationship between the direct sound and the early reflections. If you get the direct sound only, then you get no distance information. Add the early reflections and you can very easily tell how far away it is. This has been proven in lots of “official” listening tests. (For example, go check out this report as a basic starting point).
Anecdote #1: Back in the old days when I was working on my Ph.D. we had an 8-loudspeaker system in the lab – one speaker every 45° in a circle around the listening position. We were trying to build a multichannel room simulator where we were building a sound field, piece by piece – the direct sound and (up to 3rd-order) early reflections had the “correct” panning, delay and gain, and we added a diffuse field to tail in behind it. One of the interesting things that I found with that system was that the simulated distance to the source was easily to achieve with just the 1st-order reflections, but that the precision of that perceived distance was increased as we added 2nd- and 3rd-order reflections. (We didn’t have enough computing power to simulate higher-order reflections at the time. It would be interesting to go back and try again to see what would happen with higher-order stuff now that my Mac has gotten a little faster…) Another interesting thing (although, in retrospect, it shouldn’t surprise anyone) was that the location and the distance to the simulated sound source were also easy to determine without the direct sound being part of the sound field at all. Just the 1st- to 3rd-order reflections by themselves were enough to tell you where things were.
Step 2: Distance perception in a recording
It’s been well-known for many years that the apparent distance to a sound source in a stereo recording is controllable by the so-called “dry-wet” ratio – in other words, the relative levels of the direct sound and the reverb. I first learned this in the booklet that came with my first piece of recording gear – an Alesis Microverb. To be honest – this is a bit of an over-simplification, but done in good faith for people who are at the knowledge level one would typically have if one were an Alesis Microverb customer. The people at another reverb unit manufacturer know that the truth requires a little more details. For example, their flagship reverb unit uses correctly-positioned and correctly-delayed early reflections (calculated using ray tracing, apparently) to deliver a believable room size and sound source location in that room.
If you’re thinking in terms of a stereo microphone pair, then consider it this way: you want your microphone configuration to be reasonably good at acting like a decent panning algorithm. At the very least, you should ensure that you don’t have conflicting information between the interchannel time and the interchannel amplitude differences for your direct sound and the early reflections. For example, if you have a pair of near-coincident cardioids, but they’re “toed-in” instead of “toed-out”, you have a problem (i.e. the left mic is pointing to the right and the right mic is pointing to the left. This means that the the earlier channel will not be the louder channel for sound sources and reflections that are not on-axis to the pair) This would make for conflicting and therefore confusing information for your brain.
Anecdote #2: I did a recording for Atma once-upon-a-time in a large church in Montreal with a very long reverb time. During the sessions, I sat in the church (no control room), about 20 m from the mic pair. So, when I and the organist discussed what take to do next, we were talking live in the same room – no talkback speakers. During the editing for this disc, I happened to be shuttling around, looking for the beginning of a take – so I’d drop the cursor somewhere on the screen and hit “play” quickly to see where I was. One of the takes ended with the organist asking “did we get it?” and I responded “yup” quickly and loudly. It just so happened that, when I was shuttling around, looking for the right take, I hit “play” at the beginning of the “yup” and then quickly hit “stop”. The interesting thing is that it sounded, for that split second, like I was right next to the microphones – not 20 m away like I knew I was. So, I hit “play” again, and this time didn’t hit stop. This time, I sounded far away. What’s going on? Well, because the church was so big, it was possible to hit the stop button before any of the first reflections came in (save maybe the one off the floor), so it was possible (with a fast enough thumb on the transport buttons of the editing machine) to make the recording of my voice anechoic. The result was that I sounded 0 m away instead of 20 m.
The moral of the stories thus far? In order to deliver a perception of precise distance and depth (even if it’s not accurate…) you need early reflections in the recording, and they have to be panned and delayed appropriately.
Step 3: The delivery
Think back to Step 1. We agreed (or at least I said…) that early reflections tell your brain how far away the sound source is. Now think to a loudspeaker in a listening room.
Case #1: If you have an anechoic room, there are no early reflections, and, regardless of how far away the loudspeakers are, a sound source in the recording without early reflections (i.e. a close-mic’ed vocal) will sound much closer to you than the loudspeakers.
Case #2: If you have a listening room with early reflections, but the loudspeakers are directional such that there is no energy being delivered to the side walls (for example, a dipole with the angles carefully chosen to point the null of the loudspeaker at the point of specular reflection from the side wall), then the result is the same as in Case 1. This time there are no early reflections because of loudspeaker directivity instead of wall absorption, but the effect at the listening position is the same.
Case #3: If you have a listening room with early reflections, and the loudspeakers are omni-directional, then the early reflections from the side walls tell you how far away the loudspeakers are. Therefore, the close-mic’ed vocal track from Case #1 cannot sound any closer than the loudspeakers – your brain is too smart to be told otherwise.
The punchline
So, if you want to achieve precision in the distance and depth of your stereo recordings (whether you’re on the recording end or the playback end) you’re going to need to make sure that you have a reasonable mix of the following:
Early reflections in the recording itself have to be there, and coming in at the right times with the right gains with the right panning
Not much energy in the early reflections in your listening room – either by putting some absorption on the walls in the right places, or by having reasonably directional loudspeakers (or both).