“High-Res” Audio: Part 5 – Mirrors are bad

Let’s go back to something I said in the last post:

Mistake #1

I just jumped to at least three conclusions (probably more) that are going to haunt me.

The first was that my “digital audio system” was something like the following:

As you can see there, I took an analogue audio signal, converted it to digital, and then converted it back to analogue. Maybe I transmitted it or stored it in the part that says “digital audio”.

However, the important, and very probably incorrect assumption here is that I did nothing to the signal. No volume control, no bass and treble adjustments… nothing.

If you consider that signal flow from the position of an end-consumer playing a digital recording, this was pretty easy to accomplish in the “old days” when we were all playing CDs. That’s because (in a theoretical, oversimplified world…)

the output of the mixing/mastering console was analogue
that analogue signal was converted to digital in the mastering studio
the resulting bits were put on a disc
you put that disc in your player which contained a DAC that converted the signal directly to analogue
you then sent the signal to your “processing” (a.k.a. “volume control”, and maybe some bass and treble adjustment.).

So, that flowchart in Figure 1 was quite often true in 1985.

These days, things are probably VERY different… These days, the signal path probably looks something more like this (note that I’ve highlighted “alterations” or changes in the bits in the audio signal in red):

The signal was converted from analogue to digital in the studio
(yes, I know… studios often work with digital mixers these days, but at least some of the signals within the mix were analogue to start – unless you are listening to music made exclusively with digital synthesizers)
The resulting bits were saved on a file
Depending on the record label, the audio signal was modified to include a “watermark” that can identify it later – in court, when you’ve been accused of theft.
The file was transferred to a storage device (let’s say “hard drive”) in a large server farm renting out space to your streaming service
The streaming service encodes the file
- If the streaming service does not offer an lossless option, then the file is converted to a lossy format like MP3, Ogg Vorbis, AAC, or something else.
- If the streaming service offers a lossless option, then the file is compressed using a format like FLAC or ALAC (This is not an alteration, since, with a lossless compression system, you don’t lose anything)
You download the file to your computer
(it might look like an audio player – but that means it’s just a computer that you can’t use to check your social media profile)
You press play, and the signal is decoded (either from the lossy CODEC or the compression format) back to LPCM. (Still not an alteration. If it’s a lossy CODEC, then the alteration has already happened.)
That LPCM signal might be sample-rate converted
The streaming service’s player might do some processing like dynamic range compression or gain changes if you’ve asked it to make all the songs have the same level.
All of the user-controlled “processing” like volume controls, bass, and treble, are done to the digital signal.
The signal is sent to the loudspeaker or headphones
- If you’re sending the signal wirelessly to a loudspeaker or headphones, then the signal is probably re-encoded as a lossy CODEC like AAC, aptX, or SBC.
  (Yes, there are exceptions with wireless loudspeakers, but they are exceptions.)
- If you’re sending the signal as a digital signal over a wire (like S/PDIF or USB), the you get a bit-for-bit copy at the input of the loudspeaker or headphones.
The loudspeakers or headphones might sample-rate convert the signal
The sound is (finally) converted to analogue – either one stream per channel (e.g. “left”) or one stream per loudspeaker driver (e.g. “tweeter”) depending on the product.

So, as you can see in that rather long and complicated list (it looks complicated, but I’ve actually simplified it a little, believe it or not), there’s not much relation to the system you had in 1985.

Let’s take just one of those blocks and see what happens if things go horribly wrong. I’ll take the “volume control” block and add some distortion to see the result with two LPCM systems that have two different sampling rates, one running at 48 kHz and the other at 194 kHz – four times the rate. Both systems are running at 24 bits, with TPDF dither (I won’t explain what that means here). I’ll start by making a 10 kHz tone, and sending it through the system without any intentional distortion. If we look at those two signals in the time domain, they’ll look like this:

Figure 1: Two 10 kHz tones. The black one is in a 48 kHz, 24 bit LPCM system. The red one is in a 192 kHz, 24 bit LPCM system.

The sine tone in the 48 kHz system may look less like a sine tone than the one in the 192 kHz system, however, in this case, appearances are deceiving. The reconstruction filter in the DAC will filter out all the high frequencies that are necessary to reproduce those corners that you see here, so the resulting output will be a sine wave. Trust me.

If we look at the magnitude responses of these two signals, they look like Figure 2, below.

Figure 2: The magnitude responses of the two signals shown in Figure 1.

You may be wondering about the “skirts” on either side of the 10 kHz spikes. These are not really in the signal, they’re a side-effect (ha ha) of the windowing process used in the DFT (aka FFT). I will not explain this here – but I did a long series of articles on windowing effects with DFTs, so you can search for it if you’re interested in learning more about this.

If you’re attentive, you’ll notice that both plots extend up to 96 kHz. That’s because the 192 kHz system on the bottom has a Nyquist frequency of 96 kHz, and I want both plots to be on the same scale for reasons that will be obvious soon.

Now I want to make some distortion. In order to make things obvious, I’m going to make a LOT of distortion. I’ve made the sine wave try to have an amplitude that is 10 times higher than my two systems will allow. In other words, my amplitude should be +/- 10, but the signal clips at +/- 1, resulting in something looking very much like a square wave, as shown in Figure 3.

Figure 3: Distorted 10 kHz sine waves. The black one is in a 48 kHz, 24 bit LPCM system. The red one is in a 192 kHz, 24 bit LPCM system.

You may already know that if you want to make a square wave by building it up using its constituent harmonics, you need to have the fundamental (which we’ll call Fc. In our case, Fc = 10 kHz) with an amplitude that we’ll say is “A”, you then add the

3rd harmonic (3 times Fc, so 30 kHz in our case) with an amplitude of A/3.
5th harmonic (5 Fc = 50 kHz) with an amplitude of A/5
7 Fc at A/7
and so on up to infinity

Let’s look at the magnitude responses of the two signals above to see if that’s true.

Figure 4: The magnitude responses of the two signals shown in Figure 3.

If we look at the bottom plot first (running at 192 kHz and with a Nyquist limit of 96 kHz) the 10 kHz tone is still there. We can also see the harmonics at 30 kHz, 50 kHz, 70 kHz, and 90 kHz in amongst the mess of other spikes we’ll get to those soon…)

Figure 5. Some labels applied to Figure 4 for clarity, showing the harmonics of the square waves that are captured by the two systems

Looking at the top plot (running at 48 kHz and with a Nyquist limit of 24 kHz), we see the 10 kHz tone, but the 30 kHz harmonic is not there – because it can’t be. Signals can’t exist in our system above the Nyquist limit. So, what happens? Think back to the images of the rotating wheel in Part 3. When the wheel was turning more than 1/2 a turn per frame of the movie, it appears to be going backwards at a different speed that can be calculated by subtracting the actual rotation from 180º (half-a-turn).

The same is true when, inside a digital audio signal flow, we try to make a signal that’s higher than Nyquist. The energy exists in there – it just “folds” to another frequency – its “alias”.

We can look at this generally using Figure 6.

Figure 6: A general plot of aliasing, showing the intended frequency in black and the actual output frequency in red.

Looking at Figure 6: If we make a sine tone that sweeps upward from 0 Hz to the Nyquist frequency at Fs/2 (half the sampling rate or sampling frequency) then the output is the same as the input. However, when the intended frequency goes above Fs/2, the actual frequency that comes out is Fs/2 minus the intended frequency. This creates a “mirror” effect.

If the intended frequency keeps going up above Fs, then the mirroring happens again, and again, and again… This is illustrated in Figure 7.

Figure 7: An extension of Figure 5 to a higher intended frequency.

This plot is shown with linear scales for both the X- and Y-axes to make it easy to understand. If the axes in Figure 7 were scaled to a logarithmic scaling instead (which is how “Frequency Response” are normally shown, since this corresponds to how we hear frequency differences), then it would look like Figure 8.

Figure 8: The same information shown in Figure 7, plotted on a logarithmic scale instead. Note that this example is for a system running at 48 kHz (therefore with a Nyquist frequency of 24 kHz), and an intended input frequency (in black) going up to 3 times 48 kHz = 144 kHz.

Coming back to our missing 30 kHz harmonic in the 48 kHz LPCM system: Since 30 kHz is above the Nyquist limit of 24 kHz in that system, it mirrors down to 24 kHz – (30 kHz – 24 kHz) = 18 kHz. The 50 kHz harmonic shows up as an alias at 2 kHz. (follow the red line in Figure 7: A harmonic on the black line at 48 kHz would actually be at 0 Hz on the red line. Then, going 2000 Hz up to 50 kHz would bring the red line up to 2 kHz.)

Similarly, the 110 kHz harmonic in the 192 kHz system will produce an alias at 96 kHz – (110 kHz – 96 kHz) = 82 kHz.

If I then label the first set of aliases in the two systems, we get Figure 9.

Figure 9: The first set of aliased frequency content in the two systems.

Now we have to stop for a while and think about what’s happened.

We had a digital signal that was originally “valid” – meaning that it did not contain any information above the Nyquist frequency, so nothing was aliasing. We then did something to the signal that distorted it inside the digital audio path. This produced harmonics in both cases, however, some of the harmonics that were produced are harmonically related to the original signal (just as they ought to be) and others are not (because they’re aliases of frequency content that cannot be reproduced by the system.

What we have to remember is that, once this happens, that frequency content is all there, in the signal, below the Nyquist frequency. This means that, when we finally send the signal out of the DAC, the low-pass filtering performed by the reconstruction filter will not take care of this. It’s all part of the signal.

So, the question is: which of these two systems will “sound better” (whatever that means)? (I know, I know, I’m asking “which of these two distortions would you prefer?” which is a bit of a weird question…)

This can be answered in two ways that are inter-related.

The first is to ask “how much of the artefact that we’ve generated is harmonically related to the signal (the original sine tone)?” As we can see in Figure 5, the higher the sampling rate, the more artefacts (harmonics) will be preserved at their original intended frequencies. There’s no question that harmonics that are harmonically related to the fundamental will sound “better” than tones that appear to have no frequency relationship to the fundamental. (If I were using a siren instead of a constant sine tone, then aliased harmonics are equally likely to be going down or up when the fundamental frequency goes up… This sounds weird.)

The second is to look at the levels of the enharmonic artefacts (the ones that are not harmonically related to the fundamental). For example, both the 48 kHz and the 192 kHz system have an aliased artefact at 2 kHz, however, its level in the 48 kHz system is 15 dB below the fundamental whereas, in the 192 kHz system, it’s more than 26 dB below. This is because the 6 kHz artefact in the 48 kHz system is an alias of the 30 kHz harmonic, whereas, in the 192 kHz system, it’s an alias of the 190 kHz harmonic, which is much lower in level.

As I said, these two points are inter-related (you might even consider them to be the same point) however, they can be generalised as follows:

The higher the sampling rate, the more the artefacts caused by distortion generated within the system are harmonically related to the signal.

In other words, it gives a manufacturer more “space” to screw things up before they sound bad. The title of this posting is “Mirrors are bad” but maybe it should be “Mirrors are better when they’re further away” instead.

Of course, the distortion that’s actually generated by processing inside a digital audio system (hopefully) won’t be anything like the clipping that I did to the signal. On the other hand, I’ve measured some systems that exhibit exactly this kind of behaviour. I talked about this in another series about Typical Problems in Digital Audio: Aliasing where I showed this measurement of a real device:

Figure 10: A measurement of a real device showing some kind of distortion and aliased artefacts of a swept sine tone. Half of the aliasing is immediately recognizable as going downwards when the tone is going upwards.

However, I’m not here to talk about what you can or can’t hear – that is dependent on too many variables to make it worth even starting to talk about. The point of this series is not to prove that something is better or worse than something else. It’s only to show the advantages and disadvantages of the options so that you can make an informed choice that best suits your requirements.

On to Part 6

“High-Res” Audio: Part 3 – Frequency Limits

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

Just like the last posting, this is a copy-and-paste from an article that I wrote for another series. However, this one is important, and rather than just link you to a different page, I’ve reproduced it (with some minor editing to make it fit) here.

Part 1
Part 2

In the first posting in this series, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time… In the second posting in the series, we looked at a “trick” for dealing with the issue of quantisation (the fact that we have a limited resolution for measuring the amplitude of the audio signal). This trick is to add dither (a fancy word for “noise”) to the signal before we quantise it in order to randomise the error and turn it into noise instead of distortion.

In this posting, we’ll look at some of the problems incurred by the way we carve up time into discrete moments when we grab those samples.

Let’s make a wheel that has one spoke. We’ll rotate it at some speed, and make a film of it turning. We can define the rotational speed in RPM – rotations per minute, but this is not very useful. In this case, what’s more useful is to measure the wheel rotation speed in degrees per frame of the film.

Fig 1. The position of a clockwise-rotating wheel (with only one spoke) for 9 frames of a film. Each column shows a different rotational speed of the wheel. The far left column is the slowest rate of rotation. The far right column is the fastest rate of rotation. Red wheels show the frame in which the sequence starts repeating.

Take a look at the left-most column in Figure 1. This shows the wheel rotating 45º each frame. If we play back these frames, the wheel will look like it’s rotating 45º per frame. So, the playback of the wheel rotating looks the same as it does in real life.

This is more or less the same for the next two columns, showing rotational speeds of 90º and 135º per frame.

However, things change dramatically when we look at the next column – the wheel rotating at 180º per frame. Think about what this would look like if we played this movie (assuming that the frame rate is pretty fast – fast enough that we don’t see things blinking…) Instead of seeing a rotating wheel with only one spoke, we would see a wheel that’s not rotating – and with two spokes.

This is important, so let’s think about this some more. This means that, because we are cutting time into discrete moments (each frame is a “slice” of time) and at a regular rate (I’m assuming here that the frame rate of the film does not vary), then the movement of the wheel is recorded (since our 1 spoke turns into 2) but the direction of movement does not. (We don’t know whether the wheel is rotating clockwise or counter-clockwise. Both directions of rotation would result in the same film…)

Now, let’s move over one more column – where the wheel is rotating at 225º per frame. In this case, if we look at the film, it appears that the wheel is back to having only one spoke again – but it will appear to be rotating backwards at a rate of 135º per frame. So, although the wheel is rotating clockwise, the film shows it rotating counter-clockwise at a different (slower) speed. This is an effect that you’ve probably seen many times in films and on TV. What may come as a surprise is that this never happens in “real life” unless you’re in a place where the lights are flickering at a constant rate (as in the case of fluorescent or some LED lights, for example).

Again, we have to consider the fact that if the wheel actually were rotating counter-clockwise at 135º per frame, we would get exactly the same thing on the frames of the film as when the wheel if rotating clockwise at 225º per frame. These two events in real life will result in identical photos in the film. This is important – so if it didn’t make sense, read it again.

This means that, if all you know is what’s on the film, you cannot determine whether the wheel was going clockwise at 225º per frame, or counter-clockwise at 135º per frame. Both of these conclusions are valid interpretations of the “data” (the film). (Of course, there are more – the wheel could have rotated clockwise by 360º+225º = 585º or counter-clockwise by 360º+135º = 495º, for example…)

Since these two interpretations of reality are equally valid, we call the one we know is wrong an alias of the correct answer. If I say “The Big Apple”, most people will know that this is the same as saying “New York City” – it’s an alias that can be interpreted to mean the same thing.

Wheels and Slinkies

We people in audio commit many sins. One of them is that, every time we draw a plot of anything called “audio” we start out by drawing a sine wave. (A similar sin is committed by musicians who, at the first opportunity to play a grand piano, will play a middle-C, as if there were no other notes in the world.) The question is: what, exactly, is a sine wave?

Get a Slinky – or if you don’t want to spend money on a brand name, get a spring. Look at it from one end, and you’ll see that it’s a circle, as can be (sort of) seen in Figure 2.

Fig 2. A Slinky, seen from one end. If I had really lined things up, this would just look like a shiny circle.

Since this is a circle, we can put marks on the Slinky at various amounts of rotation, as in Figure 3.

Fig 3. The same Slinky, marked in increasing angles of 45º.

Of course, I could have put the 0º mark anywhere. I could have also rotated counter-clockwise instead of clockwise. But since both of these are arbitrary choices, I’m not going to debate either one.

Now, let’s rotate the Slinky so that we’re looking at from the side. We’ll stretch it out a little too…

Fig 4. The same Slinky, stretched a little, and viewed from the side.

Let’s do that some more…

Fig 5. The same Slinky, stretched more, and viewed from the “side” (in a direction perpendicular to the axis of the rotation).

When you do this, and you look at the Slinky directly from one side, you are able to see the vertical change of the spring from the centre as a result of the change in rotation. For example, we can see in Figure 6 that, if you mark the 45º rotation point in this view, the distance from the centre of the spring is 71% of the maximum height of the spring (at 90º).

Fig 6. The same markings shown in Figure 3, when looking at the Slinky from the side. Note that, if we didn’t have the advantage of a little perspective (and a spring made of flat metal), we would not know whether the 0º point was closer or further away from us than the 180º point. In other words, we wouldn’t know if the Slinky was rotating clockwise or counter-clockwise.

So what? Well, basically, the “punch line” here is that a sine wave is actually a “side view” of a rotation. So, Figure 7, shows a measurement – a capture – of the amplitude of the signal every 45º.

Fig 7. Each measurement (a black “lollipop”) is a measurement of the vertical change of the signal as a result of rotating 45º.

Since we can now think of a sine wave as a rotation of a circle viewed from the side, it should be just a small leap to see that Figure 7 and the left-most column of Figure 1 are basically identical.

Let’s make audio equivalents of the different columns in Figure 1.

Fig 8. A sampled cosine wave where the frequency of the signal is equivalent to 90º per sample period. This is identical to the “90º per frame” column in Figure 1.

Fig 9. A sampled cosine wave where the frequency of the signal is equivalent to 135º per sample period. This is identical to the “135º per frame” column in Figure 1.

Fig 10. A sampled cosine wave where the frequency of the signal is equivalent to 180º per sample period. This is identical to the “180º per frame” column in Figure 1.

Figure 10 is an important one. Notice that we have a case here where there are exactly 2 samples per period of the cosine wave. This means that our sampling frequency (the number of samples we make per second) is exactly one-half of the frequency of the signal. If the signal gets any higher in frequency than this, then we will be making fewer than 2 samples per period. And, as we saw in Figure 1, this is where things start to go haywire.

Fig 11. A sampled cosine wave where the frequency of the signal is equivalent to 225º per sample period. This is identical to the “225º per frame” column in Figure 1.

Figure 11 shows the equivalent audio case to the “225º per frame” column in Figure 1. When we were talking about rotating wheels, we saw that this resulted in a film that looked like the wheel was rotating backwards at the wrong speed. The audio equivalent of this “wrong speed” is “a different frequency” – the alias of the actual frequency. However, we have to remember that both the correct frequency and the alias are valid answers – so, in fact, both frequencies (or, more accurately, all of the frequencies) exist in the signal.

So, we could take Fig 11, look at the samples (the black lollipops) and figure out what other frequency fits these. That’s shown in Figure 12.

Fig 12. The red signal and the black samples of it are the same as was shown in Figure 11. However, another frequency (the blue signal) also fits those samples. So, both the red signal and the blue signal exist in our system.

Moving up in frequency one more step, we get to the right-hand column in Figure 1, whose equivalent, including the aliased signal, are shown in Figure 13.

Fig 13. A signal (the red curve) that has a frequency equivalent to 280º of rotation per sample, its samples (the black lollipops) and the aliased additional signal that results (the blue curve).

Do I need to worry yet?

Hopefully, now, you can see that an LPCM system has a limit with respect to the maximum frequency that it can deal with appropriately. Specifically, the signal that you are trying to capture CANNOT exceed one-half of the sampling rate. So, if you are recording a CD, which has a sampling rate of 44,100 samples per second (or 44.1 kHz) then you CANNOT have any audio signals in that system that are higher than 22,050 Hz.

That limit is commonly known as the “Nyquist frequency“, named after Harry Nyquist – one of the persons who figured out that this limit exists.

In theory, this is always true. So, when someone did the recording destined for the CD, they made sure that the signal went through a low-pass filter that eliminated all signals above the Nyquist frequency.

In practice, however, there are many cases where aliasing occurs in digital audio systems because someone wasn’t paying enough attention to what was happening “under the hood” in the signal processing of an audio device. This will come up later.

Two more details to remember…

There’s an easy way to predict the output of a system that’s suffering from aliasing if your input is sinusoidal (and therefore contains only one frequency). The frequency of the output signal will be the same distance from the Nyquist frequency as the frequency if the input signal. In other words, the Nyquist frequency is like a “mirror” that “reflects” the frequency of the input signal to another frequency below Nyquist.

This can be easily seen in the upper plot of Figure 14. The distance from the Input signal and the Nyquist is the same as the distance between the output signal and the Nyquist.

Also, since that Nyquist frequency acts as a mirror, then the Input and output signal’s frequencies will move in opposite directions (this point will help later).

Fig 14. Two plots showing the same information about an Input Signal above the Nyquist frequency and the output alias signal. Notice that, in the linear plot on top, it’s easier to see that the Nyquist frequency is the mirror point at the centre of the frequencies of the Input and Output signals.

Usually, frequency-domain plots are done on a logarithmic scale, because this is more intuitive for we humans who hear logarithmically. (For example, we hear two consecutive octaves on a piano as having the same “interval” or “width”. We don’t hear the width of the upper octave as being twice as wide, like a measurement system does. that’s why music notation does not get wider on the top, with a really tall treble clef.) This means that it’s not as obvious that the Nyquist frequency is in the centre of the frequencies of the input signal and its alias below Nyquist.

On to Part 4

“High-Res” Audio: Part 2 – Resolution

Reminder: This is still just the lead-up to the real topic of this series. However, we have to get some basics out of the way first…

Back to Part 1

In the last posting, I talked about digital audio (more accurately, Linear Pulse Code Modulation or LPCM digital audio) is basically just a string of stored measurements of the electrical voltage that is analogous to the audio signal, which is a change in pressure over time…

For now, we’ll say that each measurement is rounded off to the nearest possible “tick” on the ruler that we’re using to measure the voltage. That rounding results in an error. However, (assuming that everything is working correctly) that error can never be bigger than 1/2 of a “step”. Therefore, in order to reduce the amount of error, we need to increase the number of ticks on the ruler.

Now we have to introduce a new word. If we really had a ruler, we could talk about whether the ticks are 1 mm apart – or 1/16″ – or whatever. We talk about the resolution of the ruler in terms of distance between ticks. However, if we are going to be more general, we can talk about the distance between two ticks being one “quantum” – a fancy word for the smallest step size on the ruler.

So, when you’re “rounding off to the nearest value” you are “quantising” the measurement (or “quantizing” it, if you live in Noah Webster’s country and therefore you harbor the belief that wordz should be spelled like they sound – and therefore the world needz more zees). This also means that the amount of error that you get as a result of that “rounding off” is called “quantisation error“.

In some explanations of this problem, you may read that this error is called “quantisation noise”. However, this isn’t always correct. This is because if something is “noise” then is is random, and therefore impossible to predict. However, that’s not strictly the case for quantisation error. If you know the signal, and you know the quantisation values, then you’ll be able to predict exactly what the error will be. So, although that error might sound like noise, technically speaking, it’s not. This can easily be seen in Figures 1 through 3 which demonstrate that the quantisation error causes a periodic, predictable error (and therefore harmonic distortion), not a random error (and therefore noise).

Sidebar: The reason people call it quantisation noise is that, if the signal is complicated (unlike a sine wave) and high in level relative to the quantisation levels – say a recording of Britney Spears, for example – then the distortion that is generated sounds “random-ish”, which causes people to jump to the conclusion that it’s noise.

Fig 1: The first cycle of a periodic signal (in this case, a sinusoidal waveform) that we are going to quantise using a 4-bit system (notice the 4 bits in the scale on the left).

Fig 2: The same waveform shown in Figure 1 after quantisation (rounding off) in a 4-bit world.

Fig 3: The difference between Figure 2 and Figure 1. I made this by subtracting the original signal from the quantised version. This is the error in the quantised waveform – the quantisation error. Notice that it is not noise… it’s completely predictable and it will repeat with repetitions of the signal. Therefore the result of this is distortion, not noise…

Now, let’s talk about perception for a while… We humans are really good at detecting patterns – signals – in an otherwise noisy world. This is just as true with hearing as it is with vision. So, if you have a sound that exists in a truly random background noise, then you can focus on listening to the sound and ignore the noise. For example, if you (like me) are old enough to have used cassette tapes, then you can remember listening to songs with a high background noise (the “tape hiss”) – but it wasn’t too annoying because the hiss was independent of the music, and constant. However, if you, like me, have listened to Bob Marley’s live version of “No Woman No Cry” from the “Legend” album, then you, like me, would miss the the feedback in the PA system at that point in the song when the FoH engineer wasn’t paying enough attention… That noise (the howl of the feedback) is not noise – it’s a signal… Which makes it just as important as the song itself. (I could get into a long boring talk about John Cage at this point, but I’ll try to not get too distracted…)

The problem with the signal in Figure 2 is that the error (shown in Figure 3) is periodic – it’s a signal that demands attention. If the signal that I was sending into the quantisation system (in Figure 1) was a little more complicated than a sine wave – say a sine wave with an amplitude modulation – then the error would be easily “trackable” by anyone who was listening.

So, what we want to do is to quantise the signal (because we’re assuming that we can’t make a better “ruler”) but to make the error random – so it is changed from distortion to noise. We do this by adding noise to the signal before we quantise it. The result of this is that the error will be randomised, and will become independent of the original signal… So, instead of a modulating signal with modulated distortion, we get a modulated signal with constant noise – which is easier for us to ignore. (It has the added benefit of spreading the frequency content of the error over a wide frequency band, rather than being stuck on the harmonics of the original signal… but let’s not talk about that…)

For example…

Let’s take a look at an example of this from an equivalent world – digital photography.

The photo in Figure 4 is a black and white photo – which actually means that it’s comprised of shades of gray ranging from black all the way to white. The photo has 272,640 individual pixels (because it’s 640 pixels wide and 426 pixels high). Each of those pixels is some shade of gray, but that shading does not have an infinite resolution. There are “only” 256 possible shades of gray available for each pixel.

So, each pixel has a number that can range from 0 (black) up to 255 (white).

Fig 4: A photo of a building in Paris. Each pixel in this photo has one of 256 possible levels of gray – from white (255) down to black (0).

If we were to zoom in to the top left corner of the photo and look at the values of the 64 pixels there (an 8×8 pixel square), you’d see that they are:

86 86 90 88 87 87 90 91
86 88 90 90 89 87 90 91
88 89 91 90 89 89 90 94
88 90 91 93 90 90 93 94
89 93 94 94 91 93 94 96
90 93 94 95 94 91 95 96
93 94 97 95 94 95 96 97
93 94 97 97 96 94 97 97

What if we were to reduce the available resolution so that there were fewer shades of gray between white and black? We can take the photo in Figure 1 and round the value in each pixel to the new value. For example, Figure 5 shows an example of the same photo reduced to only 6 levels of gray.

Fig 5: The same photo of the same building. Each pixel in this photo has one of 6 possible levels of gray. Notice that some details are lost – like the smooth transitions in the clouds, or the stripes in the marble in the pillars.

Now, if we look at those same pixels in the upper left corner, we’d see that their values are

102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102
102 102 102 102 102 102 102 102

They’ve all been quantised to the nearest available level, which is 102. (Our possible values are restricted to 0, 51, 102, 154, 205, and 255).

So, we can see that, by quantising the gray levels from 256 possible values down to only 6, we lose details in the photo. This should not be a surprise… That loss of detail means that, for example, the gentle transition from lighter to darker gray in the sky in the original is “flattened” to a light spot in a darker background, with a jagged edge at the transition between the two. Also, the details of the wall pillars between the windows are lost.

If we take our original photo and add noise to it – so were adding a random value to the value of each pixel in the original photo (I won’t talk about the range of those random values…) it will look like Figure 6. This photo has all 256 possible values of gray – the same as in Figure 1.

Fig 6: A photo of noise with the same width and height as the original photo, with random values (ranging from 0 to 255) in each pixel.

If we then quantise Figure 6 using our 6 possible values of gray, we get Figure 7. Notice that, although we do not have more grays than in Figure 5, we can see things like the gradual shading in the sky and some details in the walls between the tall windows.

Fig 7: The same photo of the same building in Figure 4. Each pixel in this photo ALSO only has one of 6 possible levels of gray – just like in Figure 5. However, this version is the result of quantising the original photo with the noise added before quantisation. The result is admittedly noisy – but we are able to see pattens in the noise that preserve some of the details that we lost in Figure 5.

That noise that we add to the original signal is called dither – because it is forcing the quantiser to be indecisive about which level to quantise to choose.

I should be clear here and say that dither does not eliminate quantisation error. The purpose of dither is to randomise the error, turning the quantisation error into noise instead of distortion. This makes it (among other things) independent of the signal that you’re listening to, so it’s easier for your brain to separate it from the music, and ignore it.

Addendum: Binary basics and SNR

We normally write down our numbers using a “base 10” notation. So, when I write down 9374 – I mean
9 x 1000 + 3 x 100 + 7 x 10 + 4 x 1
or
9 x 10³ + 3 x 10² + 7 x 10¹ + 4 x 10⁰

We use base 10 notation – a system based on 10 digits (0 through 9) because we have 10 fingers.

If we only had 2 fingers, we would do things differently… We would only have 2 digits (0 and 1) and we would write down numbers like this:
11101

which would be the same as saying
1 x 16 + 1 x 8 + 1 x 4 + 0 x 2 + 1 x 1
or
1 x 2⁴ + 1 x 2³ + 1 x 2² + 0 x 2¹ + 1 x 2⁰

The details of this are not important – but one small point is. If we’re using a base-10 system and we increase the number by one more digit – say, going from a 3-digit number to a 4-digit number, then we increase the possible number of values we can represent by a factor of 10. (in other words, there are 10 times as many possible values in the number XXXX than in XXX.)

If we’re using a base-2 system and we increase by one extra digit, we increase the number of possible values by a factor of 2. So XXXX has 2 times as many possible values as XXX.

Now, remember that the error that we generate when we quantise is no bigger than 1/2 of a quantisation step, regardless of the number of steps. So, if we double the number of steps (by adding an extra binary digit or bit to the value that we’re storing), then the signal can be twice as “far away” from the quantisation error.

This means that, by adding an extra bit to the stored value, we increase the potential signal-to-error ratio of our LPCM system by a factor of 2 – or 6.02 dB.

So, if we have a 16-bit LPCM signal, then a sine wave at the maximum level that it can be without clipping is about 6 dB/bit * 16 bits – 3 dB = 93 dB louder than the error. The reason we subtract the 3 dB from the value is that the error is +/- 0.5 of a quantisation step (normally called an “LSB” or “Least Significant Bit”).

Note as well that this calculation is just a rule of thumb. It is neither precise nor accurate, since the details of exactly what kind of error we have will have a minor effect on the actual number. However, it will be close enough.

On to Part 3.

“High-Res” Audio: Part 1

I’ve been debating writing a series of postings about “high resolution” audio for a long time – years. Lately, (probably because of some hype generated by some recent press releases) I’ve been getting lots of question (no, that’s not a typo) about it, so it appears the time has come…

To start: the question that I get (a lot) is “If I can’t hear above 20 kHz, then what’s the use of high-res?” As I’ll explain as we go through, this is only one, rather small aspect to consider in this topic. In fact, it might be the least important issue to consider.

However, before I write too much, I’ll say that I’m not going to argue for or against higher resolutions in digital audio systems. I’m only going to go through a bunch of issues that can be used to argue either for or against them. So, there’s not going to be a big reveal at the end of this series telling you that high-res is either better, worse, or no different than whatever you’re using now. It’s merely going to be a discussion of a number of issues that need to be weighed. The problem is that this entire topic is complicated – and there’s no single “right” answer, as I’ll argue as we go along.

To start, let’s get down to basics and look (once again, from the perspectives of this website) at what sound is, and how it’s converted from an analogue electrical signal into a digital representation. The good thing is that I’ve written this introduction before in a different series of postings. So, I’m going to be extremely lazy and just copy-and-paste that information here. I’m not just referring you to another page because I’m intentionally leaving some things out because we’re headed into having a different discussion this time.

A quick introduction to sound

At the simplest level, sound can be described as a small change in air pressure (or barometric pressure) over short periods of time. If you’d like to have a better and more edu-tain-y version of this statement with animations and pretty colours, you could take 10 minutes to watch this video, for example.

That change in pressure can be “captured” by using a microphone, that is (at the simplest level) a device that has a change in air pressure at its input and a change in electrical voltage at its output. Ignoring a lot of details, we could say that if you were to plot a measurement of the air pressure (at the input of the microphone) over time, and you were to compare it to a plot of the measurement of the voltage (at the output of the microphone) over time, you would see the same curve on the two graphs. This means that the change in voltage is analogous to the change in air pressure.

Fig 1. Notice that (in theory, and ignoring a lot of things…) the change in air pressure over time at the input of the microphone is identical to the change in voltage over time at its output. Of course, this is not true in real life – microphones lie like a cheap rug…

At this point in the conversation, I’ll make a point to say that, in theory, we could “zoom in” on either of those two curves shown in Figure 1 and see more and more details. This is like looking at a map of Canada – it has lots of crinkly, jagged lines. If you zoom in and look at the map of Newfoundland and Labrador, you’ll see that it has finer, crinkly, jagged lines. If you zoom in further, and stand where the water meets the shore in Trepassey and take a photo of your feet, you could copy it to draw a map of the line of where the water comes in around the rocks – and your toes – and you would wind up with even finer, crinkly, jagged lines… You could take this even further and get down to a microscopic or molecular level – but you get the idea… The point is that, in theory, both of the plots in Figure 1 have infinite resolution, both in time and in air pressure or voltage.

Now, let’s say that you wanted to take that microphone’s output and transmit it through a bunch of devices and wires that, in theory, all do nothing to the signal. Let’s say, for example, that you take the mic’s output, send it through a wire to a box that makes the signal twice as loud. Then take the output of that box and send it through a wire to another box that makes it half as loud. You take the output of that box and send it through a wire to a measuring device. What will you see? Unfortunately, none of the wires or boxes in the chain can be perfect, so you’ll probably see the signal plus something else which we’ll call the “error” in the system’s output. We can call it the error because, if we measure the input voltage and the output voltage at any one instant, we’ll probably see that they’re not identical. Since they should be identical, then the system must be making a mistake in transmitting the signal – so it makes errors…

Fig 2. If you send an audio signal through some wires and devices that (in theory) do nothing to the signal, you’ll find out that they add some extra stuff that you don’t want.

Pedantic Sidebar: Some people will call that error that the system adds to the signal “noise” – but I’m not going to call it that. This is because “noise” is a specific thing – noise is random – so if it’s not random, it’s not noise. Also, although the signal has been distorted (in that the output of the system is not identical to the input) I won’t call it “distortion” either, since distortion is a name that’s given to something that happens to the signal because the signal is there. (We would probably get at least some of the error out of our system even if we didn’t send any audio into it.) So, we could be slightly geeky and adequately vague and call the extra stuff “Distortion plus noise” but not “THD+N” – which stands for “Total Harmonic Distortion Plus Noise” – because not all kinds of distortion will produce a harmonic of the signal… but I’m getting ahead of myself…

So, we want to transmit (or store) the audio signal – but we want to reduce the noise caused by the transmission (or storage) system. One way to do this is to spend more money on your system. Use wires with better shielding, amplifiers with lower noise floors, bigger power supplies so that you don’t come close to their limits, run your magnetic tape twice as fast, and so on and so on. Or, you could convert the analogue signal (remember that it’s analogous to the change in air pressure over time) to one that is represented (and therefore transmitted or stored) digitally instead.

What does this mean?

Conversion from analogue to digital and back
(but skipping important details)

IMPORTANT: If you read this section, then please read the following postings as well. This is because, in order to keep things simple to start, I’m about to leave out some important details that I’ll add afterwards. However, if you don’t add the details, you could (understandably) jump to some incorrect conclusions (that many others before you have concluded…) So, if you don’t have time to read both sections, please don’t read either of them.

In the example above, we made a varying voltage that was analogous to the varying air pressure. If we wanted to store this, we could do it by varying the amount of magnetism on a wire or a coating on a tape, for example. Or we could cut a wiggly groove in a bit of vinyl that has a similar shape to the curve in the plots in Figure 1. Or, we could do something else: we could get a metronome (or a clock) and make a measurement of the voltage every time the metronome clicks, and write down the measurements.

For example, let’s zoom in on the first little bit of the signal in the plots in Figure 1

Fig. 3 The same curve as was shown in Figure 1 – but zoomed in to the very beginning.

We’ll then put on a metronome and make a measurement of the voltage every time we hear the metronome click…

Fig 4. The same curve (in red) measured at regular intervals (in black)

We can then keep the measurements (remembering how often we made them…) and write them down like this:

0.3000
0.4950
0.5089
0.3351
0.1116
0.0043
0.0678
0.2081
0.2754
0.2042
0.0730
0.0345
0.1775

We can store this series of numbers on a computer’s hard disk, for example. We can then come back tomorrow, and convert the measurements to voltages. First we read the measurements, and create the appropriate voltage…

Fig. 5. The voltages that we stored as measurements

We then make a “staircase” waveform by “holding” those voltages until the next value comes in.

Fig 6. We make a “staircase” curve using the voltages.

All we need to do then is to use a low-pass filter to smooth out the hard edges of the staircase.

Fig 7. When we smooth out the staircase, we get back the original signal (in red).

So, in this example, we’ve gone from an analogue signal (the red curve in Figure 3) to a digital signal (the series of numbers), and back to an analogue signal (the red curve in Figure 7).

In some ways, this is a bit like the way a movie works. When you watch a movie, you see a series of still photographs, probably taken at a rate of 24 pictures (or frames) per second. If you play those photos back at the same rate (24 fps or frames per second), you think you see movement. However, this is because your eyes and brain aren’t fast enough to see 24 individual photos per second – so you are fooled into thinking that things on the screen are moving.

However, digital audio is slightly different from film in two ways:

The sound (equivalent to the movement in the film) is actually happening. It’s not a trick that relies on your ears and brain being too slow.
If, when you were filming the movie, something were to happen between frames (say, the flash of a gunshot, for example) then it would never be caught on film. This is because the photos are discrete moments in time – and what happens between them is lost. However, if something were to make a very, very short sound between two samples (two measurements) in the digital audio signal – it would not be lost. This is because of something that happens at the beginning of the chain that I haven’t described… yet…

However, there are some “artefacts” (a fancy term for “weird errors”) that are present both in film and in digital audio that we should talk about.

The first is an error that happens when you mess around with the rate at which you take the measurements (called the “sampling rate”) or the photos (called the “frame rate”) – and, more importantly, when you need to worry about this. Let’s say that you make a film at 24 fps. If you play this back at a higher frame rate, then things will move very quickly (like old-fashioned baseball movies…). If you play them back at a lower frame rate, then things move in slow motion. So, for things to look “normal” you have to play the movie at the same rate that it was filmed. However, as long as no one is looking, you can transfer the movie as fast as you like. For example, if you wanted to copy the film, you could set up a movie camera so it was pointing at a movie screen and film the film. As long as the movie on the screen is running in sync with the camera, you can do this at any frame rate you like. But you’ll have to watch the copy at the same frame rate as the original film… (Note that this issue is not something that will come up in this series of postings about high resolution audio)

The second is an easy artefact to recognise. If you see a car accelerating from 0 to something fast on film, you’ll see the wheels of the car start to get faster and faster, then, as the car gets faster, the wheels slow down, stop, and then start going backwards… This does not happen in real life (unless you’re in a place lit with flashing lights like fluorescent bulbs or LED’s). I’ll do a posting explaining why this happens – but the thing to remember here is that the speed of the wheel rotation that you see on the film (the one that’s actually captured by the filming…) is not the real rotational speed of the wheel. However, those two rotational speeds are related to each other (and to the frame rate of the film). If you change the real rotational rate or the frame rate, you’ll change the rotational rate in the film. So, we call this effect “aliasing” because it’s a false version (an alias) of the real thing – but it’s always the same alias (assuming you repeat the conditions…) Digital audio can also suffer from aliasing, but in this case, you put in one frequency (which is actually the same as a rotational speed) and you get out another one. This is not the same as harmonic distortion, since the frequency that you get out is due to a relationship between the original frequency and the sampling rate, so the result is almost never a multiple of the input frequency. (We’re going to dig into this a lot deeper through this series of postings about high resolution audio, so if it doesn’t immediately make sense, don’t worry…)

Some important details that I left out…

One of the things I said above was something like “we measure the voltage and store the results” and the example I gave was a nice series of numbers that only had 4 digits after the decimal point. This statement has some implications that we need to discuss.

Let’s say that I have a thing that I need to measure. For example, Figure 8 shows a piece of metal, and I want to measure its width.

Fig 8. A piece of metal with a width of “approximately 57 mm”.

Using my ruler, I can see that this piece of metal is about 57 mm wide. However, if I were geeky (and I am) I would say that this is not precise enough – and therefore it’s not accurate. The problem is that my ruler is only graduated in millimetres. So, if I try to measure anything that is not exactly an integer number of mm long, I’ll either have to guess (and be wrong) or round the measurement to the nearest millimetre (and be wrong).

So, if I wanted you to make a piece of metal the same width as my piece of metal, and I used the ruler in Figure 8, we would probably wind up with metal pieces of two different widths. In order to make this better, we need a better ruler – like the one in Figure 9.

Fig 9. The same piece of metal being measured with a vernier caliper. This gives us additional precision (down to 0.05 mm) so we can make a more accurate measurement.

Figure 9 shows a vernier caliper (a fancy type of ruler) being used to measure the same piece of metal. The caliper has a resolution of 0.05 mm instead of the 1 mm available on the ruler in Figure 8. So, we can make a much more accurate measurement of the metal because we have a measuring device with a higher precision.

The conversion of a digital audio signal is the same. As I said above, we measure the voltage of the electrical signal, and transmit (or store) the measurement. The question is: how accurate and precise is your measurement? As we saw above, this is (partly) determined by how many digits are in the number that you use when you “write down” the measurement.

Since the voltage measurements in digital audio are recorded in binary rather than decimal (we use 0 and 1 to write down the number instead of 0 up to 9) then we use Binary digITS – or “bits” instead of decimal digits (which are not called “dits”). The number of bits we have in the number that we write down (partly) determines the precision of the measurement of the voltage – and therefore (possibly), our accuracy…

Just like the example of the ruler in Figure 8, above, we have a limited resolution in our measurement. For example, if we had only 4 bits to work with then the waveform in 4 – the one we have to measure – would be measured with the “ruler” shown on the left side of Figure 10, below.

Fig 10: The waveform from Figure 4 as a voltage (notice the Y-axis on the right). We have to measure these values using the ruler with the resolution shown on the Y-axis on the left.

When we do this, we have to round off the value to the nearest “tick” on our ruler, as shown in Figure 11.

Fig 11: The values from figure 10 (shown as the circles) rounded off to the nearest value on our 4-bit ruler (the red staircase).

Using this “ruler” which gives a write-down-able “quantity” to the measurement, we get the following values for the red staircase:

0010
0100
0100
0011
0001
0000
0001
0010
0010
0010
0001
0000
0001

When we “play these back” we get the staircase again, shown in Figure 12.

Fig 12: The output of the measurements. Notice that all values sit exactly on one of the values for the “ruler” on the left Y-axis of the plot.

Of course, this means that, by rounding off the values, we have introduced an error in the system (just like the measurement in Figure 8 has a bigger error than the one in Figure 9). We can calculate this error if we just subtract the original signal from the output signal (in other words, Figure 12 minus Figure 10) to get Figure 13.

Fig 13: The error that we produced due to the rounding off of the signal when we did the measurements. Notice that the error is always less than 0.5 of a “tick” of the ruler on the left Y-axis.

In order to improve our accuracy of the measurement, we have to increase the precision of the values. We can do this by adding an extra digit (or bit) to the number that we use to record the value.

If we were using decimal numbers (0-9) then adding an extra digit to the number would give us 10 times as many possibilities. (For example, if we were using 4 digits after the decimal in the example at the start of this posting, we have a total of 10,000 possible values – 0.0000 to 0.9999. If we add one more digit, we increase the resolution to 100,000 possible values – 0.00000 to 0.99999 ).

In binary, adding one extra digit gives us twice as many “ticks” on the ruler. So, using 4 bits gives us 16 possible values. Increasing to 5 bits gives us 32 possible values.

If you’re listening to a CD, then the individual measurements of each voltage – the “sample values” – are stored with 16 bits, which means that we have 65,536 possible values to pick from.

Remember that this means that we have more “ticks” on our ruler – but we don’t necessarily increase its range. So, for example, we’re still measuring a voltage from -1 V to 1 V – we just have more and more resolution with which we can do that measurement.

On to Part 2…

Turntables and Vinyl: Part 9

Back to Part 8

Magnitude response

The magnitude response* of any audio device is a measure of how much its output level deviates from the expected level at different frequencies. In a turntable, this can be measured in different ways.

Usually, the magnitude response is measured from a standard test disc with a sine wave sweep ranging from at least 20 Hz to at least 20 kHz. The output level of this signal is recorded at the output of the device, and the level is analysed to determine how much it differs from the expected output. Consequently, the measurement includes all components in the audio path from the stylus tip, through the RIAA preamplifier (if one is built into the turntable), to the line-level outputs.

Because all of these components are in the signal path, there is no way of knowing immediately whether deviations from the expected response are caused by the stylus, the preamplifier, or something else in the chain.

It’s also worth noting that a typical standard test disc (JVC TRS-1007 is a good example) will not have a constant output level, which you might expect if you’re used to measuring other audio devices. Usually, the swept sine signal has a constant amplitude in the low frequency bands (typically, below 1 kHz) and a constant modulation velocity in the high frequencies. This is to avoid over-modulation in the low end, and burning out the cutter head during mastering in the high end.

* This is the correct term for what is typically called the “frequency response”. The difference is that a magnitude response only shows output level vs. frequency, whereas the frequency response would include both level and phase information.

Rumble

In theory, an audio playback device only outputs the audio signal that is on the recording without any extra contributions. In practice, however, every audio device adds signals to the output for various reasons. As was discussed above, in the specific case of a turntable, the audio signal is initially generated by very small movements of the stylus in the record groove. Therefore, in order for it to work at all, the system must be sensitive to very small movements in general. This means that any additional movement can (and probably will) be converted to an audio signal that is added to the recording.

This unwanted extraneous movement, and therefore signal, is usually the result of very low-frequency vibrations that come from various sources. These can include things like mechanical vibrations of the entire turntable transmitted through the table from the floor, vibrations in the system caused by the motor or imbalances in the moving parts, warped discs which cause a vertical movement of the stylus, and so on. These low-frequency signals are grouped together under the heading of rumble.

A rumble measurement is performed by playing a disc that has no signal on it, and measuring the output signal’s level. However, that output signal is first filtered to ensure that the level detection is not influenced by higher-frequency problems that may exist.

The characteristics of the filters are defined in internal standards such as DIN 45 539 (or IEC98-1964), shown below. Note that I’ve only plotted the target response. The specifications allow for some deviation of ±1 dB (except at 315 Hz). Notice that the low-pass filter is the same for both the Weighted and the Unweighted filters. Only the high-pass filter specifications are different for the two cases.

The magnitude responses for the “Unweighted” (black) and “Weighted” filters for rumble measurements, specified in DIN 45 539

If the standard being used for the rumble measurement is the DIN 45 539 specification, then the resulting value is stated as the level difference between the measured filtered noise and a the standard output level, equivalent to the output when playing a 1 kHz tone with a lateral modulation velocity of 70.7 mm/sec. This detail is also worth noting, since it shows that the rumble value is a relative and not an absolute output level.

Rotational speed

Every recording / playback system, whether for audio or for video signals, is based on the fundamental principle that the recording and the playback happen at the same rate. For example, a film that was recorded at 24 frames (or photos) per second (FPS) must also be played at 24 FPS to avoid objects and persons moving too slowly or too quickly. It’s also necessary that neither the recording nor the playback speed changes over time.

A phonographic LP is mastered with the intention that it will be played back at a rotational speed of 33 1/3 RPM (Revolutions Per Minute) or 45 RPM, depending on the disc. (These correspond to 1 revolution either every 1.8 seconds or every 1 1/3 seconds respectively.) We assume that the rotational speed of the lathe that was used to cut the master was both very accurate and very stable. Although it is the job of the turntable to duplicate this accuracy and stability as closely as possible, measurable errors occur for a number of reasons, both mechanical and electrical. When these errors are measured using especially-created audio signals like pure sine tones, the results are filtered and analyzed to give an impression of how audible they are when listening to music. However, a problem arises in that a simple specification (such as a single number for “Wow and Flutter”, for example) can only be correctly interpreted with the knowledge of how the value is produced.

Accuracy

The first issue is the simple one of accuracy: is the turntable rotating the disc at the correct average speed? Most turntables have some kind of user control of this (both for the 33 and 45 RPM settings), since it will likely be necessary to adjust these occasionally over time, as the adjustment will drift with influences such as temperature and age.

Stability

Like any audio system, regardless of whether it’s analogue or digital, the playback speed of the turntable will vary over time. As it increases and decreases, the pitch of the music at the output will increase and decrease proportionally. This is unavoidable. Therefore, there are two questions that result:

How much does the speed change?
What is the rate and pattern of the change?

In a turntable, the amount of the change in the rotational speed is directly proportional to the frequency shift in the audio output. Therefore for example, if the rotational speed decreases by 1% (for example, from 33 1/3 RPM to exactly 33 RPM), the audio output will drop in frequency by 1% (so a 440 Hz tone will be played as a 440 * 0.99 = 435.6 Hz tone). Whether this is audible is dependent on different factors including

the rate of change to the new speed
(a 1% change 4 times a second is much easier to hear than a 1% change lasting 1 hour)
the listener’s abilities
(for example, a person with “absolute pitch” may be able to recognise the change)
the audio signal
(It is easier to detect a frequency shift of a single, long tone such as a note on a piano or pipe organ than it is of a short sound like a strike of claves or a sound with many enharmonic frequencies such as a snare drum.)

In an effort to simplify the specification of stability in analogue playback equipment such as turntables, four different classifications are used, each corresponding to different rates of change. These are drift, wow, flutter, and scrape, the two most popular of which are wow and flutter, and are typically grouped into one value to represent them.

Drift

Frequency drift is the tendency of a playback device’s speed to change over time very slowly. Any variation that happens slower than once every 2 seconds (in other words, with a modulation frequency of less than 0.5 Hz) is considered to be drift. This is typically caused by changes such as temperature (as the playback device heats up) or variations in the power supply (due to changes in the mains supply, which can vary with changing loads throughout the day).

Wow

Wow is a modulation in the speed ranging from once every 2 seconds to 6 times a second (0.5 Hz to 6 Hz). Note that, for a turntable, the rotational speed of the disc is within this range. (At 33 1/3 RPM: 1 revolution every 1.8 seconds is equal to approximately 0.556 Hz.)

Flutter

Flutter describes a modulation in the speed ranging from 6 to 100 times a second (6 Hz to 100 Hz).

Scrape

Scrape or scrape flutter describes changes in the speed that are higher than 100 Hz. This is typically only a problem with analogue tape decks (caused by the magnetic tape sticking and slipping on components in its path) and is not often used when classifying turntable performance.

Measurement and Weighting

The easiest accurate method to measure the stability of the turntable’s speed within the range of Wow and Flutter is to follow one of the standard methods, of which there are many, but they are all similar. Examples of these standards are AES6-2008, CCIR 409-3, DIN 45507, and IEC-386. A special measurement disc containing a sine tone, usually with a frequency of 3150 Hz is played to a measurement device which then does a frequency analysis of the signal. In a perfect system, the result would be a 3150 Hz sine tone. In practice, however, the frequency of the tone varies over time, and it is this variation that is measured and analysed.

There is general agreement that we are particularly sensitive to a modulation in frequency of about 4 Hz (4 cycles per second) applied to many audio signals. As the modulation gets slower or faster, we are less sensitive to it, as was illustrated in the example above: (a 1% change 4 times a second is much easier to hear than a 1% change lasting 1 hour).

So, for example, if the analysis of the 3150 Hz tone shows that it varies by ±1% at a frequency of 4 Hz, then this will have a bigger impact on the result than if it varies by ±1% at a frequency of 0.1 Hz or 40 Hz. The amount of impact the measurement at any given modulation frequency has on the total result is shown as a “weighting curve” in the figure below.

Weighting applied to the Wow and Flutter measurement in most standard methods. See the text for an explanation.

As can be seen in this curve, a modulation at 4 Hz has a much bigger weight (or impact) on the final result than a modulation at 0.315 Hz or at 140 Hz, where a 20 dB attenuation is applied to their contribution to the total result. Since attenuating a value by 20 dB is the same as dividing it by 10; a ±1% modulation of the 3150Hz tone at 4 Hz will produce the same result as a ±10% modulation of the 3150 Hz tone at 140 Hz, for example.

This shows just one example of why comparing one Wow and Flutter measurement value should be interpreted very cautiously.

Expressing the result

When looking at a Wow and Flutter specification, one will see something like <0.1%, <0.05% (DIN), or <0.1% (AES6). Like any audio specification, if the details of the measurement type are not included, then the value is useless. For example, “W&F: <0.1%” means nothing, since there is no way to know which method was used to arrive at this value.(Similarly, a specification like “Frequency Range: 20 Hz to 20 kHz” means nothing, since there is no information about the levels used to define the range.)

If the standard is included in the specification (DIN or AES6, for example), then it is still difficult to compare wow and flutter values. This is because, even when performing identical measurements and applying the same weighting curve shown in the figure above, there are different methods for arriving at the final value. The value that you see may be a peak value (the maximum deviation from the average speed), the peak-to-peak value (the difference between the minimum and the maximum speeds), the RMS (a version of the average deviation from the average speed), or something else.

The AES6-2008 standard, which is the currently accepted method of measuring and expressing the wow and flutter specification, uses a “2-sigma” method, which is a way of looking at the peak deviation to give a kind of “worst-case” scenario. In this method, the 3150 Hz tone is played from a disc and captured for as long a time as is possible or feasible. Firstly, the average value of the actual frequency of the output is found (in theory, it’s fixed at 3150 Hz, but this is never true). Next, the short-term variation of the actual frequency over time is compared to the average, and weighted using the filter shown above. The result shows the instantaneous frequency variations over the length of the captured signal, relative to the average frequency (however, the effect of very slow and very fast changes have been reduced by the filter). Finally, the standard deviation of the variation from the average is calculated, and multiplied by 2 (“2-Sigma”, or “two times the standard deviation”), resulting in the value that is shown as the specification. The reason two standard deviations is chosen is that (in the typical case where the deviation has a Gaussian distribution) the actual Wow & Flutter value should exceed this value no more than 5% of the time.

The reason this method is preferred today is that it uses a single number to express not only the wow and flutter, but the probability of the device reaching that value. For example, if a device is stated to have a “Wow and Flutter of <1% (AES6)”, then the actual deviation from the average speed will be less than 1% for 95% of the time you are listening to music. The principal reason this method was not used in the “old days” is that it requires statistical calculations applied to a signal that was captured from the output of the turntable, an option that was not available decades ago. The older DIN method that was used showed a long-term average level that was being measured in real-time using analogue equipment such as the device shown in below.

Bang & Olufsen WM1, analogue wow and flutter meter.

Unfortunately, however, it is still impossible to know whether a specification that reads “Wow and Flutter: 1% (AES6)” means 1% deviation with a modulation frequency of 4 Hz or 10% deviation with a modulation frequency of 140 Hz – or something else. It is also impossible to compare this value to a measurement done with one of the older standards such as the DIN method, for example.

Turntables and Vinyl: Part 8

Back to Part 7

As was discussed in Part 3, when a record master is cut on a lathe, the cutter head follows a straight-line path as it moves from the outer rim to the inside of the disk. This means that it is always modulating in a direction that is perpendicular to the groove’s relative direction of travel, regardless of its distance from the centre.

The direction of travel of the cutting head when the master disk is created on a lathe.

A turntable should be designed to ensure that the stylus tracks the groove made by the cutter head in all aspects. This means that this perpendicular angle should be maintained across the entire surface of the disk. However, in the case of a tonearm that pivots, this is not possible, since the stylus follows a circular path, resulting in an angular tracking error.

Any tonearm has some angular tracking error that varies with position on the disk.

The location of the pivot point, the tonearm’s shape, and the mounting of the cartridge can all contribute to reducing this error. Typically, tonearms are designed so that the cartridge is angled to not be in-line with the pivot point. This is done to ensure that there can be two locations on the record’s surface where the stylus is angled correctly relative to the groove.

A correctly-designed and aligned pivoting tonearm has a tracking error of 0º at only two locations on the disk.

However, the only real solution is to move the tonearm in a straight line across the disc, maintaining a position that is tangential to the groove, and therefore keeping the stylus located so that its movement is perpendicular to the groove’s relative direction of travel, just as it was with the cutter head on the lathe.

A tonearm that travels sideways, maintaining an angle that is tangent to the groove at the stylus.

In a perfect system, the movement of the tonearm would be completely synchronous with the sideways “movement” of the groove underneath it, however, this is almost impossible to achieve. In the Beogram 4000c, a detection system is built into the tonearm that responds to the angular deviation from the resting position. The result is that the tonearm “wiggles” across the disk: the groove pulls the stylus towards the centre of the disk for a small distance before the detector reacts and moves the back of the tonearm to correct the angle.

Typically, the distance moved by the stylus before the detector engages the tracking motor is approximately 0.1 mm, which corresponds to a tracking error of approximately 0.044º.

An exaggerated representation of the maximum tracking error of the tonearm before the detector engages and corrects.

One of the primary artefacts caused by an angular tracking error is distortion of the audio signal: mainly second-order harmonic distortion of sinusoidal tones, and intermodulation distortion on more complex signals. (see “Have Tone Arm Designers Forgotten Their High-School Geometry?” in The Audio Critic, 1:31, Jan./Feb., 1977.) It can be intuitively understood that the distortion is caused by the fact that the stylus is being moved at a different angle than that for which it was designed.

It is possible to calculate an approximate value for this distortion level using the following equation:

$Hd \approx 100 * \frac{ \omega A \alpha }{\omega_r R }$

Where $Hd$ is the harmonic distortion in percent, $\omega$ is the angular frequency of the modulation caused by the audio signal (calculated using $\omega = 2 \pi F$ ), $A$ is the peak amplitude in mm, $\alpha$ is the tracking error in degrees, $\omega_r$ is the angular frequency of rotation (the speed of the record in radians per second. For example, at 33 1/3 RPM, $\omega_r = 2 \pi 0.556 rev/sec = 3.49$ ) and $R$ is the radius (the distance of the groove from the centre of the disk). (see “Tracking Angle in Phonograph Pickups” by B.B. Bauer, Electronics (March 1945))

This equation can be re-written, separating the audio signal from the tonearm behaviour, as shown below.

$Hd \approx 100 * \frac{ \omega A }{\omega_r} * \frac{\alpha}{R}$

which shows that, for a given audio frequency and disk rotation speed, the audio signal distortion is proportional to the horizontal tracking error over the distance of the stylus to the centre of the disk. (This is the reason one philosophy in the alignment of a pivoting tonearm is to ensure that the tracking error is reduced when approaching the centre of the disk, since the smaller the radius, the greater the distortion.)

It may be confusing as to why the position of the groove on the disk (the radius) has an influence on this value. The reason is that the distortion is dependent on the wavelength of the signal encoded in the groove. The longer the wavelength, the lower the distortion. As was shown in Figure 1 in Part 6 of this series, the wavelength of a constant frequency is longer on the outer groove of the disk than on the inner groove.

Using the Beogram 4000c as an example at its worst-case tracking error of 0.044º: if we have a 1 kHz sine wave with a modulation velocity of 34.1 mm/sec on a 33 1/3 RPM LP on the inner-most groove then the resulting 2nd-harmonic distortion will be 0.7% or about -43 dB relative to the signal. At the outer-most groove (assuming all other variables remain constant), the value will be roughly half of that, at 0.3% or -50 dB.

I bet Clara would be impressed…

Turntables and Vinyl: Part 7

Back to Part 6

Tracking force

In order to keep the stylus tip in the groove of the record, it must have some force pushing down on it. This force must be enough to keep the stylus in the groove. However, if it is too large, then both the vinyl and the stylus will wear more quickly. Thus a balance must be found between “too much” and “not enough”.

As can be seen in Figure 1, the typical tracking force of phonograph players has changed considerably since the days of gramophones playing shellac discs, with values under 10 g being standard since the introduction of vinyl microgroove records in 1948. The original recommended tracking force of the Beogram 4002 was 1 g, however, this has been increased to 1.3 g for the Beogram 4000c in order to help track more recent recordings with higher modulation velocities and displacements.

Effective Tip Mass

The stylus’s job is to track all of the vibrations encoded in the groove. It stays in that groove as a result of the adjustable tracking force holding it down, so the moving parts should be as light as possible in order to ensure that they can move quickly. The total apparent mass of the parts that are being moved as a result of the groove modulation is called the effective tip mass. Intuitively, this can be thought of as giving an impression of the amount of inertia in the stylus.

It is important to not confuse the tracking force and the effective tip mass, since these are very different things. Imagine a heavy object like a 1500 kg car, for example, lifted off the ground using a crane, and then slowly lowered onto a scale until it reads 1 kg. The “weight” of the car resting on the scale is equivalent to 1 kg. However, if you try to push the car sideways, you will obviously find that it is more difficult to move than a 1 kg mass, since you are trying to overcome the inertia of all 1500 kg, not the 1 kg that the scale “sees”. In this analogy, the reading on the scale is equivalent to the Tracking Force, and the mass that you’re trying to move is the Effective Tip Mass. Of course, in the case of a phonograph stylus, the opposite relationship is desirable; you want a tracking force high enough to keep the stylus in the groove, and an effective tip mass as close to 0 as possible, so that it is easy for the groove to move it.

Compliance

Imagine an audio signal that is on the left channel only. In this case, the variation is only on one of the two groove walls, causing the stylus tip to ride up and down on those bumps. If the modulation velocity is high, and the effective tip mass is too large, then the stylus can lift off the wall of the groove just like a car leaving the surface of a road on the trailing side of a bump. In order to keep the car’s wheels on the road, springs are used to push them back down before the rest of the car starts to fall. The same is true for the stylus tip. It’s being pushed back down into the groove by the cantilever that provides the spring. The amount of “springiness” is called the compliance of the stylus suspension. (Compliance is the opposite of spring stiffness: the more compliant a spring is, the easier it is to compress, and the less it pushes back.)

Like many other stylus parameters, the compliance is balanced with other aspects of the system. In this case it is balanced with the effective mass of the tonearm (which includes the tracking force(1), resulting in a resonant frequency. If that frequency is too high, then it can be audible as a tone that is “singing along” with the music. If it’s too low, then in a worst-case situation, the stylus can jump out of the record groove.

If a turntable is very poorly adjusted, then a high tracking force and a high stylus compliance (therefore, a “soft” spring) results in the entire assembly sinking down onto the record surface. However, a high compliance is necessary for low-frequency reproduction, therefore the maximum tracking force is, in part, set by the compliance of the stylus.

If you are comparing the specifications of different cartridges, it may be of interest to note that compliance is often expressed in one of five different units, depending on the source of the information:

“Compliance Unit” or “cu”
mm/N
millimetres of deflection per Newton of force
µm/mN
micrometres of deflection per thousandth of a Newton of force
x 10^-6 cm/dyn
hundredths of a micrometre of deflection per dyne of force
x 10^-6 cm / 10^-5 N
hundredths of a micrometre of deflection per hundred-thousandth of a Newton of force

Since

mm/N = 1000 µm / 1000 mN

and

1 dyne = 0.00001 Newton

Then this means that all five of these expressions are identical, so, they can be interchanged freely. In other words:

20 CU

= 20 mm / N

= 20 µm / mN

= 20 x 10^-6 cm / dyn

= 20 x 10^-6 cm / 10^-5 N

Footnotes

On the Mechanics of Tonearms, Dick Pierce

Turntables and Vinyl: Part 6

Back to Part 5

Tip shape

The earliest styli were the needles that were used on 78 RPM gramophone players. These were typically made from steel wire that was tapered to a conical shape, and then the tip was rounded to a radius of about 150 µm, by tumbling them in an abrasive powder.(1) This rounded curve at the tip of the needle had a hemispherical form, and so styli with this shape are known as either conical or spherical.

The first styli made for “microgroove” LP’s had the same basic shape as the steel predecessor, but were tipped with sapphire or diamond. The conical/spherical shape was a good choice due to the relative ease of manufacture, and a typical size of that spherical tip was about 36 µm in diameter. However, as recording techniques and equipment improved, it was realised that there are possible disadvantages to this design.

Remember that the side-to-side shape of the groove is a physical representation of the audio signal: the higher the frequency, the smaller the wave on the disc. However, since the disc has a constant speed of rotation, the speed of the stylus relative to the groove is dependent on how far away it is from the centre of the disc. The closer the stylus gets to the centre, the smaller the circumference, so the slower the groove speed.

If we look at a 12″ LP, the smallest allowable diameter for the modulated groove is about 120 mm, which gives us a circumference of about 377 mm (or 120 * π). The disc is rotating 33 1/3 times every minute which means that it is making 0.56 of a rotation per second. This, in turn, means that the stylus has a groove speed of 209 mm per second. If the audio signal is a 20,000 Hz tone at the end of the recording, then there must be 20,000 waves carved into every 209 mm on the disc, which means that each wave in the groove is about 0.011 mm or 11 µm long.

Figure 1: The relative speed of the stylus to the surface of the vinyl as it tracks from the outside to the inside radius of the record.

Figure 2: The wavelengths measured in the groove, as a function of the stylus’s distance to the centre of a disc. The shorter lines are for 45 RPM 7″discs, the longer lines are for 33 1/3 RPM 12″ LPs.

However, now we have a problem. If the “wiggles” in the groove have a total wavelength of 11 µm, but the tip of the stylus has a diameter of about 36 µm, then the stylus will not be able to track the groove because it’s simply too big (just like the tires of your car do not sink into every small crack in the road). Figure 3 shows to-scale representations of a conical stylus with a diameter of 36 µm in a 70 µm-wide groove on the inside radius of a 33 1/3 RPM LP (60 mm from the centre of the disc), viewed from above. The red lines show the bottom of the groove and the black lines show the edge where the groove meets the surface of the disc. The blue lines show the point where the stylus meets the groove walls. The top plot is a 1 kHz sine wave and the bottom plot is a 20 kHz sine wave, both with a lateral modulation velocity of 70 mm/sec. Notice that the stylus is simply too big to accurately track the 20 kHz tone.

Figure 3: Scale representations of a conical stylus with a diameter of 36 µm in a 70 µm-wide groove on the inside radius of a 33 1/3 RPM LP, looking directly downwards into the groove. See the text for more information.

One simple solution was to “sharpen” the stylus; to make the diameter of the spherical tip smaller. However, this can cause two possible side effects. The first is that the tip will sink deeper into the groove, making it more difficult for it to move independently on the two audio channels. The second is that the point of contact between the stylus and the vinyl becomes smaller, which can result in more wear on the groove itself because the “footprint” of the tip is smaller. However, since the problem is in tracking the small wavelength of high-frequency signals, it is only necessary to reduce the diameter of the stylus in one dimension, thus making the stylus tip elliptical instead of conical. In this design, the tip of the stylus is wide, to sit across the groove, but narrow along the groove’s length, making it small enough to accurately track high frequencies. An example showing a 0.2 mil x 0.7 mil (10 x 36 µm) stylus is shown in Figure 4. Notice that this shape can track the 20 kHz tone more easily, while sitting at the same height in the groove as the conical stylus in Figure 3.

Figure 4: Scale representations of an elliptical stylus with diameters of 10 x 36 µm in a 70 µm-wide groove on the inside radius of a 33 1/3 RPM LP, looking directly downwards into the groove. See the text for more information.

Both the conical and the elliptical stylus designs have a common drawback in that the point of contact between the tip and the groove wall is extremely small. This can be seen in Figure 5, which shows various stylus shapes from the front. Notice the length of the contact between the red and black lines (the stylus and the groove wall). As a result, both the groove of the record and the stylus tip will wear over time, generally resulting in an increasing loss of high frequency output. This was particularly a problem when the CD-4 Quadradisc format was introduced, since it relies on signals as high as 45 kHz being played from the disc.(2) In order to solve this problem, a new stylus shape was invented by Norio Shibata at JVC in 1973. The idea behind this new design is that the sides of the stylus are shaped to follow a much larger-radius circle than is possible to fit into the groove, however, the tip has a small radius like a conical stylus. An example showing this general concept can be seen on the right side of Figure 5.

Figure 5: Dimensions of example styli, drawn to scale. The figure on the left is typical for a 78 RPM steel needle. The four examples on the right show different examples of tip shapes. These are explained in more details in the text. (For comparison, a typical diameter of a human hair is about 0.06 mm.)

There have been a number of different designs following Shibata’s general concept, with names such as MicroRidge (which has an interesting, almost blade-like shape “across” the groove), Fritz-Geiger, Van-den-Hul, and Optimized Contour Contact Line. Generally, these designs have come to be known as line contact (or contact line) styli, because the area of contact between the stylus and the groove wall is a vertical line rather than a single point.

In 1973, Bang and Olufsen started working its own turntable that could play the new CD-4 Quadradisc format. This not only meant developing a new decoder with a 4-channel output, but also a stylus with a bandwidth reliably extending to approximately 45 kHz. This task was given to Villy Hansen, who was project manager for pickup development, despite being still relatively new to the company. Hansen proposed an improvement upon the Shibata grind (which was already commercially available by then) by making 4 facets instead of 2, resulting in a better shape for tracking the very high-frequency modulation. Although developed by Hansen, the new stylus became known as the “Pramanik diamond”, named after Subir K. Pramanik, who had started working as an engineer in Struer in 1971, but who had temporarily returned to India. The end result was a new pickup family that was initially launched with the top model, the MMC 6000.

Figure 6: An example of an elliptical stylus on the left vs. a line contact Pramanik grind on the right. Notice the difference in the area of contact between the styli and the groove walls.

Bonded vs. Nude

There is one small, but important point regarding a stylus’s construction. Although the tip of the stylus is almost always made of diamond today, in lower-cost units, that diamond tip is mounted or bonded to a titanium or steel pin which is, in turn, connected to the cantilever (the long “arm” that connects back to the cartridge housing). This bonded design is cheaper to manufacture, but it results in a high mass at the stylus tip, which means that it will not move easily at high frequencies.

Figure 7: Scale models (on two different scales) of different styli. The example on the left is bonded, the other four are nude.

In order to reduce mass, the steel pin is eliminated, and the entire stylus is made of diamond instead. This makes things more costly, but reduces the mass dramatically, so it is preferred if the goal is higher sound performance. This design is known as a nude stylus.

Footnotes

See “The High-fidelity Phonograph Transducer” B.B. Bauer, JAES 1977 Vol 25, Number 10/11, Oct/Nov 1977
The CD4 format used a 30 kHz carrier tone that was frequency-modulated ±15 kHz. This means that the highest frequency that should be tracked by the stylus is 30 kHz + 15 kHz = 45 kHz.

On to Part 7

Turntables and Vinyl: Part 5

Back to Part 4

Before we go any further, we need to just collect a bunch of information about vinyl records.

General Information

Min groove depth: 0.001″ = 0.0254 mm = 25.4 µm
Max groove depth: 0.005″ = 0.127 mm = 127 µm

12″ LP’s

Outside modulation groove radius: 146 mm
Inside modulation groove radius: 60 mm
Total maximum modulation radius: 86 mm (3.4″)
Typical modulation radius: 76 mm (3″)

7″ 45 RPM

Outside modulation groove radius: 84 mm
Inside modulation groove radius: 54 mm
Total maximum modulation radius: 30 mm (1.2″)

Basic math

Pitch = (Running time x RPM) / Modulation Radius

Groove Width = (1000/LPI + 1) / 2

Peak Amplitude of Displacement = Peak Lateral Velocity / (2 π freq)

Examples

Pitch = (Running time x RPM) / Modulation Radius
Pitch = (20 minutes x 33.333) / 76 mm
Pitch = 8.8 lines per mm (LPm) = 223 LPI

Groove Width = (1000/LPI + 1) / 2
Groove Width = (1000/223 + 1) / 2
Groove Width = 2.74 mil = 2.74 x 10^-3 inches = 0.0696 mm

Peak Amplitude of Displacement = Peak Lateral Velocity / (2 π freq)
Peak Amplitude of Displacement = 70 mm/sec / (2 π 1000)
Peak Amplitude of Displacement = 0.011 mm

Reference:
“Basic Disc Mastering” by Larry Boden (1981)