The recording of sound can be divided into two critcal areas: Techniques and equpiment. In the techniques section, recording environment, microphone placement, and signal processing are covered. In the equipment section, microphones, pre-amplifiers, and recording devices and media are discussed.

Recording Environment - Much of the success of a speech recording depends on the recording environment and microphone placement. Ideally, speech recordings should take place in soundproof studios or labs. If those are not available, one should try to find a relatively quiet room with as little low-frequency noise as possible. Most typical sources of low-frequency noise include 60-Hz hum from electrical equipment, heating and air-conditioning ducts, elevators, doors, water pipes, computer fans, and other mechanical systems in the building. If possible, those devices should be switched off during recording. The figure below illustrates the spectrum of a typical ambient room noise. The low-frequency prominence of this kind of noise may interfere with acoustic analysis of the fundamental frequency and the first formant. It is also quite difficult to filter out any such extraneous noise without removing information from the speech signal itself. High frequency noise should also be avoided, though it can be more easily filtered out of the signal before analysis.

Spectrum of typical room noise. Note the prominence at around 600 Hz.

Microphone Placement - The placement of the microphone directly affects the intensity of the recorded signal as well as the signal-to-noise ratio. The inverse square law guarantees a loss of approximately 6 dB per doubling of distance from the sound source. A typical handheld microphone is usually placed at a distance of 30 cm or so from the talker’s lips. This, relative, to a close placement (say, 4 cm) represents the loss of about 18 dB and an increased possibility of noise leaking into the recording. For this reason, it is recommended to use a head-mounted condenser microphone (such as AKG C410 or AKG C420) to maintain a close, and constant distance to the source. Speech signals acquired this way are characterized by a high SNR and a broad range of intensity. It may also be useful to use a linear phase high-pass rumble filter (60 Hz cutoff and 24dB/octave attenuation), unless low-frequency components are expected in the signal.
Signal Processing and Special Effects - Long-term preservation preservation is one of the primary goals of oral history recordings. It is, therefore crucial to try to obtain the cleanest, highest-fidelity signal right at the very first stage of recording. One should try to capture a wide dynamic range (e.g., 96 db in 16-bit digital recordings) and a fairly wide frequency response (e.g., 0-20,000 Hz). It is also recommended to avoid using too many special effects at this stage, as well. The only exception may be a soft limiter to reduce the possibility of accidental signal overloading. Such "clean" recordings should be stored for preservation purposes. However, various audio delivery situations, such as radio broadcasts, web streaming, CD, etc. may require extra signal processing to make the audio sound better. Digital signal processing (DSP) effects are, therefore, best applied in the post-production process.

There are 4 basic types of effects (filters) that can be applied to speech recordings:

Equalization (EQ) - Equalization is selective amplification, or reduction, of a signal based on frequency. Audio signals consist of combinations of fundamental signals and their harmonics. Changes to the spectral balance of a signal involves altering the relationship of the fundamental to its harmonics. Each harmonic makes up one aspect of the audible character of a signal. Knowing these relationships allows you to quickly zero-in on the correct frequency range of the signal and apply boost or cut to enhance or correct what you are hearing.

Compressor - The effect of a compressor is to make loud parts of a signal softer and to make very soft parts louder. Compression works particularly well in radio broadcasting and streaming audio where very soft passages may be lost due to the background noise in the listening environment and the data compression algorithm used in the streaming audio format.

De-esser - The de-esser is a dynamic range controller specially designed to regulate high frequency content. The de-essing technique was developed for motion picture dialogue recording. Speech sounded more natural and pleasing with the reduction of sibilants (in words such as: church, test, zest, etc.). By sensing and limiting certain selected frequencies, the de-esser provides specific control over some of the higher frequency vocal sounds which may become overemphasized when the speaker or vocalist is close to the microphone.

Reverb - Reverb (reverberation) is an effect that simulates natural room reverberation. It can be defined as the remainder of sound that exists after the sound source is stopped. The time of reverberation as defined as the time it takes for sound pressure level to decay to one-millionth of its former value. For good speech intelligibility, too much reverberation is a hindrance, and can be considered noise, although some reverberation may add a bit of character and presence to one's voice.

Microphones - Two basic microphone types are most typically used for recording speech. The dynamic microphone has a diaphragm that consists of Mylar plastic that has a finely wrapped coil of wire (so-called “voice coil”) attached to its inner face. This coil is suspended within a strong magnetic field. Whenever a sound wave hits the diaphragm, the coil is displaced in proportion to the amplitude of the wave, causing the coil to cut across the lines of magnetic flux supplied by the permanent magnet. Since the mass of the diaphragm and the coil is quite large, compared to the pressure changes in the sound wave, the dynamic microphone may not respond well to sharp transient sounds and it may fail to record minute changes in voice intensity. This does not mean that dynamic microphones should not be used for speech recording. On the contrary, there are several low-cost, rugged microphones, such as Shure SM58 or Shure SM48 (once recommended by Kay Elemetrics), that can be used successfully for many speech recording applications.

The Shure SM58 Dynamic Microphone. http://www.shure.com

Marantz PMD222 analog cassette field recorder. http://www.marantz.com

Dynamic microphones do not require any external power supply, which makes them a good match to several field recorders (e.g., Marantz PMD 222). The condenser microphone works on a different principle. A thin plastic diaphragm coated on one side with gold or nickel is placed at a close distance from a stationary backplate. Once a polarizing voltage (from a 48 V phantom power supply) is applied to these plates, the two surfaces create capacitance that varies as the diaphragm moves in response to a sound wave. Since the diaphragm is very light, the response of the condenser microphone is very accurate, often producing a recording that is extremely rich in both frequency and dynamic response.
Condenser microphones are usually more fragile than dynamic microphones and require a 48 V phantom power supply. Some of them can use battery packs, but some rely on an external power source. This makes the condensern microphone a bit more cumbersome to use in the field, though several DAT and HDD recorders have an on-board phantom power supply.
Frequency Response - Each microphone type has a unique frequency response. It is important to remember that many manufacturers tailor a microphone’s frequency response to accentuate particular parts of the spectrum to function optimally within a specific recording application. For the purposes of acoustic analysis, one should always choose microphones with a wide and flat frequency response curve.

Frequency response of a Shure Beta 87a microphone. Note the effect of roll-off filter at 10 mm.

Cardioid polar patter of a Shure Beta 87a microphone

Polar Patterns - The microphone’s polar pattern should play a crucial role in choosing a microphone for a specific recording application. The polar pattern is a plot of the sensitivity of a microphone as a function of the angle around that device. There are several common polar pattern types used in microphones today. The omnidirectional microphone records sound equally from all directions. Such microphones are most commonly used as built-in or lavalier types. They seem to be very good for recording interviews, though their 360-degree pick-up range introduces too much noise to the signal for it to be used reliably in acoustic analysis.

The cardioid (heart-shaped) pattern is most sensitive to sounds coming from the front. It is 6 dB less sensitive to sounds from 90 degrees to the sides, and, in theory, is completely insensitive to sounds coming from the rear. The most important attribute of a cardioid (directional) microphone is its ability to discriminate between direct sounds (coming from the direction in which it is pointed) and reverberant, unwanted sounds from all other directions. This type of polar pattern usually produces signals that are substantially less noisy than those captured with an omnidirectional microphone.-

Proximity Effect - Usually, high-quality speech recordings require the sound source to be fairly close to the microphone’s diaphragm. This may trigger a so-called proximity effect. Proximity effect is the increase in the low-frequency sensitivity of a microphone when the sound source is close to it. This is particularly true of cardioid, directional microphones. To counter that, most high-end directional microphones use a low-frequency roll-off filter to restore the response to its flat, natural balance. Some microphones have a user-selectable switch to control the filter. The proximity effect may be responsible for speech spectra showing emphasis in the low-frequency range, around the first and second harmonics.
Cabling and Phantom Power - Cabling is a common cause of recording problems. In order to avoid noise (60 Hz hum) and phase problems, it is recommended to use professional quality balanced XLR (two conductors for the signal with neither connected to the shield) cables to connect the microphone to the pre-amplifier. If the pre-amplifier does not have balanced XLR inputs, one should use a balanced to unbalanced transformer. The transformer’s primary side matches the impedance of the microphone and is balanced, while the secondary side is unbalanced and has high impedance that matches most unbalanced pre-amplifier inputs. Condenser microphones require a 48 V phantom power. While some microphones, such as AKG C1000 can work from an internal battery source, others require an external phantom power supply, which many good pre-amplifiers and mixing consoles come equipped with.
Pre-Amplifiers - The main function of a pre-amplifier is to accept a very low-level signal (such as that from a microphone) and amplify it without adding noise. Good pre-amplifiers are not easy to build, as they have to be immune to all kinds of potential noise and signal distortion. It is, therefore, important to use the best possible pre-amplifier, particularly one that has a fairly high gain, broad dynamic range, high SNR, phantom power, and balanced XLR inputs. Good field pre-amplifiers are particularly hard to find. Sound Devices MP2, USB Pre, and Shure FP 23 are among some of the best, yet affordable field pre-amplifiers. The gain on the pre-amplifier should be set relatively high, but the speaker’s voice amplitude range should be tested first to avoid signal overload. If intensity measurements are not indented, a soft compressor-limiter may be used to maximize amplitude and help prevent signal clipping.

Introduction | Web Sound Basics | Recording | Processing
Analysis and Delivery | FAQ | Site Map | MATRIX