Physical Modeling Synthesis
Physical Modeling (PhM) Synthesis starts from mathematical models of the acoustics of instrumental sound production, and attempts to create realistic sounds through these models. This approach is also referred to as synthesis by rule, synthesis from first principles, and virtual acoustics. PhM can be used to create realistic models of real instruments, or it can create novel instruments with changing geometries, such as a cello that can expand and shrink over the course of a phrase, or a gong much larger than could be created with metal. PhM excels at simulating transitions between notes and timbres, and it also captures accidents that occur during performance, such as squeaks, mode locking, and multiphonics.
Many efficient algorithms have been developed for PhM based on delay lines, filters, and table-lookups, but efficiency usually comes at the expense of drastic simplifications. The result is usually "instrument-like" tones more often than realistic instrument tones, although that Native Instruments B4 is a pretty darn good simulation of a B3.
The concepts, terminology, and formulas used in PhM synthesis can be traced to nineteenth-century scientific treatises on the nature of sound, such as Lord Rayleigh's The Theory of Sound (1894), which detailed the principles of vibrating systems. Analog circuit models were built following the invention of the vacuum tube, but progress was generally slow until the computer era.
John Kelly and Carol Lochbaum at (no surprise) Bell Labs were pioneers in adapting a physical model of the human vocal tract to a digital computer. Their rendition of Bicycle Built for Two appeared on Music from Mathematics produced by Max Mathews in 1960, and became a symbol for the increasing capabilities of digital computers. 2001: A Space Odyssey makes reference to this song as HAL is being destroyed, and regressing.
Interest in applying waveguides to synthesis was provoked by the Karplus-Strong plucked-string algorithm.
Excitation and Resonance
A fundamental principle of PhM synthesis is the interaction between an exciter and a resonator. Excitation is an action that causes vibration, and resonance is the response of the body of an instrument to the excitation vibration. The body acts as a time-varying filter applied to the excitation signal.
In general the exciter has a nonlinear behavior, and the resonator has a linear behavior. By linear, we mean that it responds proportionally to the amount of energy applied to it. If we put two signals into the system, we expect the output to be their sum. By non-linear, we mean a system that has built-in thresholds that, if exceeded, cause the system to respond in a new way, as if a switch had been thrown.
Exciter/resonance interactions can be either decoupled (feedforward) or coupled (feedback). For instance, in subtractive synthesis, a source signal is injected into a resonant filter, but their is no other interaction between the source and excitation than the transfer of energy from the exciter to the resonator. In a saxophone, however, the vibration of the read is strongly influenced by the resonation of the instrument. PhM can model the interaction between excitation and resonance, and thus creates a sense of gesture behind the emission of sound.
Classical Physical Modeling employs these steps:
Physical dimensions and constants of vibrating objects are defined, such as their mass and elasticity.
Boundary conditions are defined. These are limiting values of variables that can not be exceeded
Initial state is specified. For example, the starting position of a string at rest.
Excitation is described algorithmically as a force impinging on the vibrating object in some way. Coupling between exciter and resonator can be specified in this algorithm.
Impedance must be accounted for. Impedance is the resistance to a driving force.
Filtering due to friction and sound radiation patterns are specified.
A wave equation combines all of these factors, and is subjected to initial conditions and excitation. The equation is then solved repeatedly, which generates successive samples representing a sound pressure wave at a given instance in time.
Difference equations are useful for describing the laws of change of physical quantities. In modeling vibrational behavior of physical objects, the first step is finding the smallest number of variables that can accurately describe the state of a modeled phenomenon. FIR and IIR filter equations are examples of difference equations, but the continuous evaluation of difference equations is useful for physical modeling synthesis.
Mass-spring Paradigm for Vibrating Springs
Strings are modeled in as a series of discreet masses connected by a series of springs. This model is used to describe vibrating objects and the waves they emit. Two essential qualities of the vibrating media are density and elasticity. If we create a disturbance in one part of a spring by plucking it, the displaced parts of the medium exert forces on adjacent parts, causing them to move away from their equilibrium position. This continues on in a process called wave propagation. Because of the mass of the medium, the parts do not move instantly away from their equilibrium positions, so the pluck impulse propagates through the medium at a specific speed.
This mass-spring representation can be extended to vibrating surfaces and volumes, which are modeled as a fabric or lattice of masses connected by more than one spring.
Modal Synthesis is an alternative to the mass-spring paradigm, which starts from the premise that a sound-producing object can be represented as a collection of vibrating substructures, such as violin bridges and bodies, acoustic tubes, bells, drum heads, etc.. The total number of these substructures is usually much smaller in comparison with the mass-spring approach. Each substructure has a set of modes of vibration, which are specific to a particular structure. The instantaneous vibration of an instrument can be expressed as the sum of the contributions of its modes.
McIntyre, Schumacher, and Woodhouse Synthesis (MSW)
Yet another approach to physical modeling, which is a highly simplified model of the mechanics of instrumental sound production, stresses the importance of the time-domain behavior of tones. MSW tone production can be divided into two parts: nonlinear excitation and linear resonance. For instance, with a clarinet, the nonlinear excitation is caused by blowing into the mouthpiece, where the reed acts like a switch, alternately opening and closing to allow the flow of air into the resonating tube. The flow of air creates pressure in the mouthpiece that closes the reed. This gives the air a chance to escape from the mouthpiece into the instrument, and the reed opens the mouthpiece again. With this action, the reed converts a steady flow of air into a series of puffs. The frequency of these puffs is determined by the length of the instrument, which is varied by opening and closing keyholes so that the waves in the bore resonate at the pitches played by the instrument. This interaction between the excitation and resonance constitutes a type of feedback, which the MSW model accounts for.
With a violin, the friction of the bow captures the string for a brief interval, until the string slips and is released again, only to be captured once again, and so on.
The excitation is a nonlinear switching mechanism that sends a sharp impulse wave into the linear resonator part of the instrument.
Waveguides are an efficient implementation of Physical Modeling Synthesis that have been commercially used by Yamaha and Korg. A waveguide is a computational model of a medium along which waves travel. A basic waveguide building block is a pair of digital delay lines. Each delay line is injected with an excitation wave, which reflects back to the center when it reaches the end of the line.
Waveguides model plucked strings as two waves traveling in opposite directions from an impact point. When they reach the bridges, some of their energy is absorbed, and some of it is reflected back in the opposite direction, where the two waves will collide, causing resonance and interference. These same concepts of nonlinear excitation waves injected into delay lines, can be used to model wind instruments as well.
A waveguide model of a violin includes a bidirectional delay line, and filters that model the bridge termination, and reflection on either end of the string. The Karplus-Strong method is sort of like a string that's only attached at one end.
A waveguide "pluck" models the action of plucking a string. This starts off with a broad spectrum, which is filtered toward a sine wave at a specific pitch.
Waveguide Bowed String:
Bowing of a waveguide string instrument models the interaction a bow has with a string. The hair of the bow provides enough resistance to alternate between grabbing a string and slipping, which causes the string's vibration. This is modeled with a unit delay inserted into the waveguide string model.
A waveguide model of a clarinet includes an input for the mouth, and a model of the reed, the bore, and the bell of the instrument. This type of model produces several realistic features, including the generation of harmonics according to input amplitude and possible instrument squeaking.
Parameter Analysis for Physical Modeling
Developing a physical model of an instrument is only a small part of physical modeling synthesis. It is also important to develop a physical model of a player, along with parameters which can be used for each. Parameter estimation tries to characterize an incoming sound in terms of the parameter settings that would be necessary to approximate the sound in a given synthesis method. The usual method is to carry out trial-and-error experiments, but some sort of automatic compiler that could construct a virtual instrument out of any input sound would certainly be a useful thing to have.
Karplus-Strong Synthesis (KS)
The KS algorithm for plucked string and drum synthesis is an efficient technique based on the principle of a delay line or recirculating wavetable. The basic KS algorithm starts with a wavetable filled with random values. The simplest modification is an averaging of the current sample with the previous sample. The audible result of this algorithm are pitched sounds that sound bright at the outset, but rapidly darken toward a sine tone, much like a real plucked string. In practice the wavetable is reloaded with a new set of random values for each note. This gives each note a slightly different harmonic structure. When the length of the wavetable is large, the instrument sounds like a snare drum, when it is small, it sounds like a brushed tom-tom. A resonant drum can be made by loading the wavetable with a constant value instead of random values. Since the algorithm is based around recirculating input, one the buffer is "filled", there is never anymore external input into the system.
Some advantages of Karplus-Strong Synthesis are that it is very computationally efficient, relatively realistic, and creates a situation where every note played is slightly different. One major disadvantage, though, is that pitches depend on the sampling rate. Without some form of interpolation, frequencies can only be produced which are integer divisors of the sample rate.
Karplus-Strong is so realistic because it is a basic physical model of a real string. When a string is plucked, the wave motion initially has a very broad spectrum, but only the vibrations which move in integer divisions of the string length persist. It takes a little while for the system to get to that state, so the lowpass filter in KS synthesis models the objects which terminate the ends of the string.
A formant is a peak of energy in a spectrum, which can include harmonic and inharmonic partials as well as noise. These formant peaks are especially common in vowels spoken by the human voice, and tones radiated by many musical instruments. Formants change relative to the frequency of the fundamental, and are a kind of "spectral signature" to the source of many sounds. Three synthesis techniques that generate formants are: formant wave-function (FOF) synthesis, VOSIM, and window function (WF) synthesis. FOF and VOSIM are techniques originally designed to emulate the human voice, whereas WF was developed to emulate the formants in traditional musical instruments.
FOF, which stands for fonction d'onde formantique, is designed to model a large class of natural mechanisms that resonate when excited, but that are eventually damped by physical forces such as friction. An FOF departs from a traditional subtractive approach to formant synthesis, which uses a complicated filter to carve out most frequencies in the spectrum to produce formants. The FOF approach uses several bandpass filters in parallel to model a complicated spectrum envelope with several formant peaks. An alternative implementation of FOF replaces the filters with a bank of damped sine wave generators.
For synthesis, a FOF generator produces a damped sine wave grain of sound at each pitch period. Since the duration of each FOF grain lasts just a few milliseconds, the envelope of the FOF grain contributes audible sidebands around the sine wave, creating a formant. The result of summing several FOF generators is a spectrum with several formant peaks.
FOF Generators are controlled by a number of parameters:
-the center frequency of the formant
-the formant bandwidth (the width between the points -6dB from the peak of the formant)
-the peak amplitude of the formant
-the width of the formant skirt, which is the lower part of the formant peak, around -40 dB below the peak.
The duration of the FOF attack controls the width of the formant skirt, and the duration of the FOF decay determines the formant bandwidth. So, as the duration of the attack lengthens, the skirtwidth narrows, and a long decay length translates into a sharp resonance peak.
The core idea of VOSIM is the generation of a repeating tone-burst signal, producing a strong formant component. VOSIM was originally used to model vowel sounds, but has been extended to model fricatives and quasi-instrumental tones.
The VOSIM waveform was derived by approximating the signal generated by the human voice. This approximation takes the form of a series of pulsetrains, where each pulse in the train is the square of a sine function. We can calculate the period of the waveform as (N * T) + M, where N is the number of sin² pulses in series, T is the width of each pulse, and M is the variable length delay, which contributes to the overall period of one pulsetrain.
Two characteristics of the VOSIM signal are a fundamental corresponding to the repetition frequency of the entire signal, and a formant peak in the spectrum corresponding to the pulse width of the sin² pulses. In order to create a sound with several formants, it is necessary to mix the outputs of several VOSIM oscillators.
Window Function Synthesis:
WF synthesis is a multistage technique for formant synthesis which uses purely harmonic partials. The technique begins with a broadband harmonic signal (window-function pulse), which is fed into a "weighting" stage which emphasizes or attenuates different harmonics to create time-varying formant regions that emulate the spectra of traditional instruments. A window function exhibits a characteristic spectra with a center lobe and side lobes. The center lobe is typically much higher in amplitude than the side lobes, meaning that the signal is in effect band-limited. The initial broadband pulse used in WF is created by linking together a periodic series of WF pulses separated by a period of zero amplitude called deadtime. Much like the other formant-synthesis techniques, the output of several generators is added together to create a single complex, time-varying spectra.