Sampling “scat”

In this post, I describe the process and tools that I used to capture samples for my jazz scat voice. I will eventually release the voice (for the Yamaha PSR-S950 workstation) and its samples under the Creative Commons attribution license. I’m not the best singer, so I’ve had to rely on technology as much as possible while still producing a musical result. I want to emphasize that I sang, edited and produced all of the samples and the voice patch; it is original work.

The jazz scat voice is inspired by the (in)famous “jazz voice” patch found in Roland keyboards. The Roland patch is based on samples from the Spectrasonics Vocal Planet library by Eric Persing and Robby Duke. Their work was clearly influenced by Take 6 and other contemporary a cappella artists.

My patch uses four multi-samples where each multi-sample is a particular syllable taken over 12 (or so) pitches. The multi-samples cover the natural range of the human voice from F3 to F6 where C5 is middle C. The four syllables are: DOO, DOT, BOP and DOW. The DOOs are long, looped samples that provide a musical bed or harmony. The remaining three samples are short one shots suitable for melody, punctuation and accents. The DOW syllable falls.

The basic patch design is summarized in the following table.

Syllable Type Vel low Vel high Gain
DOO Loop 1 89 0 dB
DOT One shot 90 105 -3 dB
BOP One shot 106 119 -6 dB
DOW One shot 120 127 -9 dB

The table shows the MIDI velocity range to each syllable (multi-sample). It also shows the relative gain for each syllable. The gain decreases as velocity increases in order to maintain a more consistent volume level as the keys are struck harder to trigger the one shots.

At a strategic level, the sampling production process consists of two major steps:

  1. Capture a natural voice sample for each syllable and pitch. These natural voice samples are the formants to be used in the next step.
  2. Capture a vocoded sample for each syllable and pitch while playing the appropriate formant sample through the PSR-S950 vocoder.

This process produces scat syllable sounds that are consistent, pitch accurate and in the case of the DOO syllable, loopable.

Here’s a run-down of the practical problems that motivated this approach. My voice is an untrained baritone. It cannot possibly cover the F3 to F6 range without hysterical noise and possible voice damage. As I discovered, it is nearly impossible to sing pitch accurate short syllables such as these without proper training! I needed to find a method that would give me a consistent and pitch accurate sound across the desired range of pitches. This is a greater challenge than I originally anticipated and a lot of experimentation led to the two-step method. It took about 3 weeks to find the method and then a further two weeks of production work.

Now, the details.

I used a Roland Micro-BR digital recorder to capture both natural voice and vocoded samples. This little wonder is great — easy to use, fast and above all, quiet. For natural voice, I sang into a Shure PG-81 condenser microphone feeding an ART TUBE MP preamp. The TUBE MP is a really Swiss army knife providing phantom power for the PG81, a little bit of tube warmth, and conversion from XLR to a line level audio signal. The output of the TUBE MP is connected to the Micro-BR. For vocoded voice, I connected the line level mono output of the PSR-S950 to the Micro-BR. In both cases, all Micro-BR input effects are disabled and gain staging is established before hitting the RECORD button.

Formants are captured and produced in the following way. I sang each syllable multiple times at each of the desired pitches while recording to the Micro-BR. The pitches cover the F3 to F6 range such that no resulting final sample would be transposed up more than one semi-tone and/or down two semi-tones. Transposing up or down more than these limits negatively affect sound quality (obvious sample speed-up/slow-down). The entire sampling session is converted to WAV format and then transferred to a PC where Sony Sound Forge Audio Studio is used to review the sung syllables and to select the best one at each pitch. Each selected syllable is saved in its own WAV file. The selected syllables are tuned with Celemony Melodyne. The tuned syllables are the formants for the vocoding phase.

Sony Sound Forge is a solid audio editor. I can work fast in Sound Forge and its “Copy new” function is ideal for cherry picking a recording session. In a few cases, I had to amplify a sample to compensate for low level. When singing across such a wide range of pitches, one needs to rely on electronics/software for amplification in order to avoid voice strain! For tuning, I used the trial version of Celemony Melodyne Single Track which installed with Sonar X3. Although the procedure to enable the trial period was wonky, Melodyne is a great tool and I will very likely buy a copy.

In the second major production step, the formant syllables are sent to the PSR-S950 vocoder and vocoded syllables are recorded on the Micro-BR. The S950 vocoder is not a true synth vocoder. (The Motif/MOX and Tyros vocoders are true “synth” vocoders.) The S950 vocoder is part of its vocal harmony proceesor. Its “VocoderMONO” mode is designed to let (untrained) voices sing into a microphone and impose the formants onto a rather natural sounding, pitch accurate synthetic voice sound.

My early investigation found that the PSR-S950 vocoder needs clean formants that are near the desired final pitch. By clean, I mean formants that do not overdrive the vocoder input and are relatively free of the (un)natural gurgles and what not in the sounds made by the human vocal system. (Well, my vocal system anyway.) The first major step in the overall process let me select the cleanest formants. However, attempts to sing outside one’s natural vocal range introduce gurgles and rasps at the low end and off-pitch histrionics and screeches at the high end. The first major process step choses the cleanest formants and tunes them to the desired pitches.

I loaded the formant samples into a Roland RD-300GX piano as an Audio Key set. Each formant sample is assigned to a particular key and is played by the RD-300GX when the key is struck. Basically, this arrangement gives me a simple one-shot playback engine. The output of the RD-300GX is connected to the microphone/line input of the PSR-S950 in order to drive the vocoder. The mono output of the PSR-S950 is connected to the Micro-BR.

Once everything is connected and levels are set, a little trial and error is needed to find the best formant at each desired vocoder pitch. Think of this as a dry rehearsal for the final recording. Frequently, the formant at the same desired vocoder pitch is the best choice for the vocoded sample. However, sometimes one of the nearby formants is better or produces a more consistent timbre or articulation across the multi-sample. This involves a lot of critical listening and A/B comparison, producing a list of formant and pitch pairs.

Then, it’s time to hit RECORD and capture the vocoded samples by playing the desired pitch on the S950 keyboard and playing the corresponding desired formant on the RD-300GX. Once again, the recording session is converted to WAV format, is transferred to the PC, and is separated into individual WAV files.

At this point, the DOT, BOP and DOW one shot samples are pretty much complete. The DOO samples need to be looped. For some zany reason, Sony Sound Forge Audio Studio saves loop points in Acid METADATA within a WAV file. The Yamaha voice editor does not pick up this information. After searching the Web, I discovered that loop info within a WAV file is not really standardized. Given that the target tool is from Yamaha, I decided to use Yamaha’s Tiny Wave Editor (TWE) to loop the vocoded DOW samples. This worked out pretty well as TWE’s crossfade looping eliminated some bad thumps without introducing artifacts. A lot of trial and error was still involved in choosing the loop points, however. TWE can be found for free on the Web, by the way.

The final production step is to bring all of the vocoded samples into the Yamaha Expansion Voice Editor (EVE) and produce the final voice as part of an S750/S950 expansion pack. I made five voice patches:

  1. DooLoops: DOO syllables over MIDI velocities 1 to 127
  2. GetLayeredUp: All syllables, velocity-switched
  3. DatStuff: DOT syllables over MIDI velocities 1 to 127
  4. BopOnPop: BOP syllables over MIDI velocities 1 to 127
  5. Dow2008: DOW syllables over MIDI velocities 1 to 127

The multi-samples are most easily tested and normalized individually. Plus, the DOO loops and other syllables are musically useful by themselves without velocity switching. I built the GetLayeredUp patch after testing the individual multi-samples and normalizing the volumes of the individual samples within. Choosing the patch names was really fun! (Apologizes to George Clinton.)

The Yamaha Expansion Voice Editor is a trial version for which the trial period was, ahem, adjusted. Yamaha needs to just face facts and release an official version of EVE. Zillions of S750/S950 people are already using EVE and if Yamaha is somehow trying to protect its expansion pack franchise, well, that train done left the station a looooooooooong time ago. At this point, an official EVE would enhance the PSR product ecosystem and sales.

EVE does not implement velocity levels/switching. I used V. Muller’s version of the OLE Toy binary editor to set the element velocity ranges in the GetLayeredUp patch. Thank you, V. A huge amount of effort went into the analysis of YEP files and Python coding and he deserves all of the credit.

Thanks to vocoding, the final samples have a consistent sound. They are a little bit plain Jane by themselves, however. I gave each patch a little bit of reverb (reverb send level 20). I also added the “Ensemble Detune 2” DSP effect (send level 64). This is a truly spiffy effect — a chorus without modulation that gives the impression of an ensemble of slightly detuned voices. It is exactly the kind of gloss that the scat voices need.

Although the velocity ranges in GetLayeredUp are reasonable, users should still expect to tweak the keyboard velocity sensitivity and touch response to their personal needs. For example, I need to play GetLayeredUp on the softest touch setting. Your mileage will definitely vary!

Please stay tuned for the initial release of the expansion pack and multi-samples.