Vocaloid is not just for anime!

As I mentioned in my last post, I’m developing a new sample-based voice for the Yamaha PSR-S750/S950 arranger workstations. Roland is famous for its “jazz scat” voice which uses velocity-switching to trigger syllables like DOO, DAT, BOP and DOW at pitch. This synth voice is good for a cappella-like arrangements (think “Take 6”) or free melody lines. It’s a real boon for those of us with weak natural voices and technique.

The Roland scat voice incorporates samples from the Spectrasonics Vocal Planet library produced by Eric Persing and Roby Duke. Although these are great sounds/samples, I want to distribute both the workstation voice (as an expansion pack) and the samples within. I intend to make my work available under a Creative Commons attribution license. Thus, I want and need to produce fully original samples in order to avoid copyright and licensing issues.

The quest

These goals and desires launched a month-long quest for suitable “scat” samples. I decided to base the scat voice on the four syllables DOO, DOT, BOP and DOW where the DOOs are looped and the other syllables are one-shots. The DOOs are triggered at relatively low velocity and provide a pad-like bed while the DOTs, BOPs and DOWs provide short staccato accents/melody. The voice implementation requires a set of multi-samples for each syllable where the multi-samples are spread across the natural range of the human voice (F3 to F6 where C5 is middle C).

Freesound.org has a few individual sounds, but nothing in the way of multi-samples across a range of pitches. I next decided to try sampling my own voice. A few tentative attempts left me highly discouraged! I’m a baritone with a relatively small range — definitely not F3 to F6! Plus, I lack training and my technique is not particularly good.

I then began to experiment with vocoding. I was hoping to achieve loopable, pitch-accurate samples by using my voice as a formant and imposing my voice on a pitch accurate synth sound (the carrier). I experimented with the vocoders in the PSR-S950 and the Yamaha MOX6 workstation. The MOX6 vocoder is great at producing dance-floor sounds, but not so good at producing more natural vocal sounds suitable for jazz.

Not to be too cagey, I eventually found good use for the S950 vocoder and will describe this process in a separate post. Before I went in that direction, however, I discovered and tried Yamaha’s Vocaloid.


Here is how Yamaha describes Vocaloid.

Vocaloid is a technology for singing voice synthesis developed by Yamaha, and the name of this software application. The software allows users to input melody and lyrics in order to synthesize singing. In other words, with this technology, singing can be produced without a singer. Singing voice synthesis is produced by using fragments of voices recorded from actual singers, called the Singer Library.

To a user, Vocaloid consists of two parts: the Vocaloid editor and one or more libraries. Generally, Yamaha does not provide the libraries and prefers to license the Vocaloid technology to third parties (like Zero-G) who develop libraries using their own artists.

Vocaloid has an active and enthusiastic on-line community among anime enthusiasts. There are Japanese and English singer libraries for various anime characters or personas. These singers are not appropriate for jazz! Fortunately, there are a few singer libraries for pop and classical vocals.


Vocaloid is not inexpensive. The full Vocaloid version 3 editor is about $160USD and individual Vocaloid 3 singer libraries are $150USD. Thus, it’s hard to take a casual drive by the latest Vocaloid technology and give it a try. Vocaloid 4 has just been announced along with Cyber Diva. Pricing, unfortunately, has not budged.

Luckily, Zero-G has a fire sale on a few individual Vocaloid 2 libraries which include the version 2 editor. I bought the Zero-G Tonio library for $50USD. This is a much smaller amount to gamble in order to get a taste.

Tonio is an opera singer. The Tonio demo is very good (it’s opera!) and after messing with Vocaloid and Tonio, someone sank a lot of work into that demo! You can get very nice results from Vocaloid if you are willing to spend countless hours tweezing a performance. I recommend the on-line Vocaloid reviews at Sound on Sound Magazine. The reviews are right on the money and provide useful information to help get you started with Vocaloid. (SOS is great that way.)

To make a long story very short, you edit the vocal performance in the editor by entering lyrics into a piano roll editor. You then change the attack, vibrato and other aspects of the vocal performance. These tweaks are essential for getting a good result.

Ultimately, Tonio is an opera singer and his vocal characteristics are a distinct part of the vocal samples that underlie the singer library. There ain’t no way to turn this nice Italian boy out and make him sing pop! He isn’t Bruno Mars. Please keep this in mind if you decide to try Vocaloid in a project of your own. Make sure that the voice library is a simpatico match with the target genre/style. This is why I moved on from Tonio and Vocaloid for the scat voice project.

The technology

Yamaha has invested heavily in the Vocaloid technology and have filed many patents. They are conducting joint research with The Music Technology Group (MTG) of the Universitat Pompeu Fabra in Barcelona. The MTG, by the way, are the people behind the Freesound.org web site.

Vocaloid does a lot of intense digital signal processing (DSP). It modifies and concatenates sound in the frequency domain. It performs a Fast Fourier Transform (FFT) to convert from the time domain to the frequency domain, modifies the spectral characteristics of the sound, and then performs an inverse FFT to return to the time domain. This is too much computation to perform in real-time. Thus, there is always a delay while Vocaloid renders a performance before playback.

Yamaha protects its intellectual property (IP) through patents and rarely publishes results in the scientific literature. Vocaloid is an exception, probably due to the partnership with MTG. Here is a short list of a few papers on Vocaloid and its technology.

  • Singing synthesis as a new musical instrument, Hideki Kenmochi, IEEE International Conference on Acoustics, Speech and Signal Processing, 2012 (ICASSP 2012).
  • Sample-based singing voice synthesizer by spectral concatenation, Jordi Bonada and Alex Loscos, Proceedings of the Stockholm Music Acoustics Conference, August 6-9, 2003 (SMAC 03).
  • VOCALOID – Commercial singing synthesizer based on sample concatenation, Hideki Kenmochi and Hayato Ohshita, International Speech Communication Association (ISCA), Interspeech 2007.

You don’t need to know all of this to use Vocaloid, but it’s good to know that there is cutting edge science behind the product.

I strongly recommend the developer interview with Michael Wilson which is published at the Vocaloid US web site. The interview gives insight into the incredible amount of work and detail behind the development of the latest library, Cyber Diva. This interview is extremely informative. Thanks, Michael. Articles such as this one bridge the gap between vacuous press releases and scientific papers giving everyone a greater appreciation for the technology behind a product.

It is also the best case to be made against software piracy. Innovation, research and development is fueled by money. Cheat developers out of their just payment only if you wish to kill off future innovation!

The Vocaloid technology reminds me a little bit of Super Articulation 2 (SArt2) on Tyros. SArt2 concatenates tones together to product realistic articulations such as legato and glissando. SArt2 works in the time domain and computes in real time although latency remains a very practical concern. (There are patents.) Perhaps someday when sufficient parallel processing resources are inexpensive, there will be an SArt3 that computes in the frequency domain.