NAMM 2023: Roland go uptown

Roland’s big announcement this week is the GP Series Grand Inspiration digital pianos. The GP digital pianos cover a range of players and prices:

  • GP-3 Micro Grand: $4,000 USD (available now)
  • GP-6 Mini Grand: $6,300 USD (available March 2023)
  • GP-9 Grand Piano: $11,000 USD (available March 2023)
  • GP-9M Grand Piano: $19,000 USD (available May 2023)

All instruments feature beautifully styled wood cabinets and the “Piano Reality” sound engine inside. The product line is feature-graded, of course. 🙂 White models are also planned and will be available.

The GP-3 and GP-6 are appropriate for families who are serious about piano. I wouldn’t drop that much for a beginner piano! The GP-3 and GP-6 should also appeal to space and budget conscious schools and worship communities. As I’m painfully aware, not all churches can accomodate or afford a full grand, acoustic or digital.

Roland GP-9 digital grand piano

The GP-9 is the targeted sweet-spot for sensible upscale customers and I think Roland is hoping to sell a lot of these. The GP-9M adds a self-playing moving key function, XLR outputs and a microphone input for sing-alongs. The GP-9M has an air of “expensive toy” about it. My Lord, the GP-9M is about what I paid for my Toyota (Scion) iM.

The 9s try to provide a complete piano experience minus the hassles of strings, humidity and temperature fluctuations. Public spaces are notoriously hostile to acoustic instruments. This model should appeal to churches and commercial venues — excellent piano experience and low(er) maintenance than an acoustic grand.

As to technology, the GP-9 Piano Reality engine claims “unlimited polyphony.” The keyboard has progressive hammer action, escapement, hybrid wood/molded keys with Ivory Feel, long key pivot length, and haptic vibration.

Roland clearly put a lot of effort into the multi-channel audio projection system in order to produce an immersive experience. Don’t like what you hear? Use the Piano Designer tools and app to tweak the sound (string tuning, temperament, key sensiticity, cabinet resonance, sound field, etc.)

For 11 or 18 large, I’m sure you’ll read the specifications and try one first. 🙂

Yamaha teasers

Vocaloidâ„¢s are welcoming a new singing avatar into the pack: Po-utaâ„¢. Po-uta is based on the voice of Porter (Po) Robinson. [“Uta” means song.] As with most things Vocaloid, you’ll need to point your browser toward Japan as Yamaha seems to target Vocaloid primarily to its domestic market. Vocaloid 6 implements Vocalo Changerâ„¢ which uses your own vocal data to personalize a performance.

Vocaloid Po-uta virtual Porter Robinson

Notice all of the trademark â„¢ symbols? Yamaha applied for these trademarks in roughly the same timeframe as AN-Xâ„¢, CK61â„¢ and CK88â„¢.

Yamaha issued a teaser NAMM 2023 press release stating:

This year, Yamaha will introduce breakthrough products at the show across multiple musical categories, including piano, synthesizer, winds, acoustic guitar, drums and percussion, and professional audio.

So, will AN-Xâ„¢, CK61â„¢ and CK88â„¢ see the light of day?

Copyright © 2023 Paul J. Drongowski

Casio singing synthesis in pictures

I’m slowly immersing myself in the singing synthesis technology behind the Casio CT-S1000V. Heck, ya need somethin’ to do during TV advertisements while watching sports. 🙂

There are two major approaches to speech (singing) synthesis: unit-selection and statistical parametric.

Most people are familiar with unit-selection systems like Texas Instruments old Speak and Spell or the much more advanced Yamaha Vocaloidâ„¢. Unit-selection relies upon a large database of short waveform units (AKA phonemes) which are concatenated during synthesis. The real trick behind natural sounding singing (and speech) is the connective “tissue” between units. Vocaloid creates waveform data that connects individual phonetic units.

If you are familiar with Yamaha’s Articulation Element Modeling (AEM), a light should have lit in your mind. The two technologies have similarities, i.e., joining note heads, bodies, and tails. The Yamaha NSX-1 chip implements a stripped down Vocaloid engine and Real Acoustic Sound (AEM).

The content and size of the unit waveform database is a significant practical problem. The developers must record, organize and store a huge number of sampled phrases (waveform units). The Vocaloid 2 Tonio database (male, operatic English singer) occupies 750MBytes on my hard drive — not small and was a real challenge to collect, no doubt.

Statistical parametric systems effectively encode the source phonetic sounds into a model such as an hidden Markov model (HMM). During training, the source speech is subdivided into temporal frames and the individual frames are reduced to acoustic parameters. The model learns to associate specific text with the corresponding acoustic parameters. During synthesis, the model is fed text and acoustic parameters are recalled by the model. The acoustic parameters drive some form of vocoding. (“Vocoding” is used broadly here.)

Deep neural networks (DNN) improve on HMM. Sinsy is a DNN-based singing voice synthesis (SVS) system from Nogoya Institute of Technology. It is the culmination of many years of research by sensei Professor Keiichi Tokuda, his students and colleagues. It was partially supported by the Casio Science Promotion Foundation. Thus, adoption by Casio is hardly accidental!

Sinsy Singing Voice Synthesis System

The Sinsy block diagram is taken from their paper: Sinsy: A Deep Neural Network-Based Singing Voice Synthesis System, by Yukiya Hono, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda, EEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2803-2815, 2021. The method is quite complex and consists of several models. It’s not clear (to me, yet) if the Casio approach has all elements of the Sinsy approach. I recommend reading the paper, BTW; it’s well-written and highly technical.

Casio U.S. Patent 10,789,922 vocal synthesis

The next block diagram is taken from Casio’s U.S. Patent number 10,789,922 awarded September 29, 2020. Their approach is separated into a training phase and a synthesis (playing) phase. You’ll notice that Casio employ only an acoustic model. The patent discloses a “Voice synthesis LSI” unit, so their software may have a hardware assist. We’ll need to take a screwdriver to the CT-S1000V to find out for sure!

A picture is worth a thousand words. A technical diagram, however, requires a little interpretive context. 😉 Paraphrasing the Casio patent:

The text analysis unit produces phonemes, parts of speech, words and pitches. This information is sent to the acoustic model. The acoustic model unit estimates and outputs an acoustic feature sequence. The acoustic model represents a correspondence between the input linguistic feature sequence and the output acoustic feature sequence. Acoustic feature sequence includes:

  • Spectral information modeling the vocal tract (cepstrum MEL coefficients, line spectral pairs, or similar).
  • Sound source information modeling vocal chords (fundamental pitch frequency (F0) and power value).

The vocalization model unit receives the acoustic feature sequence. It generates singing voice inference data for a given singer. The singing voice inference data is output through a digital-to-analog converter (DAC). The vocalization model unit consists of:

  • A sound source generator:
    • Generates a pulse train for voiced phonemes.
    • Generates white noise for unvoiced phonemes.
  • A synthesis filter:
    • Uses the output signal from the sound source generator.
    • Is a digital filter that models the vocal tract based on spectral information.
    • Generates singing voice inference data (AKA “samples”).
Casio U.S. Patent 10,789,922 vocal synthesis process detail

This rather complicated diagram from U.S. Patent 10,789,922 shows the synthesis phase in more detail. It shows the lyric string decomposed into phoneme and frame sequences. Each frame is sent to an acoustic model which generates an acoustic feature sequence, that is, the acoustic parameters that were learned during training. The acoustic parameters are synthesized (vocoded) into 255 samples. Each frame is about 5.1 msec long.

Casio CT-S1000V Vocal Synthesis (User Guide)

Well, if the second patent diagram was TMI, here is the block diagram from the Casio CT-S1000V user guide. The simplified diagram is quite concise and accurate! You should be able to relate these blocks directly back to the patent.

I hope this discussion is informative. In a later post, I’ll take a look at a few practical details related to Casio CT1000V Vocal Synthesis.

Copyright © 2022 Paul J. Drongowski

Pocket Miku (Thanks, David!)

I usually unwind with a book or Keyboard Magazine before turning out the light for a good night’s rest. Some of you know Keyboard Magazine as Electronic Musician. 🙂

Imagine my surprise when I read David Battino’s “Adventures in DIY” and it’s about Gakken’s Procket Miku. And further, David gives a shout out to your’s truly and this blog (sandsoftwaresound.net).

Thank you, David! “Adventures in DIY” is one of the main reasons that I keep subscribing to Keyboard Magazine. David has a playfulness in his projects and approach that I really like. Plus, anyone who likes Japanese monsters and toys would fit right into our family.

David continues a long tradition of DIY writing that goes back to Polyphony Magazine, where I really got the bug to create. (There’s still a few treasured issues of Polyphony in our basement.)

So, if you came looking for Gakken Pocket Miku, NSX-39 or Yamaha’s NSX-1 integrated circuit, here’s a quick list of pages related to those topics:

While you’re here, please browse around. This site is my mental storage unit and you’ll never know what you might find. Lately, I’ve been diving into the new Yamaha Genos™. Maybe you need some content like scat vocal samples, converted DJXII patterns, or Motif performances converted to PSR/Tyros styles? Maybe you’re interested in taking a tour inside Montage, PSR/Tyros, or Kronos? Use soft synths on Linux and use Raspberry Pi to bridge 5-pin MIDI and USB.

And then there are reviews of products that I’ve tried or eventually purchased: Yamaha Montage, Genos, Reface CP, Reface YC, Korg Triton Taktile, Roland GO:KEYS, Nord Stage 2ex, etc.

There are several Arduino-based projects to browse (with downloadable code). Heck, there are even notes about data structures, computer architecture and VLSI design from back in the day.

Have fun!

Book-wise, I’m currently reading David Weigel’s “The Show That Never Ends: The Rise and Fall of Prog Rock.” Fun stuff.

Vocaloid keyboard announced

At long last, Yamaha have announced their Vocaloid™ keyboard, the VKB-100. The VKB-100 is a keytar design similar to the prototype shown at the “Two Yamahas, One Passion” exhibition at Roppongi Hills, Tokyo, July 3-5, 2015.

More details will be released in December 2017. However, this much is known:

  • Lyrics are entered using a dedicated application for smart phones and tablets via Bluetooth.
  • VY1 is the built-in default singing voice.
  • Up to 4 Vocaloid singers can be added using the application.
  • Four Vocaloid voices will be available: Hatsune Miku, Megpoid (GUMI), Aria on the Planets (IA), and Yuzuki Yukari.
  • Melody is played by the right hand while the left hand adds expression and navigates through the lyrics.
  • A speaker is built-in making the VKB-100 a self-contained instrument.

The VKB-100 was demonstrated at the Yamaha exhibition booth at the “Magical Mirai” conference held at the Makuhari Messe, September 1-3, 2017. Price is TBD.

VY1 is a female Japanese voice developed by Yamaha for its own products. VY1 does not have an avatar or character like other Vocaloid singers. This makes sense for Yamaha as they can freely incorporate VY1 in products without playing royalties or other intellectual property (IP) concerns.

The Vocaloid keyboard has had a long evolution, going through five iterations. The first three models did not use preloaded lyrics. Instead, the musician entered katakana with the left hand while playing the melody with the right hand. This proved to be too awkward and Yamaha moved to preloaded lyrics. The left hand controls on the neck add expression using pitch and mod wheels. The left hand also navigates through the lyrics as the musician “sings” via the instrument. The current lyrics are shown in a display just to the left of the keyboard where the musician can see them.

Yamaha will release more information on the Vocaloid keyboard site.

If you want to get started with Vocaloid and don’t want to spend a lot of Yen (or dollars), check out the Gakken NSX-39 Pocket Miku. Pocket Miku is a stylophone that plays preloaded Japanese lyrics. The NSX-39 also functions as a USB MIDI module with a General MIDI sound set within a Yamaha XG voice and effects architecture.

Be sure to read my Pocket Miku review and browse the resource links available at the bottom of the review page.

Copyright © 2017 Paul J. Drongowski

Pocket Miku: Module review

So far, I’ve posted several articles with resources for the Yamaha NSX-1 eVocaloid integrated circuit and the Gakken Pocket Miku (NSX-39), which is based on the NSX-1 chip. (See the bottom of this page for links.) This post pulls the pieces together.

Pocket Miku is both a vocal stylophone and a Yamaha XG architecture General MIDI (GM) module. There are plenty of Pocket Miku stylophone demos on the Web, so I will concentrate on Pocket Miku as a module.

Pocket Miku connects to your PC, mobile device or whatever over USB. The module implements sixteen MIDI channels where channel one is always assigned to the Miku eVocaloid voice and channels 2 to 16 are regular MIDI voices. As I said, the module follows the XG architecture and you can play with virtually all of the common XG features. The NSX-1 within Pocket Miku includes a fairly decent DSP effects processor in addition to chorus and reverb. The DSP effect algorithms include chorus, reverb, distortion, modulation effects, rotary speaker and a lot more. Thus, Pocket Miku is much more than a garden variety General MIDI module.

My test set up is simple: Pocket Miku, a USB cable, a Windows 7 PC, Cakewalk SONAR and a MIDI controller. Pocket Miku’s audio out goes to a pair of Mackie MR5 Mk3 monitors. The MP3 files included with this post were recorded direct using a Roland MicroBR recorder with no added external effects.

The first demo track is a bit of a spontaneous experiment. “What happens if I take a standard XG MIDI file and sling it at Pocket Miku?” The test MIDI file is “Smooth Operator” from Yamaha Musicsoft. Channel 1 is the vocal melody, so we’re off to a fast start right out of the gate.

One needs to put Pocket Miku into NSX-1 compatibility mode. Simultaneously pressing the U + VOLUME UP + VOLUME DOWN buttons changes Pocket Miku to NSX-1 compatibility mode. (Pocket Miku responds with a high hat sound.) Compatibility mode turns off the NSX-39 SysEx implementation and passes everything to the NSX-1 without interpetation or interference. This gets the best results when using Pocket Miku as a MIDI module.

Here is the MP3 Smooth Operator demo. I made only one change to the MIDI file. Unmodified, Miku’s voice is high enough to shatter glass. Yikes! I transposed MIDI channel 1 down one octave. Much better. Pocket Miku is singing whatever the default (Japanese) lyrics are at start-up. It’s possible to send lyrics to Pocket Miku using SysEx messages embedded in the MIDI file. Too much effort for a spontaneous experiment, so what you hear is what you get.

Depending upon your expectations about General MIDI sound sets, you’ll either groan or think “not bad for $40 USD.” Miku does not challenge Sade.

One overall problem with Pocket Miku is its rather noisy audio signal. I don’t think you can fault the NSX-1 chip or the digital-to-analog converter (DAC). (The DAC, by the way, is embedded in the ARM architecture system on a chip (SOC) that controls the NSX-1.) The engineers who laid out the NSX-39 circuit board put the USB port right next to the audio jack. Bad idea! This is an example where board layout can absolutely murder audio quality. Bottom line: Pocket Miku puts out quite a hiss.

The second demo is a little more elaborate. As a starting point, I used a simple downtempo track assembled from Equinox Sounds Total Midi clips. The backing track consists of electric piano, acoustic bass, lead synth and drums — all General MIDI. Since GM doesn’t offer voice variations, there’s not a lot of flexibility here.

I created an (almost) tempo-sync’ed tremolo for the electric piano by drawing expression controller events (CC#11). My hope was to exploit the DSP unit for some kind of interesting vocal effect. However, everything I tried on the vocal was over-the-top or inappropriate. (Yes, you can apply pitch change via DSP to get vocal harmony.) Thus, Miku’s voice is heard unadulterated. I eventually wound up wasting the DSP on a few minor — and crummy — rhythm track effects.

I created four lyrical phrases:

A summer day           Natsu no hi
f0 43 79 09 00 50 10 6e 20 61 2c 74 73 20 4d 2c 6e 20 6f 2c 43 20 69 00 f7

Your face              Anata no kao
f0 43 79 09 00 50 10 61 2c 6e 20 61 2c 74 20 61 2c 6e 20 6f 2c 6b 20 61 2c 6f 00 f7

A beautiful smile      Utsukushi egao
f0 43 79 09 00 50 10 4d 2c 74 73 20 4d 2c 6b 20 4d 2c 53 20 69 2c 65 2c 67 20 61 2c 6f 00 f7

A song for you         Anata no tame no uta
f0 43 79 09 00 50 10 61 2c 6e 20 61 2c 74 20 61 2c 6e 20 6f 2c 74 20 61 2c 6d 20 65 2c 6e 20 6f 2c 4d 2c 74 20 61 00 f7

The Japanese lyrics were generated by Google Translate. I hope Miku isn’t singing anything profane or obscene. 🙂

I did not create the SysEx messages by hand! I used the Aides Technology translation app. Aides Technology is the developer of the Switch Science NSX-1 Arduino shield. The application converts a katakana phrase to an NSX-1 System Exclusive (SysEx) message. Once converted, I copied each HEX SysEx message from the Aides Tech page and pasted them into SONAR.

Finally, the fun part! I improvised the Miku vocal, playing the part on a Korg Triton Taktile controller. What you hear in the MP3 Pocket Miku demo is one complete take. The first vocal section is without vibrato and the second vocal section is with vibrato added to long, held notes. I added vibrato manually by drawing modulation (CC#1) events in SONAR, but I could have ridden the modulation wheel while improving instead.

The overall process is more intuitive than the full Vocaloid editor where essentially everything is drawn. Yamaha could simplify the process still further by providing an app or plug-in to translate and load English (Japanese) lyrics directly to an embedded NSX-1 or DAW. This would eliminate a few manual steps.

Overall, pre-loaded lyrics coupled with realtime performance makes for a more engaging and immediate musical experience than working with the full Vocaloid editor. If Yamaha is thinking about an eVocaloid performance instrument, this is the way to go!

The pre-loaded lyric approach beats one early attempt at realtime Vocaloid performance as shown in this You Tube video. In the video, the musician plays the melody with the right hand and enters katakana with the left hand. I would much rather add modulation and navigate through the lyrics with the left hand. This is the approach taken for the Vocaloid keytar shown on the Yamaha web site.

Here is a list of my blog posts about Pocket Miku and the Yamaha NSX-1:

I hope that my experience will help you to explore Pocket Miku and the Yamaha NSX-1 on your own!

Before leaving this topic, I would like to pose a speculative question. Is the mystery keyboard design shown below a realtime eVocaloid instrument? (Yamaha U.S. Patent number D778,342)

The E-to-F keyboard just happens to coincide with the range of the human voice. Hmmmm?

Copyright © 2017 Paul J. Drongowski

Real Acoustic Sound

As mentioned in my earlier post, the Yamaha NSX-1 integrated circuit implements three sound sources: a General MIDI engine based on the XG voice architecture, eVocaloid and Real Acoustic Sound (RAS). RAS is based on Articulation Element Modeling (AEM) and I now believe that eVocaloid is also a form of AEM. eVocaloid uses AEM to join or “blend” phonemes. The more well-known “conventional” Vocaloid uses computationally intensive mathematics for blending which is why conventional Vocaloid remains a computer-only application.

Vocaloid uses a method called Frequency-domain Singing Articulation Splicing and Shaping. It performs frequency domain smoothing. (That’s the short story.)

AEM underlies Tyros Super Articulation 2 (S.Art2) voices. Players really dig S.Art2 voices because they are so intuitively expressive and authentic. Synthesizer folk hoped that Montage would implement S.Art2 voices — a hope not yet realized.

Conceptually, S.Art2 has two major subsystems: a controller and a synthesis engine. The controller (which is really software running on an embedded microcomputer) senses the playing gesture made by the musician and translates those gestures into synthesis actions. Gestures include striking a key, releasing a key, pressing an articulation button, moving the pitch bend or modulation wheel. Vibrato is the most commonly applied modulation type. The controller takes all of this input and figures out the musician’s intent. The controller then translates that intent into commands which it sends to the synthesis engine.

AEM breaks synthesis into five phases: head, body, joint, tail and shot. The head phase is what we usually call “attack.” The body phase forms the main part of a tone. The tail phase is what we usually call “release.” The joint phase connects two bodies, replacing the head phase leading into the second body. A shot is short waveform like a detached staccato note or a percussive hit. A flowing legato string passage sounds much different than pizzicato, so it makes sense to treat shots separately.

Heads, bodies and tails are stored in a database of waveform fragments (i.e., samples). Based on gestures — or MIDI data in the case of the NSX-1 — the controller selects fragments from the database. It then modifies and joins the fragments according to the intent to produce the final digital audio waveform. For example, the synthesis engine computes joint fragments to blend two legato notes. The synthesis engine may also apply vibrato across the entire waveform (including the computed joint) if requested.

Whew! Now let’s apply these concepts to the human voice. eVocaloid is driven by a stream of phonemes. The phonemes are represented as an ASCII string of phonetic symbols. The eVocaloid controller recognizes each phoneme and breaks it down into head, body and tail fragments. It figures out when to play these fragments and when bodies must be joined. The eVocaloid controller issues internal commands to the synthesis engine to make the vocal intent happen. As in the case of musical passages, vibrato and pitch bend may be requested and are applied. The NSX-1 MIDI implementation has three Non-Registered Parameter Number (NRPN) messages to control vibrato characteristics:

  • Vibrato Type
  • Vibrato Rate
  • Vibrato Delay

I suspect that a phoneme like “ka” must be two fragments: an attack fragment “k” and a body fragment “a”. If “ka” is followed immediately by another phoneme, then the controller requests a joint. Otherwise, “ka” is regarded as the end of a detached word (or phrase) and the appropriate tail fragment is synthesized.

Whether it’s music or voice, timing is critical. MIDI note on and note off events cue the controller as to when to begin synthesis and when to end synthesis. The relationship between two notes is also critical as two overlapping notes indicate legato intent and articulation. The Yamaha AEM patents devote a lot of space to timing and to mitigation of latency effects. The NSX-1 MIDI implementation has two NRPN messages to control timing:

  • Portamento Timing
  • Phoneme Unit Connect Type

The Phoneme Unit Connect Type has three settings: fixed 50 msec mode, minimum mode and velocity mode in which the velocity value changes the phoneme’s duration.

As I mentioned earlier, eVocaloid operates on a stream of phonetic symbols. Software sends phonetic symbols to the NSX-1 using either of two methods:

  1. System Exclusive (SysEx) messages
  2. NRPN messages

A complete string of phonetic symbols can be sent in a single SysEx message. Up to 128 phonetic symbols may be sent in the message. The size of the internal buffer for symbols is not stated, but I suspect that it’s 128 symbols. The phoneme delimiter is ASCII space and the syllable delimiter is ASCII comma. A NULL character must appear at the end of the list.

The NRPN method uses three NRPN message types:

  • Start of Phonetic Symbols
  • Phonetic Symbol
  • End of Phonetic Symbols

In order to send a string of phonetic symbols, software sends a start NRPN message, one or more phonetic symbol NRPN messages and, finally, an end of phonetic symbols NRPN message.

Phonetic symbols are stored in a (128 byte?) buffer. The buffer lets software send a phrase before it is played (sung) by the NSX-1. Each MIDI note ON message advances a pointer through the buffer selecting the next phoneme to be sung. The SEEK NRPN message lets software jump around inside the buffer. If software wants to start at the beginning of the buffer, it sends a “SEEK 0” NRPN message. This capability is really handy, potentially letting a musician start at the beginning of a phrase again if they have lost their place in the lyrics.

When I translated the Yamaha NSX-1 brochure, I encountered the statement: “eVocaloid and Real Acoustic Sound cannot be used at the same time. You need to choose which one to pre-install at the ordering stage.”. This recommendation is not surprising. Both RAS and eVocaloid must have its own unique database; RAS has instrument samples and eVocaloid has human vocal samples. I don’t think, therefore, that Pocket Miku has any RAS (AEM) musical instrument samples. (Bummer.)

Speaking of databases, conventional Vocaloid databases are quite large: hundreds of megabytes. eVocaloid is intended for embedded applications and eVocaloid databases are much smaller. I’ll find out how big once I take apart Pocket Miku. Sorry, Miku. 🙂

I hope this article has given you more insight into Yamaha Real Acoustic Sound and eVocaloid.

Copyright © 2017 Paul J. Drongowski

And your keytar can sing

A day with excessive heat and humidity can strand you indoors as effectively as a New England snow storm. Time for a virtual quest into parts unknown.

I stumbled onto this beautiful web page on the Japanese Yamaha web site. Lo and behold, a Vocaloid™ keyboard in the shape of a keytar. I strongly suggest visiting this page as the commercial photography is quite stunning in itself.

The Vocaloid keyboard is a prototype that was shown at the “Two Yamahas, One Passion” exhibition at Roppongi Hills, Tokyo, July 3-5, 2015. Some form of Vocaloid keyboard has been in the works for several years and this prototype is the latest example.

The overarching idea is to liberate Vocaloid from the personal computer and to create an untethered performance instrument. The Vocaloid engine is built into the keyboard. The keyboard also has a built-in speaker along with the usual goes-outtas. The industrial design — by Kazuki Kashiwase — tries to create the impression of a wind instrument such as a saxophone.

The performer must preload the lyrics into the instrument before performing. This lets the performer concentrate on the melody when performing, not linguistics. The keyboard adjusts the pitch and timing of the vocalization. The left-hand neck buttons navigate through the lyrics: back one note, advance phrase, go to the end, etc. The ribbon controller raises and lowers the pitch. Control knobs select vibrato, portameno, brightness, breath and gender. Other knobs set the volume and select lyrics. Up to five lyrics can be saved.

The prototype synthesizes the “VY1” Japanese female voice developed by Yamaha for Vocaloid version 2. Somewhat confusingly, “VY1” stands for “Vocaloid Yamaha 1.” The voice has the codename “Mizki.”

The Vocaloid engine is based on the Yamaha Vocaloid Board, not eVocaloid which is built into the NSX-1 integrated circuit (LSI). Yamaha sell the Vocaloid Board to OEMs, eventually intending to incorporate the board into entertainment, karaoke and musical instrument products of its own. The Vocaloid Board has MIDI IN/OUT, by the way, and reads the vocal database from an SD card.

Many of these details are taken from the article by Matsuo Koya (ITmedia). Please see the article for close-up photographs of the Vocaloid keyboard prototype.

The NSX-1 IC (YMW 820) mentioned above is a very interesting device itself. The NSX-1 is a single chip solution designed for embedded (“eVocaloid”) applications. It uses a smaller sized voice database, “eVY1”.

The NSX-1 has a General MIDI level 1 engine. Plus, the NSX-1 has a separate engine to reproduce high quality acoustic instrument sounds thanks to “Real Acoustic Sound” technology. This technology is based on Articulation Element Modeling (AEM) which forms the technical basis of Tyros 5 Super Articulation 2 (S.Art2) voices. Real Acoustic Sound and eVocaloid cannot be used simultaneously.

Holy smokes! I conjectured that AEM and Vocaloid are DSP siblings cousins. This is further evidence in support of that conjecture.

NSX-1 can be controlled using a Javascript library conforming to the Web MIDI API. Wanna make your browser sing? Check out the Yamaha WebMusic page on github.

The company Switch Science sells an eVY1 SHIELD for Arduino. Kit-maker Gakken Educational has developed a stylus gadget based on eVocaloid and the NSX-1 — Pocket MIKU. And, of course, here is the Pocket Miku video.

Only 13 more days until Summer NAMM 2017.

Copyright © 2017 Paul J. Drongowski