Casio speech synthesis technology - Sand, software and soundSand, software and sound

The voice synthesis in Casio’s new CT-S1000V keyboard raised quite a bit of interest on the Web, including my own curiosity.

I installed the Casio Lyric Creator app on my iPad just to see what I can see. Lo and behold, there is a long list of open source licensing statements which identify some of the voice synthesis technology in the app and the keyboard itself. Let’s take a look starting with the top of the list.

HMM-based speech synthesis engine, HTS_engine, developed by the HTS Working Group. That’s a lot of acronyms and shoulders to stand on:

HMM: Hidden Markov model
HTS: An HMM-based speech synthesis system
SPTK: Speech Signal Processing Toolkit

The HTS Working Group is a voluntary group developing the HMM-based speech synthesis system HTS. The software bears a joint copyright from two institutions:

Nagoya Institute of Technology, Department of Computer Science, and
Tokyo Institute of Technology, Interdisciplinary Graduate School of Science and Engineering

The HTS_engine API is released under the Modified BSD license. I won’t quote such chapter and verse everywhere, but it gives you a sense of the distribution terms and conditions. Read about HTS version 2 in “The HMM-based Speech Synthesis System (HTS) Version 2.0“, by Heiga Zen, et al., Sixth ISCA Workshop on Speech Synthesis, 2007.

HMM-based singing voice synthesis system, Sinsy, developed by the Sinsy Working Group. This software bears the copyright of Nagoya Institute of Technology, Department of Computer Science.

Speak Signal Processing Toolkit, SPTK, developed by the SPTK Working Group. Again, the toolkit has a joint copyright:

Nagoya Institute of Technology, Department of Computer Science, and
Tokyo Institute of Technology, Interdisciplinary Graduate School of Science and Engineering

CRF+ by Taku Kudo. “CRF” is an acronym for “conditional random fields”. CRFs are a class of statistical modeling methods that are used in pattern recognition and machine learning.

The developers also acknowledge other work which was used during speech analysis:

WORLD: A high-quality speech analysis and synthesis system based on vocoding.
CMUdict: The CMU Pronouncing Dictionary from Carnegie-Mellon University, Pittsburgh, PA (my old school)
Festival Speech Synthesis System, Centre for Speech Technology Research, University of Edinburgh, UK.

For (more than) an introduction to HMM-based speech synthesis, try: “An Introduction to HMM-Based Speech Synthesis” by Junichi Yamagishi, October 2006. That should be enough math for you. 🙂 This presentation is super helpful, too.

Casio’s voice synthesis technology is not Yamaha Vocaloid™. Vocaloid™, by the way, is a registered trademark belonging to Yamaha. I have seen punters on the Web attribute the technology to Vocaloid or Yamaha. “Oh, they must have licensed it.” Wrong. Please do not refer to Casio’s tech as “Vocaloid” as this is technically incorrect and a misuse of Yamaha’s trademark.

Plus, we want to give credit where credit is due. Casio have staked out their IP territory in a series of patents filed on their behalf.

Want more information? See Casio singing synthesis in pictures.