Audio Style file format

Yamaha introduced audio styles in the PSR-S950 arranger workstation. Audio styles are both loved and hated. Loved when they sound good, but hated when people try to change or repurpose them in new styles.

The term “audio style” is a bit of an overstatement. Only the percussion track is audio. At least, that’s how audio styles have been developed and used to this day. Yamaha just released the Audio Phraser application for creating and editing the basic skeleton of an audio style, so this situation may change now that people can more freely create, edit and share their own audio styles.

Audio style file internal format

Ever since Yamaha distributed the audio styles for Genos, I’ve been meaning to take a look inside of an audio style file. Here’s a little preliminary information.

An audio style file is an IFF-like container just like a Standard MIDI File (SMF). In fact, an audio style file has the same internal organization as a regular style file which we know to be a Type 0 SMF with extra chunks.

An audio style file has the following chunks (in order):

    Type    Purpose
    ----    ------------------------------------
    MThd    SMF header chunk
    MTrk    SMF track chunk
    CASM    Yamaha CASM chunk
    AASM    Audio assembly (descriptor) chunk
    AFil    Audio file (waveform) chunk
    OTSc    Yamaha OTS chunk

The AASM and AFil chunks are new, additional chunks beyond the known MIDI, CASM and OTS chunks. All chunks have a four byte chunk identifier and a four byte chunk size. The chunk size does not include the identifier or chunk size bytes, as usual.

The AASM chunk is relatively small, about 2,500 bytes. It consists of 15 variable length ASEG subchunks. The ASEG subchunk has a four byte subchunk size. Each ASEG corresponds to a style section; that’s why there are fifteen of them.

An ASEG subchunk has three parts:

    Type    Purpose
    ----    ------------------------------------
    Adec    Identifies the style section
    Atab    Identifies the audio file; other functions unknown
    AMix    Function unknown

The Adec part is variable length, having an explicit four byte size. The Atab and AMix parts appears to be fixed length (101 and 28 bytes, respectively) and do not have an explicit size field.

The Adec part is ASCII text and is a style section name like “Main A” or “Fill In DD”. That is the only information in Adec.

I don’t know exactly what the Atab does. The Atab part contains an ASCII string which identifies the audio file associated with the style section. This string is clearly visible in a dump. (Example below.) All of the Atab and AMix parts in the test audio file have the same values except for the audio file names.

File Offset:       36965
Subchunk type:     'ASEG'
Subchunk size:     151
Section name:      Main D
Atab type:         'Atab'
   0    0    0   97    0   32   32   32 | 00 00 00 61 00 20 20 20 | ...a.
  32   32   32   32   32   41   56   48 | 20 20 20 20 20 29 38 30 |      )80
 115   67   97  110   97  100  105   97 | 73 43 61 6E 61 64 69 61 | sCanadia
 110   82  111   99  107   95   77   97 | 6E 52 6F 63 6B 5F 4D 61 | nRock_Ma
 105  110   32   68    0    0    0    0 | 69 6E 20 44 00 00 00 00 | in D....
   0    0    0    0    0    0    0    0 | 00 00 00 00 00 00 00 00 | ........
   0    0    0    0    0    0    0    0 | 00 00 00 00 00 00 00 00 | ........
   0    0    0    0    0    0    0    0 | 00 00 00 00 00 00 00 00 | ........
   1   15   -1    7   -1   -1   -1   -1 | 01 0F FF 07 FF FF FF FF | ........
   0    0    0  127    0    0    0    0 | 00 00 00 7F 00 00 00 00 | ........
 127    0    0    0    0    0  127    0 | 7F 00 00 00 00 00 7F 00 | ........
   0    0    0    0  127    0    0    0 | 00 00 00 00 7F 00 00 00 | ........
   0    0    0    0    0    0    0    0 | 00 00 00 00 00 00 00 00 | ........
AMix type:         'AMix'
   0    0    0   24    7 -128    0   -1 | 00 00 00 18 07 80 00 FF | ........
  88    4    4    2   24    8    0  -80 | 58 04 04 02 18 08 00 B0 | X.......
   7   71    0   10   64    0   91    0 | 07 47 00 0A 40 00 5B 00 | .G..@.[.
   0   -1   47    0    0    0    0    0 | 00 FF 2F 00 00 00 00 00 | ../.....

Etienne from the PSR Tutorial Forum points out that the AMix subchunk contains MIDI event codes:

AMix : header
00 00 00 18 : length of data
07 80 : 0780 hex = 1920 decimal (PPQN ?)
00 : delta time
FF 58 04 04 02 18 08 : meta event Time signature 4/4
00 : delta time
0B 07 70 : controller volume
00 : delta time
0A 40 : controller Panpot
00 : delta time
5B 00 : Controller Reverb send level
00 : delta time
FF 2F 00 : end of MTrk trunk

Nice catch, Etienne! The AMix content makes sense because something needs to set up the channel volume, pan and reverb level for the audio phrase. Yamaha love to use MIDI events for other purposes (like voice files, OTS, etc.) Why not?

The AFil chunk has substructure, too. The AFil chunk consists of ADSg chunks. As you might guess, the AFil chunk is pretty big because it contains waveform data.

The following table shows the offset and length information for the first ADSg in the example’s AFil:

    AFil     37287  15261858
    ADSg     37295   1219275      Container for an audio file
    ANdc     37303        50      File name
    AWav     37361   1219209      Container for audio waveform
    WAVE     37369       n/a      Marker (no subchunk size)
    Afmt     37373        16      Audio format information
    Sfmt     37397       217      Container for section information
    Sdec     37608         6      Section name, e.g., Main A
    Adat     37622   1218300      Waveform data
    AInf   1255930       640      Container for audio information
    BPnt   1255938       136
    OPnt   1256082       240
    APnt   1256330       232
    ATmp   1256570         0      Empty, subchunk size is 0
    ADSg   1256578                Container for the next audio file
    ....

The container relationships are important because the containers and subchunks are nested:

    AFil contains ADSg
    ADSg contains ANdc, AWav
    AWav contains WAVE, Afmt, Sfmt, Sdec, Adat, AInf
    AInf contains BPnt, OPnt, APnt, ATmp

The nesting is a bit of a pain in the patootie when writing code to parse a style file.

ADSg is the container chunk holding audio waveform (meta-)information. Like ASEG, there are fifteen ADSg chunks — one for each audio file. The ANdc subchunk inside contains the audio file name which matches up with the name in the ASEG. AWav is the container holding the audio waveform data itself.

The audio “file” format is WAV-like, but it is not exactly WAV (Microsoft RIFF). I was able to playback the audio by importing the audio style file as a raw (untyped) audio file. The audio format seems to be 44,100Hz, 16-bit stereo, big endian. No compression or encryption. It isn’t be too hard to dump the audio.

Yamaha Audio Phraser

Now that you know a little bit about what’s inside of an audio style file, here is brief overview of what the Audio Phraser program generates.

Audio Phraser generates an MThd MIDI file header chunk, a single MTrk chunk (Type 0), an ASEG chunk for each audio waveform, an AFil chunk (containing an ADSg subchunk for each audio file) and a CASM chunk.

The MIDI tempo and time signature are the same as the tempo set in Audio Phraser. The MIDI song title is set to “Audio Phraser”.

The MIDI track contains the usual markers at the beginning: SFF2 and SInt. A single SysEx message is generated after SInt: General MIDI System ON (F0 7E 7F 09 01 F7). The key signature is set to C/Am, followed by:

  • SMPTE Offset
  • Sequencer specific metadata: ff 7f 04 43 00 01 00 00

Oddly, MIDI channel 4 has four, whack-looking MIDI OFF events:

    NOTE OFF G#9
    NOTE OFF G5
    NOTE OFF C0
    NOTE OFF C0

A bug? The remaining markers indicate the start of the style sections. The section length corresponds to the length of the audio waveform for the section. Thus, if the audio waveform for “Main A” is 2 bars, then the MIDI section for “Main A” is 2 bars long.

The CASM chunk is minimal and sets NTR/NTT for MIDI channel 9 (Subrhythm). NTR is “Root Fixed” and NTT is “Bypass/Bass Off”. No NTR/NTT is given for channel 10 (rhythm/drums).

Audio Phraser does not generate an OTSc (One Touch Settings) chunk.

Audio Phraser creates an AWI file for each waveform that it imports into an audio style file. The AWI file most likely holds the results of Audio Phraser’s analysis (i.e., beat detection and so forth). It would be interesting and informative to compare the contents of an AWI file against the ASEG and AInf chunks in the resulting audio style file. I’m guessing that the AWI file is the “prototype” for the ASEG and AInf chunks.

Java source code

If you would like to explore audio style files, then download the source code for a simple audio style dump program. The code is relatively brittle and expects to encounter chunks in a certain order and/or quantity. Thus, be prepared to modify the code. This is an experimenter’s kit, after all. 😉

Copyright © 2018 Paul J. Drongowski