Milan Nankov

Milan Nankov
In this series we are going to build a universal audio component that can be used by Windows and Windows Phone apps regardsless of the programming language. You might be wondering why one would need to write such a component. Afterall, MediaElement allows us to play all kinds of media files, control the playback and volume, etc.

For most scenarios MediaElement will be sufficient but that is not the case when your app has special requirements like playing multiple sounds at the same time, applying effects, or anything else that is not boring. This it especially true on Windows Phone where you can have only one active MediaElement. For more advanced scenarios we have to rely on frameworks and APIs that are closer to the metal - WASAPI, XAudio2, Media Foundation.

In the first part of the series we will build an app that uses XAudio2 to play some audio files and we will make some music!

So, what is XAudio2 anyway?

XAudio2 is a low-level audio API. It provides a signal processing and mixing foundation for games that is similar to its predecessors, DirectSound and XAudio. XAudio2 is the long-awaited replacement for DirectSound. It addresses several outstanding issues and feature requests.
We are not going to build an audio engine for a game here but we will use  XAudio2 to gain more control on what we can play and how. Before we start coding we have to review some of the key concepts in XAudio2 one of which is the so called voices. There are 3 types of voices: source, submix, and mastering. We will be using source and mastering voices in our app. Source voices are used to submit audio data to the audio pipeline while mastering voices write data to the audio device.  
XAudio2 pipeline
Using the diagram above as a reference, let's have a look at how the XAudio2 pipeline works.
  1. We submit the audio data of our file (say sound1.wav) to a Source Voice
  2. The voice is then responsible for channeling the bytes that make up the audio to a Mastering Voice
  3. The Mastering Voice sends the audio from all Source Voices to the speakers.
The only real difficulty here is submitting the audio to a Source Voice. Unfortunately, you cannot simply pass the whole audio. We have to first read all associated meta data like number of channels, bits per sample, etc. After that, you need to locate where the actual audio data starts and submit it along its meta data to the Source Voice. Luckily, it's not that hard but in order to do that we have to know a thing or two about the format of the audio files that XAudio2 supports.

Resource Interchange File Format (RIFF)

All audio files supported by XAudio2 use the Resource Interchange File Format or RIFF. Let's take a closer look at the structure of RIFF. A RIFF file is composed of pieces of data called chunks. Each chunk itself is composed of 3 fields - FOURCC ID, SIZE, and DATA. Take a look at (1) in the diagram below. Each RIFF file contains a number of these chunks which contain data and meta data.

The only chunk that is slightly different is the RIFF chunk. Represented in the diagram as (2), the RIFF chunk has an additional field which identifies the specific file type (FOURCC FILE TYPE) - for example, a WAVE file.

Another important piece of the RIFF puzzle is the FOURCC ID field of each chunk. These fields identify chunk types. Four-character code (or FOURCC for short) identifier is a 32-bit unsigned integer produced by concatenating four ASCII characters that give us the specific chunk type. For example, the chunk that contains the actual audio data has FOURCC that is equal to "data".
RIFF File Format Chunks and Wav File Chunks   
Now that we know what FOURCC is and what chunks are, let's take a look at the structure of a typical .wav file. Represented by (3) on the diagram above, a typical .wav file contains a main RIFF chunk, which encompasses the other important chunks as its data.  While there are a number of chunk types, the ones that we are interested in are the data format chunk (FOURCC of "fmt") and the raw audio data chunk (FOURCC of "data"). Playing an audio file typically involves the following steps:
  1. Locate RIFF chunk
  2. Locate "fmt" sub-chunk and extract the audio format from its data
  3. Locate "data" sub-chunk and locate the address of the first audio data (the first byte of the data field)
  4. Submit the audio data and its format to a Source Voice
What we are going to build now is a universal Windows Runtime component that can read RIFF files and play those files using XAudio2 . Once we are done with the component, we will use it in a universal app that can be run on both Windows Phone and on Windows.

Universal XAudio2 Component

Since XAudio2 is provided as Dynamic Link Library (DLL) the most natural way of using it is with C++. Luckily we have Windows Runtime C++ Template Library (WRL) and C++/CX at our disposal to make our lives significantly easier. What I tried to do with the audio component is make it as simple as possible and take care of audio playback only - the component is not tasked with loading or caching audio data - this is done by the application itself. Let's take a look at the two classes that get the job done.

The RIFFReader class

This class is responsible for searching for chunks and returning information about those chunks. With only a single public method this class has a pretty simple interface and as you will see in a minute its implementation is very simple as well. A couple of quick notes: IBuffer represents an array of bytes; the GetBufferByteAccess method is used to get a raw pointer to the underlying IBuffer byte array - this allow us to freely move around the data without making unnecessary copies. Now, let's take a look at the most important method of the class: FindChunk is used to locate the audio data chunk and the audio format chunk. FindChunk returns a struct that gives us the size in bytes of the data of each chunk and a pointer to the data. If you recall, we needed the raw audio data to be able to play an audio file. Well dataChunk.data gives us this data. The other required information is the audio format which is contained in the data portion of the format chunk. As you can see we simply cast the data to a predefined type and we are good to go.

The UniversalAudioPlayer class

This is the class that uses XAudio2 to play our audio files. Also, this is the class that our Windows Phone and Windows apps will interact with. Here the public API is as simple as it gets - we can Play/Stop AutioSamplesAudioSample is a class that provides the IBuffer to an audio file that has to be played and it's identifier. Upon creation, we get an instance of the XAudio2Engine with the help of XAudio2Create. Once we have this, a mastering voice, which we dicussed in the first part of the article, is also created. The meat of our audio player is contained in the next method.

Play uses RiffReader to get the audio data and it's format. Once we have those it's pretty straighforward - create a source voice for the sample that we want to play, create XAUDIO2_BUFFER using the data, submit the buffer to the source voice, and invoke  voice->Start(0) to play the sample. We have an audio player. Let's play some music!

The App

universalaudioapp The app is a universal app for Windows and Windows Phone written in C# and it will work on any Windows 8.1 device. Apart from showing the UI (Hello, Mr. Obvious), the app is responsible for loading and caching the audio files that we want to play. Once a file is loaded into memory we use our UniversalAudioPlayer to play it. Here is the code: ToggleSample is the method that plays or stops audio samples. If we want to play an audio file we have to:

  1. Load the audio file into memory (IBuffer) using GetBuffer
  2. Create and AudioSample with the loaded IBuffer and the name of the audio sample
  3. Pass the AudioSample to UniversalAudioPlayer.Play
We can now test the app and make some music by playing the available audio samples. You can download the full source code of the audio player and the app from here.

Today we have built the foundation of an audio component that can be used on any Windows 8.1 device. One can easily extend it to support features like audio effects.

In the next blog post of this series we learn how to play compressed audio files like Mp3s.

Here are several great learning resources:

Need consulting on this topic?

Yes No