Nintendo 64 console with EverDrive cartridge

Nintendo 64 Part 14: Streaming Audio

, Nintendo 64, Programming

Now that I have gotten audio playing from memory buffers, it’s time to play actual audio files from ROM. This is a code-heavy post!

Tools, Audio Library, and Microcode

The way that audio works on with Nintendo’s development kit is:

  1. You create sound effects and audio samples for musical instruments, and process these with the Nintendo 64 audio tools. This produces files called sound banks and other files in a format understood by the Nintendo 64 SDK.
  2. Your game loads sound banks from the cartridge and calls audio functions in LibUltra to play sounds. These functions generate an audio command list.
  3. Your game passes the command command list to the audio microcode, which uses it to fill in the audio buffer.
  4. Once the audio microcode completes, you pass the audio buffer to the audio interface.

I would rather not figure out how to run these old audio tools on my system, so I am going to just embed some audio data in my game and figure out how to make the audio library work with that. However, I can’t really avoid creating sound banks or at least parts of them, because the data structures in a sound bank must be passed to the audio library. I can try guessing how to fill in the structures by looking at the names of fields in header files, but that turned out to be too frustrating.

To figure out the sound bank format, I wrote a script, ctl.py (Gist) which takes a sound bank in the *.ctl format and dumps the contents as text.

Initializing Audio

The part of the SDK’s sound library which generates the command lists is called the “synthesizer”, and you can create a global synthesizer object with alInit().

To actually play sounds, you need to create a synthesizer “client”, and the SDK comes with two clients: a sound player, for playing one-shot sounds like sound effects, and a sequencer for playing MIDI files.

You can then create an ALSound structure, which needs an envelope, keymap, and wavetable. Once you have an audio player and a sound, you can play the sound.

enum {
  AUDIO_HEAP_SIZE = 256 * 1024,
  AUDIO_MAX_VOICES = 4,
  AUDIO_MAX_UPDATES = 64,
  AUDIO_EVT_COUNT = 32,
};

static u8 audio_heap[AUDIO_HEAP_SIZE]
    __attribute__((section("uninit"), aligned(16)));
static ALHeap audio_hp;
static ALGlobals audio_globals;
static ALSndPlayer audio_sndp;

void audio_init(void) {
  int audio_rate = osAiSetFrequency(22050);

  alHeapInit(&audio_hp, audio_heap, sizeof(audio_heap));
  ALSynConfig scfg = {
      .maxVVoices = AUDIO_MAX_VOICES,
      .maxPVoices = AUDIO_MAX_VOICES,
      .maxUpdates = AUDIO_MAX_UPDATES,
      .dmaproc = audio_dma_new, // explained later
      .heap = &audio_hp,
      .outputRate = audio_rate,
      .fxType = AL_FX_SMALLROOM,
  };
  alInit(&audio_globals, &scfg);

  ALSndpConfig pcfg = {
      .maxSounds = AUDIO_MAX_VOICES,
      .maxEvents = AUDIO_EVT_COUNT,
      .heap = &audio_hp,
  };
  alSndpNew(&audio_sndp, &pcfg);

  // Times measured in microseconds.
  static ALEnvelope sndenv = {
      .attackTime = 0,
      .decayTime = 1414784,
      .releaseTime = 0,
      .attackVolume = 127,
      .decayVolume = 127,
  };
  // Not sure this does anything.
  static ALKeyMap keymap = {
      .velocityMin = 0,
      .velocityMax = 127,
      .keyMin = 41,
      .keyMax = 41,
      .keyBase = 41,
  };
  // Poitner to the PCM data.
  static ALWaveTable wtable = {
      .type = AL_RAW16_WAVE,
      .flags = 1,
  };
  // SFX_FANFARE is a chunk of data in the ROM image containing raw
  // 16-bit PCM audio.
  wtable.base = (u8 *)pak_objects[SFX_FANFARE].offset;
  wtable.len = pak_objects[SFX_FANFARE].size;
  static ALSound snd = {
      .envelope = &sndenv,
      .keyMap = &keymap,
      .wavetable = &wtable,
      .samplePan = AL_PAN_CENTER,
      .sampleVolume = AL_VOL_FULL,
      .flags = 1,
  };
  // Allocate and play a sound.
  ALSndId sndid = alSndpAllocate(&audio_sndp, &snd);
  alSndpSetSound(&audio_sndp, sndid);
  alSndpSetPitch(&audio_sndp, 1.0f);
  alSndpSetPan(&audio_sndp, 64);
  alSndpSetVol(&audio_sndp, 30000);
  alSndpPlay(&audio_sndp);
}

Streaming Audio and DMA

There’s a function in there which is not explained, audio_dma_new. This function is called to issue DMAs to read audio samples from the cartridge. I think you can leave it out if you keep your samples in RAM! However, I didn’t figure out how to indicate that my samples are in RAM, so instead, I figured out how to issue the DMAs. It turns out to be really simple.

  1. The audio library will issue requests for data from the cartridge, and pass the data offset and size to your callback.
  2. Your callback returns a pointer to RAM, and makes sure that the sample data is leaded at that location in RAM by the time you run the audio microcode.

The audio_dma_new function creates a “new” DMA handler, but it’s unnecessary to maintain DMA handler state, so I just return my callback for issuing DMA requests.

static ALDMAproc audio_dma_new(void *arg) {
  (void)arg;
  return audio_dma_callback;
}

The way that my DMA callback works is simple.

  1. Keep a number of different buffers in RAM for holding data from the cartridge, and record the corresponding offset where the data in the buffers was loaded from. So the code records something like, “buffer #3 has data from cartridge offset 0xaf123.”

  2. If a request comes in for data that is already in RAM, then return a pointer to that bufer.

  3. If a request comes in for new data, pick an unused buffer and issue a DMA request to fill that buffer with the correct data.

  4. Keep track of a buffer’s age—how long ago the buffer was last used. A buffer with age 0 is being used for the current frame, and cannot be reused. A buffer with age 1 was used last frame, 2 was used two frames ago, etc. When picking an unused buffer, pick the oldest buffer.

I use a fixed size for the DMA buffers, so the metadata structure is very small: just two fields to track the age of the buffer and the offset in ROM of the data it contains.

enum {
  AUDIO_DMA_COUNT = 8,
  AUDIO_DMA_BUFSZ = 2 * 1024,
};

struct audio_dmainfo {
  uint32_t age;
  uint32_t offset;
};

// DMA buffer metadata.
static struct audio_dmainfo audio_dma[AUDIO_DMA_COUNT];

// DMA buffer contents.
static u8 audio_dmabuf[AUDIO_DMA_COUNT][AUDIO_DMA_BUFSZ]
    __attribute__((aligned(16), section("uninit")));

The DMA handler just loops over the DMA buffers looking for the requested data. From my observations, it seems likely to find a buffer already containing the requested data, because the buffers will load 2 KiB of data at a time, and the audio library will make requests for small chunks.

I use a circular buffer of OSIoMesg for outstanding DMA requests.

static OSIoMesg audio_dmamsg[AUDIO_DMA_COUNT];
static OSMesgQueue audio_dmaqueue;
static OSMesg audio_dmaqueue_buffer[AUDIO_DMA_COUNT];
static unsigned audio_dmanext, audio_dmanactive;

static s32 audio_dma_callback(s32 addr, s32 len, void *state) {
  (void)state;
  struct audio_dmainfo *restrict dma = audio_dma;
  // Start and end address of requested data (cartridge address).
  uint32_t astart = addr, aend = astart + len;

  // If these samples are already buffered, return the buffer.
  int oldest = 0;
  uint32_t oldest_age = 0;
  for (int i = 0; i < AUDIO_DMA_COUNT; i++) {
    if (dma[i].age > oldest_age) {
      oldest = i;
      oldest_age = dma[i].age;
    }
    uint32_t dstart = dma[i].offset,
             dend = dstart + AUDIO_DMA_BUFSZ;
    if (dstart <= astart && aend <= dend) {
      dma[i].age = 0;
      uint32_t offset = astart - dstart;
      return K0_TO_PHYS(audio_dmabuf[i] + offset);
    }
  }

  // Otherwise, use the oldest buffer to start a new DMA.
  if (oldest_age == 0 || audio_dmanactive >= AUDIO_DMA_COUNT) {
    // If the buffer is in use, don't bother.
    fatal_error(
        "DMA buffer in use"); // FIXME: not in release builds
    return K0_TO_PHYS(audio_dmabuf[oldest]);
  }
  uint32_t dma_addr = astart & ~1u;
  OSIoMesg *restrict mesg = &audio_dmamsg[audio_dmanext];
  audio_dmanext = (audio_dmanext + 1) % AUDIO_DMA_COUNT;
  audio_dmanactive++;
  *mesg = (OSIoMesg){
      .hdr = {.pri = OS_MESG_PRI_NORMAL,
              .retQueue = &audio_dmaqueue},
      .dramAddr = audio_dmabuf[oldest],
      .devAddr = dma_addr,
      .size = AUDIO_DMA_BUFSZ,
  };
  osEPiStartDma(rom_handle, mesg, OS_READ);
  dma[oldest] = (struct audio_dmainfo){
      .age = 0,
      .offset = dma_addr,
  };
  return K0_TO_PHYS(audio_dmabuf[oldest] + (astart & 1u));
}

Afterwards, clean up the DMA notifications.

int nactive = audio_dmanactive;
for (;;) {
  OSMesg mesg;
  int r = osRecvMesg(&audio_dmaqueue, &mesg, OS_MESG_NOBLOCK);
  if (r == -1) {
    break;
  }
  nactive--;
}
audio_dmanactive = nactive;

Audio Tasks

I now have everything in place to generate audio tasks. I’m omitting some of the bookkeeping here… There are two audio task structures, and the audio code switches between them, so current_task = 0, 1, 0, 1, …. There are three audio buffers, and the audio code cycles through them, so current_buffer = 0, 1, 2, 0, 1, 2, …. The audio_frame function will be called as soon as current_task is available and current_buffer is available.

enum {
  // Number of FRAMES in an audio buffer (2 samples per frame).
  AUDIO_BUFSZ = 1024,
  AUDIO_CLIST_SIZE = 4 * 1024,
};

static struct scheduler_task audio_tasks[2];
static Acmd audio_cmdlist[2][AUDIO_CLIST_SIZE];
static int16_t audio_buffers[3][2 * AUDIO_BUFSZ]
    __attribute__((aligned(16), section("uninit")));

void audio_frame(struct scheduler *sc, OSMesgQueue *queue) {
  // Increase the age of all sample buffers.
  for (int i = 0; i < AUDIO_DMA_COUNT; i++) {
    audio_dma[i].age++;
  }

  // Create the command list.
  int16_t *buffer = audio_buffers[current_buffer];
  s32 cmdlen = 0;
  Acmd *al_start = audio_cmdlist[current_task];
  Acmd *al_end = alAudioFrame(
      al_start, &cmdlen, (s16 *)K0_TO_PHYS(buffer), AUDIO_BUFSZ);
  if (al_end - al_start > AUDIO_CLIST_SIZE) {
    fatal_error("Audio command list overrun\nsize=%td",
                al_end - al_start);
  }

  // Create and sumbit the task.
  struct scheduler_task *task = &audio_tasks[current_task];
  if (cmdlen == 0) {
    // If the task is empty, just zero the buffer.
    // This probably shouldn't happen.
    bzero(buffer, 4 * AUDIO_BUFSZ);
    osWritebackDCache(buffer, 4 * AUDIO_BUFSZ);
    task->flags = SCHEDULER_TASK_AUDIOBUFFER;
  } else {
    task->flags = SCHEDULER_TASK_AUDIO | SCHEDULER_TASK_AUDIOBUFFER;
    task->task = (OSTask){{
        .type = M_AUDTASK,
        .flags = OS_TASK_DP_WAIT,
        .ucode_boot = (u64 *)rspbootTextStart,
        .ucode_boot_size =
            (uintptr_t)rspbootTextEnd - (uintptr_t)rspbootTextStart,
        .ucode_data = (u64 *)aspMainDataStart,
        .ucode_data_size = SP_UCODE_DATA_SIZE,
        .ucode = (u64 *)aspMainTextStart,
        .ucode_size = SP_UCODE_SIZE,
        .dram_stack = NULL,
        .dram_stack_size = 0,
        .data_ptr = (u64 *)al_start,
        .data_size = sizeof(Acmd) * (al_end - al_start),
    }};
    osWritebackDCache(al_start, sizeof(Acmd) * (al_end - al_start));
  }
  task->done_queue = queue;
  task->done_mesg = event_pack((struct event_data){
      .type = EVENT_AUDIO,
      .value = audio_taskmask(current_task),
  });
  task->data.audiobuffer = (struct scheduler_audiobuffer){
      .ptr = buffer,
      .size = 4 * AUDIO_BUFSZ,
      .done_queue = queue,
      .done_mesg = event_pack((struct event_data){
          .type = EVENT_AUDIO,
          .value = audio_buffermask(current_buffer),
      }),
  };
  scheduler_submit(sc, task);
}

In my program, the audio command list has 493 commands in it. That seems like a lot! Be sure to make your command buffer large enough.

It works!

The program plays a kazoo fanfare when it launches, and does nothing else.

Audio Demo rev 121 108 kB

Comparison to Sample Programs

The sample programs that come with the Nintendo 64 SDK use a linked list of DMA buffers and keep the list sorted in order by ROM address. To me, this seems like some unnecessary complexity. A linear scan through the DMA buffers should be quite fast—after all, the VR4300 does have a data cache, and the DMA buffer metadata is smaller than a cache line.