Nintendo 64 console with EverDrive cartridge

Nintendo 64 Part 11: Scheduling RCP Tasks

Nov 3, 2020, Nintendo 64, Programming

Better Performance

Conceptually, the main loop looks like this:

for (;;) {
  update();
  render();
  swap_buffer();
  wait_for_retrace();
}

However, writing out a loop this way is wasteful, since different processors in the system will spend a significant amount of time waiting for other parts of the system to complete tasks. Here is how drawing frames look using my simple code, right now:

Diagram of operations and threads in my simple Nintendo 64 game

This is wasteful. The CPU spends time waiting for the RSP and RDP to finish the graphics task and then waits for the retrace notification. The RSP and RDP spend time waiting for the CPU to issue more instructions.

What I want is to have multiple frames in flight at any given point, using both the CPU and RCP simultaneously:

Diagram of operations and threads in a more efficient Nintendo 64 game

This is recommended by the Nintendo 64 development manual, and it’s also how modern graphics cards work, more or less.

LibUltra comes with a way of doing this called the scheduler, but I’m not entirely happy with the scheduler API and how it structures events. Instead, I’m going to write my own scheduler. It turns out that this only takes about 130 lines of code, not counting the header file, and it is fairly easy to understand.

Scheduler

My scheduler accepts tasks over a queue. The task consists of two parts: an RCP task (OSTask structure) and a framebuffer. These are processed in sequence. First, the scheduler runs the RCP task, then the scheduler swaps the framebuffer pointer. Both the RCP task and the framebuffer come with a message to send when they finish.

// A framebuffer to be displayed on screen.
struct scheduler_framebuffer {
  // Switch to this framebuffer.
  void *ptr;

  // Send this message to this queue after the framebuffer is no
  // longer on screen, after another framebuffer has replaced it.
  OSMesgQueue *done_queue;
  OSMesg done_mesg;
};

// A task to be scheduled on the RCP.
struct scheduler_task {
  // RCP task to run.
  OSTask task;

  // Send this message to this queue when the RCP task is done.
  OSMesgQueue *done_queue;
  OSMesg done_mesg;

  // The framebuffer to be displayed on screen after the RCP task is
  // done.
  struct scheduler_framebuffer framebuffer;
};

The scheduler thread recieves messages from two queues. One queue is the event queue, which delivers events from the OS, like notifacitions that the RDP task is done or that the framebuffer has been swapped. The other queue is the task queue which recieves task pointers from the main thread.

The main scheduler loop looks like this:

struct scheduler *sc = ...;
for (;;) {
  // Process the next event from the event queue.
  OSMesg mesg;
  osRecvMesg(&sc->evt_queue, &mesg, OS_MESG_BLOCK);
  switch ((uintptr_t)mesg) {
  case EVT_TASK:
    // No-op, just wakes up the scheduler.
    break;

  case EVT_RDP:
    ...

  case EVT_VSYNC:
    ...
  }

  // Read all tasks from the task queue.
  read_tasks(sc);

  // If we are idle and there is a task pending, run it.
  if (sc->task_running == NULL && sc->pending_count > 0) {
    struct scheduler_task *task = dequeue_task(sc);
    osSpTaskLoad(&task->task);
    osSpTaskStartGo(&task->task);
    sc->task_running = task;
  }
}

Note that rather than putting pointers to real objects in the event queue, each event is just a number which is cast to a pointer.

When the task finishes, the OS will send an EVT_RDP to the event queue:

case EVT_RDP: {
  struct scheduler_task *restrict task = sc->task_running;
  sc->task_running = NULL;
  if (task == NULL) {
    fatal_error("NULL task");
  }
  if (task->framebuffer.ptr != NULL) {
    unsigned pending = sc->framebuffers_pending;
    if (pending >= 2) {
      fatal_error("Framebuffer overflow");
    }
    if (pending == 0) {
      osViSwapBuffer(task->framebuffer.ptr);
      // No previous frame, so the screen is black, and we need to
      // disable that.
      if (sc->framebuffers[0].ptr == NULL) {
        osViBlack(false);
      }
    }
    pending++;
    sc->framebuffers[pending] = task->framebuffer;
    sc->framebuffers_pending = pending;
  }
  if (task->done_queue != NULL) {
    int r = osSendMesg(task->done_queue, task->done_mesg,
                       OS_MESG_NOBLOCK);
    if (r != 0) {
      fatal_error("Dropped RCP task done message");
    }
  }
} break;

When the RSP task finishes, it is possible that the previous frame’s framebuffer is not on screen yet. The code must either drop one of the frames (which means we rendered a frame that won’t get shown to the user) or use a queue (which means some additional latency). I chose to hold the framebuffer in a queue, so the code only calls osViSwapBuffer if there are no other pending framebuffers.

The scheduler should only block waiting for OS events, so it’s a fatal error if sending an event back to the main thread. The main thread must create a queue large enough to hold all possible events in flight.

When the retrace happens, all of the framebuffers move up in the queue. Note that the queue has three entries. Entry 0 is the framebuffer currently on-screen, and it’s stored by the scheduler so the scheduler can notify the main thread when the framebuffer is no longer in use. Entry 1 is the next framebuffer the VI will display, which has already been passed to osViSwapBuffer(). Entry 2 is an additional queued buffer. Here is the scheduler’s vertical retrace handler:

case EVT_VSYNC: {
  unsigned pending = sc->framebuffers_pending;
  if (pending > 0) {
    struct scheduler_framebuffer *restrict fb = sc->framebuffers;
    if (fb[0].done_queue != NULL) {
      int r = osSendMesg(fb[0].done_queue, fb[0].done_mesg,
                         OS_MESG_NOBLOCK);
      if (r != 0) {
        fatal_error("Dropped VSYNC message");
      }
    }
    fb[0] = fb[1];
    fb[1] = fb[2];
    fb[2] = (struct scheduler_framebuffer){0};
    if (pending > 1) {
      osViSwapBuffer(fb[1].ptr);
    }
    sc->framebuffers_pending = pending - 1;
  }
} break;

Main Thread

The main thread now has to be modified to use the scheduler and to put two RCP tasks and framebuffers in flight. Here is the main thread state:

// State of the main thread.
struct main_state {
  // Whether each task is running.
  bool task_running[2];
  // Whether each framebuffer is being used.
  bool framebuffer_in_use[2];
  // Whether a controller read is in progress.
  bool controler_read_active;
  // The tasks.
  struct scheduler_task tasks[2];
  // Whether we have a controller connected.
  bool has_controller;
  // Which controller is connected.
  int controller_index;

  // Queue receiving scheduler / controller events.
  OSMesgQueue evt_queue;
  OSMesg evt_buffer[16];
};

The event queue is used both for messages from the scheduler and messages from the SI when it reads controller state. Processing an event just involves updating the main thread’s state and is fairly simple, and it is designed to work in both blocking and non-blocking mode for reasons I’ll explain below:

// Read the next event sent to the main thread and process it.
static int process_event(struct main_state *restrict st,
                         int flags) {
  OSMesg evt;
  int ret = osRecvMesg(&st->evt_queue, &evt, flags);
  if (ret != 0)
    return ret;
  int type = event_type(evt);
  int value = event_value(evt);
  switch (type) {
  case EVT_CONTROL:;
    OSContPad controller_state[MAXCONTROLLERS];
    osContGetReadData(controller_state);
    st->controler_read_active = false;
    struct game_state *gs = &game_state;
    if (st->has_controller) {
      gs->controller = controller_state[st->controller_index];
    }
    break;
  case EVT_TASKDONE:
    st->task_running[value] = false;
    break;
  case EVT_FBDONE:
    st->framebuffer_in_use[value] = false;
    break;
  }
  return 0;
}

The main loop itself now just alternates between the two sets of buffers and tasks, and processes events until a task and framebuffer are free:

// Main loop.
for (int current_task = 0;; current_task ^= 1) {
  // Wait until the task and framebuffer are both free to use.
  while (st->task_running[current_task])
    process_event(st, true);
  while (st->framebuffer_in_use[current_task])
    process_event(st, true);
  while (process_event(st, false) == 0) {
  }

  // Read the controllers if we are not already doing so.
  if (!st->controler_read_active) {
    osContStartQuery(&st->evt_queue);
    st->controler_read_active = true;
  }

  // Update game state, create display lists.
  ...

  struct scheduler_task *task = &st->tasks[current_task];
  *task = (struct scheduler_task){
      .task = {{ ... /* RSP task */ }},
      .done_queue = &st->evt_queue,
      .done_mesg = make_event(EVT_TASKDONE, current_task),
      .framebuffer =
          {
              .ptr = framebuffers[current_task],
              .done_queue = &st->evt_queue,
              .done_mesg = make_event(EVT_FBDONE, current_task),
          },
  };
  scheduler_submit(&scheduler, task);
  st->task_running[current_task] = true;
  st->framebuffer_in_use[current_task] = true;
}

This code works!

Note that it is fairly trivial to modify this code to use triple-buffering (Wikipedia: Multiple buffering). All I have to do is add a third buffer, and keep track of an additional variable, current_buffer in addition to current_task. The extra framebuffer would consume extra memory (and the Nintendo 64 does not have much to begin with), but it would allow the main thread to issue RCP tasks while two buffers are in use, increasing the framerate.

Printing to the Screen

I’ve been suffering from an inability to get information about the program’s state to the screen for debugging and testing, so I’ll take the text drawing code I wrote for fatal_error() and adapt it so text can be written on top of the game screen with some variation of printf().

Screenshot of game with frame counter — Printf in action, imagine the possibilities!

Crashes and Assembler Shenanigans

At this point, my code was working with optimizations disabled but enabling optimizations resulted in a crash in the scheduler. It looked like global variables were not zeroed, and I thought this was caused by a careless buffer overrun somewhere else in the code. However, when I printed out the value of global variables at the beginning of main() and saw that they were already corrupted, I realized that something else was going on.

I started looking at the disassembly, looking for the values that had appeared unexpectedly in RAM. Maybe I could at least figure out where they were coming from. However, I noticed something odd about the _start routine:

$ mips32-elf-objdump -d bazel-bin/game/Thornmarked.elf
80000400 <_start>:
80000400:       3c1d8007        lui     sp,0x8007
80000404:       27bdebb0        addiu   sp,sp,-5200
80000408:       3c048001        lui     a0,0x8001
8000040c:       2484fd90        addiu   a0,a0,-624
80000410:       3c058007        lui     a1,0x8007
80000414:       24a5c7a5        addiu   a1,a1,-14427
80000418:       0c00144c        jal     80005130 <bzero>
8000041c:       00000000        nop
80000420:       00a42823        subu    a1,a1,a0
80000424:       080005b1        j       800016c4 <boot>
80000428:       00000000        nop
8000042c:       00000000        nop

There’s extra nop instructions added! Why? Shouldn’t disassembly correspond 1:1 with my original assembly? Here is the original code, for comparison, from part 5:

# Entry point for the ROM image.
        .section .text.entry
        .global _start
_start:

        # Set up the stack
        la      $sp, _boot_stack

        # Zero BSS.
        la      $a0, _bss_start
        la      $a1, _bss_end
        jal     bzero
        subu    $a1, $a1, $a0

        # Jump to C entry point.
        j       boot
        nop

It turns out that the MIPS assembler reorders instructions to use branch delay slots, avoid load hazards, and that sort of thing. I guess it’s a “smart” assembler. I could turn this feature off with the directive .set noreorder, which is what GCC uses, or I could just move my code out of the delay slots. I chose to just move my code out of the delay slots. On one hand, this might be confusing because any MIPS I write will look different from the MIPS I see in disassembly, but this is already true for pseudo-instructions like la. On the other hand, the assembler will probably do a better job than me. Here is the updated start routine:

# Entry point for the ROM image.
        .section .text.entry
        .global _start
_start:

        # Set up the stack
        la      $sp, _boot_stack

        # Zero BSS.
        la      $a0, _bss_start
        la      $a1, _bss_end
        subu    $a1, $a1, $a0
        jal     bzero

        # Jump to C entry point.
        j       boot

And after checking, I will note that this is not documented anywhere in the GNU assembler manual! You can find it in other manuals for other MIPS assemblers, ask someone, or figure it out from GCC’s output which uses .set noreorder.

The other lesson is that I may want to test both with and without optimizations enabled.