Nintendo 64 console with EverDrive cartridge

Nintendo 64 Part 6: Hardware Acceleration

, Nintendo 64, Programming

Time to use hardware acceleration for our graphics!

Detour: Building with Bazel

I’ve switched the build system to Bazel. There’s a lot of reasons why I like Bazel, but the main reasons are:

To get Bazel working for Nintendo 64, I went through the following steps:

  1. Create a new cc_toolchain_suite containing the system toolchain for CPU k8, which is Bazel’s way of saying “x86_64”, and a custom cc_toolchain toolchain for CPU n64.
  2. Copy the cc_toolchain_config rule out of the @local_config_cc//:cc_toolchain_config.bzl file, which can be found in the bazel-${projecname}/external directory after you build a C or C++ target.
  3. Edit the toolchain configuration to use our cross-compiler, and adjust the flags as necessary to compile for the Nintendo 64. Remove anything we can’t use, like FDO and dynamic linking.
  4. Add a repository rule for the Nintendo 64 SDK, and create cc_import and cc_library rules for LibUltra and the various microcode object files.
  5. Create a cc_binary rule for our ELF file, linking in all the code and using our linker script.
  6. Create a genrule that runs objcopy and makemask.

Done! This didn’t take much time, but this also isn’t the first time I’ve written a cc_toolchain_config rule.

I’m a bit hesitant to recommend Bazel because of its steep learning curve, and because it’s hard to find good, up to date documentation for custom toolchains. I like Bazel, and that’s good enough.

High-Level Graphics Overview

Here is a rough diagram of the high-level components of a Nintendo 64 system which are relevant to graphics acceleration. Note that there are omissions and simplifications.

Partial system diagram of Nintendo 64
Nintendo 64 high-level graphics architecture

The various components are:

Note that the framebuffer itself is just an area in DRAM. There is no particular way that you are forced to draw to the framebuffer. Any system capable of writing to DRAM can write graphics! Some possibilities:

Our earlier code just wrote to the framebuffer directly from the CPU because it was simpler, and now we will achieve the same effect by issuing commands to the RDP. How do we do that?

Normally, what you do is employ one of the programs that SGI or Nintendo wrote for the RSP, called microcode. These programs take a series of commands called a display list as input and convert them into commands for the RDP. You can send the output from the RSP to the RDP in three different ways:

The generally recommended microcode program for 3D graphics is called F3DEX2. Its name means something like “Fast 3D Series Improved Version Level 2”. F3DEX2 was preceded by F3DEX, which evolved from Fast3D and Line3D and other earlier microcode programs in the SDK. If you were a developer earlier in the Nintendo 64 lifecycle, it wouldn’t be available yet, and you’d use one of the earlier options. This program is a binary blob in the SDK that you include in your game.

The open-source alternatives are still only 2D at the time of this writing, but Hazematman is making excellent progress writing microcode for libhfx, which is an open-source 3D library for Nintendo 64.

We are also going to use the XBUS version of the F3DEX2 program, since the XBUS is the simplest way to communicate between the RSP and RDP, and does not require us to allocate ane buffers for FIFOs.

Using the RDP

We are going to use F3DEX2 to clear the framebuffer. Not very exciting, but progress comes one step at a time! I mean… if your first program was some fancy 3D program, and it didn’t work, how would you go about debugging it?

First step is some boilerplate display lists which initialize the RSP and RDP. A display list is just a sequence of commands that are read by the RSP microcode. These are taken from the documentation.

// Viewport scaling parameters.
static const Vp viewport = {{
    .vscale = {SCREEN_WIDTH * 2, SCREEN_HEIGHT * 2, G_MAXZ / 2, 0},
    .vtrans = {SCREEN_WIDTH * 2, SCREEN_HEIGHT * 2, G_MAXZ / 2, 0},

// Initialize the RSP.
static const Gfx rspinit_dl[] = {
    gsSPClearGeometryMode(G_SHADE | G_SHADING_SMOOTH | G_CULL_BOTH |
                          G_FOG | G_LIGHTING | G_TEXTURE_GEN |
                          G_TEXTURE_GEN_LINEAR | G_LOD),
    gsSPTexture(0, 0, 0, 0, G_OFF),

// Initialize the RDP.
static const Gfx rdpinit_dl[] = {
    gsDPSetRenderMode(G_RM_NOOP, G_RM_NOOP2),

A third display list clears the screen. It will need to be modified at runtime to update the color and the pointer to the framebuffer.

// Clear the color framebuffer.
static Gfx clearframebuffer_dl[] = {
    gsDPSetColorImage(G_IM_FMT_RGBA, G_IM_SIZ_16b, SCREEN_WIDTH,
    gsDPFillRectangle(0, 0, SCREEN_WIDTH - 1, SCREEN_HEIGHT - 1),

Next, we create an OSTask structure to describe the task that the RSP will perform. This structure contains pointers to the code and data for the RSP. This is much easier to read with the designated initializer syntax introduced in C99.

In particular, note that the stack that we give to the RSP is aligned to 16 bytes, which is the size of a cache line. If it is not aligned to 16 bytes, you can get cache tearing, which is what happens when the same cache line is used both by the CPU and RSP, and a CPU cache writeback overwrites data from the RSP that happens to share the same cache line.

Also note that the OSTask structure contains a field named output_buff_size which is a misleading name! It is not necessarily a size. Instead, it is actually a parameter to the RSP microcode that you are using, and different microcode programs interpret the parameter differently. The documentation for osSpTaskStart covers this, although I am not sure that the documentation is accurate.

enum {
  SP_STACK_SIZE = 1024,

static u64 sp_dram_stack[SP_STACK_SIZE / 8]

static OSTask tlist = {{
    .type = M_GFXTASK,
    .flags = OS_TASK_DP_WAIT,
    .ucode = (u64 *)gspF3DEX2_xbusTextStart,
    .ucode_size = SP_UCODE_SIZE,
    .ucode_data = (u64 *)gspF3DEX2_xbusDataStart,
    .ucode_data_size = SP_UCODE_DATA_SIZE,
    .dram_stack = sp_dram_stack,
    .dram_stack_size = sizeof(sp_dram_stack),

static void main(void *arg) {
  tlist.t.ucode_boot = (u64 *)rspbootTextStart;
  tlist.t.ucode_boot_size =
      (uintptr_t)rspbootTextEnd - (uintptr_t)rspbootTextStart;

Finally, we are ready to run the RSP task. Some notes:

Data used by the RSP must be flushed to DRAM with osWritebackDCache.

Addresses passed to the RSP must be translated from virtual addresses, which the CPU uses, to segment addresses, which the RSP uses. In order to do this, you must define segments with gSpSegment() for the parts of DRAM that you use. Once the segments are defined, the translation happens automatically. The example programs define multiple segments, but the docs explain that this is unnecessary, and you can just use one segment!

This is because of the structure of segment addresses. A segment address contains a 4-bit segment ID and a 24-bit segment offset. The address is translated to a physical address by taking the segment base address for the segment identified by the segment ID and adding the segment offset. But the Nintendo 64 has at most 8 MiB of RAM (if the Expansion Pak is present), which fits within the 24-bit segment offset.

This means you can just take the easy way out and define one segment with a base address of 0. This is even explained in the docs, and by examination, some games appear to use this technique too—but not the sample programs! From the Nintendo 64 Programming Manual §10.2 Mixing CPU and SP Data:

If the application creates a mapping using segment 0 to a beginning physical address of 0x0, the SP can correctly access objects in DRAM when given a physical address.

This simplifies the situation somewhat, but the SP microcode takes it a step further: Since the upper four bits of a segment address are not used, they are ignored. Thus, an implicit mapping is done from a KSEG0 address to a physical address, and no explicit conversion need be done by the application.

To summarize, as long as an SP segment table mapping is done from segment number 0 to offset 0, CPU KSEG0 addresses can be interpreted correctly by the SP.

You would need to do something more complicated here if you were using virtual memory. The VR4300 does have an MMU with a TLB and user/kernel modes and all you would expect from an MMU… but most games don’t use it.

Here is our code for submitting the task to the RSP:

Gfx glist[16], *glistp = glist;
gSPSegment(glistp++, 0, 0);
gSPDisplayList(glistp++, rdpinit_dl);
gSPDisplayList(glistp++, rspinit_dl);
clearframebuffer_dl[1] = (Gfx)gsDPSetColorImage(
clearframebuffer_dl[3] =
    (Gfx)gsDPSetFillColor(color | (color << 16));
gSPDisplayList(glistp++, clearframebuffer_dl);

osWritebackDCache(&clearframebuffer_dl[1], sizeof(Gfx) * 3);
osWritebackDCache(glist, sizeof(*glist) * (glistp - glist));
tlist.t.data_ptr = (u64 *)glist;
tlist.t.data_size = sizeof(*glist) * (glistp - glist);

osRecvMesg(&rdp_message_queue, NULL, OS_MESG_BLOCK);

It works!