Multi-threading on the Lynx

Wookie · September 11, 2010

So I've posted on here several times my idea for using a streaming loader and data caches for handling game assets on the Lynx. The general idea is that asset references are stored as unique asset ID's that are resolved to pointers at runtime. Part of the resolution process is to kick off an asynchronous load of the asset whenever it is not present in RAM. In my previous descriptions I've always done a lot of hand waving when describing how and when the loader gets run. Well, this evening I sat down and coded up a very basic multi-threading implementation using the NESHLA HLA. (For those of you who don't know, I'm nearing completion of a reimplementation of the NESHLA compiler in Python that I call HLAKit. It is architected in a way that allowed me to easily add support for the Lynx and eventually other CPU's and target systems. The project is here though that version is a bit out of date and incomplete. The wiki for the compiler is fairly up to date with the features of the compiler and the code libraries that come with it. I'll be releasing the full working compiler with Lynx system library in the coming month or so. The biggest leap forward is that my compiler has built in support for properly encrypting the loader code so that the resulting ROM can run on a real Lynx.)

Anyway, on to the code. My goal was to create a simple multi-threading capability so that I could implement the loader as its own thread and the main game loop as its own thread. The 6502 is a tricky little CPU. To make multi-threading happen you have to have separate stacks, one for each thread, some zero page variables to store the stack specific data and which thread is current, and you need to have code that will save and restore thread contexts. To make all of this work, I created a small header file with a bare minimum implementation. This only supports two threads, but could be expanded to support up to 4. I wouldn't try to do more than 4 because there are only 256 bytes of stack space and having less than 64 bytes of stack is flirting with danger.

Here's the code:

/*
* ClassicGameDev.com Example Task Switching
*/

// this function initializes the stack for a given task so that
// we can start the task by restoring back to it as if it had been
// put to sleep previously
inline initialize_task(sp, entry) {
   tsx                 // get current stack pointer
   txa                 // x -> a
   tay                 // a -> y

   ldx     sp          // put task stack pointer in x
   txs                 // set stack pointer to task stack pointer
   lda     hi(entry)   // get high byte of task entry address
   pha                 // push high byte of task entry address on task stack
   lda     lo(entry)   // get low byte of task entry address
   pha                 // push low byte of task entry address on task stack
   lda     #$34        // put standard cpu status flags in a
   pha                 // push standard cpu status flags on task stack
   lda     #$0         // a = 0
   pha                 // push initial a value on stack
   pha                 // push initial x value on stack
   pha                 // push initial y value on stack

   tya                 // y -> a
   tax                 // a -> x
   txs                 // restore the stack pointer
}

// this macro handles storing the current task registers on its stack
inline save_task_context() {
   pha
   txa
   pha
   tya
   pha
}

// this macro handles restoring a task registers from its stack
inline load_task_context() {
   pla                 // pop the y value
   tay                 // initialize y
   pla                 // pop the x value
   tax                 // initialize x
   pla                 // pop the a value
}

// this handles storing the current stack pointer in memory, updating the
// current task index in memory and then loading the new task's stack pointer
inline switch_tasks(cur_task) {
   ldy     cur_task    // load current task index into y
   tsx                 // put the current stack pointer in x
   stx     $00,y       // store current stack pointer into current task sp var

   // this updates the current task index
   lda     #1          // a = 1
   eor     cur_task    // a = cur_task ^ a  (this flips from 0 to 1/1 to 0)
   sta     cur_task    // cur_task = a
  
   tay                 // make y equal the new task index
   ldx     $00,y       // load new task stack pointer from its task sp var
   txs                 // set stack pointer to task stack pointer
}

// this function will start the threading system by restoring
// the given task context, setting the cpu status, and calling
// rts to jump to it.  this can be called from a regular function
// to start the threading system.
inline start_all_tasks(first_task) {
   ldy     first_task  // get the index of the first task
   ldx     $00,y       // load its stack pointer into x
   txs                 // set stack pointer to first task sp
   load_task_context() // load task context from its stack
   plp                 // load task cpu status, this will clear interrupt disable flag
   rts                 // return to the task entry point
}

// this is the interrupt handler to hook up to the timer IRQ to
// implement task switching
interrupt context_switch() {
   // save the current task context on its stack
   save_task_context()

   // switch tasks
   switch_tasks(current_task)

   // load the task context from tnew task stack
   load_task_context()

   cli                 // clear the interrupt flag
}

I think the comments explain what the code does. The context_switch interrupt function at the bottom is called by a Lynx timer set to interrupt at roughly 30Hz. I picked that arbitrarily for now. It will probably need to be tweaked to get a good balance of execution times.

Here's the code that shows how to use the threading:

#include <task_switch.h>

#define IRQ_ADDR_HI  $FFFF
#define IRQ_ADDR_LO  $FFFE

byte task0_sp        : $0 = #$ff  // task 0 stack pointer storage
byte task1_sp        : $1 = #$7f  // task 1 stack pointer storage
byte current_task    : $2 = #0    // current task index

#ram.org 0x0200

// make this the function the one that the second stage loader
// loads and runs.  it handles further initializing the system,
// initializing the tasks and starting the task switching
function noreturn startup() {
   // disable interrupts
   sei

   // set the irq handler pointer to our context_switch function
   lda    hi(context_switch)
   sta    IRQ_ADDR_HI
   lda    lo(context_switch)
   sta    IRQ_ADDR_LO

   // initialize one of the handy interrupts to fire at 30 Hz
   initialize_timer()

   // initialize the stacks for the two tasks
   initialize_task(task0_sp, game_loop)
   initialize_task(task1_sp, asset_loader)

   // start the task system, this will set the CPU status flags to
   // a known value for the first task which will clear the disable
   // interrupts bit
   start_all_tasks(current_task)
}

function noreturn game_loop() {
   forever {
       // this is the main game loop
   }
}

function noreturn asset_loader() {
   forever {
       // this is the asset loader
   }
}

#ram.end

I haven't run this on an actual Lynx just yet. I have run it on a 6502 simulator. Right now it isn't as efficient as it could be. For instance, this uses 6502 opcodes exclusively when it could use 65c02 opcodes. The HLAKit compiler only has support for the bare 6502 right now. When I add 65c02 support I'll update this.

The next step for this system is to use the bit test/set functions to create a simple semaphore that can be used to synchronize a queue of load requests between the main thread and the loader thread. Once I get that far, I'll load it up on my Lynx and then write up a much longer tutorial for the Lynx section on my wiki.

--Wookie

Edited September 11, 2010 by Wookie

semicolo · September 14, 2010

This is groovy stuff

Shawn Jefferson · September 15, 2010

Interesting, and neat, but do you think it will actually work, in-game? To my mind, running an interrupt during your game to load game data will have to be managed very closely/precisely to not interfere with game play. Probably much easier to code the loading into each game, as each game design dictates. Maybe I'm missing something, and it's true that I haven't programmed any games for the Lynx, although I have done a bit of programming for it in the past (and for the Atari 8-bit, which is also a 6502 system and not quite entirely unlike the Lynx.)

Wookie · September 15, 2010

One thing to remember is that each task only gets to run for a single time slice. So if the "OS tick" timer is set to 30 Hz, then each task gets to run for 1/30th of a second each time it runs. Each task will be put to sleep and resumed automagically by the task switching system. At this point, both tasks have equal priority and therefore each gets 50% of the CPU time.

One of the tweaks that I can make is to add a simple priority system that allows for dividing the CPU time up unevenly. My guess is that the loader would do just fine running every 4th time slice. To do that, I just need to add an array of current task indexes. For instance, if I want the loader to run on every 4th time slice, I would set up a 4 element array in memory where each element contains the index of the next task to run like so:

+---+---+---+---+
| 1 | 0 | 0 | 0 |
+---+---+---+---+

Then the "current_task" index would index into this array. If the loader task is 1 and the game task is 0, then the every time the current_task index points to the first slot in this array, the loader will get run. An optimization can be made where the switch task IRQ handler compares the next task index with the current task index and if they are equal, it exits early without taking the time to save and restore the task context. That way, the relatively expensive task switch only happens when actually switching tasks.

Another tweak that can be made is the stack space division. The example above assumes that both tasks need the same amount of stack space. If the loader task, however, doesn't make any function calls and never needs the stack for anything other than storing the task context, then it's stack pointer can be moved to one end of the stack space. A task context only requires 6 bytes of stack space:

+--------+--------+-----------+-------+-------+-------+
| ret lo | ret hi | cpu flags | reg.a | reg.x | reg.y |
+--------+--------+-----------+-------+-------+-------+

hi------------------(stack memory)------------------->lo

When the interrupt handler starts, the return address (hi and lo bytes) and the cpu flags have already been stored on the stack. The task switching code then stores the current A, X, and Y register values on the stack before switching tasks.

So in theory, if the loader thread doesn't use any stack space at all, the initial value for its stack pointer could be $05. That gives it 6 bytes of stack space to store its context when it is asleep and the rest of the stack space ($FF-$06) would be for the game task.

--Wookie

Notes for Newbies:

BTW, if you don't know already, the 6502 is hard coded to use the memory range of $0100-$01FF for its stack. So when I refer to address $05 in the stack, that is really address $0105 in RAM. That is also why I use "#ram.org 0x0200" in the example code showing how to use the task system. That code will be loaded into RAM at $0200 and executed there. That's the lowest "free" RAM in the Lynx because the zero page--commonly used for variables--is in $0000-$00FF and the stack is in $0100-$01FF.

Edited September 15, 2010 by Wookie

Wookie · September 15, 2010

The caveat I've found so far is with writes to the sprite control blocks and writes to the cart (i.e. for save game and/or i2c I/O expanders). Those both have timing constraints that need to be heeded. Having a task switch occur during an i2c write would screw it up. Having a task resume updating sprite control blocks while Suzy is processing would be bad as well.

The solution to this problem of course it to treat those sections of code as "critical sections". Those can be protected by a semaphore that signals to the task switcher that it is OK to switch tasks. With a system like that, instead of a fully automatic task switcher, you have a co-operative task switcher where tasks can temporarily turn off task switching.

Other than that, this threading system should work great for a Lynx game. The only reason I cooked this up is because the Lynx doesn't directly memory map its ROM data. Because ROM data must be streamed into RAM when needed, the Lynx is far more similar to modern day CD/DVD-based consoles (e.g. PS3, XBox360, etc) than it is to standard ROM cart consoles.

During my time in the game industry, I maintained, amongst other things, the low level loading system for a game engine designed for CD/DVD-based consoles. It used a two thread system almost identical to the one I describe above and it worked fantastically well. Our engine was used in one of the very first "continuous massive world" (a.k.a. "giant sandbox") games for the PS2. All of the data used to render the game was streamed in dynamically in response to game play events. This broke us free from the "entire level must fit into RAM" mentality that was common back then and allowed us to make huge worlds with a fluid and dynamic--as apposed to linear--game play experience.

I should note, that our engine used a statically linked executable binary that was always resident in RAM. So our data caches took up whatever was left. Some other game engines however didn't limit themselves with that either. I know for a fact that the Jak and Daxter engine not only swapped data in/out, but it also swapped executable code in/out as well. This was aided by the fact that their engine was written in a Lisp dialect specifically made for games. (Look it up, it was called GOAL and is a pretty interesting idea, though I don't much care for Lisp).

So what does this have to do with the Lynx? Well, it is just a portable PS2 in many ways. Both have game storage systems that are magnitudes larger than the system RAM. The PS2 DVD holds 4.2 GB and it only has 32 MB of RAM, a ratio of 134:1. The Lynx carts can hold up to 128 MB of data with a special game cart (with an 8-bit i2c I/O expander to add an additional 8 bits of address lines) and it only has 64 KB of RAM, a ratio of 256:1. Even if you use a standard 1MB Lynx cart, you still have a ratio of 16:1.

Both systems have serial access to the game data. The PS2 has to read data from the DVD into RAM before it can use it. The Lynx has to read data from the cart, byte by byte, into RAM before it can use it.

Both systems have relatively modest CPU's and use hardware accelerator chips to do all of the heavy lifting. The PS2's CPU is a 200MHz MIPS 3000 compared to the XBox's 700MHz Pentium 4 CPU. The Lynx has a 6502. The PS2 used a pair of external vector units (VUs) and a dedicated 3D accelerator (GS) to do all of the rendering. The Lynx has Suzy for doing all of the sprite rendering and collision detection.

Because of these similarities, when I decided that I was going to create an RPG for the Lynx, it was natural to try the threaded loader and streaming data approach for the game engine. I could be completely wrong. This design could be a complete disaster and not work at all. I'll let you know as I test things out.

--Wookie

Edited September 15, 2010 by Wookie

matashen · September 15, 2010

The caveat I've found so far is with writes to the sprite control blocks and writes to the cart (i.e. for save game and/or i2c I/O expanders). Those both have timing constraints that need to be heeded. Having a task switch occur during an i2c write would screw it up. Having a task resume updating sprite control blocks while Suzy is processing would be bad as well.

The solution to this problem of course it to treat those sections of code as "critical sections". Those can be protected by a semaphore that signals to the task switcher that it is OK to switch tasks. With a system like that, instead of a fully automatic task switcher, you have a co-operative task switcher where tasks can temporarily turn off task switching.

Other than that, this threading system should work great for a Lynx game. The only reason I cooked this up is because the Lynx doesn't directly memory map its ROM data. Because ROM data must be streamed into RAM when needed, the Lynx is far more similar to modern day CD/DVD-based consoles (e.g. PS3, XBox360, etc) than it is to standard ROM cart consoles.

During my time in the game industry, I maintained, amongst other things, the low level loading system for a game engine designed for CD/DVD-based consoles. It used a two thread system almost identical to the one I describe above and it worked fantastically well. Our engine was used in one of the very first "continuous massive world" (a.k.a. "giant sandbox") games for the PS2. All of the data used to render the game was streamed in dynamically in response to game play events. This broke us free from the "entire level must fit into RAM" mentality that was common back then and allowed us to make huge worlds with a fluid and dynamic--as apposed to linear--game play experience.

I should note, that our engine used a statically linked executable binary that was always resident in RAM. So our data caches took up whatever was left. Some other game engines however didn't limit themselves with that either. I know for a fact that the Jak and Daxter engine not only swapped data in/out, but it also swapped executable code in/out as well. This was aided by the fact that their engine was written in a Lisp dialect specifically made for games. (Look it up, it was called GOAL and is a pretty interesting idea, though I don't much care for Lisp).

So what does this have to do with the Lynx? Well, it is just a portable PS2 in many ways. Both have game storage systems that are magnitudes larger than the system RAM. The PS2 DVD holds 4.2 GB and it only has 32 MB of RAM, a ratio of 134:1. The Lynx carts can hold up to 128 MB of data with a special game cart (with an 8-bit i2c I/O expander to add an additional 8 bits of address lines) and it only has 64 KB of RAM, a ratio of 256:1. Even if you use a standard 1MB Lynx cart, you still have a ratio of 16:1.

Both systems have serial access to the game data. The PS2 has to read data from the DVD into RAM before it can use it. The Lynx has to read data from the cart, byte by byte, into RAM before it can use it.

Both systems have relatively modest CPU's and use hardware accelerator chips to do all of the heavy lifting. The PS2's CPU is a 200MHz MIPS 3000 compared to the XBox's 700MHz Pentium 4 CPU. The Lynx has a 6502. The PS2 used a pair of external vector units (VUs) and a dedicated 3D accelerator (GS) to do all of the rendering. The Lynx has Suzy for doing all of the sprite rendering and collision detection.

Because of these similarities, when I decided that I was going to create an RPG for the Lynx, it was natural to try the threaded loader and streaming data approach for the game engine. I could be completely wrong. This design could be a complete disaster and not work at all. I'll let you know as I test things out.

--Wookie

I dont see any advantage from a task system, cause the lynx is not powerful enough for something.

I think its better to design games on th lynx without a complex multitasking system.

But i could be complete wrong ...

Regards

Matthias

Wookie · September 15, 2010

But i could be complete wrong ...

Or you could be completely right.

It is my goal to find out and report back.

--Wookie

+karri · December 16, 2010

But i could be complete wrong ...

Or you could be completely right.

It is my goal to find out and report back.

--Wookie

Actually all my codes suffer from not having a task switcher. The background ABC-music parser needs to be called at regular intervals. Currently I insert polling commands in all time-consuming places to keep the music going at a steady beat. The ideal thing would be to set up abc-parser as the task to be called. But then I also need a way to exit the task time-slot early if I have nothing to do.

I tried running the parser in an interrupt routine but that did not work because the code was too heavy.

Loading in data from cart is very fast thanks to the cc65 code segmentation feature. I don't expect to need a separate graphics streamer in a separate thread. This can easily be called from draw code.

One place where a task switcher might make sense is during the wait for Suzy. In Stardreamer I do a lot of 32 bit math like division, multiplication, accumulation. While the Suzy works there would be spare time for something else.

But of course the real question is how many clock cycles will the task switcher waste for doing its job?

--

Karri

Wookie · January 1, 2011

Thanks to the fact that the HLA is just assembly in function bodies we can get a fairly accurate count of the clock cycles lost to the task switching. The context_switch function is the ISR that gets called when the timer interrupt fires. I don't have the cycle counts for each instruction in front of me but I'm sure you can figure them out. The saving and restoring task context functions just use the register swapping and accumulator push/pop functions. The only other code is updating the current task index and swapping out the current stack pointer.

The overhead should be minimal and you can tweak it by changing how often the timer interrupt fires.

--Wookie

Edited January 1, 2011 by Wookie

GadgetUK · May 12, 2012

This is extremely interesting. Any update?

Wookie · July 11, 2012

No update so far. I'm just now getting back to this. I'm going to code something up for the CC65 assembler since my HLAKit compiler is still a work in progress (blame me being a father with a job). I'll do something to demo the advantages of the multithreading on the Lynx.

I really think that it makes sense for loading game assets on the Lynx since the Lynx doesn't have directly addressable ROMs. Since it has to stream assets through the 8-bit read register, doing that in a background thread is much nicer. Especially when you want to have complex, "endless" levels that won't fit in RAM all at once. With a background thread for loading data, you could program a game that streamed in bitmap data as needed and "flush" bitmap data no longer needed in an LRU fashion.

I've always thought it would be great for doing a final fantasy style RPG with an "endless" world. If the streaming of assets was automatic, then the map tiles would stream in and get flushed on demand. The entire set of map tiles wouldn't be limited by the RAM. It would only be limited by ROM.

-wookie

Multi-threading on the Lynx

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members