Jump to content

Photo

Multi-threading on the Lynx


10 replies to this topic

#1 Wookie OFFLINE  

Wookie

    Chopper Commander

  • 197 posts
  • Location:Seattle, WA

Posted Sat Sep 11, 2010 2:18 AM

So I've posted on here several times my idea for using a streaming loader and data caches for handling game assets on the Lynx. The general idea is that asset references are stored as unique asset ID's that are resolved to pointers at runtime. Part of the resolution process is to kick off an asynchronous load of the asset whenever it is not present in RAM. In my previous descriptions I've always done a lot of hand waving when describing how and when the loader gets run. Well, this evening I sat down and coded up a very basic multi-threading implementation using the NESHLA HLA. (For those of you who don't know, I'm nearing completion of a reimplementation of the NESHLA compiler in Python that I call HLAKit. It is architected in a way that allowed me to easily add support for the Lynx and eventually other CPU's and target systems. The project is here though that version is a bit out of date and incomplete. The wiki for the compiler is fairly up to date with the features of the compiler and the code libraries that come with it. I'll be releasing the full working compiler with Lynx system library in the coming month or so. The biggest leap forward is that my compiler has built in support for properly encrypting the loader code so that the resulting ROM can run on a real Lynx.)

Anyway, on to the code. My goal was to create a simple multi-threading capability so that I could implement the loader as its own thread and the main game loop as its own thread. The 6502 is a tricky little CPU. To make multi-threading happen you have to have separate stacks, one for each thread, some zero page variables to store the stack specific data and which thread is current, and you need to have code that will save and restore thread contexts. To make all of this work, I created a small header file with a bare minimum implementation. This only supports two threads, but could be expanded to support up to 4. I wouldn't try to do more than 4 because there are only 256 bytes of stack space and having less than 64 bytes of stack is flirting with danger.

Here's the code:

/*
 * ClassicGameDev.com Example Task Switching
 */

// this function initializes the stack for a given task so that
// we can start the task by restoring back to it as if it had been
// put to sleep previously
inline initialize_task(sp, entry) {
    tsx                 // get current stack pointer
    txa                 // x -> a
    tay                 // a -> y

    ldx     sp          // put task stack pointer in x
    txs                 // set stack pointer to task stack pointer
    lda     hi(entry)   // get high byte of task entry address
    pha                 // push high byte of task entry address on task stack
    lda     lo(entry)   // get low byte of task entry address
    pha                 // push low byte of task entry address on task stack
    lda     #$34        // put standard cpu status flags in a
    pha                 // push standard cpu status flags on task stack
    lda     #$0         // a = 0
    pha                 // push initial a value on stack
    pha                 // push initial x value on stack
    pha                 // push initial y value on stack

    tya                 // y -> a
    tax                 // a -> x
    txs                 // restore the stack pointer
}

// this macro handles storing the current task registers on its stack
inline save_task_context() {
    pha
    txa
    pha
    tya
    pha
}

// this macro handles restoring a task registers from its stack
inline load_task_context() {
    pla                 // pop the y value
    tay                 // initialize y
    pla                 // pop the x value
    tax                 // initialize x
    pla                 // pop the a value
}

// this handles storing the current stack pointer in memory, updating the
// current task index in memory and then loading the new task's stack pointer
inline switch_tasks(cur_task) {
    ldy     cur_task    // load current task index into y
    tsx                 // put the current stack pointer in x
    stx     $00,y       // store current stack pointer into current task sp var

    // this updates the current task index
    lda     #1          // a = 1
    eor     cur_task    // a = cur_task ^ a  (this flips from 0 to 1/1 to 0)
    sta     cur_task    // cur_task = a
   
    tay                 // make y equal the new task index
    ldx     $00,y       // load new task stack pointer from its task sp var
    txs                 // set stack pointer to task stack pointer
}

// this function will start the threading system by restoring
// the given task context, setting the cpu status, and calling
// rts to jump to it.  this can be called from a regular function
// to start the threading system.
inline start_all_tasks(first_task) {
    ldy     first_task  // get the index of the first task
    ldx     $00,y       // load its stack pointer into x
    txs                 // set stack pointer to first task sp
    load_task_context() // load task context from its stack
    plp                 // load task cpu status, this will clear interrupt disable flag
    rts                 // return to the task entry point
}

// this is the interrupt handler to hook up to the timer IRQ to
// implement task switching
interrupt context_switch() {
    // save the current task context on its stack
    save_task_context()

    // switch tasks
    switch_tasks(current_task)

    // load the task context from tnew task stack
    load_task_context()

    cli                 // clear the interrupt flag
}

I think the comments explain what the code does. The context_switch interrupt function at the bottom is called by a Lynx timer set to interrupt at roughly 30Hz. I picked that arbitrarily for now. It will probably need to be tweaked to get a good balance of execution times.

Here's the code that shows how to use the threading:

#include <task_switch.h>

#define IRQ_ADDR_HI  $FFFF
#define IRQ_ADDR_LO  $FFFE

byte task0_sp        : $0 = #$ff  // task 0 stack pointer storage
byte task1_sp        : $1 = #$7f  // task 1 stack pointer storage
byte current_task    : $2 = #0    // current task index

#ram.org 0x0200

// make this the function the one that the second stage loader
// loads and runs.  it handles further initializing the system,
// initializing the tasks and starting the task switching
function noreturn startup() {
    // disable interrupts
    sei

    // set the irq handler pointer to our context_switch function
    lda    hi(context_switch)
    sta    IRQ_ADDR_HI
    lda    lo(context_switch)
    sta    IRQ_ADDR_LO

    // initialize one of the handy interrupts to fire at 30 Hz
    initialize_timer()

    // initialize the stacks for the two tasks
    initialize_task(task0_sp, game_loop)
    initialize_task(task1_sp, asset_loader)

    // start the task system, this will set the CPU status flags to
    // a known value for the first task which will clear the disable
    // interrupts bit
    start_all_tasks(current_task)
}

function noreturn game_loop() {
    forever {
        // this is the main game loop
    }
}

function noreturn asset_loader() {
    forever {
        // this is the asset loader
    }
}

#ram.end

I haven't run this on an actual Lynx just yet. I have run it on a 6502 simulator. Right now it isn't as efficient as it could be. For instance, this uses 6502 opcodes exclusively when it could use 65c02 opcodes. The HLAKit compiler only has support for the bare 6502 right now. When I add 65c02 support I'll update this.

The next step for this system is to use the bit test/set functions to create a simple semaphore that can be used to synchronize a queue of load requests between the main thread and the loader thread. Once I get that far, I'll load it up on my Lynx and then write up a much longer tutorial for the Lynx section on my wiki.

--Wookie

Edited by Wookie, Sat Sep 11, 2010 2:21 AM.


#2 semicolo OFFLINE  

semicolo

    Chopper Commander

  • 128 posts

Posted Tue Sep 14, 2010 11:08 AM

This is groovy stuff

#3 Shawn Jefferson OFFLINE  

Shawn Jefferson

    Stargunner

  • 1,765 posts
  • Location:Victoria, Canada

Posted Tue Sep 14, 2010 6:55 PM

Interesting, and neat, but do you think it will actually work, in-game? To my mind, running an interrupt during your game to load game data will have to be managed very closely/precisely to not interfere with game play. Probably much easier to code the loading into each game, as each game design dictates. Maybe I'm missing something, and it's true that I haven't programmed any games for the Lynx, although I have done a bit of programming for it in the past (and for the Atari 8-bit, which is also a 6502 system and not quite entirely unlike the Lynx.)

#4 Wookie OFFLINE  

Wookie

    Chopper Commander

  • Topic Starter
  • 197 posts
  • Location:Seattle, WA

Posted Wed Sep 15, 2010 9:58 AM

One thing to remember is that each task only gets to run for a single time slice. So if the "OS tick" timer is set to 30 Hz, then each task gets to run for 1/30th of a second each time it runs. Each task will be put to sleep and resumed automagically by the task switching system. At this point, both tasks have equal priority and therefore each gets 50% of the CPU time.

One of the tweaks that I can make is to add a simple priority system that allows for dividing the CPU time up unevenly. My guess is that the loader would do just fine running every 4th time slice. To do that, I just need to add an array of current task indexes. For instance, if I want the loader to run on every 4th time slice, I would set up a 4 element array in memory where each element contains the index of the next task to run like so:

+---+---+---+---+
| 1 | 0 | 0 | 0 |
+---+---+---+---+

Then the "current_task" index would index into this array. If the loader task is 1 and the game task is 0, then the every time the current_task index points to the first slot in this array, the loader will get run. An optimization can be made where the switch task IRQ handler compares the next task index with the current task index and if they are equal, it exits early without taking the time to save and restore the task context. That way, the relatively expensive task switch only happens when actually switching tasks.

Another tweak that can be made is the stack space division. The example above assumes that both tasks need the same amount of stack space. If the loader task, however, doesn't make any function calls and never needs the stack for anything other than storing the task context, then it's stack pointer can be moved to one end of the stack space. A task context only requires 6 bytes of stack space:

+--------+--------+-----------+-------+-------+-------+
| ret lo | ret hi | cpu flags | reg.a | reg.x | reg.y |
+--------+--------+-----------+-------+-------+-------+

hi------------------(stack memory)------------------->lo


When the interrupt handler starts, the return address (hi and lo bytes) and the cpu flags have already been stored on the stack. The task switching code then stores the current A, X, and Y register values on the stack before switching tasks.

So in theory, if the loader thread doesn't use any stack space at all, the initial value for its stack pointer could be $05. That gives it 6 bytes of stack space to store its context when it is asleep and the rest of the stack space ($FF-$06) would be for the game task.

--Wookie

Notes for Newbies:
BTW, if you don't know already, the 6502 is hard coded to use the memory range of $0100-$01FF for its stack. So when I refer to address $05 in the stack, that is really address $0105 in RAM. That is also why I use "#ram.org 0x0200" in the example code showing how to use the task system. That code will be loaded into RAM at $0200 and executed there. That's the lowest "free" RAM in the Lynx because the zero page--commonly used for variables--is in $0000-$00FF and the stack is in $0100-$01FF.

Edited by Wookie, Wed Sep 15, 2010 10:40 AM.


#5 Wookie OFFLINE  

Wookie

    Chopper Commander

  • Topic Starter
  • 197 posts
  • Location:Seattle, WA

Posted Wed Sep 15, 2010 10:23 AM

The caveat I've found so far is with writes to the sprite control blocks and writes to the cart (i.e. for save game and/or i2c I/O expanders). Those both have timing constraints that need to be heeded. Having a task switch occur during an i2c write would screw it up. Having a task resume updating sprite control blocks while Suzy is processing would be bad as well.

The solution to this problem of course it to treat those sections of code as "critical sections". Those can be protected by a semaphore that signals to the task switcher that it is OK to switch tasks. With a system like that, instead of a fully automatic task switcher, you have a co-operative task switcher where tasks can temporarily turn off task switching.

Other than that, this threading system should work great for a Lynx game. The only reason I cooked this up is because the Lynx doesn't directly memory map its ROM data. Because ROM data must be streamed into RAM when needed, the Lynx is far more similar to modern day CD/DVD-based consoles (e.g. PS3, XBox360, etc) than it is to standard ROM cart consoles.

During my time in the game industry, I maintained, amongst other things, the low level loading system for a game engine designed for CD/DVD-based consoles. It used a two thread system almost identical to the one I describe above and it worked fantastically well. Our engine was used in one of the very first "continuous massive world" (a.k.a. "giant sandbox") games for the PS2. All of the data used to render the game was streamed in dynamically in response to game play events. This broke us free from the "entire level must fit into RAM" mentality that was common back then and allowed us to make huge worlds with a fluid and dynamic--as apposed to linear--game play experience.

I should note, that our engine used a statically linked executable binary that was always resident in RAM. So our data caches took up whatever was left. Some other game engines however didn't limit themselves with that either. I know for a fact that the Jak and Daxter engine not only swapped data in/out, but it also swapped executable code in/out as well. This was aided by the fact that their engine was written in a Lisp dialect specifically made for games. (Look it up, it was called GOAL and is a pretty interesting idea, though I don't much care for Lisp).

So what does this have to do with the Lynx? Well, it is just a portable PS2 in many ways. Both have game storage systems that are magnitudes larger than the system RAM. The PS2 DVD holds 4.2 GB and it only has 32 MB of RAM, a ratio of 134:1. The Lynx carts can hold up to 128 MB of data with a special game cart (with an 8-bit i2c I/O expander to add an additional 8 bits of address lines) and it only has 64 KB of RAM, a ratio of 256:1. Even if you use a standard 1MB Lynx cart, you still have a ratio of 16:1.

Both systems have serial access to the game data. The PS2 has to read data from the DVD into RAM before it can use it. The Lynx has to read data from the cart, byte by byte, into RAM before it can use it.

Both systems have relatively modest CPU's and use hardware accelerator chips to do all of the heavy lifting. The PS2's CPU is a 200MHz MIPS 3000 compared to the XBox's 700MHz Pentium 4 CPU. The Lynx has a 6502. The PS2 used a pair of external vector units (VUs) and a dedicated 3D accelerator (GS) to do all of the rendering. The Lynx has Suzy for doing all of the sprite rendering and collision detection.

Because of these similarities, when I decided that I was going to create an RPG for the Lynx, it was natural to try the threaded loader and streaming data approach for the game engine. I could be completely wrong. This design could be a complete disaster and not work at all. I'll let you know as I test things out.

--Wookie

Edited by Wookie, Wed Sep 15, 2010 10:31 AM.


#6 matashen OFFLINE  

matashen

    Moonsweeper

  • 462 posts

Posted Wed Sep 15, 2010 12:11 PM

The caveat I've found so far is with writes to the sprite control blocks and writes to the cart (i.e. for save game and/or i2c I/O expanders). Those both have timing constraints that need to be heeded. Having a task switch occur during an i2c write would screw it up. Having a task resume updating sprite control blocks while Suzy is processing would be bad as well.

The solution to this problem of course it to treat those sections of code as "critical sections". Those can be protected by a semaphore that signals to the task switcher that it is OK to switch tasks. With a system like that, instead of a fully automatic task switcher, you have a co-operative task switcher where tasks can temporarily turn off task switching.

Other than that, this threading system should work great for a Lynx game. The only reason I cooked this up is because the Lynx doesn't directly memory map its ROM data. Because ROM data must be streamed into RAM when needed, the Lynx is far more similar to modern day CD/DVD-based consoles (e.g. PS3, XBox360, etc) than it is to standard ROM cart consoles.

During my time in the game industry, I maintained, amongst other things, the low level loading system for a game engine designed for CD/DVD-based consoles. It used a two thread system almost identical to the one I describe above and it worked fantastically well. Our engine was used in one of the very first "continuous massive world" (a.k.a. "giant sandbox") games for the PS2. All of the data used to render the game was streamed in dynamically in response to game play events. This broke us free from the "entire level must fit into RAM" mentality that was common back then and allowed us to make huge worlds with a fluid and dynamic--as apposed to linear--game play experience.

I should note, that our engine used a statically linked executable binary that was always resident in RAM. So our data caches took up whatever was left. Some other game engines however didn't limit themselves with that either. I know for a fact that the Jak and Daxter engine not only swapped data in/out, but it also swapped executable code in/out as well. This was aided by the fact that their engine was written in a Lisp dialect specifically made for games. (Look it up, it was called GOAL and is a pretty interesting idea, though I don't much care for Lisp).

So what does this have to do with the Lynx? Well, it is just a portable PS2 in many ways. Both have game storage systems that are magnitudes larger than the system RAM. The PS2 DVD holds 4.2 GB and it only has 32 MB of RAM, a ratio of 134:1. The Lynx carts can hold up to 128 MB of data with a special game cart (with an 8-bit i2c I/O expander to add an additional 8 bits of address lines) and it only has 64 KB of RAM, a ratio of 256:1. Even if you use a standard 1MB Lynx cart, you still have a ratio of 16:1.

Both systems have serial access to the game data. The PS2 has to read data from the DVD into RAM before it can use it. The Lynx has to read data from the cart, byte by byte, into RAM before it can use it.

Both systems have relatively modest CPU's and use hardware accelerator chips to do all of the heavy lifting. The PS2's CPU is a 200MHz MIPS 3000 compared to the XBox's 700MHz Pentium 4 CPU. The Lynx has a 6502. The PS2 used a pair of external vector units (VUs) and a dedicated 3D accelerator (GS) to do all of the rendering. The Lynx has Suzy for doing all of the sprite rendering and collision detection.

Because of these similarities, when I decided that I was going to create an RPG for the Lynx, it was natural to try the threaded loader and streaming data approach for the game engine. I could be completely wrong. This design could be a complete disaster and not work at all. I'll let you know as I test things out.

--Wookie


I dont see any advantage from a task system, cause the lynx is not powerful enough for something.
I think its better to design games on th lynx without a complex multitasking system.

But i could be complete wrong ...


Regards
Matthias

#7 Wookie OFFLINE  

Wookie

    Chopper Commander

  • Topic Starter
  • 197 posts
  • Location:Seattle, WA

Posted Wed Sep 15, 2010 3:00 PM

But i could be complete wrong ...


Or you could be completely right. ;)

It is my goal to find out and report back.

--Wookie

#8 karri OFFLINE  

karri

    Stargunner

  • 1,481 posts
  • Location:Espoo, Finland

Posted Wed Dec 15, 2010 11:30 PM


But i could be complete wrong ...


Or you could be completely right. ;)

It is my goal to find out and report back.

--Wookie


Actually all my codes suffer from not having a task switcher. The background ABC-music parser needs to be called at regular intervals. Currently I insert polling commands in all time-consuming places to keep the music going at a steady beat. The ideal thing would be to set up abc-parser as the task to be called. But then I also need a way to exit the task time-slot early if I have nothing to do.

I tried running the parser in an interrupt routine but that did not work because the code was too heavy.

Loading in data from cart is very fast thanks to the cc65 code segmentation feature. I don't expect to need a separate graphics streamer in a separate thread. This can easily be called from draw code.

One place where a task switcher might make sense is during the wait for Suzy. In Stardreamer I do a lot of 32 bit math like division, multiplication, accumulation. While the Suzy works there would be spare time for something else.

But of course the real question is how many clock cycles will the task switcher waste for doing its job?
--
Karri

#9 Wookie OFFLINE  

Wookie

    Chopper Commander

  • Topic Starter
  • 197 posts
  • Location:Seattle, WA

Posted Sat Jan 1, 2011 10:27 AM

Thanks to the fact that the HLA is just assembly in function bodies we can get a fairly accurate count of the clock cycles lost to the task switching. The context_switch function is the ISR that gets called when the timer interrupt fires. I don't have the cycle counts for each instruction in front of me but I'm sure you can figure them out. The saving and restoring task context functions just use the register swapping and accumulator push/pop functions. The only other code is updating the current task index and swapping out the current stack pointer.

The overhead should be minimal and you can tweak it by changing how often the timer interrupt fires.

--Wookie

Edited by Wookie, Sat Jan 1, 2011 10:29 AM.


#10 GadgetUK OFFLINE  

GadgetUK

    Stargunner

  • 1,466 posts
  • Location:UK

Posted Sat May 12, 2012 1:05 PM

This is extremely interesting. Any update?

#11 Wookie OFFLINE  

Wookie

    Chopper Commander

  • Topic Starter
  • 197 posts
  • Location:Seattle, WA

Posted Wed Jul 11, 2012 3:15 PM

No update so far. I'm just now getting back to this. I'm going to code something up for the CC65 assembler since my HLAKit compiler is still a work in progress (blame me being a father with a job). I'll do something to demo the advantages of the multithreading on the Lynx.

I really think that it makes sense for loading game assets on the Lynx since the Lynx doesn't have directly addressable ROMs. Since it has to stream assets through the 8-bit read register, doing that in a background thread is much nicer. Especially when you want to have complex, "endless" levels that won't fit in RAM all at once. With a background thread for loading data, you could program a game that streamed in bitmap data as needed and "flush" bitmap data no longer needed in an LRU fashion.

I've always thought it would be great for doing a final fantasy style RPG with an "endless" world. If the streaming of assets was automatic, then the map tiles would stream in and get flushed on demand. The entire set of map tiles wouldn't be limited by the RAM. It would only be limited by ROM.

-wookie




0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users