Jump to content
IGNORED

Speed boost - run code in RAM


SpiceWare

Recommended Posts

NOTE: I'm putting this in General because it's not specific to CDFJ.

 

NOTE2: Be sure to read through the replies in this topic for additional speed boosts.

 

NOTE3: It may be helpful to reallocate Display Data RAM for C Usage when putting functions in RAM.

 

 

In a discussion about reducing the time it takes to decompress level graphics in @johnnywc's Turbo @batari suggested:

 

Quote

Regardless, my recommendation is if you have a small enough function that needs raw speed over memory, I would consider compiling it to run from RAM and running it there, if at all possible. In flash, even with the MAM enabled, there are wait states needed but code running from RAM needs no wait states. I am not 100% sure how to make things play nice with the compiler for RAM functions (aside from hacks), but I am sure there are some here who could help more with that.

 

I did a little research and found this discussion in Stack Overflow: How to run code from RAM on ARM architecture

 

Quote

On GCC: Just put the function in the .data section:


__attribute__( ( section(".data") ) )


It will be copied over with the rest of your initialzed variables by the startup code (no need to mess with the linker scipt). You may also need a "long_call" option as well if the function ends up "far away" from the rest of the code after being placed into RAM. 


__attribute__( ( long_call, section(".data") ) )


Example:


__attribute__( ( long_call, section(".data") ) ) void ram_foobar (void) { ... }

 

As a test I cloned 3 functions in Collect 3 that are used to prepare the arena. The only change was to prefix the functions with __attribute__ ( ... ) and the function names with RAM_

 

// the distance from NewGame() in ROM to RAM_PrepArenaBuffers() in RAM is far
// so need to flag long_call
__attribute__( ( long_call, section(".data") ) ) void RAM_PrepArenaBuffers()
{
    // This function loads the selected Arena layout into the 6 playfield buffers.
    //
    // The 40 bits for each row of arena data are stored in 5 bytes arranged like this:
    // byte 0    byte 1    byte 2    byte 3    byte 4
    // 33333333  33222222  22221111  11111100  00000000
    // 98765432  10987654  32109876  54321098  76543210
    //
    // They need to be converted to this arrangement for the playfield datastreams:
    // LEFT                          RIGHT
    // PF0       PF1       PF2       PF0       PF1       PF2
    // 3333----  33333322  22222222  1111----  11111100  00000000
    // 6789----  54321098  01234567  6789----  54321098  01234567
    
    int row;
    unsigned char byte0, byte1, byte2, byte3, byte4;
    unsigned char *arena = ROM + arena_graphics[mm_arena];
    unsigned char *arena_pf0_left   = RAM + _BUF_PF0_LEFT;
    unsigned char *arena_pf1_left   = RAM + _BUF_PF1_LEFT;
    unsigned char *arena_pf2_left   = RAM + _BUF_PF2_LEFT;
    unsigned char *arena_pf0_right  = RAM + _BUF_PF0_RIGHT;
    unsigned char *arena_pf1_right  = RAM + _BUF_PF1_RIGHT;
    unsigned char *arena_pf2_right  = RAM + _BUF_PF2_RIGHT;
    
    for(row=0; row<arena_heights[mm_arena]; row++)
    {
        // fetch the 5 bytes for the current row
        byte0 = arena[row*5 + 0];
        byte1 = arena[row*5 + 1];
        byte2 = arena[row*5 + 2];
        byte3 = arena[row*5 + 3];
        byte4 = arena[row*5 + 4];
        
        // convert the 5 bytes into the 6 needed for TIA's PFx registers
        arena_pf0_left[row] = RAM_BitReversal(byte0) << 4;
        arena_pf1_left[row] = (byte0 << 4) + (byte1 >> 4);
        arena_pf2_left[row] = RAM_BitReversal((byte1 << 4) + (byte2 >> 4));
        arena_pf0_right[row] = RAM_BitReversal(byte2);
        arena_pf1_right[row] = byte3;
        arena_pf2_right[row] = RAM_BitReversal(byte4);
    }
    
    // set the color of the arena
    ARENA_COLOR = RAM_ColorConvert(arena_color[mm_arena]);
}

// RAM_BitReversal() is only called from RAM_PrepArenaBuffers()
// so the flag long_call is not needed
__attribute__( ( section(".data") ) ) unsigned int RAM_BitReversal(unsigned int value)
{
    // value    a byte with bits in the order 76543210
    // return   a byte with bits in the order 01234567
    
    value = ((0xaa & value) >> 1) | ((0x55 & value) << 1);
    value = ((0xcc & value) >> 2) | ((0x33 & value) << 2);
    value = ((0xf0 & value) >> 4) | ((0x0f & value) << 4);
    return value;
}


// RAM_ColorConvert() is only called from RAM_PrepArenaBuffers()
// so the flag long_call is not needed
__attribute__( ( section(".data") ) ) int RAM_ColorConvert(int color)
{
    if (mm_tv_type == PAL)
    {
        return NTSCtoPAL[color>>4] +    // convert chroma value
               (color & 0x0f);          // retain luma value
    }
    else if (mm_tv_type == SECAM)
    {
        if (color < 2)
            return 0;   // return black for 0 or 1
        else
            return NTSCtoSECAM[color>>4];
    }
    else
        return color;
}

 

I then modified NewGame() to run the original PrepArenaBuffers() if TV_TYPE was Color, or the new RAM_PrepArenaBuffers() if TV_TYPE was B&W.

 

void NewGame()
{
    // tells 6507 code to use the game kernels
    MODE = GAME_ACTIVE;
    
    // set starting positions for the players
    player_x[0] = 36;   player_y[0] = 40;   player_shape[0] = 0;
    player_x[1] = 116;  player_y[1] = 136;  player_shape[1] = 1;
    
    if (TV_TYPE_COLOR)
    {
        T1TC = 0;           // make sure timer starts at 0
        T1TCR = 1;          // turn on timer
        // run from ROM
        PrepArenaBuffers();
        T1TCR = 0;          // turn off timer
        execution_time = T1TC;
    }
    else
    {
        T1TC = 0;           // make sure timer starts at 0
        T1TCR = 1;          // turn on timer
        // run from RAM
        RAM_PrepArenaBuffers();
        T1TCR = 0;          // turn off timer
        execution_time = T1TC | 0x80000000;
    }
    
}

 

execution_time will contain how many ARM cycles it took to prepare the arena buffers.  If the RAM function is called the high-bit is set so we have visual confirmation of which routine ran.

 

I then modified the score display to show execution_time if the difficulty switches are both set to B.  The upper 2 bytes are displayed in the left player's score, while the lower 2 bytes are shown in the right player's score.

 

My 2600 is currently out of commission, so I set up my 7800 to test.

 

Running PrepArenaBuffers():

 

IMG_2011.thumb.JPG.5fba1d17742be4174f98a48882d6d663.JPG

 

 

Running RAM_PrepArenaBuffers()

 

IMG_2012.thumb.JPG.6059fe33736302694352acc7b2dfbfd8.JPG

 

Execution times:

 

  • ROM 0x1660 cycles, 5728 decimal
  • RAM 0x114d cycles, 4429 decimal

 

The routines take about 23% less time to run in RAM.

 

Source and ROM.

 

Collect3RAMfunction.zip

 

  • Like 4
Link to comment
Share on other sites

This is incredibly useful. PrepAreaBuffers() appears top be at 0x000011b0 while RAM_PrepAreaBuffers() appears to be at 0x4000185c. There's no copying of the custom program to somewhere in RAM, it's just the variable block. Interesting.

 

 

Link to comment
Share on other sites

3 hours ago, JetSetIlly said:

There's no copying of the custom program to somewhere in RAM, it's just the variable block. Interesting.

 

If you defined an array like this:

 

unsigned char arena_color[4] =
{
    _RED + 4,
    _GREEN + 4,
    _BLUE + 4,
    _WHITE
};

without the const then the array is defined as being RAM, not ROM. The compiler sets up the initial values of _RED + 4, ..., _WHITE in a data section in ROM.  The very first time you call custom ARM code a one-time process will run that copies those initial values from the ROM data section to the appropriate location in RAM.

 

The section(".data") bit of __attribute__ taps into those routines to copy the function from ROM to RAM for you.

Link to comment
Share on other sites

3 hours ago, SpiceWare said:

 

If you defined an array like this:

 


unsigned char arena_color[4] =
{
    _RED + 4,
    _GREEN + 4,
    _BLUE + 4,
    _WHITE
};

without the const then the array is defined as being RAM, not ROM. The compiler sets up the initial values of _RED + 4, ..., _WHITE in a data section in ROM.  The very first time you call custom ARM code a one-time process will run that copies those initial values from the ROM data section to the appropriate location in RAM.

 

The section(".data") bit of __attribute__ taps into those routines to copy the function from ROM to RAM for you.

 

Yes. I was thinking from an emulation point of about why this works without any change to the driver. And it's because these functions are copied from ROM to RAM as part of the .data section, which is already emulated (in Stella, etc.). Very nice solution for time critical code.

 

Link to comment
Share on other sites

Very good.

 

That's with MAM mode 2, right? Have you tested mode 1 too? The difference should be even bigger there.

 

BTW: While you are at, how about moving the data into RAM too (or instead)?

Edited by Thomas Jentzsch
Link to comment
Share on other sites

2 hours ago, Thomas Jentzsch said:

That's with MAM mode 2, right?

 

No, CDFJ sets it to MAM = 1 when calling the custom ARM code as it could be used on an original Harmony.  CDFJ+, which is being used in Turbo, leaves it as MAM = 2.

 

I made this change:

 


int main()
{
    MAMCR=2;
...
}

 

Running PrepArenaBuffers():

 

IMG_2017.thumb.JPG.8c060db5e43003d5ec181bd671b98c51.JPG

 

Running RAM_PrepArenaBuffers():

 

IMG_2018.thumb.JPG.da19e2b76d7ae58460b581f98b0e3885.JPG

 

ROM and RAM execution times with MAM = 2 are both faster than before:
 
ROM 0x131C cycles, 4892 decimal
RAM 0x1075 cycles, 4213 decimal
 
The RAM boost is only 14%. 

  • Like 1
Link to comment
Share on other sites

I did my own tests on my PAL console. 

Chip MAM    ROM   RAM
LPC2103 0   0x3662   0x1178
LPC2103 1   0x1686   0x1168
LPC2103 2   0x1339   0x1090
LPC2104 0   0x3662   0x1178
LPC2104 1   0x25bd ?   0x1169
LPC2104 2   0x1303   0x108c

 

The LPC2104 should be able to handle ROM a bit better, the MAM = 2 value proves that. The 0x25bd must be an error (I suppose there is a bug in the chip with MAM = 1, it seems that only data is cached, though that is not mentioned in the errata sheet), something like 0x15bd would fit much better.

 

BTW: I wonder why our LPC2103 values slightly differ.

Link to comment
Share on other sites

3 minutes ago, Thomas Jentzsch said:

Your are welcome. That means that anything except Flash memory (SRAM, peripherals) has not wait states, right?

It affects peripherals. So, anything with an address in the range 0xE0000000 to 0xEFFFFFFF is on the peripheral bus. SRAM is not on that bus so is not affected by the APB divider.

 

Link to comment
Share on other sites

8 minutes ago, JetSetIlly said:

It affects peripherals. So, anything with an address in the range 0xE0000000 to 0xEFFFFFFF is on the peripheral bus. SRAM is not on that bus so is not affected by the APB divider.

Yes, but SRAM is assumed to be accessible without wait states anyway. :) 

Link to comment
Share on other sites

2 hours ago, Thomas Jentzsch said:

BTW: I wonder why our LPC2103 values slightly differ.


Did you built new ROMs for your MAM tests? If so we likely have different versions of the C compiler installed that optimized the ARM code differently.


with the prior compiler I had 4 versions installed in different virtual machines. For each project I'd used whichever compiler version resulted in smaller code

Quote

The size of custom2.bin will depend upon which C compiler you installed.

  • 2011.03-42 - 8136 bytes
  • 2012.03.56 - 8308 bytes
  • 2012.09.63 - 8224 bytes
  • 2013.05.23 - 8224 bytes

 

Link to comment
Share on other sites

Which file should I look for? main.o (11988 bytes)?

 

BTW: I experimented with compiler settings. With global -O3 the code size almost doubled (23538 bytes), but the times improved a lot:

  • LPC2103, MAM = 2 0x0d59 (ROM) ~36% faster, 0x0d07 (RAM) ~27% faster
  • LPC2104, MAM = 2 0x0e28 (ROM) ~46% faster, 0x0d04 (RAM) ~24% faster

Now I will try to get only the function optimized for speed.

 

Edit: Found it! :) 

__attribute__( ( section(".data") ) ) __attribute__ ((optimize(3))) unsigned int RAM_BitReversal(unsigned int value)

With this setting for the two functions the file size increased only by 272 bytes.

 

For LPC2103 the ROM function became even slightly slower (I suppose the increased code size cause a lot of MAM buffer misses) to 0x133f, but the RAM function decreased to 0x0d5b (~24%). For LPC2104 the results are similar.

 

RAM and local optimization together improved by ~44%. I think it is worth going into that direction.

Edited by Thomas Jentzsch
  • Like 1
Link to comment
Share on other sites

3 hours ago, Thomas Jentzsch said:

Which file should I look for? main.o (11988 bytes)?

 

armcode.bin (3708 bytes) - its the ARM code that ends up in the ROM.

 

3 hours ago, Thomas Jentzsch said:

Edit: Found it! :) 


__attribute__( ( section(".data") ) ) __attribute__ ((optimize(3))) unsigned int RAM_BitReversal(unsigned int value)

 

Nice!

 

Can probably combine those just like is done with long_call for RAM_PrepArenaBuffers(). Would be worth testing that with all 3 functions. I'm not able to at the moment, my work day has started.

 

__attribute__( ( long_call, optimize(3), section(".data") ) ) void RAM_PrepArenaBuffers()
{
    ...
}

__attribute__( ( optimize(3), section(".data") ) ) unsigned int RAM_BitReversal(unsigned int value)
{
    ...
}

__attribute__( ( optimize(3), section(".data") ) ) int RAM_ColorConvert(int color)
{
    ...
}

 

 

  • Like 1
Link to comment
Share on other sites

Looks like arm-eabi-gcc --version is the command to report compiler version:

 

atari@atari-VirtualBox:/media/sf_Atari/Collect3RAMfunction$ arm-eabi-gcc --version
arm-eabi-gcc (Linaro GCC 7.4-2019.02) 7.4.1 20181213 [linaro-7.4-2019.02 revision 56ec6f6b99cc167ff0c2f8e1a2eed33b1edc85d4]
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

 

 

That matches the version it was when I documented the installation:

 

On 10/26/2019 at 1:53 PM, SpiceWare said:

Locate the gcc-linaro-???-x86_64_arm-eabi.tar.xz file where ??? is the most current release.  At time of this post that's 7.4.1-2019.02, so:

 

Looks like the version that's currently on the Linaro site is 7.5.0-2019.12 - kind of surprised about that.

 

 

Link to comment
Share on other sites

57 minutes ago, SpiceWare said:

armcode.bin (3708 bytes) - its the ARM code that ends up in the ROM.

3972 bytes with the local optimizations.

Quote

arm-eabi-gcc (Linaro GCC 7.1-2017.08) 7.1.1 20170707
Copyright (C) 2017 Free Software Foundation, Inc.

Quite old.

Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...