+SpiceWare Posted June 26, 2021 Share Posted June 26, 2021 NOTE: I'm putting this in General because it's not specific to CDFJ. NOTE2: Be sure to read through the replies in this topic for additional speed boosts. NOTE3: It may be helpful to reallocate Display Data RAM for C Usage when putting functions in RAM. In a discussion about reducing the time it takes to decompress level graphics in @johnnywc's Turbo @batari suggested: Quote Regardless, my recommendation is if you have a small enough function that needs raw speed over memory, I would consider compiling it to run from RAM and running it there, if at all possible. In flash, even with the MAM enabled, there are wait states needed but code running from RAM needs no wait states. I am not 100% sure how to make things play nice with the compiler for RAM functions (aside from hacks), but I am sure there are some here who could help more with that. I did a little research and found this discussion in Stack Overflow: How to run code from RAM on ARM architecture Quote On GCC: Just put the function in the .data section: __attribute__( ( section(".data") ) ) It will be copied over with the rest of your initialzed variables by the startup code (no need to mess with the linker scipt). You may also need a "long_call" option as well if the function ends up "far away" from the rest of the code after being placed into RAM. __attribute__( ( long_call, section(".data") ) ) Example: __attribute__( ( long_call, section(".data") ) ) void ram_foobar (void) { ... } As a test I cloned 3 functions in Collect 3 that are used to prepare the arena. The only change was to prefix the functions with __attribute__ ( ... ) and the function names with RAM_ // the distance from NewGame() in ROM to RAM_PrepArenaBuffers() in RAM is far // so need to flag long_call __attribute__( ( long_call, section(".data") ) ) void RAM_PrepArenaBuffers() { // This function loads the selected Arena layout into the 6 playfield buffers. // // The 40 bits for each row of arena data are stored in 5 bytes arranged like this: // byte 0 byte 1 byte 2 byte 3 byte 4 // 33333333 33222222 22221111 11111100 00000000 // 98765432 10987654 32109876 54321098 76543210 // // They need to be converted to this arrangement for the playfield datastreams: // LEFT RIGHT // PF0 PF1 PF2 PF0 PF1 PF2 // 3333---- 33333322 22222222 1111---- 11111100 00000000 // 6789---- 54321098 01234567 6789---- 54321098 01234567 int row; unsigned char byte0, byte1, byte2, byte3, byte4; unsigned char *arena = ROM + arena_graphics[mm_arena]; unsigned char *arena_pf0_left = RAM + _BUF_PF0_LEFT; unsigned char *arena_pf1_left = RAM + _BUF_PF1_LEFT; unsigned char *arena_pf2_left = RAM + _BUF_PF2_LEFT; unsigned char *arena_pf0_right = RAM + _BUF_PF0_RIGHT; unsigned char *arena_pf1_right = RAM + _BUF_PF1_RIGHT; unsigned char *arena_pf2_right = RAM + _BUF_PF2_RIGHT; for(row=0; row<arena_heights[mm_arena]; row++) { // fetch the 5 bytes for the current row byte0 = arena[row*5 + 0]; byte1 = arena[row*5 + 1]; byte2 = arena[row*5 + 2]; byte3 = arena[row*5 + 3]; byte4 = arena[row*5 + 4]; // convert the 5 bytes into the 6 needed for TIA's PFx registers arena_pf0_left[row] = RAM_BitReversal(byte0) << 4; arena_pf1_left[row] = (byte0 << 4) + (byte1 >> 4); arena_pf2_left[row] = RAM_BitReversal((byte1 << 4) + (byte2 >> 4)); arena_pf0_right[row] = RAM_BitReversal(byte2); arena_pf1_right[row] = byte3; arena_pf2_right[row] = RAM_BitReversal(byte4); } // set the color of the arena ARENA_COLOR = RAM_ColorConvert(arena_color[mm_arena]); } // RAM_BitReversal() is only called from RAM_PrepArenaBuffers() // so the flag long_call is not needed __attribute__( ( section(".data") ) ) unsigned int RAM_BitReversal(unsigned int value) { // value a byte with bits in the order 76543210 // return a byte with bits in the order 01234567 value = ((0xaa & value) >> 1) | ((0x55 & value) << 1); value = ((0xcc & value) >> 2) | ((0x33 & value) << 2); value = ((0xf0 & value) >> 4) | ((0x0f & value) << 4); return value; } // RAM_ColorConvert() is only called from RAM_PrepArenaBuffers() // so the flag long_call is not needed __attribute__( ( section(".data") ) ) int RAM_ColorConvert(int color) { if (mm_tv_type == PAL) { return NTSCtoPAL[color>>4] + // convert chroma value (color & 0x0f); // retain luma value } else if (mm_tv_type == SECAM) { if (color < 2) return 0; // return black for 0 or 1 else return NTSCtoSECAM[color>>4]; } else return color; } I then modified NewGame() to run the original PrepArenaBuffers() if TV_TYPE was Color, or the new RAM_PrepArenaBuffers() if TV_TYPE was B&W. void NewGame() { // tells 6507 code to use the game kernels MODE = GAME_ACTIVE; // set starting positions for the players player_x[0] = 36; player_y[0] = 40; player_shape[0] = 0; player_x[1] = 116; player_y[1] = 136; player_shape[1] = 1; if (TV_TYPE_COLOR) { T1TC = 0; // make sure timer starts at 0 T1TCR = 1; // turn on timer // run from ROM PrepArenaBuffers(); T1TCR = 0; // turn off timer execution_time = T1TC; } else { T1TC = 0; // make sure timer starts at 0 T1TCR = 1; // turn on timer // run from RAM RAM_PrepArenaBuffers(); T1TCR = 0; // turn off timer execution_time = T1TC | 0x80000000; } } execution_time will contain how many ARM cycles it took to prepare the arena buffers. If the RAM function is called the high-bit is set so we have visual confirmation of which routine ran. I then modified the score display to show execution_time if the difficulty switches are both set to B. The upper 2 bytes are displayed in the left player's score, while the lower 2 bytes are shown in the right player's score. My 2600 is currently out of commission, so I set up my 7800 to test. Running PrepArenaBuffers(): Running RAM_PrepArenaBuffers() Execution times: ROM 0x1660 cycles, 5728 decimal RAM 0x114d cycles, 4429 decimal The routines take about 23% less time to run in RAM. Source and ROM. Collect3RAMfunction.zip 4 Link to comment Share on other sites More sharing options...
JetSetIlly Posted June 26, 2021 Share Posted June 26, 2021 This is incredibly useful. PrepAreaBuffers() appears top be at 0x000011b0 while RAM_PrepAreaBuffers() appears to be at 0x4000185c. There's no copying of the custom program to somewhere in RAM, it's just the variable block. Interesting. Link to comment Share on other sites More sharing options...
+SpiceWare Posted June 27, 2021 Author Share Posted June 27, 2021 3 hours ago, JetSetIlly said: There's no copying of the custom program to somewhere in RAM, it's just the variable block. Interesting. If you defined an array like this: unsigned char arena_color[4] = { _RED + 4, _GREEN + 4, _BLUE + 4, _WHITE }; without the const then the array is defined as being RAM, not ROM. The compiler sets up the initial values of _RED + 4, ..., _WHITE in a data section in ROM. The very first time you call custom ARM code a one-time process will run that copies those initial values from the ROM data section to the appropriate location in RAM. The section(".data") bit of __attribute__ taps into those routines to copy the function from ROM to RAM for you. Link to comment Share on other sites More sharing options...
JetSetIlly Posted June 27, 2021 Share Posted June 27, 2021 3 hours ago, SpiceWare said: If you defined an array like this: unsigned char arena_color[4] = { _RED + 4, _GREEN + 4, _BLUE + 4, _WHITE }; without the const then the array is defined as being RAM, not ROM. The compiler sets up the initial values of _RED + 4, ..., _WHITE in a data section in ROM. The very first time you call custom ARM code a one-time process will run that copies those initial values from the ROM data section to the appropriate location in RAM. The section(".data") bit of __attribute__ taps into those routines to copy the function from ROM to RAM for you. Yes. I was thinking from an emulation point of about why this works without any change to the driver. And it's because these functions are copied from ROM to RAM as part of the .data section, which is already emulated (in Stella, etc.). Very nice solution for time critical code. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 27, 2021 Share Posted June 27, 2021 (edited) Very good. That's with MAM mode 2, right? Have you tested mode 1 too? The difference should be even bigger there. BTW: While you are at, how about moving the data into RAM too (or instead)? Edited June 27, 2021 by Thomas Jentzsch Link to comment Share on other sites More sharing options...
+SpiceWare Posted June 27, 2021 Author Share Posted June 27, 2021 2 hours ago, Thomas Jentzsch said: That's with MAM mode 2, right? No, CDFJ sets it to MAM = 1 when calling the custom ARM code as it could be used on an original Harmony. CDFJ+, which is being used in Turbo, leaves it as MAM = 2. I made this change: int main() { MAMCR=2; ... } Running PrepArenaBuffers(): Running RAM_PrepArenaBuffers(): ROM and RAM execution times with MAM = 2 are both faster than before: ROM 0x131C cycles, 4892 decimal RAM 0x1075 cycles, 4213 decimal The RAM boost is only 14%. 1 Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 27, 2021 Share Posted June 27, 2021 (edited) OK, that makes perfect sense. MAM = 2 is caching slow memory access much better. That's with a Harmony, right? The dev cart should have different results (due to its Dual Flash). Edited June 27, 2021 by Thomas Jentzsch Link to comment Share on other sites More sharing options...
+SpiceWare Posted June 27, 2021 Author Share Posted June 27, 2021 14 minutes ago, Thomas Jentzsch said: That's with a Harmony I used my Harmony Encore. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 7 hours ago, SpiceWare said: I used my Harmony Encore. Same chip (LPC2103). The dev cart uses a LPC2104 or 05 which has a Dual Flash. So timing should be different. Link to comment Share on other sites More sharing options...
JetSetIlly Posted June 28, 2021 Share Posted June 28, 2021 1 hour ago, Thomas Jentzsch said: Same chip (LPC2103). The dev cart uses a LPC2104 or 05 which has a Dual Flash. So timing should be different. Does the Encore have a 70MHz processor like the original Harmony? Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 I did my own tests on my PAL console. Chip MAM ROM RAM LPC2103 0 0x3662 0x1178 LPC2103 1 0x1686 0x1168 LPC2103 2 0x1339 0x1090 LPC2104 0 0x3662 0x1178 LPC2104 1 0x25bd ? 0x1169 LPC2104 2 0x1303 0x108c The LPC2104 should be able to handle ROM a bit better, the MAM = 2 value proves that. The 0x25bd must be an error (I suppose there is a bug in the chip with MAM = 1, it seems that only data is cached, though that is not mentioned in the errata sheet), something like 0x15bd would fit much better. BTW: I wonder why our LPC2103 values slightly differ. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 12 minutes ago, JetSetIlly said: Does the Encore have a 70MHz processor like the original Harmony? Yes, it uses the same chip (LPC2103). Only the dev carts are using the LPC2104 or 05. Link to comment Share on other sites More sharing options...
JetSetIlly Posted June 28, 2021 Share Posted June 28, 2021 Do we know if the Harmony driver sets the APB divider to 1 or if it has been left at the default rate of 1/4. I'm guessing it's set to 1 but I don't know. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 52 minutes ago, JetSetIlly said: Do we know if the Harmony driver sets the APB divider to 1 or if it has been left at the default rate of 1/4. I'm guessing it's set to 1 but I don't know. It is set to 1. Link to comment Share on other sites More sharing options...
JetSetIlly Posted June 28, 2021 Share Posted June 28, 2021 1 minute ago, Thomas Jentzsch said: It is set to 1. Thanks. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 3 minutes ago, JetSetIlly said: Thanks. Your are welcome. That means that anything except Flash memory (SRAM, peripherals) has not wait states, right? Link to comment Share on other sites More sharing options...
JetSetIlly Posted June 28, 2021 Share Posted June 28, 2021 3 minutes ago, Thomas Jentzsch said: Your are welcome. That means that anything except Flash memory (SRAM, peripherals) has not wait states, right? It affects peripherals. So, anything with an address in the range 0xE0000000 to 0xEFFFFFFF is on the peripheral bus. SRAM is not on that bus so is not affected by the APB divider. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 8 minutes ago, JetSetIlly said: It affects peripherals. So, anything with an address in the range 0xE0000000 to 0xEFFFFFFF is on the peripheral bus. SRAM is not on that bus so is not affected by the APB divider. Yes, but SRAM is assumed to be accessible without wait states anyway. Link to comment Share on other sites More sharing options...
JetSetIlly Posted June 28, 2021 Share Posted June 28, 2021 3 minutes ago, Thomas Jentzsch said: Yes, but SRAM is assumed to be accessible without wait states anyway. Oh I see what you mean. Yes. I've been working on the assumption that SRAM has no waiting. Link to comment Share on other sites More sharing options...
+SpiceWare Posted June 28, 2021 Author Share Posted June 28, 2021 2 hours ago, Thomas Jentzsch said: BTW: I wonder why our LPC2103 values slightly differ. Did you built new ROMs for your MAM tests? If so we likely have different versions of the C compiler installed that optimized the ARM code differently. with the prior compiler I had 4 versions installed in different virtual machines. For each project I'd used whichever compiler version resulted in smaller code Quote The size of custom2.bin will depend upon which C compiler you installed. 2011.03-42 - 8136 bytes 2012.03.56 - 8308 bytes 2012.09.63 - 8224 bytes 2013.05.23 - 8224 bytes Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 (edited) Which file should I look for? main.o (11988 bytes)? BTW: I experimented with compiler settings. With global -O3 the code size almost doubled (23538 bytes), but the times improved a lot: LPC2103, MAM = 2 0x0d59 (ROM) ~36% faster, 0x0d07 (RAM) ~27% faster LPC2104, MAM = 2 0x0e28 (ROM) ~46% faster, 0x0d04 (RAM) ~24% faster Now I will try to get only the function optimized for speed. Edit: Found it! __attribute__( ( section(".data") ) ) __attribute__ ((optimize(3))) unsigned int RAM_BitReversal(unsigned int value) With this setting for the two functions the file size increased only by 272 bytes. For LPC2103 the ROM function became even slightly slower (I suppose the increased code size cause a lot of MAM buffer misses) to 0x133f, but the RAM function decreased to 0x0d5b (~24%). For LPC2104 the results are similar. RAM and local optimization together improved by ~44%. I think it is worth going into that direction. Edited June 28, 2021 by Thomas Jentzsch 1 Link to comment Share on other sites More sharing options...
+SpiceWare Posted June 28, 2021 Author Share Posted June 28, 2021 3 hours ago, Thomas Jentzsch said: Which file should I look for? main.o (11988 bytes)? armcode.bin (3708 bytes) - its the ARM code that ends up in the ROM. 3 hours ago, Thomas Jentzsch said: Edit: Found it! __attribute__( ( section(".data") ) ) __attribute__ ((optimize(3))) unsigned int RAM_BitReversal(unsigned int value) Nice! Can probably combine those just like is done with long_call for RAM_PrepArenaBuffers(). Would be worth testing that with all 3 functions. I'm not able to at the moment, my work day has started. __attribute__( ( long_call, optimize(3), section(".data") ) ) void RAM_PrepArenaBuffers() { ... } __attribute__( ( optimize(3), section(".data") ) ) unsigned int RAM_BitReversal(unsigned int value) { ... } __attribute__( ( optimize(3), section(".data") ) ) int RAM_ColorConvert(int color) { ... } 1 Link to comment Share on other sites More sharing options...
+SpiceWare Posted June 28, 2021 Author Share Posted June 28, 2021 Looks like arm-eabi-gcc --version is the command to report compiler version: atari@atari-VirtualBox:/media/sf_Atari/Collect3RAMfunction$ arm-eabi-gcc --version arm-eabi-gcc (Linaro GCC 7.4-2019.02) 7.4.1 20181213 [linaro-7.4-2019.02 revision 56ec6f6b99cc167ff0c2f8e1a2eed33b1edc85d4] Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. That matches the version it was when I documented the installation: On 10/26/2019 at 1:53 PM, SpiceWare said: Locate the gcc-linaro-???-x86_64_arm-eabi.tar.xz file where ??? is the most current release. At time of this post that's 7.4.1-2019.02, so: Looks like the version that's currently on the Linaro site is 7.5.0-2019.12 - kind of surprised about that. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 57 minutes ago, SpiceWare said: armcode.bin (3708 bytes) - its the ARM code that ends up in the ROM. 3972 bytes with the local optimizations. Quote arm-eabi-gcc (Linaro GCC 7.1-2017.08) 7.1.1 20170707 Copyright (C) 2017 Free Software Foundation, Inc. Quite old. Link to comment Share on other sites More sharing options...
Thomas Jentzsch Posted June 28, 2021 Share Posted June 28, 2021 1 hour ago, SpiceWare said: Can probably combine those just like is done with long_call for RAM_PrepArenaBuffers(). Would be worth testing that with all 3 functions. All three (4116 bytes) result into 0x0e31 (ROM) and 0x0d5a (RAM). 1 Link to comment Share on other sites More sharing options...
Recommended Posts