Jump to content
IGNORED

Speed boost - run code in RAM


SpiceWare

Recommended Posts

19 minutes ago, Thomas Jentzsch said:

All three (4116 bytes) result into 0x0e31 (ROM) and 0x0d5a (RAM). 

 

Not what I was expecting, but tells us to try optimize(3) with various combinations of the RAM functions to see what results in the best performance.

 

Looking in the armcode.txt file, created by issuing this command in the main/bin folder:

arm-eabi-objdump -D armcode.elf > armcode.txt

I see the 3 functions are discrete - it might be worth checking to see if the __attribute__ has an inline feature and use it on RAM_BitReversal to eliminate the overhead of all the calls to it. Might be slightly useful for RAM_ColorConvert, but I wouldn't expect much since it's only called once.

 

Disassembly of section .data:

40001800 <RAM_BitReversal>:
40001800:	2255      	movs	r2, #85	; 0x55
40001802:	0843      	lsrs	r3, r0, #1
...

40001828 <RAM_ColorConvert>:
40001828:	4b0a      	ldr	r3, [pc, #40]	; (40001854 <RAM_ColorConvert+0x2c>)
4000182a:	781b      	ldrb	r3, [r3, #0]
...

4000185c <RAM_PrepArenaBuffers>:
4000185c:	b5f0      	push	{r4, r5, r6, r7, lr}
4000185e:	4a28      	ldr	r2, [pc, #160]	; (40001900 <RAM_PrepArenaBuffers+0xa4>)
...

 

Link to comment
Share on other sites

No problem. I'm not able to do any testing tonight either as some family friends from Mexico are back in town for their second dose of the covid vaccine and we're going out for dinner when I get off work.

 

When they were here for the first dose I took a few days off and we did various things like Space Center Houston, which now has a SpaceX Falcon 9 on display.

 

 

Link to comment
Share on other sites

I did some more testing, all with my Harmony cart (LPC2103):

  • no local optimizing: 0x1339 (ROM), 0x1090 (RAM)
  • optimizing PrepArenaBuffers only: 0x0e31 (ROM), 0x0d57 (RAM)
  • optimizing BitReversal only: 0x177c (ROM), 0x148a (RAM)
  • optimizing ColorConvert only: 0x1407 (ROM), 0x107c (RAM)
  • optimizing alle three functions: 0x0e31 (ROM), 0x0d5a (RAM)

 

So PrepArenaBuffers is the key function here. Optimizing only other functions even makes the result worse! I suppose if the code becomes too large, then the small 128-bit buffers of the MAM can become much less efficient.

 

The alignment of loops might play a major role here too. E.g. if you have a (small) loop with an extra branch the branch trail buffer will be always a miss unless the loop and branch target are within the same 128-bits. If the loop is aligned to 128-bits the chance of a hit increases. However, if the the loop is larger and has multiple small branches, then putting the small branches into the same 128-bit page might be more efficient. Though in both cases the prefetch buffers may come to the rescue. And for the LPC2104/5 of the dev cart, we have two buffers each, which changes the situation again.

 

Another optimization is to put the data into the same 128-bit page. Or make sure, that consecutively used data is in the same page. 

 

Since the compilers are not aware of the MAM, they cannot optimize for it. Maybe we can find some parameters which can help here. E.g. it might be helpful to optimize for speed only moderately and reduce the code size. Then the buffer hit rate and the overall speed might increase. But I suppose there is a lot of trial and error. And while changing code, the results may vary quite a lot unexpectedly. I suggest that the critical functions are optimized with level 2 or 3 by default. And when the coding is done, one can experiment with different settings.

 

Some other maybe useful GCC function attributes:

  • aligned (alignment): This attribute specifies a minimum alignment for the function, measured in bytes.
  • always_inline: Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
  • flatten: Generally, inlining into a function is limited. For a function marked with this attribute, every call inside this function will be inlined, if possible. Whether the function itself is considered for inlining depends on its size and the current inlining parameters.
  • optimize: The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. Arguments can either be numbers or strings. Numbers are assumed to be an optimization level. Strings that begin with O are assumed to be an optimization option, while other options are assumed to be used with a -f prefix. You can also use the `#pragma GCC optimize' pragma to set the optimization options that affect more than one function. 
    This can be used for instance to have frequently executed functions compiled with more aggressive optimization options that produce faster and larger code, while other functions can be called with less aggressive options.

 

The same for variables:

  • aligned (alignment): This attribute specifies a minimum alignment for the variable or structure field, measured in bytes. Example: int x __attribute__ ((aligned (16))) = 0;
  • packed: The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute.

 

These are only the most relevant I found. There is more.

 

We should also experiment with '#pragma GCC optimize' inside larger functions. So that only the most relevant parts (e.g. the main loop) of a function are optimized differently.

Edited by Thomas Jentzsch
  • Like 2
Link to comment
Share on other sites

4 hours ago, Thomas Jentzsch said:

There is a file armcode.txt, but it is not updated during make.

 

 

On 6/28/2021 at 11:00 AM, SpiceWare said:

Looking in the armcode.txt file, created by issuing this command in the main/bin folder:


arm-eabi-objdump -D armcode.elf > armcode.txt

 

 

generating that was never added to the makefile, probably because we didn't utilize it before. I think this investigation of putting functions into RAM is the first time I've used it. I had to go looking for one of @cd-w posts as I didn't remember how it was done.

  • Thanks 1
Link to comment
Share on other sites

I have completed my testing using Collect3 and also further improved Stella's cycle counting using the results:

          Console (PAL)         Stella       Delta  
Banks MAM O Opt Bytes  ROM Dec % RAM Dec %  ROM Dec RAM Dec  ROM RAM
- 0 1 all 12108 3570 13680 278,6% 10FB 4347 88,5% 35B1 13745 115A 4442 100,5% 102,2%
- 0 s all 11972 350E 13582 276,6% 1048 4168 84,9% 35ED 13805 11BA 4538 101,6% 108,9%
- 0 2 all 12328 28CD 10445 212,7% E4E 3662 74,6% 28FB 10491 E9A 3738 100,4% 102,1%
- 0 3 all 12412 28A8 10408 212,0% E44 3652 74,4% 28D4 10452 E91 3729 100,4% 102,1%
1 1 1 all 12108 16A1 5793 118,0% 10E6 4326 88,1% 1652 5714 114B 4427 98,6% 102,3%
1 1 s all 11972 162B 5675 115,6% 10C3 4291 87,4% 1646 5702 11A5 4517 100,5% 105,3%
1 1 2 all 12328 11D1 4561 92,9% E39 3641 74,2% 1236 4662 E85 3717 102,2% 102,1%
1 1 3 all 12412 11B5 4533 92,3% E2F 3631 74,0% 11B8 4536 E82 3714 100,1% 102,3%
1 2 1 all 12108 131C 4892 99,6% FFC 4092 83,3% 1097 4247 104C 4172 86,8% 102,0%
1 2 s all 11972 132E 4910 100,0% FF1 4081 83,1% 10C1 4289 10AC 4268 87,4% 104,6%
1 2 2 all 12328 E4A 3658 74,5% D5E 3422 69,7% E0A 3594 D86 3462 98,3% 101,2%
1 2 3 all 12412 E31 3633 74,0% D54 3412 69,5% D95 3477 D89 3465 95,7% 101,6%
2 1 1 all 12108 2500 9472 192,9% 10EC 4332 88,2% 164C 5708 1145 4421 60,3% 102,1%
2 1 s all 11972 246C 9324 189,9% 10C9 4297 87,5% 1643 5699 11A5 4517 61,1% 105,1%
2 1 2 all 12328 1CCD 7373 150,2% E3F 3647 74,3% 11BE 4542 E85 3717 61,6% 101,9%
2 1 3 all 12412 1C75 7285 148,4% E35 3637 74,1% 11AC 4524 E7C 3708 62,1% 102,0%
2 2 1 all 12108 12D4 4820 98,2% FFC 4092 83,3% 1091 4241 1046 4166 88,0% 101,8%
2 2 s all 11972 1294 4756 96,9% FF1 4081 83,1% 10BE 4286 10AC 4268 90,1% 104,6%
2 2 2 all 12328 D76 3446 70,2% D5E 3422 69,7% D92 3474 D86 3462 100,8% 101,2%
2 2 3 all 12412 D65 3429 69,8% D5B 3419 69,6% D89 3465 D83 3459 101,0% 101,2%

 

Legend:

  • Banks = Number of Flash banks (LPC2103 = 1, LPC2104/5 = 2)
  • MAM = MAM mode
  • O = local optimization level
  • Opt: all 3 routines optimized 
  • Bytes: size of main.c
  • Dec: decimal value of left hex
  • %: speed relative to LPC2103, -Os, ROM
  • ROM: code executed in ROM
  • RAM: code executed in RAM

Findings:

  • Compared to running the code in ROM with default optimization -Os and MAM = 2, running with local optimization and in RAM results into 25-30% less cycles.
  • Local optimizations are more efficient than RAM code. E.g. for the LPC2104 with -O3 and MAM = 2, there is only very little (0.2%) gained when moving the code to RAM. RAM is more effective with less optimization. This shows that the MAM very effectively buffers Flash memory and that the optimization helps here (probably by using proper alignment, more testing required here).
  • The difference between -O2 and -O3 is minimal.

Notes:

  • The LPC2104 I have clearly has a bug in MAM mode 1 (marked in red). The numbers are (much) worse than for the LPC2103, especially for ROM code. It would be nice if someone with LPC2105 could verify this.
  • Stella is already quite close for MAM modes 0 or 1. But in mode 2, especially with -O1 and -Os in ROM there is room for improvement.

 

Collect.zip

  • Like 2
  • Thanks 1
Link to comment
Share on other sites

1 hour ago, Thomas Jentzsch said:

I have completed my testing using Collect3 and also further improved Stella's cycle counting using the results:

 

This is really useful data. For what it's worth, Gopher2600 is close in some areas and not in others. If we look at MAM-0 for example O0 and Os are pretty around 100% but O2 and O3 are less than 100% so something is happening in the optimised code which is upsetting the emulation. I'll take a closer look to see what the differences are.

  • Like 1
Link to comment
Share on other sites

I've used the data and ROMs Thomas has posted and used them to improve the cycle counting in Gopher2600. There are a couple of problem areas but on the whole it seems fairly consistent in all combinations. Certainly, all real-world ROMs that I have been using for testing (Turbo, Draconian, Zaxxon, etc.) continue to perform as expected (causing screen roll or not depending on the MAM settings).

 

If nothing else it shows that 100% accuracy is within reach.

 

image.png.988643774e9739548f4c59b0df955091.png

  • Like 2
Link to comment
Share on other sites

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...