Speed boost - run code in RAM

+SpiceWare · June 28, 2021

19 minutes ago, Thomas Jentzsch said:

All three (4116 bytes) result into 0x0e31 (ROM) and 0x0d5a (RAM).

Not what I was expecting, but tells us to try optimize(3) with various combinations of the RAM functions to see what results in the best performance.

Looking in the armcode.txt file, created by issuing this command in the main/bin folder:

arm-eabi-objdump -D armcode.elf > armcode.txt

I see the 3 functions are discrete - it might be worth checking to see if the __attribute__ has an inline feature and use it on RAM_BitReversal to eliminate the overhead of all the calls to it. Might be slightly useful for RAM_ColorConvert, but I wouldn't expect much since it's only called once.

Disassembly of section .data:

40001800 <RAM_BitReversal>:
40001800:	2255      	movs	r2, #85	; 0x55
40001802:	0843      	lsrs	r3, r0, #1
...

40001828 <RAM_ColorConvert>:
40001828:	4b0a      	ldr	r3, [pc, #40]	; (40001854 <RAM_ColorConvert+0x2c>)
4000182a:	781b      	ldrb	r3, [r3, #0]
...

4000185c <RAM_PrepArenaBuffers>:
4000185c:	b5f0      	push	{r4, r5, r6, r7, lr}
4000185e:	4a28      	ldr	r2, [pc, #160]	; (40001900 <RAM_PrepArenaBuffers+0xa4>)
...

Thomas Jentzsch · June 28, 2021

Maybe tomorrow, football now.

+SpiceWare · June 28, 2021

No problem. I'm not able to do any testing tonight either as some family friends from Mexico are back in town for their second dose of the covid vaccine and we're going out for dinner when I get off work.

When they were here for the first dose I took a few days off and we did various things like Space Center Houston, which now has a SpaceX Falcon 9 on display.

Thomas Jentzsch · June 28, 2021

This more something for @johnnywc to test for Turbo Arcade anyway.

+SpiceWare · June 28, 2021

True, though this investigation is helping to revive my interest 2600 projects. Hopefully I'll be interested enough to get back to it when our friends leave in a few days.

Thomas Jentzsch · June 29, 2021

I did some more testing, all with my Harmony cart (LPC2103):

no local optimizing: 0x1339 (ROM), 0x1090 (RAM)
optimizing PrepArenaBuffers only: 0x0e31 (ROM), 0x0d57 (RAM)
optimizing BitReversal only: 0x177c (ROM), 0x148a (RAM)
optimizing ColorConvert only: 0x1407 (ROM), 0x107c (RAM)
optimizing alle three functions: 0x0e31 (ROM), 0x0d5a (RAM)

So PrepArenaBuffers is the key function here. Optimizing only other functions even makes the result worse! I suppose if the code becomes too large, then the small 128-bit buffers of the MAM can become much less efficient.

The alignment of loops might play a major role here too. E.g. if you have a (small) loop with an extra branch the branch trail buffer will be always a miss unless the loop and branch target are within the same 128-bits. If the loop is aligned to 128-bits the chance of a hit increases. However, if the the loop is larger and has multiple small branches, then putting the small branches into the same 128-bit page might be more efficient. Though in both cases the prefetch buffers may come to the rescue. And for the LPC2104/5 of the dev cart, we have two buffers each, which changes the situation again.

Another optimization is to put the data into the same 128-bit page. Or make sure, that consecutively used data is in the same page.

Since the compilers are not aware of the MAM, they cannot optimize for it. Maybe we can find some parameters which can help here. E.g. it might be helpful to optimize for speed only moderately and reduce the code size. Then the buffer hit rate and the overall speed might increase. But I suppose there is a lot of trial and error. And while changing code, the results may vary quite a lot unexpectedly. I suggest that the critical functions are optimized with level 2 or 3 by default. And when the coding is done, one can experiment with different settings.

Some other maybe useful GCC function attributes:

aligned (alignment): This attribute specifies a minimum alignment for the function, measured in bytes.
always_inline: Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
flatten: Generally, inlining into a function is limited. For a function marked with this attribute, every call inside this function will be inlined, if possible. Whether the function itself is considered for inlining depends on its size and the current inlining parameters.
optimize: The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. Arguments can either be numbers or strings. Numbers are assumed to be an optimization level. Strings that begin with O are assumed to be an optimization option, while other options are assumed to be used with a -f prefix. You can also use the `#pragma GCC optimize' pragma to set the optimization options that affect more than one function.
This can be used for instance to have frequently executed functions compiled with more aggressive optimization options that produce faster and larger code, while other functions can be called with less aggressive options.

The same for variables:

aligned (alignment): This attribute specifies a minimum alignment for the variable or structure field, measured in bytes. Example: int x __attribute__ ((aligned (16))) = 0;
packed: The packed attribute specifies that a variable or structure field should have the smallest possible alignment—one byte for a variable, and one bit for a field, unless you specify a larger value with the aligned attribute.

These are only the most relevant I found. There is more.

We should also experiment with '#pragma GCC optimize' inside larger functions. So that only the most relevant parts (e.g. the main loop) of a function are optimized differently.

Edited June 29, 2021 by Thomas Jentzsch

Thomas Jentzsch · June 30, 2021

@SpiceWare How can I enable assembler output (incl. source code) for main.c in the makefile? There is a file armcode.txt, but it is not updated during make. I really would like to have a look at the assembly created by the several __attribute__ settings.

+SpiceWare · June 30, 2021

4 hours ago, Thomas Jentzsch said:

There is a file armcode.txt, but it is not updated during make.

On 6/28/2021 at 11:00 AM, SpiceWare said:
Looking in the armcode.txt file, created by issuing this command in the main/bin folder:
arm-eabi-objdump -D armcode.elf > armcode.txt

generating that was never added to the makefile, probably because we didn't utilize it before. I think this investigation of putting functions into RAM is the first time I've used it. I had to go looking for one of @cd-w posts as I didn't remember how it was done.

Thomas Jentzsch · July 2, 2021

I have completed my testing using Collect3 and also further improved Stella's cycle counting using the results:

					Console	(PAL)					Stella				Delta
Banks	MAM	O	Opt	Bytes	ROM	Dec	%	RAM	Dec	%	ROM	Dec	RAM	Dec	ROM	RAM
-	0	1	all	12108	3570	13680	278,6%	10FB	4347	88,5%	35B1	13745	115A	4442	100,5%	102,2%
-	0	s	all	11972	350E	13582	276,6%	1048	4168	84,9%	35ED	13805	11BA	4538	101,6%	108,9%
-	0	2	all	12328	28CD	10445	212,7%	E4E	3662	74,6%	28FB	10491	E9A	3738	100,4%	102,1%
-	0	3	all	12412	28A8	10408	212,0%	E44	3652	74,4%	28D4	10452	E91	3729	100,4%	102,1%
1	1	1	all	12108	16A1	5793	118,0%	10E6	4326	88,1%	1652	5714	114B	4427	98,6%	102,3%
1	1	s	all	11972	162B	5675	115,6%	10C3	4291	87,4%	1646	5702	11A5	4517	100,5%	105,3%
1	1	2	all	12328	11D1	4561	92,9%	E39	3641	74,2%	1236	4662	E85	3717	102,2%	102,1%
1	1	3	all	12412	11B5	4533	92,3%	E2F	3631	74,0%	11B8	4536	E82	3714	100,1%	102,3%
1	2	1	all	12108	131C	4892	99,6%	FFC	4092	83,3%	1097	4247	104C	4172	86,8%	102,0%
1	2	s	all	11972	132E	4910	100,0%	FF1	4081	83,1%	10C1	4289	10AC	4268	87,4%	104,6%
1	2	2	all	12328	E4A	3658	74,5%	D5E	3422	69,7%	E0A	3594	D86	3462	98,3%	101,2%
1	2	3	all	12412	E31	3633	74,0%	D54	3412	69,5%	D95	3477	D89	3465	95,7%	101,6%
2	1	1	all	12108	2500	9472	192,9%	10EC	4332	88,2%	164C	5708	1145	4421	60,3%	102,1%
2	1	s	all	11972	246C	9324	189,9%	10C9	4297	87,5%	1643	5699	11A5	4517	61,1%	105,1%
2	1	2	all	12328	1CCD	7373	150,2%	E3F	3647	74,3%	11BE	4542	E85	3717	61,6%	101,9%
2	1	3	all	12412	1C75	7285	148,4%	E35	3637	74,1%	11AC	4524	E7C	3708	62,1%	102,0%
2	2	1	all	12108	12D4	4820	98,2%	FFC	4092	83,3%	1091	4241	1046	4166	88,0%	101,8%
2	2	s	all	11972	1294	4756	96,9%	FF1	4081	83,1%	10BE	4286	10AC	4268	90,1%	104,6%
2	2	2	all	12328	D76	3446	70,2%	D5E	3422	69,7%	D92	3474	D86	3462	100,8%	101,2%
2	2	3	all	12412	D65	3429	69,8%	D5B	3419	69,6%	D89	3465	D83	3459	101,0%	101,2%

Legend:

Banks = Number of Flash banks (LPC2103 = 1, LPC2104/5 = 2)
MAM = MAM mode
O = local optimization level
Opt: all 3 routines optimized
Bytes: size of main.c
Dec: decimal value of left hex
%: speed relative to LPC2103, -Os, ROM
ROM: code executed in ROM
RAM: code executed in RAM

Findings:

Compared to running the code in ROM with default optimization -Os and MAM = 2, running with local optimization and in RAM results into 25-30% less cycles.
Local optimizations are more efficient than RAM code. E.g. for the LPC2104 with -O3 and MAM = 2, there is only very little (0.2%) gained when moving the code to RAM. RAM is more effective with less optimization. This shows that the MAM very effectively buffers Flash memory and that the optimization helps here (probably by using proper alignment, more testing required here).
The difference between -O2 and -O3 is minimal.

Notes:

The LPC2104 I have clearly has a bug in MAM mode 1 (marked in red). The numbers are (much) worse than for the LPC2103, especially for ROM code. It would be nice if someone with LPC2105 could verify this.
Stella is already quite close for MAM modes 0 or 1. But in mode 2, especially with -O1 and -Os in ROM there is room for improvement.

Collect.zip

JetSetIlly · July 2, 2021

1 hour ago, Thomas Jentzsch said:

I have completed my testing using Collect3 and also further improved Stella's cycle counting using the results:

This is really useful data. For what it's worth, Gopher2600 is close in some areas and not in others. If we look at MAM-0 for example O0 and Os are pretty around 100% but O2 and O3 are less than 100% so something is happening in the optimised code which is upsetting the emulation. I'll take a closer look to see what the differences are.

JetSetIlly · July 3, 2021

I've used the data and ROMs Thomas has posted and used them to improve the cycle counting in Gopher2600. There are a couple of problem areas but on the whole it seems fairly consistent in all combinations. Certainly, all real-world ROMs that I have been using for testing (Turbo, Draconian, Zaxxon, etc.) continue to perform as expected (causing screen roll or not depending on the MAM settings).

If nothing else it shows that 100% accuracy is within reach.

image.png.988643774e9739548f4c59b0df955091.png

Speed boost - run code in RAM

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Recently Browsing 0 members