bithopper

Speed up your memset and memcpy with loop unrolling

bithopper replied to bithopper's topic in Harmony/Melody's General

Interestlingly, Dionoid's size optimized code is not only smaller, but also performs slightly better than the original. The original has some register saving overhead; for some reason it needs one register more than the other simple loop memsets. When I played around with thumb16 code, I found that incrementing the pointer instead of indexing often leads to slightly better loop results. At least with the compiler flags and options used. Here, the optimizer found and used a special ARM instruction (STMIA). This is why the presented alternate size optimized single loop code is two more bytes smaller (12 bytes) and also slightly faster.

Speed up your memset and memcpy with loop unrolling

bithopper replied to bithopper's topic in Harmony/Melody's General

The gist: myMemsetInt_br8 always runs faster, right from count 1, and it's average speed quickly increases with the size to fill, until it reaches almost 7 times (6,85) that of the original loop and still nearly 6 times (5.7) that of the fastest single loop in this comparison. To be exact: To set one integer (count 1) takes with 19 cycles (including the call overhead for non-inlined code) roughly the same time as most variants. The original is slower at the beginning as it has some initialization instructions. It needs 32 cycles. At count 8 a little setback takes place, reducing the average speed, however the power of the block loop kicks in now and makes up for it very quickly (see data series and chart). Important here: We are never slower than any of the the simple loops, even in this 'fast block preparation phase'! Some numbers: To fill 7 integers takes less than half the time of the fastest single loop, 17 integers a third, 40 a quarter, 120 a fifth, and to fill/clear 1000 integers only takes a bit over a sixth (5.6) of the time, with lean 1.8 cycles per filled integer. So much for the theory - now let's see it in practice.

Speed up your memset and memcpy with loop unrolling

bithopper replied to bithopper's topic in Harmony/Melody's General

Got some questions about how much actually is gained by using the presented methods. As I had no detailled answer to that, only some promising observations in my game effort, here some calculations, based on a data sheet with cycle timings (from ARM, also attached). This might not completely replace actual measurements on real hardware but is probably not too far off to shed some light on it. The attached PDF analyzes cycle count for four relevant variants discussed in the topic and compares them, also graphically. It was done in Excel and I hope the cycle calculations are correct - you might do a countercheck and tell me where I miscounted... (In case you are wondering: The cycle count shown is calculated for a non-inlined function, so it includes the cycle counts for the method call and its return.) In case the PDF doesn't speak for itself and needs some explanation, please let me know as well. myMemsetInt_CycleComparison_4Variants.pdf

Speed up your memset and memcpy with loop unrolling

bithopper replied to bithopper's topic in Harmony/Melody's General

Here the corresponding optimized custom memsetInt in native assembler code. The same principle applies as with the assembler memcpy in my last post: The memory is set blockwise, 8 integers (i.e. 32 bit words) per loop pass, using well suited special ARM instructions (thanks, Thomas) up to a point where the remaining number of integers/words is below blocksize. These are worked then without another loop. Once again the aim was a reasonably low overhead for small(ish) sizes of memory to set. Nevertheless, when filling/clearing larger buffers, the method still performs well. It's size is ~48 bytes of arm thumb code. Remember: The destination pointer needs to be 4-aligned. void myMemsetInt_br8(unsigned int* destination, unsigned int fill, int count) { // (C) 2023 oliver gross aka bithopper - use at your own RISC // V0.9.230219_0 asm volatile( "cmp r2, #8\n\t" "blt .LL02\n\t" "push {r4-r6}\n\t" "mov r4, r1\n\t" "mov r5, r1\n\t" "mov r6, r1\n\t" "sub r2, #8\n\t" ".LL01:\n\t" "stmia r0!, {r1, r4-r6}\n\t" "stmia r0!, {r1, r4-r6}\n\t" "sub r2, #8\n\t" "bpl .LL01\n\t" "add r2, #8\n\t" "pop {r4-r6}\n\t" ".LL02:\n\t" "lsl r2, r2, #1\n\t" "mov r3, #13\n\t" "sub r3, r2\n\t" "add pc, r3\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" "stmia r0!, {r1}\n\t" : : : "r3", "cc", "memory" ); } The code seems to run fine in my WIP, compiled with GCC, so I decided to post it here. However: This is a beta version! You are welcome to use it, but be warned: It is not thorrowly tested and there might be bugfixes and other updates in near future. Please let me know of any issues or suggestions, feedback is most welcome 🙂

Speed up your memset and memcpy with loop unrolling

bithopper replied to bithopper's topic in Harmony/Melody's General

Spent another couple of hours hours at the bytewise memcpy, as the result of the C compiler optimization still left something to desire. After several frustrating attempts I took a deep breath and - decided to write a custom native Thumb16 code memcpy. Maybe it is also of use for you fellow hero (or heroine?!) of the VCS. I reduced the unrolling to 8, now featuring two phases: A faster 'block' copy (although still single value copy instructions, the multiple register load/store functions appear to work only with whole words), followed by 0 to 7 final single byte copies. As the overall size is ~108 bytes, this implementation will not be suited for every VCS project (most certainly not for Dionoid 's Loderunner ). It is still optimized for small copy sizes, though. I aimed for a low overhead when copying just a few bytes. When copying more than a few dozen bytes, we could use another memcpy implementation that aligns source and destination pointers and then ends up in a super fast integer copy. These algorithms are way faster for big copy sizes. However they tend to be even bulkier, with a sobering overhead ratio for the first few bytes (how big are your sprites, again? ) void myMemcpy_br8(unsigned char* destination, unsigned char* source, int count) { // (C) 2023 oliver gross aka bithopper - use at your own RISC // V0.9.230215_1 asm volatile( "cmp r2, #7\n\t" "ble .LL2\n\t" "sub r2, #8\n\t" ".LL3:\n\t" "ldrb r3, [r1]\n\t" "strb r3, [r0]\n\t" "ldrb r3, [r1, #1]\n\t" "strb r3, [r0, #1]\n\t" "ldrb r3, [r1, #2]\n\t" "strb r3, [r0, #2]\n\t" "ldrb r3, [r1, #3]\n\t" "strb r3, [r0, #3]\n\t" "ldrb r3, [r1, #4]\n\t" "strb r3, [r0, #4]\n\t" "ldrb r3, [r1, #5]\n\t" "strb r3, [r0, #5]\n\t" "ldrb r3, [r1, #6]\n\t" "strb r3, [r0, #6]\n\t" "ldrb r3, [r1, #7]\n\t" "strb r3, [r0, #7]\n\t" "add r0, #8\n\t" "add r1, #8\n\t" "sub r2, #8\n\t" "bpl .LL3\n\t" "add r2, #8\n\t" ".LL2:\n\t" "lsl r2, r2, #3\n\t" "mov r3, #55\n\t" "sub r3, r2\n\t" "add pc, r3\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" "ldrb r3, [r1]\n\t" "add r1, #1\n\t" "strb r3, [r0]\n\t" "add r0, #1\n\t" : : : "r3", "cc", "memory" ); } A word of caution - although it performed nicely in my WIP, using GCC, it is not thorrowly tested, so please regard this as highly experimental code that might still see some updates in near future. (Did I mention it is now highly platform dependend?) If someone uses this method and finds a bug, or has some other suggestions about the ARM code please let me know!

Speed up your memset and memcpy with loop unrolling

bithopper replied to bithopper's topic in Harmony/Melody's General

Thanks, Thomas! Loading/storing multiple registers sure is a good way to further speed up things. If I remember correctly, this was even possible with 68000. Do you have any idea of how to 'persuade' the compiler optimizer to make use of it? Up to now, the code is platform independend...

Speed up your memset and memcpy with loop unrolling

bithopper replied to bithopper's topic in Harmony/Melody's General

A word about the algorithm and it's C implementation: As mentioned, while the technique is well known to some, it's implementation in C proved a bit of a challenge. I tried different approaches and rejected all but the presented for various reasons: Generally speaking, the ultimate goal was C code that compiles to perfect ARM code, meaning you could not find a better solution when writing directly in ARM assembler. At least, we would want to come reasonably close. This means we must write some C code that nudges the C compiler's optimizer into the right direction. The best results I got was with using a non standard C feature called 'labels as values' that is common in GCC and other compilers. This allows for calculated jumps and is used in the presented methods. Other approaches in pure standard C, utilizing the 'switch' statement like in Duff's Device, which I found afterwards in Wikipedia, resulted in considerably worse compilation results, at least in my tests. Would be interesting to see, if some ARM code, natively written in assembler, could still be any faster somehow. Please let me know!

Speed up your memset and memcpy with loop unrolling

bithopper posted a topic in Harmony/Melody's General

While optimizing a few C methods in my current CDFJ game project, I realized just how often I use the standard myMemcpy and myMemset functions provided in the file 'defines_cdfj.h'. These are are basic loops that work fine, and have a small memory footprint. I was using them all over the place, for initalization routines, sprite graphics, etc. For example: void myMemsetInt(unsigned int* destination, int fill, int count) { int i; for (i=0; i<count; ++i) { destination[i] = fill; } } The resulting ARM code the compiler produces looks something like: myMemsetInt: mov r3, #0 .L2: cmp r3, r2 bxge lr str r1, [r0, r3, lsl #2] add r3, r3, #1 b .L2 Without knowing too much about ARM coding: As rule of thumb we can assume each line is executed by the ARM processor in one cycle, here. (Some commands will take longer, multiplications, for example.) So we see 5 lines running in a loop, where only one of these lines is actually setting a value in memory, the other 4 are loop related, which is a pretty hefty overhead, isn't it? Maybe we can do better? (Note: Actually the compiler will produce Thumb16 code, which looks even worse). All of a sudden I remembered how you sped up screen clearing in 68000 assembler on the Atari ST back in those days and I realized: By unrolling the loop we will get rid of much of this overhead! OK, let's do it. Lets assume we want to set 8 values somewhere in memory. So we can get rid of the loop completely by writing something like this: void myMemsetInt_8(unsigned int* destination, int fill) { *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; *destination++ = fill; } Which compiles to: myMemsetInt_8: str r1, [r0] str r1, [r0, #4] str r1, [r0, #8] str r1, [r0, #12] str r1, [r0, #16] str r1, [r0, #20] str r1, [r0, #24] str r1, [r0, #28] bx lr That's better, right? Just 9 commands required for setting 8 integers in the memory - as opposed to roughly 5 * 8 = 40 with the simple loop. OK, we did it. End of story. Err, wait What we really need is a more general solution with reasonable memory footprint, that is capable of setting arbitrary sizes of memory, just like the standard loop, but still getting good benefit from this principle of unrolling. After some thinking about how to do certain things in C - I ended up writing this piece of code: void myMemsetInt_unrolled16(unsigned int* destination, int fill, int count) { static const void* jmparray[] = {&&L_16,&&L_15,&&L_14,&&L_13,&&L_12,&&L_11,&&L_10,&&L_09,&&L_08,&&L_07,&&L_06,&&L_05,&&L_04,&&L_03,&&L_02,&&L_01}; // NOTE: GCC -Os will hopefully optimize this code so that it computes the address rather than actually builds the table! loop: if(count < 16) goto *jmparray[count]; // Use unrolled fill for <count> Integers. *destination++ = fill; L_01: *destination++ = fill; L_02: *destination++ = fill; L_03: *destination++ = fill; L_04: *destination++ = fill; L_05: *destination++ = fill; L_06: *destination++ = fill; L_07: *destination++ = fill; L_08: *destination++ = fill; L_09: *destination++ = fill; L_10: *destination++ = fill; L_11: *destination++ = fill; L_12: *destination++ = fill; L_13: *destination++ = fill; L_14: *destination++ = fill; L_15: *destination++ = fill; L_16: if(count <= 16) return; // DONE count -= 16; goto loop; } The compiler produced this: myMemsetInt_unrolled16: ldr r3, .L22 .L2: cmp r2, #15 ldrle pc, [r3, r2, lsl #2] @ indirect memory jump str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 str r1, [r0], #4 cmp r2, #16 bxle lr sub r2, r2, #16 b .L2 .L22: .word .LANCHOR0 Doesn't look too bad in regard of loop overhead, right? You will realize the bigger size, though. Sure, the longer you unroll it, the better the overhead ratio but also the bigger the code size. I am very aware of the memory restrictions on the VCS, but when using bank switching models like CDFJ with 32K I guess the speed boost might well be worth the memory trade off. If the unrolling seems too excessive, you can adapt it to any other number. Maybe 8 will be a good compromise between loop overhead and memory size. The 'unrolling factor' of 16 I used in this code just gives me the good feeling that i don't waste any loop overhead on my sprites, as most of them are not as tall :)) This idea is also applicable with other loops, for example the byte wise myMempy: void myMemcpy_unrolled16(unsigned char* destination, unsigned char* source, int count) { static const void* jmparray[] = {&&L_16,&&L_15,&&L_14,&&L_13,&&L_12,&&L_11,&&L_10,&&L_09,&&L_08,&&L_07,&&L_06,&&L_05,&&L_04,&&L_03,&&L_02,&&L_01}; // NOTE: GCC -Os hopefully will optimize this code so that it computes the address rather than actually builds the table! loop: if(count < 16) goto *jmparray[count]; // Use unrolled copy for <count> bytes. L_00: *destination++ = *source++; L_01: *destination++ = *source++; L_02: *destination++ = *source++; L_03: *destination++ = *source++; L_04: *destination++ = *source++; L_05: *destination++ = *source++; L_06: *destination++ = *source++; L_07: *destination++ = *source++; L_08: *destination++ = *source++; L_09: *destination++ = *source++; L_10: *destination++ = *source++; L_11: *destination++ = *source++; L_12: *destination++ = *source++; L_13: *destination++ = *source++; L_14: *destination++ = *source++; L_15: *destination++ = *source++; L_16: if(count <= 16) return; // DONE count -= 16; goto loop; } As mentioned in the code comments: The jump tables are only provided to let the compilers optimizer know about our intentions. If the optimizer does it's job, these tables are not in the result. The algorithm is optimized for small sizes, i.e. it is designed in a way that, compared to the simple loop, it has almost no overhead when setting just a single value. From the second integer on you get the sheer power of unrolling. When I showed the optimized copy and set methods to Darrell, he encouraged me to this post. May someone find it useful as well - and now it's time to copy some more sprites >> UPDATE: For some custom native ARM Thumb16 implementations of myMemcpy and myMemsetInt, please look further down <<

CDFJ audio/waveforms/envelopes

bithopper replied to Andrew Davie's topic in Harmony/Melody's CDFJ

Couldn't resist playing around with your danube song ... some experiment with dynamic wave forms for the lead voice. cdf1_music.bin

CDFJ audio/waveforms/envelopes

bithopper replied to Andrew Davie's topic in Harmony/Melody's CDFJ

Thanks a million, Darrell, for this perfect starter pack. I feel obliged now to integrate it into my WIP. (...ok, lets begin with finding some extra cycles at a steady scan line position for each kernel line 😱)

CDFJ audio/waveforms/envelopes

bithopper replied to Andrew Davie's topic in Harmony/Melody's CDFJ

Would be interested in the demos, as well. And a music driver would be just fantastic!

The Atari VCS project I am currently working on will be offering big(ish) worlds to roam in.

Currently, it is just a proof-of-concept featuring a player sprite flying through randomly generated consistent 'worlds'.

There is collision detection and a few other sprites can be found in the area. Apart from that, there is no gameplay, yet.

The proof-of-concept was written using a classic bank switching scheme without co-processor support.

In favour of richer gameplay and development speed (!) I am currently exploring the possibilities of the CDFJ scheme.

Sign In

bithopper

Posts

Joined

Last visited

1 Follower

Recent Profile Visitors

Steril707

kisrael

bithopper's Achievements

Chopper Commander (4/9)

Reputation

Speed up your memset and memcpy with loop unrolling

Speed up your memset and memcpy with loop unrolling

Speed up your memset and memcpy with loop unrolling

Speed up your memset and memcpy with loop unrolling

Speed up your memset and memcpy with loop unrolling

Speed up your memset and memcpy with loop unrolling

Speed up your memset and memcpy with loop unrolling

Speed up your memset and memcpy with loop unrolling

CDFJ audio/waveforms/envelopes

CDFJ audio/waveforms/envelopes

CDFJ audio/waveforms/envelopes

bithopper

New Tool for creating PF + 48 px title screens (functional WIP)

Uncle Hairy's Nosehair

Harmony cart "THANK YOU!" thread

Apps

My Activity Streams

More