+TheBF #851 Posted February 15 Well didn't this open an interesting can of worms. My great improvement was short lived. On my Windows 10 computer using Classic99 and Camel99 Forth 2.66 I get a different timing than @Speccery did. I should have timed it first on my machine before I even started working on improvements (doh!) but here is the Apples to Apples comparison: So after seeing some improvement re-working the DO LOOP with an inline NEXT interpreter I wondered if I could use some of that extra space to speed up other primitives? Next I found this reference table to provide some insight into what Forth words run most often from "Stack Computers, the next Wave" by Phil Koopman 6.3.1 Dynamic instruction frequencies NAMES FRAC LIFE MATH COMPILE AVE CALL 11.16% 12.73% 12.59% 12.36% 12.21% EXIT 11.07% 12.72% 12.55% 10.60% 11.74% VARIABLE 7.63% 10.30% 2.26% 1.65% 5.46% @ 7.49% 2.05% 0.96% 11.09% 5.40% 0BRANCH 3.39% 6.38% 3.23% 6.11% 4.78% LIT 3.94% 5.22% 4.92% 4.09% 4.54% + 3.41% 10.45% 0.60% 2.26% 4.18% SWAP 4.43% 2.99% 7.00% 1.17% 3.90% R> 2.05% 0.00% 11.28% 2.23% 3.89% >R 2.05% 0.00% 11.28% 2.16% 3.87% CONSTANT 3.92% 3.50% 2.78% 4.50% 3.68% DUP 4.08% 0.45% 1.88% 5.78% 3.05% ROT 4.05% 0.00% 4.61% 0.48% 2.29% USER 0.07% 0.00% 0.06% 8.59% 2.18% [email protected] 0.00% 7.52% 0.01% 0.36% 1.97% I 0.58% 6.66% 0.01% 0.23% 1.87% = 0.33% 4.48% 0.01% 1.87% 1.67% AND 0.17% 3.12% 3.14% 0.04% 1.61% BRANCH 1.61% 1.57% 0.72% 2.26% 1.54% EXECUTE 0.14% 0.00% 0.02% 2.45% 0.65% I already have 0BRANCH, BRANCH, EXIT, CALL (DOCOL), LIT, @ and DROP in 16 bit RAM. Using this table I changed the NEXT macro in each of the following words to use the ILNEXT macro. (inline NEXT, 3 instructions) DOVAR, +, SWAP, R> , >R, DOCON , ROT, DOUSER, [email protected], I and =. After re-compiling the kernel with these changes the FIB2-BENCH a bit faster again. Fibonacci FIB2-BENCH Timings ---------------------------------------- V2.66 1:46.80 V2.67 1:46.03 0.7% better ( DO LOOP change) V2.67b 1:44.50 2.2% better than original ( inline next on hi usage words) Since a threaded Forth program spends about 50% of it's time running the interpreter NEXT, you can get improvements by removing the branch through a register to get to it and placing it inline as we can see. But... this consumed 46 bytes in my tiny kernel so is it worth it? I will play with it more and run some more benchmarks before I make up my mind. 2 Quote Share this post Link to post Share on other sites
+TheBF #852 Posted Sunday at 06:44 PM In the course of optimizing EMIT for faster screen printing I was amazed that the speed of the SEVENs problem benchmark was not reduced by very much. I remembered that FbForth was doing Lee's version of the benchmark in less than 1 minute. I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler. I use an intermediate word called MOVEUP to scroll the screen by a different number of lines and you call it the number of times you need to scroll 24 lines. : MOVEUP ( vaddr -- 'vaddr) C/[email protected] 8* >R \ compute chunk size. 8* means 8 lines HERE 100 + OVER C/[email protected] + ( -- 1stline heap 2ndline) OVER [email protected] VREAD OVER [email protected] VWRITE R> + ; \ 36 bytes I re-wrote the scroll as one Forth word and use a buffer that was the size of the screen minus one line. The brought the speed down to 51.5 seconds. : SCROLL ( -- ) PAUSE TOPLN \ top of VDP screen memory C/SCR @ C/[email protected] - >R \ C/SCR-1line to rstack HERE 100 + OVER C/[email protected] + ( -- 1stline heap 2ndline) OVER [email protected] VREAD SWAP R> VWRITE 0 17 AT-XY VPOS C/[email protected] BL VFILL \ SEVENS = 51.5 SECS ; I don't really want to use such a huge buffer even though it's in un-allocated memory because at some point it will crash into the stack in a big program project. This is especially true in 80 column mode. I also don't want to put the scroll buffer in low RAM since that is so useful for SAMS buffers. Since I can rebuild the kernel in 5 seconds and re-run it on Classic99 I did the experiment to find out how the buffer size affected the speed of the benchmark. Here is the data. I think I will stay with my original decision to use an 8 line buffer but at least I know now that in the SEVENs benchmark almost 5 extra seconds are being used just to scroll the screen. Amazing. Buffer Lines Sevens Speed Reduction Notes 1 01:17.26 2 01:07.83 -12.21% Uses do loop 4 01:02.83 -18.68% Uses do loop 8 01:00.90 -21.18% Uses do loop 8 01:00.06 -22.26% MOVEUP MOVEUP MOVEUP 12 00:59.36 -23.17% MOVEUP MOVEUP 24 00:51.50 -33.34% Scroll is 1 word 1 Quote Share this post Link to post Share on other sites
+Lee Stewart #853 Posted Monday at 12:40 AM 5 hours ago, TheBF said: I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler. This is probably not terribly responsive, but I use only a one-line buffer with ALC: * *** SCROLLING ROUTINE * SCROLL MOV @$SSTRT(U),R0 VRAM addr LI R1,LINBUF Line buffer MOV @$SWDTH(U),R2 Count A R2,R0 Start at line 2 SCROL1 BLWP @VMBR S R2,R0 One line back to write BLWP @VMBW A R2,R0 Two lines ahead for next read A R2,R0 C R0,@$SEND(U) End of screen? JL SCROL1 MOV R2,R1 Blank bottom row of screen LI R0,>2000 Blank S @$SEND(U),R2 NEG R2 Now contains address of start of last line MOV LINK,R6 BL @FILL1 Write the blanks B *R6 If you need details about missing definitions, I can supply them, but the comments will likely suffice. ...lee 1 1 Quote Share this post Link to post Share on other sites
+TheBF #854 Posted Monday at 01:15 AM Thanks. This is very concise. I should go down that road. Early on to save space I decided to limit functions to the Forth interface only so I can not BLWP or BL to VMBW or VMBR or FILL. It not tricky to change but in the beginning of this journey I had no room left in the 8K. Hell I was still figuring out how to use the cross-compiler that I made. I have about 80 bytes free in the existing system and I can play games with headless definitions and labels to save space. So I will take a run at this method too since my social life is somewhat limited these days. Did get outside for a walk in the park with my brother in law this afternoon so that was good. 2 Quote Share this post Link to post Share on other sites
GDMike #855 Posted Monday at 01:44 AM Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs. I'm learning. Quote Share this post Link to post Share on other sites
+TheBF #856 Posted Monday at 01:55 AM 1 minute ago, GDMike said: Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs. I'm learning. Ya a little different. I start with this tiny 8K piece and then I can add to it after it loads. But squashing everything into the first 8K has been a fun challenge. I have actually built a version where I don't have any loops or branching in the kernel, but then it compiles those when it starts. ( I know what you are thinking. How does the kernel work without loops or branching? The cross compiler knows how to compile them to make the kernel, but the finished program does not have the BEGIN AGAIN , IF THEN etc. words. Crazy stuff.) 1 Quote Share this post Link to post Share on other sites
+TheBF #857 Posted Tuesday at 03:27 AM When you turn over a stone don't be surprise that you find a worm. In the course of testing my VMBR /VMBW code as sub-routines I found it was slower than normal because I used a WHILE loop structure to protect from cnt=0 conditions. I changed that and will take responsibility for the risk. It made things much faster. CODE: SCRL ( buffer Vaddr len -- ) \ TOS hold address of C/L user variable \ R0 VDP address \ R1 CPU BUFFER for routines, copy kept on stack \ R5 line length TOS R5 MOV, \ line length -> R5 *SP+ R0 MOV, \ VDPRDaddr -> R0 R5 R0 ADD, \ start at line2 BEGIN, R5 TOS MOV, \ COPY length to R4 for VMBR *SP R1 MOV, \ buffer -> R1 RMODE @@ BL, VMBR @@ BL, \ read line2 to buffer R5 R0 SUB, \ One line back to write R5 TOS MOV, \ set counter for the write *SP R1 MOV, \ restore buffer address WMODE @@ BL, VMBW @@ BL, R0 3FFF ANDI, \ strip off write bit R5 R0 ADD, \ Two lines ahead for next read R5 R0 ADD, R0 C/SCR @@ CMP, \ End of screen? HI UNTIL, TOS POP, \ drop buffer TOS POP, \ refill TOS register NEXT, END-CODE \ Buffer Lines Sevens Speed \ 1 00:58.16 26 bytes bigger : SCROLL ( -- ) HERE 100 + TOPLN C/[email protected] SCRL 0 17 AT-XY VPOS C/[email protected] BL VFILL ; It turns out that since I am not using BLWP, again to save space, I need a few extra instructions in my loop to reset the control registers. I call a sub-routine to setup the VDP address each time since I didn't want to push/pop R11. I also decided to erase the last screen line in Forth because it's pretty fast being mostly code words and only one line of code. I pass the buffer, screen and length as parameters since my TOPLN can be in different places in VDP RAM if you use the SCREEN: word to create different VDP text screens. And I don't keep variables for screen-end and screen-start so parameter passing was simplest. The ALC version is 26 bytes bigger than this Forth code which I created to do an "apple to apples" comparison. \ Notes: Using SEVENs program as a benchmark \ Buffer Lines Sevens Speed \ 1 01:08.71 : SCROLL ( buffer vaddr -- ) DUP C/[email protected] L/SCR * + SWAP ( -- buffer SCRend SCRstart) DO I C/[email protected] + OVER C/[email protected] VREAD DUP I C/[email protected] VWRITE C/[email protected] +LOOP DROP 0 17 AT-XY VPOS C/[email protected] BL VFILL ; So we can see that the ALC scroll makes the benchmark program ~15% faster at the cost of 26 bytes at least the way I did the ALC code. To show how much my VDP code improved, the older method that used with the 8 line buffer, improved from 1:00.6 to 0:55.75 or 8.7% improvement which I was very happy to see. I comes in 18 bytes bigger than the single line DO/LOOP method. I will explore what happens now with this improved VDP code and a 2 line buffer which seems like a reasonable trade-off. 4 Quote Share this post Link to post Share on other sites
+TheBF #858 Posted Tuesday at 07:46 PM Continuing on the scroll research... Adding this code to SCRL to clear the last line: \ erase last line adds 8 bytes, buys .4 seconds on benchmark C/SCR @@ R0 MOV, \ end of screen Vaddr -> R0 R5 R0 SUB, \ go back 1 line R5 R2 MOV, \ byte count for VFILL -> R2 TOS 2000 LI, \ space char -> R4 WMODE @@ BL, _VFILL @@ BL, Versus this code in Forth: 0 17 AT-XY VPOS C/[email protected] BL VFILL ALC for clearing the last line was not worth the trouble since it only improved the benchmark .4 seconds on the very long SEVENS benchmark and consumed an extra 8 bytes. And I needed to reset the cursor after SCRL completed anyway. I re-wrote my previous Forth SCROLL using the idea of keeping the buffer and VDP address arguments on the stack but I un-rolled the DO/LOOP. It was a bit faster. The code below ran the benchmark in 1:08.06 and was 36 bytes smaller than using the ALC scroll. Example 1: : MOVEUP ( buffer vaddr -- buffer 'vaddr) 2DUP C/[email protected] + SWAP C/[email protected] VREAD 2DUP C/[email protected] VWRITE C/[email protected] + ; : MOVE8 ( buffer Vaddr -- buffer 'Vaddr) MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP ; [PUBLIC] : SCROLL ( -- ) PAUSE HERE 100 + TOPLN MOVE8 MOVE8 MOVE8 2DROP 0 17 AT-XY VPOS C/[email protected] BL VFILL ; Re-writing the above to use a four line buffer and creating a code word to return C/[email protected] 4* was the only way I could get a faster scroll than the 1 line ALC code. 57 seconds vs 58 in ALC. Size or speed. Its hard to get both. 2 Quote Share this post Link to post Share on other sites
+TheBF #859 Posted Wednesday at 02:47 AM And just to beat this horse until it is well and truly dead Let's try ALC scrolling with a 2 line buffer. Since I am using un-allocated memory for the buffer, it was only 3 extra instructions. This dropped the benchmark time from 58.16 to 56.00 seconds. Just 1 second off of the Forth time using a 24 line buffer! Gotta love that assembler. Using all this fancy stuff created a 8,148 byte kernel, so I still have 48 bytes left over! And I now have VMBW and VMBR as BL callable sub-routines. Nice progress overall And if I really need a slightly smaller kernel I just change the TRUE below to FALSE and the kernel is 8,116 bytes with slower scroll. Spoiler \ Scroll in Assembler is a re-write of concept from FbForth Lee Stewart FALSE [IF] [PRIVATE] CODE: SCRL ( buffer Vaddr len -- ) \ TOS hold address of C/L user variable \ R0 VDP address \ R1 CPU BUFFER for routines, copy kept on stack \ R5 line length TOS R5 MOV, \ line length -> R5 *SP+ R0 MOV, \ VDPRDaddr -> R0 R5 R0 ADD, \ start at line2 BEGIN, R5 TOS MOV, \ COPY length to R4 for VMBR TOS TOS ADD, \ *read 2 lines *SP R1 MOV, \ buffer -> R1 RMODE @@ BL, VMBR @@ BL, \ read line2 to buffer R5 R0 SUB, \ One line back to write R5 TOS MOV, \ set counter for the write TOS TOS ADD, \ *write 2 lines *SP R1 MOV, \ restore buffer address WMODE @@ BL, VMBW @@ BL, R0 3FFF ANDI, \ strip off write bit R5 R0 ADD, R5 R0 ADD, R5 R0 ADD, \ *advance one extra line R0 C/SCR @@ CMP, \ End of screen? HI UNTIL, TOS POP, \ DROP buffer address TOS POP, \ refill TOS register NEXT, END-CODE [PUBLIC] \ Buffer Lines Sevens Speed \ 1 00:58.16 26 bytes bigger : SCROLL ( -- ) HERE 100 + TOPLN C/[email protected] SCRL 0 17 AT-XY VPOS C/[email protected] BL VFILL ; [ELSE] \ [PRIVATE] \ Notes: Using SEVENs program as a benchmark \ Buffer Lines Sevens Speed \ 1 01:08.06 \ 2 01:02.00 \ 1:01.00 using 2LINES code word. \ 4 00:57.28 \ 8 00:55.43 \ CODE: 2LINES ( -- n) \ TOS PUSH, \ R1 STWP, \ 2E (R1) TOS MOV, \ read user var C/L \ TOS 1 SLA, \ 2* \ NEXT, \ END-CODE [PUBLIC] : SCROLL ( buffer vaddr -- ) HERE 100 + TOPLN C/SCR @ ( -- buffer Vstart len) BOUNDS ( -- buffer SCRend SCRstart) DO I C/[email protected] + OVER C/[email protected] VREAD DUP I C/[email protected] VWRITE C/[email protected] +LOOP DROP 0 17 AT-XY VPOS C/[email protected] BL VFILL ; [THEN] 1 Quote Share this post Link to post Share on other sites
+Lee Stewart #860 Posted Wednesday at 05:40 AM 2 hours ago, TheBF said: \ Scroll in Assembler is a re-write of concept from fbForth Lee Stewart Before credits get lost, I must hasten to say that the ALC scroll routine I posted is totally lifted from TI Forth. There...I feel better already! ...lee 1 1 Quote Share this post Link to post Share on other sites
+TheBF #861 Posted Thursday at 04:20 PM While updating the list in Benchmarking Languages I started re-reading some of the posts by @matthew180 and @jedimatt42 where I was being soundly scolded for writing inefficient VDP routines. At the time I understood what they said but I did not know how to implement it in my system because of my use of different workspaces in the multi-tasker. I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything. A long time back I decided to use the workspace pointer register to define not just my register space but also local variable space above the registers for each task. I recently re-wrote my character output routine to use more inline code versus Forth and to do that I have to access those local variables for COL and ROW using indexed addressing. Well... it suddenly occurred to me that I can also access my registers the same way. This gave rise to a re-write of CPUT: (TOS is R4) Edit: Took out 1 more instruction. CODE: CPUT ( char -- ?) \ put a char at cursor position, return eol flag R1 STWP, \ workspace is USER area base address 32 (R1) R2 MOV, \ vrow->r3 2E (R1) R2 MPY, \ vrow*c/l->r3 34 (R1) R3 ADD, \ add vcol VPG @@ R3 ADD, \ add video page address 0 LIMI, 7 (R1) 8C02 @@ MOVB, \ write odd byte from R3 R3 4000 ORI, R3 8C02 @@ MOVB, 9 (R1) VDPWD @@ MOVB, \ Odd byte R4, write to screen 2 LIMI, TOS CLR, 34 (R1) INC, \ bump VCOL 34 (R1) 2E (R1) CMP, \ compare VCOL = C/L EQ IF, TOS SETO, \ set true flag ENDIF, NEXT, END-CODE 3 Quote Share this post Link to post Share on other sites
+TheBF #862 Posted yesterday at 04:47 AM Armed with this new tool for faster VDP access I had to re-write my VDP driver. You know I had to. It makes everything just a little more "perky". I like it. All the benchmarks that do anything with the screen like the TURSI sprite benchmark, go faster and even compiling is a touch quicker because we can get things in and out of the PAB faster. Only took me 5 years to get here. 🤣 It also made me look some very early code and discover how I had used 4 instructions in my VWTR word where I only needed 2. I guess I am getting a little better at this Assembly Language thing. With these faster Forth words I don't feel a real need for sub-routine access to VMBR and VMBW anymore. I pays to listen to the experts. I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP. Example: (edit: Found an extra instruction from the old version that was not needed) \ VSBR Forth style, on the stack CODE: [email protected] ( VDP-adr -- char ) \ Video CHAR fetch 0 LIMI, R1 STWP, 9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4 TOS 8C02 @@ MOVB, \ write even bytes from TOS VDPRD @@ TOS MOVB, \ READ char from VDP RAM into TOS TOS 8 SRL, \ move the damned byte to correct half of the word 2 LIMI, NEXT, END-CODE 3 Quote Share this post Link to post Share on other sites
matthew180 #863 Posted yesterday at 05:13 AM 12 hours ago, TheBF said: I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything. That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods. I see you worked it out, even if it is backwards (i.e. written in Forth). 12 minutes ago, TheBF said: I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP. This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A). Then again, I think that one case was on a modified system, so it is probably not possible on a stock console. The threads about it are here on A.A. if you want to dig around. The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP. Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP. You need a CPU clock around 25MHz or faster to overrun the F18A. 1 Quote Share this post Link to post Share on other sites
+TheBF #864 Posted yesterday at 05:20 AM 3 minutes ago, matthew180 said: That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods. I see you worked it out, even if it is backwards (i.e. written in Forth). This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A). Then again, I think that one case was on a modified system, so it is probably not possible on a stock console. The threads about it are here on A.A. if you want to dig around. The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP. Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP. You need a CPU clock around 25MHz or faster to overrun the F18A. ... even if it is backwards... The good news is the code gets laid down in the correct order. I still find it amazing that you can write functional assembler with structured branching and looping in 200 lines. Thanks for the news on the over-run not being an issue. I only have stock hardware. Quote Share this post Link to post Share on other sites
+TheBF #865 Posted yesterday at 05:30 AM Once you start looking... This is faster again by not used SRL but CLR and moving the data into the correct side of the register. \ VSBR Forth style, on the stack CODE: [email protected] ( VDP-adr -- char ) \ Video CHAR fetch 0 LIMI, R1 STWP, 9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4 TOS 8C02 @@ MOVB, \ write even bytes from TOS TOS CLR, VDPRD @@ 9 (R1) MOVB, \ READ char from VDP RAM into TOS 2 LIMI, NEXT, END-CODE I can do this in a number of places... 1 Quote Share this post Link to post Share on other sites
matthew180 #866 Posted 20 hours ago You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28). The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP. The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes. So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions. Quote Share this post Link to post Share on other sites
+TheBF #867 Posted 18 hours ago 1 hour ago, matthew180 said: You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28). The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP. The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes. So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions. When I first started writing the low level code for this system I had that thing in front me all the time. I use shift for the routines called 2* 4* 8* for fast multiplication and 2/ which divides by 2. The challenge with text is when you grab a byte and Forth wants the byte in odd byte of the register. SRL 8 is pretty slow, but its even worse to mask with AI and then SWPB. I find it pretty hard to make the old 9900 give you anything for free. Quote Share this post Link to post Share on other sites
matthew180 #868 Posted 9 hours ago 8 hours ago, TheBF said: SRL 8 is pretty slow, but its even worse to mask with AI and then SWPB. AI is 14 cycles, SWPB is 10 cycles, so 24 cycles total. This assumes register addressing to make it comparable to shift, since the shift can only operate on registers. The shift instructions have two forms (three actually, but only two that apply here), using R0 for the count, or the count is a fixed value. If the count is in R0, then the timing is 20+2N (N is the value read from R0). So in this case it would be 20+2*8 = 36 cycles. If the count is fixed (which is encoded as part of the instruction), then the timing is 12+2C. So in this case it would be 12+2*8 = 26 cycles. So, AI + SWPB is still faster, by at least 2 cycles, than shifting by 8. 1 Quote Share this post Link to post Share on other sites
+TheBF #869 Posted 9 hours ago You are correct. I recall now that I chose SLR in that case to save space since the speed difference was so small. (24 vs 26) And bytes are important since I try to keep the kernel in an 8K package. Quote Share this post Link to post Share on other sites