Jump to content
TheBF

Camel99 Forth Information goes here

Recommended Posts

Well didn't this open an interesting can of worms. :) 

 

My great improvement was short lived.

On my Windows 10 computer using Classic99 and Camel99 Forth 2.66 I get a different timing than @Speccery did.

I should have timed it first on my machine before I even started working on improvements (doh!) but here is the Apples to Apples comparison:

 

So after seeing some improvement re-working the DO LOOP with an inline NEXT interpreter I wondered if I could use some of that extra space to speed up other primitives?

 

Next I found this reference table to provide some insight into what Forth words run most often from "Stack Computers, the next Wave" by Phil Koopman

6.3.1 Dynamic instruction frequencies

NAMES           FRAC     LIFE     MATH  COMPILE      AVE
CALL           11.16%   12.73%   12.59%   12.36%   12.21%
EXIT           11.07%   12.72%   12.55%   10.60%   11.74%
VARIABLE        7.63%   10.30%    2.26%    1.65%    5.46%
@               7.49%    2.05%    0.96%   11.09%    5.40%
0BRANCH         3.39%    6.38%    3.23%    6.11%    4.78%
LIT             3.94%    5.22%    4.92%    4.09%    4.54%
+               3.41%   10.45%    0.60%    2.26%    4.18%
SWAP            4.43%    2.99%    7.00%    1.17%    3.90%
R>              2.05%    0.00%   11.28%    2.23%    3.89%
>R              2.05%    0.00%   11.28%    2.16%    3.87%
CONSTANT        3.92%    3.50%    2.78%    4.50%    3.68%
DUP             4.08%    0.45%    1.88%    5.78%    3.05%
ROT             4.05%    0.00%    4.61%    0.48%    2.29%
USER            0.07%    0.00%    0.06%    8.59%    2.18%
[email protected]              0.00%    7.52%    0.01%    0.36%    1.97%
I               0.58%    6.66%    0.01%    0.23%    1.87%
=               0.33%    4.48%    0.01%    1.87%    1.67%
AND             0.17%    3.12%    3.14%    0.04%    1.61%
BRANCH          1.61%    1.57%    0.72%    2.26%    1.54%
EXECUTE         0.14%    0.00%    0.02%    2.45%    0.65%

I already have 0BRANCH, BRANCH, EXIT, CALL (DOCOL), LIT, @ and DROP  in 16 bit RAM.

 

Using this table I changed the NEXT macro in each of the following words to use the ILNEXT macro. (inline NEXT, 3 instructions)

DOVAR,  +, SWAP,  R> , >R,  DOCON , ROT,  DOUSER,  [email protected], I and =.

 

After re-compiling the kernel with these changes the FIB2-BENCH a bit faster again.

Fibonacci  FIB2-BENCH Timings

----------------------------------------

V2.66        1:46.80

V2.67        1:46.03         0.7% better                      ( DO LOOP change)

V2.67b       1:44.50        2.2% better than original   ( inline next on hi usage words) 

 

Since a threaded Forth program spends about 50% of it's time running the interpreter NEXT, you can get improvements by removing the branch through a register to get to it and placing it inline as we can see.

 

But... this consumed 46 bytes in my tiny kernel so is it worth it?

I will play with it more and run some more benchmarks before I make up my mind.

 

  • Like 2

Share this post


Link to post
Share on other sites

In the course of optimizing EMIT for faster screen printing I was amazed that the speed of the SEVENs problem benchmark was not reduced by very much.

I remembered that FbForth was doing Lee's version of the benchmark in less than 1 minute.

 

I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler.

I use an intermediate word called  MOVEUP to scroll the screen by a different number of lines and you call it the number of times you need to scroll 24 lines.

: MOVEUP ( vaddr -- 'vaddr)
         C/[email protected] 8* >R       \ compute chunk size. 8* means 8 lines
         HERE 100 +  OVER C/[email protected] +  ( -- 1stline heap 2ndline)
         OVER [email protected] VREAD
         OVER [email protected] VWRITE
         R> +   ; \ 36 bytes

 

I re-wrote the scroll as one Forth word and use a buffer that was the size of the screen minus one line.

The brought the speed down to 51.5 seconds.  

 : SCROLL ( -- )
         PAUSE
         TOPLN                   \ top of VDP screen memory
         C/SCR @ C/[email protected] - >R       \ C/SCR-1line to rstack
         HERE 100 + OVER C/[email protected] +  ( -- 1stline heap 2ndline)
         OVER [email protected] VREAD
         SWAP R> VWRITE
         0 17 AT-XY  VPOS C/[email protected] BL VFILL  \ SEVENS = 51.5 SECS
;

I don't really want to use such a huge buffer even though it's in un-allocated memory because at some point it will crash into the stack in a big program project. 

This is especially true in 80 column mode.

I also don't want to put the scroll buffer in low RAM since that is so useful for SAMS buffers.

 

Since I can rebuild the kernel in 5 seconds and re-run it on Classic99 I did the experiment to find out how the buffer size affected the speed of the benchmark.

Here is the data.  I think I will stay with my original decision to use an 8 line buffer but at least I know now that in the SEVENs benchmark almost 5 extra seconds are being used just to scroll the screen. Amazing.

Buffer Lines Sevens Speed Reduction Notes
1 01:17.26    
2 01:07.83  -12.21%        Uses do loop
4 01:02.83 -18.68%      Uses do loop
8 01:00.90 -21.18%      Uses do loop
8 01:00.06 -22.26% MOVEUP MOVEUP MOVEUP
12 00:59.36 -23.17% MOVEUP MOVEUP
24 00:51.50 -33.34%      Scroll is 1 word
  • Like 1

Share this post


Link to post
Share on other sites
5 hours ago, TheBF said:

I suspected my scroll was the issue since I bucked the trend and did not write it entirely in Assembler.

 

This is probably not terribly responsive, but I use only a one-line buffer with ALC:

*
*** SCROLLING ROUTINE
*
SCROLL MOV  @$SSTRT(U),R0   VRAM addr
       LI   R1,LINBUF       Line buffer
       MOV  @$SWDTH(U),R2   Count
       A    R2,R0           Start at line 2
SCROL1 BLWP @VMBR
       S    R2,R0           One line back to write
       BLWP @VMBW
       A    R2,R0           Two lines ahead for next read
       A    R2,R0
       C    R0,@$SEND(U)    End of screen?
       JL   SCROL1
       MOV  R2,R1           Blank bottom row of screen
       LI   R0,>2000        Blank
       S    @$SEND(U),R2
       NEG  R2              Now contains address of start of last line
       MOV  LINK,R6
       BL   @FILL1          Write the blanks
       B    *R6

If you need details about missing definitions, I can supply them, but the comments will likely suffice.

 

...lee

  • Like 1
  • Thanks 1

Share this post


Link to post
Share on other sites

Thanks. This is very concise.

I should go down that road. Early on to save space I decided to limit functions to the Forth interface only so I can not  BLWP or BL to VMBW or VMBR or FILL.

It not tricky to change but in the beginning of this journey I had no room left in the 8K. :) 

Hell I was still figuring out how to use the cross-compiler that I made. :) 

 

I have about 80 bytes free in the existing system and I can play games with headless definitions and labels to save space.

So I will take a run at this method too since my social life is somewhat limited these days. 

Did get outside for a walk in the park with my brother in law this afternoon so that was good.

  • Like 2

Share this post


Link to post
Share on other sites

Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs.

I'm learning.

Share this post


Link to post
Share on other sites
1 minute ago, GDMike said:

Different environment, but same ole outta memory, I was able to, (and this is probably easier for you guys), but I did manage to make a type of loader for some of my routines to use free upper SAMs today. I was at 1014 bytes free, now after pushing some things up, I'm at 2050 free and 2 routines in upper SAMs.

I'm learning.

Ya a little different. I start with this tiny 8K piece and then I can add to it after it loads. But squashing everything into the first 8K has been a fun challenge.

I have actually built a version where I don't have any loops or branching in the kernel, but then it compiles those when it starts. :) 

 

( I know what you are thinking. How does the kernel work without loops or branching?  The cross compiler knows how to compile them to make the kernel, but the finished program does not have the BEGIN AGAIN , IF THEN etc. words. Crazy stuff.)

 

  • Like 1

Share this post


Link to post
Share on other sites

When you turn over a stone don't be surprise that you find a worm. :)

In the course of testing my VMBR /VMBW code as sub-routines I found it was slower than normal because I used a WHILE loop structure to protect from cnt=0 conditions.

I changed that and will take responsibility for the risk.   :)

It made things much faster.

CODE: SCRL ( buffer Vaddr len -- )  \ TOS hold address of C/L user variable
\ R0 VDP address
\ R1 CPU BUFFER for routines, copy kept on stack
\ R5 line length
       TOS R5 MOV,        \ line length -> R5
      *SP+ R0 MOV,        \ VDPRDaddr -> R0
       R5  R0 ADD,        \ start at line2
       BEGIN,
          R5  TOS MOV,      \ COPY length to R4 for VMBR
         *SP  R1  MOV,      \ buffer -> R1
          RMODE @@ BL,
          VMBR @@ BL,       \ read line2 to buffer
          R5 R0  SUB,       \ One line back to write

          R5 TOS MOV,       \ set counter for the write
         *SP R1  MOV,       \ restore buffer address
          WMODE @@ BL,
          VMBW @@ BL,
          R0 3FFF ANDI,     \ strip off write bit
          R5   R0 ADD,      \ Two lines ahead for next read
          R5   R0 ADD,
          R0 C/SCR @@ CMP,  \ End of screen?
       HI UNTIL,
       TOS POP,             \ drop buffer
       TOS POP,             \ refill TOS register 
       NEXT,
       END-CODE

\ Buffer Lines	Sevens Speed
\   1           	00:58.16    26 bytes bigger
: SCROLL  ( -- )
       HERE 100 +  TOPLN C/[email protected] SCRL
       0 17 AT-XY  VPOS C/[email protected] BL VFILL ;

 

It turns out that since I am not using  BLWP,  again to save space, I need a few extra instructions in my loop to reset the control registers.

I call a sub-routine to setup the VDP address each time since I didn't want to push/pop R11.

 

I also decided to erase the last screen line in Forth because it's pretty fast being mostly code words and only one line of code.

 

I pass the buffer, screen and length as parameters since my TOPLN can be in different places in VDP RAM if you use the SCREEN: word to create different VDP text screens.

And I don't keep variables for screen-end and screen-start so parameter passing was simplest.

 

The ALC version is 26 bytes bigger than this Forth code which I created to do an "apple to apples" comparison.

\ Notes: Using SEVENs program as a benchmark
\ Buffer Lines	Sevens Speed
\   1           	01:08.71
: SCROLL ( buffer vaddr -- )
       DUP C/[email protected] L/SCR * +
       SWAP  ( -- buffer SCRend SCRstart)
       DO
         I  C/[email protected] +  OVER  C/[email protected] VREAD
         DUP  I           C/[email protected] VWRITE
       C/[email protected] +LOOP
       DROP
       0 17 AT-XY  VPOS C/[email protected] BL VFILL
;

So we can see that the ALC scroll makes the benchmark program ~15% faster at the cost of 26 bytes at least the way I did the ALC code.

 

To show how much my VDP code improved, the older method that used with the 8 line buffer, improved  from 1:00.6  to 0:55.75  or 8.7% improvement which I was very happy to see.

I comes in 18 bytes bigger than the single line DO/LOOP method.

 

I will explore what happens now with this improved VDP code and a 2 line buffer which seems like a reasonable trade-off.

 

 

 

 

 

 

 

 

  • Like 4

Share this post


Link to post
Share on other sites

Continuing on the scroll research...

 

Adding this code to SCRL to clear the last line:

\ erase last line  adds 8 bytes, buys .4 seconds on benchmark
       C/SCR @@ R0 MOV,     \ end of screen Vaddr -> R0
       R5 R0 SUB,           \ go back 1 line
       R5 R2 MOV,           \ byte count for VFILL -> R2
       TOS 2000 LI,         \ space char -> R4
       WMODE @@ BL,
      _VFILL @@ BL,

Versus this code in Forth:

 0 17 AT-XY  VPOS C/[email protected] BL VFILL

ALC for clearing the last line was not worth the trouble since it only improved the benchmark .4 seconds on the very long SEVENS benchmark and consumed an extra 8 bytes.

And I needed to reset the cursor after SCRL completed anyway.

 

I re-wrote my previous Forth SCROLL using the idea of keeping the buffer and VDP address arguments on the stack but I un-rolled the DO/LOOP.

It was a bit faster.  The code below ran the benchmark in 1:08.06 and was 36 bytes smaller than using the ALC scroll. 

 

Example 1:

: MOVEUP ( buffer vaddr -- buffer 'vaddr)
         2DUP C/[email protected] +  SWAP C/[email protected] VREAD
         2DUP              C/[email protected] VWRITE
         C/[email protected]  +  ;

: MOVE8  ( buffer Vaddr -- buffer 'Vaddr)
      MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP MOVEUP ;

     [PUBLIC]

: SCROLL ( -- )
         PAUSE
         HERE 100 + TOPLN  MOVE8 MOVE8 MOVE8   2DROP
         0 17 AT-XY  VPOS C/[email protected] BL VFILL
;

Re-writing the above to use a four line buffer and creating a code word to return C/[email protected] 4*  was the only way I could get a faster scroll than the 1 line ALC code.

57 seconds vs 58 in ALC.

 

Size or speed. Its hard to get both.

 

 

 

 

 

  • Like 2

Share this post


Link to post
Share on other sites

And just to beat this horse until it is well and truly dead :) 

 

Let's try ALC scrolling with a 2 line buffer. Since I am using un-allocated memory for the buffer, it was only 3 extra instructions.

This dropped the benchmark time from 58.16 to 56.00  seconds.  Just 1 second off of the Forth time using a 24 line buffer!

Gotta love that assembler.

 

Using all this fancy stuff created a 8,148 byte kernel, so I still have 48 bytes left over!

And I now have VMBW and VMBR as BL callable sub-routines. Nice progress overall

 

And if I really need a slightly smaller kernel I just change the TRUE below to FALSE and the kernel is 8,116 bytes with slower scroll.

Spoiler
\ Scroll in Assembler is a re-write of concept from FbForth Lee Stewart
FALSE [IF]
               [PRIVATE]
CODE: SCRL ( buffer Vaddr len -- )  \ TOS hold address of C/L user variable
\ R0 VDP address
\ R1 CPU BUFFER for routines, copy kept on stack
\ R5 line length
       TOS R5 MOV,        \ line length -> R5
      *SP+ R0 MOV,        \ VDPRDaddr -> R0
       R5  R0 ADD,        \ start at line2
       BEGIN,
          R5  TOS MOV,      \ COPY length to R4 for VMBR
          TOS TOS ADD,      \ *read 2 lines
         *SP  R1  MOV,      \ buffer -> R1
          RMODE @@ BL,
          VMBR  @@ BL,      \ read line2 to buffer
          R5 R0  SUB,       \ One line back to write

          R5 TOS MOV,       \ set counter for the write
          TOS TOS ADD,      \ *write 2 lines
         *SP R1  MOV,       \ restore buffer address
          WMODE @@ BL,
          VMBW  @@ BL,
          R0 3FFF ANDI,     \ strip off write bit
          R5   R0 ADD,
          R5   R0 ADD,
          R5   R0 ADD,      \ *advance one extra line
          R0 C/SCR @@ CMP,  \ End of screen?
       HI UNTIL,
       TOS POP,             \ DROP buffer address
       TOS POP,             \ refill TOS register
       NEXT,
       END-CODE
              [PUBLIC]

\ Buffer Lines	Sevens Speed
\   1           	00:58.16    26 bytes bigger
: SCROLL  ( -- )
       HERE 100 +  TOPLN C/[email protected] SCRL
       0 17 AT-XY  VPOS C/[email protected] BL VFILL ;


[ELSE]

\                 [PRIVATE]
\ Notes: Using SEVENs program as a benchmark
\ Buffer Lines	Sevens Speed
\   1         	01:08.06
\   2           01:02.00   \ 1:01.00  using 2LINES code word.
\   4	          00:57.28
\   8	          00:55.43

\ CODE: 2LINES ( -- n)
\    TOS PUSH,
\    R1  STWP,
\    2E (R1) TOS MOV,  \ read user var C/L
\    TOS 1 SLA,        \ 2*
\    NEXT,
\   END-CODE
               [PUBLIC]

: SCROLL ( buffer vaddr -- )
       HERE 100 +
       TOPLN C/SCR @  ( -- buffer Vstart len)
       BOUNDS  ( -- buffer SCRend SCRstart)
       DO
         I  C/[email protected] +  OVER  C/[email protected]  VREAD
         DUP  I           C/[email protected]  VWRITE
       C/[email protected] +LOOP
       DROP
       0 17 AT-XY  VPOS C/[email protected] BL VFILL
;


[THEN]

 

 

  • Like 1

Share this post


Link to post
Share on other sites
2 hours ago, TheBF said:

\ Scroll in Assembler is a re-write of concept from fbForth Lee Stewart

 

Before credits get lost, I must hasten to say that the ALC scroll routine I posted is totally lifted from TI Forth. There...I feel better already!  |:)

 

...lee

  • Like 1
  • Haha 1

Share this post


Link to post
Share on other sites

While updating the list in Benchmarking Languages I started re-reading some of the posts by @matthew180 and @jedimatt42 where I was being soundly scolded for writing inefficient VDP routines.  :)

 

 

At the time I understood what they said but I did not know how to implement it in my system because of my use of different workspaces in the multi-tasker. 

I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything.

 

A long time back I decided to use the workspace pointer register to define not just my register space but also local variable space above the registers for each task.

I recently re-wrote my character output routine to use more inline code versus Forth and to do that I have to access those local variables for COL and ROW using indexed addressing.

Well... it suddenly occurred to me that I can also access my registers the same way.

 

This gave rise to a re-write of CPUT:  (TOS is R4)

Edit:  Took out 1 more instruction.

CODE: CPUT ( char -- ?)  \ put a char at cursor position, return eol flag
            R1         STWP,    \ workspace is USER area base address
            32 (R1) R2  MOV,    \ vrow->r3
            2E (R1) R2  MPY,    \ vrow*c/l->r3
            34 (R1) R3 ADD,    \ add vcol
            VPG @@  R3 ADD,    \ add video page address
            0 LIMI,
            7 (R1) 8C02 @@ MOVB,   \ write odd byte from R3
            R3 4000 ORI,
            R3 8C02 @@ MOVB,
            9 (R1) VDPWD @@ MOVB,  \ Odd byte R4, write to screen
            2 LIMI,
            TOS CLR,
            34 (R1)  INC,          \ bump VCOL
            34 (R1)  2E (R1) CMP,  \ compare VCOL = C/L
            EQ IF,
                TOS SETO,          \ set true flag
            ENDIF,
            NEXT,
            END-CODE

 

  • Like 3

Share this post


Link to post
Share on other sites

Armed with this new tool for faster VDP access I had to re-write my VDP driver.

You know I had to. :) 

 

It makes everything just a little more "perky".  I like it.

All the benchmarks that do anything with the screen like the TURSI sprite benchmark, go faster and even compiling is a touch quicker because we can get things in and out of the PAB faster.

Only took me 5 years to get here.  🤣

 

It also made me look some very early code and discover how I had used 4 instructions in my VWTR word where I only needed 2.

I guess I am getting a little better at this Assembly Language thing.

 

With these faster Forth words I don't feel a real need for sub-routine access to VMBR and VMBW anymore.

I pays to listen to the experts.

 

I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP.

 

Example:  (edit: Found an extra instruction from the old version that was not needed)

\ VSBR Forth style, on the stack
CODE: [email protected]   ( VDP-adr -- char )  \ Video CHAR fetch
            0 LIMI,
            R1 STWP,
            9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4
            TOS 8C02 @@ MOVB,    \ write even bytes from TOS
            VDPRD @@ TOS MOVB,   \ READ char from VDP RAM into TOS
            TOS 8 SRL,           \ move the damned byte to correct half of the word
            2 LIMI,
            NEXT,
            END-CODE

 

 

 

  • Like 3

Share this post


Link to post
Share on other sites
12 hours ago, TheBF said:

I could not reference a register's odd numbered byte using indirect addressing because the numerical address could be anything.

That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods.  I see you worked it out, even if it is backwards (i.e. written in Forth). ;)

 

12 minutes ago, TheBF said:

I have not tried this stuff on real iron yet. I hope it doesn't over run the VDP.

This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A).  Then again, I think that one case was on a modified system, so it is probably not possible on a stock console.  The threads about it are here on A.A. if you want to dig around.  The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP.

 

Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP.  You need a CPU clock around 25MHz or faster to overrun the F18A.

  • Thanks 1

Share this post


Link to post
Share on other sites
3 minutes ago, matthew180 said:

That is just a slight optimization to the VDP routines, and you only incur a small increase (8us or something like that) by using other methods.  I see you worked it out, even if it is backwards (i.e. written in Forth). ;)

 

This has been debated and tested quite a bit on the 99/4A, and IIRC there is only one very specific situation where you might be able to overrun the VDP (again, on the 99/4A).  Then again, I think that one case was on a modified system, so it is probably not possible on a stock console.  The threads about it are here on A.A. if you want to dig around.  The main clincher (again, IIRC) for the 99/4A is that VDP access triggers the wait-state generator, so you pretty much cannot overrun the VDP.

 

Also, if the system has an F18A, it cannot be overrun on the retro computers that used the 9918A family of VDP.  You need a CPU clock around 25MHz or faster to overrun the F18A.

... even if it is backwards... 

The good news is the code gets laid down in the correct order. :)

I still find it amazing that you can write functional assembler with structured branching and looping in 200 lines. :) 

 

Thanks for the news on the over-run not being an issue. I only have stock hardware.

 

 

Share this post


Link to post
Share on other sites

Once you start looking...

 

This is faster again by not used SRL but CLR and moving the data into the correct side of the register.

\ VSBR Forth style, on the stack
CODE: [email protected]   ( VDP-adr -- char )  \ Video CHAR fetch
            0 LIMI,
            R1 STWP,
            9 (R1) 8C02 @@ MOVB, \ write odd byte from TOS ie R4
            TOS 8C02 @@ MOVB,    \ write even bytes from TOS
            TOS CLR,               
            VDPRD @@ 9 (R1) MOVB, \  READ char from VDP RAM into TOS
            2 LIMI,
            NEXT,
            END-CODE

I can do this in a number of places...

  • Like 1

Share this post


Link to post
Share on other sites

You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28).  The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP.  The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes.  So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions.

Share this post


Link to post
Share on other sites
1 hour ago, matthew180 said:

You might want to check out the 9900 datasheet and get a little familiar with the instruction timings (pg. 28).  The slowest instructions are: DIV, MPY, Shift instructions, XOP, LDCR/STCR, BLWP.  The barrel shifter takes a cycle for each bit-shift, so the more bits, the longer the instruction takes.  So, even though shifting is faster than DIV and MPY for powers of 2, it is still slower than other instructions if you can do the same task with other instructions.

When I first started writing the low level code for this system I had that thing in front me all the time.  :)

 

 I use shift for the routines called 2* 4* 8* for fast multiplication and  2/ which divides by 2. 

 

The challenge with text is when you grab a byte and Forth wants the byte in odd byte of the register. 

SRL  8  is pretty slow, but its even worse to mask with AI and then SWPB.

 

I find it pretty hard to make the old 9900 give you anything for free. :)

 

 

Share this post


Link to post
Share on other sites
8 hours ago, TheBF said:

SRL  8  is pretty slow, but its even worse to mask with AI and then SWPB.

AI is 14 cycles, SWPB is 10 cycles, so 24 cycles total.  This assumes register addressing to make it comparable to shift, since the shift can only operate on registers.

 

The shift instructions have two forms (three actually, but only two that apply here), using R0 for the count, or the count is a fixed value.

 

If the count is in R0, then the timing is 20+2N (N is the value read from R0).  So in this case it would be 20+2*8 = 36 cycles.

 

If the count is fixed (which is encoded as part of the instruction), then the timing is 12+2C.  So in this case it would be 12+2*8 = 26 cycles.

 

So, AI + SWPB is still faster, by at least 2 cycles, than shifting by 8.

  • Thanks 1

Share this post


Link to post
Share on other sites

You are correct. I recall now that I chose SLR in that case to save space since the speed difference was so small. (24 vs 26)

And bytes are important since I try to keep the kernel in an 8K package.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...