Jump to content

TheBF

+AtariAge Subscriber
  • Posts

    4,429
  • Joined

  • Last visited

Posts posted by TheBF

  1. I was testing this yesterday for copying 4K SAMS pages as fast as I could.

     

    This ASM code is reverse notation Forth Assembler so you might need to twist your head a bit. 

    The results are shown on the screen capture.

    CMOVE is the same as MOVE16 below but using MOVB instruction ie: byte at a time, and does not correct the byte count to an even number of course.

    MOVE32 has no benefit for moving a 4K block as you can see but was 20% faster moving an 8K block so it is better for >8K block moves.

     

     

    Meanings so you can translate

    -----------------------------------

    BEGIN,        is a universal label to jump back to in this assembler

    OC WHILE,  compiles to:   JNC  REPEAT+2

    REPEAT,      compiles  to:  JMP  BEGIN

    LTE UNTIL,  compiles to:   JGT  BEGIN

     

    TOS           renamed R4

     

    NEXT,        returns to the Forth interpreter

    
    CODE MOVE16  ( src dst n -- )  \ n= no. of CELLS to move
                *SP+ R0 MOV,       \ pop DEST into R0
                *SP+ R1 MOV,       \ pop source into R1
                 TOS INC,          \ make sure n is even
                 TOS -2 ANDI,
                 BEGIN,
                    TOS DECT,         \ dect by two, moving 2 bytes at once
                 OC WHILE,            \ if n<0 get out
                    R1 *+ R0 *+ MOV,  \ mem to mem move, auto increment
                 REPEAT,
                 TOS POP,
                 NEXT,
                 ENDCODE
    
    \ no improvement for 4K byte moves.  20% faster for 8K bytes
    CODE MOVE32  ( src dst n -- )  \ n= no. of CELLS to move
                *SP+ R0 MOV,       \ pop DEST into R0
                *SP+ R1 MOV,       \ pop source into R1
                BEGIN,
                    R1 *+ R0 *+ MOV,  \ memory to memory move, auto increment
                    R1 *+ R0 *+ MOV,  \ memory to memory move, auto increment
                    TOS -4 AI,        \ we are moving 4 bytes at once!
                LTE UNTIL,
                TOS POP,
                NEXT,
                ENDCODE

     

    fastmoves.png

    • Like 1
  2. Simplifed ALLOCATE, FREE, RESIZE   in ANS/ISO Forth

     

    I was reading a thread in comp.lang.forth about these words and discovered that a lot of people don't bother implementing the most formal interpretation of these words for small systems.

    By formal I mean something that would allow allocation, freeing and resizing memory blocks in such a way that there would never be fragmentation. This requires a way to read all the allocations either in a table or as a linked list so you can examine the state of each allocation.

     

    However if you don't need all that it becomes quite simple to make simple system that does the same job with the caveat that you have a more static allocation process which is more in line with Forth thinking.

    So instead of a full implementation that takes 768 bytes.  Here is one that takes 118 bytes. :)  In fact if you remove the luxury of remembering the size of an allocation it would be even smaller.

    This version includes the word SIZE which seems to be commonly written by others.

    The Forth variable H is initialized to >2000 when Camel99 Forth starts and is used a the HEAP pointer for the lower 8K RAM.

    To reset the heap you would use  HEX 2000 H !  in Forth or make a word to do it.

    \ Minimal ALLOCATE FREE RESIZE for Camel99 Forth B Fox Sept 3 2020
    \ Mostly Static allocation
    HEX
    : HEAP,    ( n --) H @ !  [ 1 CELLS ] LITERAL  H +! ;
    : ALLOCATE ( n -- addr ?) DUP HEAP, H @   SWAP H +!  FALSE ;
    : SIZE     ( addr -- n) 2- @ ; \ not ANS/ISO commonly found
    \ *warning* FREE removes everything above it as well
    : FREE     ( addr -- ?) 2- DUP OFF  H ! FALSE ;
    \ *warning* RESIZE will fragment the HEAP
    : RESIZE   ( n addr -- addr ?) DROP ALLOCATE ;

     

    Usage would typically be something like this:

    
    \ protection and syntax sugar
     : ?ALLOC ( ? --) ABORT" Allocate error" ;
     : ->     ( -- addr ?) ?ALLOC  POSTPONE TO ; IMMEDIATE
    
    \ define the variables during compiling
     0 VALUE X
     0 VALUE Y
    
    : START-PROGRAM
        50 ALLOCATE -> X
        50 ALLOCATE -> Y
        .... PROGRAM continues

     

    • Like 2
  3. The only thing you might consider is writing these in CODE.  I kept DIST for computing actual distance but I felt it was too much overhead for coincidence since it's all just sitting there in VDP RAM to read and compare.

    I think these could be really fast using registers versus the stack juggling in the Forth version.

     

    Notice that I purposely have code duplication in COINC rather than calling COINCXY. This is just for a bit of extra speed.

    CODE overhead with BL would be low enough to allow calling COINCXY IMHO.

     

    
    : COINCXY   ( dx dy sp# tol -- ? )
            >R
            SP.Y V@ SPLIT
          ( -- col row col row )
            ROT - ABS R@ <
           -ROT - ABS R> <
            AND ;
    
    : COINC ( spr#1 spr#2 tol -- ?)  \ 1.4 mS, 1.1 mS optimized
          >R  SP.Y V@ SPLIT
           ROT SP.Y V@ SPLIT
            ( -- col row  col row)
           ROT - ABS R@  <
          -ROT - ABS R>  <
           AND ;
    

    Just my 2 cents on the matter.

    • Like 1
  4. The other thing that workspaces are very good for is context switching. 

    If you initialize a group of workspaces as if they were called by BLWP, in a circle, ( A calls B, B calls C, C calls A) you can change tasks with just RTWP.  That's is pretty cool!.

     

     

    • Like 2
  5. 10 hours ago, GDMike said:

    1. two different SAMS "windows" in your RAM space, switch in source and destination SAMS pages and copy from one to the other.

    yeah, that's what I'm doing, I'm using unpaged >E000->EFFF to write temporary data to and read from.

     

     

     

    I think this is my option 2 because you are copying SAMS data from a window in CPU RAM (?) , to "unpaged" >E000..>EFFF.

    Am I understanding what you are doing correctly?

     

    Option1 means you have 2 - 4K windows say at >3000   and >E000.  You set the source SAMS bank to say >3000 , the destination bank to >E000 and copy 4KBYTES  from >3000 to >E000. 

    That is a SAMS-to-SAMS transfer.

  6. I find it's really hard make big performance differences with the 9900 in the nestable sub-routine area.

    If you build a little stack it takes ~28 clocks to push R11 onto a stack and 12 to BL (no wait-state comparisons here) 

    So thats 30 and another 12 to return so total overhead is 42.

     

    BLWP/RTWP is  26+14= 40 :( 

    If you have to pass any data back and forth to and from different workspaces you lose more time, where as pushing R11 lets you share registers.

    Of course if you need to push a few registers with a stack model the 9900 will kill you. 

    You really have to work it through for every situation or just bite the bullet and take the penalty in exchange for a consistent calling convention.

     

    It reminds me of a song my grandfather sang after a suitable number of drinks. "Gone are the days when free lunches came with beer..." :) 

    • Like 2
    • Haha 1
  7. 2 hours ago, GDMike said:

    TheBF,

    Talking about VMBR and VMBW.

    Could rewriting and using a different VMBR and BW code work better than those built in referenced code? I thought maybe I should use something that wasnt a built in reference too.

    For sure but they work fine until you need to go faster.

  8. 2 hours ago, GDMike said:

    Ahh, lucky for me, my user data is all in ram to start with, so my mov's are pretty fast. But I had a feeling, well I read somewhere that it rolls over like KSARUL said previously, but the article I found, and sorry but I can't find it now, but it talked regarding the older AMS? Not sure it was the same for my 1MB card, but KSARUL put it in perspective for me.

    It's my first time working with larger data all at once with this card and I'm enjoying the heck out of It. I'm actually at a point where I'm importing just short of 8K of user data from what they create in the 8K of the supercart, but in order to import, I have to push everything in SAMs,(>3000->3FFF) lower banks to the right or up, to higher banks by 9 banks so that pages 1-9 of my SNP pages are actually SNE pages imported/merged into existing SAMs at end up at the lower part of SAMs bank because I want them to show up as SNP pages 1-9 and what was SNP pages 1-9 are now pages 10-19 if that makes sense...

    And I've got that done, but I wanted to run a test while I was here and I bumped my loop up and outside the limit of banks available and saw no issues like a crash, so it led me to look into this.

    Thx everyone for chiming in. 

    I appreciate that.

     

     

    Doesn't matter if all the data is RAM. To copy from SAMS page to SAMS page you need to have :

     

    1. two different SAMS "windows" in your RAM space, switch in source and destination SAMS pages and copy from one to the other.

    -or-

    2. use one SAMS window , copy SAMS to a RAM buffer , switch the page in the window and copy the buffer back to SAMS

    -or-

    3. or as I tried, one SAMS windows, copied to VDP, switch the page in the window copy VDP back to SAMS.

     

     

    That's all the options I can think of.  (well I suppose you could write to file and then copy back but that's not practical)

     

    • Thanks 1
  9. Thank you. Thank makes perfect sense. It would be an interesting albeit significant project I am sure to build the NCG.

     

    I found this in Wikipedia:

     

    Niklaus Wirth specified a simple p-code machine in the 1976 book Algorithms + Data Structures = Programs.

    The machine had 3 registers - a program counter p, a base register b, and a top-of-stack register t. There were 8 instructions:

    1. lit 0, a : load constant a
    2. opr 0, a : execute operation a (13 operations: RETURN, 5 math functions, and 7 comparison functions)
    3. lod l, a : load variable l,a
    4. sto l, a : store variable l,a
    5. cal l, a : call procedure a at level l
    6. int 0, a : increment t-register by a
    7. jmp 0, a : jump to a
    8. jpc 0, a : jump conditional to a[6]

     

    These look very familiar to Forth people. :) 

     

    Wirth seemed so ahead of the pack back then. 

    • Like 1
  10. 8 hours ago, GDMike said:

    I did a test of copying data from a 1 mb Sam's card  bank to the next higher bank. My test kept copying past bank 245, well because I wanted to see how the console would handle my loop. And my tests kept performing as if the higher banks existed.

    Do these banks roll over, as in start writing to bank 1 if it can't find bank 247,248,250 or something similar?

     

    Thanks strange. I was doing the same thing.

    I was testing how fast I could copy pages  using only one 4K buffer in CPU RAM.  I was blitting into VDP RAM, switching pages and blitting back to SAMS.

    It took about 3 seconds to copy 64K that way using Assembler VMBW,VMBR inside a Forth loop.

     

    If I used a 4K buffer in CPU RAM I could get it to 2 seconds for 64K by using a custom copy routine that moved 16 bit cells at a time instead of bytes. By extension then I should be able to speed that up by moving 2 or maybe even 4 cells inside the assembler loop.

     

    The VDP method is not bad for performance and means I can don't need to play with CPU RAM for the copy buffer. Still deciding which way I want to go.

    • Thanks 1
  11. My confusion was more around how you modify the existing compiler to understand 9900 Assembler opcodes or convincing the compiler to convert Pascal to native code.

    I assumed that changing the compiler is not possible at least for TI-99, so then I wondered about writing the code manually and putting the binary in system friendly form.

     

    If my assumptions about the flexibility of the compiler are wrong then problem solved. :) 

     

     

     

    • Like 1
  12. 11 hours ago, apersson850 said:

    I've never used any NCG. I don't even know if any exists for the TMS 9900. But as far as I've read about them, it seems you start by compiling a Pascal program with a compiler directive, prepare for native code conversion. This directive points out the routines you've deemed interesting to convert. Then you execute the NCG, with the compiled code file as input, and get a new code file, where the appropriate procedure(s) have been converted to native code.

     

    For the TI 99/4A, the compiler ignores the code conversion directives, so I don't know exactly what they should accomplish.

    But the p-code interpreter does understand the codes indicating that natvie code is coming in the instruction stream. I'm actually, slooowly, working on a program which would do a limited conversion (just a few instructions handled), to investigate the feasibility of doing this on the TI 99/4A. If that works, then an expansion to convert more instructions isn't very difficult. The issue is to get it to work at all.

    Thanks for that explanation.

    That would be really something for TI-99.  Very interesting work creating good native code generators.  Not for the faint of heart. :) 

     

    So if the system can understand that "native code is coming" do we know what form it is expecting? (binary data?) 

    If we knew that, then in theory you could write the code in Assembler or even a Forth cross-compiler could be adapted to make a native code block.

    I have interfaced Forth to Pascal calling conventions under DOS and it was pretty simple since everything is passed on the stack "behind the curtain" in Pascal.

    It might be a whole lot trickier connecting correctly to the USCD VM however. I have no idea on that.

     

     

    • Like 1
  13. 2 hours ago, apersson850 said:

    If the program is small, then it's usually also pretty fast, as there isn't very much code to run through.

    Unless it's a small activity that's executed in a loop, taking many turns, in which case the task to translate the inner core of the loop to assembly usually isn't too daunting.

     

    No, what our system needs is a native code generator, which can translate the major part of a p-code routine to assembly automatically. Then you can write, debug and also use the program as it is. But if you do want to sacrifice some memory for speed, you can let the NCG convert some routines, where the program spends most of its time, to machine code.

    To your point, I made a little optimizer that expands Forth words that you choose to expand.  By expand I mean copy the machine code from the kernel routines inline into memory.

    When I tested it in inner loops I can get up to 2X speed improvement on those parts of the program

     

    So the ideal would be an NCG that makes a linkable UNIT for USCD Pascal?

     

     

    • Like 1
  14. 16 hours ago, GDMike said:

    How cool is AORG >6000!!!!

     

    ... program BL routines

    ...

    ...

    RT

    When you have an extra 8K sitting there in a supercart and you have 8K of code to push to it.

    Im having a blast rt now with trying to make decisions on how the SNP and SNE 32K, (4X banks of 8K), can be effective and still retain user data but using 1 bank for either the SNP or SNE programs. 

     

    This doesn't change the way that SNE currently works, BUT it would mean that SNE would need it's program reloaded prior to use AND will not alter the user data in the other 3 banks. This way I can use that 1st bank, currently SNE program for program space for SNP.

    uh yeah! Like 8K contiguous.

    I mean yes I have SAMs but I don't have 8K without bank switching.

    I could get a lot of BL routines in that 8K space.

     

    Does it mean that the supercart would have to be part of the whole deal, well yeah, but SNP is related to SNE and SNE uses the 4 banks in the cartridge.

    So why not extend this out to SNP as a MUST have, SAMs is already a must have item.

    SNP is already using A000-D840 for program space. 

    SNP doesn't allow writing to any banks in the supercart, but SNE writes to 3 of them so I wouldn't have to worry about user corruption occuring by accident, unless someone pulled the cartridge out.

    It's just an idea for now.

     

     

     

     

    You can also put the primary program at >6000 and load different sub-routine overlays into A000 and Page A000 into SAMS pages.

    Maybe a list of sub-routine addresses could go into a table at that top of A000 on each page. 

    You could then pull in a SAMS page and call the sub-routines with indexed addressing. The main program would have to remember the bank and subroutine numbers for each bank.

     

    (forgive my poor knowledge of Assembly language syntax. Not sure what it would take to keep something like this straight.)

     

    SUBTAB   DATA   SUB0,SUB1,SUB2,SUB3,SUB4,SUB5,SUB6

     

                  LI         R1,3                 * select sub-routine 3

                  BL        SUBTAB@(R1)   * call the sub-routine

     

     

    You are the architect. :) 

     It is pretty cool to have that extra memory in the cartridge for sure.

    • Like 2
  15. It's probably worth doing to give you a straight run of memory in upper RAM for all your code.

    To change your SAMS code you just change the address you use to put your bank number in. 

    To use >3000 as your swap page write the SAMS page number data to >4006. 

    Right now for >E000 you are using >401C as the SAMS register.

    • Thanks 1
  16. That dissertation is worth another degree. :)

     

    So we had this discussion around finding fields in the Camel Forth header.  

    I don't fully understand why the terminator bit at the end is needed but it sounds like the one in the length byte has a purpose.

     

    Anton Ertl who shepherds gforth has moved to a name field based header. Searching is done for the name field and everything is accessible from the nfa.

    It's pretty consistent and it means that all the words that use stack strings (addr,len) can be brought to bear. FIND is deprecated and replaced with SEARCH-WORDLIST as I understand things.

     

    It's the eternal argument in Forth building I suppose.

    Don't break your code on my account. It just was an innocent question. :) 

    • Like 2
  17. 39 minutes ago, Vorticon said:

    Unfortunately turning range checking off, changing the arrays base index to 0 and using a packed array of bytes for the board representation only resulted in a minimal saving in overall processing time. Oh well...

    I believe Aperson showed the P code interpreter is running about 6..8 instructions for each p-code primitive. That's pretty big overhead on at the glacial pace of 9900.

    I have never looked at the GPL interpreter but it is probably about the same or maybe a bit more based on the execution speed of GPL.

     

    I guess you need a native code Pascal compiler. 

  18. This topic is rather long winded but what the heck.  

     

    I have taken to using Lee's version of the Sevens Problem, with a few "Camel Forth friendly changes, as a general test of my system builds.

     

    Lately I built a "poor-man's just-in-time compiler addition that requires some changes to the source code.

    You use the INLINE[   ]   directive on Forth primitives, constants, variables and user-variables and they are compiled as inline machine code. Kind of manual but it works.

     

    So I wondered what it would do.

    With the inlining, doing the version that only prints the final results,  I finally got down near the times that Lucien gets with his MLC compiler. 9.3 seconds versus 16.1 secs without inlining any code.

    Nice to know the tool can make a difference.

     

     

    Spoiler
    
    \ Lee Stewart's mod of Lucien2's code for the sevens problem
    
    \ Speedup mods for CAMEL99 Forth
    \ 1. Used VALUES
    \ 2. VTYPE for all short strings
    \ 3. Used UM/MOD , native division
    \ 4. >DIGIT for digit conversion
    \ 5. Redefined PAD as static memory
    \ 6. Defined 7*
    \ 7. Used inline for as many primitives as possible
    
    
    NEEDS ELAPSE FROM DSK1.ELAPSE
    NEEDS VALUE  FROM DSK1.VALUES
    NEEDS MALLOC FROM DSK1.MALLOC
    NEEDS VTYPE  FROM DSK1.VTYPE
    NEEDS INLINE[ FROM DSK1.INLINE
    
    MARKER /SEVENS
    
    DECIMAL
    \ ------------------------------------
    180 CONSTANT SIZE
    SIZE MALLOC CONSTANT A1  \ A1 digit array
    SIZE MALLOC CONSTANT PAD
    
    0 VALUE LENGTH    \ current number of digits in result
    
    HEX
    CODE 7*   C044 , 0A34 , 6101 , NEXT, ENDCODE
    
    DECIMAL
    : A1*7->A1 ( -- ) \ perform 7 * last result
       0              \ initialize carried digit on stack
       1 +TO LENGTH   \ assume we will increase length by 1 digit
       A1 LENGTH BOUNDS
       DO
          INLINE[ I C@  7* +  0 10 UM/MOD ] \ make result ud..unsigned divide by 10
          INLINE[ SWAP I C!  ]   \ store rem as cur digit..carry on stack
       LOOP
       DROP            \ clean up stack
     \ eliminate leading 0
       A1 LENGTH INLINE[ 1- + C@ 0=  ]   \ highest digit = 0?
       IF
          -1 +TO LENGTH  \ correct digit count
       THEN  ;
    
    : A1$ ( -- addr len)
       PAD DUP              \ PAD & COPY for string storage
       A1 1- DUP LENGTH +
       DO
          INLINE[ I C@ >DIGIT   OVER C!  1+ ]
       -1 +LOOP
        DROP             \ clean up stack
       ( PAD) LENGTH ;
    
    : A1$.TYPE ( --)
       [ A1 1- ] LITERAL LENGTH OVER +
       DO
         INLINE[ I C@ >DIGIT  VPUT  VCOL 1+@ C/L@ ]
         >= IF  CR  THEN
       -1 +LOOP ;
    
    : 7COUNTER ( -- ? )  \ Brian Fox's technique
       0                 \ initialize counter
       A1 LENGTH BOUNDS  \ DO A1 to A1 + length
       DO
        INLINE[ 1+ I C@ 7 = AND  DUP 6 = ]
          IF             \ more than '77777'?
             LEAVE       \ yup..we're done
          THEN
       LOOP
    ;
    
    DECIMAL
    : .POWER ( n -- ) S" SEVEN TO THE POWER OF " VTYPE DECIMAL .  S" IS" VTYPE ;
    
    : RUN      \ V2.58 1:26 ,
               \ v2.59 with 8 line scrolling, 1:02
               \ v2.62 with inline v4  54 sec.
       PAGE
       A1 SIZE 0 FILL
       7 A1 C!
       1 TO LENGTH
       2                 \ starting power
       BEGIN
          7COUNTER 5 <
       WHILE
          A1*7->A1
          DUP            \ dup power for display
          CR .POWER
          1+             \ increment power
          CR A1$.TYPE
          CR
       REPEAT
       DROP ;
    
    DECIMAL
    : NOSCROLL
       PAGE
       A1 SIZE 0 FILL
       7 A1 C!
       1 TO LENGTH
       2                 \ starting power
       BEGIN
          7COUNTER 5 <
       WHILE
          A1*7->A1
          DUP            \ dup power for display
          0 0 AT-XY .POWER
          1+             \ increment power
    \     CR CR A1$ TYPE    \ 39:08
         CR CR A1$ VTYPE    \ 24:50, W/INLINE4 00:15.35
       REPEAT
       0 7 AT-XY ;
    
    DECIMAL
    : FASTRUN         ( 16.1 seconds) ( W/INLINE 9.3 seconds)
       PAGE ." Working..."
       A1  SIZE 0 FILL
       PAD SIZE 0 FILL
       7 A1 C!
       1 TO LENGTH
       2                 \ starting power
       BEGIN
         7COUNTER 5 <
       WHILE
          A1*7->A1
          1+             \ increment power
       REPEAT
       1- CR .POWER
       CR A1$ VTYPE
       0 7 AT-XY
    ;
    
    DECIMAL
    
    CR .( TYPE: ELAPSE RUN, ELAPSE NOSCROLL or ELAPSE FASTRUN ) CR
    

     

     

     

     

     

     

     

    • Like 1
×
×
  • Create New...