Jump to content
IGNORED

Do these VDP routines look reasonable?


Kchula-Rrit

Recommended Posts

I'm thinking of using these routines I wrote to read/write VDP RAM.  Are there any obvious errors?  I was thinking of using them to avoid the BLWP that the TI routines use, along with the lack of zero-byte-count checking.  These can also sit in the 16-bit workspace that C99 uses.

 

Forgot to mention that these routines are called with BL @VSPEEK, BL @VSPOKE, BL @VMPEEK, BL @VMPOKE.  I used these names to avoid conflicts with the Ed-Assem loader.

 

K-R.

* vdp.i  (TI-name "VDP;I") */
* C'99 Rev#5.1.CC  (C)1994 Clint Pulley & Winfried Winkler
 
***** VDP RAM read/write routines. *****
* File-name pointer is in R8.

* External REFerences for TI linker.
    REF  VDPWA,VDPWD,VDPDD

* Address definitions to allow these
*  routines to function without the
*  TI linker.
*
* Read VDP data from this address.
*VDPRD  EQU  >8800
*
* Read VDP status from this address.
*VDPSTA EQU  >8802
*
* Write VDP data to this address.
*VDPWD  EQU  >8C00
*
* Write VDP command or RAM-address
*  to this address.
*VDPWA  EQU  >8C02

VMPEEK
* Get multiple bytes from VDP RAM and
*  place them in CPU RAM.
*  R0 = VDP RAM start address.
*  R1 = CPU RAM start address.
*  R2 = Byte count.

* Save caller's return address on stack.
    DECT R14
    MOV  R11,*R14

* Check count for zero.  The TI routine
*  does not check for this.
    MOV  R2,R2
    JEQ  VMEXIT * If zero, exit.

* Save caller's registers.
    DECT R14
    MOV  R2,*R14
    DECT R14
    MOV  R1,*R14
    DECT R14
    MOV  R0,*R14

* Set VDP read-address.
    ANDI R0,>3FFF
    BL   @VMADDR

VMLOOP
* Get VDP data and place in
*  caller's buffer.
    MOVB @VDPRD,*R1+

* Done?  If not, get another byte.
    DEC  R2
    JNE  VMLOOP

* Restore caller's registers.
    MOV  *R14+,R0
    MOV  *R14+,R1
    MOV  *R14+,R2

VREXIT
* Restore caller's return address,
*  then exit.
    MOV  *R14+,R11
    B    *R11


VMPOKE
* Send multiple bytes from CPU RAM
*  and place them in VDP RAM.
*  R0 = VDP RAM start address.
*  R1 = CPU RAM start address.
*  R2 = Byte count.

* Save caller's return address on stack.
    DECT R14
    MOV  R11,*R14

* Check count for zero.  The TI routine
*  does not check for this.
    MOV  R2,R2
    JEQ  VMEXIT * If zero, exit.

* Save caller's registers.
    DECT R14
    MOV  R2,*R14
    DECT R14
    MOV  R1,*R14
    DECT R14
    MOV  R0,*R14

* Set VDP read-address.
    ANDI R0,>3FFF
    ORI  R0,>4000
    BL   @VMADDR

VMLOOP
* Get VDP data and place in
*  caller's buffer.
    MOVB *R1+,@VDPWD

* Done?  If not, get another byte.
    DEC  R2
    JNE  VMLOOP

* Restore caller's registers.
    MOV  *R14+,R0
    MOV  *R14+,R1
    MOV  *R14+,R2

VWEXIT
* Restore caller's return address,
*  then exit.
    MOV  *R14+,R11
    B    *R11


VSPEEK
* Get a byte from VDP RAM and place
*  in R1 MS byte.
*
*  R0 = VDP RAM address.
*  R1 (MS Byte) = Data byte.

* Save caller's return address on stack.
    DECT R14
    MOV  R11,*R14

* Save caller's registers.
    DECT R14
    MOV  R1,*R14
    DECT R14
    MOV  R0,*R14

* Set VDP read-address.
    ANDI R0,>3FFF
    BL   @VMADDR

* Get VDP data and place in
*  caller's R1 MS byte.
    MOVB @VDPRD,R1

* Restore caller's registers.
    MOV  *R14+,R0
    MOV  *R14+,R1

* Restore caller's return address,
*  then exit.
    MOV  *R14+,R11
    B    *R11


VSPOKE
* Send byte in R1 MS byte to VDP RAM.
*  R0 = VDP RAM address.
*  R1 (MS Byte) = Data byte.

* Save caller's return address on stack.
    DECT R14
    MOV  R11,*R14

* Save caller's registers.
    DECT R14
    MOV  R1,*R14
    DECT R14
    MOV  R0,*R14

* Set VDP read-address.
    ANDI R0,>3FFF
    ORI  R0,>4000
    BL   @VMADDR

* Get VDP data and place in
*  caller's buffer.
    MOVB R1,@VDPWD

* Restore caller's registers.
    MOV  *R14+,R0
    MOV  *R14+,R1

* Restore caller's return address,
*  then exit.
    MOV  *R14+,R11
    B    *R11


VMADDR
* Send address to VDP.
* This routine expects address in R0.
* Send LS byte of address.
    SWPB R0
    MOVB R0,@VDPWA

* Send MS byte of address.
    SWPB R0
    MOVB R0,@VDPWA

* Return to caller.
    B    *R11

    EVEN

K-R.

 

 

VDP.i

Edited by Kchula-Rrit
  • Like 1
Link to comment
Share on other sites

It is not even necessary to push/pop the callers registers if you can make a rule that registers Rx,Ry,Rz are reserved for VDP I/O in your program.

If that is not possible and your program needs a full register set, Apersson850's advice is the way to go.

  • Like 1
Link to comment
Share on other sites

11 hours ago, apersson850 said:

R14 must be set up as a stack pointer elsewhere.

You usually want this kind of routines as fast as possible. These will be slow to call, compared to using BLWP.

I forgot to add that I use assembly with C99 as my start-up, so it's there if I need it.  The C99 start-up routine sets up the stack pointer, among other things.  I figured any BLWP I can set-up will be in 8-bit RAM anyway, so I might as well use the TI routines if I went the BLWP route.  But using the TI routines would make this dependent on the Ed-Assem module being present.

 

C99 uses a 16-bit workspace and the documentation I have says 0x8330-8348 should be okay to use, but anything above that might play havoc with system calls.  I'm already using those locations for something else so I figured that this is the best speed-up I could get, short of placing the routines in 16-bit RAM.

1 hour ago, TheBF said:

It is not even necessary to push/pop the callers registers if you can make a rule that registers Rx,Ry,Rz are reserved for VDP I/O in your program.

If that is not possible and your program needs a full register set, Apersson850's advice is the way to go.

I think I've been operating under some rules like you suggest, primarily using R7 and R8 for general operations.  R0, R1, and R2 I pretty much just use for VDP operations, and I don't count on the contents being preserved for VDP calls.  The save/restore was force-of-habit.

55 minutes ago, Asmusr said:

VSPEEK should not restore R1 because that's where the return value is stored.

Good catch!

 

Thanks for all the advice.  Looks like I substituted a slow call to a fast routine in an 8-bit workspace for a fast (or less-slow) call to a slow routine in a 16-bit workspace.  I'll have to get out my copy of TMS9900 manual and check instruction timings to see if I'm gaining anything, even after I streamline my routines.

 

K-R.

  • Like 1
Link to comment
Share on other sites

Hi,

 

Like apersson850 said, you want these routines to be as fast as possible. Instead of using a stack and VMADDR subroutine, inline the VMADDR in every place it is used. It's just 4 instructions. Then you don't need to save the return address because there is no inner BL. Like you said, if R0-R2 are just for vdp, you can have the caller assume they are used up. 

 

If you can have your WS in PAD, then put VDPWD in a register for another speedup. From slow RAM, it takes the CPU 12 cycles to fetch the VDPWD address after the MOVB, but fetching it from a register in PAD takes 2.

For VMBW, that's a big savings in each loop.

MOVB *R1+,@VDPWD

* or

LI R15,VDPWD  * once
MOVB *R1+,*R15

 

You can go further with inlining. If you know that the length R2 is a multiple of 8 you can inline this:

  SRL   R2,3   * divide by 8
LOOP
  MOVB *R1+,*R15
  MOVB *R1+,*R15
  MOVB *R1+,*R15
  MOVB *R1+,*R15
  MOVB *R1+,*R15
  MOVB *R1+,*R15
  MOVB *R1+,*R15
  MOVB *R1+,*R15
  DEC  R2
  JNE  LOOP

There is a trick to deal with leftover 1-7 bytes, but I find that most of the time I am writing chunks of 8,32,128,768 and so on.

 

 

  • Like 1
Link to comment
Share on other sites

Instead of doing SWPB to access each byte for transfer to the VDP, remember you can start by WSADDR EQU >8300. After that, it's possible to do a MOVB @WSADDR+1,@VDPWA to get the least significant byte in there. For the next move, it's just MOVB R0,@VDPWA. Note that if you let the two come after each other, there's a risk of overrunning the VDP. Instead of just wasting time between the two MOVB, it's sometimes possible to do something you need to do anyway in between, instead of waiting to do it after the address transfer is completed.

 

This overrunning of the VDP you may think is a non-issue. And it usually is. But if you run both workspace and code in fast RAM, then it is a real issue. I remember seeing the players in the Tennis game start running in two directions at the same time (upper body in one, legs in another) when I equipped my TI with fast RAM all over the address space.

 

In another context, I summarized that if you go away from the BLWP - RTWP route, but have to use two more instructions to compensate for not having a new register set, then you are equal in call time.

Since the TMS 9900 is comparatively slow in fetching and decoding instructions, using a more complex instruction is frequently faster than a few simple instructions.

Edited by apersson850
Link to comment
Share on other sites

On 9/20/2020 at 1:37 AM, Kchula-Rrit said:

 


* Restore caller's return address,
*  then exit.
       MOV  *R14+,R11
       B    *R11

 

 

Oops! Never mind (see @apersson850’s following post). Didn’t think this one through. Trips me up every once in a while. Will again, I am sure.

You can save one instruction here with

* Return to caller...NOT!
       B    *R14+    ; <---oops! This would attempt to execute the top of the return stack...not what we want.

 

...lee

Edited by Lee Stewart
CORRECTION
Link to comment
Share on other sites

4 hours ago, Lee Stewart said:

 


* Return to caller...NOT!
       B    *R14+    ; <---oops! This would attempt to execute the top of the return stack...not what we want.

 

This is not a 9900 instruction, but, the idea you had is provided in the 99000 instruction Branch Indirect. Suppose R14 is your stack pointer:

* Pop stack, return to address fetched from *R14
        BIND *R14+

 

The opposite is:

* Push stack, Call routine    
      BLSK R14,@VMADDR * does a DECT R14, then push NEXT onto *R14
NEXT
...

VMADDR  ...
      BIND *R14+    * return

I first learned the concept from PDP-11 assembly, which allows "deferred" or double indirect, with + or -, on every operand (they have 8 registers and 8 addressing modes). If interested go here.  Jealous.

 

 

  • Like 1
Link to comment
Share on other sites

Thanks to everyone for the suggestions.

 

After looking at the code it occurred to me that these routines do not call any others, so I rewrote the routines to remove the stack pushes/pops and to get rid of the address-write subroutine.

 

VDPWD for VMPOKE is stored in R3, so I can change MOVB *R1+,@VDPWD change MOVB *R1+,*R3,  and did the equivalent for VMPEEK.

 

The changes cut the code-size from 178 bytes to 114 bytes.  After typing up an Excel spread-sheet to make a table of instruction execution times it occurred to me that I'm running code in 8-bit RAM but the registers is in 16-bit RAM, and the data is in 8-bit RAM, and (I think) the VDP is in 16-bit RAM.  I sort of gave up on trying to calculate just how much faster the new routines would be.

 

I can post it if anyone is interested.

 

Also, I would love an autodecrement instruction (like MOV R8,*R14-) instruction.  It would make stack-pushes and loops a lot easier.

 

K-R.

 

 

  • Like 1
Link to comment
Share on other sites

5 hours ago, Kchula-Rrit said:

 

 

VDPWD for VMPOKE is stored in R3, so I can change MOVB *R1+,@VDPWD change MOVB *R1+,*R3,  and did the equivalent for VMPEEK.

 

 

 

Also, I would love an autodecrement instruction (like MOV R8,*R14-) instruction.  It would make stack-pushes and loops a lot easier.

 

K-R.

 

 

You are having PDP-11 dreams again.  I know a good shrink... :) 

Sounds like you did all the good stuff.

 

FYI: On block VDP R/W I have measured using a register rather than the port address to be 12.9% faster on Classic99.

  • Like 1
Link to comment
Share on other sites

The VDP has it's own timing. Access is as memory, but the data port is only 8 bits wide on the VDP, so you get one byte at a time.

 

There can't be any auto-decrement, due to the opcode formats. There are two bits to define the general addressing mode, then four bits to give the register number.

Register

Register indirect

Register indirect with auto-increment

Address indexed

Direct address and address indexed is the same thing. If the index register number is zero, it's direct address, which implies that you can't index via R0.

  • Like 2
Link to comment
Share on other sites

I could do without either address-indexed or direct-address ( I think TI called it "Symbolic" in the TMS9900 manual) if I could do MOV R11,*R14- for a stack-push.

 

On second thought I would trade address-indexed for register indirect with auto-decrement, since I hardly ever use indexed [MOV @THIS(R8),R8] and I do a fair number of stack operations.

 

I realize it's a moot point, but I can still dream...

 

Back to the original subject, I tried out my new routines and they work, at least in my tests.

 

K-R.

Link to comment
Share on other sites

Indexed is very valuable for frame pointer offset access, not only for indexing into arrays.

Direct memory access is kind of the whole point of a memory-to-memory architecture.

I'm of the opinion that even if I would also like a MOV R11,-*SP for stack push (note that it must be pre-decrement and post-increment to work), TI still did the right prioritization here. But it's a moot point now, for sure.

Edited by apersson850
Link to comment
Share on other sites

4 hours ago, apersson850 said:

I'm of the opinion that even if I would also like a MOV R11,-*SP for stack push (note that it must be pre-decrement and post-increment to work), TI still did the right prioritization here. But it's a moot point now, for sure.

I agree. But a dedicated stack pointer with push and pop, like the F18A GPU has, would have been very useful.

Link to comment
Share on other sites

On 9/23/2020 at 2:52 PM, apersson850 said:

The VDP has it's own timing. Access is as memory, but the data port is only 8 bits wide on the VDP, so you get one byte at a time.

 

There can't be any auto-decrement, due to the opcode formats. There are two bits to define the general addressing mode, then four bits to give the register number.

Register

Register indirect

Register indirect with auto-increment

Address indexed

Direct address and address indexed is the same thing. If the index register number is zero, it's direct address, which implies that you can't index via R0.

apersson850 probably knows all this but I thought I'd point out:

 

PDP-11 and 9900 are both 16-bit memory-to-memory architectures, and the instruction sets nearly line up.

 

Each has instructions that fit into 16 bits, with the first 4 bits decoding to the general  2-operand memory-to-memory instructions.

 

But

 

PDP-11 has 8 registers, so 3 bits for addressing mode and 3 bits for register#.   12 bits for 2 general operands.

9900 has 16 registers, so 2 bits for addressing mode and 4 bits for register#.    12 bits for 2 general operands.

 

To get 4 more addressing modes, you'd have to give up 8 of 16 registers. Since R11-R15 sometimes have special purposes that is not a nice tradeoff. PDP-11 has ALWAYS special purpose registers R6 stack pointer, R7 Program counter.

 

----

 

The PDP-11 16 bit instruction word for MOV is a lot like 9900. And it is quite elegant in octal (3 bits per digit, 6 digits hold 16 to 18 bits)   Yes, PDP-11  favors octal. 

2 general operands:

(byte-flag 0 or 1) opcode Ts S Td D   (each field 3 bits except bflag)  MOVB has the byte-flag set.

Octal,    Hex
010000   >1000  MOV  R0,R0 
11010F   >53C2  MOVB R15,R1

see how the fields line up neatly because an octal digit is 3 bits. The leading digit is just 1 bit of a 16 bit word.

 

In 9900

opcode (4 bits, byte-flag last) Td D Ts S (Ts,d are 2 bits, S/D are 4 bits. destination comes first.)

>C000 MOV R0,R0
>D04F MOVB R15,R1


For a full comparison of all the 9900, general 2-operand instructions:

op instruction.  op is 4 bits,  +1 if Byte
4  SZC
6  S    Six=Subtract
8  C
A  A    A=Add
C  MOV
E  SOC
0 other opcodes are built on this, like COC
2 other opcodes are built on this

 

On PDP-11 the 4-bit opcode equivalents are

PDP-11  9900 version
1 MOV    MOV
2 CMP    C
3 BIT    COC (but COC is not a general 2 operand instruction)
4 BIC    SZC
5 BIS    SOC
6 ADD    A
7 SUB    S
+8 if byte

 

There is no equivalent of AB, SB, which would be at hex E and F. the opcodes there are decoded into further instructions.

Another big difference is that PDP-11 byte instructions operate on the lower byte! And they are sign-extended.

 

So, memory-to-memory architecture ends up looking quite similar across the two CPUs.

 

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...