Kchula-Rrit Posted September 20, 2020 Share Posted September 20, 2020 (edited) I'm thinking of using these routines I wrote to read/write VDP RAM. Are there any obvious errors? I was thinking of using them to avoid the BLWP that the TI routines use, along with the lack of zero-byte-count checking. These can also sit in the 16-bit workspace that C99 uses. Forgot to mention that these routines are called with BL @VSPEEK, BL @VSPOKE, BL @VMPEEK, BL @VMPOKE. I used these names to avoid conflicts with the Ed-Assem loader. K-R. * vdp.i (TI-name "VDP;I") */ * C'99 Rev#5.1.CC (C)1994 Clint Pulley & Winfried Winkler ***** VDP RAM read/write routines. ***** * File-name pointer is in R8. * External REFerences for TI linker. REF VDPWA,VDPWD,VDPDD * Address definitions to allow these * routines to function without the * TI linker. * * Read VDP data from this address. *VDPRD EQU >8800 * * Read VDP status from this address. *VDPSTA EQU >8802 * * Write VDP data to this address. *VDPWD EQU >8C00 * * Write VDP command or RAM-address * to this address. *VDPWA EQU >8C02 VMPEEK * Get multiple bytes from VDP RAM and * place them in CPU RAM. * R0 = VDP RAM start address. * R1 = CPU RAM start address. * R2 = Byte count. * Save caller's return address on stack. DECT R14 MOV R11,*R14 * Check count for zero. The TI routine * does not check for this. MOV R2,R2 JEQ VMEXIT * If zero, exit. * Save caller's registers. DECT R14 MOV R2,*R14 DECT R14 MOV R1,*R14 DECT R14 MOV R0,*R14 * Set VDP read-address. ANDI R0,>3FFF BL @VMADDR VMLOOP * Get VDP data and place in * caller's buffer. MOVB @VDPRD,*R1+ * Done? If not, get another byte. DEC R2 JNE VMLOOP * Restore caller's registers. MOV *R14+,R0 MOV *R14+,R1 MOV *R14+,R2 VREXIT * Restore caller's return address, * then exit. MOV *R14+,R11 B *R11 VMPOKE * Send multiple bytes from CPU RAM * and place them in VDP RAM. * R0 = VDP RAM start address. * R1 = CPU RAM start address. * R2 = Byte count. * Save caller's return address on stack. DECT R14 MOV R11,*R14 * Check count for zero. The TI routine * does not check for this. MOV R2,R2 JEQ VMEXIT * If zero, exit. * Save caller's registers. DECT R14 MOV R2,*R14 DECT R14 MOV R1,*R14 DECT R14 MOV R0,*R14 * Set VDP read-address. ANDI R0,>3FFF ORI R0,>4000 BL @VMADDR VMLOOP * Get VDP data and place in * caller's buffer. MOVB *R1+,@VDPWD * Done? If not, get another byte. DEC R2 JNE VMLOOP * Restore caller's registers. MOV *R14+,R0 MOV *R14+,R1 MOV *R14+,R2 VWEXIT * Restore caller's return address, * then exit. MOV *R14+,R11 B *R11 VSPEEK * Get a byte from VDP RAM and place * in R1 MS byte. * * R0 = VDP RAM address. * R1 (MS Byte) = Data byte. * Save caller's return address on stack. DECT R14 MOV R11,*R14 * Save caller's registers. DECT R14 MOV R1,*R14 DECT R14 MOV R0,*R14 * Set VDP read-address. ANDI R0,>3FFF BL @VMADDR * Get VDP data and place in * caller's R1 MS byte. MOVB @VDPRD,R1 * Restore caller's registers. MOV *R14+,R0 MOV *R14+,R1 * Restore caller's return address, * then exit. MOV *R14+,R11 B *R11 VSPOKE * Send byte in R1 MS byte to VDP RAM. * R0 = VDP RAM address. * R1 (MS Byte) = Data byte. * Save caller's return address on stack. DECT R14 MOV R11,*R14 * Save caller's registers. DECT R14 MOV R1,*R14 DECT R14 MOV R0,*R14 * Set VDP read-address. ANDI R0,>3FFF ORI R0,>4000 BL @VMADDR * Get VDP data and place in * caller's buffer. MOVB R1,@VDPWD * Restore caller's registers. MOV *R14+,R0 MOV *R14+,R1 * Restore caller's return address, * then exit. MOV *R14+,R11 B *R11 VMADDR * Send address to VDP. * This routine expects address in R0. * Send LS byte of address. SWPB R0 MOVB R0,@VDPWA * Send MS byte of address. SWPB R0 MOVB R0,@VDPWA * Return to caller. B *R11 EVEN K-R. VDP.i Edited September 20, 2020 by Kchula-Rrit 1 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 20, 2020 Share Posted September 20, 2020 R14 must be set up as a stack pointer elsewhere. You usually want this kind of routines as fast as possible. These will be slow to call, compared to using BLWP. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 20, 2020 Share Posted September 20, 2020 It is not even necessary to push/pop the callers registers if you can make a rule that registers Rx,Ry,Rz are reserved for VDP I/O in your program. If that is not possible and your program needs a full register set, Apersson850's advice is the way to go. 1 Quote Link to comment Share on other sites More sharing options...
Asmusr Posted September 20, 2020 Share Posted September 20, 2020 VSPEEK should not restore R1 because that's where the return value is stored. 1 Quote Link to comment Share on other sites More sharing options...
Kchula-Rrit Posted September 20, 2020 Author Share Posted September 20, 2020 11 hours ago, apersson850 said: R14 must be set up as a stack pointer elsewhere. You usually want this kind of routines as fast as possible. These will be slow to call, compared to using BLWP. I forgot to add that I use assembly with C99 as my start-up, so it's there if I need it. The C99 start-up routine sets up the stack pointer, among other things. I figured any BLWP I can set-up will be in 8-bit RAM anyway, so I might as well use the TI routines if I went the BLWP route. But using the TI routines would make this dependent on the Ed-Assem module being present. C99 uses a 16-bit workspace and the documentation I have says 0x8330-8348 should be okay to use, but anything above that might play havoc with system calls. I'm already using those locations for something else so I figured that this is the best speed-up I could get, short of placing the routines in 16-bit RAM. 1 hour ago, TheBF said: It is not even necessary to push/pop the callers registers if you can make a rule that registers Rx,Ry,Rz are reserved for VDP I/O in your program. If that is not possible and your program needs a full register set, Apersson850's advice is the way to go. I think I've been operating under some rules like you suggest, primarily using R7 and R8 for general operations. R0, R1, and R2 I pretty much just use for VDP operations, and I don't count on the contents being preserved for VDP calls. The save/restore was force-of-habit. 55 minutes ago, Asmusr said: VSPEEK should not restore R1 because that's where the return value is stored. Good catch! Thanks for all the advice. Looks like I substituted a slow call to a fast routine in an 8-bit workspace for a fast (or less-slow) call to a slow routine in a 16-bit workspace. I'll have to get out my copy of TMS9900 manual and check instruction timings to see if I'm gaining anything, even after I streamline my routines. K-R. 1 Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted September 20, 2020 Share Posted September 20, 2020 Hi, Like apersson850 said, you want these routines to be as fast as possible. Instead of using a stack and VMADDR subroutine, inline the VMADDR in every place it is used. It's just 4 instructions. Then you don't need to save the return address because there is no inner BL. Like you said, if R0-R2 are just for vdp, you can have the caller assume they are used up. If you can have your WS in PAD, then put VDPWD in a register for another speedup. From slow RAM, it takes the CPU 12 cycles to fetch the VDPWD address after the MOVB, but fetching it from a register in PAD takes 2. For VMBW, that's a big savings in each loop. MOVB *R1+,@VDPWD * or LI R15,VDPWD * once MOVB *R1+,*R15 You can go further with inlining. If you know that the length R2 is a multiple of 8 you can inline this: SRL R2,3 * divide by 8 LOOP MOVB *R1+,*R15 MOVB *R1+,*R15 MOVB *R1+,*R15 MOVB *R1+,*R15 MOVB *R1+,*R15 MOVB *R1+,*R15 MOVB *R1+,*R15 MOVB *R1+,*R15 DEC R2 JNE LOOP There is a trick to deal with leftover 1-7 bytes, but I find that most of the time I am writing chunks of 8,32,128,768 and so on. 1 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 21, 2020 Share Posted September 21, 2020 (edited) Instead of doing SWPB to access each byte for transfer to the VDP, remember you can start by WSADDR EQU >8300. After that, it's possible to do a MOVB @WSADDR+1,@VDPWA to get the least significant byte in there. For the next move, it's just MOVB R0,@VDPWA. Note that if you let the two come after each other, there's a risk of overrunning the VDP. Instead of just wasting time between the two MOVB, it's sometimes possible to do something you need to do anyway in between, instead of waiting to do it after the address transfer is completed. This overrunning of the VDP you may think is a non-issue. And it usually is. But if you run both workspace and code in fast RAM, then it is a real issue. I remember seeing the players in the Tennis game start running in two directions at the same time (upper body in one, legs in another) when I equipped my TI with fast RAM all over the address space. In another context, I summarized that if you go away from the BLWP - RTWP route, but have to use two more instructions to compensate for not having a new register set, then you are equal in call time. Since the TMS 9900 is comparatively slow in fetching and decoding instructions, using a more complex instruction is frequently faster than a few simple instructions. Edited September 21, 2020 by apersson850 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted September 21, 2020 Share Posted September 21, 2020 (edited) On 9/20/2020 at 1:37 AM, Kchula-Rrit said: * Restore caller's return address, * then exit. MOV *R14+,R11 B *R11 Oops! Never mind (see @apersson850’s following post). Didn’t think this one through. Trips me up every once in a while. Will again, I am sure. You can save one instruction here with * Return to caller...NOT! B *R14+ ; <---oops! This would attempt to execute the top of the return stack...not what we want. ...lee Edited September 21, 2020 by Lee Stewart CORRECTION Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 21, 2020 Share Posted September 21, 2020 (edited) He doesn't want to return to the stack. He wants to return to the address stored in the stack. It's an easy mistake to make. What works is saving in a different register (provided you have one that's not used), like MOV R11,R7, then just B *R7. Edited September 21, 2020 by apersson850 1 Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted September 21, 2020 Share Posted September 21, 2020 4 hours ago, Lee Stewart said: * Return to caller...NOT! B *R14+ ; <---oops! This would attempt to execute the top of the return stack...not what we want. This is not a 9900 instruction, but, the idea you had is provided in the 99000 instruction Branch Indirect. Suppose R14 is your stack pointer: * Pop stack, return to address fetched from *R14 BIND *R14+ The opposite is: * Push stack, Call routine BLSK R14,@VMADDR * does a DECT R14, then push NEXT onto *R14 NEXT ... VMADDR ... BIND *R14+ * return I first learned the concept from PDP-11 assembly, which allows "deferred" or double indirect, with + or -, on every operand (they have 8 registers and 8 addressing modes). If interested go here. Jealous. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 21, 2020 Share Posted September 21, 2020 Machines with this instruction run indirect threaded Forth rather well. 2 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 21, 2020 Share Posted September 21, 2020 Yes, some processors for larger machine from that time had good instruction sets. I've done some assembly programming of the successor to the PDP-11, the VAX-11. A nice and orthogonal 32-bit CPU. The subroutine call instruction CALLS makes BLWP seem rather simple in comparison... 2 Quote Link to comment Share on other sites More sharing options...
Kchula-Rrit Posted September 23, 2020 Author Share Posted September 23, 2020 Thanks to everyone for the suggestions. After looking at the code it occurred to me that these routines do not call any others, so I rewrote the routines to remove the stack pushes/pops and to get rid of the address-write subroutine. VDPWD for VMPOKE is stored in R3, so I can change MOVB *R1+,@VDPWD change MOVB *R1+,*R3, and did the equivalent for VMPEEK. The changes cut the code-size from 178 bytes to 114 bytes. After typing up an Excel spread-sheet to make a table of instruction execution times it occurred to me that I'm running code in 8-bit RAM but the registers is in 16-bit RAM, and the data is in 8-bit RAM, and (I think) the VDP is in 16-bit RAM. I sort of gave up on trying to calculate just how much faster the new routines would be. I can post it if anyone is interested. Also, I would love an autodecrement instruction (like MOV R8,*R14-) instruction. It would make stack-pushes and loops a lot easier. K-R. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 23, 2020 Share Posted September 23, 2020 5 hours ago, Kchula-Rrit said: VDPWD for VMPOKE is stored in R3, so I can change MOVB *R1+,@VDPWD change MOVB *R1+,*R3, and did the equivalent for VMPEEK. Also, I would love an autodecrement instruction (like MOV R8,*R14-) instruction. It would make stack-pushes and loops a lot easier. K-R. You are having PDP-11 dreams again. I know a good shrink... Sounds like you did all the good stuff. FYI: On block VDP R/W I have measured using a register rather than the port address to be 12.9% faster on Classic99. 1 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 23, 2020 Share Posted September 23, 2020 The VDP has it's own timing. Access is as memory, but the data port is only 8 bits wide on the VDP, so you get one byte at a time. There can't be any auto-decrement, due to the opcode formats. There are two bits to define the general addressing mode, then four bits to give the register number. Register Register indirect Register indirect with auto-increment Address indexed Direct address and address indexed is the same thing. If the index register number is zero, it's direct address, which implies that you can't index via R0. 2 Quote Link to comment Share on other sites More sharing options...
Kchula-Rrit Posted September 25, 2020 Author Share Posted September 25, 2020 I could do without either address-indexed or direct-address ( I think TI called it "Symbolic" in the TMS9900 manual) if I could do MOV R11,*R14- for a stack-push. On second thought I would trade address-indexed for register indirect with auto-decrement, since I hardly ever use indexed [MOV @THIS(R8),R8] and I do a fair number of stack operations. I realize it's a moot point, but I can still dream... Back to the original subject, I tried out my new routines and they work, at least in my tests. K-R. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 25, 2020 Share Posted September 25, 2020 (edited) Indexed is very valuable for frame pointer offset access, not only for indexing into arrays. Direct memory access is kind of the whole point of a memory-to-memory architecture. I'm of the opinion that even if I would also like a MOV R11,-*SP for stack push (note that it must be pre-decrement and post-increment to work), TI still did the right prioritization here. But it's a moot point now, for sure. Edited September 25, 2020 by apersson850 Quote Link to comment Share on other sites More sharing options...
Asmusr Posted September 25, 2020 Share Posted September 25, 2020 4 hours ago, apersson850 said: I'm of the opinion that even if I would also like a MOV R11,-*SP for stack push (note that it must be pre-decrement and post-increment to work), TI still did the right prioritization here. But it's a moot point now, for sure. I agree. But a dedicated stack pointer with push and pop, like the F18A GPU has, would have been very useful. Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted September 25, 2020 Share Posted September 25, 2020 On 9/23/2020 at 2:52 PM, apersson850 said: The VDP has it's own timing. Access is as memory, but the data port is only 8 bits wide on the VDP, so you get one byte at a time. There can't be any auto-decrement, due to the opcode formats. There are two bits to define the general addressing mode, then four bits to give the register number. Register Register indirect Register indirect with auto-increment Address indexed Direct address and address indexed is the same thing. If the index register number is zero, it's direct address, which implies that you can't index via R0. apersson850 probably knows all this but I thought I'd point out: PDP-11 and 9900 are both 16-bit memory-to-memory architectures, and the instruction sets nearly line up. Each has instructions that fit into 16 bits, with the first 4 bits decoding to the general 2-operand memory-to-memory instructions. But PDP-11 has 8 registers, so 3 bits for addressing mode and 3 bits for register#. 12 bits for 2 general operands. 9900 has 16 registers, so 2 bits for addressing mode and 4 bits for register#. 12 bits for 2 general operands. To get 4 more addressing modes, you'd have to give up 8 of 16 registers. Since R11-R15 sometimes have special purposes that is not a nice tradeoff. PDP-11 has ALWAYS special purpose registers R6 stack pointer, R7 Program counter. ---- The PDP-11 16 bit instruction word for MOV is a lot like 9900. And it is quite elegant in octal (3 bits per digit, 6 digits hold 16 to 18 bits) Yes, PDP-11 favors octal. 2 general operands: (byte-flag 0 or 1) opcode Ts S Td D (each field 3 bits except bflag) MOVB has the byte-flag set. Octal, Hex 010000 >1000 MOV R0,R0 11010F >53C2 MOVB R15,R1 see how the fields line up neatly because an octal digit is 3 bits. The leading digit is just 1 bit of a 16 bit word. In 9900 opcode (4 bits, byte-flag last) Td D Ts S (Ts,d are 2 bits, S/D are 4 bits. destination comes first.) >C000 MOV R0,R0 >D04F MOVB R15,R1 For a full comparison of all the 9900, general 2-operand instructions: op instruction. op is 4 bits, +1 if Byte 4 SZC 6 S Six=Subtract 8 C A A A=Add C MOV E SOC 0 other opcodes are built on this, like COC 2 other opcodes are built on this On PDP-11 the 4-bit opcode equivalents are PDP-11 9900 version 1 MOV MOV 2 CMP C 3 BIT COC (but COC is not a general 2 operand instruction) 4 BIC SZC 5 BIS SOC 6 ADD A 7 SUB S +8 if byte There is no equivalent of AB, SB, which would be at hex E and F. the opcodes there are decoded into further instructions. Another big difference is that PDP-11 byte instructions operate on the lower byte! And they are sign-extended. So, memory-to-memory architecture ends up looking quite similar across the two CPUs. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.