Jump to content
IGNORED

Assembly on the 99/4A


matthew180

Recommended Posts

I don't believe in protecting a programmer from themselves.

I do. Especially when they are different programmers working on the same machine. We have machines with a lifetime of well over ten years. So the original programmer may not even be in the company any longer, and most probably don't remember what he did if he's still there, when you get the task to fix or add something to the code. Robust library routines are very handy then.

Link to comment
Share on other sites

 

Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly. :)

*********************************************************************
*
* Fast CPU to VDP copy, replaces VMBW
*
* R0: Destination address
* R1: Source address
* R2: Number of bytes to copy
*
VDPCP  SWPB R0
       MOVB R0,@VDPWA                  ; Send low byte of VDP RAM write address
       SWPB R0
       ORI  R0,>4000                   ; Set the two MSbits to 01 for write
       MOVB R0,@VDPWA                  ; Send high byte of VDP RAM write address
       LI   R0,VDPWD
VDPCP0 MOV  R2,R3
       SRL  R3,3                       ; Number of groups of 8
       JEQ  VDPCP2
VDPCP1 MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       DEC  R3
       JNE  VDPCP1
       ANDI R2,>0007                   ; Isolate number of remaining bytes
       JEQ  VDPCP3
VDPCP2 MOVB *R1+,*R0
       DEC  R2
       JNE  VDPCP2
VDPCP3 B    *R11
*// VDPCP

Boy does that make me feel good! For XB256 I wrote my own VMBW routine with code that is virtually identical to the above code. The only difference (other than using some different registers) is that is is accessed with BLWP instead of BL. So the setup is a little slower but it preserves R0 to R2 in the calling program.

 

Here is a VSBW routine from XB256:

VSBW96 AI R1,>6000 add XB screen offset
VSBW SWPB R0
MOVB R0,@>8C02
SWPB R0
ORI R0,>4000
MOVB R0,@>8C02
MOVB R1,@>8C00
ANDI R0,>BFFF
B *R11
Which could become:
VSBW MOVB @WKSP+1,@>8C02
ORI R0,>4000
MOVB R0,@>8C02
MOVB R1,@>8C00
ANDI R0,>BFFF needed if you want R0 restored to original value
B *R11
These are BL subroutines, not BLWP. This bit me more than once when I did a BLWP @VSBW and wondered why the program crashed! So remember: BL @VSBW, not BLWP @VSBW
Edited by senior_falcon
Link to comment
Share on other sites

I have posted the XB ROM code many times now.

 

I take it you never looked at it?

The whole point of Rasmus's post was to show how to "unroll" a loop for speed and still be as versatile as the normal VMBW. I went through all 3 files of the XB ROMs (searching for MOVB *) and was unable to find any instance where a loop is unrolled. Would you please post the code you say is the same as Rasmus's code?

Link to comment
Share on other sites

Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly. :)

*********************************************************************
*
* Fast CPU to VDP copy, replaces VMBW
*
* R0: Destination address
* R1: Source address
* R2: Number of bytes to copy
*
VDPCP  SWPB R0
       MOVB R0,@VDPWA                  ; Send low byte of VDP RAM write address
       SWPB R0
       ORI  R0,>4000                   ; Set the two MSbits to 01 for write
       MOVB R0,@VDPWA                  ; Send high byte of VDP RAM write address
       LI   R0,VDPWD
VDPCP0 MOV  R2,R3
       SRL  R3,3                       ; Number of groups of 8
       JEQ  VDPCP2
VDPCP1 MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       DEC  R3
       JNE  VDPCP1
       ANDI R2,>0007                   ; Isolate number of remaining bytes
       JEQ  VDPCP3
VDPCP2 MOVB *R1+,*R0
       DEC  R2
       JNE  VDPCP2
VDPCP3 B    *R11
*// VDPCP

How much faster is this than putting the one byte write line in a loop. Have you Measured it?

Edited by TheBF
Link to comment
Share on other sites

How much faster is this than putting the one byte write line in a loop. Have you Measured it?

 

From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame.
  • Like 4
Link to comment
Share on other sites

As a possibly interesting side point, the Geneve implements a hardware-based video wait state generation inside its Gate array. The TMS9995 is much faster, so programs may cause an overrun that had no such issues on the TI-99/4A, which would have broken the compatibility. I don't know whether the same timing constraints apply to the v9938, though.

 

Some time ago I did some investigations, see http://www.ninerpedia.org/index.php?title=Geneve_video_wait_states

Link to comment
Share on other sites

Ok. That's what I suspected. Not quite double speed but much bigger.

 

My concern has been squashing as much functionality as I can in a tight space and trade off speed/size as best I can.

Great code however. I am keeping it for another day.?

 

 

 

From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame.

Link to comment
Share on other sites

The whole point of Rasmus's post was to show how to "unroll" a loop for speed and still be as versatile as the normal VMBW. I went through all 3 files of the XB ROMs (searching for MOVB *) and was unable to find any instance where a loop is unrolled. Would you please post the code you say is the same as Rasmus's code?

Took a second to find same thing, though not in same order as your example it does the same thing.

Uses Workspace address for MSB and LSB.

Does reference and use >8C02 and >8C00

Does use ORI R0,>4000

* Write R4 to VDP and Write R1 to VDP
Get Value from caller
LN641E MOV  *R11+,R4       *# Save Return Address into R4               
LN6420 MOV  *R4,R4         * Put Address R4 into R4             
LN6422 MOVB @LR4,*R15      *# LSB R4 to VDP Address   (Workspace ADDRESS)             
LN6426 ORI  R4,>4000                    
LN642A MOVB R4,*R15        * MSB R4 to VDP Address                
LN642C JMP  LN642E         * NOP   
*   (Write R1 to VDP)                
LN642E MOVB R1,@VDPWD      *# MSB R1 to VDP Write Data                
LN6432 B    *R11           * RETURN

In other XB ROM is has several original programmers that do things in varied ways.

Link to comment
Share on other sites

 

Try again, Rich. That code has absolutely nothing to do with unrolling a loop.

 

...lee

REALLY?

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler.

 

So are you talking Compiler only as the definition states manually or compiler.

Edited by RXB
Link to comment
Share on other sites

REALLY?

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler.

 

So are you talking Compiler only as the definition states manually or compiler.

 

Well—you found a definition. Good. Now, perhaps, you might reread Rasmus' posts #395 and #397 above to see how he manually unrolls the loop to get faster execution of multiple VRAM writes, something that your example XB-ROM code does not do.

 

...lee

Link to comment
Share on other sites

 

Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly. :)

 

Very nice! I saw the timing results in your later post. Would placing the first loop (with the 8 VDP writes) into scratchpad improve performance significantly? I need a really fast move routine but have limited scratchpad space, so I was contemplating either a 4- or 8- byte copy loop based on your routine, with the setup and final byte copy loop in the 32K space.

Link to comment
Share on other sites

 

Very nice! I saw the timing results in your later post. Would placing the first loop (with the 8 VDP writes) into scratchpad improve performance significantly? I need a really fast move routine but have limited scratchpad space, so I was contemplating either a 4- or 8- byte copy loop based on your routine, with the setup and final byte copy loop in the 32K space.

 

From scratch pad I think each instruction is 4 clock cycles faster, so MOVB is 32 instead of 36. This is provided your workspace is in scratch pad and the data you copy are in 8 bit RAM or ROM.

  • Like 1
Link to comment
Share on other sites

 

 

From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame.

 

I guess I need to learn more about the speed required for instructions to execute and in particular how much the wait states slow down a program. In my 990 COMPUTER SYSTEMS SYSTEMS HANDBOOK they have a table giving instruction execution times for the 990/4 which uses the TMS 9900.

MOVB takes 14 clock cycles + 4 cycles for WR indirect + 6 cyles for WR indirect auto increment = 24 clock cycles. The additional 16 clock cycles must be from wait states for memory access?

DEC takes 10 clock cycles. (0 cycles for the address modification because it is workspace register.) The 4 additional clock cycles must be for the wait state?

JNE takes 10 clock cycles. The 4 additional clock cycles must be for the wait state?

How do you know out how many wait states are required by instructions?

Link to comment
Share on other sites

You've got the basics there. Then every memory access not in scratchpad or ROM costs 4 cycles. The instruction table should tell you how many memory accesses occur for each instruction - reads take one, and most writes take two (due to a read-before-write). Reading the opcode itself is also a memory access, of course.

Edited by Tursi
Link to comment
Share on other sites

Correct. And you have to figure out yourself which of the memory accesses that are slower (8-bit) and which are faster (16-bit). Thus you need to know about the hardware architecture of the 99/4A to get it right.

Look at MOV R0,R1. You have four memory accesses here.

  1. Fetch instruction.
  2. Fetch source data
  3. Fetch destination data
  4. Write destination data

If the instruction and the workspace both are in the internal RAM in the console, address >8300 - >83FF, then that's it. You have the basic timing of the instruction there, 14 cycles. Since all memory accesses require two cycles, which is the minimum, there are no additional delays.

But it's only the small RAM in the console and the ROM chips in the console that are on a 16-bit wide bus with no additional wait states. When the CPU is accessing the memory in the expansion box, this memory is on a bus that's 8-bit wide. To make that possible, there is extra circuitry in the console, which splits up the 16-bit memory access into two 8-bit accesses. The circuit then puts these two parts together and present them to the CPU as one 16-bit word. Hence the two cycle memory access becomes a four cycle one. But it doesn't end there. Each of these 8-bit memory access cycles also have one extra wait state. So from a memory point of view, outside in the PEB, we're talking about 8-bit access with one wait state. But from the CPU's point of view, it looks like each 16-bit access is slowed down by four wait states.

 

So if we look at the MOV instruction again, and pretend that both workspace and code is in 32 K RAM in the PEB, then suddenly all four of them memory accesses occur in slow RAM. Thus you must add 16 cycles that are waisted, in addition to the 14 that the instruction itself uses.

 

But even if the code and WS are in fast RAM, the instruction MOV *R0,R1 adds four cycles, since the CPU must first fetch R0, then the address R0 is pointing at. Now if that address is in slow RAM, you need to add another four cycles for that access. If the instruction instead is MOV R0,*R1, and R1 is pointing at slow RAM, then both the indirect fetch of the destination and the store there adds four cycles of wait states each. If you autoincrement, then you need to write to the register after reading it, and if the register then is in slow RAM that's even worse.

  • Like 1
Link to comment
Share on other sites

Okay, the last thing we need to cover on the VDP is the eight write-only registers. Here they are:

<<deleted ASCII images >>
REG 0: M3

REG 1: M1

REG 1: M2

 

These three bits control the video mode, which can be Graphics I, Graphics II, Multicolor, and Text mode. The mode is set like this:

M1 M2 M3
0  0  0   Graphics I mode
0  0  1   Graphics II mode
0  1  0   Multicolor mode
1  0  0   Text mode

Why they used 3-bits to represent 4 modes I have no idea (they could have used just 2 bits. Maybe there were originally going to be more than 4 modes...)

 

I wanted to know specifics behind these modes (before moving along) so I referenced the E/A Manual: 21.2 Graphics Mode, 21.3 Multicolor Mode, 21.4 Text Mode, 21.5 Bit Map Mode

 

Q1: What on earth did they have in mind for Multicolor Mode?

 

Q2: Is it safe to assume Multicolor Mode is mostly avoided for game programming?

 

Q3: Is Graphics Mode simply the same as what's available when coding in Extended BASIC?

 

Q4: Is it ever advantageous to swap modes in the same program: Graphics/Text, Bit Map/Text, Graphics/Multicolor? Any examples in the wild?

 

Q5: Is Bit Map mode simply Graphics mode with background/foreground color information available for the 8 individual character rows vs per entire character, and no Auto-Sprites?

 

​Q6: Are Auto-Sprites only a good deal for XB programs?

 

 

Link to comment
Share on other sites

 

I wanted to know specifics behind these modes (before moving along) so I referenced the E/A Manual: 21.2 Graphics Mode, 21.3 Multicolor Mode, 21.4 Text Mode, 21.5 Bit Map Mode

 

Q1: What on earth did they have in mind for Multicolor Mode?

 

Q2: Is it safe to assume Multicolor Mode is mostly avoided for game programming?

 

Q3: Is Graphics Mode simply the same as what's available when coding in Extended BASIC?

 

Q4: Is it ever advantageous to swap modes in the same program: Graphics/Text, Bit Map/Text, Graphics/Multicolor? Any examples in the wild?

 

Q5: Is Bit Map mode simply Graphics mode with background/foreground color information available for the 8 individual character rows vs per entire character, and no Auto-Sprites?

 

​Q6: Are Auto-Sprites only a good deal for XB programs?

 

Q1: I think the main goal of this mode was support for the "low-res" modes that many other computers had at the time. The Apple II for example had a low-res graphic mode that was VERY similar.

 

Q2: Yes. Partly because the screen is ENTIRELY in low-res, which makes displays difficult. The other reason is the mode is notoriously weird in its set-up to use. The only game I'm aware of that uses this mode is "Dragon", a side-scroller.

 

Q3: Yes. Extended BASIC has only the one mode available, unless you use assembly routines or some other tricks.

 

Q4: It tends to be more useful for utility programs than games; TI-Artist uses multicolor mode for magnification, Paint-N-Print uses regular Graphics mode for it, etc. I know of no games that do this, but it's possible you could write a text/graphic adventure where your main display is text but if you want to "look" it may show a graphic representation of what you see.

 

Q5: It's a bit more complicated than that... bitmap mode has also been called "Graphics Mode II" for a reason, it's basically the same mode but with more. It expands the character table from 256 to 768 characters, and alters the color table from set designations of 8 characters to each individual character having 8 bytes of color data. Since the screen table can only address a single byte to a position, it has to divide the screen into thirds to determine which of the 3 character sets to draw from.

 

The auto-sprite motion doesn't function correctly because the hard-wired space in VDP the ISR routine uses in in use by the pattern table. This can be easily circumvented by rolling your own routine; you could even keep motion vectors in CPU RAM instead of storing them in VDP which is far more efficient anyway.

 

Q6: Extended BASIC definitely benefits from it, most other BASIC platforms would have a hard time moving so many sprites at once! That said, in assembly land, auto-motion is less valuable, since you can move things yourself much more conveniently, and can achieve more advanced operations like having sprites move in circles, parabolic curves, etc.

  • Like 2
Link to comment
Share on other sites

TI Forth and fbForth (by inheritance) have two split modes that mix bit-mapped graphics and text modes. Split mode has the bottom third of the screen in Text mode and Split2 has the top sixth of the screen in Text mode.

 

I believe Rasmus, Tursi, Thierry et al. have explored other split modes.

 

...lee

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...