Assembly on the 99/4A

Asmusr · June 16, 2017

Hmm this looks exactly like code from the XB ROMs?

Really? Where? They might have done something similar, it's not rocket science.

apersson850 · June 16, 2017

I don't believe in protecting a programmer from themselves.

I do. Especially when they are different programmers working on the same machine. We have machines with a lifetime of well over ten years. So the original programmer may not even be in the company any longer, and most probably don't remember what he did if he's still there, when you get the task to fix or add something to the code. Robust library routines are very handy then.

+TheBF · June 16, 2017

Where do the robust libraries come from for your business?

In house?

senior_falcon · June 17, 2017

Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly.
*********************************************************************
*
* Fast CPU to VDP copy, replaces VMBW
*
* R0: Destination address
* R1: Source address
* R2: Number of bytes to copy
*
VDPCP  SWPB R0
       MOVB R0,@VDPWA                  ; Send low byte of VDP RAM write address
       SWPB R0
       ORI  R0,>4000                   ; Set the two MSbits to 01 for write
       MOVB R0,@VDPWA                  ; Send high byte of VDP RAM write address
       LI   R0,VDPWD
VDPCP0 MOV  R2,R3
       SRL  R3,3                       ; Number of groups of 8
       JEQ  VDPCP2
VDPCP1 MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       DEC  R3
       JNE  VDPCP1
       ANDI R2,>0007                   ; Isolate number of remaining bytes
       JEQ  VDPCP3
VDPCP2 MOVB *R1+,*R0
       DEC  R2
       JNE  VDPCP2
VDPCP3 B    *R11
*// VDPCP

Boy does that make me feel good! For XB256 I wrote my own VMBW routine with code that is virtually identical to the above code. The only difference (other than using some different registers) is that is is accessed with BLWP instead of BL. So the setup is a little slower but it preserves R0 to R2 in the calling program.

Here is a VSBW routine from XB256:

VSBW96 AI R1,>6000 add XB screen offset

VSBW SWPB R0

MOVB R0,@>8C02

SWPB R0

ORI R0,>4000

MOVB R0,@>8C02

MOVB R1,@>8C00

ANDI R0,>BFFF

B *R11

Which could become:

VSBW MOVB @WKSP+1,@>8C02

ORI R0,>4000

MOVB R0,@>8C02

MOVB R1,@>8C00

ANDI R0,>BFFF needed if you want R0 restored to original value

B *R11

These are BL subroutines, not BLWP. This bit me more than once when I did a BLWP @VSBW and wondered why the program crashed! So remember: BL @VSBW, not BLWP @VSBW

Edited June 17, 2017 by senior_falcon

RXB · June 17, 2017

Really? Where? They might have done something similar, it's not rocket science.

I have posted the XB ROM code many times now.

I take it you never looked at it?

XBROM SOURCE.zip

senior_falcon · June 17, 2017

I have posted the XB ROM code many times now.

I take it you never looked at it?

The whole point of Rasmus's post was to show how to "unroll" a loop for speed and still be as versatile as the normal VMBW. I went through all 3 files of the XB ROMs (searching for MOVB *) and was unable to find any instance where a loop is unrolled. Would you please post the code you say is the same as Rasmus's code?

apersson850 · June 17, 2017

Where do the robust libraries come from for your business?

In house?

Usually. Sometimes from the supplier of some control equipment we may choose to use.

+TheBF · June 17, 2017

Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly.
*********************************************************************
*
* Fast CPU to VDP copy, replaces VMBW
*
* R0: Destination address
* R1: Source address
* R2: Number of bytes to copy
*
VDPCP  SWPB R0
       MOVB R0,@VDPWA                  ; Send low byte of VDP RAM write address
       SWPB R0
       ORI  R0,>4000                   ; Set the two MSbits to 01 for write
       MOVB R0,@VDPWA                  ; Send high byte of VDP RAM write address
       LI   R0,VDPWD
VDPCP0 MOV  R2,R3
       SRL  R3,3                       ; Number of groups of 8
       JEQ  VDPCP2
VDPCP1 MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       MOVB *R1+,*R0
       DEC  R3
       JNE  VDPCP1
       ANDI R2,>0007                   ; Isolate number of remaining bytes
       JEQ  VDPCP3
VDPCP2 MOVB *R1+,*R0
       DEC  R2
       JNE  VDPCP2
VDPCP3 B    *R11
*// VDPCP

How much faster is this than putting the one byte write line in a loop. Have you Measured it?

Edited June 17, 2017 by TheBF

Asmusr · June 17, 2017

How much faster is this than putting the one byte write line in a loop. Have you Measured it?

From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame.

+mizapf · June 17, 2017

As a possibly interesting side point, the Geneve implements a hardware-based video wait state generation inside its Gate array. The TMS9995 is much faster, so programs may cause an overrun that had no such issues on the TI-99/4A, which would have broken the compatibility. I don't know whether the same timing constraints apply to the v9938, though.

Some time ago I did some investigations, see http://www.ninerpedia.org/index.php?title=Geneve_video_wait_states

+TheBF · June 17, 2017

Ok. That's what I suspected. Not quite double speed but much bigger.

My concern has been squashing as much functionality as I can in a tight space and trade off speed/size as best I can.

Great code however. I am keeping it for another day.?

From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame.

RXB · June 18, 2017

The whole point of Rasmus's post was to show how to "unroll" a loop for speed and still be as versatile as the normal VMBW. I went through all 3 files of the XB ROMs (searching for MOVB *) and was unable to find any instance where a loop is unrolled. Would you please post the code you say is the same as Rasmus's code?

Took a second to find same thing, though not in same order as your example it does the same thing.

Uses Workspace address for MSB and LSB.

Does reference and use >8C02 and >8C00

Does use ORI R0,>4000

* Write R4 to VDP and Write R1 to VDP
Get Value from caller
LN641E MOV  *R11+,R4       *# Save Return Address into R4               
LN6420 MOV  *R4,R4         * Put Address R4 into R4             
LN6422 MOVB @LR4,*R15      *# LSB R4 to VDP Address   (Workspace ADDRESS)             
LN6426 ORI  R4,>4000                    
LN642A MOVB R4,*R15        * MSB R4 to VDP Address                
LN642C JMP  LN642E         * NOP   
*   (Write R1 to VDP)                
LN642E MOVB R1,@VDPWD      *# MSB R1 to VDP Write Data                
LN6432 B    *R11           * RETURN

In other XB ROM is has several original programmers that do things in varied ways.

+Lee Stewart · June 18, 2017

Took a second to find same thing, though not in same order as your example it does the same thing.

...

Try again, Rich. That code has absolutely nothing to do with unrolling a loop.

...lee

RXB · June 18, 2017

Try again, Rich. That code has absolutely nothing to do with unrolling a loop.

...lee

REALLY?

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler.

So are you talking Compiler only as the definition states manually or compiler.

Edited June 18, 2017 by RXB

+Lee Stewart · June 18, 2017

REALLY?

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler.

So are you talking Compiler only as the definition states manually or compiler.

Well—you found a definition. Good. Now, perhaps, you might reread Rasmus' posts #395 and #397 above to see how he manually unrolls the loop to get faster execution of multiple VRAM writes, something that your example XB-ROM code does not do.

...lee

+InsaneMultitasker · June 18, 2017

Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly.

Very nice! I saw the timing results in your later post. Would placing the first loop (with the 8 VDP writes) into scratchpad improve performance significantly? I need a really fast move routine but have limited scratchpad space, so I was contemplating either a 4- or 8- byte copy loop based on your routine, with the setup and final byte copy loop in the 32K space.

Asmusr · June 18, 2017

Very nice! I saw the timing results in your later post. Would placing the first loop (with the 8 VDP writes) into scratchpad improve performance significantly? I need a really fast move routine but have limited scratchpad space, so I was contemplating either a 4- or 8- byte copy loop based on your routine, with the setup and final byte copy loop in the 32K space.

From scratch pad I think each instruction is 4 clock cycles faster, so MOVB is 32 instead of 36. This is provided your workspace is in scratch pad and the data you copy are in 8 bit RAM or ROM.

apersson850 · June 18, 2017

That makes sense, since then it's only the instruction fetch you win. And each memory access is two cycles in 16-bit memory and six in 8-bit memory.

senior_falcon · June 19, 2017

From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame.

I guess I need to learn more about the speed required for instructions to execute and in particular how much the wait states slow down a program. In my 990 COMPUTER SYSTEMS SYSTEMS HANDBOOK they have a table giving instruction execution times for the 990/4 which uses the TMS 9900.

MOVB takes 14 clock cycles + 4 cycles for WR indirect + 6 cyles for WR indirect auto increment = 24 clock cycles. The additional 16 clock cycles must be from wait states for memory access?

DEC takes 10 clock cycles. (0 cycles for the address modification because it is workspace register.) The 4 additional clock cycles must be for the wait state?

JNE takes 10 clock cycles. The 4 additional clock cycles must be for the wait state?

How do you know out how many wait states are required by instructions?

Tursi · June 19, 2017

You've got the basics there. Then every memory access not in scratchpad or ROM costs 4 cycles. The instruction table should tell you how many memory accesses occur for each instruction - reads take one, and most writes take two (due to a read-before-write). Reading the opcode itself is also a memory access, of course.

Edited June 19, 2017 by Tursi

apersson850 · June 19, 2017

Correct. And you have to figure out yourself which of the memory accesses that are slower (8-bit) and which are faster (16-bit). Thus you need to know about the hardware architecture of the 99/4A to get it right.

Look at MOV R0,R1. You have four memory accesses here.

Fetch instruction.
Fetch source data
Fetch destination data
Write destination data

If the instruction and the workspace both are in the internal RAM in the console, address >8300 - >83FF, then that's it. You have the basic timing of the instruction there, 14 cycles. Since all memory accesses require two cycles, which is the minimum, there are no additional delays.

But it's only the small RAM in the console and the ROM chips in the console that are on a 16-bit wide bus with no additional wait states. When the CPU is accessing the memory in the expansion box, this memory is on a bus that's 8-bit wide. To make that possible, there is extra circuitry in the console, which splits up the 16-bit memory access into two 8-bit accesses. The circuit then puts these two parts together and present them to the CPU as one 16-bit word. Hence the two cycle memory access becomes a four cycle one. But it doesn't end there. Each of these 8-bit memory access cycles also have one extra wait state. So from a memory point of view, outside in the PEB, we're talking about 8-bit access with one wait state. But from the CPU's point of view, it looks like each 16-bit access is slowed down by four wait states.

So if we look at the MOV instruction again, and pretend that both workspace and code is in 32 K RAM in the PEB, then suddenly all four of them memory accesses occur in slow RAM. Thus you must add 16 cycles that are waisted, in addition to the 14 that the instruction itself uses.

But even if the code and WS are in fast RAM, the instruction MOV *R0,R1 adds four cycles, since the CPU must first fetch R0, then the address R0 is pointing at. Now if that address is in slow RAM, you need to add another four cycles for that access. If the instruction instead is MOV R0,*R1, and R1 is pointing at slow RAM, then both the indirect fetch of the destination and the store there adds four cycles of wait states each. If you autoincrement, then you need to write to the register after reading it, and if the register then is in slow RAM that's even worse.

RXB · June 19, 2017

Anyone look at the XB ROMs source and have suggestions for changes needed?

Airshack · June 20, 2017

Okay, the last thing we need to cover on the VDP is the eight write-only registers. Here they are:
<<deleted ASCII images >>
REG 0: M3
REG 1: M1

REG 1: M2

These three bits control the video mode, which can be Graphics I, Graphics II, Multicolor, and Text mode. The mode is set like this:
M1 M2 M3
0  0  0   Graphics I mode
0  0  1   Graphics II mode
0  1  0   Multicolor mode
1  0  0   Text mode
Why they used 3-bits to represent 4 modes I have no idea (they could have used just 2 bits. Maybe there were originally going to be more than 4 modes...)

I wanted to know specifics behind these modes (before moving along) so I referenced the E/A Manual: 21.2 Graphics Mode, 21.3 Multicolor Mode, 21.4 Text Mode, 21.5 Bit Map Mode

Q1: What on earth did they have in mind for Multicolor Mode?

Q2: Is it safe to assume Multicolor Mode is mostly avoided for game programming?

Q3: Is Graphics Mode simply the same as what's available when coding in Extended BASIC?

Q4: Is it ever advantageous to swap modes in the same program: Graphics/Text, Bit Map/Text, Graphics/Multicolor? Any examples in the wild?

Q5: Is Bit Map mode simply Graphics mode with background/foreground color information available for the 8 individual character rows vs per entire character, and no Auto-Sprites?

Q6: Are Auto-Sprites only a good deal for XB programs?

+adamantyr · June 20, 2017

I wanted to know specifics behind these modes (before moving along) so I referenced the E/A Manual: 21.2 Graphics Mode, 21.3 Multicolor Mode, 21.4 Text Mode, 21.5 Bit Map Mode

Q1: What on earth did they have in mind for Multicolor Mode?

Q2: Is it safe to assume Multicolor Mode is mostly avoided for game programming?

Q3: Is Graphics Mode simply the same as what's available when coding in Extended BASIC?

Q4: Is it ever advantageous to swap modes in the same program: Graphics/Text, Bit Map/Text, Graphics/Multicolor? Any examples in the wild?

Q5: Is Bit Map mode simply Graphics mode with background/foreground color information available for the 8 individual character rows vs per entire character, and no Auto-Sprites?

Q6: Are Auto-Sprites only a good deal for XB programs?

Q1: I think the main goal of this mode was support for the "low-res" modes that many other computers had at the time. The Apple II for example had a low-res graphic mode that was VERY similar.

Q2: Yes. Partly because the screen is ENTIRELY in low-res, which makes displays difficult. The other reason is the mode is notoriously weird in its set-up to use. The only game I'm aware of that uses this mode is "Dragon", a side-scroller.

Q3: Yes. Extended BASIC has only the one mode available, unless you use assembly routines or some other tricks.

Q4: It tends to be more useful for utility programs than games; TI-Artist uses multicolor mode for magnification, Paint-N-Print uses regular Graphics mode for it, etc. I know of no games that do this, but it's possible you could write a text/graphic adventure where your main display is text but if you want to "look" it may show a graphic representation of what you see.

Q5: It's a bit more complicated than that... bitmap mode has also been called "Graphics Mode II" for a reason, it's basically the same mode but with more. It expands the character table from 256 to 768 characters, and alters the color table from set designations of 8 characters to each individual character having 8 bytes of color data. Since the screen table can only address a single byte to a position, it has to divide the screen into thirds to determine which of the 3 character sets to draw from.

The auto-sprite motion doesn't function correctly because the hard-wired space in VDP the ISR routine uses in in use by the pattern table. This can be easily circumvented by rolling your own routine; you could even keep motion vectors in CPU RAM instead of storing them in VDP which is far more efficient anyway.

Q6: Extended BASIC definitely benefits from it, most other BASIC platforms would have a hard time moving so many sprites at once! That said, in assembly land, auto-motion is less valuable, since you can move things yourself much more conveniently, and can achieve more advanced operations like having sprites move in circles, parabolic curves, etc.

+Lee Stewart · June 20, 2017

TI Forth and fbForth (by inheritance) have two split modes that mix bit-mapped graphics and text modes. Split mode has the bottom third of the screen in Text mode and Split2 has the top sixth of the screen in Text mode.

I believe Rasmus, Tursi, Thierry et al. have explored other split modes.

...lee

Assembly on the 99/4A

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members