Asmusr Posted June 16, 2017 Share Posted June 16, 2017 Hmm this looks exactly like code from the XB ROMs? Really? Where? They might have done something similar, it's not rocket science. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted June 16, 2017 Share Posted June 16, 2017 I don't believe in protecting a programmer from themselves. I do. Especially when they are different programmers working on the same machine. We have machines with a lifetime of well over ten years. So the original programmer may not even be in the company any longer, and most probably don't remember what he did if he's still there, when you get the task to fix or add something to the code. Robust library routines are very handy then. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted June 16, 2017 Share Posted June 16, 2017 Where do the robust libraries come from for your business? In house? Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted June 17, 2017 Share Posted June 17, 2017 (edited) Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly. ********************************************************************* * * Fast CPU to VDP copy, replaces VMBW * * R0: Destination address * R1: Source address * R2: Number of bytes to copy * VDPCP SWPB R0 MOVB R0,@VDPWA ; Send low byte of VDP RAM write address SWPB R0 ORI R0,>4000 ; Set the two MSbits to 01 for write MOVB R0,@VDPWA ; Send high byte of VDP RAM write address LI R0,VDPWD VDPCP0 MOV R2,R3 SRL R3,3 ; Number of groups of 8 JEQ VDPCP2 VDPCP1 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 DEC R3 JNE VDPCP1 ANDI R2,>0007 ; Isolate number of remaining bytes JEQ VDPCP3 VDPCP2 MOVB *R1+,*R0 DEC R2 JNE VDPCP2 VDPCP3 B *R11 *// VDPCP Boy does that make me feel good! For XB256 I wrote my own VMBW routine with code that is virtually identical to the above code. The only difference (other than using some different registers) is that is is accessed with BLWP instead of BL. So the setup is a little slower but it preserves R0 to R2 in the calling program. Here is a VSBW routine from XB256: VSBW96 AI R1,>6000 add XB screen offset VSBW SWPB R0 MOVB R0,@>8C02 SWPB R0 ORI R0,>4000 MOVB R0,@>8C02 MOVB R1,@>8C00 ANDI R0,>BFFF B *R11 Which could become: VSBW MOVB @WKSP+1,@>8C02 ORI R0,>4000 MOVB R0,@>8C02 MOVB R1,@>8C00 ANDI R0,>BFFF needed if you want R0 restored to original value B *R11 These are BL subroutines, not BLWP. This bit me more than once when I did a BLWP @VSBW and wondered why the program crashed! So remember: BL @VSBW, not BLWP @VSBW Edited June 17, 2017 by senior_falcon Quote Link to comment Share on other sites More sharing options...
RXB Posted June 17, 2017 Share Posted June 17, 2017 Really? Where? They might have done something similar, it's not rocket science. I have posted the XB ROM code many times now. I take it you never looked at it? XBROM SOURCE.zip Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted June 17, 2017 Share Posted June 17, 2017 I have posted the XB ROM code many times now. I take it you never looked at it? The whole point of Rasmus's post was to show how to "unroll" a loop for speed and still be as versatile as the normal VMBW. I went through all 3 files of the XB ROMs (searching for MOVB *) and was unable to find any instance where a loop is unrolled. Would you please post the code you say is the same as Rasmus's code? Quote Link to comment Share on other sites More sharing options...
apersson850 Posted June 17, 2017 Share Posted June 17, 2017 Where do the robust libraries come from for your business? In house? Usually. Sometimes from the supplier of some control equipment we may choose to use. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted June 17, 2017 Share Posted June 17, 2017 (edited) Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly. ********************************************************************* * * Fast CPU to VDP copy, replaces VMBW * * R0: Destination address * R1: Source address * R2: Number of bytes to copy * VDPCP SWPB R0 MOVB R0,@VDPWA ; Send low byte of VDP RAM write address SWPB R0 ORI R0,>4000 ; Set the two MSbits to 01 for write MOVB R0,@VDPWA ; Send high byte of VDP RAM write address LI R0,VDPWD VDPCP0 MOV R2,R3 SRL R3,3 ; Number of groups of 8 JEQ VDPCP2 VDPCP1 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 MOVB *R1+,*R0 DEC R3 JNE VDPCP1 ANDI R2,>0007 ; Isolate number of remaining bytes JEQ VDPCP3 VDPCP2 MOVB *R1+,*R0 DEC R2 JNE VDPCP2 VDPCP3 B *R11 *// VDPCP How much faster is this than putting the one byte write line in a loop. Have you Measured it? Edited June 17, 2017 by TheBF Quote Link to comment Share on other sites More sharing options...
Asmusr Posted June 17, 2017 Share Posted June 17, 2017 How much faster is this than putting the one byte write line in a loop. Have you Measured it? From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame. 4 Quote Link to comment Share on other sites More sharing options...
+mizapf Posted June 17, 2017 Share Posted June 17, 2017 As a possibly interesting side point, the Geneve implements a hardware-based video wait state generation inside its Gate array. The TMS9995 is much faster, so programs may cause an overrun that had no such issues on the TI-99/4A, which would have broken the compatibility. I don't know whether the same timing constraints apply to the v9938, though. Some time ago I did some investigations, see http://www.ninerpedia.org/index.php?title=Geneve_video_wait_states Quote Link to comment Share on other sites More sharing options...
+TheBF Posted June 17, 2017 Share Posted June 17, 2017 Ok. That's what I suspected. Not quite double speed but much bigger. My concern has been squashing as much functionality as I can in a tight space and trade off speed/size as best I can. Great code however. I am keeping it for another day.? From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame. Quote Link to comment Share on other sites More sharing options...
RXB Posted June 18, 2017 Share Posted June 18, 2017 The whole point of Rasmus's post was to show how to "unroll" a loop for speed and still be as versatile as the normal VMBW. I went through all 3 files of the XB ROMs (searching for MOVB *) and was unable to find any instance where a loop is unrolled. Would you please post the code you say is the same as Rasmus's code? Took a second to find same thing, though not in same order as your example it does the same thing. Uses Workspace address for MSB and LSB. Does reference and use >8C02 and >8C00 Does use ORI R0,>4000 * Write R4 to VDP and Write R1 to VDP Get Value from caller LN641E MOV *R11+,R4 *# Save Return Address into R4 LN6420 MOV *R4,R4 * Put Address R4 into R4 LN6422 MOVB @LR4,*R15 *# LSB R4 to VDP Address (Workspace ADDRESS) LN6426 ORI R4,>4000 LN642A MOVB R4,*R15 * MSB R4 to VDP Address LN642C JMP LN642E * NOP * (Write R1 to VDP) LN642E MOVB R1,@VDPWD *# MSB R1 to VDP Write Data LN6432 B *R11 * RETURN In other XB ROM is has several original programmers that do things in varied ways. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted June 18, 2017 Share Posted June 18, 2017 Took a second to find same thing, though not in same order as your example it does the same thing. ... Try again, Rich. That code has absolutely nothing to do with unrolling a loop. ...lee Quote Link to comment Share on other sites More sharing options...
RXB Posted June 18, 2017 Share Posted June 18, 2017 (edited) Try again, Rich. That code has absolutely nothing to do with unrolling a loop. ...lee REALLY? Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. So are you talking Compiler only as the definition states manually or compiler. Edited June 18, 2017 by RXB Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted June 18, 2017 Share Posted June 18, 2017 REALLY? Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. So are you talking Compiler only as the definition states manually or compiler. Well—you found a definition. Good. Now, perhaps, you might reread Rasmus' posts #395 and #397 above to see how he manually unrolls the loop to get faster execution of multiple VRAM writes, something that your example XB-ROM code does not do. ...lee Quote Link to comment Share on other sites More sharing options...
+InsaneMultitasker Posted June 18, 2017 Share Posted June 18, 2017 Here's my general VPD copy routine that works on any number of bytes. It unrolls the loop 8 times for all groups of 8 bytes and then deals with the remaining bytes in a one-byte loop. If you transfer less than 16 bytes it's probably more efficient (my guess) to use a standard VMBW routine. Note that I'm using the SWPB approach instead of the @R0LB approach because it's nice that a general routine works in any workspace, and the time to execute the setup code is insignificant compared to the time of the inner loop. But you can also see that I still have Matthew's comments in the code because this thread is where I started to learn TMS9900 assembly. Very nice! I saw the timing results in your later post. Would placing the first loop (with the 8 VDP writes) into scratchpad improve performance significantly? I need a really fast move routine but have limited scratchpad space, so I was contemplating either a 4- or 8- byte copy loop based on your routine, with the setup and final byte copy loop in the 32K space. Quote Link to comment Share on other sites More sharing options...
Asmusr Posted June 18, 2017 Share Posted June 18, 2017 Very nice! I saw the timing results in your later post. Would placing the first loop (with the 8 VDP writes) into scratchpad improve performance significantly? I need a really fast move routine but have limited scratchpad space, so I was contemplating either a 4- or 8- byte copy loop based on your routine, with the setup and final byte copy loop in the 32K space. From scratch pad I think each instruction is 4 clock cycles faster, so MOVB is 32 instead of 36. This is provided your workspace is in scratch pad and the data you copy are in 8 bit RAM or ROM. 1 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted June 18, 2017 Share Posted June 18, 2017 That makes sense, since then it's only the instruction fetch you win. And each memory access is two cycles in 16-bit memory and six in 8-bit memory. Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted June 19, 2017 Share Posted June 19, 2017 From 8 bit RAM MOVB is 40 cycles, DEC is 14 and JNE is 14. So in a normal loop we can push a byte in 68 clock cycles or 735 bytes per frame. In an 8 times unrolled loop we can push a byte in (8*40+14+14)/8=44 clock cycles or 1,136 bytes per frame. I guess I need to learn more about the speed required for instructions to execute and in particular how much the wait states slow down a program. In my 990 COMPUTER SYSTEMS SYSTEMS HANDBOOK they have a table giving instruction execution times for the 990/4 which uses the TMS 9900. MOVB takes 14 clock cycles + 4 cycles for WR indirect + 6 cyles for WR indirect auto increment = 24 clock cycles. The additional 16 clock cycles must be from wait states for memory access? DEC takes 10 clock cycles. (0 cycles for the address modification because it is workspace register.) The 4 additional clock cycles must be for the wait state? JNE takes 10 clock cycles. The 4 additional clock cycles must be for the wait state? How do you know out how many wait states are required by instructions? Quote Link to comment Share on other sites More sharing options...
Tursi Posted June 19, 2017 Share Posted June 19, 2017 (edited) You've got the basics there. Then every memory access not in scratchpad or ROM costs 4 cycles. The instruction table should tell you how many memory accesses occur for each instruction - reads take one, and most writes take two (due to a read-before-write). Reading the opcode itself is also a memory access, of course. Edited June 19, 2017 by Tursi Quote Link to comment Share on other sites More sharing options...
apersson850 Posted June 19, 2017 Share Posted June 19, 2017 Correct. And you have to figure out yourself which of the memory accesses that are slower (8-bit) and which are faster (16-bit). Thus you need to know about the hardware architecture of the 99/4A to get it right. Look at MOV R0,R1. You have four memory accesses here. Fetch instruction. Fetch source data Fetch destination data Write destination data If the instruction and the workspace both are in the internal RAM in the console, address >8300 - >83FF, then that's it. You have the basic timing of the instruction there, 14 cycles. Since all memory accesses require two cycles, which is the minimum, there are no additional delays. But it's only the small RAM in the console and the ROM chips in the console that are on a 16-bit wide bus with no additional wait states. When the CPU is accessing the memory in the expansion box, this memory is on a bus that's 8-bit wide. To make that possible, there is extra circuitry in the console, which splits up the 16-bit memory access into two 8-bit accesses. The circuit then puts these two parts together and present them to the CPU as one 16-bit word. Hence the two cycle memory access becomes a four cycle one. But it doesn't end there. Each of these 8-bit memory access cycles also have one extra wait state. So from a memory point of view, outside in the PEB, we're talking about 8-bit access with one wait state. But from the CPU's point of view, it looks like each 16-bit access is slowed down by four wait states. So if we look at the MOV instruction again, and pretend that both workspace and code is in 32 K RAM in the PEB, then suddenly all four of them memory accesses occur in slow RAM. Thus you must add 16 cycles that are waisted, in addition to the 14 that the instruction itself uses. But even if the code and WS are in fast RAM, the instruction MOV *R0,R1 adds four cycles, since the CPU must first fetch R0, then the address R0 is pointing at. Now if that address is in slow RAM, you need to add another four cycles for that access. If the instruction instead is MOV R0,*R1, and R1 is pointing at slow RAM, then both the indirect fetch of the destination and the store there adds four cycles of wait states each. If you autoincrement, then you need to write to the register after reading it, and if the register then is in slow RAM that's even worse. 1 Quote Link to comment Share on other sites More sharing options...
RXB Posted June 19, 2017 Share Posted June 19, 2017 Anyone look at the XB ROMs source and have suggestions for changes needed? Quote Link to comment Share on other sites More sharing options...
Airshack Posted June 20, 2017 Share Posted June 20, 2017 Okay, the last thing we need to cover on the VDP is the eight write-only registers. Here they are: <<deleted ASCII images >>REG 0: M3REG 1: M1 REG 1: M2 These three bits control the video mode, which can be Graphics I, Graphics II, Multicolor, and Text mode. The mode is set like this: M1 M2 M3 0 0 0 Graphics I mode 0 0 1 Graphics II mode 0 1 0 Multicolor mode 1 0 0 Text mode Why they used 3-bits to represent 4 modes I have no idea (they could have used just 2 bits. Maybe there were originally going to be more than 4 modes...) I wanted to know specifics behind these modes (before moving along) so I referenced the E/A Manual: 21.2 Graphics Mode, 21.3 Multicolor Mode, 21.4 Text Mode, 21.5 Bit Map Mode Q1: What on earth did they have in mind for Multicolor Mode? Q2: Is it safe to assume Multicolor Mode is mostly avoided for game programming? Q3: Is Graphics Mode simply the same as what's available when coding in Extended BASIC? Q4: Is it ever advantageous to swap modes in the same program: Graphics/Text, Bit Map/Text, Graphics/Multicolor? Any examples in the wild? Q5: Is Bit Map mode simply Graphics mode with background/foreground color information available for the 8 individual character rows vs per entire character, and no Auto-Sprites? Q6: Are Auto-Sprites only a good deal for XB programs? Quote Link to comment Share on other sites More sharing options...
+adamantyr Posted June 20, 2017 Share Posted June 20, 2017 I wanted to know specifics behind these modes (before moving along) so I referenced the E/A Manual: 21.2 Graphics Mode, 21.3 Multicolor Mode, 21.4 Text Mode, 21.5 Bit Map Mode Q1: What on earth did they have in mind for Multicolor Mode? Q2: Is it safe to assume Multicolor Mode is mostly avoided for game programming? Q3: Is Graphics Mode simply the same as what's available when coding in Extended BASIC? Q4: Is it ever advantageous to swap modes in the same program: Graphics/Text, Bit Map/Text, Graphics/Multicolor? Any examples in the wild? Q5: Is Bit Map mode simply Graphics mode with background/foreground color information available for the 8 individual character rows vs per entire character, and no Auto-Sprites? Q6: Are Auto-Sprites only a good deal for XB programs? Q1: I think the main goal of this mode was support for the "low-res" modes that many other computers had at the time. The Apple II for example had a low-res graphic mode that was VERY similar. Q2: Yes. Partly because the screen is ENTIRELY in low-res, which makes displays difficult. The other reason is the mode is notoriously weird in its set-up to use. The only game I'm aware of that uses this mode is "Dragon", a side-scroller. Q3: Yes. Extended BASIC has only the one mode available, unless you use assembly routines or some other tricks. Q4: It tends to be more useful for utility programs than games; TI-Artist uses multicolor mode for magnification, Paint-N-Print uses regular Graphics mode for it, etc. I know of no games that do this, but it's possible you could write a text/graphic adventure where your main display is text but if you want to "look" it may show a graphic representation of what you see. Q5: It's a bit more complicated than that... bitmap mode has also been called "Graphics Mode II" for a reason, it's basically the same mode but with more. It expands the character table from 256 to 768 characters, and alters the color table from set designations of 8 characters to each individual character having 8 bytes of color data. Since the screen table can only address a single byte to a position, it has to divide the screen into thirds to determine which of the 3 character sets to draw from. The auto-sprite motion doesn't function correctly because the hard-wired space in VDP the ISR routine uses in in use by the pattern table. This can be easily circumvented by rolling your own routine; you could even keep motion vectors in CPU RAM instead of storing them in VDP which is far more efficient anyway. Q6: Extended BASIC definitely benefits from it, most other BASIC platforms would have a hard time moving so many sprites at once! That said, in assembly land, auto-motion is less valuable, since you can move things yourself much more conveniently, and can achieve more advanced operations like having sprites move in circles, parabolic curves, etc. 2 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted June 20, 2017 Share Posted June 20, 2017 TI Forth and fbForth (by inheritance) have two split modes that mix bit-mapped graphics and text modes. Split mode has the bottom third of the screen in Text mode and Split2 has the top sixth of the screen in Text mode. I believe Rasmus, Tursi, Thierry et al. have explored other split modes. ...lee 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.