First program run on the GPU/F18A with MLC

moulinaie · July 30, 2012

Hello,

I've modified MLC to compile programs for the GPU that can be found inside the marvellous F18A.

Here is my very first test of it.

The GPU fills 26 times the screen starting with A and on to Z. So 26 times the whole screen, this is close to 20.000 bytes written.

On the video you can see that if I run this once, you only see the last "Z" because it's too fast. And if I run it 10.000 times.... just look to see how much time is needed for those 200MB written:

http://www.youtube.com/watch?v=4E3sccArZRw

Here is the source code for the program:

$MLC F 100 10 3000
300 CALL CLEAR
310 INPUT "Number of loops:":N
320 CALL LINK("TEST",N)::END
;
; here the MLC program that will run "n" times the GPU one
;
$TEST
GETPARAM 1 N		 ; number of loops
GPURUN &h4000		 ; first run of GPU routine with address
DO
 REPEAT			 ; little loop to wait for the GPU to finish
	 GPUSTATE
 UNTIL=			 ; state will be "=" when GPU has finished
 DEC N			 ; decrement my counter
WHILE<>
 GPUWAKE 		 ; if not finished, wakes up the GPU for another display !! (SEE NOTE BELOW)
LOOP
KEYWAIT 0			 ; wait for a key
$$
;
; here the GPU program in assembly code
; the compiler sends it to VDP location >4000, the new F18A area
; this example fills the screen with the capital letters from A to Z
;
$GPU
li r0,26			 ; 26 letters from A to Z
li r1,>A1A1			 ; start with double >A1 = "A" in XB

 ; letter loop
 clr r2			 ; screen address
 li r3,384		 ; 384 words = 768 bytes = 24*32 positions

	 ; fill loop
	 mov r1,*r2+	 ; write two letters on screen
	 dec r3		 ; decrement counter
	 jne -3		 ; if not finished, jump to fill loop

 ai r1,>0101		 ; next letter (>A2A2, >A3A3 etc...)
 dec r0			 ; decrement counter
 jne -10			 ; if not finished, jump to letter loop

idle				 ; back to idle state
$$
$END

NOTE:

Matthew, I have a problem with the wake function.

If I use it then, the GPU program can only be run 4 or 5 times and then it doesn't work correctly.

If I replace GPUWAKE (writting 1 to register 56) with GPURUN &h4000 then everything is fine.

I may have done something wrong, but it appears that the GPU is more stable if the routine address is reset each time.

Guillaume.

Edited July 30, 2012 by moulinaie

matthew180 · July 30, 2012

When you write a '1' to VR56 ("trigger only") the GPU runs from the *current* PC (program counter) location, *not* the location specified in VR54 and VR55. Those are only used to "set and trigger" the GPU.

Your routine appears to end with IDLE, which means the PC is pointing to the code in memory immediately following the IDLE instruction, and that is where execution will begin if you just write a '1' to VR56.

I suggest you make IDLE the first instruction in your GPU routine and set up the PC with a write to VR54 and VR55. That will trigger the GPU, but the first instruction is IDLE so it goes to sleep. Then you can trigger your routine with a single write to VR56. Your routine runs, then at the end jumps back to the IDLE at the top of the routine. That way when you trigger again the PC is at the correct location, i.e. the top of your routine.

      DEF MAIN
      AORG >3000
MAIN   IDLE
.
. Do something in the routine when triggered with a write
. of 1 to VR56.
.

*      Routine is done, go back to IDLE and wait to be triggered again.
      B    @MAIN

Also, if your GPU routine is going to take a while, an efficient way to wait for it on the host-side (i.e. 99/4A assembly) is:

VDPRD  EQU  >8800             * VDP read data
VDPSTA EQU  >8802             * VDP status
VDPWD  EQU  >8C00             * VDP write data
VDPWA  EQU  >8C02             * VDP set read/write address
.
.
.
      LI   R0,>0F02          * Set the status port to read SR2
      BL   @VWTR
GWAIT  MOVB @VDPSTA,R1
      JLT  GWAIT             * MSbit is '1' while GPU is running, makes the byte a negative value
      ANDI R1,>FF00          * Mask the GPU's return status data (7-bits in the MSB)
      LI   R0,>0F00          * Set status port to read SR0
      BL   @VWTR
.
.
.

*********************************************************************
*
* VDP Write To Register
*
* R0 MSB    VDP register to write to
* R0 LSB    Value to write
*
VWTR   MOVB @R0LB,@VDPWA      * Send low byte (value) to write to VDP register
      ORI  R0,>8000          * Set up a VDP register write operation (10)
      MOVB R0,@VDPWA         * Send high byte (address) of VDP register
      ANDI R0,>3FFF          * Restore R0 top two MSbits
      B    *R11
*// VWTR

Of course you must have interrupts disabled in the loop since you are changing the status register to read from the default SR0 to SR2.

moulinaie · July 31, 2012

I suggest you make IDLE the first instruction in your GPU routine and set up the PC with a write to VR54 and VR55. That will trigger the GPU, but the first instruction is IDLE so it goes to sleep. Then you can trigger your routine with a single write to VR56. Your routine runs, then at the end jumps back to the IDLE at the top of the routine. That way when you trigger again the PC is at the correct location, i.e. the top of your routine.

Okay, that's "elegant" ! I like it.

Something else, I'm working on the BML support.

I'm trying to figure out what's happening when the width of the bitmap is not on a byte boundary.

I created a 32x32 bml, so 32 pixels * 2bits = 8 bytes per line.

Then I created a 30x32 bml, so 30 pixels * 2bits = 7,5 bytes per line....

If I'm right, this uses 8 bytes per line, but only 7 bytes are used for display, so I only get a BML of 28x32 pixels.

Then, that's funny... if I create a 28x32 bml, this is 7 bytes per line, it uses exactly 7 bytes per line but it looks like only 27 pixels are displayed, the last one misses on each line...

So here is my question: are there limitations in the width and height of a BML (and also in the X and Y position) ?

Guillaume.

matthew180 · July 31, 2012

The boundaries are always on bytes. Here is the formula:

byte = (y * ((w + 3) / 4)) + (x / 4);

pix = x & 0x03;

or, more broken out:

wmul = (w + 3) >> 2 (the number "3" is because w + 3 == w - 1 + 4)

byte = (y * wmul) + (x >> 2)

pixel-pair index in byte = x & 0x03

So, a 10x4 BML will require 3-bytes per line (the last 4-bits in the 3rd byte will not be used):

. . 0  1  2  3  4  5  6  7  8  9 . . .  x coord
 ------------------------------------
  01 23 45 67|01 23 45 67|01 23 45 67  byte bit numbers
 -------------------------------------
0 |P0 P1 P2 P3|P0 P1 P2 P3|P0 P1|xx xx|
1 |P0 P1 P2 P3|P0 P1 P2 P3|P0 P1|xx xx|
2 |P0 P1 P2 P3|P0 P1 P2 P3|P0 P1|xx xx|
3 |P0 P1 P2 P3|P0 P1 P2 P3|P0 P1|xx xx|

So, lets try the formula for a pixel at xy=6,2 which is byte offset 7 from the base (count bytes starting from 0), and the pixel index of 2 (bits 4 and 5) in that byte. The pixel-pair indexes are:

P0 (bits 0 and 1)

P1 (bits 2 and 3)

P2 (bits 4 and 5)

P3 (bits 6 and 7)

w = 10

h = 4 (not used)

x = 6

y = 2

wmul = (w + 3) >> 2

wmul = (10 + 3) / 4

wmul = 13 / 4

wmul = 3 (this is all integer math, so you lose the fractional part)

byte = (y * wmul) + (x >> 2)

byte = (2 * 3) + (6 / 4)

byte = 6 + 1

byte = 7

pixel-pair index in byte = x & 0x03

pixel-pair = 6 AND 3

pixel-pair = "0110" AND "0011" (binary)

pixel-pair = "0010" (binary)

pixel-pair = 2

So, starting from the base address, add 7 to get the pixel's byte, then use pixel-pair 2 (bits 4 and 5).

The core execution part of the PIX instruction does all these calculations in a single clock, or 10ns. But for calculating the amount of memory your BML will take, always divide your width by 4 and round up Or add 3 to your width, then divide by 4 (which is what the formula above does). That is the number of bytes per line, then multiply by the height.

There is no limitation on the x,y location of the BML, and there is no limit on the width or height, other than 0 to 255, since each register is a byte.

So, 28x30 BML would require:

w = 28

w + 3 / 4

31 / 4 = 7 bytes per line (4 pixels per byte * 7 bytes = 28 pixels)

x  0 1 2 3 4 5 6 7|8 9 0 1 2 3 4 5|6 7 8 9 0 1 2 3|4 5 6 7
 |byte 0 |byte 1 |byte 2 |byte 3 |byte 4 |byte 5 |byte 6 |

And to continue past a boundary:

w = 29

w + 3 / 4

32 / 4 = 8 bytes

x  0 1 2 3 4 5 6 7|8 9 0 1 2 3 4 5|6 7 8 9 0 1 2 3|4 5 6 7|8 x x x
 |byte 0 |byte 1 |byte 2 |byte 3 |byte 4 |byte 5 |byte 6 |byte 7 |

w = 30

w + 3 / 4

33 / 4 = 8 bytes

x  0 1 2 3 4 5 6 7|8 9 0 1 2 3 4 5|6 7 8 9 0 1 2 3|4 5 6 7|8 9 x x
 |byte 0 |byte 1 |byte 2 |byte 3 |byte 4 |byte 5 |byte 6 |byte 7 |

w = 31

w + 3 / 4

34 / 4 = 8 bytes

x  0 1 2 3 4 5 6 7|8 9 0 1 2 3 4 5|6 7 8 9 0 1 2 3|4 5 6 7|8 9 0 x
 |byte 0 |byte 1 |byte 2 |byte 3 |byte 4 |byte 5 |byte 6 |byte 7 |

w = 32

w + 3 / 4

35 / 4 = 8 bytes

x  0 1 2 3 4 5 6 7|8 9 0 1 2 3 4 5|6 7 8 9 0 1 2 3|4 5 6 7|8 9 0 1
 |byte 0 |byte 1 |byte 2 |byte 3 |byte 4 |byte 5 |byte 6 |byte 7 |

Width:

0 + 3 = 3 / 4 = 0 bytes (width 0 will not display)

1 + 3 = 4 / 4 = 1 byte

2 + 3 = 5 / 4 = 1 byte

3 + 3 = 6 / 4 = 1 byte

4 + 3 = 7 / 4 = 1 byte

5 + 3 = 8 / 4 = 2 bytes

6 + 3 = 9 / 4 = 2 bytes

7 + 3 = 10 / 4 = 2 bytes

8 + 3 = 11 / 4 = 2 bytes

.

249 + 3 = 252 / 4 = 63 bytes

250 + 3 = 253 / 4 = 63 bytes

251 + 3 = 254 / 4 = 63 bytes

252 + 3 = 255 / 4 = 63 bytes

253 + 3 = 256 / 4 = 64 bytes

254 + 3 = 257 / 4 = 64 bytes

255 + 3 = 258 / 4 = 64 bytes

Remember that the xy coords are 0 to w-1 and 0 to h-1. Width 255 is kind of strange since you can't set the width to 256, but a width of 255 and an x coord of 255 will still display the pixel (at least it better, or else I have a bug).

Edited July 31, 2012 by matthew180

moulinaie · August 1, 2012

Hi Matthew,

I selected another palette than the first one and reduced one by one the width and it works.

I think that palette 0 confused me because the 4 colors are not so different. With palette 3, everything is clear.

I go back to MLC...

Guillaume;

moulinaie · August 2, 2012

Hi again,

I included in MLC a start of the BML support, for now you can set the BML size, address and flags and plot a pixel! That's the base.

I want to add something like "filled rectangle" and "line".

The little video shows the "plot" function in action.

A BML of 128x128 is created and "N" pixels are plot, then the palette changes and "N" more pixels are plot, etc... Until the user presses a key.

http://www.youtube.com/watch?v=nKwVxZxPR5Q

Here is the MLC source code:

$MLC F 100 10 3000
300 INPUT "Pixels per run : ",N
310 CALL LINK("TEST",N)
320 GOTO 300
$TEST
GETPARAM 1 N			 ; how many pixels said the user
XREGENABLE				 ; access enabled to extended registers
STARTDATA
 byte &hE0			 ; BML def bloc, first byte is flags
 byte &h40			 ; then address (here 64*64 = 4096)
 bytes 64,32,128,128	 ; then x,y,w,h
ENDDATA E
LET M 4096				 ; BML address
FOR I 0 4095
 PUTTABLE M(I) 0		 ; clear all pixels
NEXT
BMLSET 1 E				 ; and display the BML
REPEAT
 NDO I N				 ; N times
	 RND				 ; RND always return in Z
	 AND Z 127		 ; mask to get 0-127
	 LET X Z			 ; X = 0-127
	 RND
	 AND Z 127
	 LET Y Z			 ; same Y = 0-127
	 RND
	 AND Z 3			 ; Z = 0-3, the color!
	 BMLPLOT X Y Z	 ; plot the pixel
 NLOOP
 RND
 AND Z 15			 ; Z=0-15 a palette number
 ADD Z &hE0			 ; + flags
 PUTTABLE E(0) Z		 ; new flags
 BMLSET 1 E			 ; set new parameters for BML
 KEY 0				 ; a key stroke?
UNTIL>=					 ; if >= then no, repeat!
PUTTABLE E(0) &h60		 ; else, change flags to "BML disabled"
BMLSET 1 E				 ; and set new parameters
$$
$END

I made some time testings and it appears that 10.000 pixels are plot in 13.8 seconds, not so bad knowing that three calls to RND are performed to get x,y and color for each pixel.

Something important: this plot routine is integrated in MLC, executed by the TMS9900 and not the PIX instruction that can be found in the GPU.

Why?

Two main reasons:

- not everyone knows assembler to use the GPU/PIX

- the use of such a function (PIX) assumes that the GPU is not in use... wich would limit some programming ideas..!!

So the user has the choice.

Guillaume.

Edited August 2, 2012 by moulinaie

rocky007 · August 2, 2012

Really great guillaume !

have you planned some move / scroll instructions ?

moulinaie · August 2, 2012

Really great guillaume !

have you planned some move / scroll instructions ?

Not yet !!!

I'll be happy with la LINE instruction..!

But who knows...

What about your game?

Guillaume.

rocky007 · August 2, 2012

i worked on it in july, i hope to finish it this month

Edited August 2, 2012 by rocky007

moulinaie · August 2, 2012

i worked on it in july, i hope to finish it this month

Don't wait..!! As MLC is growing, you may be short in memory in a few weeks... ;-)

Guillaume.

rocky007 · August 2, 2012

i'm already too short in memory i'm very impatient to use the F18A new functions

matthew180 · August 2, 2012

I'll be happy with a LINE instruction..!

I have most of a GPU line function written based on Michael Abrash's code from "Zen of Graphics Programming". It is a modified Bresenham's algorithm (plots segments instead of just pixels) with special cases for horz and vert lines. I was hoping to include it in the F18A's firmware but I just didn't have time to get all the extras I wanted to include (lines, circles, fills, etc.)

No one has asked for any GPU code yet, but if you are interested then I'll finish it up and post it.

Also, remember that the F18A has two 32-bit random number generators. One is dedicated to the host-system interface, the other is private to the GPU. The GPU can also do your range modification, i.e. divide the random number to create the desired range.

lucien2 · August 2, 2012

i'm already too short in memory

Welcome to the club! (As we say in french)

moulinaie · August 3, 2012

Hi,

Here is the LINES example.

Two instructions for a line:

BMLPLOT x y c	 ; plots the first pixel
BMLDRAWTO x' y' c    ; draws to x' y'

With the source code:

$MLC F 100 10 3000
300 INPUT "How many runs : ":N
310 CALL LINK("TEST",N)
320 END
$TEST
   GETPARAM 1 N			    ; how many rectangles said the user
   XREGENABLE				  ; access enabled to extended registers
   STARTDATA
    byte &hE3			   ; BML def bloc, first byte is flags with palette 3
    byte &h40			   ; then address (here 64*64 = 4096)
    bytes 96,64,64,64	   ; then x,y,w,h
   ENDDATA E
   BMLSET 0 E				  ; set my BML without display
   BMLPLOT 0 0 0			   ; upper corner
   BMLFILLRECT 64 64		   ; clear all
   BMLSET 1 E				  ; and display the BML
   LET C 3					 ; color
   NDO I N
    LET Z 63
    FOR X 0 63
	    BMLPLOT X 0 C
	    BMLDRAWTO Z 63 C
	    DEC Z
    NEXT
    LET Z 62
    FOR Y 1 62
	    BMLPLOT 63 Y C
	    BMLDRAWTO 0 Z C
	    DEC Z
    NEXT
    DEC C
    AND C 3
   NLOOP
   PUTTABLE E(0) &h63		  ; else, change flags to "BML disabled"
   BMLSET 1 E				  ; and set new parameters
$$
$END

Guillaume.

moulinaie · August 3, 2012

Hello,

The FILL RECTANGLE example.

Two instructions to fill a rectangle:

BMLPLOT x y c	 ; plots the upper left corner
BMLFILLRECT w h  ; fills the rectangle

And the source code:

$MLC F 100 10 3000
300 INPUT "How many rectangles : ":N
310 CALL LINK("TEST",N)
320 END
$TEST
   GETPARAM 1 N			    ; how many rectangles said the user
   XREGENABLE				  ; access enabled to extended registers
   STARTDATA
    byte &hE3			   ; BML def bloc, first byte is flags with palette 3
    byte &h40			   ; then address (here 64*64 = 4096)
    bytes 64,32,128,128	 ; then x,y,w,h
   ENDDATA E
   BMLSET 0 E				  ; set my BML without display
   BMLPLOT 0 0 0
   BMLFILLRECT 128 128		 ; clear everything
   BMLSET 1 E				  ; and display the BML
   CLEAR C
   NDO I N
    RND
    AND Z 63
    LET X Z				 ; X upper corner from 0 to 63
    RND
    AND Z 63			    ; same for Y
    BMLPLOT X Z C		   ; plot upper corner
    RND
    AND Z 31
    ADD Z 32
    LET X Z				 ; width = 32 to 63
    RND
    AND Z 31
    ADD Z 32			    ; height from 32 to 63
    BMLFILLRECT X Z		 ; fill W,H from last plot
    INC C				   ; next color
    AND 3 C				 ; always from 0 to 3
   NLOOP
   KEYWAIT 0
   PUTTABLE E(0) &h60		  ; else, change flags to "BML disabled"
   BMLSET 1 E				  ; and set new parameters
$$
$END

Guillaume.

moulinaie · August 3, 2012

I'll be happy with a LINE instruction..!

I have most of a GPU line function written based on Michael Abrash's code from "Zen of Graphics Programming". It is a modified Bresenham's algorithm (plots segments instead of just pixels) with special cases for horz and vert lines. I was hoping to include it in the F18A's firmware but I just didn't have time to get all the extras I wanted to include (lines, circles, fills, etc.)

No one has asked for any GPU code yet, but if you are interested then I'll finish it up and post it.

Also, remember that the F18A has two 32-bit random number generators. One is dedicated to the host-system interface, the other is private to the GPU. The GPU can also do your range modification, i.e. divide the random number to create the desired range.

Hello Matthew,

AS you see I have added in MLC (not using the GPU) the PLOT, LINES and FILL instructions.

This way, a user can plot without any assembly knowledge for the GPU. Or he can plot and reserve the GPU for it's own use.

But of course, this would be great to have the GPU routines as well. They will be much faster !

I imagine something like a library installed in >4000 with a function dispacher and the user can easely call then filling a bloc of parameters. That would be easy to add to MLC too.

Guillaume.

Willsy · August 3, 2012

This is awesome!

So, the line drawing code is running on the 4A's 9900, not the GPU?

Would you mind sharing the assembly code to do the line drawing and rectangles? I'd like to add bitmap support to TurboForth in the future, and this would make my life much easier ;-)

moulinaie · August 3, 2012

This is awesome!

So, the line drawing code is running on the 4A's 9900, not the GPU?

Exactly !! The TMS9900 is yet impressive...! What will it be with the GPU..!

Would you mind sharing the assembly code to do the line drawing and rectangles? I'd like to add bitmap support to TurboForth in the future, and this would make my life much easier

This is not a problem, I'll be happy to share! But is it for working on the BML or for standard BITMAP mode of the TI? Because the encoding is really different.

Guillaume.

matthew180 · August 3, 2012

I imagine something like a library installed in >4000 with a function dispacher and the user can easely call then filling a bloc of parameters. That would be easy to add to MLC too.

Funny you should say that, because there are actually two functions installed at >4000 by default with the F18A is powered on. I didn't have time to write all the routines I wanted to include (lines, circles, sin, cos, etc.), but I did manage to get two in there:

1. Block copy

2. Load font

I did use a "dispatch" (or vector table) approach, but I forgot to leave room for user defined functions, so the there is only room for two vectors with the built-in code. I'm pretty mad at myself right now for not thinking about it and just leaving room for 16 vectors, or something like that.

This is the call interface code at >4000:

      LI   R15,>47FE         * Set the stack pointer to the bottom of the GRAM
MAIN
      IDLE                   * Start out idle since the GPU is triggered at power-on

*      Vector jump table.  Reads >3F00 for routine number.
      CLR  R1
      MOVB @>3F00,R1         * Load routine vector into R1
      SRL  R1,7              * Multiply by 2
      MOV  @VECTOR(R1),R0    * Get the address of the routine
      B    *R0               * Branch to routine
VECTOR
      DATA BLKCPY            * Block Copy
      DATA FONTLD            * Font Load

You can see there are only two, and BLKCPY starts right after the vector table. Heh, I suppose block copy could be used to move the code down and expand the vector table, and the two existing entries could be patched. That would be pretty simple. Or I should have put the vector table at the bottom.

moulinaie · August 3, 2012

Funny you should say that, because there are actually two functions installed at >4000 by default with the F18A is powered on. ...

I think you had to send the ordered F18A without wainting. Lots of us were waiting for it.

Now that we have them, the ideas can come and someday, an update could be done with the most interesting/useful ones to bring a solid library usable by MLC in XB environment, Forth and Assembler.

That common work would benefit to everyone.

Something else....

You explained how to manage the 32bits counter from a GPU program. I'd like to do it from MLC reading the registers 37 to 41.

Can you explain how does the four bits of the counter control (R37) work?

Thanks,

Guillaume.

matthew180 · August 3, 2012

There are two 32-bit counters. One is dedicated to the GPU, the other is accessible by the host-system. The GPU can actually access both counters, but the way the GPU interfaces each counter is different.

VR37 has 4-bits to control the counter:

0 1 2 3 |  4  |  5 | 6 | 7 |
|X X X X |RESET|LOAD|RUN|INC|

RESET and LOAD are only effective when you write to the register.

RESET: if this bit is set when you write to VR37, it will reset the counter to 0. This does *not* affect VR38 to VR41.

LOAD: if this bit is set when you write to VR37, the counter will be loaded with the values from VR38 to VR41.

RESET will override LOAD in the case where you have both set to '1' when you write to VR37. Both RESET and LOAD will clear themselves after you write to VR37, they are once-per-write indicators.

RUN is a switch that will cause the counter to free-run and increment every 10ns. You could use this to accurately time events or instruction loops. The counter will run from 0 to its max value in about 43.9496 seconds.

INC will cause the counter to increment by 1. INC, like RUN is also a "mode" of operation for the counter that comes in to play when reading the counter via status registers SR4 to SR7. Even though INC will remain a '1' after being written, the counter is only incremented when you write to VR37.

Some useful count values are might be:

.  count  .  |  elapsed time
--------------+--------------
100 . . . . . |  1 microsecond
100,000 . . . |  1 millisecond
100,000,000 . |  1 second
1,666,666 . . |  16.6 milliseconds (60Hz)
3,333,300 . . |  33.3 milliseconds (30Hz)
255 . . . . . |  2.56 microseconds
65,535  . . . |  0.65535 milliseconds
16,777,216  . |  0.16777216 seconds
4,294,967,296 |  42.94967296 seconds

Reading the counter's four bytes is done via SR4 to SR7, which have a special feature to make getting the data easier. This method also works for the 32-bit RNG which uses SR8 to SR11.

By setting VR15 to one of the counter's four status registers (SR7 to SR7), reading the VDP status port will cause an automatic pause of the counter if it is running continuously (the RUN bit is set). The value for the specified byte (based on VR15) will be returned, and the status register to read in VR15 will automatically increment by one. This allows a consecutive read of all four bytes of the counter with only four status register reads (assuming you set VR15 to start with SR4).

After reading the LSB of the counter's value (SR7), the counter will automatically resume if the RUN bit is set, or auto increment if the INC bit is set, and the value in VR15 will reset back to SR4.

This auto-stop and resume (or increment) feature does not work for the GPU, only the host-interface via the status port. The GPU can read the counter's or RNG's values, but if the counter or RNG are free-running, the GPU would have to stop them first to get a single non-changing 32-bit value.

Edited August 3, 2012 by matthew180

moulinaie · August 3, 2012

There are two 32-bit counters. One is dedicated to the GPU, the other is accessible by the host-system. The GPU can actually access both counters, but the way the GPU interfaces each counter is different.

VR37 has 4-bits to control the counter:
0 1 2 3 | 4 | 5 | 6 | 7 |
|X X X X |RESET|LOAD|RUN|INC|
RESET and LOAD are only effective when you write to the register.

RESET: if this bit is set when you write to VR37, it will reset the counter to 0. This does *not* affect VR38 to VR41.

LOAD: if this bit is set when you write to VR37, the counter will be loaded with the values from VR38 to VR41.

RESET will override LOAD in the case where you have both set to '1' when you write to VR37. Both RESET and LOAD will clear themselves after you write to VR37, they are once-per-write indicators.

RUN is a switch that will cause the counter to free-run and increment every 10ns. You could use this to accurately time events or instruction loops. The counter will run from 0 to its max value in about 43.9496 seconds.

Is it possible to write at once binary 00001010 to reset and run the counter? ie to start it from zero in one write?

Guillaume.

matthew180 · August 3, 2012

For the GPU, writing to VR38 to VR41 will set the values to be loaded to the counter, just like the host-system. However, when the GPU *reads* VR38 to VR41, the counter's (or RNG's) current value will be returned, and *not* the values in VR38 to VR41. The GPU can control the counter by writing to VR37 just like the host-system.

For the GPU, it is easier to use its dedicated memory-mapped counter.

ZERO   BYTE 0
ONE    BYTE 1
.
.
.
      MOVB @ZERO,@>8004      * Stop the counter
      CLR  @>8000            * Clear MSword (works because of the even memory address)
      CLR  @>8002            * Clear LSword
      MOVB @ONE,@>8004       * Free run

The GPU's counter and RNG work the same, only the base address is different:

32-bit counter
8xx0 - MSB
8xx1
8xx2
8xx3 - LSB
8xx4 - write >x1 = free run, >x0 = stop
8xx6 - write >x1 to single step

32-bit Linear Feedback Shift-Register (LFSR) Random Number Generator (RNG)
9xx0 - MSB
9xx1
9xx2
9xx3 - LSB
9xx4 - write >x1 = free run, >x0 = stop
9xx6 - write >x1 to single step

The GPU's memory map is:

VRAM 14-bit, 16K @ >0000 to >3FFF (0011 1111 1111 1111)
GRAM 11-bit, 2K  @ >4000 to >47FF (0100 x111 1111 1111)
PRAM  7-bit, 128 @ >5000 to >5x7F (0101 xxxx x111 1111)
VREG  6-bit, 64  @ >6000 to >6x3F (0110 xxxx xx11 1111)
current scanline @ >7000 to >7xx0 (0111 xxxx xxxx xxx0)
blanking . . . . @ >7001 to >7xx1 (0111 xxxx xxxx xxx1)
32-bit counter   @ >8000 to >8xx6 (1000 xxxx xxxx x110)
32-bit rng . . . @ >9000 to >9xx6 (1001 xxxx xxxx x110)
F18A version . . @ >A000 to >Axxx (1010 xxxx xxxx xxxx)
GPU status data  @ >B000 to >Bxxx (1011 xxxx xxxx xxxx)

VRAM = VDP RAM

GRAM = GPU only RAM

PRAM = Palette RAM - 16-bit access ONLY. Byte instructions WILL NOT update the palette registers

VREG = VDP Registers - 1-byte per registers, unused registers will return a value of 0

GPU Status is write only and intended for the host-system to read via SR2.

The current scan line, blanking, and F18A version are read only.

matthew180 · August 3, 2012

Is it possible to write at once binary 00001010 to reset and run the counter? ie to start it from zero in one write?

Yes

*      Free run test
      LI   R0,>250A          * VR37, reset and run
      BL   @VWTR

moulinaie · August 5, 2012

Hi again,

I added the TIMER support in MLC.

So now we have a counter with 10ns precision, that's great to optimize the speed of short assembly portions of code.

The internal counter of the F18A is limited to 32 bits (as Matthew said, that's 43 sec).

I extended this with a "TIMER SUM" that cumulates the current value of the TIMER into a 48 bits zone, so now the limit is 2^47 * 10ns = 1 407 374 sec = 391 hours. (don't use the last 48th bit as this would lead to a problem of sign that I didn't manage..)

So if you use several times in your program:

"TIMER SUM RESET RUN"

this sums the current value, then reset the timer and go on counting. And you can bypass the limit of 43 seconds.

Then, this value can be easely converted to a float number and sent back to the XB calling program.

This is a short example that runs the timer and returns the time:

$MLC F 100 10 3000
300 INPUT "Ready : ":K$
305 CALL LINK("TEST",N)
310 PRINT N*1E-8;" seconds" ; 1E-8 to convert ns into seconds.
320 GOTO 300
$TEST
XREGENABLE		 ; enable use of extra F18A registers
TIMER RESET RUN	 ; reset timer and runs it
KEYWAITNEW 0	 ; wait for a new key (to prevent from detecting the last "ENTER")
TIMER READ TOFLOAT ; read timer value and turns it to FLOAT in float register 0
PUTFLOAT 1(0)	 ; returns the float value to the first agrument N
$$
$END

I think that this will be final version 1.30 for the Precompiler and MLC, I will update the manual and the ZIP on my page.

Guillaume.

Edited August 5, 2012 by moulinaie

First program run on the GPU/F18A with MLC

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members