Bitmap mode.

JamesD · March 15, 2015

The multipy by .8 in is an extra step that is required in the TI code that is not required in the Atari code. In addition the TI had the additional overhead of writing its video via a port, which is on the upper 8 bits of the data bus only. Additional overhead that the Atari version doesn't have.

Again with the .8 extra step. If the TI assembly uses an extra step to multiply by .8 it's his choice.

This is NOT required with the proper algorithm.

Look at the BASIC code in post #186. Not one multiply by .8.

The extra multiply does not account for the difference in the image anyway.

+OLD CS1 · March 15, 2015

WTF just happened?

JamesD · March 15, 2015

WTF just happened?

The usual. Computer wars and bragging rights.

Tursi · March 16, 2015

I suspect it's throwing crumbs over the bridge, but here's a 320x192 version. Because the display does not have 320 horizontal pixels, the hardware wraps larger coordinates around. (The code is making no effort to adjust for that. Please read the code before you critique.)

As you can see, the scaling /does/ account for the differences in the image. Unless you can't, in which case I can't help you anyway.

* runs for 320x192, ignoring the physical size of the screen.
* because if we make it fit, we are accused of being unfair
* to the Atari 8-bit. And this is a big deal to some people.
* Sure a good thing we all do this for FUN.
*
* this version copies only sqrt and plot into scratchpad
*
* Fixed point of 3.13 in most cases
* Port from DMSC's Atari800XL code
* Port to TI-99/4A 9900 and additional opts by Tursi
* Assemble with something that likes labels > 6 chars
    DEF START
   
SQRT EQU >8330
PLOT EQU >8368
* Constants (INTEGER) - TI SX = 320, SY=192
cx      equ     160     sx / 2
cy      equ     90      sy * 15 / 32
fy      equ     56      sy * 7 / 24
* because we aren't allowed to draw to less than 320 X pixels,
* we don't need to scale anything, we can use inc/dec on X and Y both
* these constants replace the squaring with simple counting loops
initZt2 equ     >2000   TOFP(1.0)
deltaZt equ     >0002   TOFP(1.0/(64.0*64.0))
initAZt equ     >ff02   deltaZt - TOFP(2.0 / 64.0)
stepZt2 equ     >0004   2 * deltaZt
* The values of delta² are too small for 3.13 bits of fixed-point, so we
* use 8 more bits, for an 3.21 bits of precision. This is not slow because
* we only use additions.
deltaXf equ     >0065   TOFP( 20.0*20.0 / (9.0*9.0*sx*sx) * 256 ) - the *256 adds 8 bits of zeros
initAXf equ     >0065   deltaXf
stepXf2 equ     >00CA   2 * deltaXf
* variables in registers
vdpadr  equ 15
zt2     EQU 14
azt     EQU 13
xs      EQU 12
RET     EQU 11      * for BL
ys      EQU 10
zi      EQU 9
x1      equ 8
x2      equ 7
axf     equ 6
xf2adr  equ 5
tmp     equ 4
z1      equ 3
y1      equ 2
r1      equ 1
r0      equ 0
* for plot
tmp1    equ 4
tmp2    equ 3
* this one is easier to store in memory
* but we'll stick in scratchpad after the regs
* note we reserve 4 to preserve alignment, but
* it is big endian, and left aligned (4th byte is unused)
* accessed as *xf2adr for first word, and @xf2+2 for second
xf2     EQU >8320   * 3 bytes long
START
        LWPI >8300
* frequently used VDP address  
        li vdpadr,>8c02
* backup scratchpad
  li r0,>8320
  li r1,scratch
  li r2,112
scrblp
  mov *r0+,*r1+
  dec r2
  jne scrblp
* copy square root to scratchpad
  li r0,SQRTX
  li r1,SQRT
sqcplp
  MOV *R0+,*R1+
  CI R0,ENDX
  JNE sqcplp
* set up graphics and sine table
        BL @BITMAP
        BL @initsine
       
* clear out the oldY table (entries of 192)
        li r0,oldY
        li r1,>C0C0
        li r2,160
rlp
        mov r1,*r0+
        dec r2
        jne rlp
* erase the pattern table
        LI R0,>4000     write address >0000
        CLR R1
        LI R2,>1800
        BL @VDPFILL
       
* set the color table to white on black
        LI R0,>6000     write address >2000
        LI R1,>F100
        LI R2,>1800
        BL @VDPFILL
* 24-bit variable - accessed by address
  li xf2adr,xf2
  
* init defaults
        li zt2,initZt2
        li azt,initAZt
        li xs,>003f  start centered
        li ys,>003f
        li zi,128
*        ; Outer for loop:
loopZi
*   x1 = xs + cx;      // SIGNED
        mov xs,x1
        ai x1,cx
*   x2 = x1;
        mov x1,x2
*   ** only xf2 needs 24 bits **
*   xf2 = 0;
        clr *xf2adr
        clr @xf2+2   * sorry for this awful syntax. No, not really.
   
*   axf = initAXf;
        li axf,initAXf
       
* inner for loop
loopXi
*   di = sqrti( (xf2>> + zt2 );
        mov *xf2adr,r0      * two MS bytes here, LSB in next byte, so no shift needed
        a zt2,r0
       
*   if( di >= 0x2000 ) break;   (inner loop - moved from the end, no visible difference but frees di)
* sqrt returns >2000 if the input is >= 0x2000, so we can just test here and save a step
        ci r0,>2000
        jhe lXiEnd
        bl @sqrt            * return in r0
       
* we need higher resolution numbers to combine the two multiplies - ratio is *768.1216,
* but the whole fraction is important, rounding is VERY visible.
*   tmp = (0x3244 * di) >> 13;  // * (PI/2) - 3.13 * 3.13 = 6.26, shift and truncate
        li tmp,>3244
        mpy tmp,r0          * output in r0,r1
        srl r1,13
        sla r0,3
        soc r1,r0           * merge the two words back to 3.13 (not saved to tmp yet)
*   tmp = (tmp*489)>>13;    * multiplier to go from fixed point max range (1.57) to 3/4 circle (768)
        li tmp,489
        mpy tmp,r0          * output in r0,r1
        srl r1,12           * not shifting all the way to get a multiply by 2 for the index
        sla r0,4
        soc r0,r1           * merge the two words back to 3.13 (not saved to tmp yet)
*   z1 = sinetab[tmp];
        mov @sinetab(r1),z1
       
*   tmp = tmp + tmp + tmp;
        mov r1,tmp
        a r1,tmp
        a r1,tmp
       
*   z1 += sinefour[tmp&0x3ff];
        andi tmp,>07fe
*       a @sinefour(tmp),z1 * delayed till below
*   tmp =  ((z1 * fy) >> 13);   // SIGNED
        li r0,fy
        a @sinefour(tmp),z1 * moved from above so we can add and test in one step
        jlt iisneg
       
        mpy z1,r0
        srl r1,13
        sla r0,3
        soc r0,r1           * merge the two words back to 3.13 (not saved to tmp yet)
        jmp idone
iisneg     
        neg z1
        mpy z1,r0           * 3.13 x 16.0 = 19.13, LSW is already correct
        srl r1,13
        sla r0,3
        soc r0,r1           * merge the two words back to 3.13 (not saved to tmp yet)
        neg r1
idone
*   y1 = ys + cy - tmp;
        mov ys,y1
        ai y1,cy
        s r1,y1
*   if( oldY[x1] > y1 )
        c @oldY(x1),y1
        jl noplot1
       
*   oldY[x1] = y1;
        mov y1,@oldY(x1)
*   dc->SetPixel(x1, y1, RGB(0,0,0));
        mov x1,r1
        mov y1,r0
        bl @plot
       
noplot1
         
*   if( oldY[x2] > y1 )
        c @oldY(x2),y1
        jl noplot2
       
*   oldY[x2] = y1;
        mov y1,@oldY(x2)
   
*   dc->SetPixel(x2, y1, RGB(0,0,0));
        mov x2,r1
        mov y1,r0
        bl @plot
noplot2
* end of inner loop processing (normally after di, but it doesn't change)
* x1++, x2--, xf2 += axf, axf += stepXf2
        inc x1
        dec x2
   
* xf2 needs to be done bytewise, cause we have to split axf for it
        mov axf,r0
        swpb r0    * LSB in MSB position for LSB of xf2 (ab, no need to mask)
        ab r0,@xf2+2        * add the LSB
        jnc nocarry
        inc *xf2adr         * add in the carry to the MSW
nocarry
        andi r0,>00FF       * MSB in LSB position for MSW of xf2
        a r0,*xf2adr        * and add in the MSB
   
        ai axf,stepXf2
           
        jmp loopXi
lXiEnd
* outer loop end
* zt2 += azt, azt += stepZt2, --xs, --ys, zi-- (condition)
        a azt,zt2
        ai azt,stepZt2
        dec xs
        dec ys
       
        dec zi
        jne loopZi
*        ; End of program
end
* restore scratchpad
  li r0,scratch
  li r1,>8320
  li r2,112
scrrlp
  mov *r0+,*r1+
  dec r2
  jne scrrlp
waitlp
        LWPI >83E0          * GPLWS
        BL @>000E           * SCAN (so you can cancel screen blank)
        LIMI 2
        LIMI 0
        JMP waitlp
       
*************************************************************************************
* utility code
*************************************************************************************
* VDP access
* Write R2 bytes from R1 to VDP R0
* Destroys R0,R1,R2
VDPFILL
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
VMBWLP
    MOVB R1,@>8C00
    DEC R2
    JNE VMBWLP
    B *R11
  
* load regs list to VDP address, end on >0000 and write >D0 (for sprites)
* address of table in R1 (destroyed)
LOADRG
LOADLP
    MOV *R1+,R0
    JEQ LDRDN
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
    JMP LOADLP
LDRDN
    LI R1,>D000
    MOVB R1,@>8C00
    B *R11
* Setup for normal bitmap mode
BITMAP
    MOV R11,@SAVE
* set display and disable sprites
    LI R1,BMREGS
    BL @LOADRG
   
* set up SIT - We load the standard 0-255, 3 times
    LI R0,>5800
    SWPB R0
    MOVB R0,*vdpadr
    SWPB R0
    MOVB R0,*vdpadr
    LI R2,3
    CLR R1
LP#
    MOVB R1,@>8C00
    AI R1,>0100
    JNE LP#
    DEC R2
    JNE LP#
   
    MOV @SAVE,R11
    B *R11
* IN AND OUT IN R0
* fractions only > 0.999999 undefined
* adapted from dmsc's code
* R0 in = 3.13 signed fixed point
* Uses separate workspace - looks similar to following
* http://samples.sainsburysebooks.co.uk/9781483296692_sample_809121.pdf
* uses regs r0-r5 in new workspace
SQWP EQU >8324          we need some workspace, this preserves calling regs
SQRTX
    MOV R0,@SQWP
    LWPI SQWP           still have r0! (x)
   
    CLR r1              root (r)
    CLR r2              remHi (h) (r0 is remLo)
*   clr r4              (q) (doesn't need init, this line just for reference)
    SLA R0,3            lose the integer part
    LI r3,13            count = (7+FPSCALE/2) -> 7+6
SQRT0
    sla r1,1            r = r<<1;
    mov r1,r4           q = h + (0xFFFF ^ r);
    inv r4
    a r2,r4            
    jlt sqrt2           if( q >= 0 ) { r += 2; h = q; }
    inct r1
    mov r4,r2
sqrt2
   
    sla r2,2            h = (h << 2) | (x>>14);
    mov r0,r5
    srl r5,14
    soc r5,r2
    sla r0,2            x <<= 2;
   
    DEC r3              while (--count != 0);
    JNE SQRT0
   
    MOV r1,@>8300       return r;
    LWPI >8300
  
    B *R11
   
* INPUT R1,R0 - kills TMP1,TMP2 as well
PLOTX
* use the E/A routine for address
    MOV  R0,tmp1        R0 is the Y value.
    SLA  tmp1,5
    SOC  R0,tmp1
    ANDI tmp1,>FF07
    MOV  R1,tmp2        R1 is the X value.
    ANDI tmp2,7
    A    R1,tmp1        tmp1 is the byte offset.
    S    tmp2,tmp1      tmp2 is the bit offset.
   
* inline VDP!
    SWPB tmp1             set up read address
    MOVB tmp1,*vdpadr
    SWPB tmp1
    MOVB tmp1,*vdpadr
    ORI tmp1,>4000        we need this later, and provides a VDP delay
    MOVB @>8800,R1        read the byte from VDP
    SWPB tmp1             set up write address
    MOVB tmp1,*vdpadr
    SWPB tmp1
    MOVB tmp1,*vdpadr
    SOCB @BITS(tmp2),R1   or the bit and provide VDP delay
    MOVB R1,@>8C00        write the byte back
    B *R11
ENDX
* init the sine tables
* r1 - temp for reflected offset (0-510)
* r2 - add value
* r3 - current output value
* r4 - current change table entry
* r5 - table output offset (0-510)
* r6 - temp for negative output value
* r7,r8 - temp for x0.4 output
* r9 - loop counter
initSine
        mov r11,@SAVE       * need this to get home!
       
        li r2,54            * starting value
        clr r9
        clr r3
nextbyte
        clr r4
        movb @genTable(r9),r4
* we don't have a stack, easier to do it inline
        bl @genOne
        bl @genOne
        bl @genOne
        bl @genOne
        bl @genOne
        bl @genOne
        bl @genOne
        bl @genOne
* what we DO have is lots of registers 
        inc r9
        ci r9,32
        jne nextbyte
       
        li r3,>2000
        bl @genone
       
        mov @SAVE,r11
        B *R11
* set all four points on the curve, and load both tables
genOne
        li r1,512
        s r5,r1                 * reflection offset
        mov r3,r6
        neg r6                  * negative version
        mov r3,@sinetab(r5)
        mov r3,@sinetab+512(r1)
        mov r6,@sinetab+1024(r5)
        mov r6,@sinetab+1536(r1)
       
* make the *0.4 version (r3 is always positive here)
        li r6,>0ccd             * 0.4 in 3.13
        mov r3,r7
        mpy r6,r7
        srl r8,13               * shift fraction
        sla r7,3                * shift int
        soc r7,r8               * make 3.13
gdone
        mov r8,r6
        neg r6
        mov r8,@sinefour(r5)
        mov r8,@sinefour+512(r1)
        mov r6,@sinefour+1024(r5)
        mov r6,@sinefour+1536(r1)
       
        inct r5
        a r2,r3
*        ; Read bit, test if sum must be decreased
        sla r4,1
        jnc nodec
        dec r2
nodec
        B *R11
* bits for pixel
BITS
        DATA >8040,>2010,>0804,>0201
* registers for bitmap (and 5A00 is the address of the sprite table)
* background is transparent (the only color never redefined)
* PDT - >0000
* SIT - >1800
* SDT - >1800
* CT  - >2000
* SAL - >1B00
BMREGS  
        DATA >81E0,>8002,>8206,>83ff,>8403,>8536,>8603,>8700,>5B00,>0000
* data for sine generation
genTable
        data >f000,>0200,>0100,>1004,>0404,>0820,>8210,>4221
        data >0888,>4444,>4488,>8912,>2448,>9224,>8922,>4912
* BSS section
* spot to save return addresses
SAVE   
        bss 2
* spot to store the sine table (full 1024 entries)
sinetab
        bss 2048
* sine divided by 4, to remove a multiply inline
sinefour
        bss 2048
* row table for hidden surface (one word per column)
oldY   
        bss 640
scratch
  bss 256-32
        END

Only changes to the base code were the x-size defines. This pasted code uses a slightly different scratchpad helper (only plot and sqrt), but only sqrt made much difference in scratchpad. Also optimized the end of the loop some, again, noted no visible difference in execution time.

Runtimes:

Full 8-bit RAM - 24 s

Scratchpad assist - 20 s

16-bit RAM - 17 s (yeah, I still had to get one in there )

All versions - offline buffer saves 1 second, just like before.

So it appears that the extra pixels do make a difference, largely because I had discounted deltaXf, which is based on the screen width and is related to the inner loop limit. So now that all is right with the world and the 8-bitter is faster, everyone is happy again, and we can go back to playing, right?

JamesD · March 17, 2015

As you can see, the scaling /does/ account for the differences in the image. Unless you can't, in which case I can't help you anyway.

Only changes to the base code were the x-size defines. This pasted code uses a slightly different scratchpad helper (only plot and sqrt), but only sqrt made much difference in scratchpad. Also optimized the end of the loop some, again, noted no visible difference in execution time.

Runtimes:

Full 8-bit RAM - 24 s

Scratchpad assist - 20 s

16-bit RAM - 17 s (yeah, I still had to get one in there )

All versions - offline buffer saves 1 second, just like before.

So it appears that the extra pixels do make a difference, largely because I had discounted deltaXf, which is based on the screen width and is related to the inner loop limit. So now that all is right with the world and the 8-bitter is faster, everyone is happy again, and we can go back to playing, right?

There are certainly 16 bit upgrades for the TI so I see no problem with the 16 bit version.

The curves on the front of the image still don't match that of other versions. If you can't see that you are in denial.

Is there an error in a table or just one additional optimization?

http://atariage.com/forums/topic/215138-bitmap-mode/?p=3183327

Edited March 17, 2015 by JamesD

Tursi · March 17, 2015

I'm not in denial, I just don't understand why you're so intent on turning something fun into a battle.

Certainly the last one I will participate in.

+OLD CS1 · March 17, 2015

I'm not in denial, I just don't understand why you're so intent on turning something fun into a battle.

Certainly the last one I will participate in.

Others appreciate your efforts. Grain of salt, and all that. I carry a small amount of guilt on the matter: I did mention giving the TI a competitive advantage over the Atari on the topic. As far as this one goes, I have read similar threads with back-and-forths about fairness of comparing platforms in demo compos, so I would expect it to go with the territory any time multiple platforms are involved.

JamesD · March 17, 2015

I'm not in denial, I just don't understand why you're so intent on turning something fun into a battle.

Certainly the last one I will participate in.

I'm pointing out a fact and you are calling it a battle? It's not a battle.

I'm not saying your work sucks. In fact, I think you did an awesome job.

I'm just saying it doesn't generate the same image so you can't compare it to another program that does generate the same image for bragging rights on speed. At least not without acknowledging the difference in the results.

If you compare two versions with floating point that generate the same image, I'm fine with that.

If you compare two versions with table lookups that generate the same image, I'm fine with that.

If you compare a version with lookup tables versus a floating point version that generate the same image, I'm fine with that as well.

But if you want to compare speed between the two groups when the results are clearly different, that's where I have a problem. Even on the same platform.

Hence, my statement you are comparing Apples and Oranges.

FWIW, someone over in the thread in the Atari area pointed out the assembly Atari version also generates a different image.

I'm guessing the Atari image someone posted that looked the same was probably from running an old version by accident.

Check out the comparison here:

http://manillismo.blogspot.com/2015/03/fedora-hat-diferencias.html

A pixel here and there is one thing but that's pretty significant and you'll see close to the same difference on the TI between float and non float versions.

When it appeared as though the Atari generated the same image as the floating point version... I thought the difference on the TI version should be pointed out.

Since the Atari doesn't generate the same image as the original either I'd say your assembly version is a fair comparison.

The TI and Atari assembly versions are close to the same size and speed even though the TI requires some overhead for the VDP.

Retrospect · March 18, 2015

The hat thing ..... there's a program for the Powertran Cortex computer with the hat rendering ... .I don't know how long that takes though as I've not tried it (don't have a cortex and won't use the emulator)

The powertran cortex was a british kit computer , the idea was brought to us by three engineers at Texas Instruments but they couldn't market it as a full computer.

Point is - it uses a 12mhz 9900 family CPU but as far as I know the VDP is somewhat cut-down?

Be interesting to see how long it takes to render compared to the 99 and atari?

Stuart · March 18, 2015

The hat thing ..... there's a program for the Powertran Cortex computer with the hat rendering ... .I don't know how long that takes though as I've not tried it (don't have a cortex and won't use the emulator)

The powertran cortex was a british kit computer , the idea was brought to us by three engineers at Texas Instruments but they couldn't market it as a full computer.

Point is - it uses a 12mhz 9900 family CPU but as far as I know the VDP is somewhat cut-down?

Be interesting to see how long it takes to render compared to the 99 and atari?

Take a look at post #163.

Stuart · March 18, 2015

The hat thing ..... there's a program for the Powertran Cortex computer with the hat rendering ... .I don't know how long that takes though as I've not tried it (don't have a cortex and won't use the emulator)

The powertran cortex was a british kit computer , the idea was brought to us by three engineers at Texas Instruments but they couldn't market it as a full computer.

Point is - it uses a 12mhz 9900 family CPU but as far as I know the VDP is somewhat cut-down?

Be interesting to see how long it takes to render compared to the 99 and atari?

When you say "The idea was brought to us by ..." does that imply you worked for Powertran at the time?

Retrospect · March 18, 2015

When you say "The idea was brought to us by ..." does that imply you worked for Powertran at the time?

no, i meant * us * as in the public, the consumer .... I have never worked for powertran unfortunately.

It was advertised , and maybe even sold in kit form, in an electronics magazine in 1982.

Edited March 18, 2015 by Retrospect

JamesD · March 18, 2015

The hat thing ..... there's a program for the Powertran Cortex computer with the hat rendering ... .I don't know how long that takes though as I've not tried it (don't have a cortex and won't use the emulator)

The powertran cortex was a british kit computer , the idea was brought to us by three engineers at Texas Instruments but they couldn't market it as a full computer.

Point is - it uses a 12mhz 9900 family CPU but as far as I know the VDP is somewhat cut-down?

Be interesting to see how long it takes to render compared to the 99 and atari?

Take a look at post #163.

Based on that comparison, the Powertran is about twice as fast as the TI and the breadboard system is over five times as fast as the TI.

If my math is correct (New TI time / x = old TI-99 time / other machine time), using the optimization I added after that test run was made...

The Powertran should be able to run it in about 0:27:30.

The breadboard machine should be able to do it in around 0:09:25

Those are just an estimate but they should be close unless you have to insert delays for the VDP.

*edit*

The assembly version is tough to estimate but I would expect in the neighborhood of 4 seconds for the breadboard +- a second.

Edited March 18, 2015 by JamesD

+Vorticon · March 20, 2015

King of the hill. Enough said...

http://www.ebay.com/itm/1981-Commodore-CBM-8032-computer-photo-vintage-print-ad-/361246688768?&_trksid=p2056016.l4276

+OLD CS1 · March 20, 2015

I want to take that program and run it on a 128 in 80 column mode. Means I have to hook it up this weekend, I suppose

Retrospect · March 20, 2015

King of the hill. Enough said...

http://www.ebay.com/itm/1981-Commodore-CBM-8032-computer-photo-vintage-print-ad-/361246688768?&_trksid=p2056016.l4276

mmmmm ..... I see a hat on a pet ...... that was false advertising, surely?

I had no idea a pet could render anything. Other than a screen of text

+Vorticon · March 20, 2015

I think this was an advertisement for a graphics board for the PET. If you look at the code listing, there are graphic commands not native to the PET. So technically its cheating

sometimes99er · March 20, 2015

http://oldcomputers.net/pet4032.html
Commodore also released the CBM 8032 at about the same time as the PET 4032. It is similar, but displays 80 characters per line of text.

Bitmap mode.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members