Jump to content

Photo

VBXE speed


64 replies to this topic

#1 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 12:57 AM

there are rarely any kind of information (or hidden) but how fast is VBXE blitter? MHz and how many byte/pixel?

 

I can not find any cycle etc information?

 



#2 phaeron OFFLINE  

phaeron

    River Patroller

  • 2,238 posts
  • Location:USA

Posted Wed May 10, 2017 1:50 AM

As I understand it: the blitter can use the full 14MHz bandwidth of VBXE local memory (8x Atari bus speed), but it has lowest priority relative to anything else including MEMAC. In general, it will fill at 1 cycle/byte, copy at 2 cycles/byte, and do read/modify/write or collision check operations at 3 cycles/byte. The blitter can skip cycles and run faster if the source is constant (AND mask = $00), if it's doing a RMW operation and the source byte is $00, or when repeating bytes with X zoom. Y zoom is not optimized and will re-read the source bytes.

 

The blitter can slow down to as low as quarter speed depending on the amount of DMA contention involved, particularly from the overlay. A 320x192 standard overlay, for instance, will consume 20-25% of total bandwidth. Running code out of MEMAC costs up to one-eighth of total bandwidth. Still, the blitter is fast enough to redraw the entire screen every frame if you keep overdraw low. Blitter lists can help you do this; it is feasible to have to blitter draw sprites with automatic background save/restore -- put your sprite position tables in one of the MEMAC windows and use the blitter to blit the positions into the save/restore blit lists. Similarly, you can emulate a tilemap by constructing a huge blit list that first blits from the tilemap into the source addresses of the rest of the blit list that copies one tile at a time.



#3 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,113 posts
  • Location:Australia

Posted Wed May 10, 2017 2:33 AM

Additional consideration is that like legacy hardware there's the same ratio of less cycles per frame available for NTSC vs PAL.

 

For squeezing every last cycle out though, the key things would be - keep CPU and Antic access to VRam to a minimum.  Consider screen architecture such as where attribute maps are needed or not, where narrow mode might be sufficient, where text mode might be sufficient.

 

Not sure if Antic refresh cycles can generate a wait state for VBXE, in theory it could just ignore them, VRAM is static Ram so never needs refreshing.



#4 phaeron OFFLINE  

phaeron

    River Patroller

  • 2,238 posts
  • Location:USA

Posted Wed May 10, 2017 2:52 AM

I might be wrong, but I don't think refresh cycles count because they have no address to decode to a MEMAC window.

 

CPU accesses to VRAM are relatively cheap if you don't execute code from the window, as then it's likely to be <5% total local bandwidth. That's a fairly low cost to be able to do things like place MEMAC A at 0, which lets you context switch quickly and also store to VRAM at 3 cycles/byte. ANTIC, on the other hand, should just be switched off to let the CPU run faster.



#5 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 3:22 AM

I am running into following issue (compared to my Lynx demos ;)) and did not find same kind if speed for blitting spans of polygons.

 

but my render loop does 1 blit per span meaning only 1 BCB.... including wait for blitter stop.

 

but still not satisfied with speed.....



#6 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 3:27 AM

so... not sure if my waitblits break the speed after starting the span blitting.

 

$d400 = 0 so no DMA steals.

 

would it help to have a BCB spanlist?



#7 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,113 posts
  • Location:Australia

Posted Wed May 10, 2017 3:41 AM

What's a spanlist?

 

BCBs just execute one after another sequentially until one with the "NEXT" bit cleared in it's BCB finishes which signifies end of processing.

It is a bit annoying... what would have been nice is a skip command so you could leave objects defined but selectively not display them instead of having to modify the BCB so it doesn't render.

 

One solution I found is to use an initial blit or two which populates parameters within the string of BCBs, it's just way faster to do minimal CPU processing and just have blits to do much of the pre-processing since it moves data around so quickly.

 

Another timesaving thing - if you have a large object with a shape with lots of blank or common space, consider breaking it into smaller objects to save unnecessary blits, plus using constant data mode has it's cycle savings as well.



#8 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 3:50 AM

think of n-poly (not triangle but same there)...

 

I calc via CPU 2 buffers (left edge, right edge with miny, maxy vars to see which areas are covered on screen).

 

then fill those spans 

 

so kind of 

 

for y=miny to maxy-1

 set span_xpos in BCB to rightege(y)

 set span_size_x in BCB to rightedge-leftdge

blit span

wait blit

next y

 

BCB sets blitter to copy mode 0, AND #0, EXOR span colorm step x = -1

 

 

so... one idea was to have say 200 BCBs (for 200 scanlines, like unrolled code) CPU sets blitter start BCB based on miny (basicly 21*y), set all positions and sizes between miny and maxy, and clear next bit in the maxy bcb.

 

just non proofed idea of having one big poly bcb list... such stuff helped in the Elements Lynx demo as CPU dont need to wait for each span finished blit.


Edited by Heaven/TQA, Wed May 10, 2017 3:50 AM.


#9 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,113 posts
  • Location:Australia

Posted Wed May 10, 2017 3:58 AM

If possible set all the blits up first and run them in one go.

 

Running single blits or groups of a few then having CPU intervention where it waits for IRQ or the flag the starts the next lot would be somewhat wasteful.

Also don't forget - for some stuff you can make use of the blit for normal Antic graphics.

That's what I did with Quadrillion - I initially converted the game with the graphics remapped from the cell to linear type but nasty bugs crept in and I had to start again.

So I went with the idea to just leave the rendering mostly alone, using the blit to convert the entire 8K bitmap from Plus4 mapping to Atari every frame.

 

If you can live with Antic graphics for certain stuff then potentially the blitter can do 4-8 times the number of pixel shifting.



#10 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 6:18 AM

so... here are some altirra screenshots

 

yellow color appears when I start blitter operation.

 

black when finished

 

 

 

vbxe_blitter_face_nowait_span.png

 

this draws 1 span without waiting the blitter to finish

 

vbxe_blitter_face_nowait.png

 

same as span but 1 face

 

vbxe_blitter_face.png

 

same as above but with wait

 

 

vbxe_blitter_face_nowait_200span.png

 

this one blitting 200 spans in a blitter block list...

 

what makes me wonder...

 

that's not "fast"?

 

why is it always starts nearly the same screen position... is there any "align" or "sync" happening?

 

$d400 is 0, blitter is set to fill mode...

 

 

 

 

lda #15

 sta $d01a
     lda #1
        sta $D653       ; start blitter (draw span)
@        lda $D653       ; wait until not-busy
        bne @-
 sta $d01a

Attached Thumbnails

  • vbxe_blitter_face_nowait_span.png
  • vbxe_blitter_face_nowait.png
  • vbxe_blitter_face.png
  • vbxe_blitter_face_nowait_200span.png


#11 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 6:19 AM

what makes me wonder... to fill a span (scanline so to say) from x1 to x2 it takes more than half of scanline?



#12 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 6:25 AM

100 bytes to fill would need 100 cycles in blitter fill mode

100/8 =12.5 (blitter 8x faster than cpu)
so in my world… the CPU would get control back after 12.5 cycles… and atari has 112 cycles per rasterline so the yellow bars would be much thinner???
 
where is my misunderstanding?
 
though have not checked real hw yet.


#13 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,113 posts
  • Location:Australia

Posted Wed May 10, 2017 6:33 AM

Are you doing lots of single blits?  I would think that's a big problem, especially considering some line draws are like 10 pixels wide.

 

Consider VBXE reads the BCB and starts executing it in less than 3 CPU cycles.  A 10 pixel line is another 2 cycles with some spare.  The overhead in setting up for individual blits, monitoring and starting the next one could potentially see the blitter spending more time idle than actually working.

 

 

Pretty good looking sequence BTW... another optimization you might try - in standard mode the scanlines are 320 bytes apart.  Depending on how you do your calculations, if you can spare some VRam, put the scanlines 512 bytes apart which for some graphical stuff can speed things up... fairly sure I did that in Moon Cresta so all the 6502 had to do was some bit-shifting to calculate the sprite start addresses.


Edited by Rybags, Wed May 10, 2017 6:45 AM.


#14 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 6:54 AM

i have 256 byte scanlines... (more easy... $baseYYXX)

 

but that's why I thought using the blitter list with 200 BCBs... would gain... but the yellow "areas" are similar size?

 

wtf... this does not look good in terms of copy speed. I thought when looking at the Lamer's demo there is more potential in VBXE. but could my code or Altirra or whatever :D



#15 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 6:57 AM

Are you doing lots of single blits?  I would think that's a big problem, especially considering some line draws are like 10 pixels wide.

 

Consider VBXE reads the BCB and starts executing it in less than 3 CPU cycles.  A 10 pixel line is another 2 cycles with some spare.  The overhead in setting up for individual blits, monitoring and starting the next one could potentially see the blitter spending more time idle than actually working.

 

 

Pretty good looking sequence BTW... another optimization you might try - in standard mode the scanlines are 320 bytes apart.  Depending on how you do your calculations, if you can spare some VRam, put the scanlines 512 bytes apart which for some graphical stuff can speed things up... fairly sure I did that in Moon Cresta so all the 6502 had to do was some bit-shifting to calculate the sprite start addresses.

 

the yellow is I am waiting for blitter to finish... so blitter can not be idle... it looks more to me hooking up the CPU for too long as I had expected? if small spans... then it would or should be a mess of small yellow stripes?



#16 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,113 posts
  • Location:Australia

Posted Wed May 10, 2017 7:09 AM

Can you move more processing off the CPU?

 

Like store things as an array in VRam, then get the blitter to do some of the calcs.  I assume you're probably already using blits to populate the BCBs for the draws?



#17 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 7:17 AM

as you have more experience in VBXE... can you code a little benchmark meaning VBXE blitting random length of lines 0-199 ? and show how much time the blitter needs?

 

i still can not believe that so huge blocks ("yellow").



#18 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 7:27 AM

Maybe I setup stuff wrong

#19 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,113 posts
  • Location:Australia

Posted Wed May 10, 2017 7:29 AM

I was thinking about doing a test case for the refresh thing...

 

For your problem, maybe do a dump of a bad case situation of all the data going into the BCBs.

Then work out how many BCBs, how many pixels per BCB etc.

Then calculate how many cycles are required.  Then throw in the ones for where the 6502 is dragging the chain with the blitter idle and waiting.   Then compare that to what you're witnessing onscreen.

 

What you're drawing there, is it all done with horizontal line segments?

Are you using the mode 0 blitter command, without collision detection or any other time sapping stuff?



#20 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 7:33 AM

Scanline based... so only drawing from right edge to left edge with step x-1

Antic off, Blitter in filling mode.

#21 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 7:34 AM

And I might not explained what makes me wonder... the yellow rastertime is not the idle time but CPU waiting the blitter to finish. so actually the idle time of the cpu...


Edited by Heaven/TQA, Wed May 10, 2017 7:47 AM.


#22 Rybags OFFLINE  

Rybags

    Quadrunner

  • 15,113 posts
  • Location:Australia

Posted Wed May 10, 2017 8:46 AM

What's with that first screen though?  It's like the blits are only lasting half a scanline then another one started by the CPU.

But the other ones look like nice big chunks of work are being done - although a fair bit of idle time.



#23 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 9:04 AM

Rybags...

 

BUT... as you see... most of the time we are talking about 1 SCANLINE... processed... it could be that I am meassuring wrong... (but posted the wait junk of code).

 

so i am really really wonder...even if I blit one horizontal line say (here in oxygene logo faces are max maybe 32 pixels) and look the big yellow area?

 

(check the filenames to see what they do)... no wait means start blitter without waiting blitter to finish his work...

 

and I had assumed that the yellow chunks would be

a) smaller in terms of height and length 

b) randomly spreaded over the screen

 

so still most wonder... why the hell does the blitter suck so much time? (as I said...could be my code... will show you later). 



#24 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 9:49 AM

tha's my hline blit object (not talking about the list)
 
hline_bcb: 
.long $000000 ;source adress
.word 0 ;source step y
.byte 0 ;source step x
.long $010000 ;destination adress
.word 256 ;dest. step y
.byte -1 ;dest step x
.word 0 ;size x
.byte 0 ;size y
.byte $00  ;and
       .byte $00     ; XOR
       .byte 0       ;  collision AND
          .byte $00       ; zoom
.byte 0       ; pattern
       .byte 0        ; control
 
 
 
and that's the render loop:
 

render_scene
ldy miny
        lda #$80 ;bank 0
        sta $d65d       ;cpu-vram access window at $4000
  
polycol lda #4
     sta $4310 ;color
        lda #$03          ;$000300 = $4300 bank #0
        sta $D651       ; blitter addr
  
_drwply 
lda redge,y
sta $4306 ;xpos
sty $4307 ;ypos
sec
sbc ledge,y
bcc @+1
;sta $430c ;sizex  
lda #15
sta $d01a
@        lda $D653       ; wait until not-busy
        bne @-
 
     lda #1
        sta $D653       ; start blitter (draw span)
sta $d01a 
_drw2 iny 
cpy maxy
bne _drwply
@       lda #$00 ;bank 0
        sta $d65d       ;cpu-vram access window at $4000
rts
 
 


Edited by Heaven/TQA, Wed May 10, 2017 9:51 AM.


#25 Heaven/TQA ONLINE  

Heaven/TQA

    Quadrunner

  • Topic Starter
  • 10,312 posts
  • Location:Baden-Württemberg, Germany

Posted Wed May 10, 2017 3:54 PM

Ok.... significant speed improvements can be made by using Blitter lists and updating those huge lists again with Blitter. So CPU is only preparing base data.




0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users