VBXE speed

Heaven/TQA · May 10, 2017

there are rarely any kind of information (or hidden) but how fast is VBXE blitter? MHz and how many byte/pixel?

I can not find any cycle etc information?

phaeron · May 10, 2017

As I understand it: the blitter can use the full 14MHz bandwidth of VBXE local memory (8x Atari bus speed), but it has lowest priority relative to anything else including MEMAC. In general, it will fill at 1 cycle/byte, copy at 2 cycles/byte, and do read/modify/write or collision check operations at 3 cycles/byte. The blitter can skip cycles and run faster if the source is constant (AND mask = $00), if it's doing a RMW operation and the source byte is $00, or when repeating bytes with X zoom. Y zoom is not optimized and will re-read the source bytes.

The blitter can slow down to as low as quarter speed depending on the amount of DMA contention involved, particularly from the overlay. A 320x192 standard overlay, for instance, will consume 20-25% of total bandwidth. Running code out of MEMAC costs up to one-eighth of total bandwidth. Still, the blitter is fast enough to redraw the entire screen every frame if you keep overdraw low. Blitter lists can help you do this; it is feasible to have to blitter draw sprites with automatic background save/restore -- put your sprite position tables in one of the MEMAC windows and use the blitter to blit the positions into the save/restore blit lists. Similarly, you can emulate a tilemap by constructing a huge blit list that first blits from the tilemap into the source addresses of the rest of the blit list that copies one tile at a time.

Rybags · May 10, 2017

Additional consideration is that like legacy hardware there's the same ratio of less cycles per frame available for NTSC vs PAL.

For squeezing every last cycle out though, the key things would be - keep CPU and Antic access to VRam to a minimum. Consider screen architecture such as where attribute maps are needed or not, where narrow mode might be sufficient, where text mode might be sufficient.

Not sure if Antic refresh cycles can generate a wait state for VBXE, in theory it could just ignore them, VRAM is static Ram so never needs refreshing.

phaeron · May 10, 2017

I might be wrong, but I don't think refresh cycles count because they have no address to decode to a MEMAC window.

CPU accesses to VRAM are relatively cheap if you don't execute code from the window, as then it's likely to be <5% total local bandwidth. That's a fairly low cost to be able to do things like place MEMAC A at 0, which lets you context switch quickly and also store to VRAM at 3 cycles/byte. ANTIC, on the other hand, should just be switched off to let the CPU run faster.

Heaven/TQA · May 10, 2017

I am running into following issue (compared to my Lynx demos ) and did not find same kind if speed for blitting spans of polygons.

but my render loop does 1 blit per span meaning only 1 BCB.... including wait for blitter stop.

but still not satisfied with speed.....

Heaven/TQA · May 10, 2017

so... not sure if my waitblits break the speed after starting the span blitting.

$d400 = 0 so no DMA steals.

would it help to have a BCB spanlist?

Rybags · May 10, 2017

What's a spanlist?

BCBs just execute one after another sequentially until one with the "NEXT" bit cleared in it's BCB finishes which signifies end of processing.

It is a bit annoying... what would have been nice is a skip command so you could leave objects defined but selectively not display them instead of having to modify the BCB so it doesn't render.

One solution I found is to use an initial blit or two which populates parameters within the string of BCBs, it's just way faster to do minimal CPU processing and just have blits to do much of the pre-processing since it moves data around so quickly.

Another timesaving thing - if you have a large object with a shape with lots of blank or common space, consider breaking it into smaller objects to save unnecessary blits, plus using constant data mode has it's cycle savings as well.

Heaven/TQA · May 10, 2017

think of n-poly (not triangle but same there)...

I calc via CPU 2 buffers (left edge, right edge with miny, maxy vars to see which areas are covered on screen).

then fill those spans

so kind of

for y=miny to maxy-1

set span_xpos in BCB to rightege(y)

set span_size_x in BCB to rightedge-leftdge

blit span

wait blit

next y

BCB sets blitter to copy mode 0, AND #0, EXOR span colorm step x = -1

so... one idea was to have say 200 BCBs (for 200 scanlines, like unrolled code) CPU sets blitter start BCB based on miny (basicly 21*y), set all positions and sizes between miny and maxy, and clear next bit in the maxy bcb.

just non proofed idea of having one big poly bcb list... such stuff helped in the Elements Lynx demo as CPU dont need to wait for each span finished blit.

Edited May 10, 2017 by Heaven/TQA

Rybags · May 10, 2017

If possible set all the blits up first and run them in one go.

Running single blits or groups of a few then having CPU intervention where it waits for IRQ or the flag the starts the next lot would be somewhat wasteful.

Also don't forget - for some stuff you can make use of the blit for normal Antic graphics.

That's what I did with Quadrillion - I initially converted the game with the graphics remapped from the cell to linear type but nasty bugs crept in and I had to start again.

So I went with the idea to just leave the rendering mostly alone, using the blit to convert the entire 8K bitmap from Plus4 mapping to Atari every frame.

If you can live with Antic graphics for certain stuff then potentially the blitter can do 4-8 times the number of pixel shifting.

Heaven/TQA · May 10, 2017

so... here are some altirra screenshots

yellow color appears when I start blitter operation.

black when finished

vbxe_blitter_face_nowait_span.png

this draws 1 span without waiting the blitter to finish

vbxe_blitter_face_nowait.png

same as span but 1 face

vbxe_blitter_face.png

same as above but with wait

vbxe_blitter_face_nowait_200span.png

this one blitting 200 spans in a blitter block list...

what makes me wonder...

that's not "fast"?

why is it always starts nearly the same screen position... is there any "align" or "sync" happening?

$d400 is 0, blitter is set to fill mode...

lda #15

sta $d01a

lda #1

sta $D653 ; start blitter (draw span)

@ lda $D653 ; wait until not-busy

bne @-

sta $d01a

Heaven/TQA · May 10, 2017

what makes me wonder... to fill a span (scanline so to say) from x1 to x2 it takes more than half of scanline?

Heaven/TQA · May 10, 2017

100 bytes to fill would need 100 cycles in blitter fill mode

100/8 =12.5 (blitter 8x faster than cpu)

so in my world… the CPU would get control back after 12.5 cycles… and atari has 112 cycles per rasterline so the yellow bars would be much thinner???

where is my misunderstanding?

though have not checked real hw yet.

Rybags · May 10, 2017

Are you doing lots of single blits? I would think that's a big problem, especially considering some line draws are like 10 pixels wide.

Consider VBXE reads the BCB and starts executing it in less than 3 CPU cycles. A 10 pixel line is another 2 cycles with some spare. The overhead in setting up for individual blits, monitoring and starting the next one could potentially see the blitter spending more time idle than actually working.

Pretty good looking sequence BTW... another optimization you might try - in standard mode the scanlines are 320 bytes apart. Depending on how you do your calculations, if you can spare some VRam, put the scanlines 512 bytes apart which for some graphical stuff can speed things up... fairly sure I did that in Moon Cresta so all the 6502 had to do was some bit-shifting to calculate the sprite start addresses.

Edited May 10, 2017 by Rybags

Heaven/TQA · May 10, 2017

i have 256 byte scanlines... (more easy... $baseYYXX)

but that's why I thought using the blitter list with 200 BCBs... would gain... but the yellow "areas" are similar size?

wtf... this does not look good in terms of copy speed. I thought when looking at the Lamer's demo there is more potential in VBXE. but could my code or Altirra or whatever

Heaven/TQA · May 10, 2017

Are you doing lots of single blits? I would think that's a big problem, especially considering some line draws are like 10 pixels wide.

Consider VBXE reads the BCB and starts executing it in less than 3 CPU cycles. A 10 pixel line is another 2 cycles with some spare. The overhead in setting up for individual blits, monitoring and starting the next one could potentially see the blitter spending more time idle than actually working.

Pretty good looking sequence BTW... another optimization you might try - in standard mode the scanlines are 320 bytes apart. Depending on how you do your calculations, if you can spare some VRam, put the scanlines 512 bytes apart which for some graphical stuff can speed things up... fairly sure I did that in Moon Cresta so all the 6502 had to do was some bit-shifting to calculate the sprite start addresses.

the yellow is I am waiting for blitter to finish... so blitter can not be idle... it looks more to me hooking up the CPU for too long as I had expected? if small spans... then it would or should be a mess of small yellow stripes?

Rybags · May 10, 2017

Can you move more processing off the CPU?

Like store things as an array in VRam, then get the blitter to do some of the calcs. I assume you're probably already using blits to populate the BCBs for the draws?

Heaven/TQA · May 10, 2017

as you have more experience in VBXE... can you code a little benchmark meaning VBXE blitting random length of lines 0-199 ? and show how much time the blitter needs?

i still can not believe that so huge blocks ("yellow").

Heaven/TQA · May 10, 2017

Maybe I setup stuff wrong

Rybags · May 10, 2017

I was thinking about doing a test case for the refresh thing...

For your problem, maybe do a dump of a bad case situation of all the data going into the BCBs.

Then work out how many BCBs, how many pixels per BCB etc.

Then calculate how many cycles are required. Then throw in the ones for where the 6502 is dragging the chain with the blitter idle and waiting. Then compare that to what you're witnessing onscreen.

What you're drawing there, is it all done with horizontal line segments?

Are you using the mode 0 blitter command, without collision detection or any other time sapping stuff?

Heaven/TQA · May 10, 2017

Scanline based... so only drawing from right edge to left edge with step x-1

Antic off, Blitter in filling mode.

Heaven/TQA · May 10, 2017

And I might not explained what makes me wonder... the yellow rastertime is not the idle time but CPU waiting the blitter to finish. so actually the idle time of the cpu...

Edited May 10, 2017 by Heaven/TQA

Rybags · May 10, 2017

What's with that first screen though? It's like the blits are only lasting half a scanline then another one started by the CPU.

But the other ones look like nice big chunks of work are being done - although a fair bit of idle time.

Heaven/TQA · May 10, 2017

Rybags...

BUT... as you see... most of the time we are talking about 1 SCANLINE... processed... it could be that I am meassuring wrong... (but posted the wait junk of code).

so i am really really wonder...even if I blit one horizontal line say (here in oxygene logo faces are max maybe 32 pixels) and look the big yellow area?

(check the filenames to see what they do)... no wait means start blitter without waiting blitter to finish his work...

and I had assumed that the yellow chunks would be

a) smaller in terms of height and length

b) randomly spreaded over the screen

so still most wonder... why the hell does the blitter suck so much time? (as I said...could be my code... will show you later).

Heaven/TQA · May 10, 2017

tha's my hline blit object (not talking about the list)



hline_bcb: 

.long $000000 ;source adress

.word 0 ;source step y

.byte 0 ;source step x

.long $010000 ;destination adress

.word 256 ;dest. step y

.byte -1 ;dest step x

.word 0 ;size x

.byte 0 ;size y

.byte $00  ;and

       .byte $00     ; XOR

       .byte 0       ;  collision AND

          .byte $00       ; zoom

.byte 0       ; pattern

       .byte 0        ; control

 

 

 

and that's the render loop:

 



render_scene

ldy miny

        lda #$80 ;bank 0

        sta $d65d       ;cpu-vram access window at $4000

  

polycol lda #4

     sta $4310 ;color

        lda #$03          ;$000300 = $4300 bank #0

        sta $D651       ; blitter addr

  

_drwply 

lda redge,y

sta $4306 ;xpos

sty $4307 ;ypos

sec

sbc ledge,y

bcc @+1

;sta $430c ;sizex  

lda #15

sta $d01a

@        lda $D653       ; wait until not-busy

        bne @-

 

     lda #1

        sta $D653       ; start blitter (draw span)

sta $d01a 

_drw2 iny 

cpy maxy

bne _drwply

@       lda #$00 ;bank 0

        sta $d65d       ;cpu-vram access window at $4000

rts

Edited May 10, 2017 by Heaven/TQA

Heaven/TQA · May 10, 2017

Ok.... significant speed improvements can be made by using Blitter lists and updating those huge lists again with Blitter. So CPU is only preparing base data.

VBXE speed

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members