Heaven/TQA Posted May 10, 2017 Share Posted May 10, 2017 there are rarely any kind of information (or hidden) but how fast is VBXE blitter? MHz and how many byte/pixel? I can not find any cycle etc information? Quote Link to comment Share on other sites More sharing options...
phaeron Posted May 10, 2017 Share Posted May 10, 2017 As I understand it: the blitter can use the full 14MHz bandwidth of VBXE local memory (8x Atari bus speed), but it has lowest priority relative to anything else including MEMAC. In general, it will fill at 1 cycle/byte, copy at 2 cycles/byte, and do read/modify/write or collision check operations at 3 cycles/byte. The blitter can skip cycles and run faster if the source is constant (AND mask = $00), if it's doing a RMW operation and the source byte is $00, or when repeating bytes with X zoom. Y zoom is not optimized and will re-read the source bytes. The blitter can slow down to as low as quarter speed depending on the amount of DMA contention involved, particularly from the overlay. A 320x192 standard overlay, for instance, will consume 20-25% of total bandwidth. Running code out of MEMAC costs up to one-eighth of total bandwidth. Still, the blitter is fast enough to redraw the entire screen every frame if you keep overdraw low. Blitter lists can help you do this; it is feasible to have to blitter draw sprites with automatic background save/restore -- put your sprite position tables in one of the MEMAC windows and use the blitter to blit the positions into the save/restore blit lists. Similarly, you can emulate a tilemap by constructing a huge blit list that first blits from the tilemap into the source addresses of the rest of the blit list that copies one tile at a time. 1 Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 10, 2017 Share Posted May 10, 2017 Additional consideration is that like legacy hardware there's the same ratio of less cycles per frame available for NTSC vs PAL. For squeezing every last cycle out though, the key things would be - keep CPU and Antic access to VRam to a minimum. Consider screen architecture such as where attribute maps are needed or not, where narrow mode might be sufficient, where text mode might be sufficient. Not sure if Antic refresh cycles can generate a wait state for VBXE, in theory it could just ignore them, VRAM is static Ram so never needs refreshing. 1 Quote Link to comment Share on other sites More sharing options...
phaeron Posted May 10, 2017 Share Posted May 10, 2017 I might be wrong, but I don't think refresh cycles count because they have no address to decode to a MEMAC window. CPU accesses to VRAM are relatively cheap if you don't execute code from the window, as then it's likely to be <5% total local bandwidth. That's a fairly low cost to be able to do things like place MEMAC A at 0, which lets you context switch quickly and also store to VRAM at 3 cycles/byte. ANTIC, on the other hand, should just be switched off to let the CPU run faster. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 I am running into following issue (compared to my Lynx demos ) and did not find same kind if speed for blitting spans of polygons. but my render loop does 1 blit per span meaning only 1 BCB.... including wait for blitter stop. but still not satisfied with speed..... Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 so... not sure if my waitblits break the speed after starting the span blitting. $d400 = 0 so no DMA steals. would it help to have a BCB spanlist? Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 10, 2017 Share Posted May 10, 2017 What's a spanlist? BCBs just execute one after another sequentially until one with the "NEXT" bit cleared in it's BCB finishes which signifies end of processing. It is a bit annoying... what would have been nice is a skip command so you could leave objects defined but selectively not display them instead of having to modify the BCB so it doesn't render. One solution I found is to use an initial blit or two which populates parameters within the string of BCBs, it's just way faster to do minimal CPU processing and just have blits to do much of the pre-processing since it moves data around so quickly. Another timesaving thing - if you have a large object with a shape with lots of blank or common space, consider breaking it into smaller objects to save unnecessary blits, plus using constant data mode has it's cycle savings as well. 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 (edited) think of n-poly (not triangle but same there)... I calc via CPU 2 buffers (left edge, right edge with miny, maxy vars to see which areas are covered on screen). then fill those spans so kind of for y=miny to maxy-1 set span_xpos in BCB to rightege(y) set span_size_x in BCB to rightedge-leftdge blit span wait blit next y BCB sets blitter to copy mode 0, AND #0, EXOR span colorm step x = -1 so... one idea was to have say 200 BCBs (for 200 scanlines, like unrolled code) CPU sets blitter start BCB based on miny (basicly 21*y), set all positions and sizes between miny and maxy, and clear next bit in the maxy bcb. just non proofed idea of having one big poly bcb list... such stuff helped in the Elements Lynx demo as CPU dont need to wait for each span finished blit. Edited May 10, 2017 by Heaven/TQA Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 10, 2017 Share Posted May 10, 2017 If possible set all the blits up first and run them in one go. Running single blits or groups of a few then having CPU intervention where it waits for IRQ or the flag the starts the next lot would be somewhat wasteful. Also don't forget - for some stuff you can make use of the blit for normal Antic graphics. That's what I did with Quadrillion - I initially converted the game with the graphics remapped from the cell to linear type but nasty bugs crept in and I had to start again. So I went with the idea to just leave the rendering mostly alone, using the blit to convert the entire 8K bitmap from Plus4 mapping to Atari every frame. If you can live with Antic graphics for certain stuff then potentially the blitter can do 4-8 times the number of pixel shifting. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 so... here are some altirra screenshots yellow color appears when I start blitter operation. black when finished vbxe_blitter_face_nowait_span.png this draws 1 span without waiting the blitter to finish vbxe_blitter_face_nowait.png same as span but 1 face vbxe_blitter_face.png same as above but with wait vbxe_blitter_face_nowait_200span.png this one blitting 200 spans in a blitter block list... what makes me wonder... that's not "fast"? why is it always starts nearly the same screen position... is there any "align" or "sync" happening? $d400 is 0, blitter is set to fill mode... lda #15 sta $d01a lda #1 sta $D653 ; start blitter (draw span) @ lda $D653 ; wait until not-busy bne @- sta $d01a Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 what makes me wonder... to fill a span (scanline so to say) from x1 to x2 it takes more than half of scanline? Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 100 bytes to fill would need 100 cycles in blitter fill mode 100/8 =12.5 (blitter 8x faster than cpu) so in my world… the CPU would get control back after 12.5 cycles… and atari has 112 cycles per rasterline so the yellow bars would be much thinner??? where is my misunderstanding? though have not checked real hw yet. Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 10, 2017 Share Posted May 10, 2017 (edited) Are you doing lots of single blits? I would think that's a big problem, especially considering some line draws are like 10 pixels wide. Consider VBXE reads the BCB and starts executing it in less than 3 CPU cycles. A 10 pixel line is another 2 cycles with some spare. The overhead in setting up for individual blits, monitoring and starting the next one could potentially see the blitter spending more time idle than actually working. Pretty good looking sequence BTW... another optimization you might try - in standard mode the scanlines are 320 bytes apart. Depending on how you do your calculations, if you can spare some VRam, put the scanlines 512 bytes apart which for some graphical stuff can speed things up... fairly sure I did that in Moon Cresta so all the 6502 had to do was some bit-shifting to calculate the sprite start addresses. Edited May 10, 2017 by Rybags Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 i have 256 byte scanlines... (more easy... $baseYYXX) but that's why I thought using the blitter list with 200 BCBs... would gain... but the yellow "areas" are similar size? wtf... this does not look good in terms of copy speed. I thought when looking at the Lamer's demo there is more potential in VBXE. but could my code or Altirra or whatever Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 Are you doing lots of single blits? I would think that's a big problem, especially considering some line draws are like 10 pixels wide. Consider VBXE reads the BCB and starts executing it in less than 3 CPU cycles. A 10 pixel line is another 2 cycles with some spare. The overhead in setting up for individual blits, monitoring and starting the next one could potentially see the blitter spending more time idle than actually working. Pretty good looking sequence BTW... another optimization you might try - in standard mode the scanlines are 320 bytes apart. Depending on how you do your calculations, if you can spare some VRam, put the scanlines 512 bytes apart which for some graphical stuff can speed things up... fairly sure I did that in Moon Cresta so all the 6502 had to do was some bit-shifting to calculate the sprite start addresses. the yellow is I am waiting for blitter to finish... so blitter can not be idle... it looks more to me hooking up the CPU for too long as I had expected? if small spans... then it would or should be a mess of small yellow stripes? Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 10, 2017 Share Posted May 10, 2017 Can you move more processing off the CPU? Like store things as an array in VRam, then get the blitter to do some of the calcs. I assume you're probably already using blits to populate the BCBs for the draws? Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 as you have more experience in VBXE... can you code a little benchmark meaning VBXE blitting random length of lines 0-199 ? and show how much time the blitter needs? i still can not believe that so huge blocks ("yellow"). Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 Maybe I setup stuff wrong Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 10, 2017 Share Posted May 10, 2017 I was thinking about doing a test case for the refresh thing... For your problem, maybe do a dump of a bad case situation of all the data going into the BCBs. Then work out how many BCBs, how many pixels per BCB etc. Then calculate how many cycles are required. Then throw in the ones for where the 6502 is dragging the chain with the blitter idle and waiting. Then compare that to what you're witnessing onscreen. What you're drawing there, is it all done with horizontal line segments? Are you using the mode 0 blitter command, without collision detection or any other time sapping stuff? Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 Scanline based... so only drawing from right edge to left edge with step x-1 Antic off, Blitter in filling mode. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 (edited) And I might not explained what makes me wonder... the yellow rastertime is not the idle time but CPU waiting the blitter to finish. so actually the idle time of the cpu... Edited May 10, 2017 by Heaven/TQA Quote Link to comment Share on other sites More sharing options...
Rybags Posted May 10, 2017 Share Posted May 10, 2017 What's with that first screen though? It's like the blits are only lasting half a scanline then another one started by the CPU. But the other ones look like nice big chunks of work are being done - although a fair bit of idle time. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 Rybags... BUT... as you see... most of the time we are talking about 1 SCANLINE... processed... it could be that I am meassuring wrong... (but posted the wait junk of code). so i am really really wonder...even if I blit one horizontal line say (here in oxygene logo faces are max maybe 32 pixels) and look the big yellow area? (check the filenames to see what they do)... no wait means start blitter without waiting blitter to finish his work... and I had assumed that the yellow chunks would bea) smaller in terms of height and length b) randomly spreaded over the screen so still most wonder... why the hell does the blitter suck so much time? (as I said...could be my code... will show you later). Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 (edited) tha's my hline blit object (not talking about the list) hline_bcb: .long $000000 ;source adress .word 0 ;source step y .byte 0 ;source step x .long $010000 ;destination adress .word 256 ;dest. step y .byte -1 ;dest step x .word 0 ;size x .byte 0 ;size y .byte $00 ;and .byte $00 ; XOR .byte 0 ; collision AND .byte $00 ; zoom .byte 0 ; pattern .byte 0 ; control and that's the render loop: render_scene ldy miny lda #$80 ;bank 0 sta $d65d ;cpu-vram access window at $4000 polycol lda #4 sta $4310 ;color lda #$03 ;$000300 = $4300 bank #0 sta $D651 ; blitter addr _drwply lda redge,y sta $4306 ;xpos sty $4307 ;ypos sec sbc ledge,y bcc @+1 ;sta $430c ;sizex lda #15 sta $d01a @ lda $D653 ; wait until not-busy bne @- lda #1 sta $D653 ; start blitter (draw span) sta $d01a _drw2 iny cpy maxy bne _drwply @ lda #$00 ;bank 0 sta $d65d ;cpu-vram access window at $4000 rts Edited May 10, 2017 by Heaven/TQA Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 10, 2017 Author Share Posted May 10, 2017 Ok.... significant speed improvements can be made by using Blitter lists and updating those huge lists again with Blitter. So CPU is only preparing base data. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.