artrag Posted April 5, 2021 Share Posted April 5, 2021 (edited) Do you know if CPU is capable of moving 190 words in a frame ? I'm trying to optimize this code in order to stay within a frame, but it seems that the problem is the number of moved words itself... plotscreen: procedure xA = ((#wX/4) and 1) yA = ((#wY/4) and 1) tX = ((#wX/8) % 9) tY = ((#wY/8) % 9) mX = (#wX/8/9) % wMAP my = (#wY/8/9) % hMAP #j = (xA + yA*2)*(9*9) ' first line imap = 0+mX + (0+mY)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i + (tX + tY*9) , 0 ,9-tX,9-tY,9 imap = (1+mX)%wMAP + (0+mY)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i + tY*9 , 0 + 9-tX ,9 ,9-tY,9 imap = (2+mX)%wMAP + (0+mY)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i + tY*9 , 0 + 18-tX ,tX+1,9-tY,9 ' second line imap = 0+mX + ((1+mY)%hMAP)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i + tX , 0 + (9-tY)*20 ,9-tX,1+tY,9 imap = (1+mX)%wMAP + ((1+mY)%hMAP)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i , 9-tX+(9-tY)*20,9 ,1+tY,9 imap = (2+mX)%wMAP + ((1+mY)%hMAP)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i ,18-tX+(9-tY)*20,tX+1,1+tY,9 end For your fun, the snippets is used in the demo in attach... demo.cfg demo.bin Edited April 5, 2021 by artrag 1 Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 6, 2021 Share Posted April 6, 2021 (edited) Hi, @artrag, There are about 14.4K CPU cycles between interrupt events, from which the IntyBASIC runtime takes about 4.8K. You still have about 9.7K cycles to play with in your program's main engine. The SCREEN statement uses the assembly language routine CPYBLK2, excerpted below from an assembly listing file. I've marked it up with the CPU cycle cost per instruction, and noted the loops and their relative costs. ; Initial set-up ; ------------------- 6765 02B8 E302 MVII #<LABEL>,R0 ; 8 6767 0270 PSHR R0 ; 9 6768 02B8 0200 MVII #512,R0 ; 8 676A 0270 PSHR R0 ; 9 676B 02B8 0001 MVII #1,R0 ; 8 676D 0270 PSHR R0 ; 9 676E 02B8 000C MVII #12,R0 ; 8 6770 0270 PSHR R0 ; 9 6771 02B8 0001 MVII #1,R0 ; 8 6773 0004 0170 01CF CALL CPYBLK2 ; 12 ; ---- ; 88 ;;;;;;;;;;;;;;;; 0x71CF CPYBLK2: PROC ; 71CF 0083 MOVR R0,R3 ; 6 71D0 00AA MOVR R5,R2 ; 6 71D1 02B0 PULR R0 ; 11 71D2 02B1 PULR R1 ; 11 71D3 02B5 PULR R5 ; 11 71D4 02B4 PULR R4 ; 11 71D5 0272 PSHR R2 ; 9 71D6 010B SUBR R1,R3 ; 6 ; --- ; 71 71D7 0273 @@1: PSHR R3 ; 9 <----------, 71D8 008B MOVR R1,R3 ; 6 | 71D9 02A2 @@2: MVI@ R4,R2 ; 8 <----, | 71DA 026A MVO@ R2,R5 ; 9 |32 | 71DB 0013 DECR R3 ; 6 | | 71DC 022C 0004 BNE @@2 ; 7/9 <--' | 71DE 02B3 PULR R3 ; 11 | 71DF 00DC ADDR R3,R4 ; 6 | 71E0 010D SUBR R1,R5 ; 6 | 71E1 02FD 0014 ADDI #20,R5 ; 8 | 71E3 0010 DECR R0 ; 6 | 71E4 022C 000E BNE @@1 ; 7/9 <--------' ; 71E6 02B7 RETURN ; 11 ENDP ; As depicted above, the call to CPYBLK2 takes 88 cycles to set it up, then it has a preamble of 71 cycles to prepare the arguments for the copy loop. The copy loop itself takes about 32 cycles for each column, and each row adds about 50 more. The biggest overhead is in managing the inner and outer loops. Each decrement/test/branch sequence takes 15 cycles, so it depends on how your block is organized. If you have a single contiguous set of words, the copying loop itself will take about 6K cycles (32 x 190) just to move 190 words from ROM to BACKTAB. Adding the overhead, that comes to about a bit less than 6.3K cycles, so clearly there is capacity to do so. From a cursory glance at your code above, it seems that the expression computations on each statement may add an inordinate amount of processing. For example, take the following expression: imap = 0+mX + ((1+mY)%hMAP)*wMAP It assembles into the following code. Again, I've marked it up as before. 0x50BE label_FOO: ;[636] imap = 0+mX + ((1+mY)%hMAP)*wMAP SRCFILE "foo.bas",636 50BE 0280 0199 MVI var_MX,R0 ; 10 50C0 0281 019B MVI var_MY,R1 ; 10 50C2 0009 INCR R1 ; 6 50C3 0284 0170 MVI var_HMAP,R4 ; 10 50C5 00A4 TSTR R4 ; 6 50C6 0204 0004 BEQ T1 ; 7/9 <--------, 0x50C8 T2: ; | 50C8 0121 SUBR R4,R1 ; 6 <-----,15 | 50C9 0221 0002 BC T2 ; 7/9 <---' | 50CB 00E1 ADDR R4,R1 ; 6 | 0x50CC T1: ; | 50CC 0285 0173 MVI var_WMAP,R5 ; 10 <----------' 50CE 01E4 CLRR R4 ; 6 50CF 0006 CLRC ; 6 50D0 0071 RRC R1,1 ; 6 50D1 0204 0007 BEQ T4 ; 7/9 <--------, 0x50D3 T3: ; | 50D3 0209 0001 BNC T5 ; 7/9 <,--, | 50D5 00EC ADDR R5,R4 ; 6 | | | 0x50D6 T5: ; | |34 | 50D6 00ED ADDR R5,R5 ; 6 <--' | | 50D7 0079 SARC R1,1 ; 6 | | 50D8 022C 0006 BNE T3 ; 7/9 <---' | 0x50DA T4: ; | 50DA 0209 0001 BNC T6 ; 7/9 <--------' 50DC 00EC ADDR R5,R4 ; 6 0x50DD T6: ; 50DD 00A1 MOVR R4,R1 ; 6 50DE 00C8 ADDR R1,R0 ; 6 50DF 0240 014F MVO R0,var_IMAP ; 11 ; --- ; 171 cycles That's a floor of 171 cycles if it just runs through without looping. The modulo is performed by repeated subtraction, requiring 15 cycles per iteration; and the multiplication by variable is done by repeated addition, and costs 34 cycles per iteration. At over 200 cycles each, those are some rather expensive expressions, and you have six of them, potentially consuming a few thousand cycles by themselves. Still I wonder if the problem is not the window of time, but the actual timing itself. If you are experiencing "screen tearing" -- i.e., parts of the screen being updated at different moments, making it look jerky (which I noticed in your demo), then perhaps the problem is that the updates to the BACKTAB are missing the STIC pre-fetch operation that loads the internal screen buffer. In other words, you may be copying all the data before the next frame, but parts of it happened after the STIC already read them for the current frame. To avoid this, it is necessary to make sure your copy to BACKTAB starts as early as possible, preferably still during the VBLANK2 period, but at least right after, and that it goes from left to right, top to bottom, at a quick pace. I believe the BUSRQ requests from the STIC come at 1ms intervals for each BACKTAB row. If your copying calculations are expensive, then you will need to start much earlier to keep up. Alternatively, you could try to pre-compute all your pointers in a table and just do a quick direct copy. If you send me a copy of the assembler output files (LST, SYM, MAP), I could try using the debugger to see where the bottleneck lies. -dZ. P.S. Nice demo ... I know where that is going and I can't wait to see how it develops! ? Edited April 6, 2021 by DZ-Jay 3 Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 6, 2021 Share Posted April 6, 2021 (edited) A few more things. If your BACKTAB updates are done in "zones" -- that is, not as a single rectangle, but as a series of smaller rectangles -- then that could be part of the problem. Because the STIC pre-fetch follows the order of the BACKTAB array, copying data below it will succeed in making it to the current frame, but data copied from above it will not. You stand a higher chance of including all cards by making sure you copy all the cards in a single row at a time before moving downwards. You have about 3.5K cycles before the next BUSRQ from the STIC to fetch the next row. Also, consider that as efficient as the SCREEN statement is, it still incurs an overhead cost of about 160 CPU cycles just to get it started. The statement is optimized for block-copying, but repeated calls to it may just drag you down. -dZ. Edited April 6, 2021 by DZ-Jay Quote Link to comment Share on other sites More sharing options...
+nanochess Posted April 6, 2021 Share Posted April 6, 2021 Modulo is an expensive operation in terms of time. The second half of the code recalculates several things unnecessarily: imap = 0+mX + ((1+mY)%hMAP)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i + tX , 0 + (9-tY)*20 ,9-tX,1+tY,9 imap = (1+mX)%wMAP + ((1+mY)%hMAP)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i , 9-tX+(9-tY)*20,9 ,1+tY,9 imap = (2+mX)%wMAP + ((1+mY)%hMAP)*wMAP #i = map(imap) + #j SCREEN #xyrooms,#i ,18-tX+(9-tY)*20,tX+1,1+tY,9 I would do it as: IF mY = hMAP - 1 THEN mY = 0 ELSE mY = mY + 1 imap = mX + mY * wMAP #i = map(imap) + #j SCREEN #xyrooms,#i + tX , 0 + (9-tY)*20 ,9-tX,1+tY,9 IF mX = hMAP - 1 THEN imap = imap - (hMAP - 1) ELSE imap = imap + 1 #i = map(imap) + #j SCREEN #xyrooms,#i , 9-tX+(9-tY)*20,9 ,1+tY,9 IF mX = hMAP - 2 THEN imap = imap - (hMAP - 1) ELSE imap = imap + 1 #i = map(imap) + #j SCREEN #xyrooms,#i ,18-tX+(9-tY)*20,tX+1,1+tY,9 2 Quote Link to comment Share on other sites More sharing options...
artrag Posted April 6, 2021 Author Share Posted April 6, 2021 (edited) Thanks for the suggestions. I've pre-computed part of the pointers in a separate frame and now frame tearing seems more acceptable. I've attached BAS, BIN, ASM and LST and sources for your curiosity cloudy montain reboot.rar Edited April 7, 2021 by artrag 3 Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 7, 2021 Share Posted April 7, 2021 (edited) OK, I see there is less tearing, but it is still there. I'll investigate in the debugger. P.S. You forgot the MAP file that maps the source files and symbol tables to the binary for source-level debugging, and since you didn't include all the files (which I understand), I can't build it myself. No big deal, I can follow it with the LST by eye. :) -dZ. Edited April 7, 2021 by DZ-Jay Quote Link to comment Share on other sites More sharing options...
carlsson Posted April 7, 2021 Share Posted April 7, 2021 How expensive is the (9-tY)*20? From the source Artrag posted, it seems to be calculated thrice in a row without tY changing. Perhaps it doesn't change much in the wider scheme but if you have a byte to spare perhaps see if it makes any difference. There is also tY*9 thrice, perhaps use the same temp byte for both calculations, though I would assume that Artrag already is seasoned enough to recognize that. Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 7, 2021 Share Posted April 7, 2021 (edited) I took a quick look at the history file between calls to "plotscreen" and I see that what is killing you is precisely what I thought: you are copying the screen in regions, and by the time you complete one region and start the next one at the top, the raster has already passed and the STIC missed those changes. Adding to that that you should have plenty of time to copy the entire screen: the call to plotscreen comes in at about 4,400 cycles after the VBLANK interrupt arrived, and the first BUSRQ from the STIC to pre-fetch the first row doesn't come in until you are almost half-way done with the first region. However, by the time you finish that first region, the STIC has already passed the first rows, so you cannot catch up. This is why your screen tearing rips in columns. -dZ. Edited April 7, 2021 by DZ-Jay Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 7, 2021 Share Posted April 7, 2021 (edited) I had a similar problem in the Voyage @Party demo, where I had to copy four blocks of BACKTAB after moving large chunks of GRAM, and the way I solved the problem is quite simple in theory, but involved in practice: you need to copy your blocks from top to bottom. For instance, in my case, the regions were: 11111111 22222222 11111111 22222222 11111111 22222222 11111111 22222222 11111111 22222222 11111111 22222222 11111111 22222222 11111111 22222222 33333333 44444444 33333333 44444444 33333333 44444444 33333333 44444444 33333333 44444444 33333333 44444444 33333333 44444444 33333333 44444444 Instead of copying region #1 completely as a rectangle, I copy only the first row, then the first row of region #2; followed by the second row of #1 and the second row of #2; and so on. Once blocks #1 and #2 were done, I had to do the same for the bottom blocks. The idea is to stay ahead of the raster, which is preceded by the STIC pre-fetching cards to build its row buffer; and always fill the screen from left-to-right, top-to-bottom. It requires reassessing your algorithm and organizing your data differently, but it is effective. -dZ. Edited April 7, 2021 by DZ-Jay 1 Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 7, 2021 Share Posted April 7, 2021 (edited) Alternatively, you'll need to start much earlier in order to complete the first region and still be on time to start the second region before the STIC gets to it. That will require you to shave about 1.6K cycles. It's not impossible: the set up to the CPYBLK2 routine takes about 320 cycles by itself (~640 if you count the two regions), and the IntyBASIC ISR consumes about 625 in copying the STIC shadow and 380 cycles in the randomizer; so a custom ISR that just copies the BACKTAB using a specialized block-copy routine may gain you back those cycles. You would have to dedicate one frame ISR just for BACKTAB copying, which would reduce the resolution of MOB updates, GRAM animations, etc. Another alternative is to make your regions shallower by reducing the number of rows. The CPU history shows that the first BUSRQ interrupt arrives just when you are finishing copying the 6th row of the first region, right in the middle of setting up the loop for the next row. Conceivably, making the blocks 5 rows would gain you about 350 cycles, which may be enough to cover the overhead of a call to SCREEN for the next region. However, that would work for two adjacent regions; a third one would be out of range. Personally, my recommendation would be, in order of preference: Pre-compute a full screen region rectangle outside the critical path (perhaps filling up a buffer after computing the frame). Modify your loading algorithm to copy one row from each adjacent region at a time, to ensure each BACKTAB row is configured before the next. Reduce the depth (number of rows) of the regions so that adjacent ones can be reached before the STIC gets to them. Use a custom ISR and block-copy routines on special frames to optimize the loading of BACKTAB words. #1 may not be feasible because, by the time you add all trappings of a game (e.g., MOB handling, animations, music, enemy A.I., etc.), there may not be enough time within a frame to pre-calculate a large chunk of BACKTAB. It would also require a large chunk of RAM for a screen buffer, which may or may not be available. That said, I would always place full pre-computation as a first option to consider because, whenever they are possible, they can be very fast. #2 may be more practical, depending on your program requirements. The key insight here, as someone told me once, is that you do not need to beat the VBLANK, nor do you have to copy everything in one go -- all you need to do is stay slightly ahead of the raster. Remember, the raster takes the entirety of a game frame to build a field, so there are many opportunities to achieve this. Your algorithms and data structures should be designed with this point as the critical factor. #3 would only be practical if you had at most two adjacent regions, but you have three. The more you copy, the more cycles you use, and the more chances of missing the raster when you go back up to copy more. #4 would also only get you so far, plus it has a direct impact on the responsiveness and resolution of all your other game functions, since you are essentially "stealing" a frame just to copy BACKTAB. There is still much to be gained by using a specialized block-copy routine optimized for your specific use case, so that may still be a good idea. Not included in the above is #5, "ignore the screen tearing and live with it," because I personally would never be satisfied with that, as I'm sure neither would you. However, it is an option, and it wouldn't be the first game to have it. *shrug* In the end, I think #2 is the best option for your purposes. I hope this helps. -dZ. Edited April 7, 2021 by DZ-Jay Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 7, 2021 Share Posted April 7, 2021 (edited) 3 hours ago, carlsson said: How expensive is the (9-tY)*20? From the source Artrag posted, it seems to be calculated thrice in a row without tY changing. Perhaps it doesn't change much in the wider scheme but if you have a byte to spare perhaps see if it makes any difference. There is also tY*9 thrice, perhaps use the same temp byte for both calculations, though I would assume that Artrag already is seasoned enough to recognize that. The multiplications by constants are not very expensive, using the special macro @intvnut created for them. The expressions themselves do not add too much overhead, and there seems to be plenty of cycles to copy large parts of the BACKTAB. As I mentioned in my previous comments, the issue is related to the copying of multiple adjacent blocks to the BACKTAB, instead of copying full rows of it at a time. -dZ. Edited April 7, 2021 by DZ-Jay Quote Link to comment Share on other sites More sharing options...
+DZ-Jay Posted April 7, 2021 Share Posted April 7, 2021 (edited) For what it's worth, I will post below the code I used in the Voyage demo to load the BACKTAB by "zones." For reference, the BACKTAB data is stored in a table as contiguous data in the following format: @@arcia_backtab ; Format: ; DECLE <BACKTAB-ADDRESS>, <DATA-01>, <DATA-02>, <DATA-03>, ..., <DATA-24> DECLE BTAB_ADDRS(0, 0), $0807, $2c0f, $0c16, $081a, $0c26, $2c2f, $0c36, $083a, $0842, $0c4e, $0c56, $0c5e, $0862, $086a, $0872, $087a, $0882, $088a, $0892, $089f, $08a0, $08af, $08b7, $2cbf DECLE BTAB_ADDRS(4, 0), $0802, $0c0e, $0817, $0818, $0822, $082a, $0c37, $083e, $0846, $084a, $0c56, $085a, $0866, $0c6e, $0872, $087a, $0883, $088a, $0892, $0898, $0ca7, $08aa, $08b2, $08b8 DECLE BTAB_ADDRS(0, 6), $0802, $080e, $2c17, $081f, $1a20, $082a, $0c36, $0c3f, $1a40, $0848, $0853, $085b, $1a61, $1a68, $1a70, $1a78, $0881, $0889, $0891, $0899, $08a1, $08a9, $08b1, $08b9 DECLE BTAB_ADDRS(4, 6), $0c06, $080a, $0812, $0818, $0822, $082a, $0833, $0838, $0842, $084f, $0857, $0a5f, $1a60, $1a68, $0871, $0879, $0881, $0889, $0891, $0899, $08a1, $08a9, $08b1, $08b9 Each record represents a pre-computed 4 x 6 block (24 words) of BACKTAB, a "zone," and is preceded by the address where it will be placed (computed by a macro from the column and row). The problem that I was trying to solve was, how to display a 16 x 12 (96 cards) image on the BACKTAB using custom GRAM cards, while reserving a chunk of GRAM for other graphics and custom fonts on the screen. We settled on committing 24 GRAM cards to the image, which meant that it could not be rendered all at once. Because we could only load 1/4th the graphics data into GRAM for the image, we had to multiplex it across four blocks of 24 cards each. This requires each block to be erased before drawing the next one, to prevent it from showing the GRAM cards loaded for the new block. The loading of GRAM cards occurred separately, and is not germane to this particular problem, so I won't bother including it. The four BACKTAB blocks, or "zones" are organized as follows: 1111 2222 1111 2222 1111 2222 1111 2222 1111 2222 1111 2222 3333 4444 3333 4444 3333 4444 3333 4444 3333 4444 3333 4444 The original algorithm was pretty simple, starting at the first zone (top-left): Erase previous zone Load GRAM for current zone Block-copy data to BACKTAB for current zone Countdown until next draw event Move on to the next zone If this was the last zone, point to the first one again Go back to step #1 Even though the zones were small and we were copying directly from the ISR, using a simple block-copy routine like SCREEN does for each zone resulted in screen tearing. The solution I came up with was to interlace steps #1 and #2, one card a time, so as to process the BACKTAB rows themselves from left to right, top to bottom. For example, suppose we already loaded zone #1 and we are about to load zone #2. Instead of erasing zone #1 first (which is a block-copy of $0000 data), then block-copy zone #2, what we do is this: Set Previous pointer to zone #1 Set Current pointer to BACKTAB data table for next zone Read Target pointer from Current record. For each data word in record: Write $0000 to Previous. Read Target pointer from Current record Read BACKTAB data from Current record Write BACKTAB data to Target. Advance Previous pointer. Advance Current pointer. Advance Target pointer. Save Current pointer as Previous. If this was the last zone, reset pointers to start from zone #1 again. In other words, as we erase the first card of zone #1, we read the next data word for zone #2 and write it to its destination. We continue doing this for the total number of cards in a zone (4 x 6 = 24), so that when we are done, we have completed both. An excerpt of the relevant code is below, included as is. The symbols "ARCIA.PrevBlockBtab" and "ARCIA.CurrBlockBtab," are variables holding the previous and current zone pointers, respectively. "ARCIA.BlockCount" is a variable that holds how many blocks we've processed, i.e., the ordinal value of the current block. All variables are initialized to their starting values when this demo section starts, and updated after loading each zone. The copy look is unrolled for maximum speed. ; ------------------------------------- ; Draw the current block into BACKTAB ; ------------------------------------- @@__draw: CLRR R1 ; Initialize erase data (zero, of course!) MVI ARCIA.PrevBlockBtab, R3 ; Get destination address of previous block MVI ARCIA.CurrBlockBtab, R4 ; \_ Get destination address of current block MVI@ R4, R5 ; / MVO R5, ARCIA.PrevBlockBtab ; ... which will be our next "previous" block. MVII #(20-4), R2 ; Use register to avoid "immediate" mode costs. ; NOTE: ; For 3 out of 4 blocks, the previous ; block comes before the current one ; in the BACKTAB. To ensure we stay ; ahead of the STIC pre-fetch, we ; erase first, then draw. ; ; We also increment the pointer to ; the previous block after drawing, ; just in case... ; ------------------------------------- REPEAT(6) REPEAT(4) MVO@ R1, R3 ; Erase card on previous block MVI@ R4, R0 ; \_ Draw card on current block MVO@ R0, R5 ; / INCR R3 ; Advance erase block pointer (need more auto-incr regs!) ENDR ADDR R2, R5 ; \_ Skip to next row on both blocks ADDR R2, R3 ; / (using a register saves 2 cycles per iteration) ENDR MVO R4, ARCIA.CurrBlockBtab ; Update the backtab pointer with the next record MVI ARCIA.BlockCount, R0 ; \ DECR R0 ; > Is this the last block? BPL @@__done ; / No: We are done; Yes: Reset ; ------------------------------------- ; Reset pointers to first block ; ------------------------------------- @@__reset: MVII #(ARCIA.Blocks - 1), R0 ; Reset the block counter MVII #GRAM_DATA.arcia_pointers, R1 ; \_ Reset the GRAM pointer MVO R1, ARCIA.CurrCard ; / MVII #BACKTAB_DATA.arcia_backtab, R1 ; \_ Reset the BACKTAB pointer MVO R1, ARCIA.CurrBlockBtab ; / @@__done: MVO R0, ARCIA.BlockCount ; Save the updated block counter Perhaps a similar routine specialized for your use case would work for you, @artrag. Edited April 7, 2021 by DZ-Jay 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.