VladR Posted October 23, 2015 Share Posted October 23, 2015 I just downloaded the source code. It's super short ! Barely 3,900 lines (+ data), which is nothing for ASM. When browsing, it spans only 5 screens! Amazing, how little ASM code was enough to produce great games in those days ! I've been working backwards from a divide routine to duplicate the functionality of the Star Raiders "DIVIDE:" routine.I just created a loop that keeps subtracting the divisor from the dividend and increments the result each time it's successful. Once it fails the dividend becomes the remainder.But the DIVIDE: routine has some odd behavior. It limits the number of loops, plus it shifts the dividend and result each loop. A normal divide doesn't produce the desired results so it appears this cannot be duplicated by using log without additional changes to the code. Having looked at the code just briefly, the DIVIDE is called just twice - from inside CALCVH (which calculates xpos,ypos for the stars) - once to divide zpos/xpos and then to divide ypos/xpos. So, it should not be very difficult to adjust the code at those two places for the I/O differences to your method. Of course, it's called in a loop (CALCV1) for all stars (LDX NSTARS), hence it takes 51% of frame time. I'm pretty sure that approach could be replaced by a different one, while retaining the perspective effect. There are so many ways how to render the starfield from that perspective. What were you planning to do with the performance you hope to gain ? Just improve the framerate or add some effect ? Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 23, 2015 Share Posted October 23, 2015 To begin with, explosions slow down the game and changing the divide to log lookups should eliminate that.It also leaves extra cycles for other mods. Oh, and the extra speed makes it easier to port to other platforms. More time can be spend on software sprites. 1 Quote Link to comment Share on other sites More sharing options...
sup8pdct Posted October 24, 2015 Share Posted October 24, 2015 I found a version of starraiders on this disk that requires the basic cart to load and run. Don't know if someone has gone to a lot of trouble to get rom code and move it down lower in memory changing all pointers etc or it was a disk or cassette version (was one made?) There are other what it thought were Cart only based games on this disk as well. Note the Dos version on this disk is.. Dos 1. DISK_39B old games.ATR Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 24, 2015 Share Posted October 24, 2015 There is one other little issue with log that hasn't been mention yet. Now, I haven't had to use LOG for 30 years, but it seems to me it's normally a decimal number and the game uses signed magnitude numbers. Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 24, 2015 Share Posted October 24, 2015 (edited) Using X for the loop counter instead of TEMP4 in DIVIDE: saves about 18 clock cycles per call if my calculations are right.(DEC 5 cycles - DEX 2 cycles = 3 cycles * 7 loops = 21 - 3 for LDX TEMP4)So when there is an explosion it saves 12 normal stars + 32 explosion stars or 44 stars * 18 clock cycles each = 792 clock cycles? There might be a few other places where little stuff like this could be done but all of them offer small returns.I suppose once added up they could make all the difference that is needed on the explosions. Then we wouldn't have to worry about issues with LOG and that's still an option. DIVIDE: ; A = (TOP/BOTTOM)X80 ; LDA #$00 ;CLEAR THE RESULT STA TEMP3 stx TEMP4 ldx #$07 ; LDA #$07 ; NUMBER OF SHIFTS ; STA TEMP4 ; SHIFT 0 INTO THE MSBIT LSR TEMP1 ; TOP NUMBER ROR TEMP LDA DISFLG ; FRONT OR BACK ? BNE DIVID1 ; BACK LDA XPOSH,X ; BOTTOM NUMBER LSR A STA PNTR+1 LDA XPOSL,X ROR A STA PNTR JMP DIVID2 DIVID1: SEC LDA #$00 SBC XPOSL,X STA PNTR LDA #$00 SBC XPOSH,X LSR A STA PNTR+1 ROR PNTR ; DIVID2: ASL TEMP3 ; SHIFT RESULT SEC ; SUBTRACT BOTTOM FROM TOP LDA TEMP SBC PNTR TAY LDA TEMP1 SBC PNTR+1 BCC DIVID3 ; BOTTOM GREATER THAN TOP ; TOP LARGER STA TEMP1 ; STORE REMAINDER STY TEMP INC TEMP3 ; ADD 1 TO RESULT DIVID3: ASL TEMP ; SHIFT TOP ROL TEMP1 BCC DIVID4 ; IF TOP IS GREATER THN BOTTOM THEN OVERFLOW LDA #$FF ; MAX VALUE TO RESULT ldx TEMP4 RTS DIVID4: dex ; next bit ; DEC TEMP4 ; NEXT BIT BPL DIVID2 LDY TEMP3 ; RESULT IN Y LDA PTAB,Y ; MULTIPLY BY 80 (PTAB) ldx TEMP4 DIVID5: ; ENTRY POINT FROM THINK *: RTS Edited October 24, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
Rybags Posted October 24, 2015 Share Posted October 24, 2015 I've barely looked at the code... how's other stuff like efficiency of routines that actually draw/erase stars? Quote Link to comment Share on other sites More sharing options...
VladR Posted October 24, 2015 Share Posted October 24, 2015 I've barely looked at the code... how's other stuff like efficiency of routines that actually draw/erase stars? Well, from the profiling (see page 3 of thread), the top 3 things are: 51% - DIVIDE 2.8 % - Store HPos 2.8 % - Store VPos So, I wouldn't really worry about efficiency of drawing at this point, as there is 1 method that takes over half of each frame (and way more than that during explosions). Once the DIVIDE has been refactored to take under, say, 25% of frame time, then it should make sense to check rendering. 1 Quote Link to comment Share on other sites More sharing options...
VladR Posted October 24, 2015 Share Posted October 24, 2015 Using X for the loop counter instead of TEMP4 in DIVIDE: saves about 18 clock cycles per call if my calculations are right. (DEC 5 cycles - DEX 2 cycles = 3 cycles * 7 loops = 21 - 3 for LDX TEMP4) So when there is an explosion it saves 12 normal stars + 32 explosion stars or 44 stars * 18 clock cycles each = 792 clock cycles? There might be a few other places where little stuff like this could be done but all of them offer small returns. I suppose once added up they could make all the difference that is needed on the explosions. Then we wouldn't have to worry about issues with LOG and that's still an option. So, this would save 792 cycles, potentially - which is how much ? ~3% of frame time ?. These microoptimization only make sense if you are on the threshold of frame time, hence you could fit under current vbl. Try and see if this was the case. But I believe you would gain far, far more, if you analyzed the full range of input/output values for your divide method and fit it at those 2 places. You got the full dev env, you can build and test the source code at any time. Printing the values for those 12 stars sounds like a pretty fast and easy task. Then, adjusting your method should be straightforward. Right now - what's the rough clock difference between the reference and your DIVIDE ? I understand it does not - yet - account for the differences, but just curious. Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 24, 2015 Share Posted October 24, 2015 (edited) I've barely looked at the code... how's other stuff like efficiency of routines that actually draw/erase stars? Most critical code seems pretty well optimized. The problem is, without knowing what the code is doing it's impossible to rewrite it in a better way which is where real savings usually comes from. If the explosion could be updated half in one frame and half in another you might be able to eliminate the problem. Always update the first 12 stars and only 16 explosion stars per frame? So, this would save 792 cycles, potentially - which is how much ? ~3% of frame time ?. These microoptimization only make sense if you are on the threshold of frame time, hence you could fit under current vbl. Try and see if this was the case. But I believe you would gain far, far more, if you analyzed the full range of input/output values for your divide method and fit it at those 2 places. You got the full dev env, you can build and test the source code at any time. Printing the values for those 12 stars sounds like a pretty fast and easy task. Then, adjusting your method should be straightforward. Right now - what's the rough clock difference between the reference and your DIVIDE ? I understand it does not - yet - account for the differences, but just curious. I'd be shocked if it were even 3% savings to be honest. I just ran across it while working with the code. There is one potential savings related to values that I've already seen. This is a 16/16 bit divide and the divide returns a max value of $FF. So any divisor less than 257 ($FFFF/$FF) automatically results in $FF being returned and the loops can be skipped. But if that doesn't occur very often, we cost more cycles... which that little optimization should offset so we are no worse off. However, thanks to the weirdness in the divide I haven't figured out, 257 might not actually be the magic number. *edit* That might also allow some simplification of the loop but I hadn't really thought about that yet. Also, the divisor being larger than the dividend should also result in a quick exit. Edited October 24, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
VladR Posted October 24, 2015 Share Posted October 24, 2015 Well, tough slough ahead of you. We all know how those final routines come to be. The Great Iteration List: 1. Yay, I got the DIVIDE thing working ! 2. Hmm, this seems unnecessary. If I'll just alter the values in a layer above, I could shave off 2 ops ! 3. And I could do the same in this layer right here and reuse the code for something completely unrelated ! 4. This is cool, who would have thought you could abuse division to do also this little thing ? 6. Holy crap,if I modify this right here, I could also use this for this new feature ! I'll keep it disabled for now, but I'll surely return to it later to enable it ! 7. 2 weeks later - Uhm, why did I call it divide again ? One beautiful summer morning, I am going to update the comments. No, really - I will ! 1 Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 24, 2015 Share Posted October 24, 2015 Well, tough slough ahead of you. We all know how those final routines come to be. The Great Iteration List: 1. Yay, I got the DIVIDE thing working ! 2. Hmm, this seems unnecessary. If I'll just alter the values in a layer above, I could shave off 2 ops ! 3. And I could do the same in this layer right here and reuse the code for something completely unrelated ! 4. This is cool, who would have thought you could abuse division to do also this little thing ? 6. Holy crap,if I modify this right here, I could also use this for this new feature ! I'll keep it disabled for now, but I'll surely return to it later to enable it ! 7. 2 weeks later - Uhm, why did I call it divide again ? One beautiful summer morning, I am going to update the comments. No, really - I will ! There's a reason I add comments as I write my code. It never gets done otherwise. I just have to write a piece of code that calls the divide with every possible value combination and keep track of the max or min value that returns $FF. Then it's just a matter of adding range checking to the code. I have a hunch where the shifts and multiply come from but I'll need to do some testing to be sure. Quote Link to comment Share on other sites More sharing options...
playermissile Posted October 24, 2015 Share Posted October 24, 2015 There is one other little issue with log that hasn't been mention yet. Now, I haven't had to use LOG for 30 years, but it seems to me it's normally a decimal number and the game uses signed magnitude numbers. The property of logs that turns division into subtraction is: log(a/b) = log(a) - log(b) but one of the interesting things about logs is that it doesn't matter what base you use. Typically, to get the division result it's written: a/b = exp(log(a) - log(b) but the exp can be replaced with whatever power function. E.g., in python using base 5: >>> a=123455. >>> b=4545. >>> a/b 27.162816281628164 >>> math.log(a/b, 5) 2.0515541258596843 >>> math.log(a,5) - math.log(b,5) 2.0515541258596848 >>> math.pow(5, math.log(a,5) - math.log(b,5)) 27.162816281628185 So the trick would be to use some base that would maximize the log tables into 16 bit values. And there's no reason that it has to be floating point; it could be fixed point. The tables will probably be pretty sparse because there won't be enough memory to include every possible log value, so eventually there would have to also be an interpolation routine, but as a first cut just using the closest log value would give a proof of concept. You're right in that log only works with positive values, but since it's division you can just pull the signs out and deal with them at the end. Take the logs of the absolute value and If the signs were the same then the result is positive; if they were different then the result is negative. It looks like the sign is stored in a separate byte which makes things easier. Let's see if I have this right: XPOS appears to be the depth into the screen. Do you have a sense of how wide the range of XPOS values is? Is it using the whole range 0000 - FFFF or is it smaller? I'm guessing it's two's compliment since the initial code comment says that -1 is FFFF (with a zero sign byte) and 0 is 0000 with a 1 sign byte. Do you have an instrumented division routine or could you post a few sample inputs/outputs? I'm sure I can do the algorithm work in python and eventually generate a table of 16bit values and the 6502 assembly code for the table lookups that could be imported into CA65. Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 24, 2015 Share Posted October 24, 2015 (edited) I was expecting to have to use fixed point math. I haven't run any tests on the range of values yet and I'm not real familiar with most of the code.I can tell you the dividend/divisor * 80 <= 255 since the results are a byte. I haven't even verified it's actually multiplying by 80 in that table though.If it is, then we can divide both sides of the equation by 80 and 255/80 = 3.1875 so dividend/divisor <= 3.1875. That sounds fishy to me. If that's true, the loop could be limited to 3 or 4 instead of 7 and we might even be able to simplify the loop further and some sort of table lookup that doesn't even require LOG starts looking possible. Edited October 24, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
phaeron Posted October 24, 2015 Share Posted October 24, 2015 Best I can tell, Star Raiders is doing a textbook perspective projection: view_width = tan(field_of_view_angle / 2) screen_X = screen_center_X + view_Y / view_X * view_width screen_Y = screen_center_Y + view_Z / view_X * view_width The object positions are stored in view space (relative to the camera), instead of world space. CALCVH drives this by first culling out objects behind the camera (X<0 or X>=0 depending on fore/aft view) and then computes Y/X and Z/X. CALCVH computes abs(Y) and abs(Z) to pass in to DIVIDE, while DIVIDE itself computes abs(X) based on the fore/aft view setting. The division routine is a restoring 16/16 divide that calculates floor(a/b*128); it cranks out one bit at a time by binary long division. The result of the division is then looked up in PTAB, which computes y = (x*81)/256. This scales the division result by the view width, which is set by SCPTAB; this determines the field of view. As shipped, the DIVIDE routine takes ~450 cycles per divide. By unrolling the loop and converting ASL + maybe INC to ROL I got it down to ~360 cycles. Note that the two divides per object use the same divisor, which an improved algorithm might be able to take advantage of. Unfortunately, when I profiled the game the first time, I failed to notice that the rotation routine doesn't run if you aren't moving the joystick. Updated profile attached. ROHELP takes about a fifth of the frame when both axes are rotating. For the Y rotation, it is called twice to compute a rotation around Z as follows: Y' = Y + X/64 X' = X - Y'/64 This is a rotation of about 0.9 degrees per frame (atan(1/64)). Interestingly, the /64 divide is dithered using RANDOM, and so it is non-deterministic. Note that the second calculation uses the result of the first, which is a hack to approximate a circle (look up "Minsky rotation"). 3 Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 25, 2015 Share Posted October 25, 2015 (edited) Best I can tell, Star Raiders is doing a textbook perspective projection: view_width = tan(field_of_view_angle / 2) screen_X = screen_center_X + view_Y / view_X * view_width screen_Y = screen_center_Y + view_Z / view_X * view_width The object positions are stored in view space (relative to the camera), instead of world space. CALCVH drives this by first culling out objects behind the camera (X<0 or X>=0 depending on fore/aft view) and then computes Y/X and Z/X. CALCVH computes abs(Y) and abs(Z) to pass in to DIVIDE, while DIVIDE itself computes abs(X) based on the fore/aft view setting. The division routine is a restoring 16/16 divide that calculates floor(a/b*128); it cranks out one bit at a time by binary long division. The result of the division is then looked up in PTAB, which computes y = (x*81)/256. This scales the division result by the view width, which is set by SCPTAB; this determines the field of view. As shipped, the DIVIDE routine takes ~450 cycles per divide. By unrolling the loop and converting ASL + maybe INC to ROL I got it down to ~360 cycles. Note that the two divides per object use the same divisor, which an improved algorithm might be able to take advantage of. Unfortunately, when I profiled the game the first time, I failed to notice that the rotation routine doesn't run if you aren't moving the joystick. Updated profile attached. ROHELP takes about a fifth of the frame when both axes are rotating. For the Y rotation, it is called twice to compute a rotation around Z as follows: Y' = Y + X/64 X' = X - Y'/64 This is a rotation of about 0.9 degrees per frame (atan(1/64)). Interestingly, the /64 divide is dithered using RANDOM, and so it is non-deterministic. Note that the second calculation uses the result of the first, which is a hack to approximate a circle (look up "Minsky rotation"). Shouldn't the division loop 16 times for a 16/16 restoring divide? And how does a multiply by 128 end up in one byte? Edited October 25, 2015 by JamesD Quote Link to comment Share on other sites More sharing options...
phaeron Posted October 25, 2015 Share Posted October 25, 2015 Shouldn't the division loop 16 times for a 16/16 restoring divide? And how does a multiply by 128 end up in one byte? Nope, only 8 iterations are needed since only an 8-bit result is produced. It's a 16/16=8 routine. The multiply by 128 happens before the result is truncated to integer and is just a result of the way the divisor is aligned against the dividend when the algorithm runs. If you're simulating it in integer arithmetic, it would be (a*128)/b instead of floor(a/b*128). Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 25, 2015 Share Posted October 25, 2015 Nope, only 8 iterations are needed since only an 8-bit result is produced. It's a 16/16=8 routine. The multiply by 128 happens before the result is truncated to integer and is just a result of the way the divisor is aligned against the dividend when the algorithm runs. If you're simulating it in integer arithmetic, it would be (a*128)/b instead of floor(a/b*128). Well the comments say: ; A = (TOP/BOTTOM)X80 80 in hex is 128, which I wondered about. But the multiply is supposedly at the end of the code if you listen to the comments. LDY TEMP3 ; RESULT IN Y LDA PTAB,Y ; MULTIPLY BY 80 (PTAB) Last I checked, A is a byte so the result of the multiply by 80 has to be a byte. If you multiply any byte sized integer over 1 you'll overflow that byte. (2 * 128 = 256) That kind of defeats the purpose. Hence my confusion As for your analysis of the upper code (perspective projection), I'd say that's accurate. The divide is what threw me. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted October 25, 2015 Share Posted October 25, 2015 As it is all 'text book' 3D stars....all optimisations can be applied what we demo coders do Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 25, 2015 Share Posted October 25, 2015 As it is all 'text book' 3D stars....all optimisations can be applied what we demo coders do There are plenty of loops that can be unrolled. Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted October 25, 2015 Share Posted October 25, 2015 (edited) just quick sunday morning hack of the original ca65 source now for MADS... should work (but not playtested). replace .org with ORG all damned LSR A, ASL a, ROR A, ROL A with LSR, ASL, etc... all .RES into .DS exchanged the wordtab to assemble correct internal screen byte text: .byte $xx,d"string" put that on top opt h-f+ ; @com.wudsn.ide.asm.outputfileextension=.bin so it assembles a cart file. and put the DS segments which are at $2xx- $3xx from end of the source before the game code at $axxx so all segments are in ascending order... that's it... StarRaiders_mads.zip Edited October 25, 2015 by Heaven/TQA 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted October 25, 2015 Share Posted October 25, 2015 (edited) btw. I borrowed from Star raiders for an unreleased project of mine the "shoot" sprite animation (before I got the source code)... the shoots are basicly the asteroid sprite EORed with RANDOM.... clever. ah no... it was AND... https://youtu.be/U9adnZlhYt8 ;animate test shoot shootapproaching lda #$7f sta $d200 vol1 lda #$aa sta $d201 lda #$cc sta $d202 vol2 lda #$aa sta $d203 lda #$ff sta $d204 vol3 lda #$ab sta $d205 lda #$87 sta $d206 vol4 lda #$cb sta $d207 lda #$00 sta $d208 ;erase old jsr calc_shoot lda ex+1 sta 53249 lda 53770 and #7 ora #$70 sta $d013 ldy shootcount lax shoot_offset,y ldy #$0f @ lda sprite_data,x and 53770 sprite_y sta p0+256,y inx dey bpl @- Edited October 25, 2015 by Heaven/TQA 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted October 25, 2015 Share Posted October 25, 2015 Evil I am using LAX 1 Quote Link to comment Share on other sites More sharing options...
Kyle22 Posted October 25, 2015 Share Posted October 25, 2015 Evil I am using LAX Here we go again Quote Link to comment Share on other sites More sharing options...
JamesD Posted October 25, 2015 Share Posted October 25, 2015 Evil I am using LAX So you want a cookie or something? Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted October 25, 2015 Share Posted October 25, 2015 Nope... When copied the Sprite routine I was wondering why I used the LAX there in a simple routine... And remembered our discussions... Just a joke and off topic. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.