Jump to content
IGNORED

divide by 3 and by 36


SvOlli

Recommended Posts

Hello!

 

For hacking some nice demo effect, I need to convert a sprite position (range 0-95) to a cpu cycle offset. This means division by 3. I also need to divide the range into three areas. For technical reasons, I chose 12 cycles = 36 "positions". So I need to divide by 36 as well.

 

How should I do this?

 

An idea would be to have a table of 96 bytes storing all possible results: the lower 0-4 bits contain the result of div3, bits 5-6 div36, so that I can AND out or LSR the desired result rather easy and fast. But that costs precious ROM. Creating the table in RAM also is no option, because I already need >80 bytes for other data.

 

Right now I'm stuck with a subtraction, counting how many times I can subtract the desired value like this:

   lda   value
   ldx   #$ff
   sec
loop36:
   inx
   sbc   #36
   bcs   loop36
   stx   div36
   
   lda   value
   ldx   #$ff
   sec
loop3:
   inx
   sbc   #3
   bcs   loop3
   stx   div3

This costs 24 bytes in total, but also costs quite some computing time, worst case should be 263 cpu cycles which is ~3.5 raster lines. Right now I can live with that, but my mind keeps banging if there's still an alternative worth checking out. One thing I found saved one byte:

   lda   value
   ldx   #$ff
   sec
loop3:
   inx
   sbc   #3
   bcs   loop3
   stx   div3

   txa
   ldx   #$ff
   sec
loop12:
   inx
   sbc   #12
   bcs   loop12
   stx   div36

Just divide by 3 and after that by 12. Gets to the same result, saves one byte by using "txa", but still needs the same amount of cycles.

 

For now I focused on the div3 since that's where most cpu time is spent on a bad case.

 

I looked around the internet already and found the idea of multiplying by 256/3 and throw away the low byte. That approach does not work entirely as the result sometimes gets "rounded up", which would be the death of my cycle exact code.

 

Multiplying by 3 is easy:

   lda   value
   asl
   clc           ; only needed if value could be > 127
   adc   value
   sta   mul3

So I was hoping either to come up wirh or find something similar like this for dividing by 3, but no luck yet.

 

Any ideas?

 

p.s.: i should be able modify my code on the 36 positions part to use there 33 positions instead, if someone got a better idea for dividing by 33 (or 11) instead of 36 (or 12). Other numbers are technically not possible.

  • Like 1
Link to comment
Share on other sites

Just thinking out loud here without testing it: The division into three areas can be done by manual comparisons (2 cmp instructions). If you do that first, then you could re-use the result for the division by 3, i.e. you only need a look-up table of size 12 (0, 3, ..., 33) and then add a base value depending in your previous cmp results (+0, +36, +72). This should be very fast and might be light on ROM.

 

EDIT: Scratch that, this is nonsense - I should refrain from "thinking out loud" and think things through before posting instead. :)

 

EDIT 2: Or maybe not. (I'm tired today...) After the manual comparisons for the div 36, you could subtract 36 or 72 from the sprite position if it's in area 2 or 3, and then use the smaller, 12 byte look-up table for the div 3 and add 12 or 24 again to the result.

Edited by Kylearan
Link to comment
Share on other sites

For hacking some nice demo effect, I need to convert a sprite position (range 0-95) to a cpu cycle offset.

 

You could do something like X * 1/4 + X * 1/16 + X * 1/64... or (X * 16 + X * 4 + X * 1) / 64 (rounding might become a problem).

 

But why would you have to do that division anyway?

Link to comment
Share on other sites

 

You could do something like X * 1/4 + X * 1/16 + X * 1/64... or (X * 16 + X * 4 + X * 1) / 64 (rounding might become a problem).

 

 

Yes, that was one of the approaches I tried and ran into rounding problems.

 

 

Divide by 3 (Omegamatrix)

;18 bytes, 30 cycles
  sta  temp
  lsr
  adc  #$15
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

 

 

Nukey, thanks. I think that was one of the solutions I was looking for evaluation.

 

I found another shortcut

   lda   value
   sec
   ldx   #$ff
loop36:
   inx
   sbc   #36
   bcs   loop36
   stx   div36

   adc   #36
   ldx   #$ff
loop3:
   inx
   sbc   #3
   bcs   loop3
   txa
   ldx   div36
   adc   tab12,x ; tab12: .byte 0,12,24
   sta   div3

I'm using only the remainder of div36 for div3, to lower the number of loops needed to 12 max. This lower the number of cycles in the worst case from 263 to 129 by increasing ROM neede from 24 to 32. Fair tradeoff. This is something I can work with, if I don't figure out anything better.

 

But if I combine the subtract loop for div36 and omegamatrix' code for div3, I'll end up with 34 bytes and worst case of 65 cycles. Seems like the better deal.

Link to comment
Share on other sites

29 bytes, and 42 cycles (constant). Good for dividing values of 0 to 107.

;Divide by 3
;18 bytes, 30 cycles
  sta  temp
  lsr
  adc  #21
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

;Continue routine for divide by 36
;11 bytes 12 cycles
  tax      ; X = div3
  lda  #0
  cpx  #12
  adc  #0
  cpx  #24
  adc  #0
           ; A = div36

More of my unsigned integer division routines can be found here.

 

  • Like 1
Link to comment
Share on other sites

I took a look, and for the the divide by 3 can be shortened for a limited range. So now it's 27 bytes and 40 cycles (constant), and valid over all for dividing values of 0 to 107.

;Divide by 3 (can be used for 0-128 range)
;16 bytes, 28 cycles
  sta  temp
  lsr
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

;Continue routine for dividing by 36 (good for 0-107 only)
;11 bytes 12 cycles
  tax      ; X = div3
  lda  #0
  cpx  #12
  adc  #0
  cpx  #24
  adc  #0
           ; A = div36
Link to comment
Share on other sites

Another optimization, a trick I discovered long ago and forgotten for this particular situation. 26 bytes and 40 cycles.

;Divide by 3 (can be used for 0-128 range)
;16 bytes, 28 cycles
  sta  temp
  lsr
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

;Continue routine for dividing by 36 (good for 0-107 only)
;10 bytes 12 cycles
  tax      ; X = div3
  lda  #0
  cpx  #12
  rol
  cpx  #24
  adc  #0
           ; A = div36
Link to comment
Share on other sites

And by switching A and X for the results, I can even optimize that to 24 bytes and 38 cycles:

; prepare (division code suggested by omagematrix)
   lda   xpos
;Divide by 3 (can be used for 0-107 range)
;16 bytes, 28 cycles
   sta   temp
   lsr
   lsr
   adc   temp
   ror
   lsr
   adc   temp
   ror
   lsr
   adc   temp
   ror
   lsr
               ; A = div3

;Continue routine for dividing by 36 (good for 0-107 only)
;8 bytes 10 cycles
   ldx   #0
   cmp   #12
   inx
   cmp   #24
   inx
               ; X = div36

Which turns out even better this way, because I have to run a subtraction with the div3 value. :)

 

Edit: never write posts just after getting out of bed... The "inx" don't work of course... and fixing that with "bcc"s wil result in code that's slower and takes more ROM.

Edited by SvOlli
Link to comment
Share on other sites

On a hunch I tried another optimization, and it saves 3 cycles. It just makes the 0-95 criteria.

 

26 bytes, and 37 cycles overall.

;Divide by 3 (good for 0-95 range only)
;16 bytes, 25 cycles
  cmp  #33
  adc  #0
  sta  temp
  lsr
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

;Continue routine for dividing by 36 (good for 0-107 range only)
;10 bytes 12 cycles
  tax      ; X = div3
  lda  #0
  cpx  #12
  rol
  cpx  #24
  adc  #0
           ; A = div36
  • Like 1
Link to comment
Share on other sites

Thanks again. But I'm sticking with with "26 bytes and 40 cycles" versions, because what comes in via A has been calculated and needs to be stored to RAM anyway for later use. And since the "adc temp" doesn't change anything at "temp", I don't need a temporary storage, but just can "reuse" the original one.

Link to comment
Share on other sites

I'm going to present one more version. It saves 3 bytes but costs 16 more cycles. Use whatever routine that makes the most sense for the situation.

 

This one is 23 bytes and 56 cycles (constant cycles) overall.

;Divide by 3 (can be used for 0-128 range)
;13 bytes, 44 cycles
  sta  temp
  lsr
  lsr
  
  ldx  #3
.loopFinishDiv3:
  adc  temp
  ror
  lsr
  dex
  bne  .loopFinishDiv3

;Continue routine for dividing by 36 (good for 0-107 only)
;10 bytes 12 cycles
  tax      ; X = div3
  lda  #0
  cpx  #12
  rol
  cpx  #24
  adc  #0
           ; A = div36
Link to comment
Share on other sites

Also.... I always post and realize something else. If Y is available save 1 more byte.

 

So 22 bytes and 56 cycles (constant).

;Divide by 3 (can be used for 0-128 range)
;13 bytes, 44 cycles
  sta  temp
  lsr
  lsr
  
  ldy  #3
.loopFinishDiv3:
  adc  temp
  ror
  lsr
  dey
  bne  .loopFinishDiv3

;Continue routine for dividing by 36 (good for 0-107 only)
;9 bytes 12 cycles
  tax      ; X = div3
  tya      ; A=0
  cpx  #12
  rol
  cpx  #24
  adc  #0
           ; A = div36
Link to comment
Share on other sites

Here's a 21 byte version. It adds another 9 cycles but hey if you got the cycles to spare then it's another byte saved. Use whatever routine works.

 

21 bytes and 65 cycles (constant) overall.

;Divide by 3 (can be used for 0-128 range)
;12 bytes, 53 cycles
  ldy  #4
  sta  temp
  .byte $0C ; NOP skip 2 bytes
.loopFinishDiv3:
  adc  temp
  lsr
  lsr
  dey
  bne  .loopFinishDiv3

;Continue routine for dividing by 36 (good for 0-107 only)
;9 bytes 12 cycles
  tax      ; X = div3
  tya      ; A=0
  cpx  #12
  rol
  cpx  #24
  adc  #0
           ; A = div36
Link to comment
Share on other sites

  • 2 weeks later...
  • 1 year later...

 

Divide by 3 (Omegamatrix)

;18 bytes, 30 cycles
  sta  temp
  lsr
  adc  #$15
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

 

 

Hello..

 

I'm trying to divide the 160 color clocks by 6. If I wanted to divide by 6, should I just put an lsr right at the top of this (before storing to temp)? Should I do anything with the carry?

Link to comment
Share on other sites

 

Hello..

 

I'm trying to divide the 160 color clocks by 6. If I wanted to divide by 6, should I just put an lsr right at the top of this (before storing to temp)? Should I do anything with the carry?

 

Can't say I've studied on it any, but I'd just add an lsr at the end, that way you keep the most bits for the divide by 3

  • Like 1
Link to comment
Share on other sites

Could this work ?

 

After line 192, you VBLANK, then :

 

-init Y at zero (2) and A to proper collision mask (2)

-position a missile at pixel X (11 cycles)

-position a ball at start of line (3 cycles)

-set HMBL at 3 (5 cycles)

-at each new scanline, HMOVE (3 cycles) and increment Y (2) until collision detected (6/5 cycles)

 

total : 28 + result * 11, max about 380 cycles, but definitively more fun.

 

Note you can divide many number at once using this techic, or divide a number by severals number, saving many cycles.

Edited by Windless
Link to comment
Share on other sites

 

Divide by 3 (Omegamatrix)

;18 bytes, 30 cycles
  sta  temp
  lsr
  adc  #$15
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr
  adc  temp
  ror
  lsr

 

 

Well I've looked at this multiple times this week, manually worked through it with various values and it always works. I made a spreadsheet to try to observe what happens, why and when.. and I still have no solid idea why this works. My guess is that the adc #$15 (%00010101) is seeding the value to affect the carry for some reason and that repeatedly dividing by four, then adding the original amount back in, somehow works out to dividing by 3. Also thinking that binary 3 behaves differently than decimal 3, being that its a fraction of the base in decimal, but 1.5 times the base in binary.

 

Would someone care to comment it so I might get a better grip on it? Also, did I get any of it right?

 

Thanks

Link to comment
Share on other sites

I didn't try to work it out but..

 

1/3 is 1/4 + 1/16 + 1/64 + 1/256 ...

You're multiplying by reciprocal 3

 

$15 is the first three of those

 

That will (roughly) have the effect of rounding up if the fractional part is > .66

 

ie you're multiplying by .3333... and adding .333 (approximately, I've not accounted for truncation and rounding)

Edited by bogax
Link to comment
Share on other sites

19 hours ago, bogax said:

1/3 is 1/4 + 1/16 + 1/64 + 1/256 ...

Ahh thanks, that was the missing piece.  I saw Thomas mention something like that but it didn’t click because I had no idea that was a thing.  Everyone here seems to know it without having to say.  All those years of math in school and I don’t recall ever seeing it.  Here’s my interpretation:

Div6:	SUBROUTINE 	; divide player position by 6
					; rors are necessary for larger numbers
	sta  temp		; Store the value for recycling. 1/3 is the  convergence of 1/4 + 1/16 + 1/64 + 1/256...
	lsr				; Shift significant bits (of 2 bit pairs) over 1
	adc  #$15		; = %00010101. Round up if the 2 bit pair was 3?
	lsr				; Now we are at 1/4 the original number.  The results here will be divided 3 more times for the 1/256 portion
	adc  temp		; Add that to the original number for round 2 (Original # plus 1/4)
	ror				; 
	lsr				; Divide that by 4.  The results here will be divided 2 more times for the 1/64 portion
	adc  temp		; Add that to the total number for round 3 (Original # plus 1/4 plus 1/16)
	ror				; 
	lsr				; Divide that by 4.  The results here will be divided 1 more time for the 1/16 portion
	adc  temp		; Add that to the total number for round 4 (Original # plus 1/4 plus 1/16 plus 1/64)
	ror				; 
	lsr				; Divide that by 4 for final result.  Original can be discarded now. Result is 1/4 plus 1/16 plus 1/64 plus 1/256
;Make it divide by 6
	lsr				; 

 

Edited by BNE Jeff
Fractions were wrong
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...