Jump to content
IGNORED

Looking for 32 bit division routines


Pab

Recommended Posts

I'm looking for a good assembly routine to divide a 32 bit integer by a 16 bit integer, leaving a 32 bit result.

 

The only routines I have on hand are for dividing 64 bit by 32 bit to leave a 32 bit result, or to divide 32 bit by 16 bit leaving a 16 bit result. Neither quite fits my case. For example, I'd like to be able to divide $400000 by $20 and end up with $20000.

 

Anyone have any code handy for this?

 

Link to comment
Share on other sites

Here's what I've been working with, based on different routines I've seen. Using the rotate/subtract method. Complete gibberish as the result.

10 DIVISOR = $CF    
20 DIVIDEND = $CB   
30 REMAINDER = $D8  
40 RESULT =  $D4    
50 SCRAP =   $E0
60 DIVIDE LDA #0     	        ;preset remainder to 0
70       STA REMAINDER
80       STA REMAINDER+1
90       STA REMAINDER+2
100      STA REMAINDER+3
110      STA RESULT
120      STA RESULT+1
130      STA RESULT+2
140      STA RESULT+3
150      LDX #32     	        ;repeat for each bit: ...
160 DIVLOOP ASL DIVIDEND 
170      ROL DIVIDEND+1 	
180      ROL DIVIDEND+2
190      ROL DIVIDEND+3
200      ROL REMAINDER 	
210      ROL REMAINDER+1
220      ROL REMAINDER+2
230      ROL REMAINDER+3
240      TXA 
250      PHA 
260      SEC 
270      LDX #0
280      LDY #4
290 LP1  LDA REMAINDER,X
300      SBC DIVISOR,X 	;substract divisor to see if it fits in
310      STA SCRAP,X
320      INX 
330      DEY 
340      BNE LP1
350      BCC SKIP    	;if carry=0 then divisor didn't fit in yet
360      LDY #4
370      LDX #0
380 LP2  LDA SCRAP,X 	;else save substraction result as new remainder,
390      STA REMAINDER,X
400      INX 
410      DEY 
420      BNE LP2     	
430      INC RESULT  	;and INCrement result cause divisor fit in 1 times
440 SKIP PLA 
450      TAX 
460      DEX 
470      BNE DIVLOOP 	
480      RTS
Link to comment
Share on other sites

Your routine looks like the one on codebase64. One thing to note is that that routine relies on dividend and result being aliased to the same memory locations. That way the rols on dividend also rol the result.

 

Here's a working version:

 

    org $80
divisor org *+4
dividend org *+4
remainder org *+4
result equ dividend
scrap org *+2
 
    org $2000
 
divide
    lda #0                  ;preset remainder to 0
    sta remainder
    sta remainder+1
    sta remainder+2
    sta remainder+3
    ldx #32                 ;repeat for each bit: ...
 
divloop
    asl dividend
    rol dividend+1
    rol dividend+2
    rol dividend+3
    rol remainder
    rol remainder+1
    rol remainder+2
    rol remainder+3
 
    sec
    lda remainder+0         ;substract divisor to see if it fits in
    sbc divisor+0
    sta scrap+0
    lda remainder+1
    sbc divisor+1
    sta scrap+1
    lda remainder+2
    sbc divisor+2
    tay
    lda remainder+3
    sbc divisor+3
    bcc skip                ;if carry=0 then divisor did not fit in yet
 
    sta remainder+3         ;else save substraction result as new remainder,
    sty remainder+2
    mwa scrap remainder     
    inc result              ;and increment result cause divisor fit in 1 times
 
skip
    dex
    bne divloop
    rts
 
main
    mwa #123 dividend       ;test
    mwa #0 dividend+2
    mwa #3 divisor
    mwa #0 divisor+2
    jsr divide
    jmp *
 
    run main

Another problem with your code was that the loop counters went from 4 to 1 instead of 3 to 0. I circumvented this by unrolling loops. This also eliminates the need for pha/pla to save the X register.

 

@flashjazzcat, the INC will never cause a carry the way the code is written, so the extra INCs are not necessary.

 

EDIT: Nevermind about the loop counters. Yours are fine. My subtraction was in the wrong byte order. I've corrected it above.

 

EDIT #2: Fixed another error in the copying of the scratch result to the remainder. Amazing how my small test case yields the correct result despite all these errors. Probably should expand the test suite. :)

Edited by Xuel
Link to comment
Share on other sites

Hi!

 

@flashjazzcat, the INC will never cause a carry the way the code is written, so the extra INCs are not necessary.

In fact, the "INC" can be avoided by replacing the first ASL with a ROL and adding a CLC at the initialization, this is using the fact that the carry is set by the last substraction.

 

A good place to look for arithmetic routines is the CC65 runtime library, those are not optimized for maximum speed but for reusing code. The code for the 32bit division is in libsrc/runtime/ludiv.s, "udiv32" (ludiv == long-unsigned-division), see at https://github.com/cc65/cc65/blob/master/libsrc/runtime/ludiv.s

Edited by dmsc
  • Like 1
Link to comment
Share on other sites

  • 4 weeks later...

Following on from this: I was using 32-bit divide and multiply functions based on routines found on Codebase, but for non-critical operations where space is at a premium (especially when PZ RAM can't be used for whatever reason), I wondered if they could be shrunk down some. The result seems to work well:

;	Small unsigned 32-bit divide and multiply
;	Num1, Num2 and Num3 must be absolute addresses

Num1	.ds 4
Num3	.ds 8	; must follow Num1 for Div to work
Num2	.ds 4


;
;	Divide Num1 by Num2 and place result in Num1
;	Remainder is left in Num3, which MUST be immediately after Num1 in RAM
;	Optimised for size rather than speed
;

	.local DivDWord
	lda #0	        ;preset remainder to 0
	ldx #3
@
	sta Num3,x
	dex
	bpl @-

	ldy #32		; repeat for each bit
DivLoop
	clc		; clear carry so we can use ROL throughout
	ldx #248	; dividend lb & hb*2, msb -> Carry
@
	rol.w Num1-248,x	; .w ensures ZP addressing is never used
	inx
	bne @-
	
	sec
	ldx #252
@
	lda.w Num3-252,x
	sbc.w Num2-252,x
	pha
	inx
	bne @-
	
	bcs @+
	pla
	pla
	pla
	pla
	bcc Skip
@
	ldx #3
@
	pla
	sta Num3,x
	dex
	bpl @-

	inc Num1	; and increment result because divisor fit
Skip
	dey
	bne DivLoop
	rts
	.endl
	
	
	
;
;	Multiply Num1 by Num2, storing result in Num3
;	Optimised for size rather than speed
;

	.local MulDWord
	ldx #3
	lda #0
@
	sta Num3+4,x	; clear upper bits of product
	dex
	bpl @-

	ldy #32		; loop for each bit
MultLoop
	clc		; clear carry so we can use ROR on all four bytes
	ldx #3
@
	ror Num1,x	; divide multiplier by 2
	dex
	bpl @-
	bcc Rotate

	clc
	ldx #253
@
	lda.w Num3-249,x
	adc.w Num2-253,x
	sta.w Num3-249,x
	inx
	bne @-
	
	lda Num3+7
	adc Num2+3
Rotate
	ror	; rotate partial product 
	sta Num3+7

	ldx #6
@
	ror Num3,x
	dex
	bpl @-

	dey
	bne MultLoop
	rts
	.endl

Obviously these aren't the fastest because of all the loops and stack usage, but for one-off calculations it hardly matters. Putting Num1, Num2 and Num3 in zero page and getting rid of the loops provides a better balance of performance and code size, however.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...