Jump to content

Photo

Fast divide by seven?


127 replies to this topic

#26 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Wed Jan 21, 2009 5:06 AM

Hello Batari,


I tried to improve the code you posted, and I think I've saved 5 cycles, and 1 register (comparing the two used to hold the processer status, vs a second temp that could be in the same location) at a cost of 2 bytes. Though I haven't fully tested the code (and everytime I do that it bites me in the ass!), but I've been trying this for hours now and I'm tired. I think it works.

It works. I wrote a quick test program before I posted mine above, and yours passed :)

I thought there was a way to do it without the php/plp instructions.

#27 Omegamatrix OFFLINE  

Omegamatrix

    Quadrunner

  • 6,215 posts
  • Location:Canada

Posted Wed Jan 21, 2009 6:44 PM

Hello again, Batari.

I made a second attempt today. Can you please test it with your program also? :)


This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is.


;divide by 10
;takes 58 cycles
	lda byte	 ; 3
	lsr		  ; 2
	clc		  ; 2
	adc byte	 ; 3
	sta temp1	; 3
	ror		  ; 2
	tay		  ; 2
	and #$80	 ; 2
	sta temp2	; 3
	tya		  ; 2
	lsr		  ; 2
	lsr		  ; 2
	lsr		  ; 2
	adc temp1	; 3
	ror		  ; 2
	ora temp2	; 3
	lsr		  ; 2
	lsr		  ; 2
	lsr		  ; 2
	adc temp1	; 3
	ror		  ; 2
	ora temp2	; 3
	lsr		  ; 2
	lsr		  ; 2
	lsr		  ; 2

Edited by Omegamatrix, Wed Jan 21, 2009 6:45 PM.


#28 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Wed Jan 21, 2009 8:39 PM

Hello again, Batari.

I made a second attempt today. Can you please test it with your program also? :)


This one uses no branching and it takes 58 cycles all the time, which is 2-4 cycles faster then my last code. It also saves one byte over my last attempt. Its disadvantages are that it uses an index register. If the user doesn't have a free index register, they might be able to substitute PHA / PLA for TAY / TYA at a cost of adding 3 cycles, depending where the stack pointer is.

It works.

I found that the clc isn't needed, so there's 2 more cycles shaved off. I tried to use some illegal opcodes but couldn't find any that worked.

#29 Thomas Jentzsch OFFLINE  

Thomas Jentzsch

    Thrust, Jammed, SWOOPS!, Boulder Dash, THREE·S, Star Castle

  • 23,625 posts
  • Always left from right here!
  • Location:Düsseldorf, Germany, Europe, Earth

Posted Thu Jan 22, 2009 12:10 PM

What's the task? Dividing an 8 bit number by 10?

#30 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Thu Jan 22, 2009 2:37 PM

What's the task? Dividing an 8 bit number by 10?

Yep.

#31 GroovyBee OFFLINE  

GroovyBee

    Games Developer

  • 9,821 posts
  • Busy bee!
  • Location:England

Posted Thu Jan 22, 2009 9:40 PM

Another attempt :-

ldx #0		 ; 2
		lda byte	   ; 3
		cmp #200	   ; 2
		bcc below200   ; 2/3
		sbc #200	   ; 2
		ldx #20		; 2
		jmp below100   ; 3
below200
		cmp #100	   ; 2
		bcc below100   ; 2/3
		sbc #100	   ; 2
		ldx #10		; 2
below100
		cmp #50		; 2
		bcc below50	; 2/3
		sbc #50		; 2
		sta temp	   ; 3
		txa			; 2
		ldx temp	   ; 3
		adc #5-1	   ; 2 (Carry set)
		jmp above50	; 3
below50
		sta temp	   ; 3
		txa			; 2
		ldx temp	   ; 3
above50
		cpx #10		; 2
		bcc done	   ; 2/3
		adc #0		 ; 2
		cpx #20		; 2
		bcc done	   ; 2/3
		adc #0		 ; 2
		cpx #30		; 2
		bcc done	   ; 2/3
		adc #0		 ; 2
		cpx #40		; 2
		adc #0		 ; 2
done

Lots of branches and variable execution time. Tested against the Batari original.

#32 Omegamatrix OFFLINE  

Omegamatrix

    Quadrunner

  • 6,215 posts
  • Location:Canada

Posted Thu Jan 22, 2009 9:54 PM

Hello, Batari and Thomas.


With Batari's findings of not needing CLC we are two cycles faster. I think I found a way to go another two cycles faster with an illegal opcode SBX.


;divide by 10
;54 cycles

	ldx	#$80		; 2
	lda	byte		; 3
	lsr				; 2
	adc	byte		; 3
	sta	temp1	   ; 3
	ror				; 2
	sbx	#0		  ; 2   illegal opcode
	stx	temp2	   ; 3
	lsr				; 2
	lsr				; 2
	lsr				; 2
	adc	temp1	   ; 3
	ror				; 2
	ora	temp2	   ; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2
	adc	temp1	   ; 3
	ror				; 2
	ora	temp2	   ; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2


I have no idea if this illegal opcode is stable, but if it is then we are at 54 cycles and only 33 bytes. :) I hope this tests okay.


~Omega

#33 Robert M OFFLINE  

Robert M

    Stargunner

  • 1,488 posts
  • Rootbeer!
  • Location:Western NY state

Posted Thu Jan 22, 2009 11:36 PM

Here is a different approach yielding about the same efficiency.
; Divide unsigned value in A by 10.
; - How? Multiply by 26 (%11010), then divide by 256.
; Outputs :
; 	A = the answer.
;	X = 0 
;	Y = unchanged.
;	origvalue = the original value to be divided.
DivideBy10
	LDX #0		  ;  2
	STX temp		;  3
	STA origvalue;  3

	ASL			 ;  2
	ROL temp		;  5

	ADC	origvalue;  3
	BCC .skip1	  ;  2/3
	INC temp		;  5
.skip1	
	ASL			 ;  2
	ROL temp		;  5	 
	
	ASL			 ;  2
	ROL temp		;  5
	
	ADC origvalue;  3
	BCC .skip2		 ;  2/3
	INC temp		;  5
.skip2	
	ASL			 ;  2
	LDA temp		;  3
	ROL			 ;  2
					;-----
					; 56 cycles worst case
					; 48 cycles best case
					; 31 bytes

Edited by Robert M, Fri Jan 23, 2009 12:16 AM.


#34 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Fri Jan 23, 2009 12:24 AM

The latest by Omegamatrix works. Robert M's made it to 68, then it failed. But this one is interesting - how does it work?

#35 supercat OFFLINE  

supercat

    Quadrunner

  • 6,401 posts

Posted Fri Jan 23, 2009 12:55 AM

How about a hybrid table/code approach:
; Divide value in acc. by 10 and store the result in Y.  X trashed.
  tay
  lsr
  lsr
  lsr
  lsr
  tax
  tya
  sec; Could this be eliminated?
  sbc table1,y
  ldy table2,x
  cmp #10
  bcc nobump
  iny
  cmp #20
  bcc nobump
  iny
nobump:
...
table1:
  byte 0,1,3,4,6,8,9,11,12,14,16,17,19,20,22,24
table2:
  byte 0,10,30,40,60,80,90,110,120,140,160,170,190,200,220,240
Sixteen instructions totaling 36 cycles worst-case and 24+32=56 bytes

#36 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Fri Jan 23, 2009 1:48 AM

Supercat's works, with the following corrections:
sbc table2,x
  ldy table1,x
The sec is required.

Edited by batari, Fri Jan 23, 2009 1:48 AM.


#37 Thomas Jentzsch OFFLINE  

Thomas Jentzsch

    Thrust, Jammed, SWOOPS!, Boulder Dash, THREE·S, Star Castle

  • 23,625 posts
  • Always left from right here!
  • Location:Düsseldorf, Germany, Europe, Earth

Posted Fri Jan 23, 2009 5:13 AM

I wonder if the BCD mode could be used for this. :ponder:

#38 Omegamatrix OFFLINE  

Omegamatrix

    Quadrunner

  • 6,215 posts
  • Location:Canada

Posted Fri Jan 23, 2009 6:22 PM

I think I managed to save another 2 cycles, and 2 bytes by using the illegal opcode SAX.


;divide by 10
;52 cycles, 31 bytes

	lda  byte	  ; 3
	ldx  #$80	  ; 2
	lsr			; 2
	adc  byte	  ; 3
	sta  temp1	 ; 3
	ror			; 2
	sax  temp2	 ; 3   illegal opcode
	lsr			; 2
	lsr			; 2
	lsr			; 2
	adc  temp1	 ; 3
	ror			; 2
	ora  temp2	 ; 2
	lsr			; 2
	lsr			; 2
	lsr			; 2
	adc  temp1	 ; 3
	ror			; 2
	ora  temp2	 ; 2
	lsr			; 2
	lsr			; 2
	lsr			; 2

I can't believe how useful some of these illegal ops are in certain situations. Hopefully it is stable. ;)

Edited by Omegamatrix, Fri Jan 23, 2009 7:34 PM.


#39 Robert M OFFLINE  

Robert M

    Stargunner

  • 1,488 posts
  • Rootbeer!
  • Location:Western NY state

Posted Fri Jan 23, 2009 7:38 PM

Robert M's made it to 68, then it failed. But this one is interesting - how does it work?



I am multiplying the value by the fraction 26/256 which is just slightly larger than 0.1.

#40 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Fri Jan 23, 2009 8:32 PM

Robert M's made it to 68, then it failed. But this one is interesting - how does it work?



I am multiplying the value by the fraction 26/256 which is just slightly larger than 0.1.

Makes sense. It fails at 69 because 69*26/256=7.0078125

If you adjust the original value with value=value-value/64, it may fix it.

#41 Omegamatrix OFFLINE  

Omegamatrix

    Quadrunner

  • 6,215 posts
  • Location:Canada

Posted Fri Jan 23, 2009 11:28 PM

Batari, can you please test this one if you have time? I was going through distella's debugger with it, and I think I adjusted it right, but I need confirmation.


Thanks if you can,
Omega


;divide by 10
;47 cycles, 28 bytes
	
	tax				; 2
	sta	temp		; 3
	lsr				; 2
	adc	temp		; 3
	sta	temp		; 3
	ror				; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2
	adc	temp		; 3
	cpx	#161		; 2
	ror				; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2
	adc	temp		; 3
	cpx	#160		; 2
	ror				; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2


#42 Robert M OFFLINE  

Robert M

    Stargunner

  • 1,488 posts
  • Rootbeer!
  • Location:Western NY state

Posted Sat Jan 24, 2009 12:17 PM

Not as efficient anymore, but I changed it to multiply by the fraction (25.5/256) which is closer to 0.1,
; Divide unsigned value in A by 10.  "floor(A/10)"
; - How?
; - Multiply by 25.5 (%11001.1), then divide by 256  
; Inputs :
;	A = 0-255 value to be divided by ten rounded down.
; Outputs :
;    A = the answer.
;    X = 0 
;    Y = unchanged.
;    origvalue = the original value in A divided by 2.
DivideBy10
    LDX #0          ;  2; This is a setup work that can
    STX temp        ;  3; be optimized out if the value
    STA origvalue    ;  3 ; to divide is already in origvalue
					; and there is a temp location 
					; already set to zero.

    ASL              ;  2; temp:A = origvalue * 2 
    ROL temp        ;  5

    ADC origvalue    ;  3; temp:A = origvalue * 3
    BCC .skip1      ;  2/3
    INC temp        ;  5
.skip1    

    ASL              ;  2; temp:A = origvalue * 6
    ROL temp        ;  5     
    
    ASL              ;  2; temp:A = origvalue * 12
    ROL temp        ;  5
    
	ASL			;  2 ; temp:A = origvalue * 24
	ROL temp	;  5

    ADC origvalue    ;  3; temp:A = origvalue * 25
    BCC .skip2       ;  2/3
    INC temp        ;  5
.skip2    

	LSR origvalue  ;  5 ; temp:A = origvalue * 25.5	
	ADC origvalue  ;  3		  
    LDA temp	;  3
	ADC #0		 ;  2 ; A = (origvalue * 25.5) / 256
                    ;-----
                    ; 69 cycles worst case
                    ; 61 cycles best case
                    ; 38 bytes

I am curious now to try and make a general solution that can multiply any byte by a fraction.

Edited by Robert M, Sat Jan 24, 2009 12:18 PM.


#43 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Sat Jan 24, 2009 2:43 PM

Omegamatrix's works, but Robert M's fails at 10 (because 25.5*10/256=0.996.)

Robert M's original code will work if it's adjusted with origvalue=origvalue-origvalue/64.

It's doing (A-A/64)*26/256=A*25.59375/256, which is pretty close.

Unfortunately, I can't think of a better way to do it than this (add at the beginning of the code:
sta origvalue
   rol
   rol
   rol
	and #3
   sta temp
  lda origvalue
   sec
  sbc temp
  sta origvalue
EDIT: Corrected analysis (it's not exact, but works.)

Edited by batari, Sat Jan 24, 2009 3:18 PM.


#44 Omegamatrix OFFLINE  

Omegamatrix

    Quadrunner

  • 6,215 posts
  • Location:Canada

Posted Fri Feb 6, 2009 9:43 PM

I took another stab at this today, and managed to save 11 cycles and 7 bytes over my last effort.


;divide by 10
;36 cycles, 21 bytes

	tax				; 2
	sta	temp		; 3
	lsr				; 2
	sec				; 2
	adc	temp		; 3
	sta	temp		; 3
	ror				; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2
	adc	temp		; 3
	cpx	#160		; 2
	ror				; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2


Edit: found a way to save another byte by setting the carry!

Edited by Omegamatrix, Fri Feb 6, 2009 11:08 PM.


#45 Omegamatrix OFFLINE  

Omegamatrix

    Quadrunner

  • 6,215 posts
  • Location:Canada

Posted Sat Feb 7, 2009 9:58 PM

Found a way to save another cycle, and preserve the X and Y registers at the same time. :) Same amount of bytes as before.


;divide by 10
;35 cycles, 21 bytes

	sta	temp		; 3
	ora	#$03		; 2
	lsr				; 2
	adc	temp		; 3
	ror				; 2
	sta	temp		; 3
	lsr				; 2
	lsr				; 2
	lsr				; 2
	adc	temp		; 3
	adc	temp		; 3
	ror				; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2


#46 batari OFFLINE  

batari

    )66]U('=I;B$*

  • 6,673 posts
  • begin 644 contest

Posted Sun Feb 8, 2009 1:13 AM

Found a way to save another cycle, and preserve the X and Y registers at the same time. :) Same amount of bytes as before.


;divide by 10
;35 cycles, 21 bytes

	sta	temp	; 3
	ora	#$03	; 2
	lsr			; 2
	adc	temp	; 3
	ror			; 2
	sta	temp	; 3
	lsr			; 2
	lsr			; 2
	lsr			; 2
	adc	temp	; 3
	adc	temp	; 3
	ror			; 2
	lsr			; 2
	lsr			; 2
	lsr			; 2

Now that's impressive :) Works, too.

#47 grafixbmp OFFLINE  

grafixbmp

    Dragonstomper

  • 683 posts
  • Location:South Central US

Posted Sun Feb 8, 2009 10:43 AM

Found a way to save another cycle, and preserve the X and Y registers at the same time. :) Same amount of bytes as before.


;divide by 10
;35 cycles, 21 bytes

	sta	temp; 3
	ora	#$03; 2
	lsr		; 2
	adc	temp; 3
	ror		; 2
	sta	temp; 3
	lsr		; 2
	lsr		; 2
	lsr		; 2
	adc	temp; 3
	adc	temp; 3
	ror		; 2
	lsr		; 2
	lsr		; 2
	lsr		; 2

Now that's impressive :) Works, too.

Wonder if you could impliment a similar process for other numbers too. (still preserving the x and y)

#48 Omegamatrix OFFLINE  

Omegamatrix

    Quadrunner

  • 6,215 posts
  • Location:Canada

Posted Tue Feb 10, 2009 6:45 AM

Managed to shave another byte off. Same cycles as before.

;divide by 10
;35 cycles, 20 bytes

	lsr				; 2
	sta	temp		; 3
	lsr				; 2
	sec				; 2
	adc	temp		; 3
	sta	temp		; 3
	lsr				; 2
	lsr				; 2
	lsr				; 2
	adc	temp		; 3
	adc	temp		; 3
	ror				; 2
	lsr				; 2
	lsr				; 2
	lsr				; 2


#49 grafixbmp OFFLINE  

grafixbmp

    Dragonstomper

  • 683 posts
  • Location:South Central US

Posted Tue Feb 10, 2009 6:10 PM

There is also another process a friend of mine introduced me to once, but can't quite remember what it was called. It was either pure hex or true hex or something to that effect. It might be more adept at being used on a 6502 since it deals with hex values. But while I collect some more info on the subject, here is a part of the processes used in vedic for multiplication. It may not work well for a processor but it is still nice to know. It deals with the fact that the math is done with base 10 and uses it as so...

Suppose you need 8 x 7

8 is 2 below 10 and 7 is 3 below 10.
Think of it like this:

The answer is 56.
The diagram below shows how you get it.

You subtract crosswise 8-3 or 7 - 2 to get 5,
the first figure of the answer.
And you multiply vertically: 2 x 3 to get 6,
the last figure of the answer.


I'll get back after I see what I can find on the hex stuff.

I found what I was looking for. it is Prime HEX or cuban primes. here is a link for it and what it is used for. You can search for more info on the subject and may find some helpful things to use in math on the atari.
http://mathworld.wol.../HexNumber.html

Here is another site with some visual arrays that look pretty damn cool.
http://www.thefoggie...e_hex_array.htm

Edited by grafixbmp, Tue Feb 10, 2009 6:14 PM.


#50 grafixbmp OFFLINE  

grafixbmp

    Dragonstomper

  • 683 posts
  • Location:South Central US

Posted Wed Feb 11, 2009 12:55 AM

I thought about the topic that started it all, "Fast devide by 7" And decided to take an all encompasing aproach to an idea.

If a table array had the math layed out for every possible answer in decimal, bianary and hex, then a search for easy pattern recognition can be done to devise an efficient way to come to the right answer for groups of given senarios. That way you can have more than one way to get the answer depending on what the current value being worked with is. And given that the highest number you could even do with an 8 bit value is 255, then somewhere around up to say 4 diffrent ways could be constructed to obtain an answer that take extremely close to the same amount of cycles to accomplish the task since ROM space really isn't such a big issue. But then comes the decission making process of which method is used to get the answer based upon what the current value is that is being worked with. This part may be too much overhead. Sometimes visualizing the array of answers helps to see patterns where they could not be seen before. Basing this upon patterns in bianary or hex could be the key to optimizing the complete process. Sound good or is this pure smoke/speculation?

Edited by grafixbmp, Wed Feb 11, 2009 12:56 AM.





0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users