Jump to content
IGNORED

Fast divide by seven?


brpocock

Recommended Posts

I revisited divide by 5, as I no longer consider my old, old solution using ASR valid. ASR fails sometimes on my Jr.

 

 

The solution using ASR was super fast because LSR, followed by CLC could simply be replaced with ASR #$FE and saved 2 cycles for each usage. To replace it I've been working or perturbing the data so that I don't have to use CLC or SEC at all. The advantage this time is that I have my little program which can quickly cover the entire range of values, and use a variable.

 

 

I've found this solution, which comes close to the the ASR solution. It is 2 cycles slower, but saves 1 byte (and no illegal opcodes)

 

;Divide by 5
;22 bytes, 34 cycles
 sta temp
 lsr
 lsr
 sbc temp
 lsr
 lsr
 adc #198
 adc temp
 lsr
 lsr
 sbc temp
 eor #$FE
 adc #2
 lsr
 lsr

Edited by Omegamatrix
  • Like 1
Link to comment
Share on other sites

Warning: Long post below (apologies) :)

 

 

Today I was thinking more about the divide by 11 and 13 routines I found here, and exactly why they work so well. The part of that web page that caught my eye was how they were representing there division by sums of other fractions. It occurred to me that given n bits, it would be relatively simple to make an excel sheet that expressed all linear combinations of 2^k divisions, where the sum of all k never exceeded n bits. That might sound convoluted (does to me anyhow) so let me explain the process.

 

I wrote an excel sheet today to find the the division sequence to get the best accuracy. The user selects a divisor, and a desired accuracy. The excel sheet has conditional formatting to highlight the results within the accuracy class. The total number solutions is close to the top so that the user could quickly see if a solution existed without scrolling all the way down the sheet. The sheet is very long as it covers 2048 combinations (i.e. ln(2048)/ln(2) = 11 bits for n).

 

First we take the sheet and enter a value for the divisor, and a given accuracy class (these are the yellow cells).

 

post-7074-0-29655700-1359328551_thumb.png

 

We can see that solutions within the accuracy range exist, because the total number of solutions listed is not zero. The solutions might not all be unique, but at least some exist. Next we find one of them.

 

post-7074-0-79982000-1359328894_thumb.png

 

The "/4/2/2/8/16" is the sequence for divisions that need to be done. Next we the build the routine as such:

- store the dividend into a temporary register

- perform first division

- add the dividend to the result

- perform second division, using a ror instead of a lsr for the first shift to handle overflow

- add the dividend to the result again

- add so on, until all of the division is done

 

For example, the code for division by 13 from "/4/2/2/8/16" above would look like this:

 

;divide by 13, using /4/2/2/8/16 solution
 sta  temp  ; store dividend
 lsr  ; /4, which is base 2 is equal to shifting right twice
 lsr
 adc  temp  ; add dividend
 ror  ; /2, using ror instead of lsr in case of overflow
 adc  temp  ; dividend
 ror  ; /2
 adc  temp
 ror  ; /8, 8 = 2^3, or three right shifts
 lsr
 lsr
 adc  temp
 ror  ; /16, 16 = 2^4, or 4 right shifts
 lsr
 lsr
 lsr

 

So it looks great, and we try it in test program. It immediately fails:

 

post-7074-0-86621600-1359330064_thumb.png

 

Perhaps too many bits were lost in the truncation, or perhaps the algorithm wasn't accurate enough to begin with. The test program essentially works by trying all inputs from 255 to 0, and it will break at the first bad result. In this case it partially worked so there is some hope. We can see if placing a disturbance in the routine will bring in back in line...

 

In the test program set the "MODE" to 0, and add a "testVariable" into the routine. The test program now runs in a outer/inner loop sequence. It will start with "testVariable" = 255, and test the routine for every possible input (255 to 0). If that fails "testVariable" is decremented by 1, and the process begins again. A break in the routine occurs only if a solution is found, or all possible tests are exhausted.

 

I usually use ADC testVariable, or SDC testVariable, but you can do anything you wan't with it, like EOR testVariable, CMP testVariable, etc. It's also easy to add another variable. Just add another loop. The program will run a lot longer of course, but 2 variables is managable. If you add a tertiary variable you probably would be best writing something in C or Matlab and letting your computer solve it. It would take way too long in Stella.

 

So I randomly plugged in an "ADC testVariable" to the routine, and ran the program with the MODE = 0.

 

 sta  temp
 lsr
 lsr
 adc  temp
 ror
 adc  temp
 ror
 adc  temp
 ror
 adc  testVariable ; <-- Disturbance
 lsr
 lsr
 adc  temp
 ror
 lsr
 lsr
 lsr

 

A solution was then found by the test program:

 

post-7074-0-55187000-1359331106_thumb.png

 

If we replace ADC testVariable with ADC #2, then this program will work. There might be other values that will work, too. The program simply breaks at the first one it finds.

 

Now after inspecting the code, we can see that two shifts follow the ADC #2. If we took 2 on it's own and shifted it right twice it would be in the carry. Why not try SEC instead (a couple of lines below), and just see if it works? Turns out that it does, and already an additional byte has been saved! :)

 

The final new divide by 13 routine now looks like this:

 

;Divide by 13
;22 bytes, 39 cycles
 sta  temp
 lsr
 lsr
 adc  temp
 ror
 adc  temp
 ror
 adc  temp
 ror
 lsr
 lsr
 sec
 adc  temp
 ror
 lsr
 lsr
 lsr

 

 

Part1.zip

Part2.zip

 

 

Okay, so the forum won't let me upload a file larger then 2MB (I don't know why that limit is so low!). I've thus downloaded WinZip to make a split file (just click the download now button, skip the get winzip free button). Kind of shitty, I know, but I don't have a website to upload to. Al, if your read this, please increase limit.

 

So unzip both files and then put the unzipped contents into a new folder. You should have the two files below. Right click on the zipped file shown and extract it.

 

 

post-7074-0-38245600-1359334434_thumb.png

  • Like 1
Link to comment
Share on other sites

I worked on the /2/2/2/8/16 approximation. Same as before, it fails on its own. I've become very systematic in my approach though. I first lay out the routine and pad it with some ADC, SBC, and CMP testVariables. The CMP testVariable is only needed before an addition, of course. I try one at a time working from the top of the routine (highest influence) to the bottom. If those all fail, I go back and try some CLC, SEC padding before each ADC, SBC temp. Again I try one at the time.

 

 

The idea is to first search for a one or two byte solution that will correct the approximation... finding one if it is available. This requires minimal testing, and if the solution can't be found within one or two bytes then it might become too expensive anyway. Since it would be very more time consuming to try different combinations of SEC and CLC ( like 2 SEC's 2 CLC's or a SEC and CLC) I'm just sticking to one or either for now.

 

 

The /2/2/2/8/16 approximation is better because it uses one less shift (you can clearly see that because it's further left on the excel sheet). I padded it and worked my way down. I finally triggered a solution with a CMP testVariable, where testVariable was #255. I tried a CLC in it's place and trimmed a byte off of that.

 

 sta  temp
 lsr
;;  adc  testVariable
;;  sbc  testVariable
;;  cmp  testVariable
 adc  temp
 ror
;;  adc  testVariable
;;  sbc  testVariable
;;  cmp  testVariable
 adc  temp
 ror
;;  adc  testVariable
;;  sbc  testVariable
;;  cmp  testVariable
 adc  temp
 ror
;;  adc  testVariable
;;  sbc  testVariable
 lsr
;;  adc  testVariable
;;  sbc  testVariable
 lsr
;;  adc  testVariable
;;  sbc  testVariable
   cmp  testVariable  ;CMP #$FF triggered, then tried CLC instead and it passed too
 adc  temp
 ror
;  adc  testVariable
;  sbc  testVariable
 lsr
;  adc  testVariable
;  sbc  testVariable
 lsr
;  adc  testVariable
;  sbc  testVariable
 lsr

 

And the final code for the /2/2/2/8/16 approximation:

; Divide by 13
; 21 bytes, 37 cycles
 sta  temp
 lsr
 adc  temp
 ror
 adc  temp
 ror
 adc  temp
 ror
 lsr
 lsr
 clc
 adc  temp
 ror
 lsr
 lsr
 lsr

 

 

I played with the /2/2/8/16 approximation, but couldn't trigger a solution. It would be more desirable if it worked because it uses one less shift and one less addition. That leaves more cycles open, but first you need to find a solution to get it working at all before you can say what you really have.

  • Like 1
Link to comment
Share on other sites

I tried divide by 5 with my new excel sheet. Some promising approximations came up. I tried the /4/2/8/2/8 approximation, and it worked as is. The accuracy was pretty good on the approximation, but I was still surprised that it worked right away.

 

1/5 = 0.2

/4/2/8/2/8 approximation = 0.200195313 (but that's before bit truncation, etc...)

 

 

This routine is 1 cycle longer than my previous attempt, but saves 2 bytes.

 

;Divide by 5
;20 bytes, 35 cycles
 sta  temp
 lsr
 lsr
 adc  temp
 ror
 adc  temp
 ror
 lsr
 lsr
 adc  temp
 ror
 adc  temp
 ror
 lsr
 lsr

 

 

The /2/8/2/8 approximation I hope to get going. It has an accuracy of 0.19921875, and would save a lot of cycles (potentially). A quick divide by five is only a shift away from a quick divide by 10! :)

Link to comment
Share on other sites

Someone should break down and write a utility that tries random sequences of opcodes until a desired result happens.

 

The routines are so short, I wonder how long a brute force attack would take?

 

Current code is 20 bytes - just combining random bytes, it is one of 2^(160) combinations:

 

1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976

 

That's kind of a lot, even on hundreds or thousands of PC's.

 

A genetic algorithm may be able to find a solution in a human lifetime. It would be cool to see the code optimized for size and optimized for space.

Link to comment
Share on other sites

This may seem absurd but with ROMs as large as they are now and cycles being a scarce as ever, this old technique may be merited in some circumstances.

 

; 259 bytes, 4 cycles (5 if table crosses page boundary)
lda DivXBy5,x

.align 256
DivxBy5:
.byte 0,0,0,0,0,1,1,1,1,1,2,2,2,2,2
.byte 3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,6
.byte 6,6,6,6,7,7,7,7,7,8,8,8,8,8,9,9
.byte 9,9,9,10,10,10,10,10,11,11,11,11,11,12,12,12
.byte 12,12,13,13,13,13,13,14,14,14,14,14,15,15,15,15
.byte 15,16,16,16,16,16,17,17,17,17,17,18,18,18,18,18
.byte 19,19,19,19,19,20,20,20,20,20,21,21,21,21,21,22
.byte 22,22,22,22,23,23,23,23,23,24,24,24,24,24,25,25
.byte 25,25,25,26,26,26,26,26,27,27,27,27,27,28,28,28
.byte 28,28,29,29,29,29,29,30,30,30,30,30,31,31,31,31
.byte 31,32,32,32,32,32,33,33,33,33,33,34,34,34,34,34
.byte 35,35,35,35,35,36,36,36,36,36,37,37,37,37,37,38
.byte 38,38,38,38,39,39,39,39,39,40,40,40,40,40,41,41
.byte 41,41,41,42,42,42,42,42,43,43,43,43,43,44,44,44
.byte 44,44,45,45,45,45,45,46,46,46,46,46,47,47,47,47
.byte 47,48,48,48,48,48,49,49,49,49,49,50,50,50,50,50

 

Also, in many circumstances you don't need 100% accuracy. For example if dividing velocity or measuring distance, some error is allowable. In those cases I would multiply by 1/5 in fixed point. which is the same as multiplying by (256/5) / 256 - which is the same as multiplying by 256/5 and just looking at the high byte which is what most of the routines above do. The errors are coming from integer math, i.e. 256/5 being 51.2 and not just 51, so just using the upper 8 bits for the shift and add multiplication is not always exactly correct but is rarely wrong and not wrong by more than one and is always near the correct answer.

 

Another example: For division by 7 I would use 256/7 which is 36.5714.... when just multiplying by 36 (by shifting and adding for 00100100b) the errors encountered in the result were due to the missing .5714. The very clever addition of 4 plus carry at the right time above would appear to resolve the error for integers ranging from 0 to 255. If I needed more precision I would add more bits like this: 1024/7 = 146 (10010010b), Since the first 1 bit is in the high bit, this has 8 bits of precision and would probably yield the correct answer - unfortunately I don't have time to write a test routine to demonstrate.

 

I love this thread.

 

P.S. Here is the crummy little Ruby script I whipped up to generate the table:

#!/usr/bin/env ruby
for i in 0..256
val = (i/5).to_i
if i==0
 line = "DivBy:\t.byte\t"
elsif i%16 == 15
 puts line
 line = "\t.byte\t"
else
 line << ","
end
line <<"#{val}"
end

Link to comment
Share on other sites

Wouldn't that be "only" 160^20 (= 1,208925819614629174706176e+44)?

 

20 bytes is 8*20 or 160 bits which can be combined in 2^160 combinations. (i.e. 4 bits = 2^4=16 values, or 8 bits is 2^8=256 values)

 

Either way the number is intractably large on my machines ;)

 

Of course that would include unwanted opcodes like NOP and KIL, and there are illegal opcodes that have duplicate functions. If you limited the opcodes to some number n, then there would be be n^20 combinations.

 

Crosschecking: If n=256, all possible byte values, then there would be 256^20 combinations which is the same as (2 ^ 8 ) ^ 20 or 2^(8*20) or 2^160.

 

So, how many unique opcodes are there for the Atari 2600's 6507? (Thomas, you would be the first one I would ask this question)

Edited by solidcorp
Link to comment
Share on other sites

The routines are so short, I wonder how long a brute force attack would take?

 

I first saw a paper about this back in the 80s and they dubbed this approach a superoptimizer

 

I've thought about it over the years, writing a superoptimizer for the 6502 but never tried. It would be pretty interesting.

 

Batari took a crack at one back in 2005/2006 - I downloaded it but it ran under DOS so I never tried it, I wonder if anyone else has made one?

 

The problem is embarrassingly parallel but there's a lot of territory to search, I'm thinking if you could get a 6502 emulator into a shader program you could use the GPU to parallel process the problem.

  • Like 1
Link to comment
Share on other sites

The problem is embarrassingly parallel but there's a lot of territory to search, I'm thinking if you could get a 6502 emulator into a shader program you could use the GPU to parallel process the problem.

 

You don't really need a full emulator just the ability to do a small set of integer operations on a small number of registers - you could do that with shader subroutines if the shader truly supports branching. As a matter of fact I bet you could represent all the operations using texture operation on 8 bit colors (though the carry bit would be trickier to handle). That way you could use CUDA or Direct Compute to attack the problem. There is still a lot of ground to cover and each test needs to test 256 values and compare to a set of correct answers. Interesting problem, I wish there was someone who wished to contract this out for work. ;-)

 

I think the genetic algorithms may have promise, though I have not really worked in this area. The results that come out of those optimization methods are ten times more complex than optimized assembly code, doing often unpredictable and apparently meaningless things to achieve a 100% correct result. I'd expect that for a 20 byte solution it would probably spit out a few pieces of code that look familiar but there may be some bizarre outliers that completely work.

Link to comment
Share on other sites

Interestingly, many of the above posted divison routines use only sta temp, adc temp, lsr, and ror. Also they are about 20 instructions long. So, add nop and that would be 5^20 combinations, which is about 95 * 10^12.

 

Given that modern GPUs compute TFLOPs, testing all combinations for a given problem may even be doable at home. If I wasn't so lazy I would now write a CUDA program to try it :).

Link to comment
Share on other sites

I'd love to see these illegals in a computer generated solution:

60+yy nzc--v RRA op ROR+ADC op=op RCR 1 // A=A+op+cy

6B nn nzc--v 2 ARR #nn AND+ROR A=(A AND nn), V=Overflow(A+A), A=A/2+C*80h, C=A.Bit6

 

They sort of do the work that we know needs to be done but then they do other "organic" stuff which could also be unpredictably useful.

Link to comment
Share on other sites

I got the divide by five /2/8/2/8 approximation working now. It is significantly faster, and now the divide by 10 routine has also been made faster, and shorter. :) Just add a lsr on the end of the routine to make it a divide by 10, of course.

 

;Divide by 5
;18 bytes, 30 cycles
 sta  temp
 lsr
 adc  #13
 adc  temp
 ror
 lsr
 lsr
 adc  temp
 ror
 adc  temp
 ror
 lsr
 lsr

  • Like 1
Link to comment
Share on other sites

Interestingly, many of the above posted divison routines use only sta temp, adc temp, lsr, and ror.

 

Correct! They are essentially all using the same algorithm. With the excel sheet I made I limit myself to the number of shifts that can be used (11 or less), and then look for which linear combinations of shifting and adding give the best approximation. This is a brute force attack with a sensible constraint (if you can't get there in 11 shifts it will take to long!). After that I try to coax the routine if it needs it.

 

But this is all for one algorithm. There were be a huge, huge amount of them found if you tried the entire instruction set, and went for a limit of 24 bytes in length or less.

Edited by Omegamatrix
Link to comment
Share on other sites

I haven't had much time, but I managed to make a second excel sheet to give me a preview of outcome for each algorithm. You enter shifts given by the first excel sheet, and then set your divisor.

 

DividerAlgorithm.zip

 

 

I find if you set the accuracy high on the first excel sheet you can find some good algorithms. I tried division by nine, and found the /8/2/2/16 had an 0.111328125, which is pretty good since 1/9 =1.11111111111 and we are only trying to cove 0-255.

 

 

I entered the divisor of 9 in the new excel sheet, and also entered in the 8,2,2, and 16 values. It showed that all the values could be accurately divided with the routine as is... no modification was needed. :) This sheet is good in that you can quickly see what values are deviating, and where they start.

 

 

The new divide by 9 routine, 1 cycle quicker and 3 bytes shorter then before:

 

;Divide by 9
;17 bytes, 30 cycles
 sta  temp
 lsr
 lsr
 lsr
 adc  temp
 ror
 adc  temp
 ror
 adc  temp
 ror
 lsr
 lsr
 lsr

Edited by Omegamatrix
  • Like 1
Link to comment
Share on other sites

I did a quick divide by 17, and divide by 19 routine. The divide by 19 was very accurate, and needed no coaxing. The divide by 17 needed a little bump, and I noticed something. Every routine that I've been able to correct with an addition had something in common. They all worked well in a low range, and the errors that did occur were in the upper range. The errors were spaced apart evenly by the divisor number. With divide by 17 you starting getting an error at 136 and (every 17th digit after, and it's only 1 off the correct value). You can correct it by inserting an addition in the right place. I used my program to determine what number it needs to be, and where.

 

post-7074-0-33435600-1359565667_thumb.png

 

The routines that I haven't been able to correct with addition work great at low ranges, but then start tanking quite badly. I made a bogus one up (pictured below). The idea in the picture is that errors start coming in long strings, and the strings get longer as the input numbers increase. I haven't been able to find a way to correct any of these bad approximations, but I would really like too because they often involve less shifts. If there was some general method to correct these bad approximations (within a few cycles) then faster routines could be found quite easily.

 

post-7074-0-49178200-1359565997_thumb.png

 

 

Here are the divide by 17 (and divide by 19 routines:

 

;1/17 = 0.058823529
;/2/2/2/32 0.05859375

;Divide by 17
;18 bytes, 30 cycles
 sta temp
 lsr
 adc temp
 ror
 adc temp
 ror
 adc temp
 ror
 adc #0
 lsr
 lsr
 lsr
 lsr

 

;1/19 = 0.052631579
;/2/4/2/32 0.052734375

;Divide by 19
;17 bytes, 30 cycles
 sta temp
 lsr
 adc temp
 ror
 lsr
 adc temp
 ror
 adc temp
 ror
 lsr
 lsr
 lsr
 lsr

Edited by Omegamatrix
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...