Jump to content

fox

Members
  • Content Count

    266
  • Joined

  • Last visited

Everything posted by fox

  1. Fast multiplication (original article in Polish disk magazine "Syzygy" no. 6) The problem of multiplication on the 6502 is as old as this processor. The algorithms that can be found in various asm courses base on bit shifting and addition, and need about 100 cycles for calculating product of two 8-bit numbers. Such algorithms are not fast at all. There is an effective solution to this problem: create 2-dimensional array (multiplication table) that contains all the possible results. Unfortunately not always one can afford 16 KB memory for this. What I suggest is an intermediate solution: implement multiplication using an addition and a function "calculated" using a lookup table. Consider the following formulas: (1) a*b = exp( log(a)+log(b) ) (2) a*b = ( cos(x+y)+cos(x-y) )/2 where: x = arc cos (a), y = arc cos (b) (3) a*b = ( (a+b)^2-a^2-b^2 )/2 (4) a*b = ( (a+b)^2-(a-b)^2 )/4 Which formula to choose? The best would be one: - simple to calculate, - true for any a and b, - that not requires high precision, - uses a lookup table easy to generate. These conditions are fulfilled best by the formula (4). The calculation is very simple: a*b = f(a+b)-f(a-b) where: f(x) = (x*x)/4 At first glance, one thinks that you have to store the fractional part of (x*x)/4 in the lookup table in order to get an exact result. This is not the case, because: 1) for even x, x = 2*k : f(2*k) = k*k - the result is an integer 2) for odd x, x = 2*k+1 f(2*k+1) = k*k+k+1/4 - the result is an integer + 0.25. However, if a+b is odd, then a-b is odd too. The result of f(a+b)-f(a-b) would be therefore same, no matter if f() includes this 0.25 or not. Calculating successive squares is easy once you remember that: 1*1 = 1 = 0+1 2*2 = 4 = 0+1+3 3*3 = 9 = 0+1+3+5 ... That is, you only need to sum successive odd numbers. The algorithm is general enough to calculate products of signed numbers. There is no universal "fastest multiplication" routine, because universal routines are not fast and vice-versa. When implementing your own fast multiplication routine you should consider the following things: - how big are numbers which you want to multiplicate - are the numbers signed or not - where the factors are stored and where the result should be stored - what precision you want - for example, you may need just the high byte of the product and +-1 is acceptable But to not make this article too theoretical, I present two routines: - MULU8 - multiplicates unsigned numbers in A and X registers - MULS8 - multiplicates signed numbers in A and X A few words about the routines: - both routines calculate sums using 6502's indexed addressing mode. Unfortunately -A needs to calculated separately, which can be done with EOR #$ff and adding one. Instead of adding 1, we use a separate lookup table for (x+1)^2. - for MULU8 one lookup table contains squares of 0..510 and the other one -255..255 - for MULS8 both tables contain squares of -255..255 - for MULS8 $80 needs to be added to both input numbers, using EOR #$80. This eliminates the ambiguity that arises when adding signed numbers, for example: $60+$60=$c0 -$40+0=$c0 Instead, we have: $e0+$e0=$1c0 (positive $c0) $40+$80=$c0 (negative -$40) If the numbers are small (precisely: abs(A)+abs(X)<128), the EOR #$80 are redundant, but you need to use different lookup tables. mulu8.asm: * Unsigned multiplication * A*X -> A:Y (A-high byte, Y-low) * b. Fox/Tqa opt 21 org $480 sq1l equ $a000 sq1h equ $a200 sq2l equ $a400 sq2h equ $a600 jsr init lda #123 ldx #234 sta l1+1 sta h1+1 eor #$ff sta l2+1 sta h2+1 sec l1 lda sq1l,x l2 sbc sq2l,x tay h1 lda sq1h,x h2 sbc sq2h,x brk brk brk init ldx #0 stx sq1l stx sq1h stx sq2l+$ff stx sq2h+$ff ldy #$ff msq1 txa lsr @ adc sq1l,x sta sq1l+1,x sta sq2l-1,y sta sq2l+$100,x lda #0 adc sq1h,x sta sq1h+1,x sta sq2h-1,y sta sq2h+$100,x inx dey bne msq1 msq2 tya sbc #0 - ror @ adc sq1l+$ff,y sta sq1l+$100,y lda #0 adc sq1h+$ff,y sta sq1h+$100,y iny bne msq2 rts end muls8.asm: * Signed multiplication * A*X -> A:Y (A-high byte, Y-low) * b. Fox/Tqa opt 21 org $480 sq1l equ $a000 sq1h equ $a200 sq2l equ $a400 sq2h equ $a600 jsr init lda <123 ldx <0-45 eor #$80 sta l1+1 sta h1+1 eor #$ff sta l2+1 sta h2+1 txa eor #$80 tax sec l1 lda sq1l,x l2 sbc sq2l,x tay h1 lda sq1h,x h2 sbc sq2h,x brk brk brk init ldx #0 stx sq1l+$100 stx sq1h+$100 stx sq2l+$ff stx sq2h+$ff ldy #$ff msqr txa lsr @ adc sq1l+$100,x sta sq1l,y sta sq1l+$101,x sta sq2l-1,y sta sq2l+$100,x lda #0 adc sq1h+$100,x sta sq1h,y sta sq1h+$101,x sta sq2h-1,y sta sq2h+$100,x inx dey bne msqr rts end
  2. A and X are where you return the result of your function. If your asm function is declared void, the values you leave in A and X don't matter. But if you mix asm and C (as I wrote, better don't do that), then you should be extremely care about all registers. You shouldn't use any zeropage location you can imagine ($cb etc). Instead, you should "importzp" memory locations. I think you can safely modify 16-bit ptr1..ptr4 and 8-bit tmp1-tmp4 in your asm-only routines. Moreover, you should not just "RTS" from a C routine. The C code may need some stack cleanup. Just remove asm("RTS").
  3. If you debug your own code, you can compile in CIM instructions which basically act as breakpoints. For example, use: dta 2 Exactly. This can be done with a few lines of Perl.
  4. What if another interrupt interrupts your interrupt? VCOUNT and SP conditions are available in 4.0. You can also set a breakpoint for instructions that write to portb (or any address space area), but you can't set a breakpoint for the value which is written. Go upgrade to 4.0.
  5. If you mean that for the TRON command, I don't think it's useful. Anyway, I recommend you learning basics of the Perl programming language - you can easily filter any text files with Perl. I don't get it. What script do you mean? I think A800Win PLus debugger is already too sophisticated. The old one was good enough for my purposes.
  6. Even better, write asm routines in *.s and not *.c files. See source code of cc65 libraries for lots of examples. You will also see there how to pass arguments from C to asm - it is completely different from BASIC.
  7. It's not impressive at all. This is how normally the multiplication is done on a 6502 (easy and slow way). The cool way is: a * b = f(a + b) - f(a - b) where: f(x) = x * x / 4 and is of course stored in a precomputed lookup table. This can be very effectively computed on a 6502, even with signed numbers! (unlike the Russian crap) I described it in detail and with examples years ago in Polish disk magazine "Syzygy" (issue #5 I think). If someone is interested, I may translate it here. Real-world example is in my portal engine in Numen, where I use both signed and unsigned variations and 16-bit * 16-bit with 32-bit result too. Using the Russian way it would be several times slower. As for the division I don't know a really fast algorithm. To my knowledge, even in modern CPUs division is much slower than multiplication. Avoid it if you can. If you can't, calculate 1/b once and then multiply a * (1/b) - I use such a technique in this portal engine, too (see Project_calcInverseD for how to quickly compute $2800 / x).
  8. It's the same in zlib. I repeat, can someone do a comparison of compression results: exo vs zlib? Maybe I can find the uncompressed "Drunk Chessboard" somewhere. It was 24KB, and 16 KB when compressed with FlashPack. The depack routine was something like 80 bytes long and supported LZ77+RLE+changing destination memory addresses. I think you can't beat its speed. * Depack routine for FlashPack 2.1 * By Fox of Taquart, 5th May 1997 * Average speed: 30 kB/sec * 4 bytes on page 0 ff equ $fc bt equ $fd ad equ $fe * Do not start from here!!! dep1 tax beq ret lda #$7f dep2 bcc *+3 inx inx sta ad dep3 lda (ad),y put sta $8080,y iny bne dep4 inc ad+1 inc put+2 dep4 dex bne dep3 asl bt bne dep7 asl ff bne dep5 * Routine starts here! start equ * sec jsr get rol @ sta ff dep5 lda #1 bcc dep6 jsr get rol @ dep6 sta bt dep7 jsr get ldx #1 bcc put lsr @ bne dep2 jsr get bcs dep1 tay jsr get sta ad+1 sta put+2 bcc dep7 ! * Set address of packed data here! * (or make your own "get" routine) get lda $ffff inc get+1 bne ret inc get+2 ret rts end of code
  9. Actually they are, maybe just some crappy assemblers don't support negative constants directly. The same problems with signed numbers that you can encounter with assembly language can be seen in C/Java: e.g. you add two positive integers and get a negative result. So understanding the representation of signed integers is good no matter which language you are programming in (except when it's only Atari BASIC). Of course it will. Moreover, it's much easier than your "sign & absolute value" method. You add multi-byte signed integers as normal: lda val1 clc adc val2 sta result lda val1+1 adc val2+1 sta result+1
  10. Acually if you use my code with "cpx #$80" and not "cmp #-48", wrapping problems arise only if the absolute difference exceeds 96, which is close to the range of signed 8-bit number: -128..127. So there's not much you can do without widening the input to 16 bits (problematic).
  11. It's even better when you replace "cmp #-48" with "cpx #$80".
  12. cursor_add_x .local txa clc adc cursor_x cmp MOD_max_x bcc _set cmp #-48 ; if it's -48..-1 then set to 0 ; I chose -48 because 48+160+48==256 lda #0 bcs _set ; else set to MOD_max_x lda MOD_max_x _set sta cursor_x rts
  13. Do you mean you support one-byte LZ77 sequences? My zlib decompress routine isn't actually that slow, thanks to clever lookup tables. Actually it's several times faster than most of the other decompressors on Atari. I've used it in Numen for background decompression while the effects are running.
  14. Was the original 33050 or 45533 bytes long? Can you "gzip -9" the original file and add 800 bytes for the depack routine, for a quick comparison? Or even better, create GZip archive ("Ultra" compression level) using 7-Zip on Windows?
  15. fox

    5200 memory map

    Okay, so POKEY is in $E800-$EFFF. What about the remaining unexplored areas?
  16. So, the code that Heaven used is a standard zlib-compatible inflate routine: dynamic Huffman-coded LZ77 with 32KB window (same as in ZIP). As I understand, exo does some kind of Huffman only for offsets and length, not uncompressed bytes, right? My old FlashPack did only LZ77 2- and 3-byte sequences in 127-byte window + RLE and compressed marks of uncompressed bytes (a block of 8 uncompressed bytes is marked with a single bit).
  17. 800/XL/XE/5200 are much more complicated than 2600 and you can't reliably use static recompiler since code is usually in RAM (unlike on 2600). Coding full 800/XL/XE/5200 in Thumb asm is tedious and I doubt we can reach 100% real speed even at 10fps.
  18. fox

    5200 memory map

    I know about these and I'm pretty sure that whole pages are used for mirrored registers. But what's in $C100 - $D3FF, $D500-$E7FF and $E900-$F7FF ?
  19. I don't know know what kind of compression exo uses, nor what routine did you use in Boinxx.
  20. Could you shortly describe the compression algorithm used?
  21. I think GBA is not powerful enough for the Atari800 emulator.
  22. XFormer sucks. As for DOSes, MyDOS and DOS II+ aren't bad.
  23. fox

    5200 memory map

    I know that $0000-$3fff is RAM, $4000-$bfff is cartridge, $f800-$ffff is built-in ROM. What about the remaining area? How is it mapped to chips?
  24. Can you tell me how a real 5200 behaves when powered on without a cartridge? Atari800 emulator (and therefore Atari800Win PLus) shows a rainbow fuji logo and crashes after a while. I don't believe it's what happens on real hardware, but I don't see any code in 5200 BIOS to make it behave differently.
×
×
  • Create New...