Jump to content

fox

Members
  • Content Count

    266
  • Joined

  • Last visited

Posts posted by fox


  1. Fast multiplication

    (original article in Polish disk magazine "Syzygy" no. 6)

     

    The problem of multiplication on the 6502 is as old as this processor.

    The algorithms that can be found in various asm courses base on bit shifting

    and addition, and need about 100 cycles for calculating product of two

    8-bit numbers. Such algorithms are not fast at all.

     

    There is an effective solution to this problem: create 2-dimensional array

    (multiplication table) that contains all the possible results.

    Unfortunately not always one can afford 16 KB memory for this.

     

    What I suggest is an intermediate solution: implement multiplication

    using an addition and a function "calculated" using a lookup table.

     

    Consider the following formulas:

     

    (1) a*b = exp( log(a)+log(b) )

    (2) a*b = ( cos(x+y)+cos(x-y) )/2

    where: x = arc cos (a), y = arc cos (b)

    (3) a*b = ( (a+b)^2-a^2-b^2 )/2

    (4) a*b = ( (a+b)^2-(a-b)^2 )/4

     

    Which formula to choose? The best would be one:

    - simple to calculate,

    - true for any a and b,

    - that not requires high precision,

    - uses a lookup table easy to generate.

     

    These conditions are fulfilled best by the formula (4).

     

    The calculation is very simple:

    a*b = f(a+b)-f(a-b)

    where: f(x) = (x*x)/4

     

    At first glance, one thinks that you have to store

    the fractional part of (x*x)/4 in the lookup table

    in order to get an exact result. This is not the case,

    because:

    1) for even x, x = 2*k :

    f(2*k) = k*k

    - the result is an integer

    2) for odd x, x = 2*k+1

    f(2*k+1) = k*k+k+1/4

    - the result is an integer + 0.25.

    However, if a+b is odd, then a-b is odd too.

    The result of f(a+b)-f(a-b) would be therefore same,

    no matter if f() includes this 0.25 or not.

     

    Calculating successive squares is easy once you remember

    that:

     

    1*1 = 1 = 0+1

    2*2 = 4 = 0+1+3

    3*3 = 9 = 0+1+3+5

    ...

     

    That is, you only need to sum successive odd numbers.

     

    The algorithm is general enough to calculate products

    of signed numbers.

     

    There is no universal "fastest multiplication" routine,

    because universal routines are not fast and vice-versa.

     

    When implementing your own fast multiplication routine

    you should consider the following things:

    - how big are numbers which you want to multiplicate

    - are the numbers signed or not

    - where the factors are stored and where the result

    should be stored

    - what precision you want - for example, you may need

    just the high byte of the product and +-1 is acceptable

     

    But to not make this article too theoretical, I present

    two routines:

    - MULU8 - multiplicates unsigned numbers in A and X registers

    - MULS8 - multiplicates signed numbers in A and X

     

    A few words about the routines:

    - both routines calculate sums using 6502's indexed addressing mode.

    Unfortunately -A needs to calculated separately, which can be done

    with EOR #$ff and adding one. Instead of adding 1, we use a separate

    lookup table for (x+1)^2.

    - for MULU8 one lookup table contains squares of 0..510 and the other

    one -255..255

    - for MULS8 both tables contain squares of -255..255

    - for MULS8 $80 needs to be added to both input numbers, using EOR #$80.

    This eliminates the ambiguity that arises when adding signed numbers,

    for example:

    $60+$60=$c0

    -$40+0=$c0

    Instead, we have:

    $e0+$e0=$1c0 (positive $c0)

    $40+$80=$c0 (negative -$40)

    If the numbers are small (precisely: abs(A)+abs(X)<128), the EOR #$80

    are redundant, but you need to use different lookup tables.

     

    mulu8.asm:

     

    * Unsigned multiplication
    * A*X -> A:Y (A-high byte, Y-low)
    * b. Fox/Tqa
    
        opt 21
        org $480
    
    sq1l equ $a000
    sq1h equ $a200
    sq2l equ $a400
    sq2h equ $a600
    
        jsr init
    
        lda #123
        ldx #234
    
        sta l1+1
        sta h1+1
        eor #$ff
        sta l2+1
        sta h2+1
        sec
    l1   lda sq1l,x
    l2   sbc sq2l,x
        tay
    h1   lda sq1h,x
    h2   sbc sq2h,x
    
        brk
        brk
        brk
    
    init ldx #0
        stx sq1l
        stx sq1h
        stx sq2l+$ff
        stx sq2h+$ff
        ldy #$ff
    msq1 txa
        lsr @
        adc sq1l,x
        sta sq1l+1,x
        sta sq2l-1,y
        sta sq2l+$100,x
        lda #0
        adc sq1h,x
        sta sq1h+1,x
        sta sq2h-1,y
        sta sq2h+$100,x
        inx
        dey
        bne msq1
    msq2 tya
        sbc #0 -
        ror @
        adc sq1l+$ff,y
        sta sq1l+$100,y
        lda #0
        adc sq1h+$ff,y
        sta sq1h+$100,y
        iny
        bne msq2
        rts
    
        end
    

     

    muls8.asm:

    * Signed multiplication
    * A*X -> A:Y (A-high byte, Y-low)
    * b. Fox/Tqa
    
        opt 21
        org $480
    
    sq1l equ $a000
    sq1h equ $a200
    sq2l equ $a400
    sq2h equ $a600
    
        jsr init
    
        lda <123
        ldx <0-45
    
        eor #$80
        sta l1+1
        sta h1+1
        eor #$ff
        sta l2+1
        sta h2+1
        txa
        eor #$80
        tax
        sec
    l1   lda sq1l,x
    l2   sbc sq2l,x
        tay
    h1   lda sq1h,x
    h2   sbc sq2h,x
    
        brk
        brk
        brk
    
    
    init ldx #0
        stx sq1l+$100
        stx sq1h+$100
        stx sq2l+$ff
        stx sq2h+$ff
        ldy #$ff
    msqr txa
        lsr @
        adc sq1l+$100,x
        sta sq1l,y
        sta sq1l+$101,x
        sta sq2l-1,y
        sta sq2l+$100,x
        lda #0
        adc sq1h+$100,x
        sta sq1h,y
        sta sq1h+$101,x
        sta sq2h-1,y
        sta sq2h+$100,x
        inx
        dey
        bne msqr
        rts
    
        end
    


  2. The C environment will make use of registers (mostly A and X) and

    so if you are alrerting these in your function you should push them

    beforehand and restore them afterwards.

    975813[/snapback]

     

    A and X are where you return the result of your function.

    If your asm function is declared void, the values you leave in A and X

    don't matter. But if you mix asm and C (as I wrote, better don't do that),

    then you should be extremely care about all registers.

     

    You shouldn't use any zeropage location you can imagine ($cb etc).

    Instead, you should "importzp" memory locations. I think you can safely

    modify 16-bit ptr1..ptr4 and 8-bit tmp1-tmp4 in your asm-only routines.

     

    Moreover, you should not just "RTS" from a C routine. The C code may

    need some stack cleanup. Just remove asm("RTS").


  3. B PC=FunctionToDebug OR PC=FunctionDoneLabel

     

    If you debug your own code, you can compile in CIM instructions which basically act as breakpoints. For example, use:

     dta 2
    

     

    I guess with Perl (or another scripting language) you could cross

    reference the exclude list with the labels file and then filter the trace

    output accordingly, so yes, that would suffice too.

     

    Exactly. This can be done with a few lines of Perl.


  4. I would like to be able to break at the end of an NMI or DLI and get the cycle count for the interrupt, or even a running log of cycle counts.

    What if another interrupt interrupts your interrupt?

    Also I would like to be able to break on the status of any register, like break on vcount = 10, break on sp < 0x1e0 or even break on portb != 0xff.

    973489[/snapback]

    VCOUNT and SP conditions are available in 4.0. You can also set a breakpoint for instructions that write to portb (or any address space area), but you can't set a breakpoint for the value which is written.

     

    Oh and I'm working with 3.1 here, but if its not already in 4.0, I would like to be able to get more than one break point condition. :D

    973490[/snapback]

     

    Go upgrade to 4.0. :)


  5. One thing I think I'd find useful in development would be

    a trace-exclude feature, so that code execution going through

    know ranges of memory wouldn't be logged, e.g. DLIs.

    Linked in with labels exported by your assembler/compiler

    this should be straight-forward to put in, though (as with

    most debugging features) you'd get a performance hit in

    the emulation, but systems tend to be fast enough to cover

    this these days.

    If you mean that for the TRON command, I don't think it's useful.

    Anyway, I recommend you learning basics of the Perl programming language

    - you can easily filter any text files with Perl.

     

    Following in from this it would also then be nice to build your

    debugger commands into a script and be able import them.

    This would throw away any existing debugger breakpoints etc

    and then follow your script.

     

    I don't get it. What script do you mean?

     

    Does anyone else have some ideas to assist their developments?

     

    I think A800Win PLus debugger is already too sophisticated.

    The old one was good enough for my purposes.


  6. It's not impressive at all. This is how normally the multiplication is done on a 6502 (easy and slow way).

     

    The cool way is:

    a * b = f(a + b) - f(a - b)

    where:

    f(x) = x * x / 4 and is of course stored in a precomputed lookup table.

     

    This can be very effectively computed on a 6502, even with signed numbers! (unlike the Russian crap)

     

    I described it in detail and with examples years ago in Polish disk magazine "Syzygy" (issue #5 I think). If someone is interested, I may translate it here. Real-world example is in my portal engine in Numen, where I use both signed and unsigned variations and 16-bit * 16-bit with 32-bit result too. Using the Russian way it would be several times slower.

     

    As for the division I don't know a really fast algorithm. To my knowledge, even in modern CPUs division is much slower than multiplication. Avoid it if you can. If you can't, calculate 1/b once and then multiply a * (1/b) - I use such a technique in this portal engine, too (see Project_calcInverseD for how to quickly compute $2800 / x).


  7. Another difference from zlib is how the offsets work.

    The offsets of Exomizer points to the 'wrong' end of the sequences, the end furthest away. But this has the advantage of supporting sequences that can encode repeating patterns of one or more bytes like this:

    (offset 1, length 2345) == rle sequence.

    (offset 4, length 16) == four byte pattern repeated 4 times.

    It's the same in zlib.

     

    I repeat, can someone do a comparison of compression results: exo vs zlib?

     

    Maybe I can find the uncompressed "Drunk Chessboard" somewhere.

    It was 24KB, and 16 KB when compressed with FlashPack. The depack routine was something like 80 bytes long and supported LZ77+RLE+changing destination memory addresses. I think you can't beat its speed.

     

    * Depack routine for FlashPack 2.1
    * By Fox of Taquart, 5th May 1997
    * Average speed: 30 kB/sec
    
    * 4 bytes on page 0
    ff equ $fc
    bt equ $fd
    ad equ $fe
    
    * Do not start from here!!!
    dep1 tax
    beq ret
    lda #$7f
    dep2 bcc *+3
    inx
    inx
    sta ad
    dep3 lda (ad),y
    put  sta $8080,y
    iny
    bne dep4
    inc ad+1
    inc put+2
    dep4 dex
    bne dep3
    asl bt
    bne dep7
    asl ff
    bne dep5
    * Routine starts here!
    start equ *
    sec
    jsr get
    rol @
    sta ff
    dep5 lda #1
    bcc dep6
    jsr get
    rol @
    dep6 sta bt
    dep7 jsr get
    ldx #1
    bcc put
    lsr @
    bne dep2
    jsr get
    bcs dep1
    tay
    jsr get
    sta ad+1
    sta put+2
    bcc dep7 !
    * Set address of packed data here!
    * (or make your own "get" routine)
    get lda $ffff
    inc get+1
    bne ret
    inc get+2
    ret rts
    
    end of code
    


  8. Negative numbers aren't natively supported in Assembler.

    Actually they are, maybe just some crappy assemblers don't support negative constants directly. The same problems with signed numbers that you can encounter with assembly language can be seen in C/Java: e.g. you add two positive integers and get a negative result. So understanding the representation of signed integers is good no matter which language you are programming in (except when it's only Atari BASIC).

     

    But, that won't work for multi-byte binary values.

    Of course it will. Moreover, it's much easier than your "sign & absolute value" method. You add multi-byte signed integers as normal:

    lda val1
    clc
    adc val2
    sta result
    lda val1+1
    adc val2+1
    sta result+1
    


  9. Acually if you use my code with "cpx #$80" and not "cmp #-48",

    wrapping problems arise only if the absolute difference exceeds 96,

    which is close to the range of signed 8-bit number: -128..127.

    So there's not much you can do without widening the input to 16 bits

    (problematic).


  10. Unlike zlib inflate/deflate, exomizer doesn't do any encoding of uncompressed bytes but it does compress sequences down to the length of one byte to compensate.

    Do you mean you support one-byte LZ77 sequences? :o

    Another difference compared to zlib inflate/deflate is that the encoding is static and not dynamic. This makes the decruncher simpler and probably faster than zlib deflate.

    My zlib decompress routine isn't actually that slow, thanks to clever lookup tables.

    Actually it's several times faster than most of the other decompressors on Atari.

    I've used it in Numen for background decompression while the effects are running.


  11.  Length of indata: 45533 bytes.
    

    So original = 33,050 bytes, new xex = 22,707 bytes

     

    Was the original 33050 or 45533 bytes long?

    Can you "gzip -9" the original file and add 800 bytes for the depack routine,

    for a quick comparison?

    Or even better, create GZip archive ("Ultra" compression level) using 7-Zip on Windows?


  12. The mirroring actually occurs through more then just one page. For example POKEY is enabled when A15=1, A14=1,A13=1,A12=0 and A11=1, the rest of the address bits don't matter. So $E800 will access the same register as $E900, $EA00, etc.

     

    Okay, so POKEY is in $E800-$EFFF.

    What about the remaining unexplored areas?


  13. So, the code that Heaven used is a standard zlib-compatible inflate routine: dynamic Huffman-coded LZ77 with 32KB window (same as in ZIP). As I understand, exo does some kind of Huffman only for offsets and length, not uncompressed bytes, right?

     

    My old FlashPack did only LZ77 2- and 3-byte sequences in 127-byte window

    + RLE and compressed marks of uncompressed bytes (a block of 8 uncompressed bytes is marked with a single bit).


  14. The emulator is written in pure assembly language, the main emulator core is completely coded in Thumb and runs from ROM. The CPU emulation is a combination static recompiler & interpretive core -- for those pesky VCS games that run parts of their code out of RAM -- and is coded in ARM assembly running from the internal WRAM of the GBA.

     

    Frame rate is a respectable 39fps, and a little more optimisation will get that to a nice 60fps. There is no sound emulation yet

     

    800/XL/XE/5200 are much more complicated than 2600 and you can't reliably use

    static recompiler since code is usually in RAM (unlike on 2600).

    Coding full 800/XL/XE/5200 in Thumb asm is tedious and I doubt we can reach

    100% real speed even at 10fps.


  15. According to the equates...

    GTIA......$C000 - $C01F

    ANTIC....$D400 - $D40F

    POKEY....$E800 - $E80F

     

    What of the other registers? Are they used for anything? If not are they available to the programmer?

    960011[/snapback]

     

    I know about these and I'm pretty sure that whole pages are used for mirrored registers. But what's in $C100 - $D3FF, $D500-$E7FF and $E900-$F7FF ?


  16. Can you tell me how a real 5200 behaves when powered on without a cartridge?

     

    Atari800 emulator (and therefore Atari800Win PLus) shows a rainbow fuji logo and crashes after a while. I don't believe it's what happens on real hardware, but I don't see any code in 5200 BIOS to make it behave differently.


  17. Looks like the emulator palettes are based on a PAL palette

     

    The single reason for this is that the current Atari800 default, jakub.act

    was grabbed from PAL machine, just like the previous one, real.act.

    If you agree here for a "correct" NTSC palette, I can upload it to the Atari800

    emulator repository. Not necessarily one - there may be several NTSC palettes

    to choose from.

     

    *.act are plain 768-byte RGB palettes, supported by e.g. Photoshop,

    so you should have no problem with creating *.act from a picture

    where Atari colors are shown in order (dark on the left, light on the right).

    You can then load this *.act file into Atari800 and see how it looks

    in different programs.

×
×
  • Create New...