Jump to content
IGNORED

KickC Benchmark Tests (cf Mad Pascal) ... WIP


Recommended Posts

@fenrock

 

As far as I understand you use lookup table for multiplication, If you want to fair comparison with Mad Pascal use

{$f $60}

directive in MP suite. I'm not remember well if $6000 is free area but you can check it ;)

FASTMUL {$F page}

Seriously fast multiplication (8-bit and 16-bit)

{$f $70}  // fastmul at $7000

Alternatywne procedury szybkiego mnożenia dla typu BYTE, SHORTINT, WORD, SMALLINT, SHORTREAL. Procedury zajmują 2KB i są umieszczane od adresu PAGE*256.

 

Edited by zbyti
more info
Link to comment
Share on other sites

Just now, zbyti said:

@fenrock

 

As far as I understand you use lookup table for multiplication, If you want to fair comparison with Mad Pascal use


{$f $70}

directive in MP suite.

Thanks, I've been wondering on a solution for this, as storing the first 127 squared numbers in an array is kind-of avoiding the point of the test.

I'll look into this.

Link to comment
Share on other sites

I am glad to see new actor in Atari 8-bit world. KickC seems like good competitor joining existing languages for our beloved machine. I see it like good alternative for CC65 providing C language as the basis. This and Mad Pascal can also provide it with new comparisons in speed with new examples, which zbyti is preparing for us, even for other languages. If there will be even more interest for this language, it will be good candidate for supporting in Mad Studio.

 

  • Like 1
Link to comment
Share on other sites

@fenrock Monte Carlo Case ;) 

 

CC65 uses:

; 8x16 routine with external entry points used by the 16x16 routine in mul.s
tosmula0:
tosumula0:
        sta     ptr4
mul8x16:jsr     popptr1         ; Get left operand (Y=0 by popptr1)

        tya                     ; Clear byte 1
        ldy     #8              ; Number of bits
        ldx     ptr1+1          ; check if lhs is 8 bit only
        beq     mul8x8          ; Do 8x8 multiplication if high byte zero
mul8x16a:
        sta     ptr4+1          ; Clear byte 2

        lsr     ptr4            ; Get first bit into carry
@L0:    bcc     @L1

        clc
        adc     ptr1
        tax
        lda     ptr1+1          ; hi byte of left op
        adc     ptr4+1
        sta     ptr4+1
        txa

@L1:    ror     ptr4+1
        ror     a
        ror     ptr4
        dey
        bne     @L0
        tax
        lda     ptr4            ; Load the result
        rts

Mad Pascal uses:

;
; Ullrich von Bassewitz, 2009-08-17
;
; CC65 runtime: 8x8 => 16 unsigned multiplication
;

*/
.proc	imulCL
ptr1 = ecx
ptr4 = eax
	
	ldy #8
	lda #0

        lsr     ptr4            ; Get first bit into carry
@L0:    bcc     @L1
        clc
        adc     ptr1
@L1:    ror	@
        ror     ptr4
        dey
        bne     @L0
        sta	ptr4+1

	rts
.endp

@tebe explained:

MP extends the type for multiplication and other operations,
for u8xu8 it assumes the result of 16b, only later during optimization
it will start to reject redundant operations

result is written to eax, eax + 1 (16b)

procedure in the tests turned out to be several scanning lines faster than the one used previously:

.proc	imulCL

	lda #$00

	LDY #$09
	CLC
LOOP	ROR @
	ROR eax
	BCC MUL2
	CLC		;DEC AUX above to remove CLC
	ADC ecx
MUL2	DEY
	BNE LOOP

	STA eax+1

	RTS
.endp

Millfork approach:

byte*byte produces a byte, this is by design. An arithmetic operator never promotes the result to a type larger that the type of its arguments.

In order to get a word, you need to explicitly cast one of the arguments to word: x = n*word(n)
This causes a call to __mul_u16u8u16, which is defined in m6502/zp_reg.mfk.
The same file also contains __mul_u16u16u16 and __mul_u8u8u8, plus all the division and modulo implementations.

 

Edited by zbyti
Millfork
Link to comment
Share on other sites

3 hours ago, zbyti said:

@fenrock it's funny how big hole you have in the middle of the screen in landscape ;)

 

char landscapeBase[] = kickasm {{
    .byte $AA, $96, $90, $90, $7A, $7A, $6E, $6E, $5E, $5E, $56, $56, $52, $50 
  }};

 

looks like signed/unsigned char problem to me but that's just only a deduction :]

Edited by zbyti
snippet
Link to comment
Share on other sites

37 minutes ago, zbyti said:

looks like signed/unsigned char problem to me but that's just only a deduction :]

This is entirely possible, I'll check. I've had to write some fragments for some of the conditional code that I may have got wrong.

It does look like it's right between the 7A and 90 value in the array, so good shout on it being signed issue.

 

Link to comment
Share on other sites

Fixed it.

The array heights were either not getting set correctly, or getting trashed on startup.

 

 

image.thumb.png.8377b864b38a4926153aee941c4989dc.png

 

EDIT: Well that was educational.

Turned out to be an "out by 1" error, when I was copying the name of the benchmark, it didn't have a terminating 0 at the end of the name, so ran into the next data section, which turned out to be the heights array, and was trashing them by turning them into screen codes.

So it was my fault all along, not the programs :D 

Edited by fenrock
  • Like 2
Link to comment
Share on other sites

1 hour ago, zbyti said:

I check last results... If you done everything right and compiler not cheating ;) You are extremely fast on arrays, guessing bench also have quite good score. Nice job!

All credits to @Wrathchild :]

The montecarlo is still using the pre-generated sqr() array, I haven't changed it to using fastmul yet.

 

I changed the guessing game to 10x as it made the difference between using signed byte and normal byte more exaggerated. In the loop of 1000, if I use the signed byte, it reduces the time a lot.

Probably because I copied the asm code from a good website for doing signed comparison, but my hand crafted version for unsigned vs signed comparison was rubbish :)

 

I usually try and remember to inspect the code to check it isn't optimizing too much away - sometimes I have to return a value to ensure it's kept in the looping. It's tricky stopping kickc from doing what it's supposed to do :D 

 

I'm currently creating an md5.c benchmark, but having to write extra code fragments for the assembler to understand cardinal (4 byte) operations, so it's going slowly.

 

If the results are fair (need double checking), then at the moment kickc is:

Landscape: faster (quite a bit)

Chessboard: slower (10%)

QR 1D: faster (much)

Countdown For/While: much slower

Sieve 1028/1899: slower (quite a bit)

Bubble Sort: faster (much)

Montecarlo PI: faster (using fixed table)

YoshPlus: slower (just)

Guessing Game: faster (quite a bit)

 

I'd really like to have a look at the FOR/WHILE differences, that's quite a lot. And in kickc, they are almost identical whichever way you go.

 

EDIT: I'm going on the last picture in the benchmarks repository btw, if there's an updated one in the forums, apologies if I have the times wrong

 

 

Edited by fenrock
  • Like 2
Link to comment
Share on other sites

example why I didn't write a suite for Action!

 

Bubble Sort KickC:

2A41: A2 00     LDX #$00
2A43: BC 00 2F  LDY $2F00,X
2A46: BD 01 2F  LDA $2F01,X
2A49: 84 FF     STY $FF     ;FPTR2+1
2A4B: C5 FF     CMP $FF     ;FPTR2+1
2A4D: B0 07     BCS $2A56
2A4F: 9D 00 2F  STA $2F00,X
2A52: 98        TYA
2A53: 9D 01 2F  STA $2F01,X
2A56: E8        INX
2A57: E0 FE     CPX #$FE
2A59: D0 E8     BNE $2A43
2A5B: C6 88     DEC $88     ;STMTAB
2A5D: A9 FF     LDA #$FF
2A5F: C5 88     CMP $88     ;STMTAB
2A61: D0 DE     BNE $2A41
2A63: 60        RTS

Action!:

31B5: A5 CD     LDA $CD
31B7: D0 03     BNE $31BC
31B9: 4C 0B 32  JMP $320B
31BC: A0 00     LDY #$00
31BE: 84 CC     STY $CC
31C0: A9 FD     LDA #$FD
31C2: C5 CC     CMP $CC
31C4: B0 03     BCS $31C9
31C6: 4C 01 32  JMP $3201
31C9: A6 CC     LDX $CC
31CB: BD 13 20  LDA $2013,X
31CE: 85 CA     STA $CA     ;LOADFLG
31D0: 18        CLC
31D1: A5 CC     LDA $CC
31D3: 69 01     ADC #$01
31D5: 85 AE     STA $AE     ;LELNUM+1
31D7: A6 AE     LDX $AE     ;LELNUM+1
31D9: BD 13 20  LDA $2013,X
31DC: 85 CB     STA $CB
31DE: A5 CB     LDA $CB
31E0: C5 CA     CMP $CA     ;LOADFLG
31E2: 90 03     BCC $31E7
31E4: 4C FC 31  JMP $31FC
31E7: A5 CB     LDA $CB
31E9: A6 CC     LDX $CC
31EB: 9D 13 20  STA $2013,X
31EE: 18        CLC
31EF: A5 CC     LDA $CC
31F1: 69 01     ADC #$01
31F3: 85 AE     STA $AE     ;LELNUM+1
31F5: A5 CA     LDA $CA     ;LOADFLG
31F7: A6 AE     LDX $AE     ;LELNUM+1
31F9: 9D 13 20  STA $2013,X
31FC: E6 CC     INC $CC
31FE: 4C C0 31  JMP $31C0
3201: 38        SEC
3202: A5 CD     LDA $CD
3204: E9 01     SBC #$01
3206: 85 CD     STA $CD
3208: 4C B5 31  JMP $31B5
320B: A2 21     LDX #$21
31D9: BD 13 20  LDA $2013,X
31DC: 85 CB     STA $CB
31DE: A5 CB     LDA $CB
31E0: C5 CA     CMP $CA     ;LOADFLG
31E2: 90 03     BCC $31E7
31E4: 4C FC 31  JMP $31FC
31E7: A5 CB     LDA $CB
31E9: A6 CC     LDX $CC
31EB: 9D 13 20  STA $2013,X
31EE: 18        CLC
31EF: A5 CC     LDA $CC
31F1: 69 01     ADC #$01
31F3: 85 AE     STA $AE     ;LELNUM+1
31F5: A5 CA     LDA $CA     ;LOADFLG
31F7: A6 AE     LDX $AE     ;LELNUM+1
31F9: 9D 13 20  STA $2013,X
31FC: E6 CC     INC $CC
31FE: 4C C0 31  JMP $31C0
3201: 38        SEC
3202: A5 CD     LDA $CD
3204: E9 01     SBC #$01
3206: 85 CD     STA $CD
3208: 4C B5 31  JMP $31B5

 

For many years Action! was the fastest native compiler on Atari, maybe even on any 8-bit system? Maybe the optimized Advan Basic can sometimes match the Action! speed...

 

BS.ACT

Edited by zbyti
fastest compiler
  • Like 2
Link to comment
Share on other sites

I've been working on an md5 implementation for the benchmarks, just added to the repo.

 

If you want to compile this yourself, you'll need to download and build the latest kickc as there's a bunch of fragments for 32 bit integers that needed to be written.

Thanks to @JesperGravgaard for doing that.

 

I haven't done a full implementation like @tebe has done at https://github.com/tebe6502/Mad-Pascal/blob/master/lib/md5.pas, instead, it just creates md5 values.

Which is a testament to how good the Mad Pascal version is.

 

I've also added the latest results to the repo's front page:

 

suite.png

 

The md5 implementation has add about 2k to the binary (there's a lot of static tables for initialising vectors), it's now 7,099 bytes.

 

Will have some time over the weekend to add more benchmarks. I fancy having a go at the fire one for more eye candy :D 

 

EDIT: Cool, the linked image updates when I change it in the repo!

Added matrix trans, using 1D arrays, indexed with offsets to emulate 2D.

Edited by fenrock
add size info
  • Like 2
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...