Benchmarking the Atari 800XL

vol · January 30, 2021

I have a little benchmark project for various platforms. I have just added results for the Atari 800XL. These results are a bit surprising. The Atari 800 NTSC uses the 6502 at 1.79 MHz, as does the Commodore Plus4 NTSC. However the results for the Plus4 is about 10% faster. Maybe there is a way to speed up the Atari 800XL? Indeed I have disabled ANTIC by POKE559,0. The Plus4 code uses a custom interrupt handler, when it is working this code just increase the timer value and does nothing else. Is it possible to do the same thing on the Atari? What vector should I redirect to my custom handler? I have only superficial knowledge about the Atari system details. So any help will be greatly appreciated. I know that the efficient CPU frequency in the +4 NTSC is actually about 1.71 MHz (not 1.79) because of DRAM refreshing which steals cycles. But I don't know the exact number for the Atari's efficient CPU frequency.
BTW I know that the Atari vertical refresh rate is 59.92 Hz on the NTSC system but I am not shore that this rate for the PAL system is exactly equal to 50 Hz. Is it?

It is a bit surprising that the size of a BAS-file is greater than the size of the corresponding LST-file (Basic in text format).

I have found out that the OUTCHAR routine has different addresses in the Atari 800 and XL. Is there a way to make a portable code which can work on both Ataris?

Edited January 30, 2021 by vol
a typo

TGB1718 · January 30, 2021

All Atari 8 bit systems use a pre-defined set of vectors into the system, these are the ones which provided compatibility between the

different OS versions.

Also the lower memory locations (0x0000 to 0x$0400) contain memory location to fixed areas that are (almost) the same in all versions.

Have a look at "Mapping the Atari" it's invaluable for information.

I suspect the code you looked at provide the same functionality, but your jumping into the ROM, not using a vector, so it will appear in different locations.

Rybags · January 30, 2021

The Atari should be faster.

It has less refresh cycles - usually 5 per scanline vs our 9, but it has 2 badlines in succession since there's attribute fetches as well as character cell and bitmap data.

So the net effect is that Atari has DList and more refresh lost cycles but overall that is less than the extra 25x40 badline cycles of the Plus 4.

Though all this doesn't take the VBlank overhead into consideration. Likely that Atari's VBlank code path is somewhat longer than the Plus 4 timer IRQ which would swing things the other way a bit.

Edited January 30, 2021 by Rybags

Rybags · January 30, 2021

The actual frequency fed to the CPU I'm fairly sure should be identical in both cases for one machine compared to the other for both PAL and NTSC.

Back to the differences, I'm not sure what the Plus 4 does for refresh on badlines, whether it is still 5 cycles or not.

The Atari has just 1 cycle on the badlines and 9 on the remaining ones. And 32 cycles lost to DList fetches.

Doing the calculations (assuming Plus 4 does 5 refreshes always per line)

PAL Atari = 2616 refresh + 32 DList + 960 character cells + 7680 character set fetches for a TOTAL 11,288 cycles lost per frame

PAL Plus4 = 1560 refresh + 1000 character cells + 1000 attributes + 8000 character set fetches for a TOTAL 11,560 cycles lost per frame.

So, Atari slightly faster. If the Plus 4 does only one refresh on badlines that would return 200 cycles which would make it slightly faster.

I think if you're benchmarking and seeing the Atari as slower then it might come down to VBlank overhead.

Possibly a better way could be to devise some bit of work in 6502 Asm which takes about a minute.

Run on both machines with interrupts disabled and time them by hand.

Edited January 30, 2021 by Rybags

vol · January 30, 2021

1 hour ago, Rybags said:

The Atari should be faster.

It has less refresh cycles - usually 5 per scanline vs our 9, but it has 2 badlines in succession since there's attribute fetches as well as character cell and bitmap data.

So the net effect is that Atari has DList and more refresh lost cycles but overall that is less than the extra 25x40 badline cycles of the Plus 4.

Though all this doesn't take the VBlank overhead into consideration. Likely that Atari's VBlank code path is somewhat longer than the Plus 4 timer IRQ which would swing things the other way a bit.

Thank you but the benchmark runs in screen blank mode which doesn't have bad lines at all. So we have to take into account only DRAM refreshing and it is longer for the Atari which can explain why the +4 is a bit faster.

drac030 · January 30, 2021

Just do SEI - this disables the second phase VBL (fast path). You can also hook your custom VBL handler onto vector $0222. It is enough to terminate it later by jmp ($0224).

EDIT: the PAL frequency is 49.86 Hz.

Edited January 30, 2021 by drac030

vol · January 30, 2021

2 hours ago, TGB1718 said:

All Atari 8 bit systems use a pre-defined set of vectors into the system, these are the ones which provided compatibility between the

different OS versions.

Also the lower memory locations (0x0000 to 0x$0400) contain memory location to fixed areas that are (almost) the same in all versions.

Have a look at "Mapping the Atari" it's invaluable for information.

I suspect the code you looked at provide the same functionality, but your jumping into the ROM, not using a vector, so it will appear in different locations.

Thank you. I have checked carefully book MAPPING THE ATARI but I couldn't find a common vector to OUTCHAR. I found a lot of information about interrupts on the Atari but I need just a vector address which I can redirect and use just to update timer value at 18-20. However it seems that this can work only when the screen is off, isn't it? So it can be rather complex code if we need to show digits on the screen and have minimal overhead for v-blank interrupts.

drac030 · January 30, 2021

4 minutes ago, vol said:

I couldn't find a common vector to OUTCHAR

It is $346-$347 (in IOCB #0 opened for the screen editor). The vector is decremented by 1, so to use it you should push its value on the stack and do RTS.

4 minutes ago, vol said:

However it seems that this can work only when the screen is off, isn't it?

Huh?

Edited January 30, 2021 by drac030

Rybags · January 30, 2021

The refresh rate I would think should be the same on both computers.

Some computers use 263/313 scanlines NTSC/PAL but both Atari and Plus 4 should use 262/312, so assuming the CPU clocks are identical the refresh rate should also be.

Re the inexactness - in both cases it's not a reliable source of timing to assume 50 or 60 FPS but in a comparitive situation between the 2 computers any deviation due to the slightly lower actual rate should be the same.

vol · January 30, 2021

1 hour ago, drac030 said:

Just do SEI - this disables the second phase VBL (fast path). You can also hook your custom VBL handler onto vector $0222. It is enough to terminate it later by jmp ($0224).

EDIT: the PAL frequency is 49.86 Hz.

Thank you very much. SEI helps, it makes results about 2% better. However is there a way to read consistently the timer value at 18-20? Is there a way to disable NMI during this read?

Sorry I failed with VBL redirection I use the next code

         sei
         lda $222
         pha
         lda $223
         pha
         lda #<tiroutine
         sta $222
         lda #>tiroutine
         sta $223
         ...
         pla
         sta $223
         pla
         sta $222
         cli
         rts

tiroutine
         inc 20
         bne tiexit

         inc 19
         bne tiexit

         inc 18
tiexit   rti

The code fails on tiroutine. It seems I miss some Atari specific magic which should end NMI. Would anybody like to help with this matter?

Edited January 30, 2021 by vol

zbyti · January 30, 2021

Edited January 30, 2021 by zbyti
Atari is faster then C+4

drac030 · January 30, 2021

3 minutes ago, vol said:

The code fails on tiroutine.

You were told to end the handler with jmp ($0224) and not with RTI. Besides, the vector at $0222 is supposed to point to the actual address of the handler, not decremented.

Reading 19-20:

jiffy:

lda 20

ldx 19

cmp 20

bne jiffy

Edited January 30, 2021 by drac030

vol · January 30, 2021

1 hour ago, drac030 said:

You were told to end the handler with jmp ($0224) and not with RTI. Besides, the vector at $0222 is supposed to point to the actual address of the handler, not decremented.

Reading 19-20:

jiffy:

lda 20

ldx 19

cmp 20

bne jiffy

Thank you very much again. It works! However custom interrupt routine makes code only 0.2% faster. So total gain for the Atari 800 is about 2.2%.

So I may think that the Atari 800 spends more cycles on DRAM refreshing and other activities which cannot be disabled.

drac030 · January 30, 2021

Could you post the binary?

Faicuai · January 30, 2021

5 hours ago, vol said:

I have a little benchmark project for various platforms. I have just added results for the Atari 800XL. These results are a bit surprising. The Atari 800 NTSC uses the 6502 at 1.79 MHz, as does the Commodore Plus4 NTSC. However the results for the Plus4 is about 10% faster. Maybe there is a way to speed up the Atari 800XL? Indeed I have disabled ANTIC by POKE559,0. The Plus4 code uses a custom interrupt handler, when it is working this code just increase the timer value and does nothing else. Is it possible to do the same thing on the Atari? What vector should I redirect to my custom handler? I have only superficial knowledge about the Atari system details. So any help will be greatly appreciated. I know that the efficient CPU frequency in the +4 NTSC is actually about 1.71 MHz (not 1.79) because of DRAM refreshing which steals cycles. But I don't know the exact number for the Atari's efficient CPU frequency.
BTW I know that the Atari vertical refresh rate is 59.92 Hz on the NTSC system but I am not shore that this rate for the PAL system is exactly equal to 50 Hz. Is it?

It is a bit surprising that the size of a BAS-file is greater than the size of the corresponding LST-file (Basic in text format).

I have found out that the OUTCHAR routine has different addresses in the Atari 800 and XL. Is there a way to make a portable code which can work on both Ataris?

There seems something odd in your results.

Typically, when running these type of tests, the very first thing to check is the "global" advantage ratio between any given two systems (vs. their same global clock-ratio). Take, for instance, the C64 and 800XL versions, and look at the global performance difference:

(ALL results below with DMA=OFF)

1. Ratio of time C64 : 800XL for test 1: x1.5326

2. Ratio of time C64 : 800XL for test 2: x1.5805

3. Ratio of time C64 :800XL for test 3: x1.5849

So by the above results, it seems the actual performance advantage factor of the 800/XL over the C64 is in the order of 1.58x (vs. a nominal / gross 1.79x, which would be 800XL:C64 CPU clock-speed ratio, leaving outside any embedded CPU hw-based optimizations, as well as effective ram speed, OS overhead, etc.)

NOW, what about departing from an exact-same C-based benchmark, compiled with CC65 with exact same optimizations for all versions, being the only difference the target system? Here's the canonical BYTE Sieve benchmark, departing from the identical same / lean source, same declaration of Registers variables, same array size, and compiled identically with CC65 (+optimizations) in Windows, for APPLE, C64 and Atari systems:

1. Ratio of time C64 : 800XL for Byte SIEVE: x1.70 (17.2s / 10.116s)

Typical OS-interrupts overhead on 800/XL are about 3%, when active. And SIEVE benchmark uses Page-0, RAM and CPU in pretty much same (and intensive) way on all these systems.

So going from a realized x1.7x performance ratio (not far from theoretical x1.79x) down to x1.58x does makes you wonder where exactly was the time spent off? What part of your code is running (relatively more efficient) on the C64 than the 800/XL?

My 0.02c

vol · January 30, 2021

57 minutes ago, drac030 said:

Could you post the binary?

Indeed just follow the initial link. You have options: to load full pack for all platforms or just go to github. All links are at the bottom of the page.

BTW I have just updated the table, I used all optimizations offered. So now the Atari 800 port is more optimized that almost version for any other port because the custom interrupt handler is implemented only for the Plus4 and Atari 800.

Sorry I was reluctant to make a portable call to OUTCHAR, maybe next time.

Anyway thank you very much!

Edited January 30, 2021 by vol

vol · January 30, 2021

52 minutes ago, Faicuai said:

There seems something odd in your results.

Typically, when running these type of tests, the very first thing to check is the "global" advantage ratio between any given two systems (vs. their same global clock-ratio). Take, for instance, the C64 and 800XL versions, and look at the global performance difference:

(ALL results below with DMA=OFF)

1. Ratio of time C64 : 800XL for test 1: x1.5326

2. Ratio of time C64 : 800XL for test 2: x1.5805

3. Ratio of time C64 :800XL for test 3: x1.5849

Thank you very much. I have started this thread because I expected a bit higher results for the Atari 800. Maybe it is because of the emulator I used? The port for the C64 is less optimized than for the Atari 800 now. The C64 port doesn't use a custom interrupt handler. I think we must use only the third test if we want to compare raw CPU speeds, for the current results the C64/Atari800 ratio is 1:1.63 which is still definitely less than expected 1.75

Does anybody know how much time spends Atari 800 for its RAM regeneration? I don't know this number for the C64 but for the Plus4 it is 5 cycles each raster line. There is also some system service on vector $224 in the Atari.

I have also to emphasize that code for the all 6502 based systems are the same.

Edited January 30, 2021 by vol

zbyti · January 30, 2021

drac030 · January 30, 2021

17 minutes ago, vol said:

Does anybody know how much time spends Atari 800 for its RAM regeneration?

It has already been answered above: 9 clock cycles per rasterline.

vol · January 30, 2021

18 minutes ago, drac030 said:

It has already been answered above: 9 clock cycles per rasterline.

Thank you for the clarification. IMHO Rybags's information has not been completely definitive. So these 9 cycles explain the results...

It is quite possible that the DMA refresh doesn't steal cycles on the C64 at all.

Edited January 30, 2021 by vol

dmsc · January 30, 2021

Hi!

54 minutes ago, vol said:

Indeed just follow the initial link. You have options: to load full pack for all platforms or just go to github. All links are at the bottom of the page.

BTW I have just updated the table, I used all optimizations offered. So now the Atari 800 port is more optimized that almost version for any other port because the custom interrupt handler is implemented only for the Plus4 and Atari 800.

Sorry I was reluctant to make a portable call to OUTCHAR, maybe next time.

See attached patch to your github code for the correct way to write one character to IOCB #0. This has the advantage of working with 80 column hardware, display accelerators, etc.

Have Fun!

atari-putchar.patch

Faicuai · January 30, 2021

23 minutes ago, vol said:

Thank you for the clarification. IMHO Rybags's information has not been completely definitive. So these 9 cycles explain the results...

It is quite possible that the DMA refresh doesn't steal cycles on the C64 at all.

In the event you have some extra time, and want to take a vis-a-vis look with a smaller (yet relevant) piece of code, to try leveling things off and see where time is being spent, I have attached an .ATR with two versions of SIEVE benchmark:

Scratchpad-DOS-130K-IV.ATR

SIEVE3E.MAC: Atari runs x10 iterations @ 444 frames (7.4099 secs) this has been coded by me, directly on real hardware, follows Byte's "canonical" algo. (no optimizations there), and carries 6502-centric optimizations, that any like-system will benefit from. Things like handling of For Start/End/Step parameters, self-modification, optimized 16-bit arithmetics, and unrolling ram-clear sweeps... but there are NO algorithmic changes, and no special handling of OS interrupts, nor obscure programming tricks. Code is pretty clean, overly documented to ease porting to other systems, and the most critical / intensive parts of the code have the ">>" chars. on the comments. THIS one should be interesting to see how it runs on +4 and C64, for sure. Results will be quite telling.
SIEVE5A.MAC: this the hand-ported version (to Atari) of famous William Savoie (1981) edition, hand coded for 6502 (unclear if on Apple-II or on OSI 6502 Board), which nets @ 13.9s, whereas the Atari runs it @ 8.86secs, and it is supposed to follow Byte's canonical form. The relative results are strange (not as wide as they should) which makes me wonder what was the real HW config. of the OSI/Apple test bed. Up to you if you want to check this one, or not.

Global timing results left on $D0 (LSB), $D1 (MSB) P0-regs, and better run in "console mode", with XEP80 or DMA=OFF (to get these results). I have included Atari Macro Assembler, in case you want to make quick / direct changes and assemble in-situ.

Cheers!

Edited January 30, 2021 by Faicuai

drac030 · January 30, 2021

This program will be real "fun" to get it executed on real Atari: from what I can see, it requires the MEMTOP (yup, MEMTOP) being set to $0D07, then loads a binary file over that. On my system the minimum possible MEMLO is 135 bytes higher

The MEMLO must be probably around $0800 to accomplish that.

Edited January 30, 2021 by drac030

Faicuai · January 30, 2021

3 minutes ago, drac030 said:

This program will be real "fun" to get it executed on real Atari: from what I can see, it requires the MEMTOP (yup, MEMTOP) being set to $0D07, then loads a binary file over that. The MEMLO must be probably around $0800 to accomplish that.

It does not run.

Well, not in SDX. It kills it immediately.

Edited January 30, 2021 by Faicuai

vol · January 30, 2021

1 hour ago, dmsc said:

Hi!

See attached patch to your github code for the correct way to write one character to IOCB #0. This has the advantage of working with 80 column hardware, display accelerators, etc.

Have Fun!

atari-putchar.patch 783 B · 3 downloads

Thank you but drac030 told about the different locations, $346-$347. Your code uses $347-$348.

Benchmarking the Atari 800XL

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members