Benchmarking the Atari 800XL

tebe · February 1, 2021

46 minutes ago, Mazzspeed said:

The Plus/4 is a bit of an unknown, I don't really know what DRAM was fitted to the device on release, but knowing Commodore it could have been anything and most likely could have been slower than ideal considering it's ability to run at 2.2Mhz in certain situations - Essentially that's my point. However, I guess a the end of the day the machine ran, so the memory used must have been good enough.

System Clock Doubling

For increased processor throughput, the system clock output from TED doubles
frequency from 894KHz (NTSC) to 1.788KHz (NTSC), during non−display times.
The horizontal position register counts 456 dots, 0 to 455. During counts of
400−344, wile in raster lines 0 to 204, the TED device outputs single clock.
During this time TED is doing processor handshaking (counts 400−432),
character fetches (counts 432−304), and dynamic RAM refresh (counts 304−344).
Outside of this horizontal window TED outputs double clock (1.788KHz). During
raster lines 205−261 for NTSC (205−311 for PAL), TED outputs double clock at
all times except horizontal counts 304−344 which are single clock to allow for
dynamic RAM refresh. If the blanking bit (Register #6) is cleared, the active
display is cleared, the screen is filled with border color, and double clock
is enabled at all times except refresh.

vol · February 1, 2021

22 hours ago, drac030 said:

Okay, so some technical information now:

1) the BASIC interpreter I used is U-BASIC. It is not very much more than the ordinary Atari BASIC recompiled so that it runs in the memory under the OS instead of hogging the cartridge area. You can run it on vanilla 800XL or 65XE (not on 800 - it requires a 64k machine).

The maximum memory (out of the first 64k as the benchmark rules stipulate) I can get here in this BASIC without falling back to the cassette recorder (which I do not have) as the only storage is:

It would however be better if the benchmark could avoid involving a BASIC interpreter and be run directly from the CP.

2) the hardware: PAL Atari 65XE with 320k RAM and Rapidus Accelerator. The results number 1 is the regular board, 65C816/20 MHz. The results number 2 is a prototype where the CPU is clocked at 40 MHz. So yes, this is "a kind of SuperCPU for Atari" as you correctly guessed.

This however raises questions about the SuperCPU results listed in your table. Namely, what is the difference between the entries 31/34 on one side, and 44/45 on the other side? Both list 65C816/20 MHz, but as you see, 31/34 are actually faster (by about 10%) than a 65C816/40 MHz. So how this result has been achieved on the SuperCPU board or maybe it is a mistake and 31/34 are really faster than the 20 MHz listed?

If you can compile the main module for the 65C816 target, we could see.

Basic is not a necessity you can use USR($1907) that gives you 200 digits. Basic provides the use of a friendly interface to set number of digits and optionally to set screen off mode. It also helps with floating point division: 49.86 or 59.92 are not easy divisors to handle in ML-code. For the Atari, Basic is used to load ML code too.

Many other ports (The Commodore 64/Plus4/128/SuperCPU), BBC Micro, Amstrad CPC, ZX Spectrum, Dragon 32/64, ...) also use Basic. Unix ports use C instead of Basic.

Could you tell me when was the Rapidus Accelerator production started?

What is this prototype? WDC stopped developing the 6502 long ago. Bill Mensch just says sometimes something strange but does nothing. Three years ago he says that he could make the 6502 at 10 GHz, this winter he even declared 20 GHz!

The higher results are caused not the increased frequency but the driver. You can easily finds that the higher results based on the scpu64-5 driver which means that this code uses all advantages of the 65816. The lower results are based on the c64-9 driver which is just the C64 program which was used to test the C64 itself. It is easy to transform the current Atari program for the use of the 65816 advantages, just replace code between start and finish by code from the c64scpu folder.

12 hours ago, Faicuai said:

It works!

Now getting 1.85s for first 100 digits, and 184.9465s for 1000 digits... This is in "console mode", output via XEP80 Ultra-drivers (80-cols) and NO blank-screen (XEP80 drivers service (and keep) E: driver at full 6502 tilt, all the time).

All the above with Atari 800/Incognito (RAM/ROM/HD board) and SDX dos.

Thank you. You results confirms that Atari800 is a very accurate emulator. Your numbers are a bit less that those in the table which is an outcome of the faster char output of the XEP80. For 3000 digits you must get a number almost equal to that in the table.

4 hours ago, Rybags said:

I'm fairly sure that doesn't happen - the screen blanking just prevents the cycles lost for screen generation in the same fashion as setting DMACTL to 00 on the Atari - the PAL/NTSC setting would just toggle between 312/262 scanlines at ~ 50 & ~ 60 FPS. There's no valid reason for the CPU to overclock to that speed (though it does seem somewhat close to half the PAL colour clocking rate)

The Plus4 can toggle the base dot clock divider which is 8 for NTSC and 10 for PAL. This gives 25% speed boost for PAL systems and 25% slowdown for NTSC systems.

2 hours ago, Mazzspeed said:

I can tell you right now, it definitely runs at 2.2Mhz. This has been known for quite some time now.

All it needs is the 2Mhz ram of the BBC and it would be a rip snorter as long as the screen is off. All our 8bit machines run 1Mhz ram as Atari and Commodore wanted to make things as affordable as they could without the backing of the British Broadcasting Corporation.

In fact the BBC is a really good 8bit machine. It lacks somewhat in graphics capabilities, but as a development machine it's a great device - Especially with that Tube port and an ARM second processor.

The Plus4 uses 2 MHz RAM and 2 MHz CPU, the BBC Micro uses 2 MHz CPU and 4 MHz RAM. The Beeb never has memory contention and the price for this is its costly DRAM. The Plus4 sometimes gives 2 MHz access to its DRAM for its CPU. The CPU always uses DRAM at 2 MHz (exactly 1.76) when it is in screen off mode - the only exception are 5 cycles each raster line which the CPU uses at 1 MHz giving 5 cycles to DRAM refreshing. Sometimes when screen is on, the Plus4 CPU is just stopped and all ticks gets its video - this is a famous bad line effect. The Plus4 has such lines twice as many of them as the C64 in standard screen on modes but it can be programmed to use more badlines to show more colors and resolution.

Edited February 1, 2021 by vol

Rybags · February 1, 2021

I did a search but there's not much info around - supposely maybe 5 applications of this faster clock speed.

In theory, like the C128 you might be able to use Raster IRQs to implement the speedup during the 112 non display scanlines. That is assuming it doesn't screw with the overall display timing.

drac030 · February 1, 2021

1 hour ago, vol said:

Basic is not a necessity you can use USR($1907) that gives you 200 digits. Basic provides the use of a friendly interface to set number of digits and optionally to set screen off mode. It also helps with floating point division: 49.86 or 59.92 are not easy divisors to handle in ML-code. For the Atari, Basic is used to load ML code too.

Many other ports (The Commodore 64/Plus4/128/SuperCPU), BBC Micro, Amstrad CPC, ZX Spectrum, Dragon 32/64, ...) also use Basic.

Yup, these are valid points, but this section of the benchmark may need improvements anyways: the BASIC program is odd at places and, as said, using BASIC for the interface is a hassle sometimes. I will see what can be done.

1 hour ago, vol said:

Could you tell me when was the Rapidus Accelerator production started?

2016.

1 hour ago, vol said:

What is this prototype? WDC stopped developing the 6502 long ago. Bill Mensch just says sometimes something strange but does nothing. Three years ago he says that he could make the 6502 at 10 GHz, this winter he even declared 20 GHz!

I would actually prefer a newer version of the 65C816, not necessarily 20 GHz (50 MHz would be enough, although they are claiming IIRC that their softcores can run up to 200 MHz) with few its "features" rectified.

The 40 MHz prototype board runs a softcore 65C816.

1 hour ago, vol said:

The higher results are caused not the increased frequency but the driver. You can easily finds that the higher results based on the scpu64-5 driver which means that this code uses all advantages of the 65816. The lower results are based on the c64-9 driver which is just the C64 program which was used to test the C64 itself. It is easy to transform the current Atari program for the use of the 65816 advantages, just replace code between start and finish by code from the c64scpu folder.

Initially I thought that the benchmark was compiled from C on all platforms. Later I saw that this is a hand-optimized assembler.

I will take a look at the scpu sources, but I am under the impression that you already have all the tools neessary to prepare such a version in binary form. Could you do it? If not, no problem.

drac030 · February 1, 2021

Just for the records, I was wondering if the program could be run outside the first 64k and if that makes a difference. It has the advantage of code running entirely in FastRAM without interference from Antic. Also, entire 64k is available for the program, minus its size, which effectively makes a bit more than 58k. The disadvantage is that the system calls (like OUTCHAR) impose bigger overhead.

I used still the same 8-bit module from pack-45, and a BASIC interpreter capable of accessing more than 64k address space (MultiBASIC). The interpreter had to be fixed first, because one of the necessary keywords did not work correctly:

MBI.EXE

Now the program itself had to be patched. The interrupt handling I left to the Rapidus OS, so saving/changing/restoring the $0222 vector had to be removed (by the way, before setting this vector one should first wait for a VBL tick, i.e. wait for the counter at $14 to tick, and then change the vector). Instructions had to be added which save/reset/restore additional 65C816 registers. And I replaced a jmp (abs) with a jmp (abs,x) in one place.

The BASIC program to handle all that:

5 POKE 82,0:DIM R$(1),V$(3)
10 M=6407:H=$FFFF:HIM=M+$020000:DMA=PEEK(559)
15 N=INT((INT((H-6144)/7))/4)*4:DPOKE 207,N
20 OPEN #2,4,0,"D:PI":TRAP 40:FLEN #2,PSZ:PA=32768:BGET #2,PA,PSZ
40 CLOSE :TRAP 40000:GOSUB 900:? CHR$(125);"NUMBER PI CALCULATOR V2+"
55 ? "NUMBER OF DIGITS (UP TO ";N;")";:INPUT F:D=INT((F+3)/4)*4:IF D<=0 OR D>N THEN 55
60 IF F<>D THEN PRINT D;" DIGITS WILL BE PRINTED"
65 ? "BLANK SCREEN (36% FASTER)";:INPUT R$:IF R$="Y" THEN POKE 559,0:WAIT :GOTO 80
70 IF R$<>"N" THEN 65
80 F=D/2*7:FL=INT(F/256):POKE PA+37,INT(F/128):POKE PA+82,F-FL*256:POKE PA+86,FL:POKE PA+58,F*2-INT(F/128)*256
85 H=59.92:IF PEEK($62) THEN H=49.86
90 MOVE PA,HIM,PSZ:SYSCALL HIM+1:MOVE HIM,PA,PSZ
95 ? " ";(DPEEK(PA+6082)*256+PEEK(PA+6081))/H:IF R$="Y" THEN POKE 559,DMA
100 END
900 FOR F=0 TO 17:READ V$:H=VAL(V$):POKE PA+4+F,H:NEXT F
905 FOR F=0 TO 3:POKE PA+832+F,$EA:POKE PA+836+F,$EA:NEXT F
910 FOR F=0 TO 7:READ V$:H=VAL(V$):POKE PA+855+F,H:NEXT F:DPOKE PA+856,DPEEK($0346)+1:FOR F=0 TO 3:READ V$:H=VAL(V$):POKE PA+838+F,H:NEXT F
915 FOR F=0 TO 2:READ V$:H=VAL(V$):POKE PA+275+F,H:NEXT F
990 RETURN
1000 DATA $8B,$0B,$F4,$00,$00,$2B,$4B,$AB
1005 DATA $A5,$14,$C5,$14,$F0,$FC,$EA,$EA
1010 DATA $EA,$EA,$F4,$00,$00,$02,$00,$68
1015 DATA $68,$60,$2B,$AB,$58,$6B,$7C,$00
1020 DATA $25

To my surprise, it worked

The results are not much better, though: 136.62 for 3000 digits (vs 137.14 before) at 20 MHz, and 92.75 (vs 94.12 before) at 40 MHz.

The maximum amount of digits to be printed is 8484.

Mazzspeed · February 1, 2021

7 hours ago, Rybags said:

"Make use of" isn't really relevant. Whether a given cycle is granting memory access to the CPU or for DMA, the timing requirements are still the same. We don't have wait states like PCs where we're waiting for long latency memory operations, any given memory access is always done in a single cycle.

But you essentially do, especially in the case of the C64/A8 vs the BBC. With ANTIC using DMA and with the VIC-II able to halt the processor using the AEC signal, both machines essentially have forced wait states, this is not the case on the BBC. On the 16bit Amiga's you use wait states to keep faster 32 bit processors (68030 and higher) in sync with the custom chipset via the strangled 16bit bus, in this scenario memory is not the problem. Such wait states aren't needed on the newer Amiga's with the 32bit bus.

Furthermore, as another member stated, the ram in the BBC runs an effective 4Mhz. I knew there was a reason why Acorn didn't cheap out on the memory used in their machine at the time.

Edited February 1, 2021 by Mazzspeed

tebe · February 1, 2021

MadPascal (FreePascal), Pi Bench

pibench.obx pibench.pas

pi_spigot.obx pi_spigot.pas

Edited February 2, 2021 by tebe

vol · February 2, 2021

22 hours ago, drac030 said:

I would actually prefer a newer version of the 65C816, not necessarily 20 GHz (50 MHz would be enough, although they are claiming IIRC that their softcores can run up to 200 MHz) with few its "features" rectified.

The 40 MHz prototype board runs a softcore 65C816.

Initially I thought that the benchmark was compiled from C on all platforms. Later I saw that this is a hand-optimized assembler.

I will take a look at the scpu sources, but I am under the impression that you already have all the tools neessary to prepare such a version in binary form. Could you do it? If not, no problem.

I am curious about your accelerators. How is it about compatibility? We know that Atari refused to replace the NMOS 6502 with the CMOS because some programs didn't work under the CMOS 6502, for example game Asteroid.

WDC has been doing nothing for more than 25 years. So it is very unlikely that it can produce anything new. IMHO it would be good if somebody makes the high frequency 65CE02 - it is much better than the 65C02. I know little about softcores but a man ran the pi-spigot using his Acorn Atom with such a core at 100 MHz a year or two ago.

You can also read that this implementation of π-spigot is claimed as the fastest but everybody is invited to make it faster. Some ppl tried to make it better for the 6502, x86, PDP-11, 68k, ...

No problem. IMHO you can find all information about my tools in the sources. I use tmpx-assembler, awk, sed, and maybe several other standard Unix utilities. The most interesting part is maybe the branch optimizer - you can find it in bbc-folder. It helps to keep all branches within the same pages - you know when a branch crosses a page boundary we have a timing penalty.

16 hours ago, drac030 said:

Just for the records, I was wondering if the program could be run outside the first 64k and if that makes a difference. It has the advantage of code running entirely in FastRAM without interference from Antic. Also, entire 64k is available for the program, minus its size, which effectively makes a bit more than 58k. The disadvantage is that the system calls (like OUTCHAR) impose bigger overhead.

I used still the same 8-bit module from pack-45, and a BASIC interpreter capable of accessing more than 64k address space (MultiBASIC). The interpreter had to be fixed first, because one of the necessary keywords did not work correctly:

The results are not much better, though: 136.62 for 3000 digits (vs 137.14 before) at 20 MHz, and 92.75 (vs 94.12 before) at 40 MHz.

The maximum amount of digits to be printed is 8484.

Thank you for these interesting results. It seems your systems have a sophisticated MMU - they allow such tricks! I have just uploaded pipack-46 which uses a proper way to set $222-vector - thank you. This work with NMI gives the Atari a very interesting flavor. Do you know that the Commodore +4 doesn't have NMI at all? They just cannibalized this processor pin!

Sorry I missed what purpose does file MBI.EXE have? Is it a program for Microsoft Windows?

Indeed the version for the SuperCPU also uses JMP (abs,X) - it is quite a useful instruction and especially for ROM-coding.

However you use a kind of expanded Basic. In my project I have to rely only on stock ROM variants.

16 hours ago, Mazzspeed said:

Furthermore, as another member stated, the ram in the BBC runs an effective 4Mhz. I knew there was a reason why Acorn didn't cheap out on the memory used in their machine at the time.

What is the reason? IMHO the BBC Micro was too expensive, they could utilize its fast RAM much better. If they used it like the plus4 uses it they could get more than 3 effective MHz!

13 hours ago, tebe said:

MadPascal (FreePascal), Pi Bench

Do you know that authors of the pi-spigot algo published it in Pascal? https://www.maa.org/sites/default/files/pdf/pubs/amm_supplements/Monthly_Reference_12.pdf

Edited February 2, 2021 by vol

tebe · February 2, 2021

1 hour ago, vol said:

Do you know that authors of the pi-spigot algo published it in Pascal? https://www.maa.org/sites/default/files/pdf/pubs/amm_supplements/Monthly_Reference_12.pdf

I didn't know, thx

Mazzspeed · February 2, 2021

1 hour ago, vol said:

What is the reason? IMHO the BBC Micro was too expensive, they could utilize its fast RAM much better. If they used it like the plus4 uses it they could get more than 3 effective MHz!

Well I can tell you that it didn't matter, because as a result of the British Broadcasting Corporation it ended up in nearly all UK schools and even many schools here in Australia and Tasmania. It's Econet and DOS implementation was amazing, nothing 8bit at the time could beat it in relation to networking and DOS.

Edited February 2, 2021 by Mazzspeed

Mazzspeed · February 2, 2021

1 hour ago, vol said:

I know little about softcores but a man ran the pi-spigot using his Acorn Atom with such a core at 100 MHz a year or two ago.

Bear in mind that the Tube port on the BBC was designed with ARM development in mind, it's a very unique interface. I'd be hesitant to call that RPi purely a soft core.

Rybags · February 2, 2021

Having only 32K Ram was something of a handicap - we had them in school also - by the time you take the memory of the highest graphics mode away it doesn't leave much, although having the DOS in Rom was helpful in keeping that part of the overhead down.

Rybags · February 2, 2021

Getting late here - I've made a quick and dirty bench that is almost spot on 30 seconds runtime on (emulated) BBC Model B.

It makes a good baseline for comparison since it's supposedly (excactly?) 2 MHz without impediments of DMA and refresh stealing CPU cycles.

I'll convert to run on Atari, C64 and Plus/4 tomorrow if I get time (unless someone wants to do so first).

The BBC TIME variable is supposedly accurate enough, using proper timer hardware and not a 50 Hz video interrupt that's not really 50 Hz.

Spoiler

   10 MODE 7
   20 FOR C=0 TO 2 STEP 2
   25 P%=&4000
   30 [OPT C
   40 LDX #&6C
   50 STX &71
   60 LDA #0
   70 STA &70
   80 TAX
   90 .LOOP DEC &70
100 BNE LOOP
105 NOP
106 NOP
107 NOP
108 NOP
109 NOP
110 DEX
120 BNE LOOP
130 DEC &71
140 BNE LOOP
150 .RET RTS
180 ]
190 NEXT C
200 CLS
210 TIME=0
220 CALL &4000
230 T=TIME
240 PRINT T/100

drac030 · February 2, 2021

6 hours ago, vol said:

How is it about compatibility? We know that Atari refused to replace the NMOS 6502 with the CMOS because some programs didn't work under the CMOS 6502, for example game Asteroid.

I do not know that we know that (or only suppose), but in any case, Asteroids works here in the turbo mode.

Even if it did not, within few seconds (without powering the machine down, just reboot) I can always switch the board into legacy 6502 mode. I rarely need to do that, however, mainly for testing purposes.

Besides, who cares that a game does not work? I will always be so, there even are games which work on stock 65XE and do not on stock 130XE. If something is particularly cool, I or someone else can patch it. But if it is a question of just another case of a square throwing dots into circles, and it refuses to work just because its author's personal religion forbids him to accept that some people install a second Pokey in their Ataris (let alone a 20 MHz accelerator), the only thing I can do it use the special compatibility fixer every DOS has: the command DEL. The additional feature is that it also saves some space on the harddisk.

6 hours ago, vol said:

IMHO it would be good if somebody makes the high frequency 65CE02 - it is much better than the 65C02.

From the brief description I found here: http://www.zimmers.net/anonftp/pub/cbm/documents/chipdata/65ce02.txt I can see that it is just a form of a cross-breed between 65C02 and 65C816, which provides some features the latter has, but to a limited extent (save the bit manipulation instructions which however are also present in some 65C02 varieties). So as long as at least the Atari platform is concerned, I see no reason why 65CE02 should be preferred over 65C816.

6 hours ago, vol said:

tmpx-assembler

Ah, the platform-specific tools. Is it this: https://style64.org/release/tmpx-v1.0-style ?

6 hours ago, vol said:

Do you know that the Commodore +4 doesn't have NMI at all?

I did not know that. I only knew that on C-64 most interrupts of daily use are IRQs. I guess that the reason is the famous 6502 bug regarding the NMI handling you certainly heard of.

6 hours ago, vol said:

Sorry I missed what purpose does file MBI.EXE have? Is it a program for Microsoft Windows?

It is a program for 65C816-expanded Ataris. It is the "kind of expanded BASIC", as you have formulated it. I used it to load your binary into the address $021907 instead of the originally intended $001907. As this was only an one-time test to satisfy my curiosity, I did not bother writing a proper loader.

2 hours ago, Rybags said:

30 seconds runtime on (emulated) BBC Model B

It seems to give 35.74 seconds (1782 VBL ticks) on PAL Atari.

bbc.mae

bbc.com

ELSA's source. I omitted the include file with the common equates.

Edited February 2, 2021 by drac030

Faicuai · February 2, 2021

7 hours ago, drac030 said:

It seems to give 35.74 seconds (1782 VBL ticks) on PAL Atari.

Global-performance BBC:ATARI ratio is 1:1.1173 which suggests a target exec. time of 33.5195s on Atari... Yet it is clocking slower at 35.74s which implies a global ratio of 1:1.19 instead... That is a SIGNIFICANT difference. More than what the clock-speed suggests.

As far as I know the 6502 seems to *require* RAM clocked at twice its speed to perform at full potential. The BBC looks like it is juicing that 6502 all the way. But it also begs the question as to what is Atari's real ram speed is? Is it running at Antic's freq.? less?

All in-all-all, Atari's 6502 is not delivering everything it can (there's a bit lost somewhere). Not even with DMA=OFF and suppressing interrupts. That is for sure.

Edited February 2, 2021 by Faicuai

Rybags · February 3, 2021

The need for faster RAM is only to be able to hide the refresh and video memory accesses.

That ratio sounds almost right - DMA off with 9 cycles refresh on all lines should be about 1.21 for NTSC Atari vs BBC.

vol · February 3, 2021

22 hours ago, drac030 said:

Even if it did not, within few seconds (without powering the machine down, just reboot) I can always switch the board into legacy 6502 mode. I rarely need to do that, however, mainly for testing purposes.

From the brief description I found here: http://www.zimmers.net/anonftp/pub/cbm/documents/chipdata/65ce02.txt I can see that it is just a form of a cross-breed between 65C02 and 65C816, which provides some features the latter has, but to a limited extent (save the bit manipulation instructions which however are also present in some 65C02 varieties). So as long as at least the Atari platform is concerned, I see no reason why 65CE02 should be preferred over 65C816.

Ah, the platform-specific tools. Is it this: https://style64.org/release/tmpx-v1.0-style ?

I did not know that. I only knew that on C-64 most interrupts of daily use are IRQs. I guess that the reason is the famous 6502 bug regarding the NMI handling you certainly heard of.

It is a program for 65C816-expanded Ataris. It is the "kind of expanded BASIC", as you have formulated it. I used it to load your binary into the address $021907 instead of the originally intended $001907. As this was only an one-time test to satisfy my curiosity, I did not bother writing a proper loader.

We know that WDC made the 65C02 not completely compatible with the NMOS6502. Some instructions have different timings for both. Asteroids crashed because of different timings for BCD instructions. It is rather crazy for me that the CMOS 6502 (or even the 65816 executing the 6502 code) is a bit slower than the NMOS 6502.

The 65CE02 was greatly accelerated it is generally 25% faster than the 6502! It has an additional index register that gives it a better addressing mode than the 65C02/65816 (zp)-addressing, the base register, 16-bit SP, ... It is also crazy for me that WDC didn't try to make their processors as fast as this old one. Indeed the 65816 is generally more powerful because it has 24-bit address space and several other good features but it would have been better if it was based on the 65CE02 than based on the 65C02. BTW If you have interest in the 6502 history I dare to recommend this blog.

TMPX is not platform specific, you can run it from any Linux or Microsoft Windows as I do. IMHO it is not very good but I used to use it, I am trying to use vasm now. I have a project where I use vasm - BTW is there some library of good sorting algos for the Atari 800? My implementations show speed only about 2 times faster than the Z80...

Thank you. I didn't know about this 6502 bug around NMI. However it is rather a bug in the 6502 documentations so we can rather blame the Atari engineers who didn't test the Antic thoroughly and missed that NMI require one more cycle in some cases.

So I can't use MBI.EXE unless I have the Rapidus board?

15 hours ago, Faicuai said:

Global-performance BBC:ATARI ratio is 1:1.1173 which suggests a target exec. time of 33.5195s on Atari... Yet it is clocking slower at 35.74s which implies a global ratio of 1:1.19 instead... That is a SIGNIFICANT difference. More than what the clock-speed suggests.

As far as I know the 6502 seems to *require* RAM clocked at twice its speed to perform at full potential. The BBC looks like it is juicing that 6502 all the way. But it also begs the question as to what is Atari's real ram speed is? Is it running at Antic's freq.? less?

All in-all-all, Atari's 6502 is not delivering everything it can (there's a bit lost somewhere). Not even with DMA=OFF and suppressing interrupts. That is for sure.

Do you mean ATARI:BBC = 1:1.1173 ? The BBC Micro must be faster. However it should be much faster because its fast RAM isn't utilized completely. BTW were there the Acorn Archimedes in Australian schools?

Edited February 3, 2021 by vol

zbyti · February 3, 2021

I thought I knew something about benchmarking 6502... I was wrong

Faicuai · February 3, 2021

Well, all of this got me thinking... in terms of all these accelerators, etc...

....and it looks like what we really need is a bare-bones, low-power 6502 running at 200 Mhz, coupled with RAM running at 400 Mhz (with on-system-bus executive authority over 6502 / ANTIC when it needs to run)... That baby should haul-ass any time of the day, and no need for code conversion, special OSes, etc... just plain-old, 8bits-at-time chugging, accessing that congested and structured 64KB address range, but running crazy-fast!

That would be nice...

Edited February 3, 2021 by Faicuai

777ismyname · February 3, 2021

On 2/1/2021 at 8:38 AM, vol said:

.... Bill Mensch just says sometimes something strange but does nothing. Three years ago he says that he could make the 6502 at 10 GHz, this winter he even declared 20 GHz!....

Link >>> http://ataripodcast.libsyn.com/antic-interview-96-bill-mensch-6502-chip

zbyti · February 3, 2021

Quote

Three years ago he says that he could make the 6502 at 10 GHz, this winter he even declared 20 GHz!....

https://www.schach-computer.info/wiki/index.php?title=SciSys_Leonardo_TurboKit

https://www.schach-computer.info/wiki/index.php?title=Mephisto_MM_IV_Turbo

https://www.schach-computer.info/wiki/index.php?title=TurboKit

Edited February 3, 2021 by zbyti

stepho · February 4, 2021

15 hours ago, vol said:

BTW were there the Acorn Archimedes in Australian schools?

Not in significant numbers and not backed by the government like they were in the UK.
Mostly it was the Apple II or the Microbee (Z80 CPM machine).

Mazzspeed · February 4, 2021

1 hour ago, stepho said:

Not in significant numbers and not backed by the government like they were in the UK.
Mostly it was the Apple II or the Microbee (Z80 CPM machine).

https://en.wikipedia.org/wiki/Econet

Compared to the BBC, even with 32k (which really wasn't limiting in the slightest), the IIe wasn't real flash.

The Archimedes, however, was never a thing in Australia. I've never even seen one.

Edited February 4, 2021 by Mazzspeed

drac030 · February 4, 2021

On 2/3/2021 at 3:34 PM, vol said:

We know that WDC made the 65C02 not completely compatible with the NMOS6502. Some instructions have different timings for both. Asteroids crashed because of different timings for BCD instructions. It is rather crazy for me that the CMOS 6502 (or even the 65816 executing the 6502 code) is a bit slower than the NMOS 6502.

I did not know that Asteroids crashes on 65C02. Due to the BCD timings? Man, I must try it to see the crash with own eyes (I do not have a 65C02 Atari, but I can try it under emulation). Regardless, the 65C816 executing 6502 code is not slower than 6502, these instructions which were slower in 65C02 have the same timings as in 6502 (or at least the WDC manual says so).

On 2/3/2021 at 3:34 PM, vol said:

The 65CE02 was greatly accelerated it is generally 25% faster than the 6502!

By improving the prefetching pipeline, I see. This however only applies to instructions which do superfluous prefetches (like CLC for example), say JMP is not improved (nor can be on this architecture). On 65C816 the pipeline was not improved, but you can still simulate this (or similar) effect by clocking the processor faster on internal operation cycles - because, among other, it has legs to signal "valid" instruction fetches and data access to external circuitry. Nice feature of 65CE02, though.

On 2/3/2021 at 3:34 PM, vol said:

t has an additional index register that gives it a better addressing mode than the 65C02/65816

You mean the Z register. I have mixed feelings about that, because it seems a nice idea at first (one can use STZ to store not only $00 but an arbitrary value), but on the other hand I am not sure if it is not an overkill. But okay, this is a nice idea.

The "better addressing mode" is the (zp),Z - yup, nice to have another postindexed mode. Both usages of Z seem to be mutually exclusive, though.

On 2/3/2021 at 3:34 PM, vol said:

the base register, 16-bit SP,

Both also present in 65C816, save that the "base register" (known as D) is 16-bit. By copying the SP to the D register you can use all zp addressing modes on a stackframe. The only flaw here in 65C816 is that D and SP are not 24-bit.

On 2/3/2021 at 3:34 PM, vol said:

the 65816 ... would have been better if it was based on the 65CE02 than based on the 65C02

65C816 release date is 1985. 65CE02 release date is 1988. So "based" is impossible chronologically, and also it seems unlikely that Mensch would base anything on a competitor's design, even if it was earlier.

Other things from the 65CE02 manual:

- word branches: nice, but I am not sure if it is really necessary. It is true that a combo of branch and jmp (like BCC *+5 / JMP xx) is a hassle to use and it is slower (5 cycles vs 3), but a) assemblers commonly offer this as a built-in pseudoinstruction (even if not, you can make a macro), and b) the timing does not matter much once you realize that e.g. in a loop so long that it requires a long branch, the timing of the branch is probably so tiny fraction of its total execution that it is not a problem. Also, with JCC and the like the code is larger - 65C816 fixes that by offering more addressable memory.

Also, you can now say that JCC is not position independent, because it contains an absolute jump. 65C816 has one long branch, BRL (which is BRA with 16-bit offset), by combination of this with a short branch you can effectively have conditional branches with word offset.

Same with BSR - 65C816 has an instruction which takes relative 16-bit offset, calculates the absolute position out of it, then pushes it to the stack. Combining this with BRA or BRL gives you BSR and BSL - relative calls to subroutines. Generally 65C816 allows to write fully position independent code (although it sometimes requires long instruction sequences, such as when you want a position-independent version of JMP (abs,X), but it is generally possible, and it seems not on 65EC02).

- RTS with stack correction (RTN) - this seems nice at first, but I know from 68030 programming that such an instruction has hardly any practical use, as for certain reasons it is much better that is it the caller who corrects the stack.

- JSR (abs,X) - check, also present in 65C816. JSR (abs) is absent, but you can simulate it by combining two instructions (pushing a word constant onto the stack and doing JMP (abs)).

- NEG - true, not present in 65C816. EOR #$FF / INC, though.

- ASR - also not present as an instruction, one uses CMP #$80 / ROR, when desperately needed (the need is as rare as signed arithmetic, though).

- INW/DEW - in 65C816 you can have not only word RMW instructions, but also word data fetches and stores. On this CPU you can write $0222 atomically using one instruction, without disabling NMIs before, waiting for VBL or other tricks to avoid the store being interrupted in the process.

- stack relative indirect mode - check, present on 65C816, although its usefullness is reduced by the possibility to use zp indirect modes on the stack via the D register (see above). Also, 65C816 has also stack relative mode, not only to use stacked pointers, but also to access stacked data easily (like something that is several bytes deep under the SP-pointed location).

- STX abs,Y and STY abs,X - I never understood why these were missing from the 6502 in the first place, but this must have been intended, as they are also missing in 65C02 and 65C816. Mensch's personal idiosyncrasy?

- PHW - on 65C816 it is called PEA. It can push immediate data, or relative offset (see above), or copy a word from zp directly to the stack without using registers (it is the most weird instruction of all the CPU, if you ask me).

Besides, block move (odd one, but it exists), 16-bit index registers, 24-bit address space accessible with 16-bit addressing modes, but also with a 24-bit absolute mode. ABORT interrupt which allows to implement memory protection and perhaps even virtualization via external MMU circuitry. Probably more I am forgetting. So no, IMHO 65C816 is clearly a superior CPU, and still, provided that it is older.

Besides, as I can see, 65CE02 was introduced in 1988 and discontinued the same year (whereas 65C816's brand new units are available to this day). What was the reason?

On 2/3/2021 at 3:34 PM, vol said:

TMPX is not platform specific, you can run it from any Linux or Microsoft Windows as I do.

I do realize that it is a cross-asm, for "platform" I meant C-64 mainly, as each 8-bit platform has own specific programming tools, also the cross-ones. I will try to use it to prepare the Rapidus specific version of the benchmark, although it may be difficult at places (as TMPX, do I understand correctly, is missing direct 65C816 support? That would be weird considering the existence of SuperCPU).

On 2/3/2021 at 3:34 PM, vol said:

BTW is there some library of good sorting algos for the Atari 800? My implementations show speed only about 2 times faster than the Z80...

I do not know, I programmed sorting only once, and the algorithm I used was the combsort - it is fast enough when you want to sort a directory (even a long one, like 1000-something entries, it takes few seconds).

On 2/3/2021 at 3:34 PM, vol said:

However it is rather a bug in the 6502 documentations so we can rather blame the Atari engineers who didn't test the Antic thoroughly and missed that NMI require one more cycle in some cases

I heard people saying that it is a bug in the documentation, but I am not sure that anyone has proved that (by building a circuitry that extends the NMI signal and demonstrates that NMI bug does not occur anymore after that). If the docus did not provide that detail, it is really hard to notice that some NMIs are missed. Besides, what is "in some cases"?

65C02/65C816 fix that anyways.

On 2/3/2021 at 3:34 PM, vol said:

So I can't use MBI.EXE unless I have the Rapidus board?

Any 65C816 adapter board will do, even the simplest one (no acceleration, no additional memory, just the CPU). You will also need the Rapidus OS (contrary what the name suggests, it does not necessarily need the Rapidus board to run).

20 hours ago, Faicuai said:

just plain-old, 8bits-at-time chugging, accessing that congested and structured 64KB address range, but running crazy-fast!

It can be known already from attempts some people did several times at building accelerators (successfully or not, or not yet), that the main problem is not the CPU speed itself but the Atari architecture: to make everything flawless you basically have to build the entire computer from scratch, also modify the bus operation, because even today the regular DRAMs, from what I hear, are too slow even for a 14 MHz 65C816, thus you are forced to either allow watstates, use static RAMs (which are relatively expensive) or use caches.

Also, I think I wrote in my life enough code for banked memory to say that "congested and structured 64KB address range" is a burden that would be best to avoid, whenever possible. 320k, 1 MB or even 4 MB (Axlon) fitted into 64k address space... okay. Is it challenging? Certainly. Swimming with an anvil tied to your neck is also challenging; it is also anything but an effective mode of swimming. The constant banking involved is just this anvil, once you got good at swimming in general it may be fascinating at first how it is to swim with an anvil, but after some time it gets a bit boring and later you discover that you have this anvil tied permanently. The effect? Just see how many text editors (let alone wordprocessors) we have which allow editing buffers larger than 64k... what I am saying, how many are there which allow buffers larger than 40k! This platform has more than 40 years, it has been 35 years since we have the extended RAM.

Besides, some things are just hardly possible without the flat memory. The notorious problem on 65xx is the lack of multiply and divide instructions. Bill Mensch of course was once promising a CPU with 32-bit fixed point multiply and divide, but also of course did not do that (if I was the Rapidus designer I would implement that into FPGA, but this also has not been done, unfortunately). Still, having a ton of flat memory you can precompute a 128k lookup table and using that perform 8x8=16 bit multiply in less than 20 clock cycles. On a 20 MHz processor this is like about 2 cycles on 6502, i.e. no time.

Anyway, all this was way off the topic, I do not want it to become just another pointless discussion on superiority of this over that.

Edited February 4, 2021 by drac030
typos & more typos

Faicuai · February 4, 2021

29 minutes ago, drac030 said:

It can be known already from attempts some people did several times at building accelerators (successfully or not, or not yet), that the main problem is not the CPU speed itself but the Atari architecture: to make everything flawless you basically have to build the entire computer from scratch, also modify the bus operation, because even today the regular DRAMs, from what I hear, are too slow even for a 14 MHz 65C816, thus you are forced to either allow watstates, use static RAMs (which are relatively expensive) or use caches.

That's what I am not really sure about.

You see, one thing is to design a co-processor / acceleration unit that would behave its best within the original design of the system, and another very different is to build an accelerator that does whatever we want it to do, and then attempt fitting such arbitrary design back into the system. I am not really sure which of these two sides of the coin has been attempted repeatedly, at this point.

In my humble view of things, the main issue itself with Atari's architecture is that it was designed with the "ray-trace/chase-the-beam" concept at its very core. It is not the 6502 who actually calls the shots on the system. It is ANTIC who's running the show here. Even the 6502 was specifically modified to include HALT-logic on-board. The idea was always to stop the 6502, and have ANTIC do its magic (in concert with GTIA). In other words, an entire computer system designed around the notion of an electron-beam flying over the screen, at close to the speed of light. Everything dances to this tune.

On one side, this is NO way to design a real / pure computer system. It does not seem to me the Apple-II or the IBM/PC architectures were built this way (which ended up reigning supreme, even my HP Z840 fully resembles this architecture, with all its might). On the other hand, it is precisely this rasterized chase-the-beam approach which enables the Atari to genuinely display real-time video with as low as 32 Kbytes of RAM and with 1980 stock chipset, directly from a system-bus data register, where a fast-enough storage-media will deliver the encoded video, 8-bits at a time, and with ANTIC (once again) leading the way, at 500 Kbytes/sec. This will make many other architectures puke, when attempted in such simple, straight-forward manner. So the concept of "MIPS" on the Atari is a bit more complex than just how fast you can handle a bunch of LDA, STA, JSR, etc.

On the Atari, I see a system in which processing time (not just defined by CPU mips), is actually a compounded share of CPU-bound work, GPU-bound work and general I/O. And any design of co-processor that wishes doing more than just running SIEVE blazing-fast (like assisting DLI processing, for instance), will need to operate in such tandem scenario, one way or another. If the 6502 will always be stopped and ANTIC will keep running, then such co-processor will need to access ram on the system-bus and, in return, deliver x100 or x1000 the cpu-cycles lost by halting the 6502, for whatever the task it is temporarily handled. And if it can't, then it will always be a completely foreign entity in the system, with a limited range of use and applications, that will hardly appeal to the masses. That's the crux, the way I see it (I may be wrong, though).

Another approach will be (in very simple terms) forget about the idea of Co-Processor with grips on the system-bus, and simply build a compact, high-power board (fitting Slot-3 on the 800, for instance) that would run a really fast CPU and RAM with Altirra, with its own Video-output, on-board WiFi and BlueTooth, and with the ability to interface (at bus level) with the host-system, for instance (for peripherals access, etc.) This may be a more extensible and survivable approach, which will allow any user to enjoy ANY SW Title, whether NTSC or PAL, any virtual peripheral he may have never owned or known before, with pure digital video and sound, and never really leaving behind the host system, which will be available at your finger tips.

Maybe at that point, benchmarking 14/20/40Mhz CPUs will again look like a thing of the past... ;-)

Benchmarking the Atari 800XL

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members