Jump to content
IGNORED

Ahl's Benchmark?


Larry

Recommended Posts

 

  1. There is nothing to show during compute-time. Zero. No point in wasting 25%-30% of 6502's output... because it is literally being halted by Antic. Moreover, stuff CAN BE SHOWN, even if Antic is "turned off". System Information 2.24 achieves exactly this.
  2. Atari Basic (nor Atari OS) are not aware of trivial arithmetic and basic optimizations... that they would otherwise be with more memory to spare (instead of a "miserable" 8Kbytes span).
  3. Unrolling MAY or MAY NOT help. Atari Basic, for instance, does not seem to operate For-Next loops with pure integer arithmetic. Atari Basic is VERY, VERY constrained.
  4. The system rom I am using (800XL/XE-Rev3-FP) runs add / subs. operations about 2.3x faster and Mult / Div. operations 5.0-5.8x faster than original Atari FP routines. That's the key.

 

Anyone here is welcome to post resulting times (and screen shots) from similar optimizations. Going from 400+ secs. down to 42 sec (still on ATARI Basic !!!) shows how wasteful and potentially pointless this benchmark is on Atari.

 

Cheers!

I stand corrected on the Antic.

But you are still missing the point on the benchmark.

You aren't running the same code which is the point of a benchmark.

 

Link to comment
Share on other sites

I stand corrected on the Antic.

But you are still missing the point on the benchmark.

You aren't running the same code which is the point of a benchmark.

 

 

Not really the intention to over-rotate on a trivial matter but, objectively, I can't agree. To put things in perspective:

 

  • Even benchmarking my Broadwell-based HP Z840 (Xeon v4), there is a MULTITUDE of items that need to be addressed BEFORE benchmarking (!):
    • Runtime-power management,
    • idle-power-states,
    • enabling / disabling Virtualization, Hyper-Threading, etc.,
    • defining process-to-processor affinity, etc.,
  • All of the above play key roles in determining what the HW platform is really capable of doing.
  • Likewise, on the Atari, ensuring that your CPU can devote as much cycles as it can (thus capitalizing on its 1.7+ Mhz of raw speed) is MANDATORY (not optional !!!). WHAT is the point of running a test where we know UP-FRONT that close to 30% of CPU-time is wasted (!?)

 

And that is, in my opinion, the crux of this particular story / benchmark (and where I believe you are missing the point):

  • Going from 400+ secs to 42+ secs (almost TEN-FOLD reduction of time, STILL within Atari Basic interpreter !) simply tells us how WOEFULLY inadequate the environment (FP routines / BASIC implementation) were, instead of what the HW platform (CPU, memory, supporting chipsets, etc.) itself TRUE capabilities.
  • When seeking real, definitive benchmarking results, the LATTER is what I am interested on (because it is what I can't immediately control or change). The first part, however (SW / Operating environment) we have much better control of, for almost any platform.

 

If it was up to me, I would invite ANYONE reading this thread, on ANY COMPARABLE platform of their choice, running this benchmark to such platform's BEST ability, and show the results here... that would be really, really interesting (putting aside the markedly skewed nature of the benchmark, which mostly hammers Floating-Point and integer-processing handling).

Cheers!

  • Like 1
Link to comment
Share on other sites

 

Not really the intention to over-rotate on a trivial matter but, objectively, I can't agree. To put things in perspective:

 

  • Even benchmarking my Broadwell-based HP Z840 (Xeon v4), there is a MULTITUDE of items that need to be addressed BEFORE benchmarking (!):
    • Runtime-power management,
    • idle-power-states,
    • enabling / disabling Virtualization, Hyper-Threading, etc.,
    • defining process-to-processor affinity, etc.,
  • All of the above play key roles in determining what the HW platform is really capable of doing.
  • Likewise, on the Atari, ensuring that your CPU can devote as much cycles as it can (thus capitalizing on its 1.7+ Mhz of raw speed) is MANDATORY (not optional !!!). WHAT is the point of running a test where we know UP-FRONT that close to 30% of CPU-time is wasted (!?)

 

And that is, in my opinion, the crux of this particular story / benchmark (and where I believe you are missing the point):

  • Going from 400+ secs to 42+ secs (almost TEN-FOLD reduction of time, STILL within Atari Basic interpreter !) simply tells us how WOEFULLY inadequate the environment (FP routines / BASIC implementation) were, instead of what the HW platform (CPU, memory, supporting chipsets, etc.) itself TRUE capabilities.
  • When seeking real, definitive benchmarking results, the LATTER is what I am interested on (because it is what I can't immediately control or change). The first part, however (SW / Operating environment) we have much better control of, for almost any platform.

 

If it was up to me, I would invite ANYONE reading this thread, on ANY COMPARABLE platform of their choice, running this benchmark to such platform's BEST ability, and show the results here... that would be really, really interesting (putting aside the markedly skewed nature of the benchmark, which mostly hammers Floating-Point and integer-processing handling).

Cheers!

That's not what a standard benchmark does!

 

If I run a graphics test 100% coded and optimized for an AMD graphics card and then I run an entirely different test 100% coded and optimized for my NVidia graphics card, the results don't mean jack shit. SAME code running on two different platforms, valid result.

 

If you're testing 2 cars for 0=60MPH time, you don't test a professional racecar driver against a new to driving person. Again - results wouldn't mean shit. Same driver, 2 cars, valid test result.

  • Like 1
Link to comment
Share on other sites

That's not what a standard benchmark does!

 

(...)

 

If you're testing 2 cars for 0=60MPH time, you don't test a professional racecar driver against a new to driving person. Again - results wouldn't mean shit. Same driver, 2 cars, valid test result.

 

Wrong, my friend! And here's the proof:

 

  • Like 3
Link to comment
Share on other sites

Not really the intention to over-rotate on a trivial matter but, objectively, I can't agree. To put things in perspective:

  • Even benchmarking my Broadwell-based HP Z840 (Xeon v4), there is a MULTITUDE of items that need to be addressed BEFORE benchmarking (!):
    • Runtime-power management,
    • idle-power-states,
    • enabling / disabling Virtualization, Hyper-Threading, etc.,
    • defining process-to-processor affinity, etc.,
  • All of the above play key roles in determining what the HW platform is really capable of doing.
  • Likewise, on the Atari, ensuring that your CPU can devote as much cycles as it can (thus capitalizing on its 1.7+ Mhz of raw speed) is MANDATORY (not optional !!!). WHAT is the point of running a test where we know UP-FRONT that close to 30% of CPU-time is wasted (!?)
And that is, in my opinion, the crux of this particular story / benchmark (and where I believe you are missing the point):
  • Going from 400+ secs to 42+ secs (almost TEN-FOLD reduction of time, STILL within Atari Basic interpreter !) simply tells us how WOEFULLY inadequate the environment (FP routines / BASIC implementation) were, instead of what the HW platform (CPU, memory, supporting chipsets, etc.) itself TRUE capabilities.
  • When seeking real, definitive benchmarking results, the LATTER is what I am interested on (because it is what I can't immediately control or change). The first part, however (SW / Operating environment) we have much better control of, for almost any platform.
If it was up to me, I would invite ANYONE reading this thread, on ANY COMPARABLE platform of their choice, running this benchmark to such platform's BEST ability, and show the results here... that would be really, really interesting (putting aside the markedly skewed nature of the benchmark, which mostly hammers Floating-Point and integer-processing handling).

Cheers!

You list all that stuff that needs addressed before you can benchmark and forget one crucial thing. None of that is the actual benchmark code.

And that is why your opinion is wrong.

 

Look, if someone wants to post stuff like that, by all means, post away. We've included compiler numbers, results with and without A*A instead of A^2, etc...all along

But don't leave out the tiny minuscule little detail that you are using a tuned benchmark until someone calls you on it.

 

Edited by JamesD
  • Like 1
Link to comment
Share on other sites

(...)

Look, if someone wants to post stuff like that, by all means, post away. We've included compiler numbers, results with and without A*A instead of A^2, etc...all along (..)

 

Exactly!

 

And by doing so (by taking into account the profound and marked deficiencies of our beloved Atari's OS/Basic framework), those prior numbers have been already shattered. The benchmark itself is virtually unchanged. You can even discard unrolling, if you wish (little improvement with Atari Basic).

 

And that is the final point that I am attempting to illustrate: (relative) tons of juice on this platform... totally wasted. That's what this benchmark shows. NOTHING else.

 

Cheers!

P.S. As a side note, I would LOVE to see this benchmark coded (in a modular way) in ASSEMBLER and run it accross ALL 6502 platforms we can put our hands on (by just changing graphics buffer address, and Floating Point / Match vectors, etc.). THAT would be lovely!

Link to comment
Share on other sites

Latest update on this famous little thread (UPDATE #3 - 8/10/2017):

 

=> Implementation NOTES (all runs):

  • ANTIC=OFF​ for maximizing 6502 CPU bandwidth (unless otherwise noted).
  • a=A*A for direct FP-mult. math look-up / routines (unless otherwise noted).

 

=> (Atari 800/Incognito, Colleen Mode / AXLON, SDX, OS-b + Newell high-performance FP roms), and ATARI BASIC (Rev.C) Interpreted:

  • Accuracy: 0.013649 (pretty steady)
  • Random: 7.785987 (varies all over the place)
  • Time (s): 59.8000
  • Time (s): 55.4666 (Inner For / Next Loops unrolled. Slow handling of integer For/Next loops in Atari Basic)

 

=> (Altirra 2.90 w/ FP=OFF, SDX, XE ROM patched w/ optimized FP pack), and ATARI BASIC (Rev.C) Interpreted:

  • Accuracy: 0.013649 (pretty steady)
  • Random: 11.306536 (varies all over the place)
  • Time (s): 46.3000
  • Time (s): 42.9166 (Inner For / Next Loops unrolled. Slow handling of integer For/Next loops in Atari Basic)

 

=> (Altirra 2.90 w/ FP=OFF, SDX, XL ROM Rev.2 OEM), and MICROSOFT BASIC (v1.0) Interpreted:

  • Accuracy: 0.111523 (poor precision, seems to have its OWN FP library, independent of O/S)
  • Random: 11.306536 (varies all over the place)
  • Time (s): 43.0667

 

=> (Altirra 2.90 w/ FP=OFF, SDX, XE ROM patched w/ optimized FP pack), and BASIC XE (v4.1p) Interpreted:

  • Accuracy: 0.013649 (pretty steady)
  • Random: 14.79776 (varies all over the place)
  • Time (s): 37.9666
  • Time (s): 35.5333 (Inner For / Next Loops unrolled)

 

=> (Altirra 2.90 w/ FP=OFF, MyDos, ALTIRRA ROM), and ALTIRRA BASIC (v1.54) Interpreted:

  • Accuracy: 0.000452 (WoW! BIG jump in precision !!!)
  • Random: 2.605347 (varies all over the place)
  • Time (s): 33.9833
  • Time (s): 32.7000 (Inner For / Next Loops unrolled)

 

=> (Altirra 2.90 w/ FP=OFF, MyDOS, XL ROM Rev.2 OEM), and TURBO BASIC (1.5):

  • Accuracy: 0.013649 (pretty steady)
  • Random: 2.10417 (varies all over the place)
  • Time (s): 26.68 (non-compiled)
  • Time (s): 25.50 (non-compiled, Inner For / Next Loops unrolled)
  • Time (s): 21.75 (compiled, Inner For / Next Loops unrolled)

 

=> (Altirra 2.90 w/ FP=OFF, SDX, XE ROM patched w/ optimized FP pack), and BASIC++ (v1.08) Interpreted:

  • Accuracy: 0.014842 (slightly lower precision)
  • Random: 8.052295 (varies all over the place)
  • Time (s): 40.9500 ​(ANTIC=ON, A=A^2, NO inner For / Next unrolling)
  • Time (s): 37.8666 ​(ANTIC=ON, A=A*A, NO inner For / Next unrolling)
  • Time (s): 28.0000 ​(ANTIC=OFF, A=A^2, NO inner For / Next unrolling)
  • Time (s): 25.9000 ​(ANTIC=OFF, A=A*A, NO inner For / Next unrolling)
  • Time (s): 24.1000 ​(ANTIC=OFF, A=A*A, inner For / Next unrolling)

 

=> (Altirra 2.90 w/ FP=OFF, SDX, XE ROM patched w/ optimized FP pack), and ALTIRRA BASIC (v1.54) Interpreted:

  • Accuracy: 0.014842 (slightly lower precision)
  • Random: 5.4557 (varies all over the place)
  • Time (s): 77.5000 (ANTIC=ON, A=A^2, NO inner For / Next unrolling)
  • Time (s): 52.9166 (ANTIC=OFF, A=A^2, NO inner For / Next unrolling)
  • Time (s): 28.8833 (ANTIC=ON, A=A*A, NO inner For / Next unrolling)
  • Time (s): 19.7333 (ANTIC=OFF, A=A*A, NO inner For / Next unrolling)
  • Time (s): 18.5800 (ANTIC=OFF, A=A*A, inner For / Next unrolling)

 

In summary:

 

  1. Up to TWENTY (20) TIMES faster results could be attained with exact same base HW (setting aside nature of Basic interpreter optimizations).
  2. A long, long way from the dumb-ass 405+ secs. of original timing listed back in the days... Will hardly get any better than this while still preserving benchmark's core logic / structure intact.
  3. Basic++ 1.08 manages to extract impressive improvements, from what seems a pretty small code-base (8K). Maybe larger once loaded? It also seems to handle optimally integer powers (^2) by calling fast FP mult. function (!)
  4. Altirra Basic manages to outperform almost any other package BOTH in speed AND precision departments (latter with Altirra OS loaded), also from what seems to be a pretty small code-base (8K).

 

Cheers!

Link to comment
Share on other sites

LATEST UPDATE:

 

Now with Basic++, with ANTIC both ON and OFF, and with both A=A^2 and A=A*A, and with and without loops unrolling.

Coincidentally, and after-the-fact, found this article where it discusses precisely THIS benchmark and most of its implementation challenges, impact and deeper limitations embedded on Atari (points EXACTLY on the same direction as several of us did here, in every step of the way). VERY interesting read:

 

 

http://www.atarimagazines.com/compute/issue57/insight_atari.html

 

 

Cheers!

Link to comment
Share on other sites

 

. . .

 

And that is the final point that I am attempting to illustrate: (relative) tons of juice on this platform... totally wasted.

. . .

 

It is a benchmark running in BASIC. By definition it totally wastes the performance of whatever computer it runs on. Did I mention this is BASIC? The purpose of a benchmark in BASIC (wow, that's an oxymoron, isn't it?) is to evaluate how poorly BASIC performs. (and by extension the other things we know that affect this, like the difference a slow floating point library makes.)

 

AND, some kind of ballpark figure measuring BASIC was actually a useful goal back in the day. In the 70s/early 80s a major purpose of 8-bit computers was to run BASIC. In fact, for some computers BASIC was the operating system and user interface for users. Many, MANY people buying 8-bit computers would not make it as far as assembly coding, (unlike the rest of us basement dwellers without meaningful social lives,) so a general comparison of BASIC languages was a reasonable, if not precisely accurate yardstick to evaluate their user experience. Today, its primary function is as gasoline for a burning debate on efficiency.

  • Like 4
Link to comment
Share on other sites

My favorite "benchmark" was the Dead On Arrival/Failure Out Of The Box industry stats that one of the multi-platform magazines published. Wish I could remember which one that was. I recall not many computers came close to the reliability of the Ataris.

Link to comment
Share on other sites

It is a benchmark running in BASIC. By definition it totally wastes the performance of whatever computer it runs on. Did I mention this is BASIC? The purpose of a benchmark in BASIC (wow, that's an oxymoron, isn't it?) is to evaluate how poorly BASIC performs. (and by extension the other things we know that affect this, like the difference a slow floating point library makes.)

It is a bit of a who sucks the least competition. :grin:

For the companies selling the machines, it was important to have A BASIC, not A FAST BASIC.

And companies like Microsoft were all too happy to deliver the bare minimum.

 

After seeing how little effort went in to optimizing Microsoft's code... I have to wonder how nobody came out with a more competitive product.

Everything I've optimized up through the last release, still fits in 8K! What the hell was their problem?

Skipping a few passes between BREAK checks literally took 4 instructions and less than a minute to write! Instant free clock cycles!

The multiply took a while to get just so... but I would think Tandy or Motorola would have wanted their machines to be faster.

The 6800 version of BASIC followed the 8080 version rather quickly. It was within months. The 6803 came out in 1978 and the 6809 shortly after from what I can tell.

The MC-10 came out in 1983. They had about 8 years to come up with the skip, and 5 years to take advantage of the multiply instruction, but they never did either one!

 

  • Like 1
Link to comment
Share on other sites

(...) Many, MANY people buying 8-bit computers would not make it as far as assembly coding, (unlike the rest of us basement dwellers without meaningful social lives,) (...) Today, its primary function is as gasoline for a burning debate on efficiency.

post-29379-0-82689000-1502552411.jpg

 

Well I do have a social life... but could not help chuckling with that comment... LoOoOoL !!! ;-

Link to comment
Share on other sites

  • 3 months later...

After the recent prime number benchmark thread, I ran some additional tests on the Plus/4

Ahl's benchmark with no changes 110.083333
Defining variables at the top in order of most to least used 108.8
With the BASIC patch I posted in the prime number thread 108.5333
A different version of that patch with more optimizations someone made 107.8
Fastest patch plus disabled screen refresh during calculations 71.8666667
That plus A*A instead of A^2 44.6

Link to comment
Share on other sites

  • 4 months later...

I'm working on a patch for the CoCo 3 so it uses the hardware multiply.
The CoCo 3 runs the ROM out of RAM so putting in a jump to the new code is easy.
The code still has a small bug, but in high speed mode it turned in a time of 45 seconds on the first run with no other changes.
Standard speed turns in a time of about 1:30.

The latest version of the MC-10 ROM turned in a time of 65 seconds.

  • Like 1
Link to comment
Share on other sites

Just for giggles, I typed it into my iMac (3.4GHz core i7). Running chipmunk basic gave:

>list
10 t = timer()
20 for l = 1 to 10000
30 for n = 1 to 100 : a = n
40 for i = 1 to 10
50 a = sqr(a) : r = r+rnd(1)
60 next i
70 for i = 1 to 10
80 a = a^2 : r = r+rnd(1)
90 next i
100 s = s+a : next n
110 next l
120 print "Timer : ";(timer()-t)/(l-1)
130 s = s/(l-1) : r = r/(l-1)
140 print "Accuracy ";abs(1010-s/5)
150 print "Random ";abs(1000-r)

>run
Timer : 2.227133E-04 
Accuracy 0 
Random 0.160374 

It ran in essentially zero time, so I put a x10000 loop in to slow it down, and munged the values at the end. From a brief perusal of the previous pages, the XL seems to max out at ~40 secs for an unmodified interpreted benchmark, and its 1.79MHz CPU is ~1899x slower by MHz.

XL : 40 / 1899 
   = 0.021 secs
Mac: 0.000223 secs

I guess CPU design has improved over the last 40 years or so after all :)

Link to comment
Share on other sites

  • 2 weeks later...

I'm working on a patch for the CoCo 3 so it uses the hardware multiply.

The CoCo 3 runs the ROM out of RAM so putting in a jump to the new code is easy.

The code still has a small bug, but in high speed mode it turned in a time of 45 seconds on the first run with no other changes.

Standard speed turns in a time of about 1:30.

 

The latest version of the MC-10 ROM turned in a time of 65 seconds.

 

The CoCo 3 with the patch and in high speed mode turns in a time of 44.4 seconds.

After a couple more small patches it should be down around 42 seconds.

 

*edit*

That's without using a 6309.

With a 6309 it should drop below 35 seconds.

Edited by JamesD
Link to comment
Share on other sites

  • 2 weeks later...

Re-run with latest (?) Altirra Basic:

 

=> (Atari 800 + Incognito, XE ROM (patched w/ optimized FP pack), SDX 4.49, and ALTIRRA BASIC (v1.55) Interpreted:

  • Accuracy: 0.014842 (slightly lower precision)
  • Random: 3.4256 (varies all over the place)
  • Time (s): 19.95 (ANTIC=OFF, A=A*A, NO inner For / Next unrolling)

From here, it is time to code this in Assembler, and make it as cross-platform compatible as possible, so we can see pure, bare-metal differences between these little machines... That will be really interesting...

Link to comment
Share on other sites

Using A*A and no other optimizations to the BASIC program, the CoCo 3 drops to 29.2166667, 6309 should drop to around 23 seconds.
Accuracy is .000193357468

The only patch I've made was to the multiply.
FWIW, if your code performs a lot of division, you can use the reciprocal and multiply in some places to take advantage of this patch.

Link to comment
Share on other sites

  • 4 weeks later...

Quick update: thought some of you may find this interesting... :-)




=> (Atari 800 + Incognito. CP/M v2.2 (rev 1.1) on Indus-GT (rom 1.20) ,and Microsoft BASIC-85 (v5.29) Interpreted:


  • Accuracy: 0.0670776 (lower precision)
  • Random: 7.4823 (fixed due to fixed random-generator seed)
  • Time (s): 38.40 (A=A*A, NO inner For / Next unrolling)


=> (Atari 800 + Incognito, XE ROM (patched w/ optimized FP pack), SDX 4.49, and Microsoft BASIC (v2.00) Interpreted:


  • Accuracy: 0.111523 (even lower precision)
  • Random: 2.06506 (varies all over the place)
  • Time (s): 42.30 (ANTIC=OFF, A=A*A, NO inner For / Next unrolling)

So there you have it: CP/M and native-Atari Microsoft Basic... Still far from Altirra 1.55, but the MS cross-platforms implementations seems close enough to me (Z80 vs. 6502)



:-)


Link to comment
Share on other sites

  • 1 month later...

So there you have it: CP/M and native-Atari Microsoft Basic... Still far from Altirra 1.55, but the MS cross-platforms implementations seems close enough to me (Z80 vs. 6502)

This came up in another forum as well: the MS code is barely optimized for any processor other than the 8080. The Z80 (early?) versions, for instance, do not take advantage of the many features of the Z80 that could improve performance. In the case of the 6502 versions, the strings should have been re-written to work in a different way, but they just used the original code. There's likely a LOT of low-hanging fruit in all of the MS versions.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...