Jump to content
Alfred

Fastest floating point

Recommended Posts

Search didn’t come up with anything, so: There are a few FP upgrades like Newell’s, BASIC XE says it has faster FP routines, etc. Is there any consensus on what the fastest FP package is today?

Share this post


Link to post
Share on other sites

You could probably say there's 3 classes of packaging for FP routines.

 

1 - compact, lives within the original 2K of the OS FP area, Newell is one of these.

2 - extended, lives mostly within the 2K OS FP but also uses some external OS Rom space at the cost of other features.

3 - mostly/entirely external, packaged largely within the Ram or Rom space used by the language.

 

I'm fairly sure there's recent threads around that have relative benchmarks.  As can be expected the larger the footprint of the FP routines, the faster it tends to perform.

Share this post


Link to post
Share on other sites

I wonder if the question is moreso of...

can we make a round up of all the FP's in order of...

1 Speed

2 Accuracy

3 Size

in that way we could choose the sweet spot... and have a quick chart to pick what we want to use with our current project.

this is also something that could pertain to a number of pre-made code scenarios as well. although I was going to say pre-rolled I thought it a bad word choice as in this case sometimes unrolled is better.

Edited by _The Doctor__

Share this post


Link to post
Share on other sites

FAFMUL 

 is a replacement for the OS BCD FP multiplier and it's three times as fast. I'll wager it's the fastest BCD FP multiplier for the Atari. But it speeds up only multiplication (and, by extension, trig and log), not addition, subtraction, nor division. It's an OS patch, as in Rybags' 2nd category above.

  • Like 2

Share this post


Link to post
Share on other sites

Here's a thought (random thinking in my head).

 

We have multiple upgrades for these machines - video, audio, RAM, etc.  What about an FP accelerator?  Could we hang something off the PBI bus that would take say - 2 16 bit numbers, a command, and return a 32-bit value in what would essentially be one or two 6502 cycles?  Coding for it would be as simple as calling the native OS routines, but the 6502 would get the values back after maybe a NOP or two?

Share this post


Link to post
Share on other sites

while that's feasable (remember the mpp charger cart for some of the 3d games) at that point you might just allow it to be a math co processor since it's on the pbi any way... and let the 6502 chug away on something until the it gets a tap on the shoulder when the co processor finishes... although it'd be a good bet it would be done as you mentioned with a cycle or two if that with modern solutions...

Edited by _The Doctor__

Share this post


Link to post
Share on other sites
58 minutes ago, _The Doctor__ said:

while that's feasable (remember the mpp charger cart for some of the 3d games) at that point you might just allow it to be a math co processor since it's on the pbi any way... and let the 6502 chug away on something until the it gets a tap on the shoulder when the co processor finishes... although it'd be a good bet it would be done as you mentioned with a cycle or two if that with modern solutions...

Yeah - that's what I am talking about.  Any math operation than the 6502 can't do in 4 cycles (add or subtract) - hand off to a co-processor that can do it in the same time.  Multiplication, division, matrix operations, trig, etc.  Could make for some awesome 3D vector games at a minimum.

  • Like 2

Share this post


Link to post
Share on other sites
3 hours ago, _The Doctor__ said:

while that's feasable (remember the mpp charger cart for some of the 3d games) at that point you might just allow it to be a math co processor since it's on the pbi any way... and let the 6502 chug away on something until the it gets a tap on the shoulder when the co processor finishes...

Heh, so... all these SIDE/Ultimate/MyIDE?/Uno/AVG carts have little processors on them that run circles around the "pokey" 6502, it must be theoretically possible for some kind of API to send some small workloads to the little CPU in the carts for crunching and have it deposit the results back RAM via DMA or the cartridge area, trigger an interrupt to notify of completion, or the application can check for results once in a while...

 

Or even new SIO2PC commands to send a workload off to a remote PC via serial and return the results.  "Computer - fetch me pi to 1024 decimal places" - PC computes it in 1 milllisecond, returns result :) Although this option would have to wait for a response, or regularly query "are you done yet"

 

An accelerated ARC program would be cool for instance. At that point it would be easy to introduce other compression/decompression mechanisms that previously be easily be performed within limited working RAM space like ZIP, 7Z, LZH, etc.

 

So many ideas...

Share this post


Link to post
Share on other sites

I didn’t realize this was such an issue. I gather then that Newell’s Fastchip is the only replacement for the stock OS routines, that fit. So for today the path would be to drop in the FP part of a Newell rom onto a stock rom image, and maybe overlay Claus’ fastmath on that. I guess I should look at the BXE code and see what they did, maybe they just incorporated Marslett’s code as well.

Share this post


Link to post
Share on other sites

Looking at the T816 OS, I see Chuck implemented Marslett’s code and he made no changes to use any of the new instructions that I see in my quick check. It has the bug that 1050 noticed, SBC $09 rather than #9 in the hex digit conversion routines. At least I don’t have to type it in from a scan, I wasn’t looking forward to that.

Share this post


Link to post
Share on other sites

FP accelerator would be a good idea.

My proposal would be something where you upload the data to it - successive stores to a single register or maybe have accessable workspace @ $D500 - then select the operation and trigger it in another register store.

Have a status register that has busy/complete bitflags.  So an enhanced Basic could sit in a loop waiting for the operation to finish, an assembler program could go off and do it's own thing and periodically poll the register.

 

Given that something like a PIC or AVR would barely raise a sweat doing Atari's BCD operations, may as well give it other features.

Like how about line drawing?  Give it start and end X/Y and it then goes off and produces an array of memory locations and mask values, then the 6502 just has to paste them to z-page and run instructions to do each plot operation.

  • Like 1

Share this post


Link to post
Share on other sites

Veronica should be able to perform math quite a bit quicker than stock at the cost of maybe a bit unwieldy programming.

Share this post


Link to post
Share on other sites
1 hour ago, slx said:

Veronica should be able to perform math quite a bit quicker than stock at the cost of maybe a bit unwieldy programming.

Are there any coding examples for Veronica?  I have two of them.

Share this post


Link to post
Share on other sites

Hi!

19 hours ago, Alfred said:

I didn’t realize this was such an issue. I gather then that Newell’s Fastchip is the only replacement for the stock OS routines, that fit. So for today the path would be to drop in the FP part of a Newell rom onto a stock rom image, and maybe overlay Claus’ fastmath on that. I guess I should look at the BXE code and see what they did, maybe they just incorporated Marslett’s code as well.

There are at least two better ones, that fit in the same 2KB of the original:

- The Altirra math-pack. This is really fast and also more accurate than the original. In many tests is the fastest FP replacement.

- The OS++ math-pack. It is about the same speed than the Altirra one,and also is more accurate than the original.

 

Both of the above are open source, and the Altirra one has a very permissive license.

 

Have Fun!

  • Like 3

Share this post


Link to post
Share on other sites

The UNO Cart also seems an easy and viable way to add a co-processor. Something like:

 

$d500 - A

$d501 - B

$d502 - A*B LSB

$d503 - A*B MSB

$d504 - A/B

$d505 - A%B

etc...

 

Share this post


Link to post
Share on other sites
3 hours ago, ivop said:

The UNO Cart also seems an easy and viable way to add a co-processor. Something like:

 

$d500 - A

$d501 - B

$d502 - A*B LSB

$d503 - A*B MSB

$d504 - A/B

$d505 - A%B

etc...

 

If you want to replace the Os floating point functions, then you need a little bit more than a fast binary multiplication and division. You would need two floating point registers (6 bytes) and a floating point result register. Of course, the OsRom FP representation is not very appropriate...

 

*In principle*, you could connect a Motorola 68882 with suitable glue logic - you can run Motorola FPU as memory-mapped I/O chip. Actually, there have been math extension boards for the Amiga that worked like this, i.e. the CPU "manually" programmed the co-processor interface of the FPU, and collected results from there. It is not "quite as elegant" as an FPU connected to the dedicated coprocessor interface of the 68020 and 68030, but it should be faster than the mathpack, and certainly a lot more precise.

 

The 68882 uses however the (much saner) binary IEEE floating point format, not the decimal mathpack format. It is not a suitable plug-in for the mathpack.

 

  • Like 1

Share this post


Link to post
Share on other sites
16 minutes ago, thorfdbg said:

If you want to replace the Os floating point functions, then you need a little bit more than a fast binary multiplication and division. You would need two floating point registers (6 bytes) and a floating point result register. Of course, the OsRom FP representation is not very appropriate...

True. What I suggested was more a reply to Stephen and the Doctor about fast 3D calculations. But the UNO Cart would be just as suitable to do fast BCD calculations. $d500...$d505 * $d506...$d50b = $d50c...$d511, etc....

 

Share this post


Link to post
Share on other sites

but back to the op's topic, he wants FFP that the Atari will execute on board to replace what's currently flopping around... We want to have the best experience on expanded and not expanded machines.

 

drawing the wholes thread together, speeding up both ways would be optimal.

Edited by _The Doctor__

Share this post


Link to post
Share on other sites

I'll attest to the UNO Cart's ARM helping with floating point math :)

 

 

 

Edited by Wrathchild
Added Mandelbrot
  • Like 2

Share this post


Link to post
Share on other sites
23 hours ago, Stephen said:

Are there any coding examples for Veronica?  I have two of them.

Put both of them in an 800 for double the speedup 😊

 

I think there are some in the original Veronika thread.

 

The "host" Atari can swap two blocks of memory with the 65C816 and the 65C816 can set a semaphore flag to signal that it has e.g. finished a calculation. So the host computer would put some data to operate in the RAM, swap it to the 65C816 and poll the semaphore flag to check if it's finished (or be timed in a way to only swap the RAM back when the 65C816 must have finished), swap the RAM back and use the data, etc. 

 

Phaeron described it better than I can:

Quote
Are RAM banks superimposed on top of Atari RAM or is Atari RAM swapped with Veronica? (I suppose the former but am not sure.)

 

Superimposed.

 

Atari CPU has its own 48K+ RAM, Veronica CPU has its own 64K RAM, and the two share 64K of shareable memory. That 64K of shareable memory is split into two 32K halves, such that one is always visible to each. They can swap, but the Atari and Veronica never see the same 32K halves at the same time. Each 32K half is in turn split into 16K banks, and each side independently chooses which 16K it views from its 32K half.

 

From there, Veronica always has its 16K view mapped at either $4000-7FFF or $C000-FFFF. Atari has its 16K view split across the two 8K cartridge windows ($8000-9FFF and $A000-BFFF). Both of which can be toggled independently, but not positioned independently.

 

The main limitations to keep in mind: the 6502 and the 65C816 never see the same memory at the same time, the 65C816 always has its window enabled, only the 6502 can switch banks, and if you want to be compatible with V1 hardware you have to make sure the 65C816 is clear of its window when bank swap occurs.

 

It's obviously quite powerful but not that easy to write for. I always get a headache trying to wrap my head around it 😰

  • Thanks 1

Share this post


Link to post
Share on other sites

it swaps a 32k block at one go (upper or lower), and the Atari can work within that selected 32k block, 16k at a time... the Veronika can work at the same time on the 32k block it has selected, 16k at a time during the same period.

 

seems very logical.

Edited by _The Doctor__

Share this post


Link to post
Share on other sites

Hi,

 

   With the Uno Cart, you could update the OS, e.g. the Soft OS (?) approach (XL/XE only) in the same way that ATR/SIO is handled. That would potentially speed up anything that used the standard FP package routines.

 

  I'm not sure how you can get the ARM chip to do FP calculations for you though, as at the moment it is mostly tied up in a polling loop looking for memory address handling. Probably @Wrathchild has a better handle on this than me though.

Share this post


Link to post
Share on other sites

It can't easily watch the bus while doing other stuff (although interrupts or a multitasking kernel could overcome this), so the best thing to do is exactly what the existing UNO firmware does: use a messaging system and poll the cart for a 'magic' ready signal which won't appear on a pulled up or floating bus.

  • Thanks 1

Share this post


Link to post
Share on other sites
8 hours ago, slx said:

Put both of them in an 800 for double the speedup 😊

 

I think there are some in the original Veronika thread.

 

The "host" Atari can swap two blocks of memory with the 65C816 and the 65C816 can set a semaphore flag to signal that it has e.g. finished a calculation. So the host computer would put some data to operate in the RAM, swap it to the 65C816 and poll the semaphore flag to check if it's finished (or be timed in a way to only swap the RAM back when the 65C816 must have finished), swap the RAM back and use the data, etc. 

 

Phaeron described it better than I can:

 

It's obviously quite powerful but not that easy to write for. I always get a headache trying to wrap my head around it 😰

Easiest way to think of it is like a dumbwaiter or the little sliding tray at the teller in U.S. banks. First person puts an item in the tray and sends it across to the second person, who does something with it and sends it back. Only one side has access to it at the same time and in the meantime the other side is waiting in the meantime. The more difficult scenarios are when you want to sending data in both directions at the same time (double buffering) and make actual use of both CPUs in parallel. That's where the single semaphore bit, lack of '816 interrupts, and V1 timing race bug become troublesome.

 

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...