Jump to content
IGNORED

Altirra and multiple cores


morelenmir

Recommended Posts

I have a purely academic question for Phaeron himself or in a wider application one of the other hardware-guru's.

 

Phaeron states in the help file to Altirra that multiple cores/hyper-threading would not help the emulator's performance. I think I understand his explanation, That is with the original hardware, on each clock tick MANY simultaneous actions were occurring across the various embedded discrete chips of the motherboard - the 6502, POKEY and so on. Therefore in order to emulate this activity, each tick of the emulated hardware is actually comprised of many 'underlying' ticks of the host PC. Because the PC runs so many orders of magnitude faster, it can get each apparently - to the emulator - parallel action done in Non-parallel before the time comes to update Altirra with the results. To the emulator and we users - at least if running in real time mode - it SEEMS that multiple things have happened in parallel at the proper speed. However they haven't really.

 

Now, okay. I mostly get that.

 

What I do not understand is why each of the those separate pieces of code which 'portray' the behaviour of the discrete POKEY chips and so on could not be consigned accordingly to a discrete core. After all, on modern processors you effectively have 16 CPU's or more all working in parallel. Could you not assign one for each bespoke atari chip and then use another one to serve their input and receive their output - acting as a supervisor to keep everything in synchronization? Would this not massively improve performance?

 

Obviously, as on even fairly archaic Core2 quad hardware like mine Altirra runs perfectly in real time - so it does not matter especially. But I don't understand the underlying problem why parallel processing would not improve turbo performance for those speed-demons out there?

Link to comment
Share on other sites

It comes down to a case of dependancy - at each cycle the program can change a single hardware register that can affect the behaviour of most of the machine.

And a case of accuracy - no point having emulated Antic+GTIA render an entire scanline worth of pixels if the CPU goes and changes stuff midway - it effectively makes most of that work pointless because of the changed state.

 

There's stuff that could get farmed off to other cores - I think Avery explained some of it elsewhere - but the overall benefit of farming that work out isn't particularly great.

 

In theory you could have 4 threads e.g. doing CPU, Pokey, Antic, GTIA - but the thing is that each thread would have to wait on the others, never getting a cycle beyond what's going on with the others. That means some inter-thread communication required, not to mention the fact that each thread needs to be active which in a multitasking Windows environment is no way guaranteed. Alternatively of course, a thread that's completed it's task can go into a wait and allow the other thread/s to complete their little bit of work unit on the same core (assuming the Win scheduler executes it).

At the end of the day, a bunch of extra complication in the overall system, possibly for not too much gain.

  • Like 2
Link to comment
Share on other sites

Ahh... It sounds like a good part of the problem is down to windows. Effective you are multitasking the component programmes which are made to act exactly like a piece of hardware inside the emulator in an environment which itself is already multitasking many other things. Is this why purely emulated environments like the MS 'Hyper-V' thing have a single programme to oversee the others and strictly determine their core access? So the operating system itself is not the final arbiter of which process runs on which threads?

 

I guess in that sort of environment then Altirra would itself need to be a full OS!

 

This bears on something I have been thinking about using an FPGA. On the same topic - would it not be possible to use an FPGA unit and set it up so that it perfectly emulates the behaviour of the Atari motherboard and its associated ICs? That way you could genuinely get an A8 on a chip - it a far superior fashion to the Raspberry Pi or Arduino or whatever.

 

And as an aside - Rybags, you wouldn't happen to be 'That Crazy Aussie Bloke' Dave from the EEVBlog would you?

Edited by morelenmir
Link to comment
Share on other sites

No - most of my online IDs are the same and I don't bother with blogs & stuff.

 

I'd not really point the finger at Windows - the problems of multiprocessing such emulators are similar regardless of host CPU or OS.

 

It's just down to the dependancy - you can't predict what the Atari is going to do without actually doing it.

If you were to sacrifice accuracy then multi-threading might be easier - e.g. have a seperate Pokey thread that only bothers doing accuracy down to 2 scanlines.

But we've got Atari800Win+ which in most cases can run turbo mode much quicker than Altirra but with somewhat less emulation accuracy.

 

A valid multithreading candidate would be drive emulation if real drive emulation was to be implemented - e.g. running like a real Happy board system. As it is, the drive is only a loosely tethered device and somewhat autonomous.

But in the situation of most emulators, the drive is emulated just at a very high level and there's next to no benefit multithreading that because it's pretty quick anyway.

Link to comment
Share on other sites

You should check him out on youtube - he does a fair bit on programming as well as pure hardware, although he does tend to focus on the bare metal as it were. I'm a big fan of Dave's videos and have learned a lot of practical electrical engineering from him. That is where I got the idea for an atari 8-bit FPGA. Sadly I just have too many interests and cannot find the time - or money! - to seriously get in to electronics and 'soldering' as my dad called it right now!

 

You raise a very interesting idea though Rybags - farming out not perhaps devices on the A8 motherboard then, but the peripherals themselves! I am a VERY big proponent of at some point Phaeron doing a proper emulation of the various Floppy Drives. I absolutely yearn for a full Happy drive emulation. But that is obviously more 'if' then 'when' and likely 'never'. Still. It makes absolute sense to run those emulated, remote peripherals in their own threads. I would think the Coms interface board would be absolutely plum for multitasking? Also maybe the much-discussed plotter device, given how time consuming that could be? I think the amazing (but sadly pay-ware...) CCS64 Commodore emulator - which actually does offer total, programmable floppy drive emulation - does it that way?

Link to comment
Share on other sites

This bears on something I have been thinking about using an FPGA. On the same topic - would it not be possible to use an FPGA unit and set it up so that it perfectly emulates the behaviour of the Atari motherboard and its associated ICs? That way you could genuinely get an A8 on a chip - it a far superior fashion to the Raspberry Pi or Arduino or whatever.

 

foft has indeed already created a full A8 implementation using an FPGA. See here.

  • Like 1
Link to comment
Share on other sites

There is actually only one big bottleneck within Altirra, the synchronization between ANTIC and the CPU. Everything else is down in <5% peanuts land, including GTIA, POKEY, host sound/audio/input, etc. That's because GTIA and POKEY mostly just receive data, the only main feedback loops being P/M collisions and interrupts. Everything else is heavily batched and event based and executes rather quickly in comparison.

 

The issue with ANTIC<->CPU is that the two can swap memory between each other very fast (1.79MHz), and fully emulating this is a lot slower than typical per-instruction CPU emulation. However, it also means that Altirra can emulate some unusual behaviors correctly that many Atari 8-bit computer emulators can't. It's the interleaving that slows everything down -- if you run Altirra at a 4x CPU (7.19MHz), for instance, it doesn't run at quarter speed compared to 1x.

 

Another thing to keep in mind is that modern CPUs are fast partly because they are optimized to work on large batches where pipelining and caches can be effective. Force them to lock-step to each other and they run much slower. The 1000-2000x clock speed advantage of a Core i7 over a 6502 erodes very rapidly when you interrupt it all the time or force it to constantly swap data over the bus. Try disabling the L1 cache on a modern CPU sometime. If you're lucky, it might finish booting before it's time to eat dinner.

  • Like 2
Link to comment
Share on other sites

  • 2 weeks later...

I have actually done that once by accident! It was quite recently and I could not for the life i me work out why an - admittedly ancient! - Athlon 1800XP was absolutely CRAWLING to boot Windows! I had just put somethign together to act as an SQL server for my network - scalability not being an issue when you only have 2 users!

 

If, basically modern CPU's get their performance from 'tricks' with predicting and reusing already processed code then how much ACTUAL power does a say Core2 like my ageing processor have over the 6502? 5x, 10x? Obviously sheer clock speed does not determine performance. In fact I think I have read the STILL make the 6502 for embedded uses.

Link to comment
Share on other sites

In terms of speed, the Core 2 is obviously faster by orders of magnitude. While there are pathological ways to slow it down drastically, 99.999% of the time caches, pipelining, and superscalar execution are very effective. Even if you take out the clock speed advantage, it's trivial for a Core 2 to utterly spank a 6502 in getting out results per clock cycle, because it can do several 128-bit ops per clock while the 6502 struggles to achieve even one byte every two clocks.

 

That's assuming that speed is always the goal, which it isn't. Cost, customizability, power consumption, integration, reliability, code size, and ease of coding are also concerns in embedded applications. We've gotten to the point where it isn't unusual to see a little 8-bit microcontroller embedded in a chipset whose job is to set everything up to boot the big CPU. The 6502 isn't necessarily great for embedded, though -- even Atari used 8048 derivatives for the XF551 and XEP80.

Link to comment
Share on other sites

Memory access is still the big bottleneck on modern systems.

 

Even though we have dual and triple channel RAM which might spit out 128 or 192 bits in a single read. "Ideal world" situation rarely occurs, a benchmark might see a dual channel 1333 MHz DDR3 system accessing at or near it's advertised throughput rate but the reality is there's usually more "random" access which involves setting up new Row addresses which is somewhat slower than the burst data mode where the row stays the same and the Column simply increments.

 

And the cache doesn't even necessarily run at the clock speed - L1 usually does but it's not very big. L2 often at half the clock speed and L3 when present is slower again.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...