Jump to content
IGNORED

Any way to accurately simulate the A8 with the 6502 running at 2x speed of the rest of the system?


Xebec

Recommended Posts

Every cycle - some are speculative, some are dummy and just plain not needed.

Likely a result of the CPU design - they could probably have eliminated it but then the component count goes up.

 

As stated - there's not much benefit to be had if it's accessing the stock RAM.  The system architecture means it's prettymuch stuck at 1.79 MHz and in any case the fasted DRAMs used late in the XE life were probably rated to 80 ns which is only 3 times the stock requirement.

  • Like 1
Link to comment
Share on other sites

7 hours ago, Xebec said:

Please educate me here -- is this the case even when the 6502 is executing instructions that take several cycles to complete?  

 

Like if an instruction takes 6 clock cycles to execute, it needs the memory bus for all 6 of those cycles?  

Yes, as stated previously.

 

For example, ASL absolute needs 6 cycles: a) 1 to fetch opcode, b) 2 to fetch argument, c) 1 to fetch operand, d) 1 to do the shifting, e) 1 to write operand back. Out of that the stage "d" is the internal operation cycle. Assuming you are able to identify, which cycle is internal, cut the CPU off the bus for that time and switch the clock to 3.5 MHz, you at best make the instruction occupy 5.5 instead of 6 cycles, i.e. you are gaining 1/12, that is, about 8%.

 

Another is RTS: a) 1 to fetch opcode, b) 2 dummy cycles, c) 2 cycles to pull address from the stack, d) 1 to increment it. Out of that stages "b" and "d" can be shortened so that instead of 6 cycles the instruction takes 4.5. The gain here is at most 3/12, i.e. 25%.

 

CLC: a) 1 opcode fetch, b) 1 dummy. By clocking at 3.5 MHz you gain 1.5 instead of 2 cycles, i.e. 25%.

 

JMP: a) 1 opcode fetch, b) 2 argument fetch. No dummy cycles. Gain: 0.

 

LDA abs: a) 1 opcode fetch, b) 2 argument fetch, c) 1 operand fetch. No dummy cycles. Gain: 0.

 

And so on.

 

7 hours ago, Xebec said:

And just to make sure I'm understanding, a 65C816 at double speed might be a better bet for a simple no-cache speedup?  

Yes, because (again, as stated) it has signals which allow the external circuitry to identify the spare access cycles and act accordingly. But with just doubling the clock the overall speedup will be hardly worth the effort.

 

Edited by drac030
  • Like 2
Link to comment
Share on other sites

Yeah, as drac030 says, 3.5MHz is not much of an improvement.

 

But I think one or two small boards that can detach both ANTIC and the CPU from the bus, not depending on /HALT, you could at least recover most of the /REF bus access after installing SRAM (if the CPU is not reading or writing with /D400 asserted). That's 9 extra cycles per scan line, which is a 114/(114+9) = 1.078947 improvement, almost 8%.

 

About not knowing what the CPU is doing internally, you could use the databus when /SYNC is asserted. Use those 8-bits (the instruction) as the address bus to a 256B EEPROM, and latch its databus with a shifter. Now you have 8 bits per instruction to determine whether the bus is needed or not. Shift the register each clock. There are no 6502 instructions that take 8 or more cycles.

 

This way, for example, CLC could run its dummy cycle disconnected from the bus while ANTIC does its DMA read on the bus.

 

Edit: note that the 8% calculation is with ANTIC DMA off. On a normal screen, where the 114 term in the equation is lower due to screen DMA, the relative improvement goes up.

Edited by ivop
  • Like 1
Link to comment
Share on other sites

2 hours ago, ivop said:

Use those 8-bits (the instruction) as the address bus to a 256B EEPROM, and latch its databus with a shifter. Now you have 8 bits per instruction to determine whether the bus is needed or not

... and be aware that one and the same opcode may be executed in different number of cycles depending on the internal CPU state. E.g. LDA abs,X is "normally" 4 cycles, but 5 when a page boundary gets crossed. In that latter case the spare cycle could not be detected, and so it could not be shortened. This would exclude quite a number of frequently used instructions in frequently used (i.e. indexed) addressing modes, and that in turn makes the whole thing yet less worthwhile.

  • Like 1
Link to comment
Share on other sites

1 hour ago, drac030 said:

... and be aware that one and the same opcode may be executed in different number of cycles depending on the internal CPU state. E.g. LDA abs,X is "normally" 4 cycles, but 5 when a page boundary gets crossed. In that latter case the spare cycle could not be detected, and so it could not be shortened. This would exclude quite a number of frequently used instructions in frequently used (i.e. indexed) addressing modes, and that in turn makes the whole thing yet less worthwhile.

Agree. All of the opcodes that have a •, ∗, or † have to be  encoded in the EEPROM to work for both cases (read: worse case, less "new" cycles):

 

http://ivop.free.fr/atari/opcodes.html

 

So indeed, that would be less worthwhile, as you say, for quite some instructions.

 

Perhaps @foft could simulate this in his FPGA core to see what benefit it would have IRL?

 

1. Run CPU cycles when it doesn't need the bus, and let ANTIC use the bus at that cycle if it needs it. No SRAM needed for that.

2. Assume SRAM and ignore refresh cycles. Disconnect ANTIC from the bus, let the CPU do its thing, UNLESS the CPU reads from, or writes to ANTIC.

 

Edited by ivop
Link to comment
Share on other sites

Wow Thanks for all for the education here.  I've learned that the 6502 to keep costs down (reduced pin count, reduced transistors) will access memory every cycle even if there is no need.  However, even if that were not a problem - most instructions are accessing memory constantly anyway, reducing the opportunity to speed up.

 

65C816 improves on the situation a little bit, but due to the still limited 8-bit external bus doesn't provide much boost in performance unless you have a cache with higher clocks.

 

...

 

I assume I'll get laughed at for this question but just curious.  Is there any (reasonably easy) way to decouple the frequency of GTIA from the rest of the system to increase the frequency of those other parts to something like 2 MHz while still having GTIA output at the right frequency for NTSC/PAL?

 

I'm personally curious what kind of clock margin ANTIC, POKEY, and other chips have in the Atari.. 

 

 

 

 

 

 

Link to comment
Share on other sites

1 hour ago, Xebec said:

I'm personally curious what kind of clock margin ANTIC, POKEY, and other chips have in the Atari.. 

As for POKEY, I have had this idea for years to disconnect pin 7 (Phi2), and connect it to a different clock, in the range of the Atari arcades, i.e.1.2-1.5MHz. I think Phi2 does not play a role in reads from, and writes to, Pokey. These will still be at 1.77/1.79MHz, and writes are done at the edge of R/W.

 

Edit: with this clock change, you can obviously not boot over SIO (the clock is wrong), but you can with a cartridge (SIDE, MyIDE, etc...).

 

Edited by ivop
Link to comment
Share on other sites

1 hour ago, ivop said:

I think Phi2 does not play a role in reads from, and writes to, Pokey. These will still be at 1.77/1.79MHz, and writes are done at the edge of R/W.

The 6502 timing diagrams I've seen show otherwise. Data setup and hold are referred to Phi2, not R/W.

Edited by ClausB
Link to comment
Share on other sites

Your basic fast clock solution:

 

Replace the DRAM with faster SRAM.

 

'Watch' the 02 clock. When it falls negative, the Atari decodes device selects and the 816 puts A17-A23 on the data bus. If the device select is now SRAM (system RAM), then raise 02 after 70ns - 7.16mhz clock. Lower 02 after 70ns, starting all over again. If the target logic is not system RAM, then raise 02 with the system clock at 280ns (1.79mhz) and drop 02 at 560ns.

 

It does not matter what kind of instruction you are executing. You don't have to latch anything unless you are implementing linear memory (>64K).

 

I can't follow what you are doing... makes my head hurt.

 

If you are accessing SRAM with the CPU, bump the clock to 7.16mhz. If not SRAM, leave the clock at 1.79mhz. It's just that simple.

 

Bob

  • Like 3
Link to comment
Share on other sites

12 hours ago, ivop said:

I think Phi2 does not play a role in reads from, and writes to, Pokey.

It does, of course, both in reads and in writes. See Pokey internal schematics page 1; follow signals PreS01 and Phi2B.

 

But even without looking at Pokey's internal design, the 6502 bus interface is synchronous. Address bus must be latched at one phase of the clock and the data bus at the other phase.

  • Thanks 1
Link to comment
Share on other sites

11 hours ago, bob1200xl said:

I can't follow what you are doing... makes my head hurt.

Not sure if you meant me, but my musings were about stock clock speed, but reclaiming /REF cycles and/or run dummy CPU cycles disconnected from the bus, to let ANTIC use it. Sorry, things are a bit mixed up in this thread. Software simulation was the OP's question ?

 

2 hours ago, ijor said:

It does, of course, both in reads and in writes. See Pokey internal schematics page 1; follow signals PreS01 and Phi2B.

I was hoping it did not, but alas. :/ Another idea in the bin :)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...