Rybags Posted August 27, 2021 Share Posted August 27, 2021 Every cycle - some are speculative, some are dummy and just plain not needed. Likely a result of the CPU design - they could probably have eliminated it but then the component count goes up. As stated - there's not much benefit to be had if it's accessing the stock RAM. The system architecture means it's prettymuch stuck at 1.79 MHz and in any case the fasted DRAMs used late in the XE life were probably rated to 80 ns which is only 3 times the stock requirement. 1 Quote Link to comment Share on other sites More sharing options...
drac030 Posted August 27, 2021 Share Posted August 27, 2021 (edited) 7 hours ago, Xebec said: Please educate me here -- is this the case even when the 6502 is executing instructions that take several cycles to complete? Like if an instruction takes 6 clock cycles to execute, it needs the memory bus for all 6 of those cycles? Yes, as stated previously. For example, ASL absolute needs 6 cycles: a) 1 to fetch opcode, b) 2 to fetch argument, c) 1 to fetch operand, d) 1 to do the shifting, e) 1 to write operand back. Out of that the stage "d" is the internal operation cycle. Assuming you are able to identify, which cycle is internal, cut the CPU off the bus for that time and switch the clock to 3.5 MHz, you at best make the instruction occupy 5.5 instead of 6 cycles, i.e. you are gaining 1/12, that is, about 8%. Another is RTS: a) 1 to fetch opcode, b) 2 dummy cycles, c) 2 cycles to pull address from the stack, d) 1 to increment it. Out of that stages "b" and "d" can be shortened so that instead of 6 cycles the instruction takes 4.5. The gain here is at most 3/12, i.e. 25%. CLC: a) 1 opcode fetch, b) 1 dummy. By clocking at 3.5 MHz you gain 1.5 instead of 2 cycles, i.e. 25%. JMP: a) 1 opcode fetch, b) 2 argument fetch. No dummy cycles. Gain: 0. LDA abs: a) 1 opcode fetch, b) 2 argument fetch, c) 1 operand fetch. No dummy cycles. Gain: 0. And so on. 7 hours ago, Xebec said: And just to make sure I'm understanding, a 65C816 at double speed might be a better bet for a simple no-cache speedup? Yes, because (again, as stated) it has signals which allow the external circuitry to identify the spare access cycles and act accordingly. But with just doubling the clock the overall speedup will be hardly worth the effort. Edited August 27, 2021 by drac030 2 Quote Link to comment Share on other sites More sharing options...
ivop Posted August 27, 2021 Share Posted August 27, 2021 (edited) Yeah, as drac030 says, 3.5MHz is not much of an improvement. But I think one or two small boards that can detach both ANTIC and the CPU from the bus, not depending on /HALT, you could at least recover most of the /REF bus access after installing SRAM (if the CPU is not reading or writing with /D400 asserted). That's 9 extra cycles per scan line, which is a 114/(114+9) = 1.078947 improvement, almost 8%. About not knowing what the CPU is doing internally, you could use the databus when /SYNC is asserted. Use those 8-bits (the instruction) as the address bus to a 256B EEPROM, and latch its databus with a shifter. Now you have 8 bits per instruction to determine whether the bus is needed or not. Shift the register each clock. There are no 6502 instructions that take 8 or more cycles. This way, for example, CLC could run its dummy cycle disconnected from the bus while ANTIC does its DMA read on the bus. Edit: note that the 8% calculation is with ANTIC DMA off. On a normal screen, where the 114 term in the equation is lower due to screen DMA, the relative improvement goes up. Edited August 27, 2021 by ivop 1 Quote Link to comment Share on other sites More sharing options...
drac030 Posted August 27, 2021 Share Posted August 27, 2021 2 hours ago, ivop said: Use those 8-bits (the instruction) as the address bus to a 256B EEPROM, and latch its databus with a shifter. Now you have 8 bits per instruction to determine whether the bus is needed or not ... and be aware that one and the same opcode may be executed in different number of cycles depending on the internal CPU state. E.g. LDA abs,X is "normally" 4 cycles, but 5 when a page boundary gets crossed. In that latter case the spare cycle could not be detected, and so it could not be shortened. This would exclude quite a number of frequently used instructions in frequently used (i.e. indexed) addressing modes, and that in turn makes the whole thing yet less worthwhile. 1 Quote Link to comment Share on other sites More sharing options...
ivop Posted August 27, 2021 Share Posted August 27, 2021 (edited) 1 hour ago, drac030 said: ... and be aware that one and the same opcode may be executed in different number of cycles depending on the internal CPU state. E.g. LDA abs,X is "normally" 4 cycles, but 5 when a page boundary gets crossed. In that latter case the spare cycle could not be detected, and so it could not be shortened. This would exclude quite a number of frequently used instructions in frequently used (i.e. indexed) addressing modes, and that in turn makes the whole thing yet less worthwhile. Agree. All of the opcodes that have a •, ∗, or † have to be encoded in the EEPROM to work for both cases (read: worse case, less "new" cycles): http://ivop.free.fr/atari/opcodes.html So indeed, that would be less worthwhile, as you say, for quite some instructions. Perhaps @foft could simulate this in his FPGA core to see what benefit it would have IRL? 1. Run CPU cycles when it doesn't need the bus, and let ANTIC use the bus at that cycle if it needs it. No SRAM needed for that. 2. Assume SRAM and ignore refresh cycles. Disconnect ANTIC from the bus, let the CPU do its thing, UNLESS the CPU reads from, or writes to ANTIC. Edited August 27, 2021 by ivop Quote Link to comment Share on other sites More sharing options...
_The Doctor__ Posted August 27, 2021 Share Posted August 27, 2021 heh separate antic access set, give antic it's own ram....only selected to be filled but then switched independent/ sync for running Quote Link to comment Share on other sites More sharing options...
Keatah Posted August 27, 2021 Share Posted August 27, 2021 Don't understand all the tech talk about cycles and access times and stuff.. But BallBlazer is beautifully smooth when using a faster CPU. Quote Link to comment Share on other sites More sharing options...
_The Doctor__ Posted August 27, 2021 Share Posted August 27, 2021 it's smooth using a bog standard cpu as well 1 Quote Link to comment Share on other sites More sharing options...
Xebec Posted August 27, 2021 Author Share Posted August 27, 2021 Wow Thanks for all for the education here. I've learned that the 6502 to keep costs down (reduced pin count, reduced transistors) will access memory every cycle even if there is no need. However, even if that were not a problem - most instructions are accessing memory constantly anyway, reducing the opportunity to speed up. 65C816 improves on the situation a little bit, but due to the still limited 8-bit external bus doesn't provide much boost in performance unless you have a cache with higher clocks. ... I assume I'll get laughed at for this question but just curious. Is there any (reasonably easy) way to decouple the frequency of GTIA from the rest of the system to increase the frequency of those other parts to something like 2 MHz while still having GTIA output at the right frequency for NTSC/PAL? I'm personally curious what kind of clock margin ANTIC, POKEY, and other chips have in the Atari.. Quote Link to comment Share on other sites More sharing options...
Keatah Posted August 27, 2021 Share Posted August 27, 2021 59 minutes ago, _The Doctor__ said: it's smooth using a bog standard cpu as well With the faster cpu, the game is rendered at frame rate speed. Quote Link to comment Share on other sites More sharing options...
ivop Posted August 27, 2021 Share Posted August 27, 2021 (edited) 1 hour ago, Xebec said: I'm personally curious what kind of clock margin ANTIC, POKEY, and other chips have in the Atari.. As for POKEY, I have had this idea for years to disconnect pin 7 (Phi2), and connect it to a different clock, in the range of the Atari arcades, i.e.1.2-1.5MHz. I think Phi2 does not play a role in reads from, and writes to, Pokey. These will still be at 1.77/1.79MHz, and writes are done at the edge of R/W. Edit: with this clock change, you can obviously not boot over SIO (the clock is wrong), but you can with a cartridge (SIDE, MyIDE, etc...). Edited August 27, 2021 by ivop Quote Link to comment Share on other sites More sharing options...
ClausB Posted August 28, 2021 Share Posted August 28, 2021 (edited) 1 hour ago, ivop said: I think Phi2 does not play a role in reads from, and writes to, Pokey. These will still be at 1.77/1.79MHz, and writes are done at the edge of R/W. The 6502 timing diagrams I've seen show otherwise. Data setup and hold are referred to Phi2, not R/W. Edited August 28, 2021 by ClausB Quote Link to comment Share on other sites More sharing options...
+bob1200xl Posted August 28, 2021 Share Posted August 28, 2021 Your basic fast clock solution: Replace the DRAM with faster SRAM. 'Watch' the 02 clock. When it falls negative, the Atari decodes device selects and the 816 puts A17-A23 on the data bus. If the device select is now SRAM (system RAM), then raise 02 after 70ns - 7.16mhz clock. Lower 02 after 70ns, starting all over again. If the target logic is not system RAM, then raise 02 with the system clock at 280ns (1.79mhz) and drop 02 at 560ns. It does not matter what kind of instruction you are executing. You don't have to latch anything unless you are implementing linear memory (>64K). I can't follow what you are doing... makes my head hurt. If you are accessing SRAM with the CPU, bump the clock to 7.16mhz. If not SRAM, leave the clock at 1.79mhz. It's just that simple. Bob 3 Quote Link to comment Share on other sites More sharing options...
ijor Posted August 28, 2021 Share Posted August 28, 2021 12 hours ago, ivop said: I think Phi2 does not play a role in reads from, and writes to, Pokey. It does, of course, both in reads and in writes. See Pokey internal schematics page 1; follow signals PreS01 and Phi2B. But even without looking at Pokey's internal design, the 6502 bus interface is synchronous. Address bus must be latched at one phase of the clock and the data bus at the other phase. 1 Quote Link to comment Share on other sites More sharing options...
ivop Posted August 28, 2021 Share Posted August 28, 2021 11 hours ago, bob1200xl said: I can't follow what you are doing... makes my head hurt. Not sure if you meant me, but my musings were about stock clock speed, but reclaiming /REF cycles and/or run dummy CPU cycles disconnected from the bus, to let ANTIC use it. Sorry, things are a bit mixed up in this thread. Software simulation was the OP's question ? 2 hours ago, ijor said: It does, of course, both in reads and in writes. See Pokey internal schematics page 1; follow signals PreS01 and Phi2B. I was hoping it did not, but alas. :/ Another idea in the bin :) Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.