Useful features to programmers

supercat · September 6, 2005

This seems possible, I will have to run a couple tests. You may have to wait a couple VCS clocks to get the answer (you would start the sampling then come back a couple clocks later to retrieve it), is that a huge problem?

My expectation would be that the first value you get will be garbage, but values read every 76 clocks after that will be valid.

How exactly were you planning to implement your queues? Tell me that and I'll tell you how my idea would be different.

Its not so easy to explain. The microcontroller identifies that the VCS has accessed a queue hotspot and then massages the address that the SRAM sees with its queue counter (current location in the 256 byte queue). The micros pretty removed, the work is done by the VCS and the CPLD (which handles the timing of the control lines, same as normal SRAM access).

Is the idea that when the 6507 access one of 256 hotspots, there is an SRAM access with some address bits supplied by your micro and some supplied by the 6507?

If so, it might be easiest to use a pipelined-read architecture where each read request returns the results of the PREVIOUS actual read (and its result in turn will be read on the NEXT read request). Such a design would make the 6507 code seem a little goofy, but would give the micro a lot more time to do its processing.

I don't really have a feel for what your micro can do, but the design approach I would see as being most practical would be to have fifteen address lines going from the micro to an SRAM, nine address lines from the 6507 to the micro [plus a 'block-select' decode], and the data wires just going from the 6507 to the SRAM. One range of 6507 addresses would be for reading data; it would be somewhere in the $1000-$1FFF range; the other range would be for commands, and could be in the $0400-$0FFF range.

Suppose that address $1000 is defined as 'queue 0 read' and $1001 is defined as 'queue 1 read'. Then to generate 100 scan lines with queue 0 going to GRP0 and queue 1 to GRP1, the code would be:
 cmp $1000  ; Prepare to read queue 0
 ldx #100; 100 loops
lp:
 lda $1001 ; Read queue 0; prepare to read queue 1
 sta GRP0
 lda $1000 ; Read queue 1; prepare to read queue 0
 sta GRP1
 dex
 bne lp
Note that the micro will have a fair amount of time to set up for each read operation, since it merely has to ensure that the data is ready for the next one.

Writes to the SRAM would be accomplished by writing something special to the micro to let it know the next address should be a write; it would then assert /WE at the proper time to make that happen.

For times when the micro needs to supply data to the 6507 (e.g. when playing music) the approach I would suggest would be to have the 6507 put a table of byte values 0-255 at some fixed address in RAM, and then have the micro use those addresses when it wants to give a byte value to the 6507.

Delicon · September 7, 2005

Is the idea that when the 6507 access one of 256 hotspots, there is an SRAM access with some address bits supplied by your micro and some supplied by the 6507?

Yes, but I am also using the CPLD to identify the upper address bits (the bits not connected to the microcontroller), then send the microcontroller an interrupt. I wasnt sure it was going to work, but it ended up working great.

The bad news is I dont think I am going to use this design. The good news is I think I have a much much better one. It would give the microcontroller control over the full SRAM and have access to all the VCS address line. I have a few things to work out, but I am pretty confident its going to work. It should allow me to implement any of the suggestions so far. I am throwing together a circuit and a board. I will let you know.

Vern

Edited September 7, 2005 by Delicon

Robert M · September 9, 2005

Please consider for the queues the ability to auto reload. So for each queue there are 2 addresses you can read from in the kernel. The first location reads the next byte and removes it from the queue. The second address reads the byte and requeues it at the same time. That way if you have something like a static background you don't need to restuff the queues every frame.

Note: This only matters if you are implementing the queues as a separate memory space only accessable with pushes and pops. If they are just ranges of regular RAM that can be alternately accessed via self incrementing pointers, then this is not a problem.

Cheers!

supercat · September 9, 2005

Note: This only matters if you are implementing the queues as a separate memory space only accessable with pushes and pops. If they are just ranges of regular RAM that can be alternately accessed via self incrementing pointers, then this is not a problem.

927827[/snapback]

Given the application, it would only make sense to have the queue pointers be manipulable. I do think having the queues be in a shared address space with user-specified pointers and masks would be a good idea. It might require pipelining the reads which would make the code a little strange, but not unfeasible.

The simplest hardware approach I can see would be for the ATmega to have ten inputs (sysA0-sysA8 and /strobe) and sixteen outputs (ramA0-ramA14 and ramRW). On any access in the range $0800-$09FF or $1000-$10FF, a CPLD should hit the /strobe signal and, for the upper range only, select the RAM (for reading or writing as indicated by ramRW).

Note that the ATmega would not have to do anything 'instantly' in response to an access except capture the address. The CPLD and RAM would supply the necessary data onto the bus. The address used by the 6507 would not affect the current memory access, but would serve to tell the ATmega what it should do next.

I would envision the ATmega code as being something like this:

typedef unsigned char ub; typedef unsigned short ui;

#define PTR_BASE 0
#define PTRH_BASE 1
#define MASK_BASE 2
#define COMP_BASE 3

ub dat[256];
ub cmd_ptr,lastmode;

void output_address_hl(ub addrh, ub addrl);
/* Output the specified address on the address bus and /WE pin (A15=WE) */

void handle_access(ub addr_lo, ub a8)
{
 ub *p;
  if (!a8)
 {
   if (addr)
   {
     /* Only accesses 1000-103F are presently defined */
     p = dat+(addr << 2);
     if ((p[PTR_BASE] & p[MASK_BASE) == p[COMP_BASE])
       output_address_hl(p[PTRH_BASE],p[PTR_BASE]);
     else
       output_address_hl(128,0);
     p[PTR_BASE]++;
     lastmode = 0;
   }
 }
 else
 {
   if (lastmode)
     dat[cmd_ptr++] = addr_lo;
   else
   {
     cmd_ptr = dat;
     lastmode = 1;
   }
   output_address_hl(128,dat[cmd_ptr]);
 }
}

I'll write up more later about how all that would actually work.

Delicon · September 9, 2005

Please consider for the queues the ability to auto reload. So for each queue there are 2 addresses you can read from in the kernel. The first location reads the next byte and removes it from the queue. The second address reads the byte and requeues it at the same time. That way if you have something like a static background you don't need to restuff the queues every frame.

The idea was going to be circular queues. You would fill them, then as you read them the queue position would get incremented over the number of items in the queue. Incrementing past the end returns you to the beginning. You could empty a queue through a different means. I think this would work the way you want it to. Glenn also was interested in a way to seek to a position in a queue, I was going to work that in as well.

Vern

Delicon · September 9, 2005

typedef unsigned char ub; typedef unsigned short ui;

#define PTR_BASE 0
#define PTRH_BASE 1
#define MASK_BASE 2
#define COMP_BASE 3

ub dat[256];
ub cmd_ptr,lastmode;

void output_address_hl(ub addrh, ub addrl);
/* Output the specified address on the address bus and /WE pin (A15=WE) */

void handle_access(ub addr_lo, ub a8)
{
 ub *p;
  if (!a8)
 {
   if (addr)
   {
     /* Only accesses 1000-103F are presently defined */
     p = dat+(addr << 2);
     if ((p[PTR_BASE] & p[MASK_BASE) == p[COMP_BASE])
       output_address_hl(p[PTRH_BASE],p[PTR_BASE]);
     else
       output_address_hl(128,0);
     p[PTR_BASE]++;
     lastmode = 0;
   }
 }
 else
 {
   if (lastmode)
     dat[cmd_ptr++] = addr_lo;
   else
   {
     cmd_ptr = dat;
     lastmode = 1;
   }
   output_address_hl(128,dat[cmd_ptr]);
 }
}

I dont think this would work so well. The problem is that by the time the AVR knows it needs to react to the VCS it only has about 7 - 10 AVR clock cycles (max of 20Mhz clock) to get data out to the bus. A couple of those instructions get used up just manipulating the port, reading, writing, and direction setting. So you cant really have any conditional statements, function jumps, or pointer math. The idea is, get the data quick, put the data out quick. After that, you have bunches of time to do whatever calculations you want.

My new idea is to use a cheap Philips ARM7 microcontroller. Its a 32bit core running at 60MHz. If I can get this to work, I think all the suggestions and more will be possible (Glenn suggested ethernet, that might be a bit of overkill). I have a couple hurdles I need to figure out. The first is the processor boot up at a 20MHz, it takes some time and software commands to setup the 60MHz clock. The second is that the processor boots from instructions in flash. At 60MHz the flash takes 3 clock cycles to access, that effectively reduces the speed of the ARM to 20MHz. This access penalty can be removed by running the ARM code from SRAM. So the problem is it also takes time for the ARM to move its code to SRAM. For normal applications all this time wouldnt be a problem, but here, the VCS want responses right away.

I have an idea to get by this. At VCS power on, can I just feed it 'nop' every time it asks for code from the cartridge? Then when the ARM finishes its setup it will feed the VCS a jump and start real VCS code. The screen wont be updated for a few parts of a second. I dont think anybody would mind if the screen flicked for half a second. Would this work? What happens at VCS power up?

Thanks,

Vern

supercat · September 9, 2005

I dont think this would work so well. The problem is that by the time the AVR knows it needs to react to the VCS it only has about 7 - 10 AVR clock cycles (max of 20Mhz clock) to get data out to the bus. A couple of those instructions get used up just manipulating the port, reading, writing, and direction setting. So you cant really have any conditional statements, function jumps, or pointer math. The idea is, get the data quick, put the data out quick. After that, you have bunches of time to do whatever calculations you want.

I don't think you understand my plan, perhaps because I'm not being clear. I don't suppose you've ever programmed the TI 320C5x DSP? If you have, you'd grok the idea here perfectly.

Assume that addresses $1001-$1005 are for fetchers 1-5, which are set up to return data for P0, P1, M0, M1, and the Ball, in that order. I want to generate 192 scan lines of display using those fetchers. Here's how I'd do it. Read the comments carefully.

 bit $1001 ; Prepare to read data from fetcher 1
 ldx #192 ; Number of scan lines
lp:
 sta WSYNC
 lda $1002 ; Read previously-prepared data from fetcher 1; prepare to read #2
 sta GRP0
 lda $1003 ; Read previously-prepared data from fetcher 2; prepare to read #3
 sta GRP1
 lda $1004 ; Read previously-prepared data from fetcher 3; prepare to read #4
 sta ENAM0
 lda $1005 ; Read previously-prepared data from fetcher 4; prepare to read #5
 sta ENAM1
 lda $1001 ; Read previously-prepared data from fetcher 5; prepare to read #1
 sta ENABL
 dex
 bne lp

Note that the 2600 coding is a little goofy, because the LDA instructions don't use the address I really want to read, but instead use the address I'll want to read with the next LDA (or LDY, or whatever) instruction. An ATmega chip running at 14.31818Mhz would have a minimum of 48 cycles to deal with any read request (contingent only upon its having grabbed the address while it was still present on the bus). In the above 2600 code, the ATmega chip would have 84 cycles after each read to prepare for the next one. Plenty of time.

BTW, note that in the design I'm thinking of, the ATmega (or whatever) micro doesn't have any connection to the data bus. To allow for cases where it might be desirable to read data from it, user code would be responsible for storing the numbers 0-255 into SRAM addresses 0-255. Any command which should return a value would prepare the SRAM to read from address 0-255.

My new idea is to use a cheap Philips ARM7 microcontroller. Its a 32bit core running at 60MHz. If I can get this to work, I think all the suggestions and more will be possible (Glenn suggested ethernet, that might be a bit of overkill). I have a couple hurdles I need to figure out. The first is the processor boot up at a 20MHz, it takes some time and software commands to setup the 60MHz clock. The second is that the processor boots from instructions in flash. At 60MHz the flash takes 3 clock cycles to access, that effectively reduces the speed of the ARM to 20MHz. This access penalty can be removed by running the ARM code from SRAM. So the problem is it also takes time for the ARM to move its code to SRAM. For normal applications all this time wouldnt be a problem, but here, the VCS want responses right away.

That's starting to seem like massive overkill. Using a small micro to implement a fetcher is not totally out of line with the spirit of pre-crash gaming (in 1984, people would have used dedicated counters instead of a micro to produce a fetcher, but the spirit is somewhat the same). But when the micro gets to have a hundred times the processing horsepower of the main game, what's really driving what?

I have an idea to get by this. At VCS power on, can I just feed it 'nop' every time it asks for code from the cartridge? Then when the ARM finishes its setup it will feed the VCS a jump and start real VCS code. The screen wont be updated for a few parts of a second. I dont think anybody would mind if the screen flicked for half a second. Would this work? What happens at VCS power up?

You might be able to use some programmable logic to generate the initial-power-on bus timings, but before you can clock out NOPs you'll have to clock out an address. It would be nice if $4C4C was within the cartridge's address space, but it isn't. Nor is $2020 nor $6C6C.

If you stick with a small micro working as a fetcher in conjunction with an external data memory and a 'normal' code memory, emulation should be reasonably feasible if you define a fixed feature set for your fetcher. But if your fetcher becomes a changeable program, accurate emulation is going to become absurdly difficult.

Delicon · September 9, 2005

That's starting to seem like massive overkill. Using a small micro to implement a fetcher is not totally out of line with the spirit of pre-crash gaming (in 1984, people would have used dedicated counters instead of a micro to produce a fetcher, but the spirit is somewhat the same). But when the micro gets to have a hundred times the processing horsepower of the main game, what's really driving what?

It may be overkill, but its cheaper than doing it my previous way. This way should bring in total part costs under $15 and give more options for functionality. All you end up needing is the processor, its got flash and SRAM built in. I understand its taking it to extremes, but if it works and its cheaper, why not?

You might be able to use some programmable logic to generate the initial-power-on bus timings, but before you can clock out NOPs you'll have to clock out an address. It would be nice if $4C4C was within the cartridge's address space, but it isn't. Nor is $2020 nor $6C6C.

Why would the ARM have to generate addresses? The VCS should be asserting the address bus. It will be looking for code to execute and I will just feed it nop. I would only give it nop when A12 it high.

If you stick with a small micro working as a fetcher in conjunction with an external data memory and a 'normal' code memory, emulation should be reasonably feasible if you define a fixed feature set for your fetcher. But if your fetcher becomes a changeable program, accurate emulation is going to become absurdly difficult.

There will really only be one program. It will handle everything like it did before, so no real difference there. The real difference is in cost and design simplicity.

Vern

supercat · September 9, 2005

It may be overkill, but its cheaper than doing it my previous way. This way should bring in total part costs under $15 and give more options for functionality. All you end up needing is the processor, its got flash and SRAM built in. I understand its taking it to extremes, but if it works and its cheaper, why not?

What are the specs on the exact processor you want to use? Do you have a datasheet link?

Why would the ARM have to generate addresses? The VCS should be asserting the address bus. It will be looking for code to execute and I will just feed it nop. I would only give it nop when A12 it high.

The 6507 reads addresses $1FFC and $1FFD to determine where it should begin code execution. Those addresses need to return something within cartridge space. If your ARM processor fires up faster than the 6507 (I don't know how long /Reset takes after powerup on the Atari) you might want to have some code which starts running at 20Mhz that guides the 6507 through the process of copying into RIOT RAM something like the following:

 INIT ; Clear out display regs and set up stack
zz:
 iny
 sty COLUBK
 lda $1000
 cmp #$A5
 bne zz
 lda $1000
 cmp #$5A
 bne zz
 jmp $1002

Then the ARM would be free to take its time with the remainder of initialization. Shouldn't take very long (probably under 100ms) but not having to worry about keeping the 6507 'fed' during that time would be a plus.

If you stick with a small micro working as a fetcher in conjunction with an external data memory and a 'normal' code memory, emulation should be reasonably feasible if you define a fixed feature set for your fetcher. But if your fetcher becomes a changeable program, accurate emulation is going to become absurdly difficult.

There will really only be one program. It will handle everything like it did before, so no real difference there. The real difference is in cost and design simplicity.

I think you may be overlooking something major, though: while you're trying to get the fetcher working, you'll have a 6507 system at your disposal which can run code in nicely predictable fashion. You'll be able to write test programs which manipulate fetcher data and show it on the screen. Whether the fetcher is working or not, your program itself will run predictably.

By contrast, I don't see how your all-in-one approach could end up being anything less than a nightmare to debug. If anything goes less than perfectly, the 6507 is going to end up off in the weeds somewhere, and you'll have no idea as to why.

I have done a number of multi-processor projects and one of the things I have learned is absolutely critical to making them manageable is setting things up so that at least one of the processors can be debugged in isolation. For the Fetcher, that would be doable [you can write a very simple, straightforward, and bug-free program to feed specific data to the fetcher and output the exact results]. For your single-chip notion, I see no way whatsoever to accomplish this. Further, I would expect that even at 60MHz you'll have a harder job keeping up with everything that needs to happen than you would using the CPLD+RAM+fetcher approach at 14Mhz (since you'll need to process all cartridge accesses, rather than merely handling those that actually use the Fetcher's features).

Delicon · September 10, 2005

What are the specs on the exact processor you want to use? Do you have a datasheet link?

Its a Philips processor, LCP2106. Its the simplest one they make with the most SRAM.

The 6507 reads addresses $1FFC and $1FFD to determine where it should begin code execution. Those addresses need to return something within cartridge space.

I think I have the solution. Instead of feeding the 6507 a 'nop' I will feed it an absolute jump index with X (opcode 0x7C). If I always send 0x7C the 6507 will get a correct reset vector in cartridge space and everything after that will be a jump to 0x7C7C indexed with X, who cares what X is, the jump will still be in the cartridge space. I will need to keep track of what byte the 6507 is on, the opcode, high or low byte, but thats easy. This will always keep me in cartridge space no matter how long it takes me to initialize.

By contrast, I don't see how your all-in-one approach could end up being anything less than a nightmare to debug. If anything goes less than perfectly, the 6507 is going to end up off in the weeds somewhere, and you'll have no idea as to why.

I have done a number of multi-processor projects and one of the things I have learned is absolutely critical to making them manageable is setting things up so that at least one of the processors can be debugged in isolation.

It may be a nightmare, but I am still going to try. As it is now, I have managed a multiprocessor design that can handle most bankswitching and serial input and output. All that with only a multimeter. I wish I had access to some better equipment (maybe someday I will buy a logic analyzer); somehow I always manage to get by with what I got.

For the Fetcher, that would be doable [you can write a very simple, straightforward, and bug-free program to feed specific data to the fetcher and output the exact results]. For your single-chip notion, I see no way whatsoever to accomplish this. Further, I would expect that even at 60MHz you'll have a harder job keeping up with everything that needs to happen than you would using the CPLD+RAM+fetcher approach at 14Mhz (since you'll need to process all cartridge accesses, rather than merely handling those that actually use the Fetcher's features).

Your right, its really not a one chip solution, its two. I will also need a XC9536 to generate a VCS clock to trigger ARM interrupts. Still really really cheap though.

To be honest the scariest part for me is figuring out a way to get rid of the serial port ground noise the PC generates. I will have to start playing with some optoisolators.

Vern

supercat · September 10, 2005

I think I have the solution. Instead of feeding the 6507 a 'nop' I will feed it an absolute jump index with X (opcode 0x7C). If I always send 0x7C the 6507 will get a correct reset vector in cartridge space and everything after that will be a jump to 0x7C7C indexed with X, who cares what X is, the jump will still be in the cartridge space. I will need to keep track of what byte the 6507 is on, the opcode, high or low byte, but thats easy. This will always keep me in cartridge space no matter how long it takes me to initialize.

Hmm... I looked at the opcode list for anything that was well-placed and didn't see that one. I agree that would seem to be a solution.

It may be a nightmare, but I am still going to try. As it is now, I have managed a multiprocessor design that can handle most bankswitching and serial input and output. All that with only a multimeter. I wish I had access to some better equipment (maybe someday I will buy a logic analyzer); somehow I always manage to get by with what I got.

Well, best of luck to you.

Your right, its really not a one chip solution, its two. I will also need a XC9536 to generate a VCS clock to trigger ARM interrupts. Still really really cheap though.

What do you need that for? Since you're going to be having to handle almost every cycle on the 6507, why not use a timer interrupt? Assuming a 60MHz clock, once things have stabilized, wait for the lower-numbered address to appear [how exactly does opcode 7C work with regard to page crossings, BTW?], start outputting $EA until you're at one byte before where the 6507 code is supposed to begin, then set the timer to interrupt about 40 cycles in the future. Each time the interrupt fires, wait up to twenty cycles for the address to change. If the address changes, set the timer 40 cycles in the future from that. If it doesn't change, set the timer 50 cycles from when it expired.

To be honest the scariest part for me is figuring out a way to get rid of the serial port ground noise the PC generates. I will have to start playing with some optoisolators.

928576[/snapback]

My recommendation would be to use a USB-to-serial chip along with a couple optos. Communicating with a standard serial port via optos is a pain because of the need to have an RS232 level convertor which is powered by something on the PC side of things. I've used a passive opto design which worked well up to about 4800 baud, but for higher speeds of RS232 it's necessary to have something active on the PC side of the opto.

Delicon · September 10, 2005

What do you need that for? Since you're going to be having to handle almost every cycle on the 6507, why not use a timer interrupt?

I will give that a try. My proto has the CPLD, which should make testing a little easier, giving me a guaranteed signal. If I can get it to work out I can try the timer.

My recommendation would be to use a USB-to-serial chip along with a couple optos. Communicating with a standard serial port via optos is a pain because of the need to have an RS232 level convertor which is powered by something on the PC side of things. I've used a passive opto design which worked well up to about 4800 baud, but for higher speeds of RS232 it's necessary to have something active on the PC side of the opto.

My protos right now are using SiLabs CP2102 USB chip, I love them. Glenn wants to use raw serial to allow others to create add on devices, mass storage, ethernet, whatever. I like the idea also. Unfortunately that makes getting power difficult like you mentioned. So I am a little stuck. Using the both ARM serial ports would be going too far, not to mention added costs for connectors. I will have to figure something out regardless of which design I go with, the amount of noise is horrible.

Vern

brpocock · January 20, 2006

Glenn is also interested in adding math coprocessing queues. These would function the same as the normal queues except they would be loaded with values you wish to be processed and then the answer would be read back. My question is what functions would be needed? Specifics would be nice, for example 16bit multiplication with a 32bit answer. Also how fast would you expect results? I am using a relatively fast processor, so basic calculation would be available by the VCS next clock cycle. More complex calculations would not be, for example square roots. What is a reasonable amount of VCS clock cycles you would expect an answer?

Also feel free to suggest other features, ...

My wishlist (things that waste too much ROM):

stacks or queues. Can you do selectable LIFO/FIFO? I like stacks. In 6502'land we often count "down" rather than up because detecting zero is cheap, and stacks would be a nifty way to deal with that. A queue would work too, a stack would just be "nicer"

Multiplication and division -- at least 8b × 8b => 16b?

nybble flip. 0b11110000 => 0b00001111 for example; $f8 => $8f

(Think of text or similar data into sprites or playfield. Yes, this can be done in 5 lines using bit rotation, but one bit at a time... vs lda (spritedata),y:sta rotate:lda rotated)

I second the motion for binary <=> decimal and so forth. Sure, it can be done in lookup tables, but it can easily be non-addressable memory doing the work. That is: with a 4k address space, having to make the LUT's addressable is a "cost" in available address space.

Agreed, most of the things requested can be done, but doing it faster is the goal, right?

Useful features to programmers

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members