Jump to content

Recommended Posts

2 hours ago, Lee Stewart said:

 

The code I slightly modified is from MG’s The Smart Programmer, V2(2), July, 1986. Here is my transcription of it:

Thanks Lee, that is interesting. One silly question (I have only written the receiving end of a DSR, i.e. the actual DSR): the calling convention you mention is:

	BLWP @DSRLNK    
	DATA 8

I am curious about the use of the parameter 8. How is it used? If I have understood properly scratchpad location >8356 must be initialised with the PAB structure pointer (offset 9) in VDP memory. Sorry I haven't taken any time to dig into this.

  • Like 1

Share this post


Link to post
Share on other sites
2 minutes ago, Asmusr said:

If you want to speed things up you should look into how to avoid any VDP reads and how to avoid settings up the VDP write address more than once. 

Thanks. Yes I have the solution for this already: do all processing by the Cortex M4 on the cartridge and write the data to be written to VDP memory to a memory area in the cartridge memory, and then have a very simple routine in scratchpad memory for the TMS9900 to just loop though the data and write it to VDP. Then the whole thing will become VDP memory write bandwidth limited. Simple structures like: count, VDP destination address, count bytes of data.

 

My current scrolling routine already sets VDP address registers very seldom (that is the reason for the vertical strips).

 

But before I go there, I want to first play around with TMS9900 code without help by the Cortex M4 core since that is kind of cheating, although I will be more than happy to cheat once I get to that part of the project 😁

 

But I did do some benchmarking of my prototype scrolling code, comparing the same code on the some of the different systems I have:

  • TI-99/4A (8.3 seconds)
  • My "legacy" EP994A FPGA core (around 0.7s @ 100MHz) with 31 wait states but cache enabled.
  • My "new" Icy99 core on the ICE40HX (around 1.5s @ 25 MHz).

I could not run my EP994A FPGA core at full speed due to some stupid versioning reason, and I did not want to run again the synthesis. I did find a bug in my Icy99 core: on the ICE40HX platform the VDP memory is located in external SRAM. It is actually an UMA system (unified memory architecture - just one memory bus shared between CPU, VDP, and DMA from host). With the scrolling routine putting "a lot" of memory transfer load between the CPU and VDP,  the VDP screen refresh engine sometimes runs too slowly, which causes annoyingly a temporary loss of vertical sync, i.e. screen flashes. It is funny how the different projects reveal new issues. By keeping the code in the TMS9900 assembly it is actually a lot of fun to test drive it in all of these other systems.

Share this post


Link to post
Share on other sites
Posted (edited)
11 hours ago, speccery said:

Thanks Lee, that is interesting. One silly question (I have only written the receiving end of a DSR, i.e. the actual DSR): the calling convention you mention is:

	BLWP @DSRLNK    
	DATA 8

I am curious about the use of the parameter 8. How is it used? If I have understood properly scratchpad location >8356 must be initialised with the PAB structure pointer (offset 9) in VDP memory. Sorry I haven't taken any time to dig into this.

 

The DSRLNK routine pops the data after its call to get from the caller whether s/he wants a high-level device service routine (8) for the device at the beginning of the file descriptor field in the PAB or a lower-level subprogram (10), whose name (often only a hex number) is passed in the PAB. When the DSRLNK returns, it will be to the address after “DATA 8” or “DATA 10”. For the diskette DSR, 8 is for level 3 DSRs, all of which use the PAB we all know and love with >8356 pointing to the name-length byte (byte 9) of the PAB in VRAM. Diskette DSR subprogram call 10 is for level 1 and 2 routines that usually use a 2-byte PAB in VRAM, which holds a counted string containing the name of the subprogram and the namelength byte for which is pointed to by >8356 and a transfer block at FAC or FAC+2 and an additional information block pointed to by the transfer block, both of which are in CRAM.

 

...lee

 

[EDITs: Corrections per @jedimatt42’s post below and, hopefully, some clarifications are in the color of this note.]

Edited by Lee Stewart
ADDITIONAL INFO and CORRECTIONS
  • Like 2
  • Thanks 1

Share this post


Link to post
Share on other sites

That's funny I've written my TIPI ROM with DSRs long before I ever wrote a DSRLNK or a call to one in assembly... LOL... 

 

To be strict, it is the DSRLNK routine in question that read that word at the return address to determine which NameLink list to search in the ROM headers.. 

 

8 bytes into the ROM header is where the device service routine list for each ROM begins, and 10 bytes into the ROM header is where the subroutine ( CALL FILES, level2 management routines, etc ) list begins. 

 

Conveniently TI spec'ed both lists with the same structure, so one name searching routine with an offset is all that is needed. Many ( most? ) DSRLNK routines floating around expect this, but not all. 

 

 

  • Like 3

Share this post


Link to post
Share on other sites
2 hours ago, jedimatt42 said:

That's funny I've written my TIPI ROM with DSRs long before I ever wrote a DSRLNK or a call to one in assembly... LOL... 

 

To be strict, it is the DSRLNK routine in question that read that word at the return address to determine which NameLink list to search in the ROM headers.. 

 

8 bytes into the ROM header is where the device service routine list for each ROM begins, and 10 bytes into the ROM header is where the subroutine ( CALL FILES, level2 management routines, etc ) list begins. 

 

Conveniently TI spec'ed both lists with the same structure, so one name searching routine with an offset is all that is needed. Many ( most? ) DSRLNK routines floating around expect this, but not all. 

Thanks, now this makes sense! I did the same, wrote a DSR before starting to think how they get called. The 99/4A would not be the 99/4A, unless the PABs and file buffers were in VDP RAM too :) That was my first thought when working on the DSR. Of course for an unexpanded computer that's nearly all RAM one's got.

  • Like 3

Share this post


Link to post
Share on other sites

Hi,

  Speccery, when we last meet here, he started another branch regarding the DSRLNK.

  I was wondering if you have had time to work on things like Pascal Emulator, maybe console grom over-ride or just getting what you have ready for people to try out?

Thank,

Dan

 

Share this post


Link to post
Share on other sites
On 9/16/2020 at 12:43 AM, dhe said:

Hey Speccery,

  Come back, don't be a stranger!

..and I am back.  No intention of being a stranger. Sorry for the long absence, not intended. It's just that I have been very busy with work etc, more stress due to numerous reasons not under my control. In situations like these I tend to use my spare time for exercise to cope with everything that's going on.

 

In addition to StrangeCart, I'll also discuss some other random off-topic stuff I've worked a little in case someone finds it interesting.

 

Anyway, I've only had little time to work with the StrangeCart, and I've used that time to get the rest of the software pluming on the cart working properly. This has actually been done standalone, without the TI connected.

 

One the things I've accomplished to do is the get the USB serial device running on the microcontroller. This removes some of the wiring mess, as I no longer need to use an external FTDI USB to serial converter to talk to the MCU (view debug messages, issue debug commands). It's also now working the way I wanted: the M0+ core is spending its time monitoring and serving the external bus (i.e. the TI cartridge port), while the more powerful M4 core is handling USB code with interrupts enabled.

 

One thing I want to implement - and this is going to require a little GPL programming (which I have never done) is to modify the standard Mini Memory cartridge running on the  StrangeCart so that I could save the 4K SRAM contents into the MCU's flash memory, to achieve the same rentention capability the original Mini Memory had with the battery backed RAM.

 

The other retro computer related stuff I have done is some kind of random rambling on different things: on the TI-99/4A front I noticed that someone has ported my EP994A TI-99/4A FPGA core for the MIST. I tried that out, seemed to work well. I really want to port my newer and better organised icy99 core for the MIST as well. As a small exercise and to see that my toolchain with Altera FPGAs is working, I did synthesise the MIST TI version myself as well. Kind of strange to get back to my own code, that someone else has ported for the Altera platform.

 

Another project I also built (this was almost too fast to enjoy) was to assemble the RC2014 kit I won back in 2016 for the retrochallenge. I can't believe it took me this long - after moving to our current home this summer I found the kit. It wasn't really lost, it was where I left it, but I kinda forgot that I had it. Anyway I enjoyed building it - this actually is probably the first Z80 system I have built, even if Z80 was the CPU I first learned assembly programming on. The first time I applied power, it worked. In its current form is one of the base versions of the kit, having just backplane, Z80 CPU, 32K RAM, a 64K ROM chip and serial interface. Now that all of this was working I did order from Tindie a couple add on boards, as I want to make my system CP/M capable.

 

The RC2014 is such a simple and lovely system, I feel I am almost obligated to next build a TMS9995 CPU board for it...

  • Like 2

Share this post


Link to post
Share on other sites

One of the things I learned from watching "HIGH SCORE" on Netflix was that some of the games for the SNES (Star Fox) had a co-processor on the cart. I have several ideas where something like that could be useful. The thing I don't understand if how a co-processor on a cart would interact with the normal CPU.

Share this post


Link to post
Share on other sites
6 hours ago, Asmusr said:

One of the things I learned from watching "HIGH SCORE" on Netflix was that some of the games for the SNES (Star Fox) had a co-processor on the cart. I have several ideas where something like that could be useful. The thing I don't understand if how a co-processor on a cart would interact with the normal CPU.

On the SNES and the Genesis, the system was basically able to get a frame buffer image from the co-processor (over-simplified, but essentially). On the TI we don't have the bandwidth to do exactly that.

 

Interaction between the CPUs isn't too hard though... a common approach is to have a small amount of shared memory that both processors can access (so on the TI, it'd have to be on the cartridge) that can be used to pass information back and forth. It's probably enough to have space for data, and a single byte for control of that data (so that as long as you write that control byte last, you don't need to be too worried about the integrity of the data). For data that must be carefully controlled, I've used Peterson's algorithm successfully in the past and found it pretty simple: https://en.wikipedia.org/wiki/Peterson's_algorithm

 

  • Like 3

Share this post


Link to post
Share on other sites

Well I got the first version of the madness with multiple processors working. It is not anything exciting visually (yet - I hope that changes soon).

I did a test program, which just fills the screen with 'E' characters in graphics mode 2. By sending a commands to the USB serial port of the cartridge, the software does the same again. Not very exciting, right? Right. It's not what, but how. What follows is a bit technical, a nice interplay of three processors:

 

What happens in this test first is that the TI-99/4A boots normally. During boot-up the strangecart does a bunch of stuff:

- the M4 processor core starts, and initializes the hardware of the MCU

- then the M4 copies from its internal Flash memory a small TMS9900 cartridge program to one of the internal SRAM blocks (SRAM 3).

- next the M4 copies from its internal Flash memory a small Cortex M0+ program to another internal SRAM block, and starts this program. It's what I call the "busserver" program, where the M0+ sits in a tight loop waiting for cartridge port accesses, either to ROM or GROM. If it sees a ROM or GROM read request by the TMS9900, it fetches the corresponding data from SRAM 3 and presents it to the TMS9900 on the normal data bus. And again it goes.

 

At this point the TI-99/4A is still sitting in it's boot screen. When the user pushes a key, the normal menu is presented. One of the options is my test program strange1.

 

If the user chooses this, the strange1 cartridge starts. There is some TMS9900 code to initialize the VDP to GM2. After this the TMS9900 writes to a magical memory location, namely >7FFE. The TMS9900 then sits in a loop, and checks if memory location >6166 becomes non-zero. If it does, the TMS9900 branches to that location.

; Signal the StrangeCart that we're ready to go!
        INC   @>7FFE          ; Write cycle to 7FFE
        LWPI WRKSP
; Stop here
Stop    MOV  @VECTOR,R0
        JEQ  STOP
; We did get a jump vector from StrangeCart. Go there.
        B    *R0        
vector  DATA 0
        END MAIN

The M0+ core maintains a counter of all writes to >7FFE. This counter is read by the M4 core. When the M4 notices that the counter has changed, it knows the TMS9900 is now waiting for something to do.

 

Before even checking for the counter to change, the M4 generates dynamically some code, from TMS9900 address >6200 onwards (the following routine is executed by the M4 core):

void vector_tms9900(volatile struct bus_config *bc)
{
	static int count = 0;
	int start = 0x200;	// Our offset for writing the program to execute.

	unsigned vdp_addr = 0;
	int cells = 0;
	while(cells < 768) { 
		unsigned char *p = &__base_Ram3_32[start];
		p = put_magic(p);
		p = put_lwpi(p, 0x8C00);
		p = put_li(p, 1, (vdp_addr & 0xFF) << 8);		// VDP write pointer to beginning of VRAM 1/2
		p = put_li(p, 1, 0x4000 | (vdp_addr & 0x3F00));	// VDP write pointer to beginning of VRAM 2/2
		// Let's write as many characters as we can.
		while(p < &__base_Ram3_32[8*1024-32] && cells < 768) {
			const unsigned char e_char[] = { 0, 0xFE, 0x80, 0x80, 0xFE, 0x80, 0x80, 0xFE };
			for(int j=0; j<8; j++)
				p = put_li(p, 0, e_char[(j+count) & 7] << 8);
			vdp_addr += 8;
			cells++;
		}
		// Ok cannot write more E chars.
		// Write an instruction to return back to where we started.
		p = put_b(p, 0x6156);
		// Finally send the TMS9900 to it's merry way.
		launch_tms9900(bc);
		long_wait_magic(bc);	// wait until done
	}
	count++;
}

In the above, the function put_magic writes "INC @>7FFE instruction" into memory, put_lwpi writes a LWPI instruction, put_li writes a load immediate instructions (the arguments are the register number and the data word to be written). Finally put_b(addr) writes simply a "B @ADDR" instruction. These four opcodes are all that the test software uses for now.

 

Thus from the viewpoint of the TMS9900, what the above vector_tms9900() function does is that is simply first inserts the magic instruction, sets W pointing to VDP, and then loads a load of stuff to the VDP. In this case that constitutes of E characters, as defined in the e-char array. The last instruction written is the B >6156 instruction, which branches back to the cartridge stub.  

 

As the cartridge space is 8K, and the first >200 bytes are reserved for the cartridge stub, so there is only 7.5K of code that can be written. Each byte written to the VDP takes 4 bytes of cartridge space in my case, since each LI R0,VALUE instruction takes 4 bytes. This is the fastest method to write data to the VDP. The payload is the high byte of value. Writing all 768 characters requires thus 768*8*4=24576 bytes of ROM code. As this does not fit into one bank of cartridge space, the M4 will write as many instructions as possible into the cartridge space. Once the space gets tight, it launches the TMS9900 with the launch_tms9900() function. The launch_tms9900() function waits for the magic variable to increment before returning. As can be seen from the above, the first instruction in the dynamically created TMS9900 code is put_magic(). Interestingly the M4 core needs to wait for this to happen for a long time, between 200 and 400 iterations of a C code NOP loop...

 

long_wait_magic then waits for the TMS9900 to execute the whole bunch of code. This takes normally 150 000 iterations of a NOP loop on the M4 core. The magic variable increments when the TMS9900 is done writing the stuff to the VDP, and jumps to address >6156. This corresponds to the first instruction of the TMS9900 code snippet above.

 

That took a bit of explaining, but is very simple really. It's just a sequence of events, managed by the M4 processor, effectively in this example reducing the TMS9900 into a write-to-VDP engine. In a more practical example the M4 would render the next frame, i.e. stream of TMS9900 instructions, while the TMS9900 is running the previous stream. In the function above it would be before the long_wait_magic() function. 

 

 

  • Like 2

Share this post


Link to post
Share on other sites
On 9/19/2020 at 4:47 AM, Tursi said:

On the SNES and the Genesis, the system was basically able to get a frame buffer image from the co-processor (over-simplified, but essentially). On the TI we don't have the bandwidth to do exactly that.

 

Interaction between the CPUs isn't too hard though... a common approach is to have a small amount of shared memory that both processors can access (so on the TI, it'd have to be on the cartridge) that can be used to pass information back and forth. It's probably enough to have space for data, and a single byte for control of that data (so that as long as you write that control byte last, you don't need to be too worried about the integrity of the data). For data that must be carefully controlled, I've used Peterson's algorithm successfully in the past and found it pretty simple: https://en.wikipedia.org/wiki/Peterson's_algorithm

 

Moving off topic, but would a co-processor in a sidecar cartridge be able to push data to the VDP at the same time as the main CPU executed a program on the 16-bit bus in scratchpad? 

Share this post


Link to post
Share on other sites
6 hours ago, Asmusr said:

Moving off topic, but would a co-processor in a sidecar cartridge be able to push data to the VDP at the same time as the main CPU executed a program on the 16-bit bus in scratchpad? 

My instinct is no... I am pretty sure the 8-bit bus still sees activity even with 16-bit accesses (just that the multiplexer doesn't kick in). I would have to double check (but am feeling lazy), but I thought the VDP was attached to the 16-bit side of the bus anyway...

 

The 9900 has the ability to suspend for external DMA masters, but TI locked that pin down instead of exposing it to the expansion port. Not that that's exactly what you had in mind either. :)

  • Like 1

Share this post


Link to post
Share on other sites
On 10/8/2020 at 7:09 PM, Asmusr said:

Moving off topic, but would a co-processor in a sidecar cartridge be able to push data to the VDP at the same time as the main CPU executed a program on the 16-bit bus in scratchpad? 

The answer is no, like @Tursi wrote. The VDP sits actually on the high byte of the 16-bit bus. The 16-to-8 multiplexer sits between the 16-bit and 8-bit bus. There is no DMA capability anywhere, let alone from external 8-bit source to the 16-bit bus.

 

We could build another VDP which could reside in the sidecar, with its own memory, co-processor and potential mass storage. However, this VDP could not respond to normal VDP accesses without modifying the console. Perhaps the F18A Mk2 could be used to do something along these lines too.

Edited by speccery

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...