Assembly on the 99/4A

+Lee Stewart · April 6, 2013

...

No. Byte operations always require the CPU to load the full word first. For instance, if you want to write 01 to address A000, you expect that address A001 remains the same. However, the CPU is 16-bit, it cannot move half a word out of the ALU. Accordingly, what it does is to load the complete word at A000 first, keep the value of A001 in an internal storage, modify the byte at A000, and write the complete word back.

...

Michael...

With your experience, you would certainly know better than I and the docs for the TMS9900 support you---except for the situation I mentioned above, viz., workspace register indirect auto-increment, in which case the docs say that MOVB bests MOV by 2 clock cycles (4, if source and destination are both auto-incremented registers). Have you found otherwise?

...lee

Tursi · April 6, 2013

;; This buffer is for notes you don't want to save, and for Lisp evaluation.

;; If you want to create a file, visit that file with C-x C-f,

;; then enter the text in that file's own buffer.

I'll try not to repeat too much information, although it's easy to do so with a list.

For 1, the VDP delay question, we had a huge investigation on the Yahoo group a few years ago. I'm the one who put a logic analyzer on the bus and measured instruction and turnaround time to the VDP. On a standard console, whether you are in scratchpad RAM or not, writes can not overrun the VDP because of the additional delay induced by the CPU's read-before-write cycle. Reads appear to be unable to overrun the VDP, unless you use the fastest possible instruction time (IIRC, it was MOVB R0,R1 -- note no indirection and no increment). However, there is a case where after setting the VDP address, and then doing a fast read instruction (such as one involving only registers), you MIGHT overrun the VDP. (For instance, write the address then immediately do a MOVB R1,<anything>, where R1 points to the VDP read data address). This is because the time between the write to the VDP for setting the address and the read from the VDP for getting the byte may be too short. Using an absolute address (MOVB @VDPRD,<anything>) seems to give you enough time with the extra memory read to be safe. As Mizapf notes, faster machines like the Geneve may need a delay (note the Geneve also has a different video chip), and accelerated TI consoles (faster crystal) may as well. These are relatively rare, however.

Emulators do not appear to care about the delay today. Classic99 does not at this time.

There ARE periods at which the delay is not needed. After VSYNC is the only predictable one (because there is no external indication of the other windows). The TMS9918 data manual notes that there are 4300uS of CPU access after a vertical blank starts -- some of this time is eaten by the TI interrupt routine if you leave it on. If the blank bit is enabled, so that the display is not being drawn, the CPU also gets full speed access to memory.

The reason for this is that CPU accesses to VRAM are given 'access windows' depending on the VDP's exact mode. There is always a 2uS delay, and then additional time is added depending on the current mode. You'll find the table on page 2-4 of the datasheet, but basically, in text mode you need 3.1uS, in bitmap or graphics mode you need 8uS, in multicolor you need 3.5uS, and when the display is off or during vertical blank, you need 2uS. Most 9900 instructions take far longer than these times, and so TI's recommendation was pretty much just a guarantee, as well as future-proofing for future, faster CPUs.

2. RMW - as Mizapf notes, there's no way to do it. If possible, it may be helpful to 'double buffer', keep a copy of a VDP memory area in CPU RAM, and then you only need to push, never read. I'm not sure that helps you for a line draw function, though.

3. Mentioned a few times, but yeah, SOC, SZC, ANDI and ORI are your best bet for bit manipulation. You can use a table of bits with SOC or SZC to remove the need for shifting

4. LI vs MOV is well covered. As an extra benefit, LI does not do a read-before-write when loading the register, one of the rare instructions that doesn't. When MOV can be considered for performance is when you replace @ONE with a register. Because memory performance is the biggest bottleneck on the TI, often the /length/ of your instruction is more important than the work it does.

5. Byte instructions on the 9900 are the same speed except when doing Workspace Register indirect autoincrement (*R1+) - a byte operation is 2 cycles faster.

6. I've been thinking about page-flipped monochrome bitmap lately, actually. My current thinking is that the best way to do it is to mess with the masks to make the color table as small as possible, then copy the offending data from another place in memory. So, a /perfect/ page-flip, no, but one that both fits in RAM and updates in VBLANK, yes, I think so. I still have to find time to try it.

7. The built-in KSCAN is mostly slow only because of all the built-in debounce, and it relies on hard-coded addresses (and GROM tables), but I wouldn't call it lousy. There are lots of alternatives out there.. search for the Zombie MOTIF game here in this forum somewhere - I wrote a very bare-bones KSCAN for that that gives just basic key capability.

8. As noted, yes, you have to copy your data into scratchpad. You can't count on an AORG because you can't count on the loader not using the scratchpad itself. It's not very big, this is not a huge concern (Parsec copies several routines on the fly).

Tursi · April 6, 2013

To extend #8: If your scratchpad code doesn't need any absolute memory addresses inside itself (ie: it doesn't use B @ and doesn't store any data inside itself and doesn't self-modify....), you can simply inline it. For instance:

 AORG >A000

START
 LI R1,MYCODE * START OF CODE FOR SCRATCHPAD
 LI R2,>8120     * ADDRESS TO STORE IT AT
LP
 MOV *R1+,*R2+ * COPY A WORD
 CI R1,ENDCODE * CHECK IF DONE
 BNE LP         * KEEP GOING

* ALL DONE, JUMP INTO SCRATCHPAD CODE
 B @>8120

* THIS CODE IS ASSEMBLED HERE, BUT AS THE ADDRESSING
* IS ALL RELATIVE, IT WILL RUN AT ANY ADDRESS
MYCODE
 LI R0,>0040     * BYTE-SWAPPED ADDRESS
 MOVB R0,@>8C02
 SWPB R0
 MOVB R0,@>8C02 * SET VDP WRITE ADDRESS TO >0000
 LI R0,>2000     * SPACE CHARACTER
 LI R1,768     * COUNT
LP2
 MOVB R0,@>8C00 * WRITE BYTE TO VDP DATA
 DEC R1         * COUNT DOWN
 JNE LP2

* ALL DONE, JUST WAIT FOR QUIT
 LIMI 2         * ENABLE INTERRUPTS
 JMP $         * JUMP TO OURSELF
ENDCODE

 END

That should compile and run - assuming many things it should clear the screen from scratchpad. If you use an emulation with a debugger, you can step through the code and see it work. When it's done, FCTN-= (QUIT) to reset.

Caveat, I didn't test it here, but it should be right.

Edited April 6, 2013 by Tursi

Opry99er · April 6, 2013

I remember the "over-run the VDP" tests.

If I remember right, there was a test with a big arrow on the screen---

Didn't Marc do the dirty work on that one, or am I mistaken?

Tursi · April 6, 2013

@Owen: Depends on what you define as "the dirty work", a few people were involved. Thread is here: http://tech.groups.yahoo.com/group/TI994A/message/615

@RamusM: I looked a bit more at the bitmap page flipping.. the best that the masking will give with a full pattern screen is 256 color entries, so the minimum full screen page flip would still involve writing those 256 characters (at 8 bytes each, which is 2k!) That's still better than 6k, but it won't fit inside the vblank (MIGHT be close enough to beat the beam, though, if you go top to bottom...?). The thing that keeps throwing me when I think of cool bitmap mode hacks is that the color table and the pattern table are both tied to the same character index.

Opry99er · April 6, 2013

Thanks for the link. I haven't spent much if any time on the Y list since the AA forum was created.

And I stand corrected. There were quite a few fellas on that project. =)

Asmusr · April 6, 2013

Thanks for all the replies. Regarding the bitmap page flipping, I see two alternatives to the option proposed by Tursi. One is to use the bitmap + text mode with only two character sets - otherwise there is not enough room for the screen image table within 8K. The problem with that is that the F18A doesn't support this mode according to this link: http://www.msx.org/forum/msx-talk/hardware/v9938v9958-tms991829-pcb-adaptor?page=1

Another option is to use standard graphics mode with only a 16x16 characters (128x128 pixels) area for bitmap drawing, but that would be rather lame.

+mizapf · April 6, 2013

You are possibly right with the *Rx+ cycles, I did not pull out the docs to check. I may have done it right in MESS as I closely followed the specs with all the flow charts and so, but just don't remember right now . So the claim is that MOVB R0,*R1+ is faster than MOV R0,*R1+ ... I'll see if I did that correctly inside the emulation.

There is another funny thing with these delays. If you have a look at the specs of the TMS9995 (used in the Geneve and TI-99/8) you will notice that the SWPB instruction is extremely slow, compared to the others. For instance, on the TMS9900, A takes 14 cycles and SWPB 10 (plus additional memory cycles). On the 9995, while A takes 4 cycles, SWPB requires 13 with the same setting (all operand on-chip), and is thus twice as slow as DIV. I had to consider that when I re-implemented the CPUs in MESS, and had to explicit add dummy cycles to get that delay.

I cannot really imagine why SWPB (a particularly simple command) should take that many cycles. My guess is that this is an intentional delay, just related to the issue with delays when accessing slow devices like the VDP. With this slowed down execution you could stay with the common MOVB/SWPB/MOVB sequence when setting the address without changing the source code. (Just a wild guess, yes.)

jens-eike · April 6, 2013

I cannot really imagine why SWPB (a particularly simple command) should take that many cycles. My guess is that this is an intentional delay, just related to the issue with delays when accessing slow devices like the VDP. With this slowed down execution you could stay with the common MOVB/SWPB/MOVB sequence when setting the address without changing the source code. (Just a wild guess, yes.)

Another wild guess: is SWPB implemented as a (circular) shift? That could explain the sloooooow execution, even on the 9995.

+mizapf · April 6, 2013

I also thought about some internal SRC when I tried to figure out how the respective cycle count can be explained for various commands. It seems to be indeed plausible for TMS9900, but as you see, it takes even more cycles on the TMS9995.

In the document "9900-FamilySystemDesign-04-HardwareDesign.pdf", page 4-89 ff. you can find the microprograms for the TMS9900 (at least as far as needed to explain the cycle count per command); I used it extensively for the re-implementation in MESS. The SWPB command is handled together with CLR, SETO, INV, NEG, INC(T), and DEC(T); the shifts look very different. Still, it could be that some portion of the ALU is re-used in both cases.

Unfortunately there is no such document for the 9995, so I had to guess how the microprograms possibly look like (with all that prefetching and so on). And this was right there when I ran into that issue with SWPB. The 9995 is a huge improvement of the 9900, which becomes clear when you see how few (external) cycles it actually uses, but surprisingly not in that case.

In fact, the 9995 is driven by a 12 MHz clock but outputs a 3 MHz external clock line. First I thought the 12 MHz is simply divided by 4, but when I saw how far the number of cycles has been reduced from the 9900, I guess the 12 MHz are well needed inside the CPU for driving the microprograms. While in the 9900 you could well imagine that this happens with this clock tick, now that is the next step and so on, in the 9995 you simply cannot believe how most operations can be handled in the seemingly few clock ticks visible on CLKOUT. There happening much more inside.

+InsaneMultitasker · April 6, 2013

There is another funny thing with these delays. If you have a look at the specs of the TMS9995 (used in the Geneve and TI-99/8) you will notice that the SWPB instruction is extremely slow, compared to the others. For instance, on the TMS9900, A takes 14 cycles and SWPB 10 (plus additional memory cycles). On the 9995, while A takes 4 cycles, SWPB requires 13 with the same setting (all operand on-chip), and is thus twice as slow as DIV. I had to consider that when I re-implemented the CPUs in MESS, and had to explicit add dummy cycles to get that delay.

I noticed this on the 9995 data sheet last night while looking for any last optimizations I could make to a new code segment. I was quite surprised at how many cycles SWPB requires. Out of curiosity, when you implemented this in MESS, did you validate the timing of this instruction on real hardware? I only ask because my first thought was "is this a printed mistake?"

matthew180 · April 6, 2013

Having implemented the 9900 CPU in HDL, I don't think the 9995 was a complete redesign of the 9900. That would have cost too much, taken too long, and been a ton of unnecessary rework. The differences are too minimal.

As for the clock, the 9900 uses a 3MHz 4-phase clock, with *each phase* being 3MHz. The internal FSM of the CPU is clocked with a change in *each* phase, i.e. at 12MHz, however an external machine *cycle* is at the 3MHz frequency since the FSM steps (microcode operations) are indivisible at the software level. The 9995 simply removes the 4-phase input and uses a single 12MHz clock, probably to free up 3-pins on the package.

I have never looked at the cycle counts for the 9995, but I can't believe that SWPB is slower than DIV, and twice as slow? SWPB on the 9995 probably takes longer than on the 9900 because both bytes have to be fetched in two separate memory operations, where on the 9900 is is a single memory op. Implementing SWPB as a shift would be hard to set up, and when you are talking about hardware, engineers avoid adding complexity and extra gates at all costs. The shift instructions are very specific and have their own instruction group which only consists of the shift instructions. I don't think the ALU is used in the shift instructions at all.

The *instruction* (and only the instruction) prefetch in the 9995 is actually pretty basic since the CPU does not do any hazard detection, which would complicate the design. I almost did the same thing in my 9900 implementation because I saw exactly where an instruction pre-fetch could be performed, but it would have made my CPU incompatible with the 9900.

+mizapf · April 6, 2013

In my opinion the differences between 9995 and 9900 are not negligible. For once, it has an 8-bit data bus output (but a 16 bit architecture), so the whole operation of the 16/8 data bus converter is inside the chip. The 9995 makes good use of this by dropping the time-consuming read-before-write known from the 9900. Then you have the prefetch feature which is not present for the 9900. Also, the interrupt handling is pretty different (watch the flow charts); you do not have 16 interrupt levels anymore.

In fact, when you compare execution speeds of the 9980A to the 9900 you can see that there is no real difference, apart from the fact that it also has to split memory accesses as two 8-bit accesses, but otherwise it reacts just like the 9900. Both, as far as I could find out, really share the same microprograms.

In MESS I could exploit this in terms of subclassing 9980A from 9900. For the 9995 I had to start almost from scratch.

As for SWPB I think I remember that I actually measured it. I once wrote a program to do benchmarking (just by doing a loop with the operation and taking the time difference using the RTC in the Geneve), but I cannot tell for sure now that I actually tested SWPB. From the specs you can see that the memory access is not the real problem - you can try the operations in the on-chip RAM so you don't have wait states, and you have the full 16-bit access. And still the processing time is surprisingly high. (If I find some time I'll do a check, but right now I'm a bit busy preparing slides for my lectures next week... )

matthew180 · April 6, 2013

In my opinion the differences between 9995 and 9900 are not negligible. For once, it has an 8-bit data bus output (but a 16 bit architecture), so the whole operation of the 16/8 data bus converter is inside the chip. The 9995 makes good use of this by dropping the time-consuming read-before-write known from the 9900. Then you have the prefetch feature which is not present for the 9900. Also, the interrupt handling is pretty different (watch the flow charts); you do not have 16 interrupt levels anymore.

<snip>

For the 9995 I had to start almost from scratch.

Mmm, not so much. Software emulators tend to be written by programmers from a functional perspective and do not work at all like the real hardware. It is funny, but programmers tend to know very little about how hardware works. Before I started working with HDL and FPGAs I was in the same boat, and I am constantly surprised and amazed at what is really going on inside a chip. Emulators also do not consider the highly common components of the CPU, the data path vs control path, or take into account the propagation delays or FSM working inside the CPU.

The F18A's 9900 is actually more like the 9995 in that it has an 8-bit data bus because I had already implemented the VRAM with an 8-bit width. So the GPU has to do two reads / write to perform word ops. Since the real 9900 and 9995 are microcoded, the extra access would not be such a big deal, and was not an issue when I was working on the F18A GPU.

The interrupt differences would not be such a big change either. Hardware design is *extremely* modular and the interrupt handling would most likely be an isolated block.

I did look at the 9995 datasheet today and I see TI finally implemented a decent DIV and MPY instruction. On the 9900 DIV was a nightmare. A 32-by-16 divide should not take any more than sixteen internal FSM changes. Even when I rolled my own for the F18A GPU I managed a 16-cycle (worst case) performance from the DIV instruction, and I don't consider myself anywhere near a real IC engineer. The 9900 must have had some long combinatorial logic going on in the DIV block that slowed it down so much. The 9995 seems to have fixed that and DIV is very respectable now. Note that the timing for DIV with ST4 set means no division took place and you have to look at the second timing where ST4 is reset. 24 - 46 cycles is pretty good for a worst case scenario where all 16-bits of the divisor had to be considered.

SWPB shows to be between 13 and 22 clocks in the manual I'm looking at. Even the worst case for SWPB is better than the best case for DIV. And the time of 13 clocks best case is pretty close to the 9900's 10 clocks for SWPB.

+mizapf · April 6, 2013

OK, admitted, the DIV is not a good compare against the SWPB (as you said, it stops quickly in some cases), but don't you think the SWPB is still surprisingly slow compared to other operations (in particular when looking at the 9900 cycles)?

+mizapf · April 6, 2013

Also, I don't have experience in real hardware, that is right. This sharing of data paths is something that may indeed be easier in hardware than in software, where you try a lot to keep apart things for the sake of readability and code structure. What I did is to try to simulate each piece of microprogram, but as you said, from a functional perspective. As far as the manuals disclosed to me.

matthew180 · April 6, 2013

<disclaimer> please note that I'm simply having a discussion and not trying to belittle or insult anyone. If you feel offended by anything I said, please try to interpret it in another way, because being offensive is not my intent. </disclaimer>

Hmm, now that I look at the timings on the *9995*, SWPB does appear to use a lot of cycles comparatively. However, it does have to do four memory ops. Also, from the block diagram (pg 3, fig 3) I'm looking at, there is a "swap mux" that looks like it is designed to do exactly this. But the output goes to the temporary regs, and from there is is unclear if the temp regs can be directly written back to RAM. It is possible that the data has to take a ride through the ALU.

As for SWPB on the 9900, at 10 cycles it is actually one of the fastest instructions, with the average being 14 to 16-ish cycles. DIV is the worst at 92-124 cycles, followed by MPY at 52 clocks, then the shift instructions.

Emulating the CPU via the microcode does not really give you much information about the workings of the CPU unless you are treating the microcode for what it really is, i.e. the control signals for the data path. IMO, I think you would really enjoy learning an HDL and trying you hand at implementing a CPU at the hardware level. Many of the mysteries about a chip's internals come to light when you are trying to do the same thing via hardware.

Of course accuracy in an emulation will have a trade-off in performance if you are working on a multi-tasking OS like Windows, OSX, Unix, etc.

Edited April 6, 2013 by matthew180

Asmusr · April 6, 2013

Drawing lines in CPU RAM and sending the entire 6K image to the VDP for every frame:

http://www.youtube.com/watch?v=xPudgJISSfc&feature=youtu.be

Result: 2-3 frames/second. Hmm, need to improve the speed before this could turn into a game...

lines.txt

Tursi · April 6, 2013

Very nice! Even slow, that's really cool to see running on a TI. Has the illusion of layers.

It looks like you're advanced enough to know the basics for speeding things up, but definitely unroll your clear loop (I find 4 times a good tradeoff if you can afford it the space, benefits taper after 8 ). Likewise for your VDP copy loop, it should be helpful to unroll that one 2-4 times. If you have enough space in scratchpad, those alone will give you back a noticable amount of time.

Also, replace @VDPWD in your VDP copy loop with a register, this also saves a few cycles both for reading the instruction and processing it.

Edited April 6, 2013 by Tursi

+mizapf · April 7, 2013

<disclaimer>None of my comments should be understood as conveying a feeling of being insulted or similar unless I say so.</disclaimer>

Don't worry, and likewise, I hope you are not feeling offended by my contributions. I just say what I found out, and I'm happy to learn more. After all, this is not a religion but a scientific hobby.

+Vorticon · April 7, 2013

Would you mind explaining the sine algorithm? You don't seem to use lookup tables at all...

EDIT: OK never mind, you are using a table for the first quarter of the wave. Still great algorithm

Edited April 7, 2013 by Vorticon

+retroclouds · April 7, 2013

very nice! :-)

+mizapf · April 7, 2013

Hi Matthew,

one thing I would be really interested in, as you said you did a HDL implementation - can you tell me how DIV and DIVS are implemented? In particular, I'm interested in the overflow detection of DIVS, because right now I have an ugly piece of code in MESS to predict an overflow, and this *must* be easier to achieve.

The problem is as follows: The overflow detection is pretty simple for unsigned division. For DIV R2,R0 we get

R0 = ((R0<<16)+R1) / R2

R1 = ((R0<<16)+R1) % R2

and we will have an overflow iff R0 >= R2. So this can be quickly checked before the division algorithm starts. The fact that DIV requires much less cycles for this case seems to prove that. In MESS I let the code pretend to do a division procedure by calculating the result directly and then consuming the appropriate number of clock cycles. I'd favor to have the real division procedure, though.

However, as for overflow I'm getting real headaches with the signed version, which has to do with the sign bit. I had go through all combinations of even/odd/positive/negative divisors and positive/negative dividends. So this works as expected, but as I said, this must be easier somehow. Maybe the real hardware can find out earlier during the execution of the DIVS microprogram.

sometimes99er · April 7, 2013

Maybe a bit off topic, but I think it may interest a few of you.

http://www.youtube.com/watch?v=K5miMbqYB4E

+mizapf · April 7, 2013

Hi all,

I just benchmarked the SWPB. If someone is interested in the tool, just tell me. Or maybe I should upload it here.

I'm using the RTC in the Geneve and do a loop with 0x0400000 iterations. The empty loop yields 8.4 seconds; the loop with a SWPB inside returns 26.6 seconds, i.e. a delta of 18.2 seconds. After division by the iteration count this is a time difference of 4.34 µs (which is the execution time of a single SWPB). With a cycle period of 0.333 µs we have exactly 13 cycles. No error in the specs.

This is all measured with code and registers in on-chip RAM. You can change parameters in the source code to put code and/or registers in SRAM or DRAM.

Assembly on the 99/4A

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members