Jump to content
IGNORED

MIPS


Recommended Posts

This guy has tried to estimate MIPS on some CPUs from the seventies. :)

http://drolez.com/retro/

mips.png

That doesn't look too good for the 9900. And maybe performance was further crippled, when comparing computers, with the TI-99/4A shoehorn design including multiplexer, wait states and read before write stuff.

:|

Edited by sometimes99er
  • Like 2
Link to comment
Share on other sites

Many years ago, I visited a web site where someone had implemented a Mandelbrot in 9900 assembly language and claimed it ran faster than his identical program in C, compiled on a 386DX. Perhaps he used a very bad compiler, but I really would have thought the 9900 - at least used on a different system with a full bus and more RAM accessible to the CPU - to do better than so.

 

Also I would think the 6809 would generate a bigger number than the 6502 at the same clock speed, but then again MIPS is a rather crude measurement that doesn't tell the entire story about how powerful/useful a system is. A more varied benchmark like the SPEC or another suite would be better.

 

Actually it looks like he tried to find out how many cycles each machine instruction takes, but the row for the TMS9900 is empty. Perhaps that is why the average number is low?

Edited by carlsson
  • Like 1
Link to comment
Share on other sites

MIPS is easy to calculate, right? It's just (number of cycles per second / average cycles per instruction) / 1 000 000. I'd say the average number of cycles per instruction on the 9900 (not including wait states) is 12 or so? (3 000 0000 / 12) / 1 000 000 = 0.25. As used in the TI, I'm sure the numbers in that guy's list are probably not too far off. But I think it's fair to say that the tms9900 @ 3mhz is about half as slow as a Z80 @ 4mhz.

  • Like 1
Link to comment
Share on other sites

Half a MIP is what I thought too.

 

Should be a simple division, yes? 3MHz / average cycles per instruction = Average number of instructions per second? (assuming even distribution, but that's probably fair.) Assuming scratchpad memory and register access, I get an average of 22.4 cycles across the instruction set in the datasheet, so that would be 133984 instructions per second. 0.13 MIPs.

 

Huh, there we go then.

 

Worst case instruction on the 4A is a full division in 8-bit RAM, that's 124 cycles + 6*4 wait states and indirect auto-increment (can you do that?) for another 8 cycles and 8 wait states -- 164 cycles. We get only 18,292 of those per second for a worst case of 0.018 MIPs.

 

Best case (excluding illegal instructions) is at 16-bit memory conditional jump that's not taken at 8 cycles (a couple of others are 8 too). We get 375,000 of those per second for 0.375 MIPs.

 

An average program is probably mostly loaded with the common 14 cycle instructions with at least one side in 8-bit RAM, so the 22 cycle average is probably fair for the TI architecture.

 

I think I accept this new number, then... 0.15 MIPs it is.

 

But yeah, MIPs is a measure of CPU performance, not a measure of power. One instruction can do a lot of different things on a different CPU. The 9900 has single instruction multiply and divide, the 8-bit CPUs on that list don't.

  • Like 2
Link to comment
Share on other sites

MIPS estimates are fine, but keep in mind that ultimately you want your CPU do actually do work. The 8-bit CPUs tend to be very RISC-like and don't actually do very much with a single instruction. The TMS9900 and its successors are CISC CPUs and seem like an experimental CPU line to me. TI was trying a lot of ideas like the memory-to-memory architecture (i.e. no hardware registers), support for arbitrary length numbers (see the 99010), etc.

 

The TMS9900 was born in a multitasking minicomputer where high-level operations like context-switching are directly supported by the CPU, i.e. on the TMS9900 you can task switch in a single command (BLWP). To context-switch on other CPUs you have to push every register to the stack, which can take a lot of time. Another example would be operations like multiply and divide, which the TMS9900 supports in hardware but other CPUs without those instructions have to implement in software. At the end of a sub-routine call, which CPU would be ahead?

 

To really make a meaningful comparison you would need to run a program that is doing something practical, IMO. Benchmarks tend to focus on certain aspects of a CPU or a system, and only represent certain workloads. I would also argue that the 99/4A is not a very good platform to test a TMS9900 on, for reasons we are all very well aware of.

  • Like 1
Link to comment
Share on other sites

MIPS estimates are fine, but keep in mind that ultimately you want your CPU do actually do work. The 8-bit CPUs tend to be very RISC-like and don't actually do very much with a single instruction. The TMS9900 and its successors are CISC CPUs and seem like an experimental CPU line to me. TI was trying a lot of ideas like the memory-to-memory architecture (i.e. no hardware registers), support for arbitrary length numbers (see the 99010), etc.

 

The TMS9900 was born in a multitasking minicomputer where high-level operations like context-switching are directly supported by the CPU, i.e. on the TMS9900 you can task switch in a single command (BLWP). To context-switch on other CPUs you have to push every register to the stack, which can take a lot of time. Another example would be operations like multiply and divide, which the TMS9900 supports in hardware but other CPUs without those instructions have to implement in software. At the end of a sub-routine call, which CPU would be ahead?

 

To really make a meaningful comparison you would need to run a program that is doing something practical, IMO. Benchmarks tend to focus on certain aspects of a CPU or a system, and only represent certain workloads. I would also argue that the 99/4A is not a very good platform to test a TMS9900 on, for reasons we are all very well aware of.

 

Very well put. In my very simple experiments, I've found that the TI can often hold its own with its contemporary machines of the day, such as the Atari 8-bitters and the Commodores. Your point about the 9900 instruction is a good point. CPUs such as the 6502 often need many more instructions than the 9900, and the same goes for the Z80 too, though the Z80 has better support for 16 bit operations than the 6502.

  • Like 1
Link to comment
Share on other sites

... though the Z80 has better support for 16 bit operations than the 6502.

 

Absolutely.

 

Looks like the Z80 can be quite efficient at context switching.

 

There is no direct access to the alternate registers; instead, two special instructions, EX AF,AF' (4 cycles) and EXX (4 cycles), each toggles one of two multiplexer flip-flops; this enables fast context switches for interrupt service routines: EX AF, AF' may be used alone (for really simple and fast interrupt routines) or together with EXX to swap the whole BC, DE, HL set; still much faster than pushing the same registers on the stack (4 instructions 11 cycles each).

 

https://en.wikipedia.org/wiki/Zilog_Z80

http://clrhome.org/table

 

There's no PUSH and POP on the 9900. You have to implement your own stack. For example for subroutines that call other routines:

 

	mov	r11,*r10+	; push return

        ! your subroutine

	b	@retu

 

To save bytes with every routines we have one "global" return:

 

retu
	dect	r10		; down the stack
	mov	*r10,r11	; pop return
	rt

 

For routines that do not call other routines, we can simply do:

 

	! your subroutine

	rt

 

With the Z80, CALL is basically a PUSH PC+3 followed by a JP Label. The +3 is needed to jump over the call - always 3 bytes long. RET is even simpler, it is equivalent to a POP PC.

 

http://sgate.emt.bme.hu/patai/publications/z80guide/part2.html

 

;)

Edited by sometimes99er
Link to comment
Share on other sites

So tell me, exactly how did he measure MIPS? His website doesn't say.
From the barebones definition, it's just Millions of Instructions Per Second.
That varies from one program to the next and it doesn't account for work done, just instructions executed per second.

 

The 64 column font code I wrote for several different Machines/CPUs (Tandy MC-10 .89MHz 6803, Acorn Atom 1MHz 6502, and VTech VZ200 3.58MHz Z80) has the 6803 outperforming the 6502 by about 30%, and it's almost as fast as the Z80 running at about 4x the MHz.

I've also spent some time looking at changes to support the 68hc11, 6809, and 6309.

All three are faster than the 6803 to do the same thing, yet a 6809 has a lower MIPS rating than a 6502?
I've also spent some time on a port to the PLUS/4 for 80 column text. The different memory layout of the screen actually lends itself to faster execution at the same MHz than on the Atom. Even though the CPUs run at the same MIPS, one machine is faster than the other for this code.
The same thing would could be seen if I were to create a version for the NEC TREK and SAM Coupe.
Most of the code would be identical but the speed would be different just due to different wait states on the different platforms.
Even setting different graphics modes might alter the speed of a program on the TREK.
So how the heck does MIPS tell you anything?

Benchmarks are usually a better comparison of speed across different CPUs, but benchmarks usually measure performance of the compiler as much as the CPU. If you look at some of the benchmark suites, they involve dozens of different C programs.
One compiler may produce faster code for one program, but a different compiler may produce faster code from another program.
An update to a compiler can significantly alter the results from one day to the next.

Plus, the 9900 and 6809 support a compiler fairly well where the 6502 does not, yet the 6502 can be much more competitive on assembly code.

So, take these sort of comparisons with a grain of salt.

 

Edited by JamesD
  • Like 3
Link to comment
Share on other sites

What strikes me is how efficient the implementation of TI BASIC must be. While it generally is considered very slow, if we take this MIPS measurement into consideration, it doesn't look half bad. In a different thread a couple of years ago, I ran both the BYTE and Ahl benchmarks on my VTech Creativision console, for which there is a very slow BASIC. This is interesting because both the TI-99/4A and Creativision have the same VDP chip and use VDP memory to store programs. The Creativision has a slice more CPU RAM, but also a 2 MHz 6502 which according to the MIPS estimates above should mean it runs 5-6 times as many instructions per second as the 3 MHz TMS9900. Yet the console's BASIC yields even slower execution of said benchmarks. There is a speed hacked version of said BASIC which breaks the RND function but otherwise brings it to speed, but then I know there are greatly improved BASIC's for the TI as well.

  • Like 1
Link to comment
Share on other sites

As was mentioned earlier, raw MIPS ain't necessarily a good indicator. We can do sixteen bit arithmetic in a single instruction on the 9900 which cannot be done done on a 6502 IIRC (I well remember all those add with carry instructions!). In addition, a serious flaw with the 6502 and its derivatives is that it's seriously register starved. So, while a sixteen bitter can be 'just getting on with it' a register starved 8-bitter like the 6502 can be thrashing around moving stuff on and off the stack, doing everything as two separate 8 bit operations etc. Also, the 9900 doesn't have an 8-bit accumulator that you are forced to use for math. You can use any registers you want.

 

So while the 6502 is undoubtedly faster it's also inefficient, wasting time because a single instruction gets less work done than a cpu such as the 99xx and 68k.

  • Like 1
Link to comment
Share on other sites

So while the 6502 is undoubtedly faster it's also inefficient, wasting time because a single instruction gets less work done than a cpu such as the 99xx and 68k.

 

But also the 6502 could be more efficient with some of the simple stuff, whereas the 9900 does 16 bits and do infamous "read before write" to manipulate just that byte. Byte manipulation is part of the 9900 vocabulary in its so 16 bit world. Are we lucky or not ?

 

When we push from ROM/RAM to the VDP, should we consider reading words (two bytes at a time and swap / sla) instead of reading bytes ? The latter is so at hand when the destination only takes bytes.

 

;-)

Edited by sometimes99er
Link to comment
Share on other sites

This looks like a Top Dog or King of Hill - Grudge Match / Smack Down opportunity to see who has the most advanced "SKILLZ" at massaging the TMS9900 to their will. Who can do the most operations in the least amount of time? Is a 20 second "real world" duration long enough? The winning participant would have to submit their code to prove nothing "funny" is going on.

Link to comment
Share on other sites

When we push from ROM/RAM to the VDP, should we consider reading words (two bytes at a time and swap / sla) instead of reading bytes ? The latter is so at hand when the destination only takes bytes.

 

I don't think I have found anything faster than MOVB *R1+,*R0 (where R0 is set to VDPWD) in an unrolled loop.

 

Something like this is not faster AFAIK:

 

MOV *R1+,R2

MOVB R2,*R0

SWPB R2

MOVB R2,*R0

 

It's longer, and then you have to consider the problem with odd/even bytes.

 

If only we have had one or two internal registers...

 

Edit: there's also this variant, but I don't think that's faster either:

 

MOV *R1+,R2

MOVB R2,*R0

MOVB @R2LB,*R0

  • Like 1
Link to comment
Share on other sites

Yep, seems right. I'm just pulling out the figures from Classic99. ;)

 

MOVB *R1+,*R0 ; 40
MOVB *R1+,*R0 ; 40
40 + 40 = 80 cycles
MOV *R1+,R2 ; 30
MOVB R2,*R0 ; 30
SWPB R2 ; 14
MOVB R2,*R0 ; 30
30 + 30 + 14 + 30 = 104 cycles
MOV *R1+,R2 ; 30
MOVB R2,*R0 ; 30
SLA R2,>8 ; 32
MOVB R2,*R0 ; 30
30 + 30 + 32 + 30 = 122 cycles
MOV *R1+,R2 ; 30
MOVB R2,*R0 ; 30
MOVB @R2LB,*R0 ; 42
30 + 30 + 42 = 102 cycles

 

 

  • Like 1
Link to comment
Share on other sites

Another exercise is clearing RAM. Can you do it faster than this? (Hope my code is OK, haven't tried it).

       LWPI WRKSP
       LI   R1,START
       LI   R2,LENGTH/32
LOOP   MOV  R1,@SETWP+2
SETWP  LWPI 0
       CLR  R0
       CLR  R1
       CLR  R2
       CLR  R3
       CLR  R4
       CLR  R5
       CLR  R6
       CLR  R7
       CLR  R8
       CLR  R9
       CLR  R10
       CLR  R11
       CLR  R12
       CLR  R13
       CLR  R14
       CLR  R15
       LWPI WRKSP
       AI   R1,32
       DEC  R2
       JNE  LOOP
  • Like 5
Link to comment
Share on other sites

Another exercise is clearing RAM. Can you do it faster than this? (Hope my code is OK, haven't tried it).

 

       LWPI WRKSP
       LI   R1,START
       LI   R2,LENGTH/32
LOOP   MOV  R1,@SETWP+2
SETWP  LWPI 0

Ahhh, self modifying code. Nice! :-) Can't do that any more on a modern CPU. Probably won't work on the 9995 either since it has a single instruction pre-fetch. IMO it is code like this that really takes advantage, and make use, of a CPU. Too bad everyone is in such a hurry to add layers and get as far away from the CPU as possible these days.

  • Like 1
Link to comment
Share on other sites

Ahhh, self modifying code. Nice! :-) Can't do that any more on a modern CPU. Probably won't work on the 9995 either since it has a single instruction pre-fetch.

I did not try, but I'd expect the 9995 to run this correctly. The pipeline prefetch gets the next instruction which is LWPI, the argument is not prefetched, so it may change.

Link to comment
Share on other sites

Another exercise is clearing RAM. Can you do it faster than this? (Hope my code is OK, haven't tried it).

Code looks fine! It was tough, but I think I beat it. Requires few more cycles for setup and loop around, but you can use STST to write the zeros even faster than CLR (8 cycles versus 10 cycles to 16-bit RAM, and 16 cycles versus 22 cycles to 8-bit RAM due to one fewer memory access - STST doesn't read before write.) I tried two ways of getting a zero into the status register.

 

For one, just doing math. I loaded a zero into R3, then

 

       AB R3,R3           * 0+0=0, clears L>, A>, C, OV, OP, sets EQ
       COC R1,R3          * assuming R1<>0, clears EQ
This clears it out (you also need to be running with interrupts cleared (LIMI 0) and not running an XOP. This takes a spare register and 28 cycles (0-wait state), and has to be done every loop before the LWPI.

 

If you can spare the high registers in the workspace, then you can use RTWP to do it instead. This also lets you do it without self-modifying code.

 

       LWPI WRKSP
       LI   R2,LENGTH/32
       LI   R13,START     * will become the WP
       LI   R14,LOOP2     * is the address branched to
       CLR  R15           * will become the ST register
             
LOOP   RTWP

LOOP2  STST  R0
       STST  R1
       STST  R2
       STST  R3
       STST  R4
       STST  R5
       STST  R6
       STST  R7
       STST  R8
       STST  R9
       STST  R10
       STST  R11
       STST  R12
       STST  R13
       STST  R14
       STST  R15
       
       LWPI WRKSP
       AI   R13,32
       DEC  R2
       JNE  LOOP
       
       END
If I'm doing my math right, assuming 16-bit code and data both, the CLR version should take 34 cycles and 2 registers to set up, and take 236 cycles per loop (an average of 7.375 cycles per byte). If the target is in 8-bit RAM, it would take another 128 cycles per loop (due to the read-before-write, 8 wait states are inserted), which makes it 11.375 cycles per byte.

 

The STST version needs 56 cycles to set up and 4 registers (and fixed registers for three of those), but manages 186 cycles per loop (averaging 5.8125 cycles per byte). STST does not have the read-before-write, so it jumps ahead in 8-bit target memory, adding only 64 cycles (4 wait states per write). This makes an average of 7.8125 cycles per byte.

 

Fun puzzle, and I almost gave up! ;)

  • Like 8
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...