Jump to content
IGNORED

Conversion of Z80 code to TMS9900


Asmusr

Recommended Posts

So what have I learned? It's definitely possible to convert Z80 code to the TI as can be seen here: http://atariage.com/forums/topic/267989-knight-lore/

 

My mapping of Z80 to TI register worked well:

tmp0   equ  0
tmp1   equ  1
one    equ  2
mone   equ  3
af     equ  4
a      equ  4
bc     equ  5
b      equ  5
c      equ  R5LB
de     equ  6
d      equ  6
e      equ  R6LB
hl     equ  7
h      equ  7
l      equ  R7LB
ix     equ  8
iy     equ  9
sp     equ  10
af_    equ  12
bc_    equ  13
de_    equ  14
hl_    equ  15

one and mone are constants 1 and -1 that I found that I used all the time.

 

It's inefficient to use the LSB registers c, e, and l because they have to be accessed as memory bytes rather than registers.

 

The Knight Lore code is checking the carry flag a lot. I found that after a byte compare (cp on Z80, cb on the TI) the carry condition on the Z80 correspond to JL on the TI or JHE for the inverse condition. If the carry flag is checked after a subtraction instead of a compare this has to be turned into a compare followed by a (possible) subtraction on the TI.

 

The biggest conversion issue is probably that loading data into a register (ld on Z80, mov or li on the TI) does not set any flags on the Z80, so here you can do a compare, then load something, and then check the condition. This type of code has to be reworked on the TI.

 

A stack and calls to subroutines can easily, but somewhat slowly, be emulated on the TI, but on the TI it's the called routine that pushes the return address (r11) onto the stack. If Z80 code is jumping directly into subroutines (rather than calling) it is necessary to bypass the initial push of the return address on the TI.

 

Anyone has a commented Z80 disassembly of Elite?

  • Like 4
Link to comment
Share on other sites

Forgot to mention: I believe converted code takes up 50% - 100% more memory than the Z80 code. Basically every Z80 instruction takes on byte and on the TMS9900 they take two. But it depends on the code and in this case the TMS9900 is a lot more efficient:

*
* =============== S U B R O U T I N E =======================================
*
* b: pixel Y
* c: pixel X
*
* Result in bc
*
calc_vidbuf_addr:
;RAM:D811 E5                          push    hl
;RAM:D812 CB 38                       srl     b                                       ; y >> 1, bit 0 to Carry
;RAM:D814 CB 19                       rr      c                                       ; Carry to bit 7, x >> 1
;RAM:D816 CB 38                       srl     b                                       ;
;RAM:D818 CB 19                       rr      c                                       ;
;RAM:D81A CB 38                       srl     b                                       ;
;RAM:D81C CB 19                       rr      c                                       ;
       srl  bc,3
;RAM:D81E 21 F3 D8                    ld      hl, #vidbuf                             ; bitmap buffer
;RAM:D821 09                          add     hl, bc                                  ; calculate bitmap memory address
;RAM:D822 4D                          ld      c, l
;RAM:D823 44                          ld      b, h                                    ; BC = bitmap memory address
       ai   bc,vidbuf
;RAM:D824 E1                          pop     hl
;RAM:D825 C9                          ret
       rt
*
* End of function calc_vidbuf_addr
*
Edited by Asmusr
Link to comment
Share on other sites

When you simply convert code, that's written to work well on one CPU, to another, that's quite a lot different, you have to count on that the code will not be that terrific when translated. But if you re-write it to actually exploit the advantages of that second architecture, then it may be even better (depending on the source and target systems, of course).

Link to comment
Share on other sites

I can point you to the commented sources of Uridium (there is an improved version for cv) an of Tales of Popolon (a 3D fps on msx)

 

Do you think anything in Uridium would benefit from a 16-bit architecture or would it end up running at half speed?

Link to comment
Share on other sites

All the x coordinates of the objects are 16 bit, if not 24 bits (due to decimal parts). Some of the y coordinates are 16 bits to take into account decimal points. How fast the TMS9900 is in doing a large case/switch and in doing 8 and 16bit comparisons?

Anyway, in case of problems, one can reduce the max number of enemies or set the speed at 30 fps.

Both aspects are already implemented, one by a label, the other in the menu where there is the choice of the difficulty level, where easier modes work slowing down the main loop by adding extra waiting for vblank.

Edited by artrag
Link to comment
Share on other sites

How fast the TMS9900 is in doing a large case/switch and in doing 8 and 16bit comparisons?

 

A large case/switch is just a jump table I guess? 8 and 16 bit comparisons (CB and C) are equally fast, but I don't know how fast they are compared to the Z80.

Link to comment
Share on other sites

 

A large case/switch is just a jump table I guess? 8 and 16 bit comparisons (CB and C) are equally fast, but I don't know how fast they are compared to the Z80.

Yes,the switches are used in the enemies and in the collisions

About the relative speed, maybe one can eventually compensate by using less enemies or setting at 30hz the frame rate

Edited by artrag
Link to comment
Share on other sites

  • 3 years later...
On 7/1/2017 at 3:26 AM, mizapf said:

When you calculate 00c0 - 00d0, the ALU effectively adds the two's complement: 00c0 + ff30 = fff0. Since the result does not exceed ffff, carry is cleared. However, when you calculate 00d0 - 00c0, you have 00d0 + ff40 = 10010 = 0010, so this means carry is set.

I know this is way out of date, but it helped me sort out some frustrating issues I had when carrying out some 40 bit subtraction and addition.   The lesson I learnt is that you can't rely on the Carry Flag when performing arithmetic larger than 16 bits.   The most reliable method I found was to use the overflow for arithmetic functions and carry for carry over to the next register.   For example:

 

;
;
;Subtract 40 bit Destination (R5,R6,R7)from another 40 bit Source (R1,R2,R3).  
;
;
SUBAC:  CLR	R0
        SB	R3,R7
        JNO	SAC1
        INC	R2		;Add carry bit
        JNC	SAC1
        INC	R1		;Add carry bit
        JNC	SAC1
        SETO	R0   		;Carry to AC  has occured from R5
;
; Now subtract the registers pairs 6 and 5
;
SAC1:   S	R2,R6
        JNO	SAC2
        INC	R1
        JNC	SAC2
        SETO	R0   		;Carry to AC  has occured from R5

SAC2:   S	R1,R5			;
        JNO	SAC3
        SETO	R0
SAC3:   INC	R0		;Set the carry flag if carry occured
        RET

 

Link to comment
Share on other sites

48 minutes ago, adel314 said:

I know this is way out of date, but it helped me sort out some frustrating issues I had when carrying out some 40 bit subtraction and addition.   The lesson I learnt is that you can't rely on the Carry Flag when performing arithmetic larger than 16 bits.   The most reliable method I found was to use the overflow for arithmetic functions and carry for carry over to the next register.   For example:

 


;
;
;Subtract 40 bit Destination (R5,R6,R7)from another 40 bit Source (R1,R2,R3).  
;
;
SUBAC:  CLR	R0
        SB	R3,R7
        JNO	SAC1
        INC	R2		;Add carry bit
        JNC	SAC1
        INC	R1		;Add carry bit
        JNC	SAC1
        SETO	R0   		;Carry to AC  has occured from R5
;
; Now subtract the registers pairs 6 and 5
;
SAC1:   S	R2,R6
        JNO	SAC2
        INC	R1
        JNC	SAC2
        SETO	R0   		;Carry to AC  has occured from R5

SAC2:   S	R1,R5			;
        JNO	SAC3
        SETO	R0
SAC3:   INC	R0		;Set the carry flag if carry occured
        RET

 

Did that take more instructions in Z80 code?

 

And for a the person who knows nothing about games (me), why is 40 bit math required?

32 bit operations give +/- 2 billion  ish magnitudes. I can't imagine what a game would need more.

 

Link to comment
Share on other sites

1 hour ago, adel314 said:

I know this is way out of date, but it helped me sort out some frustrating issues I had when carrying out some 40 bit subtraction and addition.   The lesson I learnt is that you can't rely on the Carry Flag when performing arithmetic larger than 16 bits.   The most reliable method I found was to use the overflow for arithmetic functions and carry for carry over to the next register.   For example:

 

I'd probably calculate the two's complement of the second operand and use an addition.

 

by the way, which register is the most significant? I guess R3 and R7 (otherwise, SB and JNO would not make sense).

Link to comment
Share on other sites

9 hours ago, mizapf said:

 

I'd probably calculate the two's complement of the second operand and use an addition.

 

by the way, which register is the most significant? I guess R3 and R7 (otherwise, SB and JNO would not make sense).

Yes, I should have specified that.  R5 to R7 (Upper Byte) represent a Floating Point Mantissa and the Lower byte of R7 the Exponent, that is why I only needed the SB for the R7 operation.   So R5 is MS and R7 LS

Link to comment
Share on other sites

10 hours ago, TheBF said:

Did that take more instructions in Z80 code?

 

And for a the person who knows nothing about games (me), why is 40 bit math required?

32 bit operations give +/- 2 billion  ish magnitudes. I can't imagine what a game would need more.

 

In certain places the Z80 code with the use of the carry flag makes some of the arithmetic much simpler than the TMS9900 but in most other cases the TMS9900 produces much more efficient and smaller code.  The Z80 Floating Point Math Package that I have ported to the TMS9900 is probably about the same size but I will do a check.     Looks like it is all working now, but your post helped me quite a bit.

Edited by adel314
Clearer expression
  • Like 1
  • Thanks 1
Link to comment
Share on other sites

On 2/17/2021 at 2:23 AM, mizapf said:

 

I'd probably calculate the two's complement of the second operand and use an addition.

 

by the way, which register is the most significant? I guess R3 and R7 (otherwise, SB and JNO would not make sense).

Regarding the use of two's complement, I decided to use another method as the method I have posted earlier, is not 100% accurate, but from a coding perspective I think this updated method is accurate is reasonably compact and would appreciate your thoughts.   The logic behind this is that a "Carry" or "Borrow" as we now it will always occur if the operand being subtracted (source) is logically larger than the destination operand, so using a Compare to test this allows us to use JH or JLE to detect the borrow.

 

;
;
;Subtract 40 bit Destination (R5,R6,R7)from another 40 bit Source (R1,R2,R3). 
;
;  40 bit source:      R1,R2,R3    with R1 being MS and R3 LS
;  40 bit destination: R5,R6,R7    with R5 being MS and R7 LS
;
;
;
SUBAC:  PUSH R1
        PUSH R2
        CLR  R0
        CB   R3,R7     ;Check for borrow
        JLE  SAC1
        INC  R2        ;Propogate the borrow
        JNC  SAC1
        INC  R1
        JNC  SAC1
        SETO R0
;
; Now perform the subtract the R3 from R7 (LS)
;
SAC1:   SB  R3,R7    ;no need to check for carry/overflow
        C   R2,R6
        JLE  SAC2
        INC  R1
        JNC  SAC2
        SETO R0
;
; Now perform the subtract of R2 from R6
;
SAC2:   S    R2,R6
        C    R1,R5 
        JLE  SAC3
        SETO R0
;
;  Now the subtract the MSB registers R1 from R5
;
SAC3:   S    R1,R5
        POP  R2
        POP  R1
        INC	 R0		;Set the carry flag if carry in the MSB occured
        RET

 

 

Edited by adel314
Link to comment
Share on other sites

18 hours ago, apersson850 said:

I presume PUSH and POP are defined like

DECT SP

MOV %1,*SP

and

MOV *SP+,%1

 

But I don't get the logic of R2, R3 being the mantissa and the lower part of R7 being the exponent, combined with the code above? Or is this two different cases?

The integer maths are required to calculate the 40bit mantissa component oof the floating point routines.   Here is a link to the original Z80 code.  http://www.z80.info/zip/math.zip .   For your info here here is the code for PUSH and POP , they are just standard XOP routines.

;
;*************************************************
;	PUSH DATA/REGISTER ONTO THE STACK
;	USES CALLER'S WP AND STACK POINTERS
;*************************************************
;
XOP8	MOV	@FREEMEM,@2*R9(R13)	;UPDATE FREE MEMORY POINTER IE STACK LIMIT
	MOV	@2*R10(R13),R10
	DECT	R10
	C	R10,@2*R9(R13)		;CHECK FOR OVERFLOW
	JLE	STACKERR
	MOV	*R11,*R10
	MOV	R10,@2*R10(R13)
	RTWP
;
;	POP DATA/REGISTER OFF STACK
;
XOP9	MOV	@2*R10(R13),R10
	MOV	*R10+,*R11
	MOV	R10,@2*R10(R13)
	RTWP

 

Link to comment
Share on other sites

18 hours ago, mizapf said:

Yes, I was a bit surprised, too, when hearing about mantissa and exponent, because I thought we were talking about integers.

Yes, it is about integer arithmetic.  The mantissa is a 40 bit integer.   The problem that I see is that the TMS9900 performs two's complement arithmetic and the flags are based on that implementation.

Link to comment
Share on other sites

OK, I'll have to look at the link. But I read the exponent is in least significant byte of R7, and you do CB with R7, which looks at most significant byte.

Also I don't see the value of the exponent influencing the magnitude of the mantissa. Like if you calculate 1E20-9E0, then it doesn't matter that 9 is much bigger than one. In this case, the exponent will make the result of 1E20-9E0=1E20. The subtraction isn't even noticed.

Or do you use the exponent in some other way?

 

After looking at the data in the link, I see it discusses floating point BCD math. Is it something similar to TI's own floating point math you try to use? Just that they use radix 100, since there are no specific radix 10 instructions in the TMS 9900 anyway.

Edited by apersson850
Link to comment
Share on other sites

19 hours ago, apersson850 said:

OK, I'll have to look at the link. But I read the exponent is in least significant byte of R7, and you do CB with R7, which looks at most significant byte.

Also I don't see the value of the exponent influencing the magnitude of the mantissa. Like if you calculate 1E20-9E0, then it doesn't matter that 9 is much bigger than one. In this case, the exponent will make the result of 1E20-9E0=1E20. The subtraction isn't even noticed.

Or do you use the exponent in some other way?

 

After looking at the data in the link, I see it discusses floating point BCD math. Is it something similar to TI's own floating point math you try to use? Just that they use radix 100, since there are no specific radix 10 instructions in the TMS 9900 anyway.

Yes, the exponent is handled separately, so in effect I am working on a 5 byte mantissa in R5, R6 and MSB of R7.  This causes extra handling but I though I would be consistent with the original and keeping the representation within 3 registers just makes it neater.

Link to comment
Share on other sites

On 2/20/2021 at 12:12 AM, adel314 said:

The integer maths are required to calculate the 40bit mantissa component oof the floating point routines.   Here is a link to the original Z80 code.  http://www.z80.info/zip/math.zip .   For your info here here is the code for PUSH and POP , they are just standard XOP routines.


;
;*************************************************
;	PUSH DATA/REGISTER ONTO THE STACK
;	USES CALLER'S WP AND STACK POINTERS
;*************************************************
;
XOP8	MOV	@FREEMEM,@2*R9(R13)	;UPDATE FREE MEMORY POINTER IE STACK LIMIT
	MOV	@2*R10(R13),R10
	DECT	R10
	C	R10,@2*R9(R13)		;CHECK FOR OVERFLOW
	JLE	STACKERR
	MOV	*R11,*R10
	MOV	R10,@2*R10(R13)
	RTWP
;
;	POP DATA/REGISTER OFF STACK
;
XOP9	MOV	@2*R10(R13),R10
	MOV	*R10+,*R11
	MOV	R10,@2*R10(R13)
	RTWP

 

I have never used the XOP instructions but that seems like a lot of code to do PUSH and POP plus the BLWP/RTWP overhead.

I suppose if the Assembler does not support macros this provides some abstraction. (?)

 

Push can be two instructions and pop can be one instruction on 9900. Could that work on your system?

 

 

  • Like 1
Link to comment
Share on other sites

On 2/22/2021 at 1:32 AM, TheBF said:

I have never used the XOP instructions but that seems like a lot of code to do PUSH and POP plus the BLWP/RTWP overhead.

I suppose if the Assembler does not support macros this provides some abstraction. (?)

 

Push can be two instructions and pop can be one instruction on 9900. Could that work on your system?

 

 

The problem with BLWP/RTWP is that after the first level of subroutine you have to begin saving the workspace pointers, status registers and return addresses etc so the overhead management soon becomes very complicated for nested systems.  Using the XOPs you can avoid that are mimic and normal micro that microcodes the CALL, PUSH and POP functions.     I have copied the CALL functions just for info.  Of course in most circumstances you don't need to check for stackoverflow and this would reduce the code somewhat. 

 

As you suggest, if I don't want the overhead of a PUSH and POP, and provided the routine is local you can use MOV R3,@-2(SP) to push onto the stack and MOV @-2(SP),R3 to recover it.

 

You can perform some fairly powerful pseudo instructions using XOPs which would yes be very similar to MACROs

;
;************************************************
;	CALL	SUBROUTINE
;	CALLING METHOD:   CALL SUBROUTINE_ADDRESS
;*************************************************
;
XOP6   MOV	@2*R10(R13),R10  ;GET STACK POINTER
       DECT	R10
       C	R10,@2*R9(R13)	;CHECK FOR STACK OVERFLOW
       JLE	STACKERR        ;O/P STACK OVERFLOW MESSAGE
       MOV	R14,*R10        ;PUSH SAVED PC ONTO STACK
       MOV	R11,R14         ;MOVE EA INTO R14 FOR CALL
       MOV	R10,@2*R10(R13) ;UPDATE STACK POINTER
       RTWP                 ;NOTE WE ARE NOW USING THE ORIGINAL WP

 

Edited by adel314
Link to comment
Share on other sites

I was thinking about the way the Forth community here does stacks on 9900.  There is way less overhead in cycles and memory.

A macro assembler makes it prettier but the inlined instructions are not very big.

 

It would seem to be an order of magnitude faster than the XOP version just "eyeballing" the code.

 

If you are changing context (workspace) you would of course have to decide that register X is the stack pointer for all workspaces and give each context a small section of the stack memory ie: an offset from the base stack address used by the main workspace.  

 

With this kind of small stack overhead you can make different decisions about what truly requires a BLWP and what can be managed with a register push like you would do on an MSP430 for example where there is only a stack and no workspaces.

 

Just a thought.

 

Note: SP is an EQUate for the register you choose for the stack pointer.

* stack is placed in high memory and descends on PUSH
* there is no overflow protection, there is no underflow protection

*
* PUSH    	
*
			DECT SP
			MOV Rn,*SP

*
* POP
*
			MOV *SP,Rn
			

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...