# Assembly on the 99/4A

## Recommended Posts

Generally, subtraction means addition of the complement. The carry flag is set as if the CPU performed an unsigned operation. This means that carry is always set when the addition (of the complement) exceeds FFFF.

Example:

10 - 5 = 5

In two's complement:

0x000a + 0xfffb = 0x10005 = 0x0005 + carry

2 - 5 = -3

0x0002 + 0xfffb = 0xfffd + no carry

When you use DEC (= adding FFFF), the only situation where carry is reset is when you go from 0 to -1 (because 0x0000 + 0xffff = 0xffff + no carry). Thus, a loop like

LOOP do_something

DEC R1

JOC  LOOP

will leave the loop with R1=-1 (unlike JNE, which leaves the loop with R1=0).

One interesting result that I found was that subtracting a zero also sets the carry flag.

CLR R1

CLR R2

S  R1,R2 will set carry

This can be explained when we assume that the S operation actually uses the ones' complement and then adjusts to the two's complement:

0x0000 - 0x0000 = 0x0000 + 0xFFFF + 0x0001 = 0x10000 = 0x0000 + carry

(ones' complement of 0000 is FFFF)

• 1
• 2

##### Share on other sites
2 hours ago, intvnut said:

You might be slightly faster if you always CLR TOS, and then conditionally SETO, to avoid the extra unconditional JMP.  CLR doesn't modify the flags.  Something like this (guessing at the syntax):

```             \  cx ax  bx
\  R1 R0 tos
CODE WITHIN   ( n  lo  hi -- flag )
R0  POP,
R1  POP,
R0  TOS SUB,
R0  R1  SUB,
TOS R1  SUB,
TOS CLR,
NC  IF, TOS SETO, ENDIF,
NEXT,
ENDCODE```

This should work perfect. Thanks!

##### Share on other sites
2 hours ago, intvnut said:

Also, I'm not an expert in TMS9900 instruction timings.  Would there be any advantage to using relative addressing rather than POPs?  e.g @2(SP), @4(SP), etc. and then doing a single stack update at the end?

I could get at the stack elements as you suggest.  The advantage of use *SP+ that is works like a real POP instruction so makes it easy to keep track of things.

However if there were a lot of parameters your method could be the way to go.  I should think that the overhead of indexed addressing would be an advantage at more than 4 or 5 arguments with  SP  XX ADDI  at the end, but I would need to to do the math.

FYI I do use   @2(SP)  TOS MOV, to implement OVER in Forth.  Stack diagram: ( n1 n2 --> n1 n2 n1 )

And I also have a library file with little routines called 3RD and 4TH that lets me bring deeper stack elements up to the TOS register just as fast as OVER on a 9900.

The dirty little secret of Forth is that you should never be working on deep stack items. The language works best with short routines that work on the top 3, 4 max.

But you have to layout the code that way from the start. When you do it's more like using an 8 bitter with X Y and Accumulator registers.

##### Share on other sites
14 hours ago, mizapf said:

Generally, subtraction means addition of the complement. The carry flag is set as if the CPU performed an unsigned operation. This means that carry is always set when the addition (of the complement) exceeds FFFF.

Example:

10 - 5 = 5

In two's complement:

0x000a + 0xfffb = 0x10005 = 0x0005 + carry

2 - 5 = -3

0x0002 + 0xfffb = 0xfffd + no carry

When you use DEC (= adding FFFF), the only situation where carry is reset is when you go from 0 to -1 (because 0x0000 + 0xffff = 0xffff + no carry). Thus, a loop like

LOOP do_something

DEC R1

JOC  LOOP

will leave the loop with R1=-1 (unlike JNE, which leaves the loop with R1=0).

One interesting result that I found was that subtracting a zero also sets the carry flag.

CLR R1

CLR R2

S  R1,R2 will set carry

This can be explained when we assume that the S operation actually uses the ones' complement and then adjusts to the two's complement:

0x0000 - 0x0000 = 0x0000 + 0xFFFF + 0x0001 = 0x10000 = 0x0000 + carry

(ones' complement of 0000 is FFFF)

Thank you for this excellent tutorial.  This explanation and some explanation of the SBB instruction on StackOverflow made it clearer.

It looks like in your S R1,R1 example the SBB instruction brings the value of the carry flag into the calculation.

This makes we wonder If I could emulate it with a store status and a shift instruction... hmmm?

Will try that.

##### Share on other sites

It's hard to win with the 9900.

If I want to remove the jump I can do the code in the screen capture and it will work MOST of the time. False is 0 but TRUE is >1000.

However ANS Forth defines TRUE as -1 (all bits set) so I need to do 3 more instructions to get it truly correct.

This makes it both bigger and slower than just using the jump.

```CODE WITHIN   ( n  lo  hi -- flag )
R0  POP,
R1  POP,
R0  TOS SUB,
R0  R1  SUB,
TOS R1  SUB,
TOS STST,
TOS 1000 ANDI,
TOS 0C SRL,
TOS NEG,
NEXT,
ENDCODE```

##### Share on other sites

Yeah, I had thought briefly about suggesting STST, mask, shift, etc. but it quickly got expensive, so I didn't.

##### Share on other sites
3 hours ago, TheBF said:

Thank you for this excellent tutorial.  This explanation and some explanation of the SBB instruction on StackOverflow made it clearer.

It looks like in your S R1,R1 example the SBB instruction brings the value of the carry flag into the calculation.

Actually, I'm in good practise right now because it's yet another turn for me to teach the introductory lecture at our university in Nuremberg (Grundlagen der Informatik = Basics of computer science). 🙂

And we just had the complements this week.

##### Share on other sites
5 hours ago, TheBF said:

By the way... did anybody notice that I pasted Assembly language into the screen and then ran and tested it interactively.

I'm just saying... not bad for 1970's technology😎

##### Share on other sites
On ‎7‎/‎23‎/‎2019 at 2:02 AM, mizapf said:

What I wanted to point out is that if we want to set up something like a stack with a register as stack pointer, this does not mix well with the BLWP concept. TI should have kept such a stack pointer outside of the R0-R15 set, but it is easy in hindsight to say what one should have done.

Not necessarily. The concept of context switch usually also goes with the concept of private stacks. When you have a stack, you don't need BLWP to be able to nest a lot of subroutine calls. Just push the links. So when you actually do a BLWP, it's either because you just want to do something complex enough to motivate using a private workspace (you can easily copy the caller's stack pointer, if you want to), or you want to enter a complex environment, where you want to use your own stack. This could be an interrupt thing, or a process called by a scheduler in a multiprogrammed system.

I've used both these concepts on the 99, so they are perfectly doable.

##### Share on other sites
On ‎7‎/‎27‎/‎2019 at 10:03 PM, FarmerPotato said:

I want this. I want a 4A that has nothing but FORTH on it. Then it would be a 99/4thA.

Seriously, I would use this.

That's one of the big advantages with my internal memory expansion. I can replace the internal ROM chips (8 KBytes) with whatever I like.

##### Share on other sites

Not necessarily. The concept of context switch usually also goes with the concept of private stacks. When you have a stack, you don't need BLWP to be able to nest a lot of subroutine calls. Just push the links. So when you actually do a BLWP, it's either because you just want to do something complex enough to motivate using a private workspace (you can easily copy the caller's stack pointer, if you want to), or you want to enter a complex environment, where you want to use your own stack. This could be an interrupt thing, or a process called by a scheduler in a multiprogrammed system.

I've used both these concepts on the 99, so they are perfectly doable.

I use this private workspace concept precisely in the Forth Multi-tasker for the 9900.  I copy the root task workspace and alter the two stack pointers and the Forth interpreter pointer and voila, a new task is ready to go.  One of my "innovations" (I think) is that I do the  context switch with the RTWP instruction.

The FORK command in Forth manages this such that I can insert new tasks into the round-robin chain can allocate the new stacks.  The ASSIGN command sets the interpreter pointer for the program you want to run.  I have not gone to the trouble making a de-linking command because I would never use it, but it could be done of course.

Edited by TheBF
typo

##### Share on other sites
On ‎7‎/‎31‎/‎2019 at 4:23 AM, senior_falcon said:

Looking at the GPL interpreter in INTERN, it's hard for me to see how this could work. The GPL interpreter is hard coded in the ROM. At >0078 is MOVB *R13,R9 which fetches the GPL byte into R9. Assuming you set up R9 to point to the byte in RAM, to fetch it would have to be MOVB *R13+,R9.

The PME (the P-Machine Emulator, the program that interprets p-code for the UCSD p-code card) is also executing a stack-centered byte-code. The inner interpreter for the PME is written so that it can handle p-code stored in VDP RAM, normal RAM or GROM. The p-code card contains 48 K of GROM, which holds quite a lot of p-code and some data.

The inner interpreter is moved to RAM PAD at >8300 when the system executes. There are different versions of it, to handle exactly the issue described above. Note that jumps must be handled differently, since a jump when the code is in RAM simply consists of reloading the instruction pointer, but a jump when the code is in GROM, or VDP RAM, consists of reloading the read address to the memory device port. And that reloading is different for VDP RAM and GROM.

##### Share on other sites
20 minutes ago, TheBF said:

One of my "innovations" (I think) is that I do the  context switch with the RTWP instruction.

That works fine. I've done the same thing, when doing multi-tasking things in assembly. I've actually posted that code here once, in some thread.

When I forced a task switch in the UCSD-system, I simply called the operating system's task switch routine.

Do you use pre-emptive task switching in Multi-task Forth, or is it voluntary?

##### Share on other sites

That works fine. I've done the same thing, when doing multi-tasking things in assembly. I've actually posted that code here once, in some thread.

When I forced a task switch in the UCSD-system, I simply called the operating system's task switch routine.

Do you use pre-emptive task switching in Multi-task Forth, or is it voluntary?

Traditional Forth multi-tasking uses a cooperative method. The routine is typically embedded in primitive I/O routines so that anytime their is a loop waiting for  I/O or something is sent to the output, the context switches. This works well because of the fast switching.  ISRs are traditionally reserved in these old systems for non-determinate routines like serial input or data acquisition code.  This way the cooperative tasks never  get in the way of the mission critical/real-time routines and things work smoothly.

So I am not as smart as I think I am with the RTWP method.  That makes sense. I am not that smart.

I have thought many times about just hooking my context switch to the ISR routine to see what happens.  It does make things a lot more complicated requiring locks on shared resources which takes more code so I have never had the need.  I could put a watchdog timer on the ISR to switch out of misbehaving tasks I supposed.

##### Share on other sites

I tried to convert the UCSD p-system to inlude pre-emptive multitasking. It kind of works, but there's something in HEAPOPS that goes wrong. Without dynamic variables, multitasking isn't much worth, so I gave up there.

It has occurred to me at later dates that I could probably make another attempt, where I instead allow the use of the ATTACH intrinsic. When it's properly implemented, it allows linkage of hardware interrupts to semaphores, which in turn implies that you can let a Pascal process go due to an interrupt. it will still be a system with a volutary task switch, but you can have a semaphore being signalled by an external event. Such an approach would probably allow the HEAPOPS to work properly, as it wouldn't be interrupted unless it wanted to itself.

Of course you are smart. It's just that you aren't alone...

Anyway, using RTWP to call a process is a pretty given thing in the TMS 9900 multitasking world.

##### Share on other sites

I know nothing about P-system but it sounds like HEAPOPS is one of those "shared resources" I mentioned.  Pre-emption can make that kind of thing break.

In fact I made a very simple MALLOC for the lower 8K and it is a challenge to manage across tasks. The only way I can think of is to pre-allocate a chunk for a task and limit the task to that chunk.

Starting a process on a semaphore is still very useful. You perhaps can create a separate chain of tasks that are "safe" that run when the interrupt triggers. Probably lots of potential uses.

Forth typically solved this problem by not solving it.   Since it is normally used in dedicated embedded products memory was statically allocate when the program started. It tends to be more reliable to do that in mission critical systems. It of course means you will probably over-allocate a bit to be safe and waste some space.

##### Share on other sites

HEAPOPS are Heap Operations, i.e. the operating system procedures which manage the heap. Like in many other systems, the p-system has two variable storage areas, the stack and the heap. The stack is used in the conventional manner, i.e. things are always pushed on the top of the stack and also popped off the top of the stack. The heap is a memory area where sections are allocated and released in any order.

The stack is used to allocate return links, data to be processed, environment records (local variables) and such stuff. Each process (independently executing program in a multi-programmed environment) has its own stack. You allocate one when you start the process.

Dynamic variables are allocated on the heap. They can be used for anything, but typical uses are leaves of trees of unknown (at programming time) size, buffers and other structures, where you don't know in advance what you need, or it makes sense to dispose the structures after use, so you don't occupy large areas of memory with data space that's idle.

If several processes use the heap to allocate and dispose variables, it does take that the heap protects itself against being interrupted where that can't be allowed. It seems it doesn't. Many other things did work well, though.

This is different compared to many versions of Forth, where typically two stacks are used instead. One data stack and one return stack. The way Pascal works, when you create a "word" (a function or procedure in Pascal), you also create the local variables the "word" needs. When the "word" ends, you pop all these local variables, pop the return address, push the results from the "word" (if it's a function) on the stack and return. So no separate return stack is needed.

##### Share on other sites

HEAPOPS are Heap Operations, i.e. the operating system procedures which manage the heap. Like in many other systems, the p-system has two variable storage areas, the stack and the heap. The stack is used in the conventional manner, i.e. things are always pushed on the top of the stack and also popped off the top of the stack. The heap is a memory area where sections are allocated and released in any order.

The stack is used to allocate return links, data to be processed, environment records (local variables) and such stuff. Each process (independently executing program in a multi-programmed environment) has its own stack. You allocate one when you start the process.

Dynamic variables are allocated on the heap. They can be used for anything, but typical uses are leaves of trees of unknown (at programming time) size, buffers and other structures, where you don't know in advance what you need, or it makes sense to dispose the structures after use, so you don't occupy large areas of memory with data space that's idle.

If several processes use the heap to allocate and dispose variables, it does take that the heap protects itself against being interrupted where that can't be allowed. It seems it doesn't. Many other things did work well, though.

This is different compared to many versions of Forth, where typically two stacks are used instead. One data stack and one return stack. The way Pascal works, when you create a "word" (a function or procedure in Pascal), you also create the local variables the "word" needs. When the "word" ends, you pop all these local variables, pop the return address, push the results from the "word" (if it's a function) on the stack and return. So no separate return stack is needed.

Yes as you describe Forth's two stacks are used for the operation of the VM.

ANS Forth also permits named local variables that can be allocated on the return stack (or another stack) like Pascal or C with the optional "locals" wordset.

ANS Forth has a dynamic memory wordset for interfacing to O/S heap management or you can build it your own way.

I got a PD version to run on the lower 8K in this post:

Traditional Forth has a third memory space. The "heap" in Forth is the dictionary memory where all the code and labels go.

It is managed with one pointer, the "dictionary pointer". The entire memory management command set is only four small commands.

Below is the Forth memory management "system"   in Forth and also Forth Assembler

(Assembler is more appropriate in this Forum, think of NEXT, like return in Assembler)

These simple routines could be called as sub-routines in Assembler if a simple memory management system was needed.

The extreme simplicity means that you can use ALLOT to allocate memory or de-allocate memory by using a negative parameter.

```Forth memory management commands in Forth

: HERE  ( -- addr) DP @  ;             \ fetch the dictionary pointer
: ALLOT    ( n --) DP +! ;             \ add n to DP (allocate some space)
: ,       ( n -- ) HERE !   2 ALLOT ;  \ put 'n' into memory, advance DP by 2
: C,      ( c --)  HERE C!  1 ALLOT ;  \ put 'char' in memory, advance DP by 1

\ ==========================================
\ Equivalent code in 9900 Forth Assembler
CODE: ALLOT ( n -- )
TOS POP,          \ refills top of stack register R4
NEXT,
END-CODE

TOS PUSH,         \ make space in top of stack register R4
DP @@ TOS MOV,
NEXT,
END-CODE

CODE: ,     ( n --)
DP @@ R1 MOV,      \ get next available memory -> R1
TOS *R1 MOV,       \ put 'n' from stack into memory
DP @@ INCT,        \ "allocate" the memory
TOS POP,           \ refill TOS register
NEXT,
END-CODE

CODE: C,    ( c --)
DP @@ R1 MOV,      \ get next available memory -> R1
TOS SWPB,          \ TOS byte needs fixing for 9900
TOS *R1 MOVB,      \ store byte in memory
DP @@ INC,         \ "allocate" the memory
TOS POP,           \ refill TOS register
NEXT,
END-CODE```

Of course the old TI-99 has many more memory spaces. I replicated the four commands with different names to manage VDP memory. This worked really well with little overhead. ( VHERE VALLOT  V,  VC, )  These let me build strings and arrays in VDP RAM with code "borrowed" from Forth itself.

I may do the same for the SAMS card. I have a library that uses 32 bit addresses for SAMS so it could look like contiguous 1Mbyte with that scheme.

I did a slightly different version called MALLOC and MFREE for the low 8K RAM block and I use this for dynamically allocating temporary buffers.

As we mentioned all these things get more complicated when multi-tasking. I have been solving that ad hoc up to now.

##### Share on other sites

Earlier versions, like UCSD p-system II (similar to Apple Pascal) didn't, but version IV does support a "true heap", i.e. a heap where you can allocate and reclaim space, regardless of which order you do it in.

Earlier p-system versions worked only with mark and release, where you could put a mark on the heap, then issue any number of new and finally get rid of them by release, which wound you back to the corresponding mark. But version IV will allow you to do new(a), new(b), new(c) in a row, then dispose(a) and actually free the space used by a, but still keep b and c. This does of course cost more complexity and time for these operations, but is in line with the p-systems philosphy of prioritizing being able to do as much as possible with a small memory footprint.

##### Share on other sites

Very interesting. This is exactly how Forth evolved in this area.  I understand that the old Apple OS9 and earlier was more static allocation as well.

Evolution takes time. There are few shortcuts it seems.

##### Share on other sites

In this old post (http://www.vcfed.org/forum/showthread.php?49301-TI990-minis&p=384192#post384192) pnr says that

"By the late 70's a major use for the TI990 was running Cobol programs. At that time Cobol (and Fortran) did not have support for recursive procedures: the local storage was not stack allocated, but had fixed memory allocations. In this pattern, each procedure (or 'performed' paragraph) would have its own workspace and be called with a BLWP. According to the article, Cobol code did a context switch every 30 instructions or so. Compare this to e.g. an IBM360 that had to perform a full registers Load/Store Multiple to a procedure save area on each context switch. For 1970's Cobol the "registers in memory workspace" concept was quite defensible."

The architecture of ti994a is a derivative of 990?

##### Share on other sites
4 hours ago, Elia Spallanzani fdt said:

The architecture of ti994a is a derivative of 990?

There is another thread recently where this was discussed:

Basically the TMS9900 CPU is the microchip version of the CPU-on-a-board used in the early 990-minicomputers.  The 99/4A is not the same architecture as the 990-mini, but it does share the CPU design.

##### Share on other sites

Last night I failed to work out a bug in my game's collision detection routine due to fatigue. I was comparing the screen column of a jeep sprite ( in R8) to the screen column of a tank sprite (in R9). The bug was resolved after eight hours of sleep. The error I made was:

CI    R8,R9        * incorrectly using Compare Immediate

This morning I "fixed" the error by changing the line to read:

C     R8,R9        * why I couldn't resolve this at 1am is anyone's guess

The collision routine ran flawlessly after this simple change.

My question is why didn't the assembler detect this error?

How was I able to incorrectly Compare-Immediate the contents of R8 to "R9"?

##### Share on other sites
42 minutes ago, Airshack said:

Last night I failed to work out a bug in my game's collision detection routine due to fatigue. I was comparing the screen column of a jeep sprite ( in R8) to the screen column of a tank sprite (in R9). The bug was resolved after eight hours of sleep. The error I made was:

CI    R8,R9        * incorrectly using Compare Immediate

This morning I "fixed" the error by changing the line to read:

C     R8,R9        * why I couldn't resolve this at 1am is anyone's guess

The collision routine ran flawlessly after this simple change.

My question is why didn't the assembler detect this error?

How was I able to incorrectly Compare-Immediate the contents of R8 to "R9"?

When you use the R option, it’s as if the assembler inserts:

R9 EQU 9

CI R8,9

the assembler is just following simple orders, so it doesn’t see anything wrong.

this kind of thing in modern practice  should be a warning, maybe it is in xas99?

• 1
• 1

##### Share on other sites

I was using Asm994a which didn't seem to flag it or provide warning.

## Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

×   Pasted as rich text.   Paste as plain text instead

Only 75 emoji are allowed.

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

• ### Recently Browsing   0 members

×
×
• Create New...