Asmusr Posted August 23, 2018 Share Posted August 23, 2018 It would have been more logical to interpret "B R15" as "branch to the address in R15". But instead, it means "branch to the address of R15". It certainly would. And if B *R15 had meant to "jump to the address contained in the word pointed to by R15" that would have been very useful for jump tables. Quote Link to comment Share on other sites More sharing options...
Tursi Posted August 23, 2018 Share Posted August 23, 2018 It is valid, but not the same. If your workspace pointer is set to >8300, "BL R15" is the same as "BL @>831E" (branching to the assembly code in memory at the same address as R15.) I wonder what the the C99 compiler was using it for? Maybe a breakpoint or tracepoint where R15 could easily be changed between "B *R11" or "something else" to cause compiled code to do something different dynamically at runtime? Just a guess... It was used for a stack push subroutine call. Just trying to keep it quick as possible, I guess (register is faster than register indirect ). Quote Link to comment Share on other sites More sharing options...
RXB Posted August 23, 2018 Share Posted August 23, 2018 (edited) One of the reasons for FORTH being so fast is it is STACK BASED. Having several STACKS would seem to be much more efficient then swapping values in Registers to make room for new values. BL GETSD GETSD MOV @SAVEDDATA,R7 ADD 41,R7 MOV R7,@SAVEDDATA RT VS: BLWP @NEWREGISTERSET ADD 41,R7 * R7 is SAVEDATA RTWP I am not a great Assembly Language programmer but the second one seems to make much more efficient sense long term. Now I could be wrong but if you have a huge complicated program which one saves more space and is faster? Edited August 23, 2018 by RXB Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted August 23, 2018 Share Posted August 23, 2018 "B R15" ... means "branch to the address of R15". Is the timing for the address modification piece still 0 clock cycles and 0 memory accesses, as for register direct addressing, or does it change, in this instance, to 8 clock cycles and 1 memory access, as for symbolic addressing? ...lee Quote Link to comment Share on other sites More sharing options...
Tursi Posted August 23, 2018 Share Posted August 23, 2018 (edited) I am not a great Assembly Language programmer but the second one seems to make much more efficient sense long term. Now I could be wrong but if you have a huge complicated program which one saves more space and is faster? Well, just for fun, we can look at the numbers: Assuming a normal arrangement of 8 bit code and 16-bit registers: BL @GETSD - 4 bytes, 28 cycles GETSD MOV @SAVEDDATA,R7 - 4 bytes, 30 cycles AI R7,41 - 4 bytes, 22 cycles MOV R7,@SAVEDDATA - 4 bytes, 34 cycles RT - 2 bytes, 16 cycles == 18 bytes, 130 cycles, 2 bytes of storage (R11 for BL/RT) VS: NEWREGISTERSET DATA NEWWP,GETSD ... BLWP @NEWREGISTERSET - 4 bytes, 34 cycles GETSD AI R7,41 - 4 bytes, 22 cycles RTWP - 2 bytes, 18 cycles == 10 bytes, 74 cycles, 8 bytes of storage (R7,R13,R14,R15) and 4 bytes for the vector. But, the first one could be exactly the same as the second, but slightly faster: BL @GETSD - 4 bytes, 28 cycles GETSD AI R7,41 - 4 bytes, 22 cycles RT - 2 bytes, 16 cycles == 10 bytes, 66 cycles, 4 bytes of storage (R7,R11) You can relieve the pressure on R7 slightly easier, too: VAL41 DATA 41 ... BL @GETSD - 4 bytes, 28 cycles GETSD A @VAL41,@SAVEDATA - 6 bytes, 38 cycles RT - 2 bytes, 16 cycles == 12 bytes, 82 cycles, 2 bytes of storage (R11 for BL/RT) Assembly on the TI is always a set of tradeoffs. Are you coding for size or for speed? Are your registers constrained, or do you have lots free? BLWP is great if you need to swap in a new register set AND need all the information about the caller (if you don't care and can get back yourself, LWPI is only 18 cycles in 8-bit RAM and doesn't tie up any registers with caller information). The one thing I've found (kind of) nice, is that the TI is so memory-bound by the multiplexer and the base instruction cost that in many cases, coding for size ALSO generates the fastest code. Not all, but certainly many. I'd argue myself in the example given that a subroutine of any kind for a single add is silly - just put the add inline. It takes the same number of bytes as any literal jump and saves all the overhead (or, if you need it to be flexible but it fits in one word, consider the X instruction, which adds just 12 cycles plus the cost of reading the argument, which is 0 if it's a register!) But in practical terms you'd usually have more content. Stacks are tricky to do well on the 9900, what ARE the Forth developers using for their stack functions? Edited August 24, 2018 by Tursi 1 Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted August 24, 2018 Share Posted August 24, 2018 Stacks are tricky to do well on the 9900, what ARE the Forth developers using for their stack functions? In TI Forth and fbForth, there is a parameter (data) stack, with R9 of the Forth workspace (on 16-bit bus) as the stack pointer (SP), and a return stack, with R14 as the stack pointer. Both stacks are 1 cell (2 bytes) wide and grow downward. The parameter stack grows down from the high end of “high” RAM toward the dictionary and the return stack grows down from the high end of “low” RAM toward the Forth support routines and block buffers. Pushing a value onto either stack involves reserving space by decrementing the stack pointer by 2 and then copying the new value into that 16-bit space. For the parameter stack: DECT SP MOV @ADDR,*SP Popping values from either stack is easier because the stack pointer can be dereferenced and autoincremented in one instruction. For the return stack: MOV *RP+,@ADDR One can do other stackrobatics, but the basic pushing and popping of 16-bit values is as explained above. ...lee 2 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted August 24, 2018 Share Posted August 24, 2018 (edited) In TI Forth and fbForth, there is a parameter (data) stack, with R9 of the Forth workspace (on 16-bit bus) as the stack pointer (SP), and a return stack, with R14 as the stack pointer. Both stacks are 1 cell (2 bytes) wide and grow downward. The parameter stack grows down from the high end of “high” RAM toward the dictionary and the return stack grows down from the high end of “low” RAM toward the Forth support routines and block buffers. Pushing a value onto either stack involves reserving space by decrementing the stack pointer by 2 and then copying the new value into that 16-bit space. For the parameter stack: DECT SP MOV @ADDR,*SP Popping values from either stack is easier because the stack pointer can be dereferenced and autoincremented in one instruction. For the return stack: MOV *RP+,@ADDR One can do other stackrobatics, but the basic pushing an popping of 16-bit values is as explained above. ...lee CAMEL99 Forth uses one simple variation of the above mechanisms. It caches the top element of the data stack (TOS) in R4. The makes some operations much faster and others slower. The literature indicates about 10% speed improvement on a threaded Forth for most CPUs. I can attest to about net 8% improvement on the 9900 vs not caching TOS. It makes operations that are stack neutral very efficient and operations that consume 2 items from the stack and return a result to the stack are also very efficient. CODE: 1+ ( n -- n') TOS INC, NEXT, END-CODE CODE: + ( u1 u2 -- u ) *SP+ TOS ADD, \ ADD 2nd item to TOS and incr stack pointer. NEXT, END-CODE Managing the TOS register for more complex operations, that consume all the input parameters that are on the stack, normally means just doing a refill at the end of the operation. CODE: ! ( n addr -- ) \ store n in address *SP+ *TOS MOV, TOS POP, NEXT, END-CODE But I think you can see that managing stacks is not so bad on the 9900. In fact using a Forth assembler CAMEL99 creates macros for PUSH, POP for the data stack and RPUSH, RPOP for the return stack. This makes the code look like it's running on a stack machine. :-) Edited August 24, 2018 by TheBF 3 Quote Link to comment Share on other sites More sharing options...
RXB Posted August 24, 2018 Share Posted August 24, 2018 I have found stacks are a much better management than just spaghetti code using BL all the time. Also if R6 had 41 in it and you used BLWP you could just do Add R6,R7 instead of ADD 41,R7 For some reason you omitted my code MOV @SAVEDDATA,R7 that my BLWP did not need to do as R7 was the SAVEDATA Additionally having 32 registers is faster then using only 16 registers as the more complex and larger programs get the more swapping values is needed. Quote Link to comment Share on other sites More sharing options...
+mizapf Posted August 24, 2018 Share Posted August 24, 2018 Is the timing for the address modification piece still 0 clock cycles and 0 memory accesses, as for register direct addressing, or does it change, in this instance, to 8 clock cycles and 1 memory access, as for symbolic addressing? The 9900 still loads the contents of R15 into the ALU, but discards them; so you have one memory access. The 9995 does not. 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted August 24, 2018 Share Posted August 24, 2018 I have found stacks are a much better management than just spaghetti code using BL all the time. I did some Asm experiments with a macro i called 'CALL' It pushed R11 onto a return stack, did the BL to the sub-routine and when the sub-routine returned it popped R11 from the return stack. It was aesthetically pleasing to be able to have sub-routines call a sub-routine which called a sub-routine etc... But the size cost is quite high on the 9900. 4 instructions, 8 bytes per call. So for most Assembly language coders the need for it would be small I think especially if you have a free register to hold R11 temporarily. But it is neat to watch on the debugger. :-) Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted August 24, 2018 Share Posted August 24, 2018 The 9900 still loads the contents of R15 into the ALU, but discards them; so you have one memory access. The 9995 does not. So... This means that Table 3 in TMS9900 Microprocessor Data Manual will give the wrong answer for execution times for direct register access for B and BL instructions, right? ...lee [Edits in this color.] Quote Link to comment Share on other sites More sharing options...
+mizapf Posted August 24, 2018 Share Posted August 24, 2018 No, they already included the memory access in the operation time. I recommend to use the 9900 Family System Design, chapter 4, showing the micro operations. B ------------------------------------------------- Cycle Type Description 1 Memory read AB=PC (address bus) DB=Instruction (data bus) 2 ALU AB=NC (no change) DB=SD (source data register, internal) Ns Data derivation 3+Ns ALU AB=NC DB=SD Note that each machine cycle shown here requires 2 clock cycles for the 9900. The data derivation sequence takes 1 cycle for Rx, 3 for *Rx, 4 (Byte) or 5 (Word) for *Rx+, 5 for @>Addr, 5 for @>Addr(Rx). The reason for this 1 cycle for Rx is the memory access. This is the reason why they call it "data derivation sequence". On the 9995, only the address is derived. 1 Quote Link to comment Share on other sites More sharing options...
Tursi Posted August 24, 2018 Share Posted August 24, 2018 (edited) For some reason you omitted my code MOV @SAVEDDATA,R7 that my BLWP did not need to do as R7 was the SAVEDATA Ah, because I couldn't figure out why you omitted that step. Comment your code! Edit: it's a good call though. if your alternate workspace has registers preset with values that you need in the subroutine, tbat's another potential use case for BLWP (or at least an alternate workspace). I do that on my music player. Edited August 24, 2018 by Tursi 1 Quote Link to comment Share on other sites More sharing options...
RXB Posted August 25, 2018 Share Posted August 25, 2018 (edited) It has always been my view that preloaded alternate set of 16 more registers has to be faster then using @ANYADDRESS or MOV R#,R# for lack of Registers. Would it not be faster completely with entire programs being nothing but sets of Registers, and would allow modified instructions on the fly??? Damn few CPU could pull this off and the 9900 is one of the few as it is Memory Mapped. Of course Scratch Pad is faster, but if it was nothing but Registers that would have to really pay off. Edited August 25, 2018 by RXB Quote Link to comment Share on other sites More sharing options...
RXB Posted August 25, 2018 Share Posted August 25, 2018 I did some Asm experiments with a macro i called 'CALL' It pushed R11 onto a return stack, did the BL to the sub-routine and when the sub-routine returned it popped R11 from the return stack. It was aesthetically pleasing to be able to have sub-routines call a sub-routine which called a sub-routine etc... But the size cost is quite high on the 9900. 4 instructions, 8 bytes per call. So for most Assembly language coders the need for it would be small I think especially if you have a free register to hold R11 temporarily. But it is neat to watch on the debugger. :-) This is exactly how XB Stacks work in the XB ROMs. Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted August 25, 2018 Share Posted August 25, 2018 (edited) It has always been my view that preloaded alternate set of 16 more registers has to be faster then using @ANYADDRESS or MOV R#,R# for lack of Registers. Would it not be faster completely with entire programs being nothing but sets of Registers, and would allow modified instructions on the fly??? Damn few CPU could pull this off and the 9900 is one of the few as it is Memory Mapped. Of course Scratch Pad is faster, but if it was nothing but Registers that would have to really pay off. I'm sure there could be times when this approach would lead to greater speed. But I think usually it would be slower. Here's why: 1 - To start with, BLWP is quite a bit slower than BL. (the gurus here can tell you exactly how much slower.) 2 - Usually a subroutine is acting on values in the calling workspace. For example, when you BLWP @VSBW you have to put the screen address in R0 and the byte to write in R1. When VSBW starts, it has to fetch those values with MOV *R13,R0 and MOV@2(R13),R1 before it can do anything. If you used a BL @VSBW instead, those values would already be in the registers and ready to use. There is a time and place for everything, and BLWP is great if the subroutine is lengthy. In that case the overhead is a much smaller percentage of the program and is not so detrimental. And of course, not having to worry about messing up registers in the calling program is a big plus. And as noted above, preloaded registers would also help, but of course that means one less register available for use in the BLWPWS Edited August 25, 2018 by senior_falcon 1 Quote Link to comment Share on other sites More sharing options...
RXB Posted August 25, 2018 Share Posted August 25, 2018 I'm sure there could be times when this approach would lead to greater speed. But I think usually it would be slower. Here's why: 1 - To start with, BLWP is quite a bit slower than BL. (the gurus here can tell you exactly how much slower.) 2 - Usually a subroutine is acting on values in the calling workspace. For example, when you BLWP @VSBW you have to put the screen address in R0 and the byte to write in R1. When VSBW starts, it has to fetch those values with MOV *R13,R0 and MOV@2(R13),R1 before it can do anything. If you used a BL @VSBW instead, those values would already be in the registers and ready to use. There is a time and place for everything, and BLWP is great if the subroutine is lengthy. In that case the overhead is a much smaller percentage of the program and is not so detrimental. And of course, not having to worry about messing up registers in the calling program is a big plus. And as noted above, preloaded registers would also help, but of course that means one less register available for use in the BLWPWS Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE! Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers. What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity. Think like Forth but instead of 1 endless stack you have hundreds of small stacks, or treat it like all 1 huge stack, but with out ROT or SWAP needed to get a value used. Quote Link to comment Share on other sites More sharing options...
+mizapf Posted August 25, 2018 Share Posted August 25, 2018 VSBW / VSBR are probably the worst examples for using BLWP, in terms of overhead. However, every VMBW with a length of at least, say, 5 or 10 characters makes the BLWP overhead small against the overall execution time. DSRLNK is arguably the best use of BLWP. We are really a bit spoiled with direct access to devices. On a system with a 99105 (with a proper multiuser/multitasking operating system), any such access would need to be done through a XOP system call, and the LIMI command would be forbidden (for user processes). 1 Quote Link to comment Share on other sites More sharing options...
+TheBF Posted August 26, 2018 Share Posted August 26, 2018 Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE! Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers. What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity. Think like Forth but instead of 1 endless stack you have hundreds of small stacks, or treat it like all 1 huge stack, but with out ROT or SWAP needed to get a value used. I have been playing with BLWPing directly to the Forth stack. It still takes some effort because you have to put 3 zeros on the bottom for R13,R14,R15 but then you have space for 13 parameters. It's not a panacea for everything but it is cool that the 9900 can do it! I have not found an application for it but I can envision something like a floating point package where you need to pass a lot a parameters to a function where it might be pretty cool. B 2 Quote Link to comment Share on other sites More sharing options...
senior_falcon Posted August 26, 2018 Share Posted August 26, 2018 Yea, but then again you have UNLIMITED NUMBER OF REGISTERS WITH BLWP AS EACH ONE GIVES YOU 16 MORE! Also anything to do with VDP like VSBW or GROM would be slower, yes BL is faster then BLWP, but again you are stuck with only 16 Registers. What I was saying was any variables or data should be BLWP REGISTERS so the loss in speed between BL vs BLWP would be negligible the larger the program complexity. Rich, I think your idea is interesting. If I understand correctly you are advocating for always using BLWP instead of BL, with the added nuance that every BLWP subroutine has its own unique workspace. The advantage to this is that no stack is necessary - each subroutine can find its way back to the code that called it. If you were to implement a stack for BL it takes an instruction to push an address on the stack and go up a level and another instruction (or two) to pop an address from the stack and go down a level. That eliminates the speed advantage of BL. You still have the need for passing values from the calling program to the BLWP sub, and I'm not yet convinced that putting values/data into BLWP registers is as useful as you think it is. There is also the extra memory usage (36 bytes per BLWP sub) which could be a problem with a large program. But as I noted earlier, there is a time and a place for everything, and this is one more useful tool in the assembly programmers repertoir. Speaking of neat ideas, one of the slicker ones I have seen here is to overlap the BLWP workspace with the main workspace: USRWS BSS 16 (but you are really using 32 bytes for USRWS) BLWPWS BSS 32 You can use all 16 registers in the USRWS. When you BLWP to the subroutine, USRWS R8-R15 are BLWPWS R0-R7. And voila! Provided you are using R8-R15 to pass data, it is already in BLWPWS and ready for use. 1 Quote Link to comment Share on other sites More sharing options...
RXB Posted August 26, 2018 Share Posted August 26, 2018 (edited) Yea that is what I was talking about, so say R8 to R15 are Screen pointers say R8 is Row and R9 is Column, R10 is character buffer and color of that set, then you have R11 Return address, R12 CRU, R13 GROM address, R14 Stack Address and R15 VDP address for multiple writes/reads. Over lap 8 bytes and now you have a whole new set sort of like jumping up and down a stack all the time. Everyone keeps thinking inside the 32 or 16 set box, instead of think outside that box and using less distance in overlap, this means you get more stack like features. Edited August 26, 2018 by RXB Quote Link to comment Share on other sites More sharing options...
GDMike Posted January 27, 2019 Share Posted January 27, 2019 I never saw how to implement the standard character set scs that Matt started discussing.. I also couldn't get his vsbr to work, and I patched between someone else's code for vmbw and Matt's to make my own vmbw work. Uncool. Note I'm using an e/a module on real hardware. Why did I have so much trouble, I just used the examples for vsbw, vmbw, vsbr, vmbr and vwtr. No refs, just equates.. And none worked for me using the workspace and info from the first post. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted January 27, 2019 Share Posted January 27, 2019 I never saw how to implement the standard character set scs that Matt started discussing.. I also couldn't get his vsbr to work, and I patched between someone else's code for vmbw and Matt's to make my own vmbw work. Uncool. Note I'm using an e/a module on real hardware. Why did I have so much trouble, I just used the examples for vsbw, vmbw, vsbr, vmbr and vwtr. No refs, just equates.. And none worked for me using the workspace and info from the first post. Where is your code? We need code to assist you in discovering the difficulties. ...lee Quote Link to comment Share on other sites More sharing options...
RXB Posted January 27, 2019 Share Posted January 27, 2019 (edited) Here is my thinking like I do sometimes using BLWP using overlapping area of memory for both BLWP SET 1 : USE : SET 2 : USE -------------------------------------------------------------- R0 TEMP R1 TEMP R2 TEMP R3 TEMP R4 TEMP R5 TEMP R6 TEMP R7 TEMP R0 TEMP R8 TEMP R1 TEMP R9 TEMP BL RETURN R2 TEMP BL RETURN R10 TEMP or ERROR 1 R3 TEMP or ERROR 1 R11 RETURN BL 1 R4 RETURN BL 1 R12 CRU USE 1 R5 CRU USE 1 R13 GROM ADDRESS 1 R6 GROM ADDRESS 1 R14 Intepreter flags R7 Intepreter flags R15 VDP ADDRESS 1 R8 VDP ADDRESS 1 R9 TEMP BL RTN 2 R10 TEMP R11 RETURN BL 2 R12 CRU USE 2 R13 GROM ADDRESS 2 R14 TEMP R15 VDP ADDRESS 2 You can see more registers and multiple use. Saves code space as SET 1 LI R7,>7F00 also loads SET 2 R0 and instead of ADD @VALUE,R7 you can use another BLWP SET 2 and instead use ADD R10,R0 this saves space and if both in SCRATCH PAD RAM makes more sense then temp area. Never understood why TI never did this? Edited January 27, 2019 by RXB Quote Link to comment Share on other sites More sharing options...
GDMike Posted January 27, 2019 Share Posted January 27, 2019 Submitting photos...I'm sure it's me somehow that's got it wrong here, and I'm using the NANOPEB with E/A cart. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.