BillG Posted September 14, 2020 Share Posted September 14, 2020 As mentioned in another thread, the following code to add a small constant to a variable: 00037 * W0 := W0 + 2; 00038 00039 * * 0 := v W0 -> 1 00040 * * 1 L r 2 00041 00042 * * 2 L v W0 -> 3 00043 * * 3 + c 2 00044 00045 00046 * 1 L r 2 00047 * 2 L v W0 -> 3 00048 * 3 + c 2 0052 C020 002E 00049 mov @W0,R0 0056 05C0 00050 inct R0 00051 * 0 := v W0 -> 1 0058 C800 002E 00052 mov R0,@W0 can be optimized to this: 00037 * W0 := W0 + 2; 00038 00039 * * 0 := v W0 -> 1 00040 * * 1 L r 2 00041 00042 * * 2 L v W0 -> 3 00043 * * 3 + c 2 00044 00045 00046 * 1 L r 2 00047 * 2 L v W0 -> 3 00048 * 3 + c 2 0052 05E0 002E 00049 inct @W0 00050 * 0 := v W0 -> 1 This is the code to add two signed bytes resulting in a 16-bit number. Is there a better way to do sign extension? 00037 * W0 := S1 + S2; 00038 00039 * * 0 := v W0 -> 1 00040 * * 1 L r 2 00041 00042 * * 2 L v S1 -> 3 00043 * * 3 + v S2 00044 00045 00046 * 1 L r 2 00047 * 2 L v S1 -> 3 00048 * 3 + v S2 0052 04C0 00049 clr R0 0054 D020 0037 00050 movb @S1,R0 0058 1502 (005E) 00051 jgt 2f 005A 0260 00FF 00052 ori R0,>FF 005E 00053 2 005E 06C0 00054 swpb R0 0060 04C1 00055 clr R1 0062 D060 0038 00056 movb @S2,R1 0066 1502 (006C) 00057 jgt 2f 0068 0261 00FF 00058 ori R1,>FF 006C 00059 2 006C 06C1 00060 swpb R1 006E A001 00061 a R1,R0 00062 * 0 := v W0 -> 1 0070 C800 002E 00063 mov R0,@W0 Quote Link to comment Share on other sites More sharing options...
Asmusr Posted September 14, 2020 Share Posted September 14, 2020 I don't know if it's faster, but you can extend the sign using sra r0,8. Then you don't need clr, jgt, ori or swpb. Quote Link to comment Share on other sites More sharing options...
BillG Posted September 14, 2020 Author Share Posted September 14, 2020 Thanks. I was going to do that when optimizing for size. The 9900 does not have a barrel shifter, but shifts a bit at a time. The documentation seems to say one cycle per bit position; if that is true, shifting will be the fastest. Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 14, 2020 Share Posted September 14, 2020 A shift instruction is roughly equivalent to 1-4 "normal" instructions, depending on how many bits to shift. Which implies that you can't do much other logic before you've consumed as many clock cycles as the shift will need. Quote Link to comment Share on other sites More sharing options...
+Lee Stewart Posted September 14, 2020 Share Posted September 14, 2020 5 hours ago, BillG said: The 9900 does not have a barrel shifter, but shifts a bit at a time. The documentation seems to say one cycle per bit position; if that is true, shifting will be the fastest. I read two cycles per click. ...lee 1 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 14, 2020 Share Posted September 14, 2020 BillG was reading information regarding the TMS 9940, so that could be the reason for the error. Quote Link to comment Share on other sites More sharing options...
BillG Posted September 14, 2020 Author Share Posted September 14, 2020 If my calculations are correct, shifting is faster (assuming the workspace is in fast memory and everything else is in slow memory.) 0052 04C0 00049 clr R0 ; 14 : 10 + 4 (fetch) 0054 D020 0037 00050 movb @S1,R0 ; 30 : 14 + 4 (fetch) + 8 + 4 0058 1502 (005E) 00051 jgt 2f ; 12/14 : 8/10 + 4 (fetch) 005A 0260 00FF 00052 ori R0,>FF ; 22 : 14 + 2 * 4 (fetch) 005E 00053 2 005E 06C0 00054 swpb R0 ; 14 : 10 + 4 (fetch) vs 0052 D020 0037 00049 movb @S1,R0 ; 30 : 14 + 4 (fetch) + 8 + 4 0056 0880 00050 sra R0,8 ; 32 : 12 + 4 (fetch) + 16 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 14, 2020 Share Posted September 14, 2020 Which is well within the rule of thumb I gave, where I stated that a shift is like 1-4 simple instructions, depending on the number of bits. Since you shift 8 bits it's about two instuctions. 'And as soon as you start adding more advanced addressing to other instructions, they take more time. Since shift can only be done with registers, they keep their timing. Still, using fancy addressing modes is almost always faster than adding more instructions to accomplish the same thing. 1 Quote Link to comment Share on other sites More sharing options...
BillG Posted September 15, 2020 Author Share Posted September 15, 2020 This shows just how un-RISC the 9900 is. Reminds me of programming the 8088 - the specialized "string" instructions were by far the fastest way to do some things. That advantage eroded away as all of the instructions were made more efficient with each new generation of the architecture. By the time of the 486, a sequence of simple instructions can beat most of the string instructions. 1 Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 15, 2020 Share Posted September 15, 2020 (edited) It's definitely CISC! But the general addressing modes make it pretty orthogonal, which is a good thing for a programmer. Edited September 15, 2020 by apersson850 Quote Link to comment Share on other sites More sharing options...
+mizapf Posted September 15, 2020 Share Posted September 15, 2020 12 hours ago, BillG said: This shows just how un-RISC the 9900 is. Some years ago I learned to program the MIPS2000, a RISC platform. This is just the second assembly language that I learned (to the extent of writing programs) after TMS9900. It is quite interesting to see how specific things are solved in one or another platform. The MIPS architecture is a load/store architecture, means that you cannot work on memory contents but only on registers, and that you have special command load/store to transfer values from/to memory. 32 registers in MIPS are quite nice, but as always, registers are quickly used up, and this is the point where you start to appreciate the slow but very comfortable workspace concept of the TMS and the memory/memory architecture. Likewise - as we recently discussed the pros and cons of BLWP - a context switch for the MIPS is very expensive: It essentially means to copy the register set to memory and to fetch a new set from there. On the other hand, we can be really glad to have the BLWP instruction in the TMS, which makes context switches extremely fast. I'd say the point is to make best use of the specific features the platform offers you. 2 Quote Link to comment Share on other sites More sharing options...
BillG Posted September 15, 2020 Author Share Posted September 15, 2020 I have been playing with code generation on different processors. This line of code W0 := S1; sign extends a byte into two. For the 6502: 00037 ; 1 L v S1 0029 A0 00 [2] 00038 ldy #0 002B A6 16 [3] 00039 ldx S1 002D 10 01 (0030) [2/3] 00040 bpl 2f 002F 88 [2] 00041 dey 0030 00042 2: 0030 98 [2] 00043 tya 00044 ; 0 := v W0 -> 1 0031 86 0D [3] 00045 stx W0 0033 85 0E [3] 00046 sta W0+1 For the 6800: 00037 * 1 L v S1 0029 4F [2] 00038 clra 002A D6 16 [3] 00039 ldab S1 002C 2A 01 (002F) [4] 00040 bpl 2f 002E 4A [2] 00041 deca 002F 00042 2: 00043 * 0 := v W0 -> 1 002F 97 0D [4] 00044 staa W0 0031 D7 0E [4] 00045 stab W0+1 For the 8080: 00037 ; 1 L v S0 0100 3A 0015 [13] 00038 lda S0 0103 6F [5] 00039 mov L,A 0104 17 [4] 00040 ral 0105 9F [4] 00041 sbb A 0106 67 [5] 00042 mov H,A 00043 ; 0 := v W0 -> 1 0107 22 000D [16] 00044 shld W0 For the 9900: 00043 * 1 L v S1 0052 D020 0037 00044 movb @S1,R0 0056 0880 00045 sra R0,8 00046 * 0 := v W0 -> 1 0058 C800 002E 00047 mov R0,@W0 For the 68000: 00009 ; 1 L v S1 00000400 1038 0421 00010 move.b S1,D0 00000404 4880 00011 ext.w D0 00012 ; 0 := v W0 -> 1 00000406 31C0 0418 00013 move.w D0,W0 That almost feels like cheating. It is interesting that the last four examples have each been ten bytes long. Real cheating would be the 80386: movsx AX,[S1] mov [W0],AX For the AVR: 00011 ; 1 L v S1 000060 9160 0116 [2] 00012 lds R22,S1 000062 2F76 [1] 00013 mov R23,R22 000063 0F77 [1] 00014 lsl R23 000064 0B77 [1] 00015 sbc R23,R23 00016 ; 0 := v W0 -> 1 000065 9360 010D [2] 00017 sts W0,R22 000067 9370 010E [2] 00018 sts W0+1,R23 And finally, for the 6809: 00037 * 1 L v S1 0029 D6 16 [4] 00038 ldb S1 002B 1D [2] 00039 sex 00040 * 0 := v W0 -> 1 002C DD 0D [5] 00041 std W0 which brings up one of my favorite programming jokes: The Motorola 6809, where SEX is sometimes followed by STD. Thank you very much. Drive safely... 1 6 Quote Link to comment Share on other sites More sharing options...
+OLD CS1 Posted September 15, 2020 Share Posted September 15, 2020 38 minutes ago, BillG said: The Motorola 6809, where SEX is sometimes followed by STD. Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted September 16, 2020 Share Posted September 16, 2020 I was motivated to try some hand-assembly in J1 assembly language, and share the love for this little cpu. So, here is sex (sign-extend) for the J1. The J1 is a stack-based machine, similar to the Novix NC4016, with an instruction set designed to run FORTH. Read about it here: https://www.excamera.com/files/j1.pdf or https://excamera.com/files/svfig-2015-aug.pdf Its opcodes are: 8000 push 15-bit literal 3-bit opcodes with 13 bit arguments: 0000 jump 2000 jump if zero 4000 call 6x00 ALU. The bit fields are like microcode. x is an operation from 0-f , for example T+N, ~T, fetch [T] or NrshiftT So, let's do it. Sign extend the high byte of the 16-bit word at s1. ( w0 s1 -- [s1]>>8 -> [w0] ) : sex @ h# 8 NrshiftT h# 80 N<T jz done h# 00ff inv + done: swap ! ; assembled: sex: 0000 6C00 @ 0002 8008 h# 8 0004 6903 >> 0006 8080 h# 80 0008 6803 N<T 000a 2009 jz done 000c 80ff h# 00ff 000e 6603 inv 0010 6403 or done: 0012 6180 swap 0014 6123 N->[T] drop 0016 710f drop ; Of course, in FORTH, you wouldn't write sex as a memory-to-memory operation. It should just be a stack operation. Once sex is defined in the dictionary, calling it is one instruction. See the full listing at the end. Some curiosities: 1. The opcode 8000 pushes a 15-bit literal to the stack. To load the 16-bit value ff00, it is necessary to push 00ff and invert it. 2. ! must consume 2 values from the stack, but one ALU cycle can drop only 1. So I insert an extra drop. If you want to leave the value on the stack, you can get that for free. The idiom "dup addr !" can be optimized to "addr N->[T] drop", fitting in two instructions. 3. ; is usually free - You can do an ALU operation to the stack, and also pop the return stack and return. This: 000b 6103 drop 000c 700c ; To optimize, you may OR the instructions for the combined effect, because the CPU units are independent. 710f operates on the data stack, return stack, and PC. 000b 710f drop ; 5. c@ is messy. Addresses are byte-oriented, but memory access is aligned to 16-bit words. Sound familiar? The J1 has no "MOVB" only "MOV". Where @ is one instruction, c@ is written in software, and must check if the address is even or odd, then shift or mask. Full Listing : c@ ( addr -- c ) ( little endian ) h# 1 TandN jz ceven h# 8 >> ret ceven: h# ff and ; : sex ( c -- d ) h# 80 N<T jz done h# 00ff inv or done: ; 0 variable w0 0 variable s0 : main w0 c@ sex s0 ! ; c@: 0000 8001 h# 1 0002 6300 TandN 0004 2005 jz ceven 0006 8008 h# 8 0008 790f >> ret ceven: 000a 80ff h# ff 000c 730f and ret sex: 000e 8080 h# 80 0010 6803 N<T 0012 200d jz done 0014 80ff h# 00ff 0016 6603 inv 0018 740f or ; done: 001a 700c ; this word could be optimized away w0: 001c 8020 h# 0020 001e 700c ; 0020 0000 s0: 0022 8026 h# 0026 0024 700c ; 0026 0000 main: 0028 400c w0 002a 4000 c@ 002c 4007 sex 002e 400f s0 0030 6123 N->[T] drop 0032 710f drop ; Quote Link to comment Share on other sites More sharing options...
apersson850 Posted September 16, 2020 Share Posted September 16, 2020 2 hours ago, FarmerPotato said: 5. c@ is messy. Addresses are byte-oriented, but memory access is aligned to 16-bit words. Sound familiar? The J1 has no "MOVB" only "MOV". Where @ is one instruction, c@ is written in software, and must check if the address is even or odd, then shift or mask. Which makes you understand why TI went the read-before-write route. And why they always do it, regardless of whether they are processing a byte or a word, so that they can do the same thing consistently. It doesn't improve efficiency, but does simplify architectural aspects. 1 Quote Link to comment Share on other sites More sharing options...
BillG Posted September 16, 2020 Author Share Posted September 16, 2020 6 hours ago, FarmerPotato said: The J1 is a stack-based machine, similar to the Novix NC4016, with an instruction set designed to run FORTH. Read about it here: https://www.excamera.com/files/j1.pdf or https://excamera.com/files/svfig-2015-aug.pdf Do you think this will be effective for writing emulators for other processors? My understanding is that the fastest non-jit approach is some form of threaded code instead of a central decoder/dispatcher. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 16, 2020 Share Posted September 16, 2020 3 hours ago, BillG said: Do you think this will be effective for writing emulators for other processors? My understanding is that the fastest non-jit approach is some form of threaded code instead of a central decoder/dispatcher. The J1 as is would need to be expanded to be useful emulator. The original only addressed 16K bytes of memory It was for a specific application. There might be 32bit version now. (?) Important to note that J1 is not a threaded code machine but rather encodes the forth primitives (fetch,store,dup,drop,branch, branchifzero , math etc.) in hardware. The instructions are small because there is no register addressing required so they are organized to be compressed into one machine word by the compiler and fetched together in one cycle. No pipeline. Sub-routine call can be embedded in one of those instructions and return is free as any instruction that has the highest bit set (?), automatically returns. Pretty clever. Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted September 16, 2020 Share Posted September 16, 2020 1 hour ago, TheBF said: The J1 as is would need to be expanded to be useful emulator. The original only addressed 16K bytes of memory It was for a specific application. There might be 32bit version now. (?) Important to note that J1 is not a threaded code machine but rather encodes the forth primitives (fetch,store,dup,drop,branch, branchifzero , math etc.) in hardware. The instructions are small because there is no register addressing required so they are organized to be compressed into one machine word by the compiler and fetched together in one cycle. No pipeline. Sub-routine call can be embedded in one of those instructions and return is free as any instruction that has the highest bit set (?), automatically returns. Pretty clever. And see the citations of earlier, bigger stack machines, in https://www.excamera.com/files/j1.pdf b16-small has 5-bit instructions packed into 16 bits, for 3 execution units. How cool is that? There is a 32-bit j1b and a multi-core j4a in the swapforth repo. The j1a fit in 5K of ram on the ice-1k. I wrote my sd card driver on that one before the other 3k filled up Now I'm running the mecrisp fork of j1a, where Matthias optimized, put the barrel shifter back in and added mpy, on ice-4k. It's my favorite hardware tester, for my V9958 board. Bowman's j1a left opportunities for optimizing because it coded so much in FORTH, not assembly. OK, now about emulation.. Quote Link to comment Share on other sites More sharing options...
+TheBF Posted September 16, 2020 Share Posted September 16, 2020 42 minutes ago, FarmerPotato said: And see the citations of earlier, bigger stack machines, in https://www.excamera.com/files/j1.pdf b16-small has 5-bit instructions packed into 16 bits, for 3 execution units. How cool is that? There is a 32-bit j1b and a multi-core j4a in the swapforth repo. The j1a fit in 5K of ram on the ice-1k. I wrote my sd card driver on that one before the other 3k filled up Now I'm running the mecrisp fork of j1a, where Matthias optimized, put the barrel shifter back in and added mpy, on ice-4k. It's my favorite hardware tester, for my V9958 board. Bowman's j1a left opportunities for optimizing because it coded so much in FORTH, not assembly. OK, now about emulation.. Now I'm running the mecrisp fork of j1a, where Matthias optimized, put the barrel shifter back in and added mpy, on ice-4k. I want one of those! Quote Link to comment Share on other sites More sharing options...
+FarmerPotato Posted September 16, 2020 Share Posted September 16, 2020 5 hours ago, BillG said: Do you think this will be effective for writing emulators for other processors? My understanding is that the fastest non-jit approach is some form of threaded code instead of a central decoder/dispatcher. I don't think it's great at emulation. I thought jit was for making threaded code, while non-jit meant a decoder/dispatcher? I think it would be better to add some registers to the stack machine, because emulating a CISC or RISC you know you need a PC, ST, WP, register file, parking for src/dst address/values etc. J1 also doesnt keep an overflow bit in the alu, and you'd need that. I would implement a generic cpu with those features, then write decoders to emulate all the target cpus. So I started some j1 code to emulate the 9900 decoder/dispatcher. I made the PC always at the bottom of the stack. I use 'tbl' to jump to 6 of the Format 1 instructions, Nxxx with 6 bit src/dst addr modes. ( -- d PC+2->PC ) : pc@ pc @ @ swap 2+ pc ! ; 8000 h# 0000 pc 6x01 @ T'=[T] ( a -- a a ) 6x00 @ T'=[T] ( a a -- a d ) 6x80 swap T'=N T->N 8002 h# 2 6x03 + 7000 h# 0000 pc 6123 ! N->[T] 6103 drop ( in no particular top-bottom order ) pc@+: 8002 h# 2 6200 T'=T+N ( next PC ) 6180 swap 7c00 @ ; : init h# 0 ( pc ) run ; init: 8000 h# 1234 ( entry point ) 0xxx jmp run : run do pc@+ decode loop ; run: 8002 h# 2 ( inline pc@+ ) 6200 T`=T+N 6180 swap 6c00 @ 4xxx call decode 0xxx jmp run ( d -- d ) : decode ( switch: jump into tbl1[n], where n is top 3 bits ) 9000 h# e000 6600 inv 6300 and 800c h# 12 6900 >> 800x tbl 6200 T`=T+N 6143 >r 700c exit ; tbl1: 'decode0x ( many more opcodes ) 'decode2x ( many more opcodes ) 'szc 'sub 'cmp 'add 'mov 'soc ( general src and dst addressing mode decoders ) : gad ( inst -- addr dst inst ) ... ; : gas ( inst -- src inst ) ... ; : mov gad gas ( inst -- addr dst addr src inst ) h# 1000 and ( drops the inst as a side effect ) jz word ( handle a byte operation by testing src and dst address lsbit ) ( do shifty stuff ) ( do masky stuff ) and .. and .. or swap ! ( don't drop ) status exit word: nip nip swap ! ( dont drop ) status ; : status ( d -- bbbb->ST ) ( compare to 0 and set all the status bits ) ; mov: 4xxx call gad 4xxx call gas 9000 h# 1000 6303 and 2xxx jz word ( stuff happens if byte ) 6123 ! 0xxx jmp status word: 6003 nip ( ignore src ) 6003 nip 6123 store 0xxx jmp status : add same as mov but replace "nip nip" with "nip +" : sub same as mov but replace "nip nip" with "nip inv 1+ +" 6003 nip 6600 inv 8001 h# 1 6203 T`=T+N 6203 T`=T+N : cmp cmp is a holy terror cmpb is even worse ; : szc same as mov but replace "nip nip" with: "nip swap inv and" 6003 nip 6180 swap 6600 inv 6303 and : soc same as mov but replace "nip nip" with: "nip or " 6003 nip 6503 or Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.