Code generation techniques

BillG · September 14, 2020

As mentioned in another thread, the following code to add a small constant to a variable:

                          00037 * W0 := W0 + 2;
                          00038
                          00039 *   *  0 := v W0 -> 1
                          00040 *   *  1 L r 2
                          00041
                          00042 *      *  2 L v W0 -> 3
                          00043 *      *  3 + c 2
                          00044
                          00045
                          00046 *  1 L r 2
                          00047 *  2 L v W0 -> 3
                          00048 *  3 + c 2
 0052 C020 002E           00049         mov     @W0,R0
 0056 05C0                00050         inct    R0
                          00051 *  0 := v W0 -> 1
 0058 C800 002E           00052         mov     R0,@W0

can be optimized to this:

                          00037 * W0 := W0 + 2;
                          00038
                          00039 *   *  0 := v W0 -> 1
                          00040 *   *  1 L r 2
                          00041
                          00042 *      *  2 L v W0 -> 3
                          00043 *      *  3 + c 2
                          00044
                          00045
                          00046 *  1 L r 2
                          00047 *  2 L v W0 -> 3
                          00048 *  3 + c 2
 0052 05E0 002E           00049         inct    @W0
                          00050 *  0 := v W0 -> 1

This is the code to add two signed bytes resulting in a 16-bit number. Is there a better way to do sign extension?

                          00037 * W0 := S1 + S2;
                          00038
                          00039 *   *  0 := v W0 -> 1
                          00040 *   *  1 L r 2
                          00041
                          00042 *      *  2 L v S1 -> 3
                          00043 *      *  3 + v S2
                          00044
                          00045
                          00046 *  1 L r 2
                          00047 *  2 L v S1 -> 3
                          00048 *  3 + v S2
 0052 04C0                00049         clr     R0
 0054 D020 0037           00050         movb    @S1,R0
 0058 1502 (005E)         00051         jgt     2f
 005A 0260 00FF           00052         ori     R0,>FF
 005E                     00053 2
 005E 06C0                00054         swpb    R0
 0060 04C1                00055         clr     R1
 0062 D060 0038           00056         movb    @S2,R1
 0066 1502 (006C)         00057         jgt     2f
 0068 0261 00FF           00058         ori     R1,>FF
 006C                     00059 2
 006C 06C1                00060         swpb    R1
 006E A001                00061         a       R1,R0
                          00062 *  0 := v W0 -> 1
 0070 C800 002E           00063         mov     R0,@W0

Asmusr · September 14, 2020

I don't know if it's faster, but you can extend the sign using sra r0,8. Then you don't need clr, jgt, ori or swpb.

BillG · September 14, 2020

Thanks.

I was going to do that when optimizing for size.

The 9900 does not have a barrel shifter, but shifts a bit at a time. The documentation seems to say one cycle per bit position; if that is true, shifting will be the fastest.

apersson850 · September 14, 2020

A shift instruction is roughly equivalent to 1-4 "normal" instructions, depending on how many bits to shift. Which implies that you can't do much other logic before you've consumed as many clock cycles as the shift will need.

+Lee Stewart · September 14, 2020

5 hours ago, BillG said:

The 9900 does not have a barrel shifter, but shifts a bit at a time. The documentation seems to say one cycle per bit position; if that is true, shifting will be the fastest.

I read two cycles per click.

...lee

apersson850 · September 14, 2020

BillG was reading information regarding the TMS 9940, so that could be the reason for the error.

BillG · September 14, 2020

If my calculations are correct, shifting is faster (assuming the workspace is in fast memory and everything else is in slow memory.)

 0052 04C0                00049         clr     R0      ; 14 : 10 + 4 (fetch)
 0054 D020 0037           00050         movb    @S1,R0  ; 30 : 14 + 4 (fetch) + 8 + 4
 0058 1502 (005E)         00051         jgt     2f      ; 12/14 : 8/10 + 4 (fetch)
 005A 0260 00FF           00052         ori     R0,>FF  ; 22 : 14 + 2 * 4 (fetch)
 005E                     00053 2
 005E 06C0                00054         swpb    R0      ; 14 : 10 + 4 (fetch)

vs

 0052 D020 0037           00049         movb    @S1,R0  ; 30 : 14 + 4 (fetch) + 8 + 4
 0056 0880                00050         sra     R0,8    ; 32 : 12 + 4 (fetch) + 16

apersson850 · September 14, 2020

Which is well within the rule of thumb I gave, where I stated that a shift is like 1-4 simple instructions, depending on the number of bits. Since you shift 8 bits it's about two instuctions. 'And as soon as you start adding more advanced addressing to other instructions, they take more time. Since shift can only be done with registers, they keep their timing.

Still, using fancy addressing modes is almost always faster than adding more instructions to accomplish the same thing.

BillG · September 15, 2020

This shows just how un-RISC the 9900 is.

Reminds me of programming the 8088 - the specialized "string" instructions were by far the fastest way to do some things. That advantage eroded away as all of the instructions were made more efficient with each new generation of the architecture. By the time of the 486, a sequence of simple instructions can beat most of the string instructions.

apersson850 · September 15, 2020

It's definitely CISC!

But the general addressing modes make it pretty orthogonal, which is a good thing for a programmer.

Edited September 15, 2020 by apersson850

+mizapf · September 15, 2020

12 hours ago, BillG said:

This shows just how un-RISC the 9900 is.

Some years ago I learned to program the MIPS2000, a RISC platform. This is just the second assembly language that I learned (to the extent of writing programs) after TMS9900.

It is quite interesting to see how specific things are solved in one or another platform. The MIPS architecture is a load/store architecture, means that you cannot work on memory contents but only on registers, and that you have special command load/store to transfer values from/to memory.

32 registers in MIPS are quite nice, but as always, registers are quickly used up, and this is the point where you start to appreciate the slow but very comfortable workspace concept of the TMS and the memory/memory architecture.

Likewise - as we recently discussed the pros and cons of BLWP - a context switch for the MIPS is very expensive: It essentially means to copy the register set to memory and to fetch a new set from there. On the other hand, we can be really glad to have the BLWP instruction in the TMS, which makes context switches extremely fast.

I'd say the point is to make best use of the specific features the platform offers you.

BillG · September 15, 2020

I have been playing with code generation on different processors.

This line of code

    W0 := S1;

sign extends a byte into two.

For the 6502:

                          00037 ;  1 L v S1
 0029 A0 00           [2] 00038          ldy    #0
 002B A6 16           [3] 00039          ldx    S1
 002D 10 01 (0030)  [2/3] 00040          bpl    2f
 002F 88              [2] 00041          dey
 0030                     00042 2:
 0030 98              [2] 00043          tya
                          00044 ;  0 := v W0 -> 1
 0031 86 0D           [3] 00045          stx    W0
 0033 85 0E           [3] 00046          sta    W0+1

For the 6800:

                          00037 *  1 L v S1
 0029 4F              [2] 00038          clra
 002A D6 16           [3] 00039          ldab   S1
 002C 2A 01 (002F)    [4] 00040          bpl    2f
 002E 4A              [2] 00041          deca
 002F                     00042 2:
                          00043 *  0 := v W0 -> 1
 002F 97 0D           [4] 00044          staa   W0
 0031 D7 0E           [4] 00045          stab   W0+1

For the 8080:

                          00037 ;  1 L v S0
 0100 3A 0015        [13] 00038         lda     S0
 0103 6F              [5] 00039         mov     L,A
 0104 17              [4] 00040         ral
 0105 9F              [4] 00041         sbb     A
 0106 67              [5] 00042         mov     H,A
                          00043 ;  0 := v W0 -> 1
 0107 22 000D        [16] 00044         shld    W0

For the 9900:

                          00043 *  1 L v S1
 0052 D020 0037           00044         movb    @S1,R0
 0056 0880                00045         sra     R0,8
                          00046 *  0 := v W0 -> 1
 0058 C800 002E           00047         mov     R0,@W0

For the 68000:

                                  00009 ;  1 L v S1
 00000400 1038 0421               00010         move.b  S1,D0
 00000404 4880                    00011         ext.w   D0
                                  00012 ;  0 := v W0 -> 1
 00000406 31C0 0418               00013         move.w  D0,W0

That almost feels like cheating.

It is interesting that the last four examples have each been ten bytes long.

Real cheating would be the 80386:

    movsx   AX,[S1]
    mov     [W0],AX

For the AVR:

                          00011 ;  1 L v S1
 000060 9160 0116     [2] 00012         lds     R22,S1
 000062 2F76          [1] 00013         mov     R23,R22
 000063 0F77          [1] 00014         lsl     R23
 000064 0B77          [1] 00015         sbc     R23,R23
                          00016 ;  0 := v W0 -> 1
 000065 9360 010D     [2] 00017         sts     W0,R22
 000067 9370 010E     [2] 00018         sts     W0+1,R23

And finally, for the 6809:

                                  00037 *  1 L v S1
 0029 D6 16                   [4] 00038          ldb    S1
 002B 1D                      [2] 00039          sex
                                  00040 *  0 := v W0 -> 1
 002C DD 0D                   [5] 00041          std    W0

which brings up one of my favorite programming jokes:

The Motorola 6809, where SEX is sometimes followed by STD.

Thank you very much. Drive safely...

+OLD CS1 · September 15, 2020

38 minutes ago, BillG said:

The Motorola 6809, where SEX is sometimes followed by STD.

giggity.jpg.53babd965e7046f560ecd8c96f5b25ad.jpg

+FarmerPotato · September 16, 2020

I was motivated to try some hand-assembly in J1 assembly language, and share the love for this little cpu.

So, here is sex (sign-extend) for the J1.

The J1 is a stack-based machine, similar to the Novix NC4016, with an instruction set designed to run FORTH. Read about it here: https://www.excamera.com/files/j1.pdf or https://excamera.com/files/svfig-2015-aug.pdf

Its opcodes are:

8000 push 15-bit literal

3-bit opcodes with 13 bit arguments:
0000 jump
2000 jump if zero
4000 call
6x00 ALU. The bit fields are like microcode. x is an operation from 0-f , for example T+N, ~T, fetch [T] or NrshiftT

So, let's do it.

Sign extend the high byte of the 16-bit word at s1.

( w0 s1 -- [s1]>>8 -> [w0] )
: sex
 @
 h# 8
 NrshiftT
 h# 80
 N<T
 jz done
 h# 00ff
 inv
 +
done:
 swap
 !
;

assembled:

sex:
0000 6C00   @
0002 8008   h# 8
0004 6903   >>
0006 8080   h# 80
0008 6803   N<T
000a 2009   jz done
000c 80ff   h# 00ff
000e 6603   inv
0010 6403   or
done:
0012 6180   swap
0014 6123   N->[T] drop
0016 710f   drop ;

Of course, in FORTH, you wouldn't write sex as a memory-to-memory operation. It should just be a stack operation.

Once sex is defined in the dictionary, calling it is one instruction. See the full listing at the end.

Some curiosities:

1. The opcode 8000 pushes a 15-bit literal to the stack. To load the 16-bit value ff00, it is necessary to push 00ff and invert it.

2. ! must consume 2 values from the stack, but one ALU cycle can drop only 1. So I insert an extra drop.
If you want to leave the value on the stack, you can get that for free. The idiom "dup addr !" can be optimized to "addr N->[T] drop", fitting in two instructions.

3. ; is usually free - You can do an ALU operation to the stack, and also pop the return stack and return.

This:

000b 6103   drop
000c 700c   ;

To optimize, you may OR the instructions for the combined effect, because the CPU units are independent. 710f operates on the data stack, return stack, and PC.

000b 710f   drop ;

5. c@ is messy. Addresses are byte-oriented, but memory access is aligned to 16-bit words. Sound familiar? The J1 has no "MOVB" only "MOV". Where @ is one instruction, c@ is written in software, and must check if the address is even or odd, then shift or mask.

Full Listing

: c@ ( addr -- c )  ( little endian )
  h# 1
  TandN
  jz ceven
  h# 8
  >>
  ret
ceven:  
  h# ff
  and
;


: sex ( c -- d )
 h# 80
 N<T
 jz done
 h# 00ff
 inv
 or
done:
;

0 variable w0
0 variable s0

: main w0 c@ sex s0 ! ;

c@:
0000 8001 h# 1
0002 6300 TandN
0004 2005 jz ceven
0006 8008 h# 8
0008 790f >> ret
ceven:
000a 80ff h# ff
000c 730f and ret


sex:
000e 8080   h# 80
0010 6803   N<T
0012 200d   jz done
0014 80ff   h# 00ff
0016 6603   inv
0018 740f   or ;
done:
001a 700c   ;  this word could be optimized away

w0:
001c 8020 h# 0020
001e 700c ;
0020 0000

s0:
0022 8026 h# 0026
0024 700c ;
0026 0000

main:
0028 400c w0
002a 4000 c@
002c 4007 sex
002e 400f s0
0030 6123 N->[T] drop
0032 710f drop ;

apersson850 · September 16, 2020

2 hours ago, FarmerPotato said:

5. c@ is messy. Addresses are byte-oriented, but memory access is aligned to 16-bit words. Sound familiar? The J1 has no "MOVB" only "MOV". Where @ is one instruction, c@ is written in software, and must check if the address is even or odd, then shift or mask.

Which makes you understand why TI went the read-before-write route. And why they always do it, regardless of whether they are processing a byte or a word, so that they can do the same thing consistently. It doesn't improve efficiency, but does simplify architectural aspects.

BillG · September 16, 2020

6 hours ago, FarmerPotato said:

The J1 is a stack-based machine, similar to the Novix NC4016, with an instruction set designed to run FORTH. Read about it here: https://www.excamera.com/files/j1.pdf or https://excamera.com/files/svfig-2015-aug.pdf

Do you think this will be effective for writing emulators for other processors?

My understanding is that the fastest non-jit approach is some form of threaded code instead of a central decoder/dispatcher.

+TheBF · September 16, 2020

3 hours ago, BillG said:

Do you think this will be effective for writing emulators for other processors?

My understanding is that the fastest non-jit approach is some form of threaded code instead of a central decoder/dispatcher.

The J1 as is would need to be expanded to be useful emulator. The original only addressed 16K bytes of memory It was for a specific application. There might be 32bit version now. (?)

Important to note that J1 is not a threaded code machine but rather encodes the forth primitives (fetch,store,dup,drop,branch, branchifzero , math etc.) in hardware.

The instructions are small because there is no register addressing required so they are organized to be compressed into one machine word by the compiler and fetched together in one cycle. No pipeline.

Sub-routine call can be embedded in one of those instructions and return is free as any instruction that has the highest bit set (?), automatically returns.

Pretty clever.

+FarmerPotato · September 16, 2020

1 hour ago, TheBF said:

The J1 as is would need to be expanded to be useful emulator. The original only addressed 16K bytes of memory It was for a specific application. There might be 32bit version now. (?)

Important to note that J1 is not a threaded code machine but rather encodes the forth primitives (fetch,store,dup,drop,branch, branchifzero , math etc.) in hardware.

The instructions are small because there is no register addressing required so they are organized to be compressed into one machine word by the compiler and fetched together in one cycle. No pipeline.

Sub-routine call can be embedded in one of those instructions and return is free as any instruction that has the highest bit set (?), automatically returns.

Pretty clever.

And see the citations of earlier, bigger stack machines, in https://www.excamera.com/files/j1.pdf

b16-small has 5-bit instructions packed into 16 bits, for 3 execution units. How cool is that?

There is a 32-bit j1b and a multi-core j4a in the swapforth repo.

The j1a fit in 5K of ram on the ice-1k. I wrote my sd card driver on that one before the other 3k filled up Now I'm running the mecrisp fork of j1a, where Matthias optimized, put the barrel shifter back in and added mpy, on ice-4k. It's my favorite hardware tester, for my V9958 board.

Bowman's j1a left opportunities for optimizing because it coded so much in FORTH, not assembly.

OK, now about emulation..

+TheBF · September 16, 2020

42 minutes ago, FarmerPotato said:

And see the citations of earlier, bigger stack machines, in https://www.excamera.com/files/j1.pdf

b16-small has 5-bit instructions packed into 16 bits, for 3 execution units. How cool is that?

There is a 32-bit j1b and a multi-core j4a in the swapforth repo.

The j1a fit in 5K of ram on the ice-1k. I wrote my sd card driver on that one before the other 3k filled up Now I'm running the mecrisp fork of j1a, where Matthias optimized, put the barrel shifter back in and added mpy, on ice-4k. It's my favorite hardware tester, for my V9958 board.

Bowman's j1a left opportunities for optimizing because it coded so much in FORTH, not assembly.

OK, now about emulation..

Now I'm running the mecrisp fork of j1a, where Matthias optimized, put the barrel shifter back in and added mpy, on ice-4k.

I want one of those!

+FarmerPotato · September 16, 2020

5 hours ago, BillG said:

Do you think this will be effective for writing emulators for other processors?

My understanding is that the fastest non-jit approach is some form of threaded code instead of a central decoder/dispatcher.

I don't think it's great at emulation.

I thought jit was for making threaded code, while non-jit meant a decoder/dispatcher?

I think it would be better to add some registers to the stack machine, because emulating a CISC or RISC you know you need a PC, ST, WP, register file, parking for src/dst address/values etc. J1 also doesnt keep an overflow bit in the alu, and you'd need that.

I would implement a generic cpu with those features, then write decoders to emulate all the target cpus.

So I started some j1 code to emulate the 9900 decoder/dispatcher. I made the PC always at the bottom of the stack. I use 'tbl' to jump to 6 of the Format 1 instructions, Nxxx with 6 bit src/dst addr modes.

( -- d PC+2->PC )
: pc@
  pc @
  @ swap
  2+ pc !
;

8000 h# 0000 pc
6x01 @     T'=[T]     ( a -- a a )
6x00 @     T'=[T]     ( a a -- a d )
6x80 swap  T'=N T->N
8002 h# 2
6x03 +
7000 h# 0000 pc
6123 !     N->[T]
6103 drop


( in no particular top-bottom order )

pc@+:
8002 h# 2
6200 T'=T+N   ( next PC )
6180 swap
7c00 @ ;

: init
  h# 0 ( pc )
  run
;

init:
8000 h# 1234 ( entry point )
0xxx jmp run

: run
  do
  pc@+
  decode
  loop
;

run:
8002 h# 2  ( inline pc@+ )
6200 T`=T+N
6180 swap
6c00 @ 
4xxx call decode
0xxx jmp run

( d -- d )
: decode   ( switch: jump into tbl1[n], where n is top 3 bits )
9000 h# e000
6600 inv
6300 and
800c h# 12
6900 >>
800x  tbl 
6200 T`=T+N 
6143 >r
700c exit
;

tbl1:
'decode0x ( many more opcodes )
'decode2x ( many more opcodes )
'szc
'sub
'cmp
'add
'mov
'soc

( general src and dst addressing mode decoders )
: gad ( inst -- addr dst inst ) ... ;
: gas ( inst -- src inst ) ... ;

: mov
  gad gas ( inst -- addr dst addr src inst )
  h# 1000
  and  ( drops the inst as a side effect )
  jz word
  ( handle a byte operation by testing src and dst address lsbit )
  ( do shifty stuff )
  ( do masky stuff )
  and .. 
  and ..
  or
  swap ! ( don't drop ) status exit
word:  nip nip 
  swap ! ( dont drop ) status
;

: status ( d -- bbbb->ST )
 ( compare to 0 and set all the status bits )
;

mov:
4xxx call gad
4xxx call gas
9000 h# 1000
6303 and
2xxx jz word
  ( stuff happens if byte )
6123 !
0xxx jmp status
word:
6003 nip ( ignore src )
6003 nip
6123 store
0xxx jmp status

: add
same as mov but replace "nip nip" with "nip +"


: sub
same as mov but replace "nip nip" with "nip inv 1+ +"
6003 nip
6600 inv
8001 h# 1
6203 T`=T+N 
6203 T`=T+N 


: cmp
cmp is a holy terror
cmpb is even worse
;

: szc
same as mov but replace "nip nip" with: "nip swap inv and"
6003 nip
6180 swap
6600 inv
6303 and


: soc
same as mov but replace "nip nip" with: "nip or "
6003 nip
6503 or

Code generation techniques

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members