Extending the CP1600 Instruction Set

JohnPCAE · November 14, 2019

Insomnia and allergies are conspiring to keep me awake so I figured I'd write up an idea that's been in the back of my mind for the past few months. One of the fun parts about programming for the Inty (and many other retro systems) is having to work within the limits of the system, such as size or speed constraints. With modern bankswitching we can make an Inty game almost as large as we want, and we even have some simple math acceleration available like fast multiplies. This is an idea about taking it to the next level.

We know that the CP1600 only requires 10 bits of memory for its instruction set, but a really intelligent cartridge could use those extra 6 bits to extend the instruction set within its address range. By "really intelligent" I mean a cartridge that listened to all bus traffic to and from the CPU. For every instruction the CPU executes, the cartridge maintains a mirror of the CPU's internal state. Basically a microcontroller on a cartridge. The microcontroller would also be responsible for providing data from the cartridge's address space when the CPU requests it, i.e. serving up the game code and data. This would enable some really powerful things.

1. Simple example: inclusive or instruction

XORI data, RD is encoded as 1111 111 ddd. We could extend it to mean 000000 1111 111 ddd, so that whenever the upper six bits are all zeroes the instruction remains the same. We could then decide that encoding 000001 1111 111 ddd would mean ORI data, RD. Since the microcontroller would always know the state of RD beforehand, it could always supply the correct "data" value to achieve an inclusive OR result even though the CPU would *actually* execute the original XOR operation. And, since an XOR is being used, the condition flags would also be set.

Taking it a step further, we could do something like this:

000000 1111 111 ddd XORI data, RD (original instruction)

000001 1111 111 ddd ORI data, RD

000010 1111 111 ddd OR R0, RD

000011 1111 111 ddd OR R1, RD

000100 1111 111 ddd OR R2, RD

000101 1111 111 ddd OR R3, RD

000110 1111 111 ddd OR R4, RD

000111 1111 111 ddd OR R5, RD

001000 1111 111 ddd OR R6, RD

001001 1111 111 ddd OR R7, RD

The CPU would see all of these as XORI data, RD, but the microcontroller would supply the appropriate data value to achieve the desired result. It's even possible to extend it to indirect memory addressing using XOR@, provided that the memory being accessed is also within the cartridge's address space.

2. Other instructions that place a result in a register

For something like a MUL instruction, instead of basing it on an XOR operation, extended instructions could be based on something like MVII data, RD. This wouldn't set any condition flags, but it also wouldn't have any unwanted side effects. On the other hand, if one *wants* condition flags set, sticking with XOR would work just dandy.

3. Advanced idea: redefinable instructions

Instead of hard-coding the extra bits to mean specific instructions, it could be left up to software. How? By using an extended no-op instruction.

The CP1600 has four (!) methods of achieving a no-op:

0000110100 NOP

0000110101 NOP (2)

1000001000 NOPP

1000101000 NOPP (2)

One of these could be extended to supply special meta-instructions to the microcontroller for defining how certain instructions are extended. Here's a very simple example of a meta-instruction format:

1111111000 0000011111111xxx 0000000000 0000

The first 10 bits specify a mask telling what parts of the basic opcode are relevant. The next 16 bits specify the extended opcode (where x's could be either 0 or 1 since they will be ignored because of the mask). The next 10 bits specify the operation to take place (XOR). The final 4 bits specify the source operand type (0=immediate value)

The 40-bit meta-instruction could be encoded in a string of 10 NOPP instructions:

01 1111 1000001000

10 1110 1000001000

10 0000 1000001000

10 0001 1000001000

10 1111 1000001000

10 1110 1000001000

10 0000 1000001000

11 0000 1000001000

The first two bits are markers that (1) ensure that the upper six bits are never all zero and (2) signal the beginning, middle, and end of a meta-instruction. The next four bits are the payload, and the last 10 bits are the normal NOPP instruction. This scheme would let a microcontroller support a vast amount of potential instructions while letting a programmer choose those instructions he needs. A programmer could even redefine instructions as the needs in a program change. The concept in fact starts to blur the line between writing instructions and writing *microcode* for instructions.

Edited November 14, 2019 by JohnPCAE

mr_me · November 14, 2019

Those six extra bits were "reserved for future expansion"; coprocessors according to wikipedia.

intvnut · November 18, 2019

Yes, you can indeed take advantage of the upper 6 bits of the CP-1600 instruction opcode to provide enhanced functionality from the cartridge. The CPU does indeed ignore those additional bits. Also, the MVOI opcode provides some great opportunities for extending the instruction set as well.

It turns out, I actually implemented many of these ideas before shipping LTO Flash!, but didn't really tell anyone about them. I wanted to implement the same extensions in my JLP cartridge board before I unleashed these to the world, so that any game that used these extensions wouldn't be tied to just LTO Flash!.

(Yes, before anyone busts my chops for actually charging for my hardware, I do make a modest markup on my JLP boards, but anyone can look at my BOM and assembly costs and realize it's a very modest markup.)

Anyway, I approached this exercise two primary ways:

Extend direct-mode instructions to provide additional addressing modes.
Provide extended operations by leveraging MVOI.

I don't try to shadow the CP-1600's internal state. Rather, I provide some additional "registers" in the address map, and define my operations to work on those additional registers and to take advantage of the opportunities the CP-1600 provides for me to learn about its register values.

If I shadowed the CP-1600's internal state, I could perhaps do more. But, at that point, there's little point having the CP-1600 execute anything, other than to cause it to write values to memory where we need it to. My design ethos was to keep the CP-1600 in control, and treat my extensions as a proper coprocessor similar to the CP-1620 that appeared (and later disappeared) in early GI literature.

So, attached is a not-quite-complete document of all of the CP-1600X extensions that are currently available in every LTO Flash! cartridge as well as jzIntv. I will publish a macro file you can include in AS1600 to add these features to your assembly programs. I'm going to try to get this document completed in another few days. But, I figured I should publish what I have now, before too much time has passed.

Locutus_CP-1600X_Instruction_Set_Extensions_20191117a.pdf

Edited November 18, 2019 by intvnut

intvnut · November 18, 2019

OK, the most effective way to find errors in your documents is to post them publicly, and then read them. I fixed a couple minor errors in the doc I just posted. Here's the updated version.

Locutus_CP-1600X_Instruction_Set_Extensions_20191117b.pdf

intvnut · November 19, 2019

And, now I have most of the rest of the ISA documented, including handling of extended precision values, BCD values, etc. I still need to document DECBNZ, TSTBNZ, RXSER, and TXSER.

The ISA description is mostly complete. I probably do need to go into more detail about the PV (previous value) register, as it's an interesting architectural feature that enables some idioms that might not be obvious.

Locutus_CP-1600X_Instruction_Set_Extensions_20191119a.pdf

artrag · November 20, 2019

Now we need that Intybasic include those extensions ?

intvnut · November 20, 2019

1 hour ago, artrag said:

Now we need that Intybasic include those extensions ?

It would be interesting to see an IntyBASIC version that made use of them, and indeed I had IntyBASIC in mind when I wrote some of the instructions (the fixed-point set, as well as the definitions for the comparisons).

That said, programs written with this instruction set would only run in jzIntv or on LTO Flash, currently. I haven't yet upgraded JLP to implement these instructions, and I'm not sure when I'll get a chance to.

I first need to get a few more LTO Flashes out into the world, as there seems to be a demand.

artrag · November 20, 2019

LTO Flash is the future for any (serious) game development on cartridge....

Edited November 20, 2019 by artrag

artrag · November 20, 2019

There are also atan and other math functions

Awesome!!!

artrag · November 25, 2019

On 11/20/2019 at 8:15 AM, intvnut said:

It would be interesting to see an IntyBASIC version that made use of them, and indeed I had IntyBASIC in mind when I wrote some of the instructions (the fixed-point set, as well as the definitions for the comparisons).

That said, programs written with this instruction set would only run in jzIntv or on LTO Flash, currently. I haven't yet upgraded JLP to implement these instructions, and I'm not sure when I'll get a chance to.

I first need to get a few more LTO Flashes out into the world, as there seems to be a demand.

Btw, how to use those extensions in ASM?

Do you have defined macros in your assembler?

I would like to use them in ASM embedded in an intybasic program

intvnut · November 26, 2019

22 hours ago, artrag said:

Btw, how to use those extensions in ASM?

Do you have defined macros in your assembler?

I would like to use them in ASM embedded in an intybasic program

I will post a macro file. I need to add a couple opcodes to it before it's ready. All of the opcodes are tested. I just didn't add macros for all of them.

(How did I test if my macro set is incomplete? I wrote behavioral models of every instruction in both C and CP-1600 assembly, and wrote a random test generator to generate random inputs to every instruction, and vetted it on jzIntv and LTO Flash by letting it run for several days. I actually found a bug in the PIC24H microcontroller's divide implementation and had to implement a workaround.)

Edited November 26, 2019 by intvnut
clarification

intvnut · December 3, 2019

OK, so I think I've got the CP-1600X macro file in decent shape. I tested it a bit more thoroughly, and I modified Tag-Along Todd 2V to use the CP-1600X ISA a bit more heavily, including converting it to use the fixed-point types.

One thing to watch out for: The extended register-to-register instructions are non-interruptible. This is both a blessing and a curse. You'll need to drop in a NOP every so often, or carefully partition your code to run on both the CP-1600 and the CP-1600X instruction sets. There's a few NOPs in Todd that I could eliminate if I scheduled the instructions a little differently.

Some of the mnemonics changed names slightly compared to earlier PDFs. ADD, SUB, SHR, SHRU, etc. became ADD3, SUB3, SHR3, SHRU3. The "3" suffix means "three operands," and is meant to placate the assembler.

The Tag-Along Todd download includes the cp1600x.mac file that defines the ISA extensions. You can copy that into the "examples/macro" directory in jzIntv, or put it wherever you need it. The attached ZIP file is designed to unpack directly in jzintv/examples/.

The ISA document itself still doesn't have all the details, but it has plenty. Good luck!

Locutus_CP-1600X_Instruction_Set_Extensions_20191203a.pdf tagalong2v_cp1600x.zip

+Lathe26 · December 3, 2019

Ok, I'm finally making some time to read the doc...

I've read up to page 20 so far so more feedback is on the way:

Recommend adding page numbers and section numbers. This makes it easier to talk to other folks about specific parts of a high-tech doc like this one. I'm currently relying on Adobe's page numbers in the comments below.
Have all the new instructions been checked for whether they save execution time? I suspect they are all useful and save time, but there are some X86 instructions that RISC folks love to point out are so slow using other simply X86 instructions is faster. I have done nothing to check this.
Page 9 - "Must not be over the top : It’s possible to implement a program entirely in an external machine, using the Intellivision solely for access to the controllers and display. That is explicitly an anti-goal . These instructions should feel like coprocessor extensions that would be reasonable circa 1984."
1. I love this anti-goal.
Page 16 - TXSER / RXSER instructions seem to require a non-LTO cart with extended instructions be _required_ to support serial communication. This should be called out earlier in the doc's introduction sections, such as "It’s OK if the additional instructions..." on page 9.
Page 18 - I really like the examples. For the previous spec, I had typed up a comment that more examples were needed.

intvnut · December 3, 2019

2 hours ago, Lathe26 said:

Recommend adding page numbers and section numbers. This makes it easier to talk to other folks about specific parts of a high-tech doc like this one.

Good idea. I've gotten into the habit of not turning those on, since at $DAYJOB we just share the Google Docs directly and don't generate PDFs. They're a distraction there. Since I'm publishing PDFs, they make sense here.

2 hours ago, Lathe26 said:

Have all the new instructions been checked for whether they save execution time? I suspect they are all useful and save time, but there are some X86 instructions that RISC folks love to point out are so slow using other simply X86 instructions is faster. I have done nothing to check this.

Excellent question. They're not all a slam dunk. I need to add performance data.

Executive overview:

The extended addressing modes build on Direct Mode, and so the cycle counts are the same as Direct Mode. That means "ADD @X1++(1), R0" takes 2 cycles longer than "ADD@ R4, R0".
Extended branches are also built on Direct Mode, and have the same execution time as MVI addr, R7.
Extended Reg-Reg instructions build on MVOI. I'm not sure if that's a 9 or 10 cycle instruction, but I think it's 9 cycles.

The flat cycle count profile helps. There's no long-run-time microcoded beasts like x86. The fact that none is faster than 9 cycles, though, gives a drag relative to the 6 to 8 cycle reg-reg native instructions.

Many of the instructions save time only if you have all the values you need in the X registers already. There's definitely a communication bottleneck between the X registers and the CP-1600. Popping over there for just an instruction or two is a loss far more often than not. When I went through Tag-Along Todd 2 to convert portions to CP-1600X, I found it tricky to get obvious wins everywhere. When I did, I found the computation largely migrated into the X registers, popping back at the end.

The wins also tend to come when using the more complex instructions (BOUND, multiply/divide, I2BCD) that absorb multiple CP-1600 instructions. The simpler CP-1600X instructions only really help as glue when all the values are stuck in X registers and you need to glue the fancier instructions together.

Some things turned out rather cute, such as the decimal print loop. I haven't really optimized this, but it's nice and compact. Perhaps a more focused version (with more comments and commentary) of this should appear in the Programmer's Guide section.

DEC16:  PROC
        ADDR    R3,     R3      ; \_ Set LSB to 1 to indicate leading spaces.
        INCR    R3              ; /
        INCR    R7              ; Skip the ADDR.
DEC16A: ADDR    R3,     R3      ; Set LSB to 0 to indicate leading zeros.

        SUB3    5,  R2, X2      ; X3 is number of digits to display.
        I2BCD   0,  R0, X0      ; BCD decode the number
        SLL     R2,     2       ; \_ Pop off suppressed digits into X1.
        SHLU3   X0, R2, X0      ; /
        AND3    X1, $F, X1      ; Keep only the last one.

@@digit_loop:
        MOVR    R3,     R1
        SARC    R1,     1
        TSTBNZ  X1,     @@non_zero
        BC      @@no_digit
@@non_zero:
        ANDI    #$FFFE, R3
        MPY16   X1, 8,  X1
        ADD     X1,     R1
        ADDI    #$80,   R1

@@no_digit:
        MVO@    R1,     R4

        SHLU3   X0, 4,  X0      ; Pop next digit into X1
        DECBNZ  X2, @@digit_loop

        SLR     R3,     1       ; Restore R3
        JR      R5
        ENDP

As another example, the velocity/position update also ended up kinda cute, but also was almost entirely in X registers. It also leaned heavily on the rotated fixed point representation. The original used non-rotated fixed point, though, and was competitive in the base instruction set. I need to work up an actual performance analysis of the two to see how far ahead (if I did end up ahead) CP-1600X actually ended up. I suspect it wasn't very big.

; Note: X4 = $00FF at the start.
            MVI@        R4,         R0  ; Target Velocity
            MVI@        R4,         R1  ; \_ Velocity
            MVO         R1,         X0  ; /

            SUBFX       R0,   X0,   X1  ; Velocity difference

            CMPLTFX     X1,    0,   X2  ; \_ +/- 01.00 based on sign of
            BOUNDU      X4,    1,   X2  ; /  difference

            MVII        #$0300,     R1  ; Round away from zero (00.03)
            MPYFXSS     R1,   X2,   X2  ; +/- 00.03 based on sign of diff
            ADDFX       X1,   X2,   X1  ; Add rounding term
            DIVFXS      X1,    4,   X1  ; Divide by 04.00

            ADDFX       X1,   X0,   X0  ; Add rounded diff to velocity
            MVI         X0,         R0  ; Updated velocity

            MVI@        R4,         R1
            ADDFX       R1,   X0,   X0  ; Updated position
            BOUNDFXU    X5,   X6,   X0  ; Clamp to screen
            MVI         X0,         R1  ; Updated position

            SUBI        #2,         R4
            MVO@        R0,         R4  ; Save updated velocity
            MVO@        R1,         R4  ; Save updated position

2 hours ago, Lathe26 said:

TXSER / RXSER instructions seem to require a non-LTO cart with extended instructions be _required_ to support serial communication. This should be called out earlier in the doc's introduction sections, such as "It’s OK if the additional instructions..." on page 9.

Yeah, the TXSER / RXSER venture into "over the top" territory, and are really intended only for LTO Flash use, to improve serial transfer performance. I need to document how to detect whether they're supported, as I don't intend all implementations to support them. jzIntv doesn't support them at the moment, for example. JLP could support them, as it does have a serial port.

Most of the instructions are focused on game-oriented communication. TXSER / RXSER are for space cadet applications, such as turning your Intellivision into a terminal server. Anyone want to write a BBS that runs on an Intellivision? LOL

Edited December 3, 2019 by intvnut
grammar fix

+Lathe26 · December 4, 2019

Here's more feedback on the doc:

Page 19 - Does PV store different values based on ISR code vs non-ISR code? Most likely, the PV register unnecessary in ISRs so maybe there should be a statement that PV instructions should not be used in ISRs.
Some RISC processors have a register that is hardcoded as 0 (i.e. X0 is read as 0x000, writes are ignored).
1. This might remove the need for some instruction encodings, make room for other new instructions. For example, if X0 was always 0, then some of the CMP* extended instructions are just aliases for SUB* that store into X0. RISC processes also benefit from having a source register that is 0, but since the register-to-register instructions already support signed 5-bit constants, this portion is moot.
What are your thoughts on combined "mask and shift" instructions? The intention is to mask off a bit bitfield and have the bitfield shifted to bottom bits of the destination register (based on the whatever the lowest bit in the 'mask' register is a 1). For example, given an instruction of "ANDSHFT X1, X2, X3" where X1 is the data register containing $1234 and X2 is the mask register of $0F00 would result in X3 being set to $0002 (which makes later operations using X3 easier). Of course, this can be implemented as 2 separate instructions so maybe this instruction is too specialized.
Page 24 - Minor typo where superscript 3 for BITREV is not tagged below (looks like superscript 3 and 4 were later combined)
In addition to the PV register, what about adding status register(s)? On the simple end of the spectrum, there could be an XSWD register with status bits of SZOC in bits 7-4 (same format as what the RSWD CP1610 instruction takes). This would allow for all the CMP* instructions to be replaced with some other instructions.
Would there be benefits for more graphics-centric commands that use a bank of 8 registers. The only concern is that this might start dancing closer to the anti-goal of over-the-top functionality.
1. One simple model would be only use 2 fixed banks of X0-X7 or X8-XF that use the lower 8 bits of each register (i.e. the low 8-bits of X0-X7 would contain all 8x8 pixel of a single card). Instructions would allow for 90 degree rotations of a single card, AND/OR of two cards (output stored in one of the banks), collision detection (use low/high bytes of PV as X/Y offset of the 2 cards), etc. Of course, the bottleneck would be copying the card data into and out of the X registers so a clock cycles would need to be looked at for whether this is actually useful.
2. Alternatively to the above, what if the extended instructions device was required to monitor writes to GRAM and maintains a shadow copy (would require 512 or 2048 bytes of RAM). This would save needing to use instructions to copy data into the X registers but instructions would still need to be used to copy out of the X registers to get the data into GRAM. For example, a graphics rotate instruction of "GFXROTL X9, 1" would take the card indexed by X9 (if X9 was 11, then the 11th card in shadow RAM is to be rotated) and rotate it by 90 degrees (1 = 90 degrees, 2 = 180 degrees, etc) and finally store the result in X0-X7 (the "L" at the end of the instruction means store result in the low bank of X0-X7 rather than the high bank of X8-XF).
3. Alternatively, a GFXROT (graphics rotate) might store the output might only back to the shadow copy and then a separate GFXLD (graphix load) is used to copy a card from the shadow copy to registers X0-X7. The downside would be that shifting a card by an X/Y offset would need to handle the various ways that pixels would combine/overwrite with the neighboring cards.
4. Another variation would be to have a set of 8-bit registers called G0-GF that are mapped to memory locations $9FA0 through $9FAF rather than reuse the X registers.
How about instructions that help unpack Intellivoice data? Currently, Intellivoice data is stored as 10-bit quantities and it takes up modest ROM space. If the ROM could store it as 16-bit quantities but then unpack it quickly as a series of 10-bit quantities for the SPB640 at runtime, then games could store more voice data. Not sure if this would be faster in clock cycles vs having the CP1610 handle this natively (likely faster but not sure).
Page 23 - Needs copy-paste examples of several of the instruction formats, just to drive the points home.
Page 25 - For the instructions that end in &, is that meant to that the C &= operator is applied to the dst register (i.e. CMPEQ& means that the dst register will retain its current value if src1 == src2 or will be set to 0 if src1 != src2). I'm asking because other parts of the document use & to mean immediate values.
Page 44 - Should there be an inverse of ATAN2? This would take a distance in a register, a heading of 0-15 as a register or constant, and output the X/Y in dst_lo/dst_hi.
Page 61 - Add a GETDEVID instruction that loads a set of constants into X0-XF. Some registers might contain the device name, others the device version, others contain device feature flags (ex: are TXSER/RXSER supported?).
Page 62 - Should mention that determining _what_ serial port error occurred will be device dependent (ex: LTO serial status register is at address $xxxx).
Page 63 - Not understanding the section about "you cannot really perform in-place accumulation" since it looks like the accumulation handled above. It looks like a 64-bit add was done above and then the final result was simply moved from X8-XB to X4-X7.

intvnut · December 4, 2019

BTW, in case I wasn't clear: I'm documenting what's already shipped and is in all 600-ish LTO Flash units today, and has been in jzIntv for over 3 years now. If I change anything, it will be to add to the ISA in a backward compatible way.

Yes, there aren't any programs out there that use the existing ISA. But, the ISA has already shipped, and is actually verified.

38 minutes ago, Lathe26 said:

Page 19 - Does PV store different values based on ISR code vs non-ISR code? Most likely, the PV register unnecessary in ISRs so maybe there should be a statement that PV instructions should not be used in ISRs.

There's no real indication you're in an interrupt context or not. PV is not banked nor is it preserved. PV is primarily intended to be ephemeral and should be consumed immediately after it's generated. The instructions that set PV are also non-interruptible. Thus, the intended, safe and idiomatic use of PV is:

       EXT3OP  X0, X1, X2  
       MVI     PV, R0

I should make this clearer in the documentation. It's not meant for any other use. From this, you can implement a number of interesting idioms, but you really have to bolt a MVI or ADD, or something next to it to immediately consume the value.

In general, using the extended register set (X0 - XF) from an interrupt handler needs to be done with care, as the standard ISR save/restore won't save and restore it. I'd wanted PSHM/PULM to make that more efficient. For now, it's maybe easier to say "don't use these from ISRs," but I need to include context switch concerns in the Programmer's Guide as well. Either that, or partition the Xregs so that some are for interrupt context and the rest are for foreground context. There are 16 of them.

If I do an "upgrade" to the ISA, it might be work considering register banks as well, similar to an 8051. I don't want to try to detect "return from interrupt," though. I'd make bank selection manual. But, it'd be one instruction as opposed to many.

42 minutes ago, Lathe26 said:

Some RISC processors have a register that is hardcoded as 0 (i.e. X0 is read as 0x000, writes are ignored).

This might remove the need for some instruction encodings, make room for other new instructions. For example, if X0 was always 0, then some of the CMP* extended instructions are just aliases for SUB* that store into X0. RISC processes also benefit from having a source register that is 0, but since the register-to-register instructions already support signed 5-bit constants, this portion is moot.

[...]

In addition to the PV register, what about adding status register(s)? On the simple end of the spectrum, there could be an XSWD register with status bits of SZOC in bits 7-4 (same format as what the RSWD CP1610 instruction takes). This would allow for all the CMP* instructions to be replaced with some other instructions.

At the start, I really wanted to avoid having an external status register, in particular because I hadn't worked out how to do branches efficiently. As it stands, I have a limited ability to encode new conditional branches efficiently, and I may have wasted my one opportunity by using the MVI xx, R7 encoding for TSTBNZ.

I encoded the CMPxx the way I did, to better reflect how IntyBASIC performs comparisons. The CMPxx& construct allows you to build compound comparisons quickly, or to conditionally zero out a value (think "break statement").

A XSWD becomes part of the interrupt context in a strong way. PV seemed safe to me as long as you stick to the idiomatic use. XSWD is a little more of a problem because the instruction that generates it is not necessarily immediately before the instruction that consumes it. Reading it and PSHR'ing it onto the stack is no fun either.

And then there's the simple concern that it's costly for me to compute. Some of the ISA implementations pushed the limit of what I could do on a 40MHz PIC. The I2BCD, BCD2I, ABCD/SBCD, ATAN2, and ISQRTFX were tight, as I recall. If you gave me 20% more cycles, I could compute the flags. I don't have those cycles.

The lack of flags forces you into a "branch-free algorithm" mindset. Given that branches are costly anyway, it's not necessarily a bad place to be. What I'm missing is a good conditional-move instruction to round it out. "Based on src1, conditionally move src2 into dst."

1 hour ago, Lathe26 said:

What are your thoughts on combined "mask and shift" instructions? The intention is to mask off a bit bitfield and have the bitfield shifted to bottom bits of the destination register (based on the whatever the lowest bit in the 'mask' register is a 1). For example, given an instruction of "ANDSHFT X1, X2, X3" where X1 is the data register containing $1234 and X2 is the mask register of $0F00 would result in X3 being set to $0002 (which makes later operations using X3 easier). Of course, this can be implemented as 2 separate instructions so maybe this instruction is too specialized.

I like them, but I couldn't figure out how to arrange the operands to make it efficient. It really wants to be a 4 operand instruction, with the extracted-from entity separate from the extracted-to entity. In the end, I settled for a two-instruction sequence of shift (or multiply) followed by a rotate.

1 hour ago, Lathe26 said:

Would there be benefits for more graphics-centric commands that use a bank of 8 registers. The only concern is that this might start dancing closer to the anti-goal of over-the-top functionality.

Perhaps, in a V2 of the extensions. As you note, this starts looking more like an anti-goal. I'd need to consider whether to prioritize GRAM loading efficiency or obviousness. (e.g. use the more obvious encoding of 8 LSBs in each of 8 registers, or the more efficient encoding of packed 8-bit pairs in 4 registers.) And, without PSHM/PULM to quickly block transfer data into and out of the Xregs, the memory bottleneck quickly overwhelms whatever you might save.

So, I decided to table those ideas until I had an efficient way to get data into and out of Xregs.

1 hour ago, Lathe26 said:

How about instructions that help unpack Intellivoice data? Currently, Intellivoice data is stored as 10-bit quantities and it takes up modest ROM space.

I didn't bother trying to solve that one, as JLP stores 10-bit ROM in 12-bit pages rather than 16-bit pages, and already gets most of the benefit for me. (The 16-bit vs. 12-bit decision works on 4K page boundaries. Any game with significant voice can pack it in dedicated 4K pages and get the benefit.)

That said, if you set aside an 8 word buffer of RAM, it wouldn't be hard to write an efficient 5-to-8 decoder that took a block of 5 16-bit words and output 8 10-bit words using the existing shifts and bit operations. It just didn't seem like the more pressing concern. A better use of my time would be to work on a tighter encoding for Intellivoice data, as the current data is not at all compressed.

1 hour ago, Lathe26 said:

For the instructions that end in &, is that meant to that the C &= operator is applied to the dst register (i.e. CMPEQ& means that the dst register will retain its current value if src1 == src2 or will be set to 0 if src1 != src2). I'm asking because other parts of the document use & to mean immediate values.

Yes, I used both meanings from C here: Address-of for the effective-address addressing modes, and bitwise-AND for the comparison instructions.

1 hour ago, Lathe26 said:

Should there be an inverse of ATAN2? This would take a distance in a register, a heading of 0-15 as a register or constant, and output the X/Y in dst_lo/dst_hi.

I figured there's enough variation in heading update logic that having the inverse didn't make sense. You can easily convert ATAN2 into sine/cosine values of the desired precision with something like:

    ATAN2 R0, X0, X1
    MVI   @X1(sintbl), R1    ; sine value
    MVI   @X1(sintbl+4), R2  ; cosine value
    MPY16 R1, X2, X3         ; assume X2 is velocity
    MPY16 R2, X2, X4

If you instead wanted to do that in fixed point, just replace MPY16 with MPYFXS. Or, if you want to model rotational inertia and use a finer-grain sine table, you could do that. etc. etc.

1 hour ago, Lathe26 said:

Add a GETDEVID instruction that loads a set of constants into X0-XF. Some registers might contain the device name, others the device version, others contain device feature flags (ex: are TXSER/RXSER supported?).

Ah, remember the bad old days of how to detect 8088 vs. 8086 vs. 80286, before they added CPUID? I could even return a short manufacturer string like x86 does ("GenuineIntel" / "AuthenticAMD"). "LegitLTO"?

1 hour ago, Lathe26 said:

Should mention that determining _what_ serial port error occurred will be device dependent (ex: LTO serial status register is at address $xxxx)

I need to check whether RXSER / TXSER modify the register when branching to the error path. In any case, JLP and Locutus both put their serial port at the same address, so if / when these instructions come to JLP, the address for serial status will be the same.

If someone figures out how to backport this to a CC3, you could always add a CP-1600X compatible serial port window at the alternate address.

1 hour ago, Lathe26 said:

Not understanding the section about "you cannot really perform in-place accumulation" since it looks like the accumulation handled above. It looks like a 64-bit add was done above and then the final result was simply moved from X8-XB to X4-X7.

That's an out-of-place accumulation, as it required storing the intermediate result in X8 - XB. An in-place accumulation would have stored the accumulated value directly in X4 - X7 without disturbing other registers (except perhaps one for the carry).

If I had instead defined the extended precision instructions as adding the carry/borrow to dst_hi, then I could have gotten away with fewer new instructions (just 2 rather than 4). This is a case where the rapid speed with which I spec'd the ISA caused me to miss an opportunity.

artrag · June 2, 2020

@intvnut

I'm trying to figure out how to use in Intybasic the new opcodes

I was testing the TAN2 function

My idea was to use Angle=USR mytan2(dx,dy)

I've defined an asm PROC expecting R0 = dx, R1=dy and returning Angle in R0

I was expecting to use this code

ASM mytan2: PROC

ASM MOVR R1,X0

ASM TAN2 R0,X0,X0

ASM MOVR X0,R0

ASM JP R5

ASM ENDP

Naturally, my code miserably fails to be compiled...

Where am I wrong ?

Should I use some custom file in my tool chain ?

PS

I use --jlp on the commandline for Intybasic

artrag · June 6, 2020

On 12/3/2019 at 11:15 AM, intvnut said:

OK, so I think I've got the CP-1600X macro file in decent shape. I tested it a bit more thoroughly, and I modified Tag-Along Todd 2V to use the CP-1600X ISA a bit more heavily, including converting it to use the fixed-point types.

One thing to watch out for: The extended register-to-register instructions are non-interruptible. This is both a blessing and a curse. You'll need to drop in a NOP every so often, or carefully partition your code to run on both the CP-1600 and the CP-1600X instruction sets. There's a few NOPs in Todd that I could eliminate if I scheduled the instructions a little differently.

Some of the mnemonics changed names slightly compared to earlier PDFs. ADD, SUB, SHR, SHRU, etc. became ADD3, SUB3, SHR3, SHRU3. The "3" suffix means "three operands," and is meant to placate the assembler.

The Tag-Along Todd download includes the cp1600x.mac file that defines the ISA extensions. You can copy that into the "examples/macro" directory in jzIntv, or put it wherever you need it. The attached ZIP file is designed to unpack directly in jzintv/examples/.

The ISA document itself still doesn't have all the details, but it has plenty. Good luck!

Locutus_CP-1600X_Instruction_Set_Extensions_20191203a.pdf 1.55 MB · 9 downloads tagalong2v_cp1600x.zip 119.86 kB · 7 downloads

I'm struggling on compiling the ASM example tagalong2v_cp1600x.asm

I cannot find jlp_accel.asm ...

Is it missing?

intvnut · June 7, 2020

I'm catching up on my inbox. Sorry for the delay.

On 6/6/2020 at 9:16 AM, artrag said:

I'm struggling on compiling the ASM example tagalong2v_cp1600x.asm

I cannot find jlp_accel.asm ...

Is it missing?

Oops, yes it is. It was supposed to be checked in as jzintv/examples/library/jlp_accel.asm as part of jzIntv. It will be a part of the next jzIntv release. Until then, I've attached the file.

On 6/2/2020 at 2:13 AM, artrag said:

@intvnut

I'm trying to figure out how to use in Intybasic the new opcodes

I was testing the TAN2 function

My idea was to use Angle=USR mytan2(dx,dy)

I've defined an asm PROC expecting R0 = dx, R1=dy and returning Angle in R0

I was expecting to use this code

ASM mytan2: PROC

ASM MOVR R1,X0

ASM TAN2 R0,X0,X0

ASM MOVR X0,R0

ASM JP R5

ASM ENDP

Naturally, my code miserably fails to be compiled...

Where am I wrong ?

Should I use some custom file in my tool chain ?

PS

I use --jlp on the commandline for Intybasic

You will need to add a line like this to your BASIC code somewhere:

ASM INCLUDE "cp1600x.mac"

Adjust the path if necessary. The cp1600x.mac file is the same file that's part of the TagAlong Todd release above. It adds the new instructions and addressing modes through some creative macros.

I've written up a quick demo of using ATAN2 from IntyBASIC and attached it. It just prints the value returned by ATAN2 for points around the perimeter of a 100-unit square in hexadecimal.

atan2_intybasic.gif.875635c27a25fd33a8628d40163cf8e0.gif

jlp_accel.asm atan2_intybasic.zip

intvnut · June 8, 2020

Also, here's the corrected form of the BASIC-callable assembly function, in case folks want to see what it looks like w/out downloading the code. The documentation comment block comes from the cp1600x.mac file, and mentions another instruction, ATAN2FX, that works with IntyBASIC-style fixed point numbers.

BTW, one thing to watch out for: ATAN2 and ATAN2FX are defined based on classical Cartesian coordinates, with +y going up, rather than down. This is the opposite of the screen coordinates. Be sure you use the correct sign on the delta-Y value you pass into ATAN2.

ASM           INCLUDE "cp1600x.mac"

'' ------------------------------------------------------------------------ ''
''  ATAN2                   dst = direction_of(src1, src2)        (signed)  ''
''  ATAN2FX                 dst = direction_of(src1, src2)  (signed fx-pt)  ''
''                                                                          ''
''  Returns the direction pointed by the vector <src1,src2>.  This is       ''
''  approximately equivalent to the C library function atan2(); however,    ''
''  instead of returning a value in the range [0, 2*PI], this returns a     ''
''  value in the range 0..15, starting counter clockwise the origin as      ''
''  follows:                                                                ''
''                                                                          ''
''                                     4                                    ''
''                               5     ^     3                              ''
''                                     |+y                                  ''
''                           6         |         2                          ''
''                              \      |      /                             ''
''                       7        \    |    /       1                       ''
''                                  \  |  /                                 ''
''                                    \|/                                   ''
''                      8 <------------+-----------> 0                      ''
''                         -x         /|\        +x                         ''
''                                  /  |  \                                 ''
''                        9       /    |    \      15                       ''
''                              /      |      \                             ''
''                          10         |-y      14                          ''
''                               11    V   13                               ''
''                                    12                                    ''
''                                                                          ''
''  This can be useful for computing the direction something is from the    ''
''  one's current position.                                                 ''
'' ------------------------------------------------------------------------ ''
ASM X_ATAN2:  PROC
ASM           MVO   R1, X0
ASM           ATAN2 R0, X0, X0
ASM           MVI   X0, R0
ASM           JR    R5
ASM           ENDP

artrag · June 13, 2020

Thanks! Now I've in intybasic the fastest ATAN2() I would ever expected

Going to test to other new instructions

?

artrag · June 15, 2020

BTW what is the constraint that has limited ATAN2() to return only 4 bits of precision?

I have a very smart and FAST routine able to return 8 bits of precision with minimal logic and 2 tables, could it be useful ?

Extending the CP1600 Instruction Set

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members