Jump to content
matthew180

Assembly on the 99/4A

Recommended Posts

It looks like you're advanced enough to know the basics for speeding things up, but definitely unroll your clear loop (I find 4 times a good tradeoff if you can afford it the space, benefits taper after 8 ). Likewise for your VDP copy loop, it should be helpful to unroll that one 2-4 times. If you have enough space in scratchpad, those alone will give you back a noticable amount of time.

 

Also, replace @VDPWD in your VDP copy loop with a register, this also saves a few cycles both for reading the instruction and processing it.

 

Ah yes, unrolling the loops, I forgot about that old trick. But actually a quick test shows that it doesn't help much, only about 20% faster. What is the best way to profile you code in Classic99, is there some way to count how many cycles an entire subroutine call takes? Thanks for your advice.

Share this post


Link to post
Share on other sites
Ah yes, unrolling the loops, I forgot about that old trick. But actually a quick test shows that it doesn't help much, only about 20% faster. What is the best way to profile you code in Classic99, is there some way to count how many cycles an entire subroutine call takes? Thanks for your advice.

 

I would have expected better, unless you move 20% overall, and not just in the loop itself. I benchmarked unrolled copy loops on hardware a few years ago to see what could be done. :)

 

Classic99 has a timing mode in the debugger, set a start and end point, and the debug log will emit the number of cycles executed between those two points. (In the help, it's described as T(0000-0001) -- 0000 is the start address, 0001 is the end. More accurately, the first address resets the counter to zero, and the second dumps it to debug. :)

 

  • Like 1

Share this post


Link to post
Share on other sites

... can you tell me how DIV and DIVS are implemented? In particular, I'm interested in the overflow detection of DIVS, because right now I have an ugly piece of code in MESS to predict an overflow ...

 

DIV was one part of the CPU that worried me when I started because I had no idea how it was done. After a lot of research I found a few methods:

 

* subtract and replace

 

* subtract and shift

 

* a complicated (but fast) prediction method with lots of math I don't understand

 

* turn the dividend into its reciprocal and multiply by the divisor. This method requires a lookup table, but is also fast and can be done in a consistent time. This is the method used on the CRAY-1 computer. Check these links out:

 

http://bitsavers.informatik.uni-stuttgart.de/pdf/cray/2240004C_CRAY-1_Hardware_Reference_Nov77.pdf

http://bitsavers.informatik.uni-stuttgart.de/pdf/cray/

 

I chose a version of the subtract and shift method and modified it to be a 32x16 instead of the dividend and divisor being the same size. In looking at my HDL to reply to this post, I found a bug though! Grrr. When I modified the NxN divide example in to an NxM divide, I failed to realize a subtle difference caused by the difference in dividend vs divisor size. So, my DIV in the GPU is limited to a 15-bit divisor and 31-bit dividend. I'm glad you asked the question though, since it forced me to review the HDL and find the error. I have a fix, but of course that is another update on top of the one I *just* put out. However I don't suppose having *yet another update* will ever end.

 

The 9900 does not have signed division (DIVS), so I did not have to deal with it, but I did go learn about it so I could reply with some sort of useful information. Here is a link that I found that explains how to perform the signed division and handle the sign of the results:

 

http://homepages.cae.wisc.edu/~ece352/fall00/project/pj2.pdf

 

Basically, signed division is done the same as unsigned by converting the inputs to unsigned values first. There are a few tests to do on the inputs to determine the resulting sign of the quotient and remainder:

 

1. The remainder will have the same sign as the dividend.

2. The quotient will be positive if the divisor and dividend have the same sign, otherwise negative.

 

The minimal extra overhead of DIVS over DIV is probably checking and converting the inputs to unsigned as necessary, and setting two flags based on the original inputs so the output values can be made negative if necessary.

 

Overflow would work the same way as unsigned division, once the inputs are converted to unsigned values, and can be checked prior to doing the actual DIVS just like DIV.

 

For a lopsided divider, i.e. dividers with configurations of 32-bit x 16-bit, or 16-bit x 8-bit, or 8-bit x 4-bit, etc., the value of the divisor must be greater than the value of the most-significant-half of dividend. If not, overflow occurs and can be checked even before the division begins. I did an exercise to see this work using a 4x2 divider:

 

4-bit x 2-bit divider:

The dividend is two 2-bit registers, the divisor is one 2-bit register.
The answer quotient and remainder must be 2-bit, i.e. 0..3.

binary       decimal
----------   -------
11:11 / 11  =  15 / 3  =  5r0 overflow, divisor >= MS-half of dividend
11:10 / 11  =  14 / 3  =  4r2 overflow, divisor >= MS-half of dividend
11:01 / 11  =  13 / 3  =  4r1 overflow, divisor >= MS-half of dividend
11:00 / 11  =  12 / 3  =  4r0 overflow, divisor >= MS-half of dividend
----
10:11 / 11  =  11 / 3  =  3r2
10:10 / 11  =  10 / 3  =  3r1
10:01 / 11  =   9 / 3  =  3r0
10:00 / 11  =   8 / 3  =  2r2
01:11 / 11  =   7 / 3  =  2r1
01:10 / 11  =   6 / 3  =  2r0
01:01 / 11  =   5 / 3  =  1r2
01:00 / 11  =   4 / 3  =  1r1
00:11 / 11  =   3 / 3  =  1r0
00:10 / 11  =   2 / 3  =  0r2
00:01 / 11  =   1 / 3  =  0r1
00:00 / 11  =   0 / 3  =  0r0
You can see where if the MSbits are >= to the divisor that the result will not fit in the 2-bit quotient and 2-bit remainder, and thus set overflow. What also becomes clear by looking at this is the limitations within the range of numbers. For example, you can't divide 8 by 2, since the answer is 4R0, but 4 cannot be represented with 2-bits and would set the overflow flag.

 

So, in the 32x16 division, there are actually many ranges of numbers that cannot be divided if the dividend is too large compared to the divisor. Luckily this is easily determined by comparing the divisor to the MS-word of the dividend as noted above.

 

Some of the ranges for various NxM division are:

 

Max Dividend / Max Divisor

4-bit / 2-bit
-------------
10:11 / 11  =  11 / 3  =  3r2

Max 4-bit dividend value - max allowable dividend: 15 - 11 = 4


8-bit / 4-bit
-------------
1110:1111 / 1111  =  239 / 15  =  15r14

Max 8-bit dividend value - max allowable dividend: = 255 - 239 = 16


16-bit / 8-bit
--------------
11111110:11111111 / 11111111  =  65279 / 255  =  255r254

Max 16-bit dividend value - max allowable dividend: 65535 - 65279 = 256


32-bit / 16-bit
---------------
1111111111111110:1111111111111111 / 1111111111111111  =  4294901759 / 65535  =  65535r65534

Max 32-bit dividend value - max allowable dividend: 4294967295 - 4294901759 = 65536
It is interesting to note that the difference between the max value for a dividend (all bits '1'), and the max allowable dividend, is one greater than the max value of the divisor.

 

Here is a C program I wrote as an interpretation of the HDL divider (the corrected version) in the F18A:

 

 

 

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>

enum
{
    IDLE,
    OP,
    DONE,
    QUIT
};

int
main(int argc, char *argv[])
{
    uint32_t d;         // 16-bit   dividend
    uint32_t rl;        // 16-bit   divisor ms half
    uint32_t rh;        // 16-bit   divisor ls half
    uint32_t msb;       // 1-bit    ms bit shifting out of rh
    uint32_t diff;      // 16-bit   partial remainder
    uint32_t sub17;     // 17-bit   17-bit subtraction
    uint32_t q_bit;     // 1-bit    quotient bit
    uint8_t count;      // 4-bit    0 to 15 counter
    uint32_t dividend;  // 32-bit   input dividend
    uint32_t divisor;   // 16-bit   input divisor
    uint32_t q;         // 16-bit   output quotient
    uint32_t r;         // 16-bit   output remainder
    uint32_t state;


    state = IDLE;

    if ( argc != 3 )
    {
        printf("\nUsage: %s dividend divisor\n\n", argv[0]);
        state = QUIT;
    }

    else
    {
        dividend = atoi(argv[1]);
        divisor = (atoi(argv[2]) & 0xFFFF);
    }

    while ( state != QUIT )
    {
        // FSM is synchronous with the main clock.
        switch ( state )
        {
        case IDLE :

            d = divisor;
            rl = (dividend & 0xFFFF);
            rh = ((dividend >> 16) & 0xFFFF);
            msb = 0;
            count = 15;
            printf("%u / %u\n", (rh<<16)+rl, d);

            // Divisor must be greater than the dividend ms half.
            if ( d > rh )
                state = OP;
            else
            {
                state = QUIT;
                printf("overflow\n");
            }

            break;

        case OP :

            // msb stores the shifted-out bit of rh for the 17-bit subtract.
            msb = ((diff << 1) & 0x10000);  // msb is positioned for the 17-bit subtraction

            // rh shifts left and stores the difference and next dividend bit.
            rh = ((diff << 1) & 0xFFFE) + ((rl & 0x8000) >> 15);

            // rl shifts left and stores the quotient bits.
            rl = ((rl << 1) & 0xFFFE) + q_bit;

            if ( count == 0 )
                state = DONE;
            count--;

            break;

        case DONE :

            // Final iteration stores the quotient and remainder.
            rh = diff;
            rl = ((rl << 1) & 0xFFFE) + q_bit;
            state = QUIT;

        default :

            state = QUIT;
            break;
        }

        printf("diff:%u q_bit:%u msb:%u rh:%u rl:%u\n", diff, q_bit, msb, rh, rl);

        // Combinatorial logic, always evaluated.

        // 17-bit subtraction
        sub17 = (msb + rh) - d;

        // If the partial remainder is greater than the divisor, the
        // result will be positive.
        if ( (sub17 & 0x10000) == 0 )
        {
            // Partial remainder is the difference.
            diff = (sub17 & 0xFFFF);
            q_bit = 1;
        }

        else
        {
            // Partial remainder is still smaller than the divisor.
            diff = rh;
            q_bit = 0;
        }
    }

    q = rl;
    r = rh;
    printf("%u R %u\n", q, r);

    return 0;
}
// main()

 

 

For anyone interested, below is the corrected VHDL 32x16 division implementation I wrote for the 9900 GPU. Note that overflow is tested outside of this module before the DIV even starts, like this:

 

-- The divisor (src_oper) must be > the MSB of the dividend
if ws_dout >= src_oper then
   cpu_state <= st_cpu_status;
   div_overflow <= '1';
end if;

 

-- F18A V1.5
-- Matthew Hagerty, copyright 2013
--
-- Create Date:    20:09:36 04/12/2012
-- Module Name:    f18a_div32x16 - rtl
--
-- Unsigned 32-bit dividend by 16-bit divisor division for the
-- TMS9900 compatible GPU.  16-clocks for the div op plus two
-- clocks state change overhead.

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use ieee.std_logic_unsigned.all;


entity f18a_div32x16 is
   port (
      clk            : in  std_logic;
      reset          : in  std_logic;                    -- active high, forces divider idle
      start          : in  std_logic;                    -- '1' to load and trigger the divide
      ready          : out std_logic;                    -- '1' when ready, '0' while dividing
      done           : out std_logic;                    -- single done tick
      dividend_msb   : in  std_logic_vector(0 to 15);    -- MS Word of dividend 0 to FFFE
      dividend_lsb   : in  std_logic_vector(0 to 15);    -- LS Word of dividend 0 to FFFF
      divisor        : in  std_logic_vector(0 to 15);    -- divisor 0 to FFFF
      q              : out std_logic_vector(0 to 15);
      r              : out std_logic_vector(0 to 15)
   );
end f18a_div32x16;

architecture rtl of f18a_div32x16 is

   type div_state_t is (st_idle, st_op, st_done);
   signal div_state : div_state_t;

   signal rl      : std_logic_vector(0 to 15);           -- dividend lo 16-bits
   signal rh      : unsigned(0 to 15);                   -- dividend hi 16-bits
   signal msb     : std_logic;                           -- shifted msb of dividend for 17-bit subtraction
   signal diff    : unsigned(0 to 15);                   -- quotient - divisor difference
   signal sub17   : unsigned(0 to 16);                   -- 17-bit subtraction
   signal q_bit   : std_logic;                           -- quotient bit
   signal d       : unsigned(0 to 15);                   -- divisor register
   signal count   : integer range 0 to 15;               -- 0 to 15 counter
   signal rdy     : std_logic;
   signal dne     : std_logic;

begin

   -- Quotient and remainder will never be more than 16-bit.
   q <= rl;
   r <= std_logic_vector(rh);
   ready <= rdy;
   done <= dne;

   -- Compare and subtract to derive each quotient bit.
   sub17 <= (msb & rh) - ('0' & d);

   process (sub17, rh)
   begin
      -- If the partial result is greater than or equal to
      -- the divisor, subtract the divisor and set a '1'
      -- quotient bit for this round.
      if sub17(0) = '0' then
         diff <= sub17(1 to 16);
         q_bit <= '1';

      -- The partial result is smaller than the divisor
      -- so set a '0' quotient bit for this round.
      else
         diff <= rh;
         q_bit <= '0';
      end if;
   end process;

   -- Divide
   process (clk) begin if rising_edge(clk) then
      if reset = '1' then
         div_state <= st_idle;
      else

         rdy <= '1';
         dne <= '0';

         case div_state is

         when st_idle =>

            d <= unsigned(divisor);
            count <= 15;
            msb <= '0';

            -- Only change rl and rh when triggered so the registers
            -- retain their values after the division operation.
            if start = '1' then
               div_state <= st_op;
               rl <= dividend_lsb;
               rh <= unsigned(dividend_msb);
               rdy <= '0';
            end if;

         when st_op =>

            -- rl shifts left and stores the quotient bits.
            rl <= rl(1 to 15) & q_bit;
            -- rh shifts left and stores the difference and next dividend bit.
            rh <= diff(1 to 15) & rl(0);
            -- msb stores the shifted-out bit of rh for the 17-bit subtract.
            msb <= diff(0);

            count <= count - 1;
            rdy <= '0';

            if count = 0 then
               div_state <= st_done;
            end if;

         when st_done =>

            -- Final iteration stores the quotient and remainder.
            rl <= rl(1 to 15) & q_bit;
            rh <= diff;
            dne <= '1';
            div_state <= st_idle;

         end case;
      end if;
   end if; end process;

end rtl;

 

 

If you are not familiar with HDL, note that the assignments ( <= ) are not done in a serial manner like in programming, but in parallel (this is hardware, not software). So, something like:

 

count <= count - 1;
if count = 0 then
   ...
The value of "count" (a register) in the condition has not changed when the "if" is evaluated, and will not change to the new value until the next clock cycle. That is one of things you have to wrap your head around when moving from programming to describing hardware. The inherent parallelism is very cool IMO. Edited by matthew180

Share this post


Link to post
Share on other sites

Matthew, thanks a lot for the elaborate answer! I'll have to save that message for some spare time to come during the next days ... hopefully ...

Share this post


Link to post
Share on other sites

On page 2 of this thread, about post #38 or so, I talked about what the console ISR does and also how much I dislike the way the 99/4A only has a single external interrupt available. Don't get me wrong, I like interrupts. They are efficient. But the way the 99/4A was designed, they add a lot of unnecessary overhead.

 

For me, polling the VDP was the way to go, however in light of the new information in the "Smooth Scrolling" thread, I have to change my opinion on this. It appears that polling the original 9918A VDP can be problematic. Apparently the original VDP does not manage asynchronous hazards and if you are polling the status byte at the same time the VDP is updating the status byte, the flags will be cleared and come back in the status byte as clear, i.e. you miss the VSYNC and/or collision flags. If you can tolerate a few VSYNC or collision misses every few seconds, then polling might still be an option. But for smooth scrolling, timing, or sound processing, the glitches can be noticeable.

 

Because of this new information, using the VDP interrupt, and subsequently the console ISR seems to be a more stable option. The only problem with the console ISR is all the baggage it brings along, but luckily most of it can be disabled. It does still have overhead since it makes various checks before getting to your code, so you will have to decide if that is acceptable.

 

Assuming you read pg2 from post #38 on, I'm only going to give a small example of disabling the ISR features and hooking your own routine.

 

The control byte for the ISR is in the 16-bit scratchpad RAM at address >83C2. The first four bits control the various ISR services: auto sprite motion, the sound player, and the QUIT key. Yes, that is only three services, the fourth is "disable all" (and is actually the first bit, the MSb, in the control byte). To be safe, I set the control byte to >FF which ensures everything in the ISR is disabled.

 

One *nice* thing the ISR does actually do is to allow us to "hook" into the ISR. When the ISR is all done with its processing, it will check the 16-bit value at scratchpad address >83C4, and if it is not zero, it will use that value as a address to branch to. So, we can put the address of a subroutine we wrote in >83C4 and when the VDP generates and interrupt, the console ISR (after a little overhead) will call our code. It is actually very nice, I just wish the check for the ISR hook had been *first* in the console ISR instead of last.

 

The hook code we write should be small since an ISR is supposed to "service" something and return quickly. In my example, I simply use a flag in my code that can only be set by our ISR hook. That keeps things very small and neat.

 

One more thing about the ISR hook. Before you load your own ISR routine into >83C4, you are supposed to see if the value has already been set. If it has, some other code has also hooked into the console ISR and you are supposed to save that address before loading your own address. Then, when you are done with your own ISR, instead of returning with B *R11, you are supposed to branch to the address you found at >83C4. This allows multiple programs to hook the ISR and all of them will be called in a chain. You will probably only see this if you are working on assembly code to be LINKed into XB or something. If your program is loaded via the E/A or you expect to be the only program running, you can just ignore any previous value. :-)

 

So, on with the example. I like to use names in my code instead of remembering addresses and numbers, so I set up some equates (#define for those more familiar with other languages):

 

*Note, as of this post, code tags are still broken on the forum, so the spacing will not work out. Thus I'm not even going to try (much). You will have to reformat this yourself.

 

ISRCTL EQU >83C2  * Four flags: disable all, skip sprite, skip sound, skip QUIT, XXXX
USRISR EQU >83C4  * Interrupt service routine hook address

 

My ISR simply sets a flag and returns. This is the *only* place in the code where this flag can be set to something other than zero. Very important. The flag here is in 16-bit RAM, and is just an address I chose to use. It could also be allocated with a DATA statement:

 

FLGINT EQU >8354  * Interrupt flag
- or -
FLGINT DATA 0

.
. Somewhere in the code is the very simple ISR.
.

MYISR
  INC @FLGINT
  B   *R11

 

To hook in to the console ISR, somewhere during your initialization, and before you begin any game loop or such thing, to something like this:

 

  LI   R0,>FF00
  MOVB R0,@ISRCTL * Disable everything in the ISR
  LI   R0,MYISR   * Address of my ISR
  MOV  R0,@USRISR * Load my ISR into the console ISR hook address
  CLR  @FLGINT   ** Initially clear my ISR flag

 

Keep in mind you code still runs mostly with LIMI 0, i.e. the CPU's interrupts disabled. Or not, it is up to you really. Now in your game loop or where ever you need to wait for the VSYNC, you use something like this:

 

VWAIT
  CLR  @FLGINT  * Optional depending on if LIMI 0 is normal, or if you want to know you missed an interrupt
  LIMI 2  * Enable interrupts
VLOOP
  MOV  @FLGINT,@FLGINT  * Cheap way compare to zero.
  JEQ  VLOOP  * If still zero, wait.
  LIMI 0  * Optional depending on how you want things to work.
  CLR  @FLGINT  * Interrupt was set, so clear the flag and continue.
.
. Rest of the code.
.

 

How you manage things is up to you. You might not want to sit around in an idle loop waiting for the VSYNC, or maybe you do. You could just leave interrupts enabled (LIMI 2) and check the flag in your main loop, and if set, clear it and do something. If you want to know that you missed a VSYNC, then you can leave interrupts enabled during your main processing and check the flag when you are done, etc.

 

The only thing to remember is, you *MUST* disabled interrupts when you communicate with the VDP, i.e. update the screen, set a VDP register, etc. This is because the console ISR will read the VDP status byte which will mess up any VDP communication you may have had going on when the interrupt occurred. And if auto sprite motion is enabled, the console ISR is going to update the whole Sprite Attribute Table.

Edited by matthew180
  • Like 4

Share this post


Link to post
Share on other sites

Oh well, already some months ago...

 

To relax, now a little bit of math for you.

 

Got more time now, so I had a new look at the DIVS overflow detection, and now I think I got it. I have produced a heap of paper with wrong solutions, most of the attempts ending in sufficient but not necessary criteria (if the check indicates overflow, there will be an overflow, but if it says no, there could still be an overflow). But now here is a solution which worked with all test cases.

 

Signed division, V=32 bit signed value, D=16 bit signed divisor, not 0.

  1. Case: V>=0 and D>0. Overflow iff V > D << 15 - 1.
  2. Case: V>=0 and D<0. Overflow iff V > (-D) << 15 + (-D) - 1.
  3. Case: V<0 and D>0. Overflow iff (-V) > D << 15 + D - 1.
  4. Case: V<0 and D<0. Overflow iff (-V) > (-D) << 15 - 1.

(Non-math people: "iff"="if and only if". "D<<15" = shift left D by 15; alternatively: D*2^15; << binds stronger than +/-)

 

The symmetry in the solutions gives me some confidence that this should be correct, even catching the cases with that nasty 8000. My fault in earlier attempts was that I divided the value by 2^15 and threw away the important bits in the lower positions.

Share this post


Link to post
Share on other sites

Quick question:

The EA manual states that whenever the value in VR1 is being changed, the new value should first be saved in location >83D4 before effecting the change to the register. Is this truly necessary? I can't seem to see a difference regardless of whether I save the new value or not...

Thanks.

Share this post


Link to post
Share on other sites

If interrupts (i.e. the console ISR) are enabled it might be necessary. It might be more of an "XB thing", but I'm not sure. Can't remember. Sorry for being so vague. If you are working with XB or some other language, plan to use the ISR, or call DSRs, then you probably should follow the rules. If you are taking over the console, i.e. a game that does not return to XB or anywhere else, do what you want with the scratchpad.

Share this post


Link to post
Share on other sites

 

Quick question:

The EA manual states that whenever the value in VR1 is being changed, the new value should first be saved in location >83D4 before effecting the change to the register. Is this truly necessary? I can't seem to see a difference regardless of whether I save the new value or not...

Thanks.

 

 

I "think"....

 

The KSCAN routine (IIRC) writes the value found @ >83d4 to R1 when it is accessed. I guess as part of the screen saver feature ? I don't know if it is connected to the ISR or does it @ every call.

 

If You change screen modes and don't update >83d4 then a call to KSCAN will revert your screen mode back to what it was before you changed it.

Share this post


Link to post
Share on other sites

This gives my an opportunity to post an old question that I was struggling with for Titanium and now again for Scramble. The code below if for a pause key toggle. It works fine to toggle the pause on, but when you try to toggle the pause off again it is often turned back on immediately after. I assume the problem is lack of debouncing, but how can you prevent it except by adding long delays? Even if I could, I wouldn't want to call KSCAN.

*      Test pause key P
       LI   R1,>0500     * Test column 5
       LI   R12,>0024    * Address for column selection
       LDCR R1,3         * Select column
       LI R12,>000A      * P key
PAUSE1 TB 0
       JEQ PAUSE1        * Wait for press
PAUSE2 TB 0
       JNE PAUSE2        * Wait for release
*      Do pause stuff
PAUSE3 TB 0
       JEQ PAUSE3        * Wait for press
PAUSE4 TB 0
       JNE PAUSE4        * Wait for release
PAUSE5 ...

Share this post


Link to post
Share on other sites

Why do you not save some memory and use the OS built in Keyscan?

 

It does have a built in debounce and would do the trick with no real cost for speed.

Share this post


Link to post
Share on other sites

Quick question:

The EA manual states that whenever the value in VR1 is being changed, the new value should first be saved in location >83D4 before effecting the change to the register. Is this truly necessary? I can't seem to see a difference regardless of whether I save the new value or not...

Thanks.

 

It's necessary if you call KSCAN -- when a key is pressed, KSCAN copies the value in >83D4 back into VR1 (to turn off any potential screen blanking :) ).

Share this post


Link to post
Share on other sites

 

This gives my an opportunity to post an old question that I was struggling with for Titanium and now again for Scramble. The code below if for a pause key toggle. It works fine to toggle the pause on, but when you try to toggle the pause off again it is often turned back on immediately after. I assume the problem is lack of debouncing, but how can you prevent it except by adding long delays? Even if I could, I wouldn't want to call KSCAN.

*      Test pause key P
       LI   R1,>0500     * Test column 5
       LI   R12,>0024    * Address for column selection
       LDCR R1,3         * Select column
       LI R12,>000A      * P key
PAUSE1 TB 0
       JEQ PAUSE1        * Wait for press
PAUSE2 TB 0
       JNE PAUSE2        * Wait for release
*      Do pause stuff
PAUSE3 TB 0
       JEQ PAUSE3        * Wait for press
PAUSE4 TB 0
       JNE PAUSE4        * Wait for release
PAUSE5 ...

 

It likely is key bounce.. the two ways around keybounce are long delays and multiple samples over shorter delays - when you get the same result 3-4 times in a row, then you can accept it.

 

Since it's a pause you are dealing with, I would just put the delay in the game loop, and ignore the pause key for 100ms or so after pausing (and vice versa) - so the game runs immediately, but you can't pause again immediately. ;)

Share this post


Link to post
Share on other sites

I "think"....

 

The KSCAN routine (IIRC) writes the value found @ >83d4 to R1 when it is accessed. I guess as part of the screen saver feature ? I don't know if it is connected to the ISR or does it @ every call.

 

If You change screen modes and don't update >83d4 then a call to KSCAN will revert your screen mode back to what it was before you changed it.

Ah yes, exactly. The "screen saver" on the 99/4A blanks the screen by setting the blank-bit in VR1, so it would have to know what to write for the other VR1 bits when blanking/unblanking.

Share this post


Link to post
Share on other sites

I assume the problem is lack of debouncing, but how can you prevent it except by adding long delays

Along the same lines as what Tursi already mentioned, you need to sample the key over a period of time. The console 16.does this by imposing a long delay in the KSCAN routine, which is really bad for a game. Human reaction time is something on the order of 200ms (IIRC), so check the input for 5 consecutive frames and increment a counter if the key is pressed. Then if the value is between 3 and 5 then accept the key's state. You might have to play with the values a little, but that is the basic idea.

Share this post


Link to post
Share on other sites

Why do you not save some memory and use the OS built in Keyscan?

 

It does have a built in debounce and would do the trick with no real cost for speed.

Because it does cost speed, to the tune of ~16ms per call and that is *just* the delay loops used in KSCAN and does not count the overhead of the rest of the routine. I don't think Rasmus can spare an extra frame just to read the key input and still maintain the scrolling.

 

Look in the TI-Intern at ROM addresses >0390 and again at >03B6. They branch to >0498 which is this code:

 

Time delay
0498 020C LI  12,>04E2 Loop counter
049A 04E2
049C 060C DEC 12
049E 16FE JNE >049C
04A0 045B B   *11
>04E2 = 1250 decimal. DEC takes 10 clock cycles, and JNE takes 10 clock cycles (when the PC is changed, which it is inside the loop). The formula for instruction time is:

 

T = Tc * (C + W * M)

 

T = instruction time in uS

Tc = 0.333

C = clock cycles, which is 20 in this case for both instructions

W = wait states, which luckily is 0 because this is in 16-bit ROM

M = memory accesses, which is 6 for both instructions

 

T = 0.333 * (20 + 0 * 6)

T = 6.66uS * 1250 loop iterations = 8.325 milliseconds.

 

The delay is called twice in KSCAN. There are easily better ways, especially for games.

Share this post


Link to post
Share on other sites

Ah yes, exactly. The "screen saver" on the 99/4A blanks the screen by setting the blank-bit in VR1, so it would have to know what to write for the other VR1 bits when blanking/unblanking.

OK this makes sense to me now. This issue came up as I continue work on Ultimate Planet and noted that VR1 was not saved prior to displaying the bitmap splash screen, with no ill effects. However, I did follow the rules in the main program which also accesses KSCAN, so I should be good. Thanks for the insight guys :)

Share this post


Link to post
Share on other sites

 

This gives my an opportunity to post an old question that I was struggling with for Titanium and now again for Scramble. The code below if for a pause key toggle. It works fine to toggle the pause on, but when you try to toggle the pause off again it is often turned back on immediately after. I assume the problem is lack of debouncing, but how can you prevent it except by adding long delays? Even if I could, I wouldn't want to call KSCAN.

*      Test pause key P
       LI   R1,>0500     * Test column 5
       LI   R12,>0024    * Address for column selection
       LDCR R1,3         * Select column
       LI R12,>000A      * P key
PAUSE1 TB 0
       JEQ PAUSE1        * Wait for press
PAUSE2 TB 0
       JNE PAUSE2        * Wait for release
*      Do pause stuff
PAUSE3 TB 0
       JEQ PAUSE3        * Wait for press
PAUSE4 TB 0
       JNE PAUSE4        * Wait for release
PAUSE5 ...

 

Perhaps your scheme could be to.......

 

Sample the key once per VDP interrupt (I assume that is your time slice .) When you have 2 consecutive, positive hits (during 2 consecutive frames) then activate the pause/un-pause feature. After a positive press has been accepted then do not allow another " positive input" scan until 4 consecutive no hits have been registered (on consecutive frames again.) This would give you the de-bounce you need without any delay required and the required key down time should be no more than 1/15 of a second which should work for a pause switch. That number (4 frames total) is assuming 2 frames for settling with NTSC which is most likely more than needed). Of course it is an arbitrary number and may take more but I would hazard that 2 would be more than enough for a solid press ;-).....

Share this post


Link to post
Share on other sites

Thank you for the replies. Perhaps a simple solution would be only to execute the code every 4 or 8 frames? That should prevent another pause from being triggered immediately after one is released. I will try that before moving on to the more advanced solutions.

 

If KSCAN is waiting an entire frame of 16ms that makes it useless for high speed speed games. 16ms is enough for scrolling, sprites, collision detection, sound, and input reading and game logic combined (actually Scramble only uses about 12ms).

 

Edit: Just realized the sample code I posted was misleading. The first loop should not be there. It should jump to PAUSE5 if the key if not pressed instead of waiting .

Share this post


Link to post
Share on other sites

How about using different keys for (P)ause and (C )ontinue?

This adds a compare, but avoids timers and counters (and comparing their results in turn).

Edited by jens-eike

Share this post


Link to post
Share on other sites

Along the same lines as what Tursi already mentioned, you need to sample the key over a period of time. The console 16.does this by imposing a long delay in the KSCAN routine, which is really bad for a game. Human reaction time is something on the order of 200ms (IIRC), so check the input for 5 consecutive frames and increment a counter if the key is pressed. Then if the value is between 3 and 5 then accept the key's state. You might have to play with the values a little, but that is the basic idea.

So I have to ask. Is this KSCAN slowing down running Basic / Extended Basic programs ? Besides the GPL interpretation I mean.

Share this post


Link to post
Share on other sites

So I have to ask. Is this KSCAN slowing down running Basic / Extended Basic programs ? Besides the GPL interpretation I mean.

KSCAN do have a delay after reading the keyboard - it counts down from 1250 (DEC, JNE, no interrupts and fast memory). I guess TI thought this a value safe. I wonder if the 3 to 3.58 MHz CPU speedup hack makes it unsafe.

 

Waiting a frame before reading the keyboard directly, should be safe too (you can count to more than 1250 in one frame (unless something like speech stops you)). Waiting a frame on the other hand might only be safe using the ISR (LIMI 2). I guess some kind of counting would be safe too (while reading the VDP status register), but it has to be proven (with screen blanking, speech, hardware attached etc.).

Share this post


Link to post
Share on other sites

So I have to ask. Is this KSCAN slowing down running Basic / Extended Basic programs ? Besides the GPL interpretation I mean.

 

Yes. I can see it in TurboForth. TF uses the console's KSCAN routine, and if you put it in a short loop the delay is noticable.

 

Try this in TF:

 

 

: TEST1 ( -- ) PAGE 500 0 DO I . KEY? DROP LOOP ;
: TEST2 ( -- ) PAGE 500 0 DO I . LOOP ;

 

:mad:

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...