Assembly routines in XB

+TheBF · August 24, 2017

Quick update:

The version based on apersson850's array access method is fully functional now. Actually, it was never broken to start with, but rather the PIO throughput was much lower than I had expected and it was not being adequately picked up by my connected equipment. The culprit was the compare instruction when cycling through the array elements
       CLR  R3
       MOVB @VERTEX,R3        GET NUMBER OF VERTICES IN ARRAY
       SWPB R3
REDO   LI   R1,1
LOOP   MOVB @VERTEX(R1),@PIO  SEND ARRAY BYTE TO PIO
       INC  R1
       C    R1,R3
       JLE  LOOP
       JMP  REDO
At Lee's suggestion, I changed it to a decrementing loop instead, thus eliminating the compare instruction alltogether
REDO   CLR  R3
       MOVB @VERTEX,R3        GET NUMBER OF VERTICES IN ARRAY
       SWPB R3
       LI   R2,VERTEX+1
LOOP   MOVB *R2+,@PIO  SEND ARRAY BYTE TO PIO
       DEC  R3
       JNE  LOOP
       JMP  REDO
and now the array is cycled through much faster. I had no idea Compare executed that slowly...

At that point, I'm not going to even bother to use the NUMREF utility function given that I will have to issue a BLWP call for each array element in that case, which is going to be even slower.

Fascinating...

3 MHz clock, 14 cycles or more to do almost anything and you get a VERY slow machine. I would estimate 120,000 instructions per second or less.

State of the art circa 1970s

By comparison the modern MSP430 from TI which has a similar instruction set and register count to the TMS9900 does some instructions in 1 clock cycle but on average takes about 2.5

So with the same 3MHz clock it would give you 1,200,000 instructions per second.

This is part of the fun of programming the old devices. You must be smarter <oldfartrant> than these kids today, programming multi-core GHz processors in Javascript.</oldfartrant>

:-D

+Vorticon · August 24, 2017

I'm going to measure the PIO throughput tonight with and without the compare instruction to get an objective sense of the difference and for the advancement of human knowledge

apersson850 · August 24, 2017

3 MHz clock, 14 cycles or more to do almost anything and you get a VERY slow machine. I would estimate 120,000 instructions per second or less.

Considering that a standard TI 99/4A runs most of your own assembly code in memory which has four wait-states per access, you can add roughly 10 more cycles per instruction, and then you land at almost exactly what you estimate.

On the other hand, having 14 cycles per instruction gives you about 214000 instructions/s, so a modification to have all 16-bit RAM is quite valueable when it comes to performance.

Due to a more efficient internal design, the TMS 9995 is capable of executing the same instructions using about 1/3 of the clock cycles. But it must always read memory byte by byte, except when using the internal scratch pad RAM inside the chip.

Vorticon, you know you can start counting the clock cycles if you want to, right? You don't have to measure.

Edited August 24, 2017 by apersson850

+Vorticon · August 24, 2017

Vorticon, you know you can start counting the clock cycles if you want to, right? You don't have to measure.

Yes I know, but I want to measure the actual pulse output from the PIO port with my oscilloscope as there may some variation related to hardware. But the main reason is that I get to play with the oscilloscope

apersson850 · August 24, 2017

Being crystal controlled, the variation from the hardware should be neglectable. But the playing with the scope thing I endorse fully, so I'll not object any more!

+InsaneMultitasker · August 25, 2017

So I'm back at it with another project, and I've run into a snag: I need to send an entire array from XB to a small assembly program to be put out to the PIO port. The problem is that while I can give the assembly program access to the array via the NUMREF utility, it seems that I can only access one element for each XB call.

Looks like you figured out NUMREF and the arrays. Neat stuff It isn't clear to me how many elements you intend to send or how they are 'calculated'. An alternative might be to use a string. If your values are 0<x<255 and total elements is <256, you could build the data into a string, pass the string with STRREF, and use the resulting length byte to send the 'elements' to PIO in that fashion. For example, I=1 to 255::A$=A$&CHR$(I)::NEXT I::CALL LINK("SEND".A$) would send 255 bytes. I suppose it also depends on whether you are building the elements on the fly, retrieving from disk, etc. The PIO port being 8 bits wide lends itself to byte values.

+Vorticon · August 25, 2017

Looks like you figured out NUMREF and the arrays. Neat stuff It isn't clear to me how many elements you intend to send or how they are 'calculated'. An alternative might be to use a string. If your values are 0<x<255 and total elements is <256, you could build the data into a string, pass the string with STRREF, and use the resulting length byte to send the 'elements' to PIO in that fashion. For example, I=1 to 255::A$=A$&CHR$(I)::NEXT I::CALL LINK("SEND".A$) would send 255 bytes. I suppose it also depends on whether you are building the elements on the fly, retrieving from disk, etc. The PIO port being 8 bits wide lends itself to byte values.

The size of the array is limited to 256 bytes, and the number of actual elements is calculated prior to sending the data to the PIO port. Using string passing is definitely an option that I did not think about. Cool!

+Vorticon · August 25, 2017

So I tested the PIO throughput speed for my assembly routine:

Here's the result with the compare instruction

and here's the one without the compare instruction

Interestingly, the throughput from the PIO port was about the same at about 9.1 kHz, although the pulse was longer by approximately 9 microseconds for the compare instruction because the data on the PIO port stays longer while the compare instruction is being processed. But then when resetting the array to the beginning there are fewer instructions to process in the program with the compare instruction. So in the end it turns out to be a wash... Interesting finding! That said, for a 3MHz computer, that throughput seems awfully low... Does it sound right?

apersson850 · August 25, 2017

What kind of data are you measuring? If you have data that alternates from on to off, then on again, on the same pin for each cycle, then the duty cycle of your signal should be roughly 50%. It's obviously not.

I get 120 cycles for the first method and 92 for the second, in the inner loop. This assuming everything is in slow memory. With everything in fast memory, it's 64 vs. 48.

48 cycles is equivalent to 31.25 kHz, since you have to run through the loop twice to output first a "one", then a "zero".

By pre-loading a register with PIO, you can change the MOVB to output to *R5 instead of @PIO, saving eight more cycles, and thus reaching 37.5 kHz.

If we assume workspace in fast memory and the rest in slow memory, I get 100 vs. 76 cycles. The latter implies 19.7 kHz.

An interesting observation is that by "simply" adding 16-bit wide zero wait-state RAM to the machine, you roughly double the speed of the computer, in the cases where you make things easy for yourself and let both code and workspace reside in expansion RAM. Especially when writing software that works together with a higher level language, like Extended BASIC or Pascal, it's valuable to be able to leave the RAM pad as it is, to avoid messing up things you shouldn't, and still have full speed from the CPU.

For software that frequently accesses VDP RAM and such stuff, the impact is of course less.

Edited August 25, 2017 by apersson850

+Vorticon · August 25, 2017

Actually the frequency I got on the oscilloscope should be multiplied by 3 because there are 3 elements in my test array. The cycle is pin D7, D6 then D7 and my probe is on D6. So the PIO output frequency is more like 27.3 kHz...

+Vorticon · August 25, 2017

What kind of data are you measuring? If you have data that alternates from on to off, then on again, on the same pin for each cycle, then the duty cycle of your signal should be roughly 50%. It's obviously not.

I get 120 cycles for the first method and 92 for the second, in the inner loop. This assuming everything is in slow memory. With everything in fast memory, it's 64 vs. 48.

48 cycles is equivalent to 31.25 kHz, since you have to run through the loop twice to output first a "one", then a "zero".

By pre-loading a register with PIO, you can change the MOVB to output to *R5 instead of @PIO, saving eight more cycles, and thus reaching 37.5 kHz.

If we assume workspace in fast memory and the rest in slow memory, I get 100 vs. 76 cycles. The latter implies 19.7 kHz.

An interesting observation is that by "simply" adding 16-bit wide zero wait-state RAM to the machine, you roughly double the speed of the computer, in the cases where you make things easy for yourself and let both code and workspace reside in expansion RAM. Especially when writing software that works together with a higher level language, like Extended BASIC or Pascal, it's valuable to be able to leave the RAM pad as it is, to avoid messing up things you shouldn't, and still have full speed from the CPU.

For software that frequently accesses VDP RAM and such stuff, the impact is of course less.

Great info. Not sure why I'm not seeing a faster frequency for the decrementing loop on the scope.

apersson850 · August 25, 2017

Well, frequency is how often a period is repeated. So you need two data elements to create a frequency. 1 Hz is one signal per second, but you need two data output operations, one to turn on and the other to turn off, to create that frequency. Thus it takes two updates per second to get 1 Hz.

Thus it would be correct to say that you can achieve 14 kHz by removing one of the three data elements. That also implies that you are reloading the array frequently. My calculations were based on that the array is at least 100 elements long, so that the time for the reload doesn't really matter. That's what gave me the 19.7 kHz figure.

Edited August 25, 2017 by apersson850

+Vorticon · August 26, 2017

So more tribulations...

I set up the assembly program to scan for any key using KSCAN at the end of each array processing cycle, and if a key is detected it will return to XB. That works just fine and it dutifully returns to XB on demand. The problem is that the key that was pressed is somehow retained and when a CALL KEY is used subsequently in the XB program, it registers that key, which is undesirable. I tried placing >FF in location >8375 (the address of the key pressed with KSCAN) prior to returning to XB with no effect. Clearing that location also does nothing. Clearing the GPL status byte before exiting also has no effect.

Is there some magical location specific to XB where the key from KSCAN is stored by any chance?

I tell you my hair is thinning by the minute :spidey:

senior_falcon · August 26, 2017

Have you tried waiting until the finger is off the key before returning to XB?

+Lee Stewart · August 26, 2017

Check KSCAN for "no keystroke" before returning to XB to avoid key bounce.

...lee

+InsaneMultitasker · August 26, 2017

IIRC, you can stuff one of the bytes in the >83cx range with the scanned key to trick the auto-repeat functionality into a debounce. Check Thierry's page as I believe he goes into a bit of this in the keyboard or ROM section(s).

If the solutions posted by SeniorFalcon and Lee do not work for some reason, you could also opt to do a simple scan for the CTRL or SHIFT key with a simple CRU test. I recycled this from TIMXT's interpreter, with a few minor changes, as an example.

 
* TEST FOR a pause
NOPS1  LIMI 0     don't allow interrupts during CRU operation.
       LI   R12,>0024    set row
       CLR  R0           to 0
       LDCR R0,3
       LI   R12,6        set column
       TB   6            CTRL ?
       JNE NOPSE      yes. exit
       TB   5            SHIFT?
       JNE  NOPSE        yes
* (add a LIMI 2 here to trigger an interrupt before re-scanning)
       JMP  NOPS1        stay in loop
NOPSE  LIMI 2   continue...

+Vorticon · August 26, 2017

Ah debounce! I don't know why I did not think of it...

A simple delay loop before returning to XB solved the problem, BUT with one notable exception: if I press the space bar instead of any other key to exit the assembly routine, when I am in basic I get an un-interruptable stream of space characters coming in... It's the oddest thing! Any ideas here?

+Vorticon · August 27, 2017

OK figured it out. Looks like if the RS232 card is still active upon return to XB, strange things happen with KSCAN. Properly turning off the card before exiting the assembly program solved the issue. Meshing hardware and software is tricky...

Assembly routines in XB

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members