Same delay in shorter space?

Willsy · October 8, 2010

; wait 12uS - see editor assembler page 349, paragraph 5.
	nop   ; 2 bytes
	nop   ; 2 bytes
	nop   ; 2 bytes
	rt    ; 2 bytes

I need the same delay time (or a little bit longer...) but in less instructions (I need to save 2 bytes). Can it be done?

BTW: This is running in 16-bit scratch-pad RAM.

Would something like this work?

; wait 12uS - see editor assembler page 349, paragraph 5.
	mov *r0,*r0 ; 2 bytes
	rt          ; 2 bytes

The second bit of code is non-destructive, and 50% smaller. But how long does it take to execute? Classic99 indicates 22 cycles, but what is that in actual time? (And how do you work it out - I'll write it down this time, I promise ;-)

	mov r0,r0   ; 2 bytes

The above code (spot the subtle difference) requires 14 cycles according to Classic99 - how long is that.

Thanks

matthew180 · October 9, 2010

1 cycle in the 99/4A is about 333ns (nanoseconds), which is the period for a 3MHz clock. There are 1000ns in 1us (microsecond). So, 14 cycles is 333ns * 14 = 4.662us.

*NOTE*

This all assumes scratch pad no wait-state RAM, as you indicated. As soon as you step into any other RAM, you have to add wait states to these timings.

The MOV instruction without any modifiers (symbolic, indirect, indexed, and autoincrement) is 14 clock cycles. So: MOV R0,R0 is 4.662us. Adding indirect addressing to either parameter will add 4 clock cycles and 1 memory access.

MOV *R0,R0  - 18 clock cycles = 333ns * 18 = 5.994us

MOV *R0,*R0 - 22 clock cycles = 333ns * 22 = 7.326us

Autoincrement adds an additional 4 clocks on top of the indirect addressing:

MOV *R0+,R0 - 22 clock cycles = 333ns * 22 = 7.326us

MOV *R0+,*R0 - 26 clock cycles = 333ns * 26 = 8.658us

MOV *R0+,*R0+ - 30 clock cycles = 333ns * 30 = 9.990us

What you really need though is 36 clock cycles to get 12us (12us / 333ns = 36.036). Use a shift instruction, since they take 12 clocks + 2C (C == the count) clocks. That gives you some fine grain control, and the count is stored in the instruction so it is still only 2 bytes. For example:

SRC R1,1  - 12 clocks + 2 * 1 = 14 clocks = 4.662us
SRC R1,2  - 12 clocks + 2 * 2 = 16 clocks = 5.328us
SRC R1,3  - 12 clocks + 2 * 3 = 18 clocks = 5.994us
SRC R1,4  - 12 clocks + 2 * 4 = 20 clocks = 6.660us
.
SRC R1,12 - 12 clocks + 2 * 12 = 36 clocks = 11.988us
.
SRC R1,15 - 12 clocks + 2 * 15 = 42 clocks = 13.986us

Just keep in mind to not use a count of zero! That means to use bits 12 through 15 of R0 for the count. Counts can only be 1 to 15. A count of 16 == 0. If you use a zero count, and bits 12 through 15 of R0 are also zero, the shift executes 16 times and takes 52 clocks. Anyway, just stick with 1 to 15 and you have a delay between 4.662us and 13.986us.

Matthew

Edited October 13, 2010 by matthew180

Willsy · October 9, 2010

Great stuff - thanks for taking the time, Matthew, that's a great help!

Mark

sometimes99er · October 9, 2010

	mov *r0,*r0 ; 2 bytes
The second bit of code is non-destructive, and 50% smaller. But how long does it take to execute? Classic99 indicates 22 cycles, but what is that in actual time? (And how do you work it out - I'll write it down this time, I promise

Isn't it potentially dangerous ? - Like reading and writing memory-mapped devices.

Willsy · October 9, 2010

	mov *r0,*r0 ; 2 bytes
The second bit of code is non-destructive, and 50% smaller. But how long does it take to execute? Classic99 indicates 22 cycles, but what is that in actual time? (And how do you work it out - I'll write it down this time, I promise
Isn't it potentially dangerous ? - Like reading and writing memory-mapped devices.

It could be, but I'm in control of the 'code path' anyway, and I know at that point that R0 isn't pointing to anything dangerous!

Willsy · October 9, 2010

Matthew

Could I impose on you to time these instructions for me (the ones indicated)? I would do it, but I don't have the data book, and I don't know of an online source where this information is available (in fact, that would make a GREAT addition to the programming resources thread here on Atariage!)

; convert the word to nybbles and send to the speech synth...
	li r2,4			; 4 nybbles to load
loadlp		src r0,4		; start with least significant nybble
	mov r0,r1		; copy it
	src r1,4		; get target nybble into correct position
	andi r1,>0f00		; mask out the nybble of interest
	ori r1,>4000		; put in 4x00 format for speech synth
	movb r1,@spchwt		; send it to the speech synth
-->	dec r2			; finished?
-->	jne loadlp		; do next nybble if not
-->	li r1,>4000		; load 'speak from rom' opcode to speech synth
	movb r1,@spchwt		; send it to the speech synth... synth is now talking
romspx		rt			; return from interrupt

I've been reading the speech synth section in the Editor Assembler manual. That section is rather poorly written IMO and you have to read it carefully.

According to section 22.1.1, page 349:

The delay time from loading an address until the next command is 42 microseconds.

Fine, I can live with that. No problem. However, on the last iteration of the loop above, just after the last address byte is written out to SPCHWT, the instructions indicated can actually be considered as part of the 42 uS delay, since they have to be decoded and executed.

So if they add up to 42uS or more, I'm all set and I don't need to worry about it any more. At worst I may need to add a NOP or a MOV R0,R0 or something to spin the wheels a bit longer.

Seems pedantic? Well yes, but I don't like redundant code, so a delay loop, if not required is just silly, and I'm at the point where bytes matter!

BTW: These instructions are executing from 8-bit CPU ROM (cartridge space).

Thanks again,

Mark

+retroclouds · October 9, 2010

I would do it, but I don't have the data book, and I don't know of an online source where this information is available (in fact, that would make a GREAT addition to the programming resources thread here on Atariage!)

Mark, did you look at section 3.6 (TMS9900 INSTRUCTION EXECUTION TIMES), page 28 of the TMS9900 Microprocessor data manual ?

I know it's a bit brief, but it might be what you are looking for.

I have the "9900 Family Systems Design and Data Book" at home.

That's a good book that also goes into detail on instruction timing. Unfortunately I don't have this available as PDF.

If someone has a PDF copy, please send it to me and I'll upload it to the Development Resources thread.

Willsy · October 9, 2010

I would do it, but I don't have the data book, and I don't know of an online source where this information is available (in fact, that would make a GREAT addition to the programming resources thread here on Atariage!)

Mark, did you look at section 3.6 (TMS9900 INSTRUCTION EXECUTION TIMES), page 28 of the TMS9900 Microprocessor data manual ?

I know it's a bit brief, but it might be what you are looking for.

I have the "9900 Family Systems Design and Data Book" at home.

That's a good book that also goes into detail on instruction timing. Unfortunately I don't have this available as PDF.

If someone has a PDF copy, please send it to me and I'll upload it to the Development Resources thread.

Thanks, I've downloaded the PDF. The PDF will help to give the instruction timings for the 9900, but it doesn't help in the 99/4a environment, especially when you have wait-state/non-wait-state RAM & ROM!

Willsy · October 9, 2010

According to the data sheet, that 3 instruction phrase gives:


(where WS=wait-states=0)
DEC R2       = 10 + (3*WS) = 13
JNE LOADLP   =  8 + (1*WS) =  9
LI R1,4000   = 12 + (3*WS) = 15
                            --
                            37 cycles

In 0 wait-state memory I think that would be:

37 cycles @ 333ns per cycle = 12321ns / 1000 = 12.321us

Is that correct? Or is it 30 cycles in 0 wait-state memory? (because WS=0 and x*0 is always 0?)

How does it work for 8-bit memory? Is it 666ns/cycle?

Edited October 9, 2010 by Willsy

matthew180 · October 9, 2010

The wait state generator adds 4 wait states to the ~2 clock cycle that make up a memory operation. So you use 4 as the wait state variable in the equation from the data manual. A wait state basically suspends the operation for 1 cycle, i.e. 333ns. So the 4 wait states add about 1.3us to *each* memory operation. That include the instruction fetch, reading any immediate or symbolic operands, and if the workspace is in 8-bit RAM, each register access gets hit too.

Thierry breaks it down in detail here:

http://nouspikel.group.shef.ac.uk/ti99/wait.htm

Matthew

Edited October 9, 2010 by matthew180

Willsy · October 9, 2010

The wait state generator adds 4 wait states to the ~2 clock cycle that make up a memory operation. So you use 4 as the wait state variable in the equation from the data manual. A wait state basically suspends the operation for 1 cycle, i.e. 333ns. So the 4 wait states add about 1.3us to *each* memory operation. That include the instruction fetch, reading any immediate or symbolic operands, and if the workspace is in 8-bit RAM, each register access gets hit too.

Thierry breaks it down in detail here:

http://nouspikel.group.shef.ac.uk/ti99/wait.htm

Matthew

Thanks Matthew,

So I make that:

(where WS=wait-states=4)
DEC R2       = 10 + (3*WS) = 22
JNE LOADLP   =  8 + (1*WS) = 12
LI R1,4000   = 12 + (3*WS) = 24
                            --
                            58 cycles

58 cycles @ 333ns per cycle = 19314ns / 1000 = 19.341us

~36% slower?

Does the fact that the WS is in 0 wait state memory complicate matters?

Sorry to hassle about this - there are not many people that can answer this stuff!

Mark

matthew180 · October 9, 2010

All memory access in the 99/4A causes wait states *except* the system ROM (>0000 to >2000) and the 256 bytes of scratch pad RAM.

So, if the workspace register points to scratch pad (as is always should), then any operands that read the value of the registers will not have wait states. However, something like *R0 will not have a wait state when reading the register value, but the indirection address will cause a wait state (assuming it does not point to scratch pad memory).

Yes, the wait states do cause a huge performance hit, as you have discovered. 36% is probably conservative.

Matthew

+retroclouds · October 10, 2010

All memory access in the 99/4A causes wait states *except* the system ROM (>0000 to >2000) and the 256 bytes of scratch pad RAM.

So, if the workspace register points to scratch pad (as is always should), then any operands that read the value of the registers will not have wait states. However, something like *R0 will not have a wait state when reading the register value, but the indirection address will cause a wait state (assuming it does not point to scratch pad memory).

Yes, the wait states do cause a huge performance hit, as you have discovered. 36% is probably conservative.

Matthew

Matthew, instruction timing & memory access would make a *very* interesting topic for your book on assembly language

matthew180 · October 12, 2010

Sorry Willsy, I have not have any time to calculate timings lately. Did you work out your numbers?

Retroclouds: I was thinking about a chapter on instruction timing, but I don't know if that goes too low level. You have to dig into the operation of the CPU at a level lower than assembly to really understand and calculate the timing. I like that kind of stuff though, and it can't hurt to have it in the book. :-)

Matthew

+retroclouds · October 12, 2010

Hi Mark,

I presume you are currently looking for driving the speech synthesizer from the cartridge ROM space >6000->7FFF ?

Also is it safe to assume you can do that, without having to copy part of the speech player code into scratch-pad memory ?

That would only be required if your program code is residing in memory located "behind" the speech synthesizer (32K memory in PEB) ?

I know that timing for the code running from the cartridge space is different as when located in scratchpad. But other than that, it should be possible I guess?

Sure hope so, because I need to save on scratchpad memory.

I'm starting work on the speech player now, using some code I got from you in 2009

This is an interesting area I don't know much about yet, but starting to learn now.

How big are the differences between the TMS5220 and TMS5200 ? Were all TI-99/4A speech synthesizers driven by the TMS5200 or are there also any using the TMS5220.

I think that classic99 and MESS are running on TMS5220 but this seems to cause "glitches".

Anyway, I just found the TMS5220 preliminary data manual at bitsavers. Check here.

Might be worth adding to the Development Resources thread

EDIT: The thing about player code having to reside in scratch-pad memory only applies for the "speak external" command. When using the normal "speak" command with built-in vocabulary you are safe. The question is: Is any of the built-in vocabulary worth listening to in games ?

Edited October 12, 2010 by retroclouds

+retroclouds · October 12, 2010

Here is a cool demo of the speak 'n spell. I suppose its TMS5100 to be relatively close to the TMS5200 ?

EDIT: Would be cool doing a TI-99/4A speech game with the speak & spell voice. Sorry getting carried away now

Edited October 12, 2010 by retroclouds

Willsy · October 12, 2010

Hi RetroClouds

The *only* Speech Synth code that needs to go in scratch-pad ram is the code to READ data from the synth. Basically, to cut a long story (that I only just about understand) short, the SS is rather slow. It seems to take a long time to either latch data in, or gate data out (can't remember now).

But when reading, you cannot use the 8 bit bus (and that includes the 8 bit memory in the cartridge space, as (according to Tursi) accesses in the cartridge space will still trigger the multiplexer and cause the wrong data to be latched.

There are other timing restrictions, but they do not require execution from pad ram.

To summarise:

When reading data or status: 12uS - 8 bit bus cannot be used, so the read and the delay must be executed from pad.
When writing external data into the speech synth (i.e. streaming LPC speech data into it) 10uS is required. In practice, no action is necessary, because you will be streaming data in a loop, and the loop will take > 10uS to execute.
The delay after loading a command until giving it some more work/data is 42uS. So, to speak a word from ROM, load the 4 address nybbles, then load >40 then wait 42uS, then issue a "speak from rom" (>50) command.

It's a bit of a pain to work with!

If your scratch-pad space is low, consider overwriting other code in scratch pad with synth code when you need it, then restore the old code.

In practice, you don't need much code. With Matthew's help, I arrived at the following code which does the job with minimal byte usage:

	movb @spchrd,@spdata	; move data from speech synth to location spdata
	src R0,12		; wait 12uS - see editor assembler page 349, paragraph 5.
	rt
spdata		data 0                  ; place to store the data read from the synth

That's 10 bytes, plus 2 bytes (only one byte *actually* needed) to store the data read from the synth. Note: You need to store the data read from the synth in scratch-pad ram, or else you will trigger the 8-bit bus! Of course, once your program has returned via the RT you can move it from pad and do anything.

You could make the above code 4 bytes shorter if you use registers:

	movb *r0,*r1		; move data from speech synth to location spdata
	src R0,12		; wait 12uS - see editor assembler page 349, paragraph 5.
	rt
spdata		data 0                  ; place to store the data read from the synth

But of course, you would have to initialise r0 and r1 appropriately before calling the subroutine in pad.

Hope this helps.

Mark

+retroclouds · October 13, 2010

Hi Mark,

thank you for your answer, it'll save me quite some time

It's a bit of a bummer that I'll have to use scratchpad memory for reading/status polling.

But it's ok, I probably can use the same scratchpad area I also use for tight loops.

Yeah, it's a wonder how they managed to drive the synth with the TI-99/4A in the first place.

Makes me appreciate what Parsec does a lot more

+retroclouds · October 13, 2010

I'm trying to calculate how the "SRC R0,12" is used to get to 12 uS.

Assuming the SRC is located in scratchpad:

From the data book:

T = Total instruction execution time

Tc(o) = clock cycle time

C = number of clock cycles for instruction execution plus address modification

W = number of required wait states per memory access for instruction execution plus address modification

M = number of memory accesses

T=tc(o) (C+W*M)

In our case that would read up to:

Tc(o) = 0.333 uS on the TI-99/4A

C = 52 clock cycles for executing "SRC" + increasing PC

W = 0 (we are in scratchpad)

M = 4

T= 0.333 * 52 + (0 * 4) = 17.316 uS

So I must be doing something wrong here, because 17.316 uS is more than 12 uS ?

I'm confused

sometimes99er · October 13, 2010

I'm trying to calculate how the "SRC R0,12" is used to get to 12 uS.

Assuming the SRC is located in scratchpad:

From the data book:

T = Total instruction execution time

Tc(o) = clock cycle time

C = number of clock cycles for instruction execution plus address modification

W = number of required wait states per memory access for instruction execution plus address modification

M = number of memory accesses

T=tc(o) (C+W*M)

In our case that would read up to:

Tc(o) = 0.333 uS on the TI-99/4A

C = 52 clock cycles for executing "SRC" + increasing PC

W = 0 (we are in scratchpad)

M = 4

T= 0.333 * 52 + (0 * 4) = 17.316 uS

So I must be doing something wrong here, because 17.316 uS is more than 12 uS ?

I'm confused

SRC is a "Shift" operation. Look it up in table 3, page 28, in the

http://www.retroclouds.de/atariage/tms9900_microprocessor_data_manual.pdf

There's a difference when the count is zero or not.

Also see post #2.

Edited October 13, 2010 by sometimes99er

matthew180 · October 13, 2010

The binary encoding (machine code) of a shift instruction looks like this:

| 0 1 2 3 4 5 6 7 | 08 09 10 11 | 12 13 14 15 |
+-----------------+-------------+-------------+
|     OPCODE      |      C      |      W      |
+-----------------+-------------+-------------+

C is the count

W is the register to shift

Note that C and W only have 4 bits, thus the only possible values are between 0 and 15.

There are three variations of the shift instructions:

1. The C (count) is NOT zero: 12 + 2C clocks

2. The C is zero and bits 12 through 15 of R0 *ARE* zero: 52 clocks

3. The C is zero and bits 12 through 15 of R0 *are NOT* zero: 20 + 2N clocks

The N parameter in #3 is the count value, 0 to 15, from bits 12 through 15 of R0.

R0 when used for the count

| 0 1 2 3 4 5 6 7 8 9 10 11 | 12 13 14 15 |
+---------------------------+-------------+
|  XXXX  DON'T CARE  XXXX   |      N      |
+---------------------------+-------------+

Remember, a zero parameter to a shift instruction means "get the shift count from R0". But, since the shift count can only be between 0 and 15, only bits 12 through 15 of R0 are used. If that value is also zero, then the shift count will be 16, hence the longest instruction time for #2 above. This is the ONLY way to get a shift count of 16. Note the difference between #1 and #3 is 8 cycles, which is due to having to read R0 for the count in the case of #3 (in case #1 the count is encoded as part of the instruction.)

I think using R0 in a shift instruction with a count of zero is illegal (SRC R0,0), but maybe only for the compiler, I think the CPU will execute it. I can not find anything in the datasheet that says you can't do that.

If you try to use a count greater than 15 in a shift instruction, the compiler will do a "mod 16" on the value, so you will always have a shift between 0 and 15. However, you have to be careful with that, since, if you try to code something like this:

SRC R1,16

That 16 will really become 0 (16 mod 16 = 0), and the instruction will use bits 12 through 15 of R0 for the count, which is probably *not* what you intended. IMO, this kind of situation should cause the compiler to generate an error, but I'm pretty sure it does not.

Matthew

Edited October 13, 2010 by matthew180

+retroclouds · October 13, 2010

Thanks for the answers people! This community is just plain awesome :cool:

Tursi · October 19, 2010

I just wanted to note that the information attributed to me above isn't cut-and-dried. I don't know what the speech synth does on the bus or why the E/A manual says you can't touch the 8-bit bus during certain operations.. in the attributed conversation I was just confirming that the cartridge port IS the 8-bit bus since it sits on the back-end of the multiplexer.

It's nice to have someone else working on the timing counts, hehe.

Same delay in shorter space?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members