Jump to content

Photo

Fast String Cutting


1 reply to this topic

#1 TheBF OFFLINE  

TheBF

    Moonsweeper

  • 371 posts
  • Location:The Great White North

Posted Thu Nov 9, 2017 10:01 PM

CUTTING STRINGS FASTER

 

One the things that BASIC offers the programmer is a handy string data type. Along with the string data type itself there are a set of functions to cut the strings into various pieces and find strings within other strings.

 

Many implementations of strings rely on copying the sub-string to a string stack or to another string. Copying bytes back and forth in memory eats up a lot of time. This is even more serious in TI-99 BASIC where the strings are held in VDP memory.  What if, when you cut a string, you did not copy it?  How would that work you ask?  One way that has entered the Forth world is to simply record the memory location of the new sub-string and its length in a temporary location. For example take the string “All my ex’s live in Texas”.  

 

Let us say that this string is currently at address >3000 in low memory and it is 25 bytes long plus 1 byte for the count. In ASM it would look like this:

 

ADOLL   BYTE  25

        TEXT ‘All my ex’s live in Texas’

 

In Forth it would be this: 

 

: A$   ( -- addr len)  S” All my ex’s live in Texas”  ;

 

The word ADOLL and A$ in both cases are the labels (name) of the strings.  So the way we could represent ADOLL is to just use the address and the length.  In ASM this would be in registers typically and in Forth they would be on the DATA stack. For this discussion the numbers would be >3000,25.  With this representation, If we wanted to do a LEFT$ operation STRING all we have to do is change the number 25 to the new length.  That’s it. No copying and we have created a new string!

In Forth we can make LEFT$ like this:

 

: LEFT$    ( $addr len newlen – $addr len)  NIP  ;

STRING 12 LEFT$

 

The Forth word NIP removes the second item on the DATA stack.  One command is all we needed to create the function LEFT$. The output of LEFT$ falls onto the DATA stack where you can print it or save it or whatever you want to do.

 

And RIGHT$ is not much harder:

 

: RIGHT$   ( $addr len newlen – $addr len)  OVER SWAP - 0 MAX /STRING ;

 

The word /STRING (pronounced “cut string) above is a surprisingly powerful yet very simple ANS Forth word that takes an (address, len) pair and a number. 

It subtracts the number from the length and adds the number to the address. This gives us a new string address starting point and a new length. 

We add the 0 MAX to prevent a negative number as a size.

 

/STRING DEMONSTATION CODE

A$ TYPE All my ex’s live in Texas

A$ 4 /STRING TYPE my ex’s live in Texas

 

In conventional notation /STRING does this:

DEF CUTSTRING (ADDR,LEN,N)

    ADDR=ADDR+N

    LEN=LEN-N

 

What we did above was equivalent in conventional notation to:

DEF RIGHT$(ADDR,LEN, NEWLEN)

    NEWLEN=LEN-NEWLEN

    CUTSTRING (ADDR,LEN,NEWLEN)

 

(These Forth functions return 2 outputs so they don’t fit well with functional notation.

 The examples are here to help translate the Forth reverse polish notation and stack movement which can make you feel dizzy when you are not used to it)

 

You can easily see how this is much, much faster than copying the bytes to a new string or buffer. There are no loops and no copying. In ASM you could transfer these new values to new registers or a better way would be to implement a stack using a register as a pointer and pushing the new values onto the stack for later use. Forth uses its own DATA stack for this purpose and as such the output of each string manipulation can be the input for the next string function. 

 

Consider this expression where the output of LEFT$ is on the stack waiting for RIGHT$. Then TYPE takes the address and length from RIGHT$ and prints it on the screen:

 

A$  20 LEFT$  5 RIGHT$  TYPE

 

Using this technique we can re-create TI BASIC’s SEG$ with only four words!

:  SEG$   ( addr len start size -- addr len)   

          >R                \ push the size onto the rstack temporarily

          /STRING           \ cut the string using the start parameter

          DROP              \ NOT going to use the length as is, so drop it

          R>                \ move the SIZE parameter back onto the data stack  

;

 

So if you do strings this way you do all the processing with the address and len and only copy back to memory when you are finished the string manipulation. 

This makes  a very fast system.

 

In the next installment we will use this method to do something you might see in LISP or Python and it runs pretty fast even on our TI-99.

Attached Files


Edited by TheBF, Thu Nov 9, 2017 10:55 PM.


#2 TheBF OFFLINE  

TheBF

    Moonsweeper

  • Topic Starter
  • 371 posts
  • Location:The Great White North

Posted Thu Nov 9, 2017 10:42 PM

Ok, so I see some cool stuff in languages like Python and LISP where you can chop up a string in many sub-strings all at once.

I wondered can I used this fast string cutting method to do the job.

 

The method I used is borrowed (stolen) from an old Forth word called -TRAILING.   This word takes an [address,length]  pair as input

and is scans backwards on the string looking for blank characters (ie: spaces ASCII 32) . If it finds a space, it shortens the string length and keeps going until it finds

a non-blank character.   Simple.

 

So I created a routine that does that and I created the opposite. Meaning scan backwards on a string, ignoring non-spaces until you find a space. I call that one -ALPHA

 

With those 2 routines we combine them to create /WORD ("cut word")

This routine cuts a word off the end of a string delimited by blanks but it also re-sizes the remainder string.

It uses some ugly stack gymnastics to calculate the new string lengths for both strings so unless you are really bored just assume it works.

 

Armed with /WORD we can put this in a loop call SPLITSTR.  This word cuts off a string starting from the end and leaves each sub-string on the stack as an [address,len] pair

It also counts how many times it does this until the last word is found and puts that count on the top of the stack.

 

So we are left with a variable number of strings on the stack with a count on top.

 

The code is in the spoiler and the GIF lets you see how quickly we can get the results and even print them reversed.

Spoiler

Attached Files


Edited by TheBF, Fri Nov 10, 2017 9:44 AM.





0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users