Fast String Cutting

+TheBF · November 10, 2017

CUTTING STRINGS FASTER

One the things that BASIC offers the programmer is a handy string data type. Along with the string data type itself there are a set of functions to cut the strings into various pieces and find strings within other strings.

Many implementations of strings rely on copying the sub-string to a string stack or to another string. Copying bytes back and forth in memory eats up a lot of time. This is even more serious in TI-99 BASIC where the strings are held in VDP memory. What if, when you cut a string, you did not copy it? How would that work you ask? One way that has entered the Forth world is to simply record the memory location of the new sub-string and its length in a temporary location. For example take the string “All my ex’s live in Texas”.

Let us say that this string is currently at address >3000 in low memory and it is 25 bytes long plus 1 byte for the count. In ASM it would look like this:

ADOLL BYTE 25

TEXT ‘All my ex’s live in Texas’

In Forth it would be this:

: A$ ( -- addr len) S” All my ex’s live in Texas” ;

The word ADOLL and A$ in both cases are the labels (name) of the strings. So the way we could represent ADOLL is to just use the address and the length. In ASM this would be in registers typically and in Forth they would be on the DATA stack. For this discussion the numbers would be >3000,25. With this representation, If we wanted to do a LEFT$ operation STRING all we have to do is change the number 25 to the new length. That’s it. No copying and we have created a new string!

In Forth we can make LEFT$ like this:

: LEFT$ ( $addr len newlen – $addr len) NIP ;

STRING 12 LEFT$

The Forth word NIP removes the second item on the DATA stack. One command is all we needed to create the function LEFT$. The output of LEFT$ falls onto the DATA stack where you can print it or save it or whatever you want to do.

And RIGHT$ is not much harder:

: RIGHT$ ( $addr len newlen – $addr len) OVER SWAP - 0 MAX /STRING ;

The word /STRING (pronounced “cut string) above is a surprisingly powerful yet very simple ANS Forth word that takes an (address, len) pair and a number.

It subtracts the number from the length and adds the number to the address. This gives us a new string address starting point and a new length.

We add the 0 MAX to prevent a negative number as a size.

/STRING DEMONSTATION CODE

A$ TYPE All my ex’s live in Texas

A$ 4 /STRING TYPE my ex’s live in Texas

In conventional notation /STRING does this:

DEF CUTSTRING (ADDR,LEN,N)

ADDR=ADDR+N

LEN=LEN-N

What we did above was equivalent in conventional notation to:

DEF RIGHT$(ADDR,LEN, NEWLEN)

NEWLEN=LEN-NEWLEN

CUTSTRING (ADDR,LEN,NEWLEN)

(These Forth functions return 2 outputs so they don’t fit well with functional notation.

The examples are here to help translate the Forth reverse polish notation and stack movement which can make you feel dizzy when you are not used to it)

You can easily see how this is much, much faster than copying the bytes to a new string or buffer. There are no loops and no copying. In ASM you could transfer these new values to new registers or a better way would be to implement a stack using a register as a pointer and pushing the new values onto the stack for later use. Forth uses its own DATA stack for this purpose and as such the output of each string manipulation can be the input for the next string function.

Consider this expression where the output of LEFT$ is on the stack waiting for RIGHT$. Then TYPE takes the address and length from RIGHT$ and prints it on the screen:

A$ 20 LEFT$ 5 RIGHT$ TYPE

Using this technique we can re-create TI BASIC’s SEG$ with only four words!

: SEG$ ( addr len start size -- addr len)

>R \ push the size onto the rstack temporarily

/STRING \ cut the string using the start parameter

DROP \ NOT going to use the length as is, so drop it

R> \ move the SIZE parameter back onto the data stack

;

So if you do strings this way you do all the processing with the address and len and only copy back to memory when you are finished the string manipulation.

This makes a very fast system.

In the next installment we will use this method to do something you might see in LISP or Python and it runs pretty fast even on our TI-99.

Edited November 10, 2017 by TheBF

+TheBF · November 10, 2017

Ok, so I see some cool stuff in languages like Python and LISP where you can chop up a string in many sub-strings all at once.

I wondered can I used this fast string cutting method to do the job.

The method I used is borrowed (stolen) from an old Forth word called -TRAILING. This word takes an [address,length] pair as input

and is scans backwards on the string looking for blank characters (ie: spaces ASCII 32) . If it finds a space, it shortens the string length and keeps going until it finds

a non-blank character. Simple.

So I created a routine that does that and I created the opposite. Meaning scan backwards on a string, ignoring non-spaces until you find a space. I call that one -ALPHA

With those 2 routines we combine them to create /WORD ("cut word")

This routine cuts a word off the end of a string delimited by blanks but it also re-sizes the remainder string.

It uses some ugly stack gymnastics to calculate the new string lengths for both strings so unless you are really bored just assume it works.

Armed with /WORD we can put this in a loop call SPLITSTR. This word cuts off a string starting from the end and leaves each sub-string on the stack as an [address,len] pair

It also counts how many times it does this until the last word is found and puts that count on the top of the stack.

So we are left with a variable number of strings on the stack with a count on top.

The code is in the spoiler and the GIF lets you see how quickly we can get the results and even print them reversed.

 \ Split a string into words
DECIMAL

: LASTCHAR   ( addr len -- c) 2DUP + 1- C@ ;

: -TRAILING  ( adr len -- adr len')     \ remove trailing blanks (spaces)
             BEGIN  LASTCHAR BL =       \ test char at end of string (adr+len)
             WHILE  1-                  \ while it is a blank, decrement length
             REPEAT ;

: -ALPHA  ( adr len -- n)   \ scan string backwards to find next blank
            BEGIN  LASTCHAR BL <>
            WHILE  1-
            REPEAT
;

\ "cut word"  cuts off last word from a string.
: /WORD  ( adr len -- adr len-len' adr' len')
             2DUP  -ALPHA    \ scan back past the alphanumeric chars
             DUP 0>          \ if the string has length
             IF
                DUP >R +        \ calc start of cut word
                SWAP R@ -       \ calc length of cut word
                R> 1-  -ROT     \ put new length on remaining string
             ELSE
                  DROP 0        \ probably negative count. replace with 0
            THEN
;

\ split a string into words.
\ outputs multiple stack strings with count on top
: SPLITSTR  ( adr len -- adr len ... adr len cnt)
            -TRAILING          \ clean off trailing spaces
            0 >R               \ put a counter on rstack
            BEGIN
               /WORD  DUP 0>
            WHILE
               2SWAP -TRAILING   \ cut word put remaining string on top
               R> 1+ >R          \ incr the counter
            REPEAT
            2DROP
            R> 1+
;
\ edit: Had to add -TRAILING after 2SWAP to handle multiple spaces between words

CHAR " CONSTANT '"'

: .ITEM  (  adr len -- )  '"' EMIT  TYPE '"' EMIT  ;

: .ITEMS  ( [listofstrings] cnt --) \ destructive printlist
         CR 0 ?DO   .ITEM  SPACE LOOP ;

\  non-destructive print list backwards
: .RLIST  ( [listofstrings] cnt  -- [listofstrings] cnt )
        CR
        DUP 2* 1 SWAP
        ?DO
            I PICK  I PICK  .ITEM SPACE
        -2 +LOOP ;

: A$   S" All my ex's live in Texas" ;
: B$   S" ONE TWO THREE FOUR" ;
: C$   S" NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY      "  ;

Edited November 10, 2017 by TheBF