Jump to content
IGNORED

An interesting Speed Comparison


Larry

Recommended Posts

Every month or so, I run a BASIC program that I wrote to process a large text file of movie titles, dates, and times. I'm only interested in the titles so I can see if there are movies I wish to watch. The text file being processed is typically between 700 and 800 SD sectors in length. Three steps are necessary in this program: 1) Find the actual titles, 2) sort the titles, and 3) remove any duplicate entries. This program initially took a very long time to execute (so slow I frequently used the emulator!), but then I started working on improvements. In the current version, I use Basic XE since it supports true string arrays and has built-in ML sorting routines (SORTUP and SORTDOWN). But then I thought I might get better performance out of Turbo Basic XL, especially since I could compile the program. The execution time results have turned out to be quite interesting:

 

BXE using a BASIC insertion sort 35.9 minutes

BXE (using SORTUP) 16.1 minutes

TBXL using a BASIC insertion sort 63.4 minutes

TBXL compiled and using the BASIC insertion sort 31.8 minutes

TBXL using a ML sort 27.3 minutes

TBXL compiled and using the ML sort 10.8 minutes

 

The 63+ minutes for TBXL is really a "grabber," since an apples-to-apples comparison with BXE using the same BASIC sort was 35.9 minutes. My interpretation of the results is that the huge number of string manipulations are being done in (Turbo) BASIC with subscripting (pseudo-string arrays), while BXE supports true string arrays, so no subscripting is required. But clearly compiling the code helps a lot. The 10.8 minutes is the fastest I've been able to get thus far, although I'm sure Action! would provide another dramatic improvement.

 

Over the years, it has been written fairly often that "you don't really need string arrays," (mainly in comparison to Microsoft-type BASICs), but the caveat is that if you are using them a lot, then pseudo-string arrays will really bog things down. And a more generalized conclusion may be that "the more features in the language, the better." -Larry

Link to comment
Share on other sites

But clearly compiling the code helps a lot. The 10.8 minutes is the fastest I've been able to get thus far, although I'm sure Action! would provide another dramatic improvement.

Sure Action would provide faster execution. In some ways it's nearly/equally as fast as assembly language is. And if I had to choose between BASICs for Atari, it would be Turbo BASIC XL.

Link to comment
Share on other sites

It would be interesting to see speed comparisons with other languages.

Could you post the source of your routine so that we may translate it?

 

Oh yes, the source and sample data to process. I would do an Forth Version and compare. Also as the files a quite big, it might be intresting if we see speed differences in IO, for example DOS 2.x file systems vs. SpartaDOS file systems.

 

Carsten

Link to comment
Share on other sites

It would be interesting to see speed comparisons with other languages.

Could you post the source of your routine so that we may translate it?

 

Oh yes, the source and sample data to process. I would do an Forth Version and compare. Also as the files a quite big, it might be intresting if we see speed differences in IO, for example DOS 2.x file systems vs. SpartaDOS file systems.

 

Carsten

 

OK, for those interested, please send me a PM with an email address, and I'll send the code and data file info. The I/O speed might be interesting; currently I read/write the files from a MyDos ramdisk. -Larry

Link to comment
Share on other sites

  • 2 weeks later...

I've got the Basic sources from Larry. I already took a look at it last weekend. I identified two bottlenecks:

 

* reading the file bytewise (using GET) instead of Block-Load (BGET in TurboBasic). Using BGET/BPUT might considerably speed up loading and saving

* converting strings from MS-DOS CR/LF to Atari String Format and back again takes too much time

 

I'll work on an optimized TurboBasic Vesion and also on an VolksForth Version

 

Carsten

Edited by cas
Link to comment
Share on other sites

I've got the Basic sources from Larry. I already took a look at it last weekend. I identified two bottlenecks:

 

* reading the file bytewise (using GET) instead of Block-Load (BGET in TurboBasic). Using BGET/BPUT might considerably speed up loading and saving

* converting strings from MS-DOS CR/LF to Atari String Format and back again takes too much time

 

I'll work on an optimized TurboBasic Vesion and also on an VolksForth Version

 

Carsten

 

Hi Carsten-

 

Even though the records in the test file are uniform in length, you can't rely on that in real-world data (variable-length records). The part of the data that is of interest may be as few as 6 -8 characters to as many as (apx) 40. The individual records could be as few as apx. 20 characters to as many as 300 or more. So I don't see how you could use BGET. (?)

 

The original data is CR/LF. The resulting output also needs to be CR/LF. I don't remember converting it anywhere in the program. (?)

 

Thanks for your interest in the program. I'll be most interested to see what a FORTH version looks like and how it performs.

 

-Larry

Link to comment
Share on other sites

Even though the records in the test file are uniform in length, you can't rely on that in real-world data (variable-length records). The part of the data that is of interest may be as few as 6 -8 characters to as many as (apx) 40. The individual records could be as few as apx. 20 characters to as many as 300 or more. So I don't see how you could use BGET. (?)

 

I would use BGET to load a chunk of data (for example 2048 byte) into memory with BGET and then scan the memory for CR/LF, then copy each record into the final destination (including CR/LF as part of the string). Then I would sort the data (and in the same step remove duplicates, because that is a special case of sorting), and then dump the whole memory area containing the sorted data including CR/LF from memory to disk again.

 

The original data is CR/LF. The resulting output also needs to be CR/LF. I don't remember converting it anywhere in the program. (?)

 

Yes, you're converting, because you read until CR/LF appears, then you create a string without CR/LF in memory, you sort, then you write back the data re-adding CR/LF,

 

Carsten

Link to comment
Share on other sites

Even though the records in the test file are uniform in length, you can't rely on that in real-world data (variable-length records). The part of the data that is of interest may be as few as 6 -8 characters to as many as (apx) 40. The individual records could be as few as apx. 20 characters to as many as 300 or more. So I don't see how you could use BGET. (?)

 

I would use BGET to load a chunk of data (for example 2048 byte) into memory with BGET and then scan the memory for CR/LF, then copy each record into the final destination (including CR/LF as part of the string). Then I would sort the data (and in the same step remove duplicates, because that is a special case of sorting), and then dump the whole memory area containing the sorted data including CR/LF from memory to disk again.

 

The original data is CR/LF. The resulting output also needs to be CR/LF. I don't remember converting it anywhere in the program. (?)

 

Yes, you're converting, because you read until CR/LF appears, then you create a string without CR/LF in memory, you sort, then you write back the data re-adding CR/LF,

 

Carsten

 

No, I do not include the CR/LF in the string, but the CR/LF is at the end of the whole record and usually many characters past what we are interested in. It seems to me that you either have to write the CR/LF at the end of the part we want to keep and store it (takes more memory) or write the CR/LF when we write the record out. I guess I will understand better when I see your code. -Larry

Link to comment
Share on other sites

Why use an insertion sort? Other sorting methods are much better. For an Apple //e project to manage overdue library books, I coded a radix sort. It still wasn't blindingly fast (in retrospect I can think of things that would have made it faster) but it was a huge improvement over the sort that preceded it. A radix sort may work very well in combination with an insertion sort, since it often takes a lot more work to get things totally sorted with a radix sort than to get them "close"; an insertion sort can take advantage of a list being nearly sorted (and having nothing too far out of position).

Link to comment
Share on other sites

Why use an insertion sort? Other sorting methods are much better. For an Apple //e project to manage overdue library books, I coded a radix sort. It still wasn't blindingly fast (in retrospect I can think of things that would have made it faster) but it was a huge improvement over the sort that preceded it. A radix sort may work very well in combination with an insertion sort, since it often takes a lot more work to get things totally sorted with a radix sort than to get them "close"; an insertion sort can take advantage of a list being nearly sorted (and having nothing too far out of position).

 

I'm not familiar with a radix sort. The later, more advanced versions of this program use a ML sort that takes about 1-3 seconds, so it's really not an issue in the current scope of the program. But thanks, I will check the web for references to a radix sort. If you know of a radix example, you might refer us to it. -Larry

Link to comment
Share on other sites

I'm not familiar with a radix sort. The later, more advanced versions of this program use a ML sort that takes about 1-3 seconds, so it's really not an issue in the current scope of the program. But thanks, I will check the web for references to a radix sort. If you know of a radix example, you might refer us to it. -Larry

 

Radix sort is probably excessively complicated for the necessary level of speed (I probably should have used Shell sort) but it's an interesting approach which predates the widespread use of electronic computers (and might predate any use of electronic computers).

 

Suppose one has a deck of punched cards an one wants to sort them by date; the date is stored in YYMMDD format in columns 21-26. To begin, one would set a dial on the machine to "26" (the number of the LAST column to sort by), insert the deck of cards in the hopper, and push the start button. This would cause the cards to be deposited (at about 600 cards/minute) into eleven hoppers. The first hopper would contain all cards where column 26 held a zero. The next hopper would hold all the cards where it held a one. The next hopper the twos, etc. The tenth hopper would hold the nines, and the last hopper would hold all the cards that didn't have a digit in column 26.

 

Once all the cards had run through the machine, the contents of the first ten hoppers would be placed back into the input hopper, the dial would be set to "25", and the machine would be started again. The procedure would then be repeated setting the dial to "24", "23", "22", and "21". After that, all the cards would be sorted.

 

The key to making the radix sort work is that all of the cards that index to a particular digit will remain in the same order at the end of a sorting step as they were at the beginning. So just before the cards are run the last time, the cards for different decades will be intermixed, but all the cards of any given decade will be in order.

 

Implementing radix sort in BASIC is rather complicated. Nonetheless, it has the advantage that the number of passes required is independent of the number of items to be sorted.

 

Shell sort is much quicker and easier to implement, though it does have a disadvantage compared with radix sort or insertion sort: unlike radix sort, insertion sort, or other "stable" sorts, when using Shell sort or other "non-stable" sorting algorithms, there is no guarantee that items with identical keys will be in the same order after sorting as they were before sorting. Adding an index number to each item and appending it to the sort keys is a workaround which, while it requires some extra memory and code, is nonetheless often worthwhile.

 

Shell Sort is very much like insertion sort, except that it involves making several passes through the data. If there are 700 items to be sorted, Shell sort will start by doing an insertion sort where each item is compared to the item 364 units earlier on in the file (I'll explain the number 364 shortly). Then it will do an insertion sort where each item is compared with one 121 units earlier. Then 40 units, then 13, then 4, then 1.

 

The code for Shell sort ends up being not much bigger than insertion sort, but it runs much faster. For really big data sets, it's not the best approach, but for medium-sized data sets it can be very useful.

 

Incidentally, another sorting technique which is very useful if one is sorting huge files stored on magnetic tape or other sequential-access medium, and one has four tape drives (at any given time, one pair for reading and one for writing), is called merge sort. Using four tape drives, one could sort a list of a ten million items, reading and writing each record 24 times, with only enough RAM to hold a couple of records. Of course, nowadays one might try to read all ten million items into RAM and sort them, but that wasn't always practical. Merge sort is generally more useful when dealing with files on a sequential storage medium than files in RAM, but understanding the principles is useful regardless.

Link to comment
Share on other sites

Remember that if you change interpreter Atari Basic to Turbo Basic you have to also change style of programming. Only then you make your program run much faster. Try to eliminate as many GOTO instructions as possible and replace it with TB stuctures (WHILE, DO, LOOP etc) in places when you use GOTO instead number of lines use etiquettes.

Link to comment
Share on other sites

Here's a QBASIC implementation of Shell sort [note, btw, Shell is the inventor's name--hence the capitalization]

 

SUB dosort (dat%(), num%)
 part% = 1
 WHILE part% < num% AND part% < 10000
part% = part% * 3 + 1
 WEND
 WHILE part% > 0
FOR i% = part% TO num%
  temp% = dat%(i%)
  j% = i% - part%
  DO
	IF j% < 0 THEN EXIT DO
	IF dat%(j%) <= temp% THEN EXIT DO
	dat%(j% + part%) = dat%(j%)
	j% = j% - part%
  LOOP
  dat%(j% + part%) = temp%
NEXT
part% = INT(part% / 3)
 WEND
END SUB

Hopefully it shouldn't be too hard to figure out. So of the math involving +/- part% is probably not the greatest and could be done better, but I think this code should illustrate the general idea.

Link to comment
Share on other sites

Hello Larry,

 

the test data "PSEUDO.TXT" you send me is 87 KB size. Did you ever been able to sort a datafile that size with BasicXE or TurboBasic?

 

In my Forth Version of the program, I'm loading the datafile into extended RAM (256 KB compyshop compatible ramdisk), however then loading from RAMDISK will not work, only loading from Disk. When sorting data in the extended RAM, each time two different strings will be compared, there is a good chance that they will be in different RAM Banks, so at least one string needs to be moved out to the Main RAM for comparison. That results in a lot memory movement. When sorting, I use an index to the strings and I will not sort the real strings in memory, only the index (using Shell Sort).

 

However only if the Basic XE and Turbo Basic Programs can also sort that file completely, it will be a speed comparison. If the Basic Programs work only on smaller datasets, they will have an advantage as they can sort the data in main ram.

 

So far loading the 87 KB from Disk and building the index takes 1 minute (non-optimized)

 

Addition:

 

OK, I've got it. Although the titles are over 200 Bytes long, you only read 35 chars max per title. So with 423 lines of titles and 35 chars per title we have ~15 K data to sort.

 

Carsten

Edited by cas
Link to comment
Share on other sites

I've implemented so far

 

Loading

Building the Index (while loading)

Sorting (Shell- or Comp-Sort)

 

First results:

 

with ANTIC DMA on

Load: 103 Sec

Shell Sort: 46 Sec

Comp Sort: 60 Sec

 

with ANTIC DMA off

Load: 83 Sec

Shell Sort: 29 Sec

Comp Sort: 41 Sec

 

Elemenating duplicates will add approx 10 seconds to the sort. Saving will be approx. 50 Sec.

 

I expect a total total time with DMA on for the non-optimized Forth Program to do the full task in approx. 210 Sec (3 Minutes). Coding the Index-building and string compare in assembler would bring it down to 2 minutes maybe.

 

@Larry: can you change your pgm to create the PSEUDO.TXT to generate records in random length and with paranteses and duplicates (the current test data has no duplicates). You might load an original dataset and exchange the chars in the title just randomly.

 

Carsten

Link to comment
Share on other sites

Here is the Forth code so far

 

\ Tuning Example

\needs DUMP   INCLUDE" D:DUMP.FS"

: TASK;

: UNUSED SP@ HERE - ;

VARIABLE IO#
&500 CONSTANT MAXLINES  
&35 CONSTANT MAXLENGTH  
$400 CONSTANT TMPLEN

CREATE TMP TMPLEN ALLOT
CREATE IDX MAXLINES 4 * ALLOT
CREATE FILENAME ," D:PSEUDO.TXT"
VARIABLE IDP  \ Indexpointer
VARIABLE LINE

CR UNUSED U. .( Bytes free! )

: DOT. ASCII . EMIT ;

: BUILDINDEX
 TMP TMPLEN
 BEGIN
ASCII : SCAN 
2DUP IF 1+
  HERE IDP @ !
  MAXLENGTH HERE 
  MAXLENGTH ALLOT PLACE
  2 IDP +!
  1 LINE +! DOT.
  $0A SCAN 
ELSE DROP
THEN 
 DUP 0= UNTIL
 2DROP ;

: LOADFILE
 LINE OFF
 FILENAME COUNT R/O OPEN-FILE
 ABORT" File Error!"
 IO# !
 IDX IDP ! \ init index pointer
 BEGIN
TMP $400 IO# @ READ-FILE NIP
BUILDINDEX
IF IO# @ CLOSE-FILE DROP EXIT THEN
 REPEAT ;

DEFER SORTSWAP
DEFER SORTCOMPARE

: SORTHELPER ( inc i -- xx xx )
 BEGIN
2DUP + OVER SORTSWAP OVER -
 DUP 0< 0= WHILE
  2DUP + OVER SORTCOMPARE 
  0= IF EXIT THEN
 REPEAT ;

: SHELLSORT ( u -- )
 DUP
 BEGIN 
2/
 DUP WHILE
1- 1 OR 2DUP -
?DUP IF 
0 DO
	DUP I + I SORTCOMPARE 
	IF DUP I SORTHELPER 2DROP THEN
  LOOP
THEN
 REPEAT 2DROP ;

VARIABLE elements
VARIABLE GAP

: /1.3 ( n -- n ) 10 13 */ 1 MAX ;

: COMPSORT ( u -- )
 DUP elements ! GAP !
 BEGIN 1
GAP @ /1.3 GAP !
elements @ GAP @ - 0 DO
  I GAP @ + I SORTCOMPARE
  IF I GAP @ + I SORTSWAP 
   DROP 0 THEN
LOOP
 GAP @ 1 = AND UNTIL ;

: I2IDX ( u -- idx )
 2* IDX + ;

: MYSWAP ( u1 u2 -- )
 I2IDX DUP >R @ SWAP
 I2IDX DUP >R @ SWAP
 R> ! R> ! ;

' MYSWAP IS SORTSWAP

: COMPARE ( a1 a2 n -- f )
 BOUNDS 
 DO 
  COUNT I C@ - ?DUP
  IF SWAP 0= LEAVE THEN
 LOOP IF 0 THEN ;

: MYCOMPARE ( u1 u2 -- f )
 I2IDX @ DUP C@ ROT 
 I2IDX @ DUP C@ ROT MIN 
 COMPARE 0> ;

' MYCOMPARE IS SORTCOMPARE

: PRINT ( u -- )
 I2IDX @ COUNT TYPE CR ;

: LIST CR
 IDP @ IDX 
 DO
I @ COUNT TYPE CR
 2 +LOOP ;

: RESETTIMER
 0 &18 ! 0 &20 C!;

: ELAPSED
 &20 C@ &19 C@ $100 * + &18 C@ 
 &50 UM/MOD NIP 
 U. ." Seconds" CR ;

: DMAON
 &34 &559 C! ;

: DMAOFF
 &559 OFF ;

VARIABLE PGMTOP

: RESET
 PGMTOP @ DP ! ;

: LOAD RESET
 RESETTIMER LOADFILE ELAPSED ;

: SHELL
 RESETTIMER 
 LINE @ SHELLSORT
 ELAPSED ;

: COMP
 RESETTIMER
 LINE @ COMPSORT
 ELAPSED ;


HERE PGMTOP !

CR UNUSED U. .( Bytes free! )

Edited by cas
Link to comment
Share on other sites

In my Forth Version of the program, I'm loading the datafile into extended RAM (256 KB compyshop compatible ramdisk), however then loading from RAMDISK will not work, only loading from Disk. When sorting data in the extended RAM, each time two different strings will be compared, there is a good chance that they will be in different RAM Banks, so at least one string needs to be moved out to the Main RAM for comparison. That results in a lot memory movement. When sorting, I use an index to the strings and I will not sort the real strings in memory, only the index (using Shell Sort).

 

Hi Carsten,

 

I wrote a program once that used extended ram and sorted that data. You can eliminate all the intra-bank moves by sorting each bank seperately and then when you write it back to disk, just grab the next piece from each ram bank in order. That's what I did anyway and it worked fine.

Link to comment
Share on other sites

I've implemented so far

 

Loading

Building the Index (while loading)

Sorting (Shell- or Comp-Sort)

 

First results:

 

with ANTIC DMA on

Load: 103 Sec

Shell Sort: 46 Sec

Comp Sort: 60 Sec

 

with ANTIC DMA off

Load: 83 Sec

Shell Sort: 29 Sec

Comp Sort: 41 Sec

 

Elemenating duplicates will add approx 10 seconds to the sort. Saving will be approx. 50 Sec.

 

I expect a total total time with DMA on for the non-optimized Forth Program to do the full task in approx. 210 Sec (3 Minutes). Coding the Index-building and string compare in assembler would bring it down to 2 minutes maybe.

 

@Larry: can you change your pgm to create the PSEUDO.TXT to generate records in random length and with paranteses and duplicates (the current test data has no duplicates). You might load an original dataset and exchange the chars in the title just randomly.

 

Carsten

 

Hi Carsten-

There should be 5 duplicate records in the data set you have. I copied 5 records from the original batch and added them to the end, IIRC.

I'll add a bit of code to print out the duplicates to confirm that.

-Larry

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...