Jump to content
Sign in to follow this  
retroclouds

DV80 file loading/saving speed comparisons

Recommended Posts

I've been doing a couple of tests lately loading and saving a 97KB DV80 file on a real TI-99/4a with PEB, using a couple of different storage devices.

 

To be able to handle such big file I used my Stevie editor with a 1 MB SAMS.

Having that much RAM allows me to keep the file completely in memory.

 

The DV80 file I used is the TI Invaders source code as found here:
http://aa-ti994a.oratronik.de/TI_Invaders_TI-99_4A_Disk_Version.txt

 

I have tested with following devices:

  • >1000. IDE Card
    • IDE DSR v14
    • Seagate ST32140A 2GB hard drive
    • First partition IDE1
       
  • >1100. Standard TI Disk Controller
    • Freshly formatted BASF floppy disk in drive 1
       
  • >1400. TIPI PEB
    • Raspberry PI 3b with 16gb ssd card
       
  • >1700. HRD4000B
    • ROS 8.42c
    • 4MB RAM disk

 

I clocked total file operation duration using a timer on my mobile phone. 

For each device I repeated each test 3 times; meaning 3x loading and 3x saving the file.
That way I can iron out some of the inaccuracy of my manual timing measurements.

 

To keep compatibility as high as possible, I'm solely using level 3 file I/O.

However, for devices that support file buffers in RAM I additionally repeated the test (3x loading, 3x saving), with the buffer in RAM instead of VDP RAM. 

 

This is how the test procedure looks like for the load test:

  1. Reset console
  2. Start Stevie editor
  3. Load DV80 from device and time duration until file is shown in editor.
  4. Reset console
  5. Back to 1 (next device)

 

This is how test procedure looks like for the save test:

  1. Reset console
  2. Start Stevie editor
  3. Load DV80 from HRD4000B partition 1
  4. Save DV80 to device and time duration until file is saved.
  5. Reset console
  6. Back to 1 (next device)

 

For the sake of science I have spent quite some time on this, repeating the tests over and over 😄

 

Even though absolute times are not that important (they depend on my file loading handling in the Stevie editor and memory handling specific to Stevie), I do think that comparing times between devices certainly is relevant. With that you get a rough idea what speed the devices offer compared to each other.

 

Durations are in seconds. The "Save slowness" factor indicates how much time the saving of the DV 80 takes, compared to loading.
For example, for the TIPI device it means that with the file buffer in VDP RAM, saving the DV80 takes 1,13 times as long.  

 

image.thumb.png.ec3e3834d3b235fed05d7c31baf48970.png

 

 

  • Like 5

Share this post


Link to post
Share on other sites

My conclusions so far:

  • Saving is always slower as loading. That one is obvious as there is more work involved.
     
  • ROS is much more optimized for reading records, or the code path for saving records should be revisited.
    It's save slowness factor of 2,46 is much bigger compared to all other devices 
     
  • File buffer in RAM instead of VRAM (for IDE and HRD)
    • Loading files: Speed increase is a lot smaller as I originally anticipated. 
    • Saving files: Have to revisit my code, it should be faster and not slower as with file buffer in VDP memory. 
      EDIT: I did revisit my code and it does not go through the CPU RAM path. Once fixed I'll repeat the tests for CPU and post the new results
Edited by retroclouds
Updated comment about CPU RAM path
  • Like 2

Share this post


Link to post
Share on other sites

Very interesting results, particularly with the ramdisk save times.  I suspect that has to do with ROS not caching the current sector in use, meaning that as each record is appended to the sector, ROS must load the FDR to find the right sector, read the sector, update it with the record (or create a new sector if previous was full), and write back the sector and update the FDR.   Other devices store the current sector in memory; the sector is flushed when no more records can be written to it or when the file is closed. Much of the read-before-write overhead doesn't exist.

 

If you are calling the standard DSRLNK for each record, the device CRU and number of level 3 names will impact performance to a degree, as each DSRLNK call will search the DSRs in CRU order.  This is where some optimization can make a difference, such as caching the entry address and CRU address and calling directly after the first IO operation finds the intended device.  

  • Thanks 1

Share this post


Link to post
Share on other sites
6 hours ago, InsaneMultitasker said:

Very interesting results, particularly with the ramdisk save times.  I suspect that has to do with ROS not caching the current sector in use, meaning that as each record is appended to the sector, ROS must load the FDR to find the right sector, read the sector, update it with the record (or create a new sector if previous was full), and write back the sector and update the FDR.   Other devices store the current sector in memory; the sector is flushed when no more records can be written to it or when the file is closed. Much of the read-before-write overhead doesn't exist.

 

If you are calling the standard DSRLNK for each record, the device CRU and number of level 3 names will impact performance to a degree, as each DSRLNK call will search the DSRs in CRU order.  This is where some optimization can make a difference, such as caching the entry address and CRU address and calling directly after the first IO operation finds the intended device.  

 

Your idea about ROS not caching the current sector in use makes a lot of sense. 
I forget to mention that I'm using a custom DSRLNK that does reuse the CRU and entry address. Adding that functionality to my custom dsrlnk really did bring a speed improvement of more than 10 seconds when loading a DV80 file on the HRD.
So you clearly notice the difference more than having the file buffer in RAM instead of VRAM.

 

 

Share this post


Link to post
Share on other sites
5 hours ago, apersson850 said:

ROS doesn't work with caching CRU and call addresses. The p-system uses that approach, but the later ROS don't support that.

I cannot confirm that. I'm using a custom dsrlnk that does cache the CRU and entry address and it works fine with ROS 8.42c

Share this post


Link to post
Share on other sites
1 hour ago, retroclouds said:

 

Your idea about ROS not caching the current sector in use makes a lot of sense. 
I forget to mention that I'm using a custom DSRLNK that does reuse the CRU and entry address. Adding that functionality to my custom dsrlnk really did bring a speed improvement of more than 10 seconds when loading a DV80 file on the HRD.
So you clearly notice the difference more than having the file buffer in RAM instead of VRAM.

 

 

10 seconds is quite an improvement; it makes sense as the DSRLNK scan cycles really add up when you are reading so many records. Would you be open to sharing the modified DSRLNK at some point? I'm curious to see how and where you are handling the caching. 

 

When you switch from VDP to CPU buffering, is your PAB directing the DSR to copy the data into your final Stevie buffer location or are you copying the data from the CPU DSR buffer into the Stevie buffer?  If you are able to do the former, that removes the intermediate copy (which you would have with the VDP buffering) and since those copies are also likely BYTE-oriented (i.e., copying one byte per instruction) it should shave more time from the file load. Caveat: the variable record length can throw a wrench into things and you may need to have two contiguous SAMS buffers mapped for times when the data copy extends beyond the first page's boundary. The 2-byte buffer address in the pab (vdp) would also need to be updated per record.

 

I've added a task to look at the ROS write routine to confirm my conjecture and to see if there is anything that can be done to speed up the record IO.  Although it isn't applicable to your use case, a simple test would be to read and write from/to a fixed record file, such as DF255, that had only one record per sector. 

Share this post


Link to post
Share on other sites
22 minutes ago, InsaneMultitasker said:

10 seconds is quite an improvement; it makes sense as the DSRLNK scan cycles really add up when you are reading so many records. Would you be open to sharing the modified DSRLNK at some point? I'm curious to see how and where you are handling the caching. 

 

When you switch from VDP to CPU buffering, is your PAB directing the DSR to copy the data into your final Stevie buffer location or are you copying the data from the CPU DSR buffer into the Stevie buffer?  If you are able to do the former, that removes the intermediate copy (which you would have with the VDP buffering) and since those copies are also likely BYTE-oriented (i.e., copying one byte per instruction) it should shave more time from the file load. Caveat: the variable record length can throw a wrench into things and you may need to have two contiguous SAMS buffers mapped for times when the data copy extends beyond the first page's boundary. The 2-byte buffer address in the pab (vdp) would also need to be updated per record.

 

I've added a task to look at the ROS write routine to confirm my conjecture and to see if there is anything that can be done to speed up the record IO.  Although it isn't applicable to your use case, a simple test would be to read and write from/to a fixed record file, such as DF255, that had only one record per sector. 

 

Yes, I'll be releasing the Stevie editor source code as well as the updated spectra2 library (that one contains the custom dsrlnk) in the near future.

 

When using CPU buffering the PAB is directing the DSR to copy the data into the final Stevie editor buffer location, skipping the intermediate copy.
There's code in place that assures the correct SAMS page is mapped upfront and that a single record can never span 2 SAMS pages.

  • Thanks 1

Share this post


Link to post
Share on other sites
3 hours ago, retroclouds said:

I cannot confirm that. I'm using a custom dsrlnk that does cache the CRU and entry address and it works fine with ROS 8.42c

Interesting. When thinking a little I realize that my statement is true for subprogram calls, like sector read/write. File handling is something else. But the p-system uses the sector access procedure only.

  • Like 2

Share this post


Link to post
Share on other sites

WOOT!!

 

TIPI clearly hits the design target of, "fast enough". 

 

I didn't imagine it was that close to the others.

  • Like 2

Share this post


Link to post
Share on other sites
5 hours ago, jedimatt42 said:

WOOT!!

 

TIPI clearly hits the design target of, "fast enough". 

 

I didn't imagine it was that close to the others.

 

I’m quite sure the DV80 file load/save speed can be further improved on Tipi side by changing the python code.

But life/work has been a killer lately, so can’t confirm yet. I will do some tests and get back with the results.

 

Also think speed could be further improved by using PyPy instead of the cpython interpreter.

Share this post


Link to post
Share on other sites
55 minutes ago, retroclouds said:

 

I’m quite sure the DV80 file load/save speed can be further improved on Tipi side by changing the python code.

But life/work has been a killer lately, so can’t confirm yet. I will do some tests and get back with the results.

 

Also think speed could be further improved by using PyPy instead of the cpython interpreter.

PyPy is not likely to help. When a file is opened on TIPI, the the python side reads the entire file, breaks up the records into an array, and reads just fetch the cached value. The serial transmission protocol is more likely the limit.  That code is in C with specific signal delays implemented cause the wiring can't handle 20mhz un-delayed GPIO transitions from C on a PI3  reliably. 

 

Also, TIPI wouldn't typically benefit from PyPy which depends on JIT for the performance boost. JIT isn't going to help something that is IO-bound. And the TIPI design depends on the process to be closed and restarted when the 4A powerup routine executes... So basically between every program you run any JIT work is thrown out, and you are back to being slower than CPython.  I guess JIT would pay off in extreme cases like loading a 1 megabyte DV80. 

 

Somewhere, I outlined a plan to switch to a nibble transfer over the same wires, so 28 signal delays optimizes to 3 signal delays per byte. Oh, and for every application byte, the 9900 DSR & python exchange 3 bytes. I've worked that out in more detail for fun. My ego kind of wants to take a swing at beating the IDE controller which was noticeably twice as fast as TIPI in the brief time I spent with one. 

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...