JamesD Posted March 9, 2015 Share Posted March 9, 2015 I was working on another CoCo 3 version. Microsoft BASIC only recognizes the first 2 letters of a variable name. Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 11, 2015 Share Posted March 11, 2015 What "fair", isn't this just "let's see it go?" Besides, I wanted to challenge the Atari guys... I'm not challenging your porting prowess, JamesD (And the TI code doesn't fit in scratchpad either, but the Atari doesn't have 4 wait states per memory access, so it'd still be fair even if it did ). Quote Link to comment Share on other sites More sharing options...
+Stephen Posted March 11, 2015 Share Posted March 11, 2015 I'd like to see an assembly version for a8, to see the max speed Before anyone says it, no I will not be writing it. Quote Link to comment Share on other sites More sharing options...
Rybags Posted March 11, 2015 Share Posted March 11, 2015 I doubt an optimized Asm version would be more than 3 times faster than the best optimized Basic version. The way I see it, the best optimization would be to work out what granularity and range is needed for each, then produce square root and trig lookup tables instead of using the functions every iteration. 1 Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 11, 2015 Share Posted March 11, 2015 What "fair", isn't this just "let's see it go?" Besides, I wanted to challenge the Atari guys... I'm not challenging your porting prowess, JamesD (And the TI code doesn't fit in scratchpad either, but the Atari doesn't have 4 wait states per memory access, so it'd still be fair even if it did ). Porting prowess? It's BASIC. All I have to do is change 3 or 4 lines... except on the TI. What idiot decided this :: is better than this : ? Oh right... ANSI. Language by committee. If more emulators allowed me to paste code you'd see several more versions. All 8 bit BASICs' built in editors suck. There's nothing quite like accidentally hitting ESC on an SDL based emulator right when you get the code typed in and working to piss a person off. BTW, the Laser 2001's BASIC interpreter appears to be based largely on Applesoft BASIC. I now think the Laser 500's BASIC is based on MSX BASIC or Spectravideo BASIC which is roughly the same thing. Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 11, 2015 Share Posted March 11, 2015 The Ohio Scientific Machines should do well with this computationally but I don't think they have bitmapped graphics. Quote Link to comment Share on other sites More sharing options...
ricortes Posted March 11, 2015 Share Posted March 11, 2015 I doubt an optimized Asm version would be more than 3 times faster than the best optimized Basic version. The way I see it, the best optimization would be to work out what granularity and range is needed for each, then produce square root and trig lookup tables instead of using the functions every iteration. I did some quick and dirty profiling. For some reason the SQR function is painfully slow. I knocked out something in Action and it was ~60 times as fast. Since everything to the screen is either a BYTE or an INTEGER i.e. 192x320, proper scaling and not using floating point would help a lot. Of course if you just use the built in FP routines it won't make a difference since you would just be quickly calling slow routines. I haven't really looked hard enough at the program to tell what is going on. The hints given here certainly help. If everything was scaled before hand to nothing bigger then an INT, it would help. Even if I had a complete understanding of the program with explanations of what is going on, it would still be hard to figure out the why and what for. Good example 180 XL=INT(SQR(20736-ZS)+0.5) Where did the 20736 come from? 1 Quote Link to comment Share on other sites More sharing options...
Rybags Posted March 11, 2015 Share Posted March 11, 2015 I think that crept into it with the TI translation. I suspect the whole thing originated elsewhere and was adapted with scaling to the Atari to begin with so suffers a speed penalty just to begin with. What would be good is to get the algorithm/formula in its purest form so that system adaptions could be done without unnecessary calculations thrown in just to scale the graphics. 1 Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 11, 2015 Share Posted March 11, 2015 (edited) Besides, I wanted to challenge the Atari guys... I'm not challenging your porting prowess, JamesD I accept the challenge :-) Attached is a version not very optimized, it is stand-alone except for the call to CIOV to set the graphics mode. The specifics are: - Arithmetic using 3.13 bits signed fixed-point, - Sine function with a resolution of 10bits on the angle units, - Square root with 2.12 bits of accuracy, - Don't plot the hidden lines, - Only 899 bytes. The runtimes are 43 sec on PAL, 46 sec on NTSC. I doubt an optimized Asm version would be more than 3 times faster than the best optimized Basic version. Well, it is already 20 times faster, and this is without using tables for the multiplication and square root. I will try cleaning the source code to post it later. fedora.xex Edited March 11, 2015 by dmsc 2 Quote Link to comment Share on other sites More sharing options...
Mclaneinc Posted March 11, 2015 Share Posted March 11, 2015 On Altirra set to an XE it draws a straight line with gaps and seems to hang? Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 11, 2015 Share Posted March 11, 2015 Hi!, On Altirra set to an XE it draws a straight line with gaps and seems to hang? Strange, here it works: Quote Link to comment Share on other sites More sharing options...
Mclaneinc Posted March 11, 2015 Share Posted March 11, 2015 (edited) Very strange, went back to 2.40 and it does indeed work, so I cleared my settings on 2.60 beta 41 and the best I could get was a half picture of the hat with garbage in the middle of the screen. Odd.. Aha...Found out why, its the power up ram settings, set to DMA 3 it works perfectly but I've found stuff that won't work with DMA 3 Edited March 11, 2015 by Mclaneinc Quote Link to comment Share on other sites More sharing options...
Bryan Posted March 11, 2015 Share Posted March 11, 2015 I gotta say, this is a really fun thread. The kind of stuff computers used to be about. 4 Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 11, 2015 Share Posted March 11, 2015 Hi! Very strange, went back to 2.40 and it does indeed work, so I cleared my settings on 2.60 beta 41 and the best I could get was a half picture of the hat with garbage in the middle of the screen. Odd.. Aha...Found out why, its the power up ram settings, set to DMA 3 it works perfectly but I've found stuff that won't work with DMA 3 I found the bug!! It was missing a "#" in a "lda #0", so it worked if the memory location 0 had a 0 inside :-) Attached is a new version, also 3 bytes shorter. fedora.xex 5 Quote Link to comment Share on other sites More sharing options...
Mclaneinc Posted March 11, 2015 Share Posted March 11, 2015 That makes sense then Glad to have sort of helped Quote Link to comment Share on other sites More sharing options...
fujidude Posted March 11, 2015 Share Posted March 11, 2015 Hi! I found the bug!! It was missing a "#" in a "lda #0", so it worked if the memory location 0 had a 0 inside :-) Attached is a new version, also 3 bytes shorter. Hello. Could you post the source code please? Your program is fast as hell, but the real enjoyment here is the code. Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 11, 2015 Share Posted March 11, 2015 (edited) Hi!, Hello. Could you post the source code please? Your program is fast as hell, but the real enjoyment here is the code. Ok, I simplified somewhat the code, removed an extra multiplication, now the code is 829 bytes and the runtimes are 30sec in PAL, 38sec in NTSC. I know how to make the code faster (and perhaps smaller), but it will need rewriting of the inner loop. Attached is the source, compile with CA65: - fedora.s : The main loop. - math.s : The math functions, sine, sqrt, square (x^2), imul (signed fixed-point multiply) - plot.s : The plotting routines, gr8, plot. - macros.inc : Some usefull macros, to simplify the code. - atari-header.s : The atari XEX header. - atari-asm.cfg : The linker configuration file. - Makefile : Makefile to compile all the above. - genTable.awk : An AWK program that searches for best sine approximations. fedora-asm.zip fedora.xex Edited March 11, 2015 by dmsc 8 Quote Link to comment Share on other sites More sharing options...
bfollett Posted March 11, 2015 Share Posted March 11, 2015 Isn't your asm code slower than the turbo basic example posted earlier? Bob Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 11, 2015 Share Posted March 11, 2015 Isn't your asm code slower than the turbo basic example posted earlier? Bob seconds vs minutes 2 Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 12, 2015 Share Posted March 12, 2015 (edited) Good example 180 XL=INT(SQR(20736-ZS)+0.5) Where did the 20736 come from? 20736 is 144^2 (ie: the maximum size of ZS). That's in the original code from the Analog article, which is what we TI'ers got to start with. Going from memory, some of the parts I worked out: 100 REM ARCHIMEDES SPRIAL 110 REM 120 REM ANALOG MAGAZINE 130 REM 140 GRAPHICS 8+16:SETCOLOR 2,0,0 150 XP=144:XR=4.71238905:XF=XR/XP 160 FOR ZI=-64 to 64 170 ZT=ZI*2.25:ZS=ZT*ZT 180 XL=INT(SQR(20736-ZS)+0.5) 190 FOR XI=0-XL TO XL 200 XT=SQR(XI*XI+ZS)*XF 210 YY=(SIN(XT)+SIN(XT*3)*0.4)*56 220 X1=XI+ZI+160:Y1=90-YY+ZI 230 TRAP 250:COLOR 1:PLOT X1,Y1 240 COLOR 0:PLTO X1,Y1+1:DRAWTO X1,191 250 NEXT XI:NEXT ZI 260 GOTO 260 150 - calculates XF (X-Factor?) - this is just a ratio of the height of the screen (192 pixels) to a full circle in Radians (2PI). I don't know why it's calculated that way, but that's the point, to get Radians for the SIN function. XP and XR are never used again. 160 - presumably a Z coordinate loop - didn't look too deep 170 - scales ZI, then gets ZS (which is Z-squared, obviously) 180 - X Limit is calculated from ZS - which is inverted by subtracting from 20736 - its maximum value. ((64*2.25)^2)=20736. The +0.5 allows for integer rounding up. 190 - X loop, pretty well understood already 200 - XT is a temp value converted down to radians by the XF. I don't know the original function so I didn't dig deep into what part of it this is. 210 - this calculates the actual height of the curve for the current position. More scaling with the *56. 220 - fake projection (adding of ZI), origin centering (the 160 and 90), and Y-axis in version to get pixel addresses 230 - plots the pixel 240 - erases from one below the pixel to the bottom of the screen. This provides the hidden surface removal fakery, but was one of the first things we removed in optimizations (by reversing the loop), since it's slow on the TI to draw. 250 - end the loops 260 - sit forever. Also, if you check the TI thread, Sometimes99er worked out the ranges of all the variables, and also did some comparison animations showing the effects of integer versus floating point on some of the variables for output. We start about halfway down this page: http://atariage.com/forums/topic/215138-bitmap-mode/page-4 Edited March 12, 2015 by Tursi Quote Link to comment Share on other sites More sharing options...
Tursi Posted March 12, 2015 Share Posted March 12, 2015 Ok, I simplified somewhat the code, removed an extra multiplication, now the code is 829 bytes and the runtimes are 30sec in PAL, 38sec in NTSC. Dude! You absolutely kicked my butt! Fantastic! I'll have to go through this later and see if there's anything I can steal. The only optimization I had left in my pocket was to delay the screen draw (every pixel takes 7 memory accesses on the TI - two to set VDP address, one to read the byte, one to change the pixel, two to set the address AGAIN, and one more to write it back). If I render to CPU memory (as was suggested), I can skip all the extra VDP access during the runtime at the expense of not getting to watch it draw. Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 12, 2015 Share Posted March 12, 2015 Hi!, Dude! You absolutely kicked my butt! Fantastic! Thanks!! Well, I redid the loops to avoid squaring the variables, using a recurrence and replaced the multiplication by 0.4 with a multiplication by a better factor (0.40625), so I don't need the multiplication routines anymore. The result is a reduction to 787 bytes, and runtimes of 21.2 seconds on PAL, 22.6 seconds on NTSC. Attached is the new XEX and code, now the profiler shows that the square root routine consumes most of the CPU. By the way, in the Atari you can turn the screen DMA off and turn it on only at the end, reducing the overhead, then the runtime would be only 16 seconds. fedora-asm.zip fedora.xex 8 Quote Link to comment Share on other sites More sharing options...
fujidude Posted March 12, 2015 Share Posted March 12, 2015 (edited) Ok, I simplified somewhat the code, removed an extra multiplication, now the code is 829 bytes and the runtimes are 30sec in PAL, 38sec in NTSC. I know how to make the code faster (and perhaps smaller), but it will need rewriting of the inner loop. Attached is the source, compile with CA65: Thanks for posting the source. I don't have/use CA65 (I'm assuming ©ross(A)ssembler 6500 series), but that's okay as I really just wanted to peek at the code a bit. Edited March 12, 2015 by fujidude Quote Link to comment Share on other sites More sharing options...
JamesD Posted March 12, 2015 Share Posted March 12, 2015 The Apple II version runs in about 32 minutes once compiled with Einstein.That's really no improvement and I think I used the right options for the best speed.Clearly, most of the time is spent in the floating point library but this should have eliminated the constant parsing that goes on at runtime.I'll double check the compiler options and try again if I messed up. I may try The Beagle Compiler to see if it's any better. 1 Quote Link to comment Share on other sites More sharing options...
dmsc Posted March 12, 2015 Share Posted March 12, 2015 (edited) Hi! Thanks for posting the source. I don't have/use CA65 (I'm assuming ©ross(A)ssembler 6500 series), but that's okay as I really just wanted to peek at the code a bit. It is the assembler from the CC65 suite, http://cc65.github.io/cc65/ IMHO the best assembler for the 6502 :) Seriously, the thing that makes ca65 stand apart from the rest is it support for object files and linking, that makes possible to structure big programs with multiple independent files. Edited March 12, 2015 by dmsc 3 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.