Klax, Atari ST to Jaguar conversion

sh3-rg · February 17, 2014

I'm sorry, does the idea of Klax 3D for Jag offend you ? At least I'm actually doing something productive with the system, instead of just bashing. Then again, your definition of productive may as well be different...

How about you give us links to at least 2-3 ports of Klax into 3rd dimension ?

108 isn't bashing, he's making a funny and pretty informed quip, that's all. No need for handbags.

VladR · February 17, 2014

Don't forget, if you are using a frame buffer you are using bandwidth to populate and redraw portions of the framebuffer each time you up date it, those updates then have to be read back to be displayed on the screen. The OP at least will pull in the framebuffer in 64bit wide chunks (I assume you have them phrase aligned?) so will do so very efficiently. The 68k will cough and splutter the data into them in very painful 16bit chunks every few ticks. certainly no where near as efficient.

Yes, with the framebuffer approach, I am paying the bandwidth cost twice, but I was under impression the 68k can move data in 32-bit chunks. At least my VJ benchmarks show almost a double increase in performance when using 32-bit chunks compared to 16-bit ones. Or is it just VJ playing with me ?

swapd0 · February 17, 2014

The 68k has a 16bit bus, when you write 32bit chunks it's faster because you only use one instruction instead of two (you're saving one instruction decode&execute cycle), but it writes the chunks 16bits by 16bits.

Anyway IMHO it's better yo use the OP with scaling so you don't need to draw & clear the playfield, maybe with the bricks precalculated from real 3D models.

+CyranoJ · February 18, 2014

http://www.freescale.com/files/32bit/doc/ref_manual/MC68000UM.pdf

At least my VJ benchmarks show almost a double increase in performance when using 32-bit chunks compared to 16-bit ones.

VladR · February 18, 2014

Of course, it's 16 bits - I simply must have forgotten - it's been a while since I read that pdf...

At this moment I am in the middle of preparing some additional look-up tables so as not to waste any CPU cycles during rendering.

Quick question - in VJ, about half a year ago, I noticed a pretty massive speed-up when I unrolled the loops that move the data around memory.

Is there a significant speed-up in unrolled loops on an actual 68k ?

This 16 vs 64 bit thing made me realize that even if I did code in ASM it would not help anyway. I'd have to come up with a way how to use either OP or Blitter to draw the polygons for me.

From what I remember from the Blitter docs - there should be a way to specify an XPOS offset for each scanline (and a separate width), right ? Because then the Blitter could be used to rasterize polygons...

JagChris · February 19, 2014

This looks really beautiful actually.

So, I played a bit with my 3D rasterizer in C yesterday, tweaked it a bit to get the proper perspective and came up with this

ScreenShot076.gif

The Conveyor belt is procedurally generated, same goes for cubes [obviously].

I also implemented a run-time shading based on depth of the pixel.

I have to tweak the cube rendering, since they are slightly higher than they should be, because the underlying codebase was meant for screen-height walls, not just few pixels.

JagChris : Now you can't say that no one ever tried to make Klax for Jag

Disclaimer: No GPU was harmed during production of this demo

Rybags · February 19, 2014

Quick question - in VJ, about half a year ago, I noticed a pretty massive speed-up when I unrolled the loops that move the data around memory.

Is there a significant speed-up in unrolled loops on an actual 68k ?

Depends on what you're doing and how you're doing it.

68K can be very efficient compared to 80s contemporaries when moving data. First helper being the pre-decrement/post-increment mode, second being the ability to perform some instructions using large groups of registers at a time.

Generally you want a Stack Pointer free and a pair of data pointers but with 13 remaining registers in a single instruction you can copy 52 bytes - actually you could probably get away on some systems with disabling Interrupts and temporarily use all the registers.

It's probably likely that compiled code wouldn't generate 68K instructions that do copy operations in the most efficient (timely) way. In a situation where you're moving large chunks at a time already in native 68K, unrolling a loop probably wouldn't make a huge difference - but it's up to the programmer as to if you want scrape for that single-digit percentage boost in performance.

+CyranoJ · February 19, 2014

Unrolled loops do gain a fair bit of time at the cost of ram for the instructions. However you can compensate loops buy making the counter shorter and moving more data inside the loop.

eg:

move.l #19,d7

loop:

move.l (a0)+,(a1)+

dbra d7,loop

compared to

move.l #3,d7

loop:

move.l (a0)+,(a1)+

dbra d7,loop

LinkoVitch · February 19, 2014

From what I remember from the Blitter docs - there should be a way to specify an XPOS offset for each scanline (and a separate width), right ? Because then the Blitter could be used to rasterize polygons...

Yup there is, the blitter is quite flexible and despite having bugs in places is quite easy to use. Just spend a bit of time with the manual and play with it. Getting the blitter to draw the screen for you will likely save you a buttload of CPU time. It's what it was meant to do.

VladR · February 19, 2014

This looks really beautiful actually.

Uhm, I'm not quite sure what can be considered beautiful on that screenshot ? At best, it looks very 8-bittish - just a couple of cubes on a checkerboard.

But, since all the gfx is completely procedural, you can change the look at run-time, so it could be used to bring in additional variety without having to involve an artist in the process.

I think I read somewhere there were plenty people who actually wanted more games on jag to have this flat-shaded / non-textured look.

I think I could use the same technique to pre-render some static background(s) at start-up. I'd need to implement the equivalent of a default (on a PC, that is) pixel-shader with distance attenuation to beef it up a notch.

Then, each level could have new background. Just changing the parameters of the lights would completely change the look - anything between dark rooms with a low-key light up to multiple bright and saturated lights - well as much as 32-64 shades of each base color allow anyway.

Since it's all rendering to a memory screen buffer right now anyway, I could just use an off-screen buffer of 640x400, and use the downsampling routines I already have to downsample it to 320x200 to eliminate the aliasing and smooth out whole scene.

Thinking about it, that could actually be a new and original look on Jag...

VladR · February 19, 2014

Depends on what you're doing and how you're doing it.

68K can be very efficient compared to 80s contemporaries when moving data. First helper being the pre-decrement/post-increment mode, second being the ability to perform some instructions using large groups of registers at a time.

Generally you want a Stack Pointer free and a pair of data pointers but with 13 remaining registers in a single instruction you can copy 52 bytes - actually you could probably get away on some systems with disabling Interrupts and temporarily use all the registers.

It's probably likely that compiled code wouldn't generate 68K instructions that do copy operations in the most efficient (timely) way. In a situation where you're moving large chunks at a time already in native 68K, unrolling a loop probably wouldn't make a huge difference - but it's up to the programmer as to if you want scrape for that single-digit percentage boost in performance.

Well, the way I do it in C, is that I have two pointers ptr1, ptr2 and just do *ptr1++ = *ptr2++

When I unrolled that, I got quite a boost (even on a real HW). Don't remember the exact numbers now, but it was quite a speed-up - I can only guess compiler used the postincrement too.

I haven't looked at ASM output of that, but I saw the previous version's ASM generated by compiler and it looked pretty awful (but that's the price you pay for using high-level language).

On the other hand, I want to resist to use ASM for as long as possible, since I truly want to see how far can I push C on 68k in real-time 3D gfx

LinkoVitch · February 19, 2014

Mostly morbid curiosity, but.. what is the size of ptr? uint8_t, uint16_t or uint32_t ?

I'd hope that the compiler wrote something along the lines of

move.X (a0)+,(a1)+

and not

move.X (a0),(a1)

adda #sizeofX,a0

adda #sizeofX,a1

??

VladR · February 19, 2014

Unrolled loops do gain a fair bit of time at the cost of ram for the instructions. However you can compensate loops buy making the counter shorter and moving more data inside the loop.

eg:

move.l #19,d7

loop:

move.l (a0)+,(a1)+

dbra d7,loop

compared to

move.l #3,d7

loop:

move.l (a0)+,(a1)+

move.l (a0)+,(a1)+

move.l (a0)+,(a1)+

move.l (a0)+,(a1)+

move.l (a0)+,(a1)+

dbra d7,loop

Yes, that's exactly what I did initially - when I started unrolling loops - I did 1,2,4,8,16 transfers in each pass. But as long as the counter was there, the performance difference (on VJ, that is) was minimal.

Once the counter was gone and loop was unrolled completely, that's when I got the performance boost.

Then again, I'd have to see what the compiler generated in each of the cases.

VladR · February 19, 2014

Mostly morbid curiosity, but.. what is the size of ptr? uint8_t, uint16_t or uint32_t ?

I'd hope that the compiler wrote something along the lines of

move.X (a0)+,(a1)+

and not

move.X (a0),(a1)

adda #sizeofX,a0

adda #sizeofX,a1

??

The way I benchmarked it last year is that I wrote 3 versions of the transfer routine - separate for 8 bits, separate for 16 bits and separate for 32 bits (although, as has been said, the 32-bit - albeit fastest under VJ, is not a real 32-bits on real HW).

Of course, the 32-bit one is right now useable only when (XPOS % 4 == 0) && (WIDTH % 4 == 0), which is OK on this platform anyway.

I do however remember seeing compiler do adda #asadsa,a0...

Rybags · February 19, 2014

If you can embed Asm with your C then it's well worth your while doing time-critical stuff that way.

There's all manner of ways to do a memory copy on 68K, and I imagine the time difference between fastest and something innocently coded that looks efficient could be enormous.

LinkoVitch · February 19, 2014

32bit will be faster for the reasons explained earlier in this thread.

I'd also image the 32bit version will work for (X % 2 == 0) as well, given a long read or write needs only be word aligned. Well unless you are poking stuff in DSP RAM

Having the adda in there is a way to waste ticks, these are reasons I am not a fan of high level languages where there is limited resources and the ASM is so simple.

VladR · February 19, 2014

If you can embed Asm with your C then it's well worth your while doing time-critical stuff that way.

There's all manner of ways to do a memory copy on 68K, and I imagine the time difference between fastest and something innocently coded that looks efficient could be enormous.

I almost did that last year, but I still had some unresolved naming conflicts when interfacing with ASM, so I put it aside.

Ideally, it would work like in TurboPascal - you just wrote a block starting with ASM and boom, you got instant access to all your pascal-defined pointers and variables. That's how I wrote the raycaster - I used pascal only for the input, the rest was writtten in ASM...

Defining a separate *.s file is a bit less straightforward...

VladR · February 21, 2014

I found a bit of time this evening and finished the precomputed tables I mentioned earlier which also finally fixed the incorrect height at the distant cubes (as can be seen on the first screenshot).

This is how it looks now:

Since I was curious about the performance, I benchmarked (under VJ, of course) just the cube-rendering method - which could be still optimized much more (especially the vertical sides could be done faster).

I rendered those 5 cubes in a loop 1000 times (merely overwriting the framebuffer - I turned off the double buffering to see how fast it really is).

It took just 10 seconds. Thus, the cube-rendering part (including depth-based shading) runs in 100 fps

Of course, the background and everything else in the scene will eventually slow it down, but this proves my original assumption that the 3D bricks should be pretty fast, despite using high-level C rasterizer.

So, those target 25 fps start to seem much more realistic

I'm pretty sure the antialiasing on the cube's edges shouldn't slow it down drastically either. That could be a funny feature - run-time AntiAliasing on Jaguar

Next, I'll go optimize the checkerboard - the conveyor belt - I've got few ideas how to make it relatively fast (though it does occupy a major part of the screen space)...

+CyranoJ · February 21, 2014

That's very unrealistic for several reasons:

You are not erasing. So you will either have to buffer the pixels under the object, or redraw them (or use another object)
You are not waiting for vsync. If you are even 1 cycle over the frame limit you will waste an entire frame. If you don't use vsync, you will get unstable frame rates and it'll be unplayable.
5 cubes isn't 40. By your calculations if you can do 5 cubes at 100fps (which is really 10 cubes at 50fps in PAL land, not really anything to hoopla about when you put it like that....) thats 12.5fps for 40..... add in that buffering so you can remove them... and yeah I think you'll be down to around 6fps.
VJ is about 2x the speed of the Jaguar for tests like these... so make that 3fps.
The rest of the game is still required.

Now, that's not to say I'm not glad you are trying this, and I'm very pleased that VRbasic is coming along nicely and will encourage more people to play with the Jag, but please, don't go making wild 100fps claims and giving everyone false hope

I rendered those 5 cubes in a loop 1000 times (merely overwriting the framebuffer - I turned off the double buffering to see how fast it really is).

It took just 10 seconds. Thus, the cube-rendering part (including depth-based shading) runs in 100 fps

Edited February 21, 2014 by CyranoJ

sh3-rg · February 21, 2014

Next, I'll go optimize the checkerboard - the conveyor belt - I've got few ideas how to make it relatively fast (though it does occupy a major part of the screen space)...

Post a binary if you want some hardware benchmarking to compare with VJ.

VladR · February 21, 2014

That's very unrealistic for several reasons:

You are not erasing. So you will either have to buffer the pixels under the object, or redraw them (or use another object)

You are not waiting for vsync. If you are even 1 cycle over the frame limit you will waste an entire frame. If you don't use vsync, you will get unstable frame rates and it'll be unplayable.

5 cubes isn't 40. By your calculations if you can do 5 cubes at 100fps (which is really 10 cubes at 50fps in PAL land, not really anything to hoopla about when you put it like that....) thats 12.5fps for 40..... add in that buffering so you can remove them... and yeah I think you'll be down to around 6fps.

VJ is about 2x the speed of the Jaguar for tests like these... so make that 3fps.

The rest of the game is still required.

Now, that's not to say I'm not glad you are trying this, and I'm very pleased that VRbasic is coming along nicely and will encourage more people to play with the Jag, but please, don't go making wild 100fps claims and giving everyone false hope

What false hope ? I was pretty explicit in saying that only the cube-rendering part runs in 100 fps and that remaining features will bring the framerate down. You basically reiterated everything I said

5 cubes isn't 40, for sure, but 40 cubes for sure isn't 8 times higher rendering costs than 5 cubes

Well, theoretically it could be, but I'd have to be extremely stupid and noob to just call DrawCube 40 times when I know these cubes do not move. And there's no visibility check, and there's a lot of occlusion going on from nearby cubes. And those cubes share the same starting address in VRAM. And few other things where I can shave off some additional performance. Which all adds up

So, no : 40 cubes != 5 * 8 cubes (by far)

But yes, you do bring another interesting point, and a one that should be easy to benchmark, so I think I'll go and create those bottom rows of cubes before going for the checkerboard optimizations.

LinkoVitch · February 21, 2014

Unless you are testing on real hardware you are NOT benchmarking anything useful. There is also no VRAM in the Jag, VRAM and DRAM are physically different, the jag has NO VRAM.

So you are expecting this game to have a lot of nothing happening? These cubes arn't going to move? One pixel a frame is only 50/60 pixels, which isn't terribly fast, so it's conceivable that the tiles will move pretty much every frame, so they are going to need to be re-drawn faster than 10 times a second. Plus as mentioned by sh3, it's only a low number on 1st few levels of Klax, once it gets going you have a lot going on, plus you can fling the 5 tiles you are holding back onto the playfield, as well as every one you catch.. so you can get a silly number of the the things all on screen at once.

You need to stop using VJ for calculating frame rates! Unless you only want to write code that only works on VJ, which just seems like a masochistic way of writing low quality stuff for a PC, as that is the only thing that is going to be able to run it. Part of the challenge of working with older systems is working around their limitations and playing to their strengths.

VladR · February 21, 2014

When I say VRAM, I obviously mean my FrameBuffer array in regular RAM- it's just much shorter and faster to write, that's all.

I know very well I should just go and bite the bullet and get that LPT card working already - it's been overdue for quite some time. But since my current env is working just fine, there's no real need at this moment.

I will, of ocurse, relase the builds out to the wild, soon. I just want to optimize as much as possible under VJ, that's all.

Besides, the current build does not show very much anyway - so let's do that when I have at least 20 cubes on screen, since that's going to represent a regular gameplay load much closer than current 5 cubes.

Lynxpro · February 22, 2014

From the research I did earlier, didn't find much about the graphics HW though, likely 256 colour bitmap mode, unknown if sprites, blit or object scaling available.

Klax was originally developed on the Amiga in Basic and was ported line by line to C. Atari then prototyped on an "Escape from the planet of the Robots" arcade setup.

Given the game has fairly wide representation on home systems and from obvious observation, the requirements aren't exactly high.

Anything an ST or Amiga 500 can do should be a walk in the park for the Jaguar in any case - disregarding stuff that uses ludicrous amounts of Ram or HDD storage.

I find it interesting how Atari Games used both the Atari ST and the Amiga to develop various arcade titles.

Wow, developed using Amiga [Microsoft] Basic originally. That's about as bad as developing using stock ST Basic.

VladR · February 22, 2014

I find it interesting how Atari Games used both the Atari ST and the Amiga to develop various arcade titles.

Wow, developed using Amiga [Microsoft] Basic originally. That's about as bad as developing using stock ST Basic.

Of course they were using AtariST / Amiga. Those machines had already proven and fully working development environment.

Those machines were the ones that coders were already familiar with very well and could produce something simple (yet still nicer than the older consoles could) very quick..

And you keep forgetting the dire financial situation of Atari. It made most economic sense to get as many titles as possible developed in shortest time possible. Ideally using high-level languages - be it Basic or C.

Just look at the development environment of Jaguar. It's a brutal clusterf*ck in 2014. Think it was better in 1994 ?

Klax, Atari ST to Jaguar conversion

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members