Atari Owl Project?

+CyranoJ · June 10, 2018

Don't question his incorrect assumptions, he runs off and sulks

Gummy Bear · June 10, 2018

So no octuple buffering in Rebooteroids 2 then?

LinkoVitch · June 10, 2018

So no octuple buffering in Rebooteroids 2 then?

It's 64 buffers or go home! This is the Jaguar after all

Zerosquare · June 10, 2018

Well, bexa64n is only 64x64, so 64 buffers could fit in RAM...

sh3-rg · June 10, 2018

Well, bexa64n is only 64x64, so 64 buffers could fit in RAM...

And in a way, they kinda do (60 of them at least ;-) )

philipj · June 10, 2018

Few things wrong with your assumptions there..

More frame buffers != higher frame rate. Each buffer needs to be populated, if the jag can produce 15 frames in a second, having an extra buffer will not make that any faster, it can still only produce 15 frames a second. Double buffering allows you to complete a frame whilst displaying the previous frame, this helps remove rendering artefacts from view (sheering etc), only giving the player a complete frame when it is ready. It doesn't slow down the production or speed it up, it just improves the experience of viewing the scene, and allows more than 1/60th of a second to produce a frame.

From my understanding having spoken with Atari Owl over the years, the use of GPU in Main (oh god here we go again), is primarily used for low frequency type functions, NOT as anything to be used for speeding up rendering. It (for example), could mean you leave the 3D rendering code in the GPU RAM which needs speed, and then have routines that perform more administrative type roles in Main RAM. The GPU will run roughly 10x slower when running code from main RAM, but if it's only a relatively small amount of code that's ran once every frame or so, it makes it worthwhile overall.

GPU in main is NOT a magic bullet, it is another tool/possibility that can be used to help in SOME circumstances.

Well... I'm aware of that. GPU to main requires full 64bit access due to an 8bit multiplexer not getting enough current when trying to make jumps, so small function like AI and such is suitable for access to main ram. Anything more than that and you're basically "Hammering the GPU" or taxing the GPU from performance. It seems feasible for a scan-line based render-er to benefit if a few lines are pre-rendered in main ram before it hits the OP. It wouldn't have to be a full screen, but maybe a very small portion of a screen size where one or two (or more) lines of graphics can be fetched from main before it hits the screen. Or 64x64 sprite based graphics could be fetched from main memory.

Edited June 10, 2018 by philipj

philipj · June 10, 2018

OK... Here's a video I ran into recently concerning a sprites on the Neo Geo. Here's the YouTube description;

What's really going on with sprites when you're playing Metal Slug 3. Basically, all the Metal Slug games on the NeoGeo use two background planes, one zone for "real sprites" which are dynamically reorganized each frame (that's where all the blinking comes from), and one last foreground plane.

sh3-rg · June 10, 2018

philipj · June 10, 2018

????

VladR · June 10, 2018

so small function like AI and such is suitable for access to main ram. Anything more than that and you're basically "Hammering the GPU" or taxing the GPU from performance.

Uhm, what ?

but maybe a very small portion of a screen size where one or two (or more) lines of graphics can be fetched from main before it hits the screen.

That's not how it works. You keep rendering (via GPU/Blitter/DSP) into second (off-screen) framebuffer, while OP is drawing the first framebuffer (that you rendered previously).

There's basically zero practical use in trying to chase the beam, as on 8-bit Atari. Now, Jaguar does provide some functionality in that regards, but that would really only be useful for a simplistic 2D game, where the performance of each scanline is constant.

In 3D, there's no such thing as constant performance. I have 8 different versions for scanline traversal in GPU, and each of them has very different performance characteristics. Any polygon can use any combination of those 8, so it's impossible to preprocess that (or if you did, you'd waste incredible amount of performance for something useless).

The performance differences are more than an order of magnitude between the shortest scanline span (and the longest one) - thus this characteristics alone makes it impossible to actually gain any performance out of it overall.

Also, you keep forgetting that beam chasing is an incredible waste of performance due to all the syncing and waiting that has to happen each and every scanline (on top of your 3D engine performance cost). While it could be a fun technological exercise, in the end you'd end up with much smaller polycount/scanlinecount.

It seems feasible for a scan-line based render-er to benefit if a few lines are pre-rendered in main ram before it hits the OP. It wouldn't have to be a full screen, but maybe a very small portion of a screen size where one or two (or more) lines of graphics can be fetched from main before it hits the screen.

The last thing you want to be doing for a scanline renderer is throw the inner rasterizing loop to execute from an order of magnitude slower main RAM.

The single greatest advantage of gpu-from-main is actually totally indirect and unexpected. When I did that for my road-rash road renderer, I have gradually cleared (by moving the non-performance-critical outer loops and initialization) enough bytes from the 4 KB cache, that I could then implement a faster algorithm that was longer in code size (thus it didn't fit before).

The second greatest advantage is if you can avoid the code swap - meaning there's going to be a threshold where you keep pushing more and more code out to the main, such that you will be able to avoid doing a code swap (which kills substantial amount of performance). Understand here, that even single code swap really means two swaps (the old code and the new code (that didn't fit alongside the old)). Problem is that you halt both GPU and Blitter for the duration of those 2 swaps, so it's quite obvious visually, you don't even have to benchmark it, to see the impact. It's like losing 2 out of 4 engines in flight...

philipj · June 11, 2018

Also, you keep forgetting that beam chasing is an incredible waste of performance due to all the syncing and waiting that has to happen each and every scanline (on top of your 3D engine performance cost). While it could be a fun technological exercise, in the end you'd end up with much smaller polycount/scanlinecount.

I wouldn't call it chasing the beam... It's more like chasing the OP line than chasing the beam; to some degree it really isn't that either. It's more or less preserving the lines in main for later usage. If a line is pre-rendered in main ram, the hope is to leverage enough lines in main for the sake of producing a consistent "Persistence Of Motion"; or at least enough lines to have some leverage over frame rates.

In 3D, there's no such thing as constant performance. I have 8 different versions for scanline traversal in GPU, and each of them has very different performance characteristics. Any polygon can use any combination of those 8, so it's impossible to preprocess that (or if you did, you'd waste incredible amount of performance for something useless).

The performance differences are more than an order of magnitude between the shortest scanline span (and the longest one) - thus this characteristics alone makes it impossible to actually gain any performance out of it overall.

Now that much I wouldn't doubt in the least... This is the Jaguar we're talking about, not a magical kitty riding a unicorn (pun intended). However I think it could be possible to preserve enough OP lines in main to get some kind of controlled frame rate at a respectable pace while leveraging any lagging in the process. The kind real time 2600 chasing the beam theory is something that I once thought was possible on the Jag, but the Jag is just made different thus the theory can't really apply using conventional means. However I do feel like it's possible to have a more reserved approach by saving enough OP lines in main to get a decent frame rate then have the Blitter to copy a few of the lines back for fast OP draw.

Few things wrong with your assumptions there..

More frame buffers != higher frame rate. Each buffer needs to be populated, if the jag can produce 15 frames in a second, having an extra buffer will not make that any faster, it can still only produce 15 frames a second. Double buffering allows you to complete a frame whilst displaying the previous frame, this helps remove rendering artefacts from view (sheering etc), only giving the player a complete frame when it is ready. It doesn't slow down the production or speed it up, it just improves the experience of viewing the scene, and allows more than 1/60th of a second to produce a frame.

From my understanding having spoken with Atari Owl over the years, the use of GPU in Main (oh god here we go again), is primarily used for low frequency type functions, NOT as anything to be used for speeding up rendering. It (for example), could mean you leave the 3D rendering code in the GPU RAM which needs speed, and then have routines that perform more administrative type roles in Main RAM. The GPU will run roughly 10x slower when running code from main RAM, but if it's only a relatively small amount of code that's ran once every frame or so, it makes it worthwhile overall.

GPU in main is NOT a magic bullet, it is another tool/possibility that can be used to help in SOME circumstances.

@LinkoVitch

I got my definition of a framebuffer from "Wikipedia"... I claims that color values are stored in memory to be drawn to screen versus the old chasing the beam theory where a line is drawn using a single beam of light on the 2600. I know you didn't bring up the chasing the beam thing, just making a distinction-that's all. A single OP line is enough to fit on the GPU internal memory so it wouldn't too much memory if small sections of a screen is drawn in a series of OP lines in main. The Blitter can copy those lines for fast OP draws with more lines still reserved in main; it would make for a great frame rate leveraging tool.

@sh3-rg

All magical kitties riding a unicorn aside... :?: : :?:

Austin · June 11, 2018

I'm so glad you're back man. :lol:

VladR · June 11, 2018

I wouldn't call it chasing the beam... It's more like chasing the OP line than chasing the beam; to some degree it really isn't that either. It's more or less preserving the lines in main for later usage. If a line is pre-rendered in main ram, the hope is to leverage enough lines in main for the sake of producing a consistent "Persistence Of Motion"; or at least enough lines to have some leverage over frame rates.

Well, so if you don't mean chasing the beam, how is this then different from double/X - buffering ? Those buffers also keep the "rasterized lines in main for later usage" - as you put it, no ?

However I do feel like it's possible to have a more reserved approach by saving enough OP lines in main to get a decent frame rate then have the Blitter to copy a few of the lines back for fast OP draw.

So, how exactly would this bring performance compared to double-buffering ?

Also, you don't mention the single biggest issue with triple-buffering : the brutal input lag.

It's quite OK, if the game runs at 15 fps and checks input every 4 frames. It's quite brutal, however, if the game manages to keep average 20 fps (due to triple buffering), but keeps the input lag at same 4 frames. It looks smooth enough yet the input lags disproportionally.

Now, on jag, due to the way the final composite is done on OP, you can prioritize rendering main player, and keep the input at 60 fps, with environment at whatever framerate it can manage. But, it's not doable for all game genres and it complicates the engine design considerably...

LinkoVitch · June 11, 2018

@LinkoVitch

I got my definition of a framebuffer from "Wikipedia"... I claims that color values are stored in memory to be drawn to screen versus the old chasing the beam theory where a line is drawn using a single beam of light on the 2600. I know you didn't bring up the chasing the beam thing, just making a distinction-that's all. A single OP line is enough to fit on the GPU internal memory so it wouldn't too much memory if small sections of a screen is drawn in a series of OP lines in main. The Blitter can copy those lines for fast OP draws with more lines still reserved in main; it would make for a great frame rate leveraging tool.

@sh3-rg

All magical kitties riding a unicorn aside... :

There is no point in doing that though. Storing scanlines in GPU RAM is incredibly wasteful (the OP lives on the same die and already has enough physical RAM attached to hold 2 (IIRC) scanlines). All the processors in the Jag are faster than "Chasing the beam" (which is the 2600 BTW not the 5200 or the 7800 systems). If you have your system doing any form of "waiting" you are wasting potential processing time. The only time it should wait is if there is nothing for it to do, or it needs player input.

Not quite sure what the relevance of the NeoGeo video is. Composite layered output with sprites is quite common, you do precisely that with the OP in the Jag. Although there are no prescribed layers and it renders each bitmap on top of what has already been rendered, it does this a scanline at a time, (Hence if a list is poorly structured and too long you will get video distortions as the OP runs out of time/bandwidth to render a scanline.) Think of the bitmaps as being stickers, the OP list is an ordered list of how those stickers are to be affixed to the screen, and at what position. So whatever is 1st on the list is at the back, then the next image is overlayed, and then the next and so on. If one of those bitmaps is a frame buffer from a 3D renderer it will get placed like any other, dependant on where it is in the list.

RE GPU running in main, these reads are only 32/16bit. All GPU instructions at 16bit in size, (with the one exception of MOVEI, which has a 32 bit data portion following the instruction), the GPU is a 32 bit device, it cannot access memory as a single 64 bit read, it probably does prefetch 32bit worth of data (but I am not 100% on this), and due to the physical limitations of DRAM vs SRAM it suffers performance whilst accessing main RAM just like any other device would. Hence only fairly infrequent (in terms of game code) calls are best suited for it.

I did do a test of the speed difference and posted the code I used and the results on our website: https://www.u-235.co.uk/gpu-in-main-science/ You can try that yourself code is there and explained, so you can see yourself the speed differences. (I haven't had time to work on any of the other ideas mentioned in that article alas )

philipj · June 11, 2018

Well, so if you don't mean chasing the beam, how is this then different from double/X - buffering ? Those buffers also keep the "rasterized lines in main for later usage" - as you put it, no ?

So, how exactly would this bring performance compared to double-buffering ?

Also, you don't mention the single biggest issue with triple-buffering : the brutal input lag.

It's just an idea I'm putting out there; that's all... Besides any kind of problems that one might have with buffering, there's always a way to find leverage over performance vs speed. It's not like the 8bit that uses a display list on a system that has limited memory in bytes and kilobytes. If there's a way to get more than 15 to 20fps using main ram access then I'll use it with no bones about it, I guess that opinion would go for anyone here. It's possible for OP lines to be stacked in rows in main memory to be streamed back via the blitter really fast and it probably wouldn't take a whole lot of memory to do so if it's 8 to 10 lines. That's like 16 to 18kb?

It's quite OK, if the game runs at 15 fps and checks input every 4 frames. It's quite brutal, however, if the game manages to keep average 20 fps (due to triple buffering), but keeps the input lag at same 4 frames. It looks smooth enough yet the input lags disproportionally.

Now, on jag, due to the way the final composite is done on OP, you can prioritize rendering main player, and keep the input at 60 fps, with environment at whatever framerate it can manage. But, it's not doable for all game genres and it complicates the engine design considerably...

That's also something I've given some thought to and that's the Jaguar bottleneck bus... The processors doesn't always have to run parallel to each other; a pipeline that helps to free the bus is something that I'm looking into as a way of making the most out of the bottlenecks.

There is no point in doing that though. Storing scanlines in GPU RAM is incredibly wasteful (the OP lives on the same die and already has enough physical RAM attached to hold 2 (IIRC) scanlines). All the processors in the Jag are faster than "Chasing the beam" (which is the 2600 BTW not the 5200 or the 7800 systems). If you have your system doing any form of "waiting" you are wasting potential processing time. The only time it should wait is if there is nothing for it to do, or it needs player input.

I don't recall referring to the 5200 or the 7800 as machines that chases the beam... Those systems uses a list to display graphics; they both do so differently from each other, but the concept of a list streaming to a graphics chip is similar. Besides the 7800 graphic chip is much faster than the processor that runs it every time it hits the bus just like the 68000 when it hogs the bus on the Atari Jaguar whenever it's in use; but I certainly understand wanting to take advantage of the GPU speeds in internal cache.

Not quite sure what the relevance of the NeoGeo video is. Composite layered output with sprites is quite common, you do precisely that with the OP in the Jag. Although there are no prescribed layers and it renders each bitmap on top of what has already been rendered, it does this a scanline at a time, (Hence if a list is poorly structured and too long you will get video distortions as the OP runs out of time/bandwidth to render a scanline.) Think of the bitmaps as being stickers, the OP list is an ordered list of how those stickers are to be affixed to the screen, and at what position. So whatever is 1st on the list is at the back, then the next image is overlayed, and then the next and so on. If one of those bitmaps is a frame buffer from a 3D renderer it will get placed like any other, dependant on where it is in the list.

Well the Neo Geo was an interesting find and seem worth posting... I probably should've put in the OP topic, but I posted regardless and thought it a very cool video to post. It really put sprite based graphics in to a great visual perspective; I'll have to post this video in the OP topic. It's one more reason I believe Neo Geo games can probably be ported to the Jaguar even if the GPU to main memory it would be slow, but it would be fast enough emulate the Neo Geo well enough to pull off 30 or more frames per second.

RE GPU running in main, these reads are only 32/16bit. All GPU instructions at 16bit in size, (with the one exception of MOVEI, which has a 32 bit data portion following the instruction), the GPU is a 32 bit device, it cannot access memory as a single 64 bit read, it probably does prefetch 32bit worth of data (but I am not 100% on this), and due to the physical limitations of DRAM vs SRAM it suffers performance whilst accessing main RAM just like any other device would. Hence only fairly infrequent (in terms of game code) calls are best suited for it.

Well a few stack of OP lines about 8 or 10 lines probably wouldn't take up a whole lot of memory, sounds like a good place to start... I know I mentioned 500kb earlier because that seems like a reasonable number to work with, but really less then half that number seems even more reasonable. Point is I'm willing to compromise on putting main ram to good use if it means getting a steady stream of lines streaming back to the OP via the blitter. By saving lines in main memory, that's 4 to 5 times the line that GPU cache currently handles so yea there's a bit of the "chasing the beam" affect going on, but it's done in a way the Jaguar quite possibly handle steadily using OP lines if it's done right. The GPU can pre-render a 3D image and have the blitter stream it back to the OP; it's like streaming a 2D image if the lines are in place for streaming. I saw the numbers you posted at almost 10 times the wait? Well it's close to 8 times the wait time, which possibly makes the GPU run as fast as the Motorola 68K using RISC instructions, which is double the speed of CISC. The access to main would only be for a very short time using only almost less then 30 kilobytes if i was to use between 8 to 10 lines; the only issue would be texture mapping, sound, anything that would take a lot of memory, which I'm looking into a procedural method of graphics... Maybe some kind of sprite pattern thing using dithering or something; make good use of the Blitter ability to copy really fast.

I did do a test of the speed difference and posted the code I used and the results on our website: https://www.u-235.co.uk/gpu-in-main-science/ You can try that yourself code is there and explained, so you can see yourself the speed differences. (I haven't had time to work on any of the other ideas mentioned in that article alas )

To be honest I haven't had any time to do a lot of the stuff I'd like to do... Don't know if I need to put this out there, but I spent the last 8 or 9 weeks recouping from a knee surgery I had after I fell and ripped a muscle. Things are getting better now, but man what a slow down weeks prior; still recouping, but I'm definitely a whole lot better off today... Surgery is no joke. lol That's life though, once I'm fully up and going, I have a family waiting to take up even more of my time. What can you do? Gotta make time for my hobbies. :lol:

Edited June 11, 2018 by philipj

Zerosquare · June 12, 2018

Look, if you want to convince us that your theories are good, don't tell us how great they are. Especially when you have no prior experience on the console.

Instead, implement them and show us they work.

LinkoVitch · June 12, 2018

Well a few stack of OP lines about 8 or 10 lines probably wouldn't take up a whole lot of memory, sounds like a good place to start... I know I mentioned 500kb earlier because that seems like a reasonable number to work with, but really less then half that number seems even more reasonable. Point is I'm willing to compromise on putting main ram to good use if it means getting a steady stream of lines streaming back to the OP via the blitter. By saving lines in main memory, that's 4 to 5 times the line that GPU cache currently handles so yea there's a bit of the "chasing the beam" affect going on, but it's done in a way the Jaguar quite possibly handle steadily using OP lines if it's done right. The GPU can pre-render a 3D image and have the blitter stream it back to the OP; it's like streaming a 2D image if the lines are in place for streaming. I saw the numbers you posted at almost 10 times the wait? Well it's close to 8 times the wait time, which possibly makes the GPU run as fast as the Motorola 68K using RISC instructions, which is double the speed of CISC. The access to main would only be for a very short time using only almost less then 30 kilobytes if i was to use between 8 to 10 lines; the only issue would be texture mapping, sound, anything that would take a lot of memory, which I'm looking into a procedural method of graphics... Maybe some kind of sprite pattern thing using dithering or something; make good use of the Blitter ability to copy really fast.

The 5200/7800 comment was purely just a bit of extra info, not saying you did claim that. And yeah the NeoGeo vid is interesting, but not really relevant.

I don't think you have grasped the hardware tbh, no offence intended, but I seriously think you are over complicating things. First up, if the GPU is pre-calculating a few lines, where is it storing this pre-calculated data? and how is it being stored? The most efficient storage method would be as pixel data, which may as well be written into a frame buffer by the blitter to paint it, and therefore the whole scene.

Calculating just a few lines from a 3D scene with all the various trimming etc just adds more computations or memory requirements to store the various data, what happens if the scene is too complex for a small section? the GPU runs out of time and a portion is incomplete. Each of these small number of scanlines would need to be rendered ideally before the 1st scanline is needed by the OP, the OP will not just sit and wait for data, it is a hungry beast that must have food for it to eat constantly (or hibernated )

I don't understand why you think the blitter putting data into the OP will improve anything? Configuring the blitter is significantly more complicated than giving the OP a list and letting it do it's thing! Plus you would also need to time the read and writes to the OPs scanline buffer correctly, and rather than your system spending it's resources generating an image, it would be spending them configuring the blitter each scanline to copy the data from RAM to the OP, so you'd have even less time to render these scanlines. Both the OP and the blitter have full 64bit BUS access, the OP is the highest priority device on the bus, it is king of the bus. Just point the OP at some RAM and set it loose.

RE the timing issues, I think a 10x slower GPU is still significantly faster than the 68K, it can perform 32bit fetches for one, and processes faster. That is the point of GPU running in main, it is quicker (in theory) than a 68K doing a similar task, but it runs ~10x slower than the GPU running in it's local SRAM. 68K is running at half the clock speed, with half the bus width and as you say is CISC. (IMHO the RISC on the GPU isn't the most efficient RISC implementation, there are others which are much nicer and elegant, MIPS for example)

OK it's almost 9 times, slower in that example, (8.7 = 9 times not . 10x is a rounder figure in my mind

pacman000 · June 20, 2018

I remember seeing this on YouTube years ago, & I always thought it looked beautiful. Glad to know it wasn't killed by technical limits; sad to know you've stopped working on it for other reasons. If you're still here, please start again; what you has was stunning.

CrazyChris · June 17, 2020

Atari_Owl,

Can you please release this demo so we can try it out on our

Jaguar GameDrives?

I would hate to see this fade into obscurity.

Chris

KidGameR186496 · June 17, 2020

It would be cool to see the E-JagFest demo being released online but I don't see that happening anytime soon sadly...

CrazyChris · June 25, 2020

Were there other people that worked on this demo?

walter_J64bit · June 25, 2020

16 minutes ago, CrazyChris said:

Were there other people that worked on this demo?

I think it was just Atari_Owl.

Atari Owl Project?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members