Jump to content

Photo

Road Rash pre-alpha on Jaguar at 30 fps

Road Rash GPU 30 fps

456 replies to this topic

#451 JagChris ONLINE  

JagChris

    River Patroller

  • 3,385 posts
  • Location:Oregon

Posted Wed May 16, 2018 11:32 PM

Y'all need to lighten up. Geez.

#452 VladR OFFLINE  

VladR

    Stargunner

  • Topic Starter
  • 1,272 posts
  • Location:Montana

Posted Thu May 17, 2018 12:45 AM

Y'all need to lighten up. Geez.

Good luck with that :lol:

 

Where's that endless optimism coming from :) ?

 

You started to believe in humanity (in jaguar forum, nonetheless!) or something :lolblue: ?



#453 philipj OFFLINE  

philipj

    Moonsweeper

  • 426 posts
  • Location:Birmingham, Alabama

Posted Thu May 17, 2018 4:16 AM

There's one more thing I forgot to mention about "Project M" on the Atari XE130... The texture used in the "Wolfenstien" demo; they didn't use any bit maps, but was done in assembly/binary code, which I thought was pretty interesting. It's a very note worthy mention; I wonder if the Jag could do something procedural with texture mapping for the sake of speed and memory? That would certainly help tilt the unbalanced nature of the buggy Jaguar if the textures were very small, compact, yet machine code fast for that on-the-fly real-time rendering.



#454 VladR OFFLINE  

VladR

    Stargunner

  • Topic Starter
  • 1,272 posts
  • Location:Montana

Posted Thu May 17, 2018 12:42 PM

... The texture used in the "Wolfenstien" demo; they didn't use any bit maps, but was done in assembly/binary code, which I thought was pretty interesting.

Of course, on 6502, you can't beat the

LDA #$13    ; 2 cycles
STA $1300   ; 4 cycles

combo, which is just 6 cycles. That's the fastest possible way on 6502 - you just unroll the data into code.

 

I wonder if the Jag could do something procedural with texture mapping for the sake of speed and memory?

Oh, no. Faaaar from it. The 8-bit Atari is vastly superior to Jaguar in this regard:

 

1. You can have 4 MB of a full-speed unrolled code on 8-bit 6502 (via PORTB bank-switching)

2. You can have 4 KB of a full-speed unrolled code on 64-bit jaguar

 

The little 8-bit Atari has literally over 3 orders of magnitude more accessible RAM for a fast unrolled code. It's literally order of magnitude easier to handle bank switching on 8-bit, than it is to **properly** swap code into GPU on Jaguar.

 

Now let's consider the speed of access to the new code:

1. On 8-bit Atari, the new bank is available within 1 cycle. In words: ONE cycle

2. On 64-bit Atari, this is what you need to do:

- compute (or look up) the size of new code chunk

- Turn GPU Off

- Set up ~10 registers for Blitter

- Initiate Blit

- Initiate endless loop waiting for Blitter's mighty 64-bit snail blitting to finally finish, thus killing another processor for substantial period of time

- Set the PC for GPU

- Turn GPU On

- Of course, this presumes you are aware of the SMAC assembler bugs and issues, and what happens if you foolishly attempt 32-bit aligning of the GPU code - that was one fun discovery : )

 

Also, on GPU, great majority of instructions take 3 cycles (so, no real win compared to 8-bit Atari), and with pipelining, you usually get to about 2.2 average throughput. I can't however stress how many times, I wasn't able to get under 2.0 due to those stupid HW bugs, where rearranging instructions reveals the ugly bugs, hence you either have to insert NOPs or use a much slower combination of instructions (but the one which actually executes).

 

 

Just because you compiled a legal GPU code, it doesn't mean it will execute. More than likely, not :)

 

If jag docs are to be believed, MOVEQ only takes 2 cycles, instead of 3, thus the closest you can get to the above LDA/STA combo in terms of speed is:

MOVEQ #$1F, r0  ; 2 cycles
STORE r0, (r1)  ; 3 cycles
ADDQ #4, r1     ; 3 cycles

Unfortunately, the allowed range of values for MOVEQ is only 0-31, so 32 colors are max. Of course, writing to GPU cache is always 32-bit, so you waste 24 bits (but, let's just say you are ok with this speed/size compromise for this particular case). I'm sure you noticed that you can't store a value to a memory directly. All storage is indirect via registers (hence the third instruction)

And I benchmarked the other case - where you just pack all bytes together. It's much slower than just writing from GPU to main via STOREB.

 

So, no - not even 64-bit jaguar can beat the little 8-bit micro from '70s in terms of this efficiency, as crazy as it sounds. The only advantage Jag has in this regard, is higher clock speed. But, it's still brutally [per clock cycle] inefficient in this regard compared to 6502.

 

 

 

That would certainly help tilt the unbalanced nature of the buggy Jaguar

Nope, I'm sorry, you're unfortunately wrong on this one, as that "feature" of jag just plain sucks. What was supposed to be an alpha version of the chip, Atari deemed as production candidate...

 

Nothing can "tilt" or "explain" or ease the bugs. You literally can't think of the code, as you are writing it, you must think of all the HW bugs, as you are writing it and adjust algorithm around it.

I used to just keep another text file open, right next to the GPU source code window, for instant verification if the code I wrote is remotely runnable or not.

 

That GPU-bugs file got so big (recently grew, because of the DSP-specific bugs), I actually have to scroll it. I think I'll have to buy a bigger TV because of the GPU bugs, mine's only 50"...



#455 philipj OFFLINE  

philipj

    Moonsweeper

  • 426 posts
  • Location:Birmingham, Alabama

Posted Thu May 17, 2018 9:13 PM

 

Oh, no. Faaaar from it. The 8-bit Atari is vastly superior to Jaguar in this regard:

 

1. You can have 4 MB of a full-speed unrolled code on 8-bit 6502 (via PORTB bank-switching)

2. You can have 4 KB of a full-speed unrolled code on 64-bit jaguar

Right... And the "GPU-to-main work around" requires full 64bit due to a bug with the multiplexer not getting enough current when making jumps, which means more cycles lost, but is still somewhat useful provided the access isn't in a tight loop less you wind up "Hammering the GPU". Makes for great limited use like AI and the such, but still very limited when it comes to graphics for texture mapping in a speedy manner. The GPU has a great ALU on it that's fast; it seems like procedural textures would be ideal considering the two 16bit multipliers running parallel, but I guess that ram issue is always going to slow things down at some point.

 

Here's something from the Jag manual concerning it's math capabilities...

The GPU is also intended to perform rapid floating-point arithmetic. It has no floating-point instructions as such,
but has some specific simple instructions that allow a limited precision floating-point library to be capable of in
excess of 1 MegaFlop.

One of the reasons I chose 2.5D for the GPU is because of the fast math; it would be overkill for that sort of thing... I still stay hopeful, but it's not all that surprising considering every good Jag programmer all express their frustration with the system and its bugs.

 

Unfortunately, the allowed range of values for MOVEQ is only 0-31, so 32 colors are max. Of course, writing to GPU cache is always 32-bit, so you waste 24 bits (but, let's just say you are ok with this speed/size compromise for this particular case). I'm sure you noticed that you can't store a value to a memory directly. All storage is indirect via registers (hence the third instruction)

 

 

At 32bits that would be 8bits x 4, but would mean 4 times the data in cache... Another reason I would refer to the DSP, 68K or both to do some 3D work prior it the GPU handling whatever the outputs are. By the time it reaches the GPU, a great majority of the work is already done. Simply let the GPU do some fast 2D rendering based on the per-calculated stuff since the DSP have full access to main ram at 16bits. But the info you're dishing is very helpful only confirming a lot of stuff that I here from Jag programmers.

 

 
That GPU-bugs file got so big (recently grew, because of the DSP-specific bugs), I actually have to scroll it. I think I'll have to buy a bigger TV because of the GPU bugs, mine's only 50"...

 

 

I read somewhere that the Jaguar can do 720x480... Every tried another resolution size? Your vertical res is low, but you use a lot of lines within that low res. If the Jag can handle that many lines, seems like something a little more modest in resolution might be in order, which could help to free up some cycles.


Edited by philipj, Thu May 17, 2018 9:16 PM.


#456 phoenixdownita OFFLINE  

phoenixdownita

    River Patroller

  • 3,002 posts

Posted Thu May 17, 2018 10:30 PM

Of course, on 6502, you can't beat the

LDA #$13    ; 2 cycles
STA $1300   ; 4 cycles

combo, which is just 6 cycles. That's the fastest possible way on 6502 - you just unroll the data into code.

 

.....

 

Now let's consider the speed of access to the new code:

1. On 8-bit Atari, the new bank is available within 1 cycle. In words: ONE cycle

.....

Forgive my ignorance but in order to switch banks are you not supposed to execute something like:

 

LDA #$BANKNUM;

STA $BANKSWADDR; //$D301 most likely, not sure about the 4MB board you are referring to

 

at every bank switching juncture? 

Sure beats setting up the blitter to copy 4K, but it is not exactly one cycle either.

Does the 4MB expansion perform autoswitching?



#457 VladR OFFLINE  

VladR

    Stargunner

  • Topic Starter
  • 1,272 posts
  • Location:Montana

Posted Fri May 18, 2018 1:28 PM

Right... And the "GPU-to-main work around" requires full 64bit due to a bug with the multiplexer not getting enough current when making jumps, which means more cycles lost, but is still somewhat useful provided the access isn't in a tight loop less you wind up "Hammering the GPU". Makes for great limited use like AI and the such, but still very limited when it comes to graphics for texture mapping in a speedy manner.

Those rules are actually not 100%.For some time, it may seem so.  I spent about 2 weeks with GPU-in-main code. Wrote a LOT of such code. Tried having the whole road rendering from Main. It is very unreliable. Just do a simple loop of 1,000 and most of the time it won't get to the end. Same executable, just run it few times.

In the end, I couldn't trust it, regardless of how much effort I spent in aligning all jumps and confirming by doing HexDump that those values are, indeed, at requested aligned addresses - because you can't really trust assembler (hard lesson, btw - about 2 days time worth). Now I have run-time hexDump on a keypad, so at any time I can confirm that the code I wrote, is indeed, at the address I want it.

 

In the end, the benefits of gpu in main are very marginal - its greatest benefit is actually in having the GPU-debugging code - e.g. you don't have to fill the tiny 4 KB with number-writing functionality. But, now that I do run-time debugging on 68000, I don't need it anymore for debugging.

 

The second greatest benefit of gpu-in-main is that you can have all the outer loops, that take less than 1% of frame time (but several pages of code inside precious 4 KB), slowly execute from main - as it's still much faster than doing code swap. And you gain the space for some new feature or optimization, that you didn't have the space for, previously.

 

But, that was when I was just single-threaded. After gpu-in-main proved unreliable, I started using the computing power of 68000 and did some drastic refactoring, but gained enough space in 4 KB to implement some major optimizations, that were impossible before (because there was no available space for them in 4 KB, with all other code there).

 

 

 

I'm sure you could write a synthetic benchmark that would "show" how 68000 slows the GPU down when it's on. But even ONE 4 KB swap removes more MIPS from your engine, than having 68000 banging on the bus all the time.

Somehow, you never hear those people mention this "tiny technical detail" :lol:

 

 Here's something from the Jag manual concerning it's math capabilities...

The GPU is also intended to perform rapid floating-point arithmetic. It has no floating-point instructions as such,
but has some specific simple instructions that allow a limited precision floating-point library to be capable of in
excess of 1 MegaFlop.

Forget floating point. It's order of magnitude slower than fixed-point or integer. What's 1 MFlop compared to 15 MIPS ? I don't even use fixed point everywhere. I just use plain integer in tight loops, e.g. just one instruction, no need for shifting. That is the fastest way. Yes, it needs some experimenting, and refactoring, and adjusting on the input side, but you gain tremendous performance back for a little work.

 

 

 ... Another reason I would refer to the DSP, 68K or both to do some 3D work prior it the GPU handling whatever the outputs are. By the time it reaches the GPU, a great majority of the work is already done. Simply let the GPU do some fast 2D rendering based on the per-calculated stuff since the DSP have full access to main ram at 16bits.

Yes, that's exactly what I've been trying to explain last half decade. DSP is basically another GPU in terms of raw computing power. Nobody is going to drive Blitter from DSP (it's not on the same 64-bit bus as GPU is, after all). But guess what. My transformed coordinates are 16-bit, so it doesn't matter that DSP only has 16-bit access to RAM. It's exactly identical to the speed with regards to GPU (save for DMA priority) for what I need it to do.

Best thing, if I remove all the transform code from GPU, I will gain enough space for code to handle vertical Blitter stripes, which will bring additional performance boost (though, granted, it's mostly for Quake-style 3D scenes, with lots of thin vertical walls (e.g. pillars)).

The interrupts will take care of playing audio in parallel, which doesn't really consume all that much MIPS anyway.

 

 

 

I read somewhere that the Jaguar can do 720x480... Every tried another resolution size? Your vertical res is low, but you use a lot of lines within that low res. If the Jag can handle that many lines, seems like something a little more modest in resolution might be in order, which could help to free up some cycles.

All of my recent vids are in 768x200. The video chip may display only 720 of those 768, but Blitter, OP, GPU, 68000, all need to work with 768 pixels (even though you don't physically see 48 of them), as OP does not offer direct 720 width.

And no, I tried. It's not faster to ignore those 48 px by adjusting  DWIDTH in the OP phrase.






0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users