Jump to content
IGNORED

blitter access to line buffer


Recommended Posts

So I read ( on this website ) that internal SRAM from the GPU for the blitter access is fastest if we program the interrupt, the blitter, and then halt the GPU. I think, if the return address is fixed and register contents unimportant, we can save some instructions on the interrupt. Instead of returning from interrupt, the interrupt flag is just cleared and the GPU halted again. I see that this allows exclusive access to SRAM for the blitter.

 

But maybe I have stuff to do for the GPU (T&L, occlusion and other culling). What if I use the line buffer? When I display 320x200 I have 60 lines where the line buffers are unused by the OP (NTSC. PAL would be 320x240). They are probably not switchted around. There is an address bit to select one of them. Has anyone checked how fast read access there is? For texels needed for many pixels ( GL_REPEAT, zoom ) this would be great. There is an address gap between those buffers, thus they cannot store a single texture, but one could be a proxy for the framebuffer (including the z-columns). This would be great for approx 1:1 scaled textures. Or this is used for mipmaps to avoid sort by texture on a mipmap level. Or to avoid perfect sort by texture, especially with triangles, where we could store 4 triangular textures. Likewise with 320px CRY there are 200 px space left in the linebuffer. Each line a small textured triangle could be drawn. This would be great with mipMaps and triangles on the smaller maps.

I thought about rendering directly into the line buffer. But those fill-ins need to navigate around existing graphics .. I need z-buffer access. I fail to find anything pattern or output function to achieve z-buffer access. Maybe the GPU can do the z-interpolation, read z-phrases from DRAM, and assemble spans (in internal SRAM) for the blitter. Aligned pixels, words, and phrases can be written directly by the GPU (software rendering). Spans which zoom into a single texel are rendered as solid color. So every scanline this has to be done first to avoid artifacts, then line buffer as texture source is used, and then independent from the scanlines (if enough time is left) something else is blitted DRAM to DRAM. Since I have so many constraints about the rendering order, I cannot render back to front. I would need z-buffer (phrase cleared, bus hog on first blank line). So with the blitter at least this does not introduce page misses. Though in single pixel mode it still hurts me that a whole phrase with z values is read and three of them are discarded.

Very large zoom in is rendered flat shaded into external DRAM because this is the fastest way to check the z-buffer.

  • Confused 2
Link to comment
Share on other sites

Don't take this personally, but:

1) you're unknown in this community

2) you're promising groundbreaking stuff
3) you don't have anything to show for it

 

This has happened many times before, and in every case it ended up being a waste of time. 

 

So, a piece of advice: get simpler stuff working first. It'll help you find out things you may not have thought about, and make other developers much more willing to consider and discuss your ideas.

  • Like 1
Link to comment
Share on other sites

On 1/19/2022 at 4:37 AM, Zerosquare said:

Don't take this personally, but:

1) you're unknown in this community

2) you're promising groundbreaking stuff
3) you don't have anything to show for it

 

This has happened many times before, and in every case it ended up being a waste of time. 

 

So, a piece of advice: get simpler stuff working first. It'll help you find out things you may not have thought about, and make other developers much more willing to consider and discuss your ideas.

I know. I lurked for years here and also somewhere (3do?) ignored an Email a year ago that all accounts need to be renewed . I had some drastic changes in view. I am mostly interested in accesible demos. Like you can read stuff for the GBA on the web and you don't need to install stuff and it all makes sense. Or you read stuff about the PSX and their wobbly rasterizer which cannot to truecolor .. the PSX has it strange quirks, too.

So it doesn't help when I just pile up more of the boilerplate code on Github. It all needs to be connected to like 3 discussions on this sub. I don't promise anything. I would want to write benchmarks where later anyone could easily look up some numbers and guess the efficiency of an idea which might be needed to render a specific scene ( organic vs architecture, dynamic vs static ). Still I don't want to test just anything. Also I don't own the hardware and would only trust the timing on an FPGA ( which is on GitHub ), but which for some reason is not compatible with Mister . So I downloaded the SDK and wanted to see how this Quake Spanbuffer thing would go into the DSP ( and idea from this sub ), to directly blit to line buffer, but I everything DSP is closed source.

Still someone must have tried to access the line buffers in the borders, or ? Vertical and horizontal. I can see that you should not really access the front buffer because it is not clearly specified which clock is applied when. But it is specified that we are allowed to access at least one of the buffers any time ( through this address ). Even if the hardware for some reason continues to toggle, there would be a use for small triangles. There is no other system (PCengine, genesis, GBA) with this direct access to the line buffer. This is what is interesting.

I've read that memCopy() is quite a complicated function because it needs to consider alignment. Likewise I can understand that a function for short spans is a lot of code and just has to have all these branches into the different methods. I don't even know if it fits into internal SRAM. I would never be proud of the source code .. it would need to be created in an automated way.
 

I read that Tim Sweeney wrote about Beam Trees, yet his engine does not seem to use them or be known for innovative occlusion culling. So to this day there is basically only one working occlusion culling algorithm: Portals ..  as used in Descent, Tomb Raider, and DukeNukem3d. Others need an aligned camera, or only achieve speed up with hardware support ( hierarchical z ). So for any fill rate limited system ( Jag, 32x, GBA, N64 ) I first need to write a demo ( in TypeScript ) about the beam tree. There I would create some statistics about memory access, divisions, mulitpilications, branches ... So don't expect any running Jag code from me before that .. But Jag is so hard to code for and do buggy that you need to have the hardware specs on the back burner to digest them.

  • Confused 2
Link to comment
Share on other sites

You should benchmark it. It's a fun way to dip your toes into Jaguar development, and you'll learn more in one afternoon than you can pondering ideas for a year.

 

Spoiler: Reading the line buffer is a lot slower than you're thinking, since it only has a 16-bit read port.


You can also blit into the CLUT RAM, which is more convenient than the line buffer. But it has the same slow 16-bit read port.

 

Blitting into the line buffer is still useful, if you don't try to read it back. You can 'race the beam', writing pixels right before they're displayed.

Link to comment
Share on other sites

1 hour ago, kskunk said:

You should benchmark it.

You should read the code in his github before replying ;)

 

{

ADD 8,3
ADD 9,5  ; Bresenham
; pipeline. Maybe do both edges at once like those two z values
JR N,less_jumps  
ADDQ 1,3  ; one px more to the side
ADD 19,5  ; Bresenham

SUBQ 1,2
}
JR C, return

}
}else{
    JR C ; this is for V   ;  Really:  J ~C  outOfhere
    {

....

 

.......

 

.............

 

<tumbleweed>

Link to comment
Share on other sites

7 hours ago, kskunk said:

You should benchmark it. It's a fun way to dip your toes into Jaguar development, and you'll learn more in one afternoon than you can pondering ideas for a year.

 

Spoiler: Reading the line buffer is a lot slower than you're thinking, since it only has a 16-bit read port.


You can also blit into the CLUT RAM, which is more convenient than the line buffer. But it has the same slow 16-bit read port.

 

Blitting into the line buffer is still useful, if you don't try to read it back. You can 'race the beam', writing pixels right before they're displayed.

That for example would be great to answer. The blitter has a pixel mode and I would go with conventional CRY. So the pixels are 16 bit. So I get one pixel in one access and thus one cycle. Pixel mode is engaged for read and write. Luckly the line buffer has 16 bit pixel precise addressing for read and write. So I can also write individual pixels without a RMW cycle.

Thanks for the CLUT RAM. So if I don't use 8 bit scaled bitmaps because I cannot utilize the z buffer in complex 3d scene, the CLUT RAM is unused. Espcially outside the visible area or if I have a cockpit and opt for CRY for that. The CLUT seems to be slightly smaller than the linebuffer, but it fits 16x16 px CRY . I think a lot of games (Iron Soldier) on the Jag love explosion and smoke effects done by the OP. I don't get how they get the layers correct. It feels like they only check the central pixel.

 

The ObjectProcessor can read a texture over the 64 bit bus and write to the linebuffer without interference between the two. The blitter has to switch between the two and the blitter also waits one cycle between one of the switchtes. The blitter has 32 bit write access to the line buffer, but I think that there is no latch, like the blitter cannot blurt out 64 bit and the linebuffer input latch takes it and the bus can be released in the next cycle and used access the memory controller.

 

At first I did not understand bus hug: Why would I allow a blitter who needs 5 cycles for one memory access to hug the bus? But since the package commands in the GPU are missing, we need to access Memory via blitter. So we need to keep it busy. Blitter registers are not double buffered and interrupt is slow and thus there is always a time when the blitter reases the bus. That is when the OP can use it. That are some 20 cycles = 10 memory cycles and thus the fast page mode on memory is utilized.

The way we use the cycles where the blitter cannot utilize the bus ( for what ever reasons .. I don't get it ), we can still utilize the GPU to load vertex data for example:
3. DSP at DMA priority 4. GPU at DMA priority 5. Blitter at high priority 6. Object Processor
Of course it would be great if we could do some stuff in the same page the Blitter operates in: We could check the z-values and copy min/max values in binary tree for occlusion.

Edited by ArneCRosenfeldt
DMA priority
  • Confused 2
Link to comment
Share on other sites

2 hours ago, CyranoJ said:

You should read the code in his github before replying ;)

 


{

ADD 8,3
ADD 9,5  ; Bresenham
; pipeline. Maybe do both edges at once like those two z values
JR N,less_jumps  
ADDQ 1,3  ; one px more to the side
ADD 19,5  ; Bresenham

SUBQ 1,2
}
JR C, return

}
}else{
    JR C ; this is for V   ;  Really:  J ~C  outOfhere
    {

....

 

.......

 

.............

 

<tumbleweed>

That was just to get a ballpark estimate how many cycles JRISC needs. I just wrote down from the top of my head the algorithms which I might need and wanted to count the number of registers.

I found out that for a lot of stuff the pipeline is not an issue. But espcially with memory access the instructionSet is slow. I would need something like   Repleace HiWord with HiWord from    and Lo from Lo and Hi from Lo and all combinations like the move instruction on 8086 with AL and AH. Like RORQ 16 , but with two registers.  LoadP and StoreP would need a version where the Hi word goes through an implict register ( or just use pairs as in 8086 ). The ISA already has implicit registers for an offeset address and to store the instructionPointer on interrupt . The ISA already has instructions which need multiple cycles. Only new thing would be two register writes. Ah wait, MMULT writes to multiple registers already. 8086 MUL and DIV could write to multiple registers .. so it is not a new concept.

I don't get why there is no  AddCHi . I would need a CMPtoMask to allow me to check two z-buffer values and utilize the Mask to pull in the correct pixel data into the destination data register. Basically the ISA is constructed in such a way that you cannot implement the Blitter functionallity in an efficient way and add a cache because you are supposed to use their flawed hardware implementation.

I did complain about the ISA in the past .. I mean if you just browse the instructions without a specific intent:  .. lost opportunities with many of them where so many bits in the encoding are just ignored and bus is left idle.

  • Confused 2
Link to comment
Share on other sites

7 minutes ago, CyranoJ said:

Uh huh... and the family relationships between mother and father control which part of the 3d render pipeline? 

 

????????????????

Occlusion culling breaks the pipeline. You transform bounding volumes ( their surface ). You break down the object which covers the largest screen area until you reach the real geometry (maybe add some Lod here). If something is small and sure visible: Just draw it to not need to touch the vertices again. On the Jag the scanlines are software. So you will render the edges and at the same time store the occlusion in a buffer. This span buffer needs an exception if it becomes inefficient. The Jag has quite fast z-buffer, so we can always just deactivate occlusion culling and don't get artefacts. An efficient case if spans of adjacent triangles have no gaps between them and become one as I leave the bounding volume.

 

The only pipeline is that some parts are only rendered into the line buffer. This is not a natural pipeline (information flow), it is just a performance enhancement to reduce memory access. The same way sprites an OP works. So if I really can use the CLUT, then on every scanline I have one texture free to render to. I guess that this is the limit. So in the pass into the backbuffer I would see if I am going to fill a large area with a 16x16 texture. Then I don't do it. I use the area as an occluder ( no exception anymore ) and defer the rendering to racing the beam. On those lines I block the CLUT for any other textures.The blitter can flip the texture on load so it may be possible to have two textures staggered in CLUT. The line buffer has 32 bit write access and thus I could load textures at full speed. On the CLUT I can only load at half the speed.

  • Confused 2
Link to comment
Share on other sites

7 hours ago, ArneCRosenfeldt said:

So I get one pixel in one access and thus one cycle.

The main bus is never that fast. All reads are minimum 2 cycles, and often more. You really need to benchmark on real hardware to get a handle on this. Ballpark estimates will lead you down the wrong road.

Edited by kskunk
  • Like 1
Link to comment
Share on other sites

1 hour ago, kskunk said:

The main bus is never that fast. All reads are minimum 2 cycles, and often more. You really need to benchmark on real hardware to get a handle on this. Ballpark estimates will lead you down the wrong road.

The problem with real hardware is that it is precious and I don't live alone (Wife with move-furniture-around-attacks, dog, kids). Also people wrote that the documentation is indeed accurate. And it mentions the 2 cycles .. for the memory controller to get a phrase out of the same page of external DRAM. So there seem to be bug in Jerry, but generally the documentation claims that the 64 bit bus can even change the bus master without dropping a cycle. The CPU is made using the same process in the same fab. Timing for Load is explicitly mentioned there:

  1. instruction decoding
  2. read source (address) reg
  3. probably send address to internal SRAM <-- only here internal RAM is blocked and cannot used for instruction fetch
  4. have the result in register

So the blitter does not need to decode and also now let us concentrate on the bus occupation. I mean, I always find it strange that the address ( for it is like a HTTP request ) is send out and in the same cycle you get back the value ( HTTP response ). Here I would expect a pipeline stage. Not in the register file of the CPU. Can anybody read the netlist of the CPU and explain to me why I cannot funnel the result of the ALU directly back as input? A small exercise on paper showed me that a single cycle with a two phase clock and a write back ( or input ) registers does just that ( on ARM and MIPS ).

 

The documentation states that the blitter needs a single cycle to process the data and then writes it back in the next cycle. I've seen the lengthy calculations for DRAM, but internally it is simple. Maybe this up and down network really produces 4 16 bit reads for a single pixel. I cannot read the netlist good enough. All those 2 letter functions. And the conversion to VDHL is longer and stripped of all comments. I would fit the many flaws of the system. Is that: What would be a better Jaguar thread still open? I need the Jaguar as a way to explain hardware performance. Like as a child I wondered what the speed limit is on a 6502. Now I know it better and it is all clear and the 8080 is the better CPU crippled by a bad package. The PSX strong-arms a dev into its way and is not really good at ( low rendering quality ). So you cannot even dream. I like the idea: We use the smallest memory chips available (to keep the price down), but 8 of them to read out 8 bytes at once and have time left after DAC readout. This is unlike the SNES or PSX with a lot of different memory chips. The Jag is low latency unlike the N64 because CPU and PPU and blitter just get raw memory data. Yeah, need a two sided PCB, who cares, the C64 had that, every DIMM is that.

How do I even test the blitter read speed? I cannot disable DestinationWrites. There is not even a 32 bit read target. All on Chip parts appear as 16 bit ( why even, when they are 32 bit internally ). Two can be written to at 32 bit. And RAM is 64 bit read and write.

 

Seems like now I have an interesting list of benchmarks to run. I still think about reading back the z-buffer .. but already unpacking of the doublets and then software z interpolation for the back face of the bounding volume eats a factor of two too many GPU cycles ( because it is not optimized for packed longs .. only bits and words are well supported ). Oh scrap that, I can be clever and limit the z range to less than 16 bit, then have sign bits after my comparison, mask them and add them to counters. So I know how much my occluder is covered like 5 cycles per pixel .  The collision method on the blitter only works in pixel mode and as always the blitter wants to write something. So I cannot gather the "shiloute" of a high detail vector enemy in front of a large scenery and thus cannot render anything into the line buffer .

  • Confused 2
Link to comment
Share on other sites

I'll be more direct.

 

There are two kinds of people discussing Jaguar programming in this forum:

1. those who spend years posting their theories (usually as stream-of-consciousness walls-of-text), yet have nothing to show for it
2. those who actually develop and release stuff

 

Needless to say, people in group #2 have very little interest in the theories of people in group #1.

 

The choice is up to you.

Edited by Zerosquare
  • Like 6
Link to comment
Share on other sites

4 hours ago, Zerosquare said:

I'll be more direct.

 

There are two kinds of people discussing Jaguar programming in this forum:

1. those who spend years posting their theories (usually as stream-of-consciousness walls-of-text), yet have nothing to show for it
2. those who actually develop and release stuff

 

Needless to say, people in group #2 have very little interest in the theories of people in group #1.

 

The choice is up to you.

Numberwang, numberwang, numberwang!!!

 

I like my 3-D rendering code done in cycle count spreadsheets.

  • Haha 1
Link to comment
Share on other sites

6 hours ago, ArneCRosenfeldt said:

The problem with real hardware is that it is precious and I don't live alone (Wife with move-furniture-around-attacks, dog, kids). 

You need to make a man cave and show them your strong pimp hand, keep them out of there and put the Ixnay on all that running amuck. 

  • Haha 1
Link to comment
Share on other sites

Cutting gaps into the frame buffer using the Object Processor has a clear cost: The OP reads two phrases and then writes back the first one. I hope this parallel and only costs 6 cycles. Since I use the GPU for the dynamics datastructure I can place the object and the data in one page. I can place two objects with their data for one scaline into a page. A cost is that I need to write the objects for every scanline. If the GPU does anything else, I better use a small buffer. So the OP may have to read data for 4 scanlines? Or some branch objects ( worst case 8 branches for 8 bit scanline counter .. or more 7 branches and two objects in each leaf). So to guess, a gap costs 40 cycles. Alignment on phrases and clipping will probably bookkeeping time on GPU. The OP can read 20 phrases in this time, so 80px. 64 bit hey!

 

I mean, it would be cool to port "afterburner" to really max out the OP with objects. In afterburner the same 3 objects are instantiated all over the screen for 20 s. So you could have a cache of three rotated objects and some rotated and scaled down objects depending on free memory, maybe even sub pixel for the smallest scale to dynamically use all GPU time per frame at fixed 60 fps (but variable quality). But this is not 6DoF like Commanche is also not.

  • Confused 2
Link to comment
Share on other sites

14 hours ago, Zerosquare said:

I'll be more direct.

 

There are two kinds of people discussing Jaguar programming in this forum:

1. those who spend years posting their theories (usually as stream-of-consciousness walls-of-text), yet have nothing to show for it
2. those who actually develop and release stuff

 

Needless to say, people in group #2 have very little interest in the theories of people in group #1.

 

The choice is up to you.

The release I see are 2d and not even bullet hell. Yeah, colorful. Rayman did it. And overhead racers have not even parallaxe and the Jag has enough RAM to store the models prerendered. And there was a voxel demo. So yeah, you can fire 4 STOREW . I mean I am happy that someone did it.

I don't know why there is such a large gap. My typescript code runs ( github pages ), I mean as far as I could test. I've earned money with boring code  -- but which runs. I am impressed what the devs achieved back in the day. Still the spreadsheet and the good manual clearly show the speed. It is not like the N64 where you cannot know anything.

 

You could formulate it like this: The devs had to use that ASIC library. They set a high clock rate and played with the pipeline. Then they found out that  overlooked  stuff like address generators and gateways and routing on the chip also cost time. I guess that the GPU uses minmal amount of registers and that they intended to run vector code on it. So you don't need to use the result of the previous calculation because you are on the next component. But why did they combine this with only two ports on the register set? They claim that access to external memory only cost one cycle more than internal memory. Can't we blame then that they are crazy? You have to test the address bits, then check if the 64 bit bus is available, then the memory controller has to test the page or even the whole address then send out row address then read back the value. I say it is a big wonder that this all supposed to be twice as fast as OR reg1, reg2 .  Or they did buy a highly optimized library? Or they did not invest too much time in the GPU because it should only be a better Copper like in Amiga. Then it all would make sense. Spreadsheet and code would agree. How do you comment code?

I mostly see those porting projects. Massive projects with feature creep ported from ST. How will those bring spreadsheet noobs to code?

 

PS: And with the two ways. I see that a flip flop has one input and one output. So it quite naturally one can attach an address generator to both. But JRISC sports 64 registers. So yeah, 3 way would be expensive. But on the other hand this could be organized as 4 x 16 registers (ARM has 16). With minimal amount of transistors you could read from two registers from different banks. Instead of moveTA and FA .. there would be an instruction to map the 2x16 registers in the (following) instructions to the 4 sets. Still I don't see why there is a one cycle delay for the flags. Carry look ahead on ADD gives you the flags before you get the result. The effort to route the flags through the score board is probably bigger than if the devs would just have strived to make flags faster. Maybe even use alternat clocks and parallel ciruits without pipeline stages to combine the condition flags as fast as possible. We software devs already have to deal with the branch delay slot!!

Edited by ArneCRosenfeldt
idea
  • Confused 2
Link to comment
Share on other sites

14 hours ago, Zerosquare said:

I'll be more direct.

 

There are two kinds of people discussing Jaguar programming in this forum:

1. those who spend years posting their theories (usually as stream-of-consciousness walls-of-text), yet have nothing to show for it
2. those who actually develop and release stuff

 

Needless to say, people in group #2 have very little interest in the theories of people in group #1.

 

The choice is up to you.

Then may I call for a little bit of attention towards this subject instead?

I am so sad to see people reinventing the wheel over an over again, e.g. starting from line blits...

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...