Jump to content

Luigi301

Members
  • Content Count

    372
  • Joined

  • Last visited

Posts posted by Luigi301


  1. I have a Rainbow TOS ROM set in my 1040STF and my boot disk softkicks 2.06. If you have a 4MB upgrade in your ST you might want to load 2.06 at boot time - it preserves compatibility with almost all ST booter software since you have the correct version in ROM, while your HD boots to 2.06 for desktop use.


  2. Yeah, I set it up so writes to $800000 (unmapped) get logged to a debug log. I also rebuilt my workflow under WSL instead of Cygwin, making the compile/debug loop a lot faster. Together, this let me figure out that a typo in my view matrix function was screwing up the shading. :| Now I have 16 light levels projected from the camera - so everything's bright, but at least it's shaded properly.

     

    2kQzZGS.gif

    • Like 5

  3. Working on a model converter that will turn my OBJ files into raw data for the 3D engine at the moment.

     

    I’ve also got debug spew going to the screen now, using a simple 40x24 1bpp text layer. Trying to figure out why the lighting seems to be coming from a weird “angle” when it should be projected from the camera. :(

     

    I wish Virtual Jaguar could hook the jaglib skunkboard console output. Maybe I can hack that into VJ somehow.

     

     

    Sent from my iPhone using Tapatalk

    • Like 2

  4. Right now it just works from the bottom up incrementing the endpoints with the slope of the edges they follow, until it hits the scanline with the middle vertex and stops. I need to clear out some space in my program to be able to work from the top down and fill in the other half of the triangle.

     

    Im doing each scanline separately but I think the blitter should be able to process both slopes changing at the same time if I bastardize the increment registers properly? That would let me do one blitter operation per triangle instead of per scanline.


  5. For reference, I was thinking something along the lines of TIE Fighter - a bunch of Gouraud-shaded models flying around in a starfield with a 2D overlay UI. I think the Jaguar can easily handle a couple hundred lit and shaded polygons flying around an empty space shooting billboarded sprites at each other.

     

    I've got all the Jaguar code archives but I'm not sure where to start looking for implementing flat shading on the GPU/blitter.

    • Like 2

  6. Yes, you can work out the normals during initialisation. Then rotate them along with the rest of the vertices at runtime.

    As long as the triangles are all wound in the same order you can tell front or back from the direction.

    Oh right, since the models don't deform I can precalculate the normals and rotate them as part of the transformation program. The triangles are all clockwise but I will need the normals for shading when I get there.

     

     

     

    You can gain a bit of speed with the following changes, exploiting the pipeline:

     

     

    Cool, thanks. Optimizing for pipelining isn't something I've had to do before.

     

    Another night of programming and I got a sweet-looking wireframe cube with no back-facing polygons! Now, uh, I just need to make some models other than cubes...

     

    DUpJPxk.gif

    • Like 6

  7. Well, this was fun. Working on back-face culling, so I need to calculating normals of polygons. That means I need to do square roots... of fixed-point numbers... on the GPU. Translating the C to GPU code took about an hour and a half.

     

    C:

    #define FRACBITS 16
    #define ITERS (15 + (FRACBITS >> 1))
    FIXED_32 FIXED_SQRT(FIXED_32 val)
    {
        //http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.178.3957&rep=rep1&type=pdf
        uint32_t root, remHi, remLo, testDiv, count;
       
        root = 0;    //Clear root
        remHi = 0;   //Clear high part of partial remainder
        remLo = val; //Get argument into low part of partial remainder
        count = ITERS;  //16.16 number
       
        do {
            remHi = (remHi << 2) | (remLo >> 30); remLo <<= 2;
            root <<= 1;
            testDiv = (root << 1) + 1; //test radical
            if(remHi >= testDiv) {
                remHi -= testDiv;
                root++;
            }
        } while(count-- != 0);
               
        return root;
    }
    GPU:
    FIXED_SQRT:
        ;; Calculate the square root of the fixed-point number in r0.
        ;; Returns the result in r0.
        FRACBITS        .equ    16
        ITERS           .equ    (15 + (FRACBITS >> 1))
     
        SQRT_ROOT       .equr   r20
        SQRT_REM_HI     .equr   r21
        SQRT_REM_LO     .equr   r22
        SQRT_TEST_DIV   .equr   r23
        SQRT_COUNT      .equr   r24
     
        SQRT_THIRTY     .equr   r25
        SQRT_LOOP_CHECK .equr   r29
        SQRT_LOOP_ADDR  .equr   r30
     
        moveq   #0,SQRT_ROOT
        moveq   #0,SQRT_REM_HI
        move    r0,SQRT_REM_LO
        moveq   ITERS,SQRT_COUNT
     
        moveq   #30,SQRT_THIRTY
        movei   #.sqrt_loop,SQRT_LOOP_ADDR
        movei   #.sqrt_do_loop,SQRT_LOOP_CHECK
     
    .sqrt_loop:
        shlq    #2,SQRT_REM_HI
        move    SQRT_REM_LO,TEMP1
        sh      SQRT_THIRTY,TEMP1
        or      TEMP1,SQRT_REM_HI
        shlq    #2,SQRT_REM_LO
     
        shlq    #1,SQRT_ROOT
        move    SQRT_ROOT,SQRT_TEST_DIV
        shlq    #1,SQRT_TEST_DIV
        addq    #1,SQRT_TEST_DIV
     
        cmp     SQRT_TEST_DIV,SQRT_REM_HI
        jump    ge,(SQRT_LOOP_CHECK) ;if remHi >= testDiv
        nop
     
        sub     SQRT_TEST_DIV,SQRT_REM_HI
        addq    #1,SQRT_ROOT
     
    .sqrt_do_loop:
        subq    #1,SQRT_COUNT
     
        cmpq    #-1,SQRT_COUNT
        jump    ne,(SQRT_LOOP_ADDR) ; if not -1, keep looping
        nop
     
        move    SQRT_ROOT,r0
     
        GPU_RTS
    • Like 2

  8. Thanks, but since you are from the very start splitting the load between GPU and DSP, you won't really have to spend time optimizing. Even a brute-force inefficient algorithm will suffice due to the parallel nature of execution.

     

    Don't forget, I use only 4 KB for the code (3.2 KB at the moment). You have 3x more available (8 KB (DSP) + 4 KB (GPU) = 12 KB). If I had just 1 KB more, I could get another 10% boost easily. Gimme 2 KB and I'll get another 15%.

    Gimme 8 more KB and ... :)

     

    I like trying to extract maximum possible performance from just one core, though. It's more of a challenge that way :)

     

     

    I ended up moving everything over to the GPU because I ran into hardware bugs in the DSP (something about external writes failing under certain conditions) that made the program not work on real hardware.

     

    I have one GPU program that handles matrix calculations for building transformations and then another GPU program that handles projection and blitting. Are you saying you're doing all that in one GPU program? My matrix program (which includes translation, rotation, multiplication, and fixed-point add/subtract/multiply/divide functions) alone is 3.8KB.


  9. I've got most of the render loop running on the GPU now. The perspective matrix and view matrix are precalculated at the start of a frame. The CPU gets an object and its position, yaw/pitch/roll, and scale parameters and passes them to the GPU. It uses these to calculate the model's transform matrix, then calculates the full transformation matrix. Right now the CPU passes each of the object's triangles in turn to the GPU, where they're combined with the transformation matrix, projected, and blitted by the blitter. The next step is to just pass the full list of triangles to the GPU instead.

     

    But look how fast it is now!

     

    VW7kvUr.gif

    • Like 8
×
×
  • Create New...