Jump to content

SCPCD

Members
  • Content Count

    103
  • Joined

  • Last visited

Posts posted by SCPCD


  1. If you don't change priority (DMAEN bit for DSP/GPU & BUSHI bit for Blitter) and you shouldn't do it 😉, nothing can takes bus priority over the OP.

    If the OP takes some bandwidth, DRAM refresh will be excecuted as soon as possible after object list processing.

    If the OP takes full bandwidth, a DRAM burst refresh will be forced.

     

    The only way for other CPU to take the bus during OP processing is to enable RELASE bit for bitmap object, but this bit should eventualy be used for low colour objects only.

     

    The DSP is just below the OP for bus priority.

     

     

    • Like 1

  2. I don't have read or uncounter this sort of bug (from what I remind, but I haven't play much with the DSP)

     

    Does _nTriAdd in external memory ?

    Is there a load somewhere before ?

    Does rNAdd comes from a long instruction (as div?)

     

    I would suspect the #1 or  #7 in the "Hardware Bugs" reference manual.

    image.png.a22257210d13f831a4a70f37a8c1f1e1.pngimage.png.6c29db4e3a455d5784a1611ada491f50.png

    • Like 2

  3. I don't think adding a load just after the nop will be enough:

    Each 68k memory acces takes 6cycles on the jaguar (from what I remember) so arround 12gpu cycles : you will have approx 12 cycles between each word write.

     

    If I don't do mistakes, the "wait_list" will take arround 6 cycles to complete.

    Adding extra nop (I would say approx 6) before the second load will probably do the trick for test purpose, but it's not the solution.

     

    The best way is indeed a semaphore.

     

     

    Edit:

    After posting I have a doubt about if it's 5 or 6cycles on the jaguar for the 68k memory access, need to check in my notes, but it's higher than standard 68k use.

    • Like 2

  4. One of the bug is indeed what CyranoJ says.

     

    I will see another one : from what I understand, bank0 is used for your service interrupt and bank1 for your user code.

    In your post #21, i don't see any r31 configuration (don't know if it's in part of code we don't have), but I will explain what I suspect now that we have more code and mixing with post #4.

     

     

    Initialisation : (#4)

    	movei #init_raster,r0		; 3
    	movei #G_ENDRAM,stack		; 3
    	jump t,(r0)					; 1
    	moveta stack,stack			; 1

    => G_ENDRAM in b0r31 & b1r31 (stack = r31)

     

     

    Dispatcher code : (probably b1r31) (#21)

    ; jsr to routine
        movei #.l0,return
        subqt #4,r31
    	jump (event)
    	store return,(r31)

    => b1r31 changed and ".l0" written in stack

     

     

    Interrupt code : (b0r31) (#4 & #21)

    	load (r31),r30
    	addq #2,r30
        addqt #4,r31
    	jump (r30)
    	store flags,(gpuFlagsPtr)			; restore interrupt

    => b0r31 changed during interrupt entering and PC written in stack

    -> RTE : jump back to user code correcting stack pointer

     

    Event RTS :(: (#21)

    ;rts
        moveq #0,r0			; skip 0 parameter
    	load (r31),r1
    	jump (r1)
    	addqt #4,r31

     

    If it's the case, when entering the interrupt routine during your user "event" code, your stack content is overwritten as b0r31 & b1r31 are independant.

    Then, the interrupts service completed and go back to the user code, at the end of your user "event" code when your "rts" append, the PC go back to an unattended address.

     

     

     

    Edit:

    Hmm in #4, i see a subq #4, for the "user" stack, maybe to reserve one stack slot for the interrupt service routine ?

     

    • Like 1

  5. There is not enough informations about how the interrupts routine is written to give a solution but, this is what i have done for my FACTS demo :

    • Main code use BANK1
    • Interrupt service code use BANK0

     

     Initialisation routine :

    GPU_STACK				.equ	$F04000
    
    ; BANK0 :
    ;--------
    OBJ_FLAGS				.equr	r22
    vbl_counter				.equr	r26
    vbl_interrupts			.equr	r27
    pGflags					.equr	r28
    cGflags					.equr	r29
    cGstack					.equr	r30
    pGstack					.equr	r31
    
    
    ;GPU initialisation
    gpu_init:
    	movei		#G_FLAGS,pGflags			;Flags GPU
    	load		(pGflags),cGflags
    	bclr		#3,cGflags
    	bset		#7,cGflags					;enable op interrupt
    	bset		#12,cGflags					;clear pending interrupt
    	bclr		#14,cGflags					;select bank0
    	store		cGflags,(pGflags)			;mise a jour des flags
    	
    	nop
    	nop
    	
    ;Stack Pointer
    	movei		#GPU_STACK,pGstack			;adresse SP
    	movei		#VblInterrupt,vbl_interrupts
    	movei		#OBF,OBJ_FLAGS
    
    	moveq		#0,vbl_counter

     

    Service routine at slot 3 (GPU Object Interrupts):

    gpu_int_3:
    	jump		(vbl_interrupts)
    	nop
    	nop
    	nop
    	nop
    	nop
    	nop
    	nop
    

     

    GPU Object interrupt routine :

    	.long
    VblInterrupt:
    	storew		r0,(OBJ_FLAGS)				;	Rr0		Rr22	|	-			|	-
    	load		(pGflags),cGflags			;	Rr28			|	Cr22		|	-
    	addq		#1,vbl_counter				;	#1		Rr26	|	Cr29		|	W(r22)
    	load		(pGstack),cGstack			;	Rr31			|	Cr26		|	Wr29
    	bclr		#3,cGflags					;	#3		Rr29	|	Cr30		|	Wr26
    	addq		#2,cGstack					;	#2		Rr30	|	Cr29		|	Wr30
    	addq		#4,pGstack					;	#4		Rr31	|	Cr30		|	Wr29
    	bset		#12,cGflags					;	#12		Rr29	|	Cr31		|	Wr30
    	jump		T,(cGstack)					;	T		Rr30	|	Cr29		|	Wr31
    	store		cGflags,(pGflags)			;	Rr29	Rr28	|	-			|	Wr29
    											;	-				|	Cr28		|	-
    											;	-				|	-			|	W(r28)

    In this exemple, I do those steps :

    • write to the OBF register as soon as possible to allow the OP to continue his process
    • read the GPU Flags register
    • increase the vbl counter
    • read the stack
    • clear the IMASK bit
    • correct the address of the instruction that will be executed after jumping
    • increase the stack pointer
    • clear the interrupt flag
    • jump to the new address
    • write back flags to register

     

     


     

    • Like 1

  6. Indexed and offset load/store take 2 more cycle (as wait_states) than standard load/store.

     

    Remplacing it with standard load/store and using addq can be more efficiency as you can rearange opcodes to avoid as much wait_states as possible (at least 1 for each load/store) :

    The idea is to replace the 2 wait_states by 1 addq and 1 another useful instruction.

     

     


  7. There are some tips on my website that I use to make the ST2Jag optimization here : http://scpcd.free.fr/jag/jag.htm#ST2Jag

    It should be neer 99% true, from what I remember.

     

    For the ST2Jag exemple :

    - First column describes what is done in the Read cycle as R[register number]

    - Second column describes what is done in the Compute cycle as C[register number] and the parrellel memory controller current task as "M" (external memory read), "I" (internal memory read) and "R" (GPU register range)

    - Third column describes what is done in the Write cycle as W[register number]

     

    For External Memory LOAD(B.W.P), it will depend of bus usage but a good approximation is arround 10 cycles (for the ST2Jag, I use 12cycles to be pessimistic).

     

     

    For the "High Long Word register", there is also an exemple in the ST2Jag code :)

    If you would like to use loadp/storep, you can't made other load instruction to external memory as it will trash the "high long word register". It will be effectively neer imposible to use it if there is external load in GPU  interrupt routine.

    But, as you can see in my code, you can insert load instruction if it only reads in internal memory.

     

     

    For storep, you effectively can't do something like :

    storep r0, (r1)
    store  r2, (high_word_register)
    nop
    storep r0, (r3)

    In this case, the high_word_register can be updated by the second instruction before the memory controller has latched the data and write to the r1 memory address : this will depend of the memory controller curent state and bus activities.


    To avoid this, you can made one of the following :

    - insert enough instruction between the first and the second instruction but it will be difficult to have something reliable as it depends of the bus activity

    storep r0, (r1)
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    nop
    store  r2, (high_word_register)
    nop
    storep r0, (r3)

    - or : add an external load or store instruction between the first and the second instruction : when the instruction will arrive to the second storep, you will have the assurance that the first storep is completed because the added load/store will trigg a gpu wait_state as the memory controller is in "work in progress" state.

     

    storep r0, (r1)
    store  r4, (somewhere_in_external_memory)
    store  r2, (high_word_register)
    nop
    storep r0, (r3)

     

     

    For load/store R14(5)+, those are usefull, but at the cost of an extra (wait_state) cycle.

    In the ST2Jag exemple, you will see that I replace them by standard load instruction to give me more reordering possibilities and increase instruction pipelining.

    But It will probably depends of the registers availabilities and algorithms.

     

     

    • Like 3

  8. STOREP/LOADP worth it when you can use it : if you can handle the restriction about those instruction.

     

    I will not recommend to use DMAEN, I think that the headache to make it works as you want with all Tom bugs will not worth the time/boost ratio.

    A better way is to optimize the GPU code. :)

     

     

     

    • Like 1

  9. In the most up to date JTRM, the YPOS for GPU Object was removed.

    In the netlist, there is no use of YPOS for Gpu object (the state machine always execute gpuint).

     

    I think I have read somewhere that this feature was removed from the specification and replaced by the use of branch object.

     

     


  10. The distortion in the video & sound glitch is typical to not enough bandwidth for the OP.

    lets verify it :

    - the OP has ~64µs to parse the whole list each line : 26.59MHz * 64µs = 1701cycles

    - the picture is 352 wide in 16bpp, and the skunkboard is 16-bit at 5cycles : (352*2)/2 *5 cycles = 1760cycles (only for data, extra cycles is needed to read the object description)

     

    The scaled up is a side effect to this as the line buffer is swapped while OP haven't finished yet the previous line.

     

    What you can do to grab some cycles, is to reduce horizontally the picture. 352 is way larger than the visible screen which is 332 in overscanned.

    With a 320 wide picture, this will give you 100 available cycles to other objects in internal RAM and if there is not to much object, to wide or to much scaled, maybe it will be sufficient.

     

    If you don't want to edit the picture, you can play with the Object IWIDTH and Object DATA to clip it.

     

    PS :

    on standard cartridge, wich is 32-bit 10cycles, you will have : (352*2)/4 * 10 cycles = 1760 cycles.

    With faster ROM you can set it to 5cycles and have 2x bandwidth.

     

     

    About the Universal header, I would say it's by default for standard cartridge so 32bit 10cycles unless it was already modified.

     

     

    • Like 3

  11. Ho, a new challenger ! :)

     

    To beat Zero5 (with highscores 😛), you need to :

    - For levels as BamBam :

    - Kill all monsters with proper position to get the fastest as possible power-up's to increase in first place the laser power (it's much powerfull than the smart laser), then the scores and finally, if you really need : for the energy

    - For levels as HitPack :

    - Kill all monsters fastest as possible : enemies have weak points generally in center of their bodys.

    - For Big one : destroy quickly their weapons, then you can shoot them as you want

    - If you can't kill them enough quickly, kill the most powerfull enemies first (generally yellow one) or shoot on their lasers to limit dammage and give you a chance to pass to the next pattern

    - For tunnels levels :

    - I always shoot to use the sound to know if there is an unbreakable wall and destroy all others as soon as possible

    - Rotate less as possible to go to the right position quickly

    - Take all power-up's to increase laser power and reach the max when you go outside of the tunnel and then kill easily enemies in this step and use power-up's to increase Hiscore :).

     

    In Expert mode, levels are longest and enemies more difficult to kill :)

     

    • Like 5

  12. I haven't played much with RB+, so don't understand what will be generated with the above code, but what I understand, you want the bitmap data to be readed directly from ROM memory.

    If this is the case, you will need to take care about the huge loss of performance you will have :

     

    The skunkboard having a 16-bit bus, it's 4 times slower than main memory for 64-bit read.

    Adding to that, the number of cycles needed to read on ROM space is higher : at least 5 cycles (don't remind what is the speed setted on skunkboard) and can be up to 10 cycles (standard cartridge and I think also for the JagGD).

     

    If OP bitmap data is directly readed from ROM space, it's possible that there is not enough bandwidth to parse the whole OP List (i.e. reach the STOP object) in less than ~64µs (one line drawing)

    If this happens, you will have corrupted graphics on screen.

     

     

×
×
  • Create New...