Finding enough 2LK kernel time...

tschak909 · May 31, 2016

Yeah, ok, I'm sure I'll get lambasted here, by those who can literally knock out a 1LK kernel that can bend time and space, but...

I'm finding it rather difficult to time kernel writes, so that I can get:

playfield

2 players

2 missiles

1 ball

no other changes

even at this point, I've basically gotten rid of all the wsyncs so that I can at least try to make it fit on screen, and the combat stack trick does a great job of enabling the ball; missiles on a given scanline in a very constant time...

I could _really_ use a master class in understanding kernel layout. taught by Mr. Spice, along with say, Tod Frye, and a few other kernel luminaries, so I can get over the hump of being a beginner...but I digress...

as of right now, I'm just here, with stella, emacs, and a calculator, trying to eek out things in time.

-Thom

ZackAttack · May 31, 2016

I think a good way to approach a new kernel is to list out everything you want it to update. Then calculate how many cycles that will consume, assuming worst case for each one. Worst case being defined as the most cycles used in order to save RAM and ROM. At this point you'll probably be way over 76 or 152 cycles. Now is the time to make sure you really need everything on your list. If something can be dropped, drop it. Next you simply optimize speed by sacrificing RAM or ROM efficiency until you make it fit in 1 or 2 lines. I just assume that it's going to be exactly 76 or 152 cycles from the start because STA WYSCN wastes too many valueable cycles. Deciding where to trade RAM and ROM for speed is a bit of an art. Typically you'd want to avoid increasing ROM space where you will have a lot of variation, such as the PF graphic data for a game with lots of levels. Small arrays can be put in memory instead of loaded from ROM, that removes a level of indirection and saves some CPU cycles during the kernel. Avoiding branching is also a good idea whenever possible.

Attached is a 2LK that I created which supports an asymetric playfield (PF1 & PF2 only), GRP0, COLUP0, GRP1, COLUP1, NUSIZ1, and HMP1. It's a demo I put together to see if scrolling a 2 line resolution playfield would be possible. It uses a cart with extra RAM to hold the playfield. There is so much PF data to update during scrolling that I update it inside the 2LK. So in addition to updating all those TIA registers I found some time to shift bits around in the five PF bytes that represent a single line. (Four are displayed while the fifth contains data to be scrolled in/out) This kernel is a more extreme example of trading RAM and ROM for CPU time. One huge benefit to this approach is that it decoupled drawing the screen from updating the screen. So it would be easy to swap my very lame level loading code with something tile based. Really anything would be better than storing an uncompressed bitmap of the entire background...

Not sure if this helps you, but at least it's something to think about.

3E_scroll_PF_demo.asm

+SpiceWare · May 31, 2016

That's exactly what I cover in Collect. By the time you get to step 12 you'll have a 2LK that displays all 5 objects. Each lesson builds on the prior lessons though, so if you skimmed or skipped ahead it would be difficult to follow.

tschak909 · May 31, 2016

Yes, I've been reading through it. I am having to go back and forth between the blog post and the source code...

-Thom

gauauu · May 31, 2016

There's also a good bit of variation when you say you want to support a player. Is it a single-color or multiple-color player? That makes a difference. Is the playfield symmetrical or asymmetrical? Are the missiles/ball variable heights, or just 1 pixel?

There's a lot of wiggle room depending on what features you really need. I'm not very good at this yet (still making my first game); to make everything that I needed fit in a 2LK I ended up making the P2 sprite single-colored, and the P2 missiles locked at 1 pixel tall, and only using symmetric playfields.

tschak909 · May 31, 2016

single color players, symmetrical reflected playfield.

Am thinking of making a special case for the top and bottom borders, as they use writes to PF0, and want to take them out of the equation to gain a few cycles.

-Thom

+SpiceWare · May 31, 2016

Yes, I've been reading through it.

Ok, guess your project's just getting ahead of where you are in the Collect series

I am having to go back and forth between the blog post and the source code...

That's the intent. It's why I put so many comments in the source, there's way more than I would normally use.

Am thinking of making a special case for the top and bottom borders, as they use writes to PF0, and want to take them out of the equation to gain a few cycles.

I did that in the original Stay Frosty so I could have a multi-color snowman and an asynchronous playfield. PF0 is only updated at the top (for the horizon) and the bottom.

Just Jeff · July 8, 2016

Hi.. I came up with 166 cycles- what to you have? Also, I prettied up your .asm a little. I think it helps to line up the comments and add nice big headers. Maybe put in where you think line 2 starts. And I added the cycle counts. I assure you there are errors (sometimes I can't tell if I'm doing something to the accumulator or the operand) so don't replace anything you have with it.. Its just an example. Maybe someone can take a look at it and correct. Also, you have a nop in there. You could probably get rid of it by shuffling stuff around- a nice reason to have those cycle counts right there in the .asm.

dodgeballedit.asm

Edited July 8, 2016 by BNE Jeff

Mr SQL · July 8, 2016

I think a good way to approach a new kernel is to list out everything you want it to update. Then calculate how many cycles that will consume, assuming worst case for each one. Worst case being defined as the most cycles used in order to save RAM and ROM. At this point you'll probably be way over 76 or 152 cycles. Now is the time to make sure you really need everything on your list. If something can be dropped, drop it. Next you simply optimize speed by sacrificing RAM or ROM efficiency until you make it fit in 1 or 2 lines. I just assume that it's going to be exactly 76 or 152 cycles from the start because STA WYSCN wastes too many valueable cycles. Deciding where to trade RAM and ROM for speed is a bit of an art. Typically you'd want to avoid increasing ROM space where you will have a lot of variation, such as the PF graphic data for a game with lots of levels. Small arrays can be put in memory instead of loaded from ROM, that removes a level of indirection and saves some CPU cycles during the kernel. Avoiding branching is also a good idea whenever possible.

Attached is a 2LK that I created which supports an asymetric playfield (PF1 & PF2 only), GRP0, COLUP0, GRP1, COLUP1, NUSIZ1, and HMP1. It's a demo I put together to see if scrolling a 2 line resolution playfield would be possible. It uses a cart with extra RAM to hold the playfield. There is so much PF data to update during scrolling that I update it inside the 2LK. So in addition to updating all those TIA registers I found some time to shift bits around in the five PF bytes that represent a single line. (Four are displayed while the fifth contains data to be scrolled in/out) This kernel is a more extreme example of trading RAM and ROM for CPU time. One huge benefit to this approach is that it decoupled drawing the screen from updating the screen. So it would be easy to swap my very lame level loading code with something tile based. Really anything would be better than storing an uncompressed bitmap of the entire background...

Not sure if this helps you, but at least it's something to think about.

3E_scroll_PF_demo.asm

Nice scroller Zack!

Agree about the advantage of decoupling the update from the display. One way to decouple the update further is by putting it outside the kernel into the vertical blanks - that gives more time to decompress a bitmap of the background too.

Just Jeff · July 9, 2016

Attached is a 2LK that I created which supports an asymetric playfield (PF1 & PF2 only), GRP0, COLUP0, GRP1, COLUP1, NUSIZ1, and HMP1. It's a demo I put together to see if scrolling a 2 line resolution playfield would be possible. It uses a cart with extra RAM to hold the playfield. There is so much PF data to update during scrolling that I update it inside the 2LK. So in addition to updating all those TIA registers I found some time to shift bits around in the five PF bytes that represent a single line. (Four are displayed while the fifth contains data to be scrolled in/out) This kernel is a more extreme example of trading RAM and ROM for CPU time. One huge benefit to this approach is that it decoupled drawing the screen from updating the screen. So it would be easy to swap my very lame level loading code with something tile based. Really anything would be better than storing an uncompressed bitmap of the entire background...

Not sure if this helps you, but at least it's something to think about.

3E_scroll_PF_demo.asm

Interesting... Is it really 128K? Its strange that a 6K .asm would produce that.

ZackAttack · July 11, 2016

Nice scroller Zack!

Agree about the advantage of decoupling the update from the display. One way to decouple the update further is by putting it outside the kernel into the vertical blanks - that gives more time to decompress a bitmap of the background too.

Thanks. I agree about decoupling the location too. If you look at the source code carefully you'll see that it is in fact happening during overscan/vblank. Only the rendering and scrolling occur during the kernel. Putting the scrolling inside the kernel is an optimization because both rendering and scrolling must read all 400 bytes of the PF buffer. With each read (LDA absolute,y) taking 4 cycles that a savings of 1600 cycles per frame.

Due to the way the 3E scheme works you can't copy the PF data directly from the banked ROM to banked RAM. So it has to be copied from banked ROM to ZP RAM and then copied again from ZP RAM to the banked RAM where the PF buffers are. Copying all those bytes takes a massive amount of cycles. So the loops had to be partially unrolled in order to optimize for time over code size. For example the copy from ZP RAM to banked RAM does to LDA/STA pairs per iteration of the loop instead of one. This cuts the iteration count in half from 80 to 40. Since DEY and BPL consume 5 cycles combined per iteration it saves 5*40=200 cycles with a cost of only 5 more bytes worth of code. Ideally the PF data would be compressed and then the size of data to be copied to ZP RAM could be reduced to improve things even more. Then you simply decompress it from ZP RAM and store it directly to banked RAM.

	;copy buffer to scroll
	LDy #39
fcs
	lda ColumnBuffer,y
	sta wColumnScroll,y
	lda ColumnBuffer+40,y
	sta wColumnScroll+40,y
	dey
	bpl fcs

Interesting... Is it really 128K? Its strange that a 6K .asm would produce that.

The short answer is that I intentionally made it 128KB by using org to put the reset vector (last bank) at an offset of $1FFFC. So it's generating a bin with about 120KB of empty banks.

The 3E banking scheme allows for up to 512KB of ROM and 256KB of RAM in the cartridge. At least according to the specification. I was going to expand this demo into a much larger one by upping the cart size.When I found out that the harmony encore cartridge only seems to handle 32KB, I shifted focus back to other projects and forget to undo my size experiments before posting the source. All of the extra space that was allocated in the bin file was intended to hold a massive amount of uncompressed PF data.

If you changed the offsets to the following it would reduce it to a 8K bin and still work just fine. org is what indicates the offset in the actual binary file that dasm generates. The bigger your offsets the bigger the file it generates.

	org $1800
	rorg $F800

and here

;reset vector
	org $1fFC
	rorg $FFFC

Mr SQL · July 11, 2016

Thanks. I agree about decoupling the location too. If you look at the source code carefully you'll see that it is in fact happening during overscan/vblank. Only the rendering and scrolling occur during the kernel. Putting the scrolling inside the kernel is an optimization because both rendering and scrolling must read all 400 bytes of the PF buffer. With each read (LDA absolute,y) taking 4 cycles that a savings of 1600 cycles per frame.

Due to the way the 3E scheme works you can't copy the PF data directly from the banked ROM to banked RAM. So it has to be copied from banked ROM to ZP RAM and then copied again from ZP RAM to the banked RAM where the PF buffers are. Copying all those bytes takes a massive amount of cycles. So the loops had to be partially unrolled in order to optimize for time over code size. For example the copy from ZP RAM to banked RAM does to LDA/STA pairs per iteration of the loop instead of one. This cuts the iteration count in half from 80 to 40. Since DEY and BPL consume 5 cycles combined per iteration it saves 5*40=200 cycles with a cost of only 5 more bytes worth of code. Ideally the PF data would be compressed and then the size of data to be copied to ZP RAM could be reduced to improve things even more. Then you simply decompress it from ZP RAM and store it directly to banked RAM.
	;copy buffer to scroll
	LDy #39
fcs
	lda ColumnBuffer,y
	sta wColumnScroll,y
	lda ColumnBuffer+40,y
	sta wColumnScroll+40,y
	dey
	bpl fcs
The short answer is that I intentionally made it 128KB by using org to put the reset vector (last bank) at an offset of $1FFFC. So it's generating a bin with about 120KB of empty banks.

The 3E banking scheme allows for up to 512KB of ROM and 256KB of RAM in the cartridge. At least according to the specification. I was going to expand this demo into a much larger one by upping the cart size.When I found out that the harmony encore cartridge only seems to handle 32KB, I shifted focus back to other projects and forget to undo my size experiments before posting the source. All of the extra space that was allocated in the bin file was intended to hold a massive amount of uncompressed PF data.

If you changed the offsets to the following it would reduce it to a 8K bin and still work just fine. org is what indicates the offset in the actual binary file that dasm generates. The bigger your offsets the bigger the file it generates.
	org $1800
	rorg $F800
and here
;reset vector
	org $1fFC
	rorg $FFFC

If I understand correctly you are making changes to the background bitmap in the vertical blanks, but scrolling it inside the kernel. I agree that is an optimization if you have to read all 400 bytes to scroll and to render; out of the 400, how many are displayed at one time?

I use 480 bytes of data in the background bitmap but read only 30 at a time for the screen display into low RAM which also gives time to decompress it (into 60 bytes) for the kernel.

The 3E format looks pretty phat! 768K with 1/3 of it available as RAM, maybe it's exceeding the spec even for the Encore.

ZackAttack · July 15, 2016

If I understand correctly you are making changes to the background bitmap in the vertical blanks, but scrolling it inside the kernel. I agree that is an optimization if you have to read all 400 bytes to scroll and to render; out of the 400, how many are displayed at one time?

I use 480 bytes of data in the background bitmap but read only 30 at a time for the screen display into low RAM which also gives time to decompress it (into 60 bytes) for the kernel.

The 3E format looks pretty phat! 768K with 1/3 of it available as RAM, maybe it's exceeding the spec even for the Encore.

The background bitmap is stored in ROM. The portion of the background that is currently being displayed is stored in RAM. Essentially there are 5 columns each of which are a byte wide or 8 PF pixels. 4 columns are mapped directly to PF1, PF2, PF2, PF1 respectively. Each column is 80 bytes or 160 PF pixels tall. So there is a 32x80 bitmap rendered each frame. The fifth column acts as a scroll buffer. When a bit/pixel is pushed out the left side it is put back in on the right side. So you have 5 bytes side by side that get rotated as you scroll. Every time the alignment of the bits matches a byte boundary the fifth column is loaded with either the next column to the right or left of the four that are on the screen. Which side is loaded depends on which direction the next scroll will be in. Perhaps this could be a good base for a platformer with a two color background that is 32x40.

Sorry tschak909, we may have taken your post a bit off topic.

tschak909 · July 15, 2016

a bit?

+Andrew Davie · July 15, 2016

Memory tells me that the *formal* implementation of 3E supports 480K ROM and 32K RAM

This in order to be compatible both with Krokodile Kart and Cuttle Cart.

See http://atariage.com/forums/topic/62704-flicker-free-large-sprites-sample-mpg/page-3for some discussion.

boomlinde · August 29, 2016

One trick I used (in a 2 player + background kernel) was to pad the player graphics data with 0 up to 256 bytes. Then, no logic had to be used in determining where to start/stop drawing the players; the kernel just loaded the vertical positions of the players, added a per-line decremented value and used that as an offset in the player table. If you have some ROM to spare, that is a good way to save some cycles in the kernel.

DEBRO · August 30, 2016

Don't forget about VDELx. It can be very useful...delaying the write until the next scan line.

Finding enough 2LK kernel time...

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members