Chained vs. un-chained SCBs

42bs · July 7, 2019

I made a small test drawing the Phobyx logo wobbling with a sinus (like demo0006) once with a chain of 102 SCBs (w/o size and palette reloading) and as 102 separate sprites (flipped, but this is only the drawing order).

(See https://github.com/42Bastian/lynx_hacking/tree/master/chained_scbs).

Chained is always quicker and on a Lynx II it can be drawn in a single frame at 60Hz.

But not on Handy.

Just wanted to share this.

+bhall408 · July 7, 2019

2 hours ago, 42bs said:

Chained is always quicker and on a Lynx II it can be drawn in a single frame at 60Hz.

But not on Handy.

Just wanted to share this.

Is this issue exposed in any released games?

Is it worth trying to fix/address?

42bs · July 7, 2019

1 hour ago, bhall408 said:

Is this issue exposed in any released games?

Is it worth trying to fix/address?

You mean fix Handy? I think it is important to know, that Handy is slower w/ respect to sprite painting. But mostly if you count cycles (or just at the edge of running 30fps or only 20fps).

+bhall408 · July 7, 2019

2 hours ago, 42bs said:

You mean fix Handy? I think it is important to know, that Handy is slower w/ respect to sprite painting. But mostly if you count cycles (or just at the edge of running 30fps or only 20fps).

Yes, I mean fix Handy...

drludos · July 7, 2019

2 hours ago, 42bs said:

You mean fix Handy? I think it is important to know, that Handy is slower w/ respect to sprite painting. But mostly if you count cycles (or just at the edge of running 30fps or only 20fps).

As someone who is trying to make a 60fps game on Handy, I'm glad to read that!

I'm quite impressed that the Lynx can draw a 102 sprite chain at 60fps, someone will maybe make a manic shooter game on it someday!

Did you make your test with Mednafen too, as it's apparently *faster* than real hardware?

42bs · July 8, 2019

Mednafen does nor work on my PC. But anyway, AFAIK all emulators use the same code base: Keith Wilkins Handy.

Since it makes no difference what kind of video output or scaling I use, I guess it is a "limitation" in the Suzy emulation.

42bs · July 9, 2019

Added a 2nd version, now 100 tiles each 16x10 pixels. Takes slightly more time to draw.

https://github.com/42Bastian/lynx_hacking/tree/master/chained_scbs2

Edited July 9, 2019 by 42bs

42bs · July 9, 2019

On 7/7/2019 at 11:40 PM, bhall408 said:

Yes, I mean fix Handy...

It seems the cycle calculation is either wrong or too rough.

VladR · July 9, 2019

On ‎7‎/‎7‎/‎2019 at 3:56 PM, drludos said:

As someone who is trying to make a 60fps game on Handy, I'm glad to read that!

I'm quite impressed that the Lynx can draw a 102 sprite chain at 60fps, someone will maybe make a manic shooter game on it someday!

Did you make your test with Mednafen too, as it's apparently *faster* than real hardware?

Yeah, just not with 102 sprites and at 60 fps

You are, quite literally, multiplying the processing cycles per sprite by 100 here, so it's going to add up real fast. Now, with ~56,467 cycles of CPU time available (after Mikey reads FrameBuffer), that means you have about ~564 cycles per 1 sprite to handle its behavior, to keep within confines of a frame time.

That should be doable and you could still have 30 fps (plenty for such small screen anyway). Now, if you had, say, 15 enemies, and each with ~7 bullets, that's around 105 sprites on screen. Now, if the bullets were just horizontal, vertical they wouldn't have to be transparent, so that should help.

Still, it would be interesting to see, how transparency directly affects the performance.

13 hours ago, 42bs said:

Added a 2nd version, now 100 tiles each 16x10 pixels. Takes slightly more time to draw.

https://github.com/42Bastian/lynx_hacking/tree/master/chained_scbs2

What's the exact device timing here ? I presume those tiles are not transparent, correct ?

What were the sprite dimensions / total pixels drawn in the first benchmark ?

This interests me a great deal, because , obviously, this was the first thing that popped into my mind when reading the docs : when doing flatshading, does it make sense to burn cycles on creating a list of chained scanlines ? Problem is, that list is different each and every frame, so it's questionable if it even can be faster, given that so many scanlines are so short (less than 10 px) - as for those, you don't want to loose any more time that you already did).

I'm not doing the benchmark just yet, must resist and focus on the working game first :lol:

Clearly, 16 MHz Blitter [paired with 8-bit CPU], is really, really nice

drludos · July 10, 2019

Oh by-the-way, I have another "performance question" related to Suzy:

Is it faster to render a small sprite stretched to a larger size by Suzy using the SCB or to render a big sprite directly?

For example, if I want to draw a blue fullscreen background, would it be faster for Suzy to draw a 1px*1px blue sprite stretched to 160*102 or to draw directly a 160*102 blue rectangle sprite?

42bs · July 10, 2019

4 hours ago, drludos said:

Oh by-the-way, I have another "performance question" related to Suzy:

Is it faster to render a small sprite stretched to a larger size by Suzy using the SCB or to render a big sprite directly?

For example, if I want to draw a blue fullscreen background, would it be faster for Suzy to draw a 1px*1px blue sprite stretched to 160*102 or to draw directly a 160*102 blue rectangle sprite?

Not tested yet, but I assume, reading 8160+ bytes and then writing 8160 bytes should take longer than just reading 5 and writing 8160.

42bs · July 10, 2019

Vladr, first writes whole screen, second "only" 160x100 (hence each tile is 16x10).

42bs · July 10, 2019

11 minutes ago, 42bs said:

Not tested yet, but I assume, reading 8160+ bytes and then writing 8160 bytes should take longer than just reading 5 and writing 8160.

Prove: No difference.

I tried:

Packed 2 color sprite (10x1) sized up 160x102 => 3ms

Literal 16 color sprite (1x1) sized up 160x102 => 3ms

Literal 16 color sprite (160x102) => 3ms

Suzy, I stand corrected ;-)

42bs · July 10, 2019

8 hours ago, VladR said:

What's the exact device timing here ? I presume those tiles are not transparent, correct ?

No difference if I draw normal (with transparent) and background sprites. Always 12ms for cls+100 tiles.

VladR · July 10, 2019

Man, I would kill to see the internal HW implementation. This is the same bulls*it as with Jaguar's Blitter where it doesn't matter whether I draw two nontransparent bitmaps 768x240 or if they are transparent. I strongly encourage everybody to try to implement it in SW and compare the results.

Which means there's only one - albeit bloody insane - explanation : The HW always treats the payload same in the inner loop and performs the RenderTarget Read + per-pixel condition. Even for non-transparent ones, where it's not needed.

Even if they had a separate silicon (with parallel execution) for this particular purpose (which I doubt), it still should be faster to just dump the payload than apply per-pixel conditioning.

VladR · July 10, 2019

There's one more scenario related to scanline scaling (that I'm currently working with). You have a scanline, that depending on distance, ranges in width from 4 pixels to 160 pixels.

Now, on Jaguar, I already did the benchmarking, and it doesn't matter whether you render 2x2 = 4 pixels or 128x128 = 16,384. It still takes the exact same amount of time, meaning the HW is literally going through each and every pixel, in brute-force.

I suspect the same will be true for Suzy, based on your sizing example ?

Cyprian · July 10, 2019

There is another explanation:

'dma' channels can have different memory slots access, e.g source uses 'even' and destination ' odd' - as it is done in amiga blitter.

Therefore, when you use only destination channel, 'even' cycles are free, and there no is time difference between 'copy' and 'clear.'

VladR · July 10, 2019

Thanks. It would be very interesting for me to read more on the HW implementation of Blitters. But not the diagrams, those pictures are useless. A detailed description of the processing of the inner and outer loops during blitting, describing all the stages of the HW pipeline, including timing.

I'm just annoyed there's no performance advantage to doing least amount of processing. I wouldn't expect anything significantly parallelized in 1989 for an 8-bit HW.

Fadest · July 10, 2019

13 minutes ago, Cyprian_K said:

There is another explanation:

'dma' channels can have different memory slots access, e.g source uses 'even' and destination ' odd' - as it is done in amiga blitter.

As the Atari Lynx has been created by the same guys than the original Amiga, they probably reused some of their best ideas.

Cyprian · July 10, 2019

3 hours ago, Fadest said:

As the Atari Lynx has been created by the same guys than the original Amiga, they probably reused some of their best ideas.

yep, but I would not use a word 'best' in that case

42bs · July 10, 2019

1 minute ago, Cyprian_K said:

yep, but I would not use a word 'best' in that case

Don't judge a 30 year old machine by today's possibilities.

+karri · July 10, 2019

Guys. What a pity you tell me this now. I could have written On Duty with linked sprites but now I am too far in the project for changing the design. I already used the RAM for other things and there is no way to change this any more.

drludos · July 10, 2019

11 hours ago, 42bs said:

Prove: No difference.

I tried:

Packed 2 color sprite (10x1) sized up 160x102 => 3ms

Literal 16 color sprite (1x1) sized up 160x102 => 3ms

Literal 16 color sprite (160x102) => 3ms

Suzy, I stand corrected

Woaw, thanks a lot for the numbers and your test, it's very good to know!

+bhall408 · July 10, 2019

Is there a way to check if an existing title is making used of chained SCBs?

I had been wondering why I saw the performance of Ms. Pac-Man increase (in emulation) as you eat the dots.

It just didn't make sense to me... Until...

If the dots were part of a linked list of sprites, then that would make total sense -- eating the dots would remove them from the chain, making it shorter, and thus less work to do.

Any way to confirm this?

And if that *is* the reason, then all the more excuse to spend some time improving how Handy core handles large lists of sprites.

42bs · July 10, 2019

30 minutes ago, bhall408 said:

Is there a way to check if an existing title is making used of chained SCBs?

I had been wondering why I saw the performance of Ms. Pac-Man increase (in emulation) as you eat the dots.

It just didn't make sense to me... Until...

If the dots were part of a linked list of sprites, then that would make total sense -- eating the dots would remove them from the chain, making it shorter, and thus less work to do.

Any way to confirm this?

And if that *is* the reason, then all the more excuse to spend some time improving how Handy core handles large lists of sprites.

Take the source and add some debug output in susie.cpp ;-)

Chained vs. un-chained SCBs

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members