Jump to content
IGNORED

Chained vs. un-chained SCBs


42bs

Recommended Posts

I made a small test drawing the Phobyx logo wobbling with a sinus (like demo0006) once with a chain of 102 SCBs (w/o size and palette reloading) and as 102 separate sprites (flipped, but this is only the drawing order).

(See https://github.com/42Bastian/lynx_hacking/tree/master/chained_scbs).

 

Chained is always quicker and on a Lynx II it can be drawn in a single frame at 60Hz.

But not on Handy.

 

Just wanted to share this.

  • Like 4
Link to comment
Share on other sites

2 hours ago, 42bs said:

 

Chained is always quicker and on a Lynx II it can be drawn in a single frame at 60Hz.

But not on Handy.

 

Just wanted to share this.

 

Is this issue exposed in any released games?

 

Is it worth trying to fix/address?

Link to comment
Share on other sites

1 hour ago, bhall408 said:

 

Is this issue exposed in any released games?

 

Is it worth trying to fix/address?

You mean fix Handy? I think it is important to know, that Handy is slower w/ respect to sprite painting. But mostly if you count cycles (or just at the edge of running 30fps or only 20fps).

  • Like 1
Link to comment
Share on other sites

2 hours ago, 42bs said:

You mean fix Handy? I think it is important to know, that Handy is slower w/ respect to sprite painting. But mostly if you count cycles (or just at the edge of running 30fps or only 20fps).

Yes, I mean fix Handy...

Link to comment
Share on other sites

2 hours ago, 42bs said:

You mean fix Handy? I think it is important to know, that Handy is slower w/ respect to sprite painting. But mostly if you count cycles (or just at the edge of running 30fps or only 20fps).

As someone who is trying to make a 60fps game on Handy, I'm glad to read that! :)

I'm quite impressed that the Lynx can draw a 102 sprite chain at 60fps, someone will maybe make a manic shooter game on it someday!

 

Did you make your test with Mednafen too, as it's apparently *faster* than real hardware?

Link to comment
Share on other sites

Mednafen does nor work on my PC. But anyway, AFAIK all emulators use the same code base: Keith Wilkins Handy.

Since it makes no difference what kind of video output or scaling I use, I guess it is a "limitation" in the Suzy emulation.

Link to comment
Share on other sites

On ‎7‎/‎7‎/‎2019 at 3:56 PM, drludos said:

As someone who is trying to make a 60fps game on Handy, I'm glad to read that! :)

I'm quite impressed that the Lynx can draw a 102 sprite chain at 60fps, someone will maybe make a manic shooter game on it someday!

 

Did you make your test with Mednafen too, as it's apparently *faster* than real hardware?

Yeah, just not with 102 sprites and at 60 fps :)

You are, quite literally, multiplying the processing cycles per sprite by 100 here, so it's going to add up real fast. Now, with ~56,467 cycles of CPU time available (after Mikey reads FrameBuffer), that means you have about ~564 cycles per 1 sprite to handle its behavior, to keep within confines of a frame time.

 

That should be doable and you could still have 30 fps (plenty for such small screen anyway). Now, if you had, say, 15 enemies, and each with ~7 bullets, that's around 105 sprites on screen. Now, if the bullets were just horizontal, vertical they wouldn't have to be transparent, so that should help.

Still, it would be interesting to see, how transparency directly affects the performance.

 

13 hours ago, 42bs said:

Added a 2nd version, now 100 tiles each 16x10 pixels. Takes slightly more time to draw.

https://github.com/42Bastian/lynx_hacking/tree/master/chained_scbs2

What's the exact device timing here ? I presume those tiles are not transparent, correct ?

What were the sprite dimensions / total pixels drawn in the first benchmark ?

 

This interests me a great deal, because , obviously, this was the first thing that popped into my mind when reading the docs : when doing flatshading, does it make sense to burn cycles on creating a list of chained scanlines ? Problem is, that list is different each and every frame, so it's questionable if it even can be faster, given that so many scanlines are so short (less than 10 px) - as for those, you don't want to loose any more time that you already did).

I'm not doing the benchmark just yet, must resist and focus on the working game first :lol:

 

Clearly, 16 MHz Blitter [paired with 8-bit CPU], is really, really nice :)

 

Link to comment
Share on other sites

Oh by-the-way, I have another "performance question" related to Suzy:

Is it faster to render a small sprite stretched to a larger size by Suzy using the SCB or to render a big sprite directly?

 

For example, if I want to draw a blue fullscreen background, would it be faster for Suzy to draw a 1px*1px blue sprite stretched to 160*102 or to draw directly a 160*102 blue rectangle sprite?

Link to comment
Share on other sites

4 hours ago, drludos said:

Oh by-the-way, I have another "performance question" related to Suzy:

Is it faster to render a small sprite stretched to a larger size by Suzy using the SCB or to render a big sprite directly?

 

For example, if I want to draw a blue fullscreen background, would it be faster for Suzy to draw a 1px*1px blue sprite stretched to 160*102 or to draw directly a 160*102 blue rectangle sprite?

Not tested yet, but I assume, reading 8160+ bytes and then writing 8160 bytes should take longer than just reading 5 and writing 8160.

Link to comment
Share on other sites

11 minutes ago, 42bs said:

Not tested yet, but I assume, reading 8160+ bytes and then writing 8160 bytes should take longer than just reading 5 and writing 8160.

Prove: No difference.

I tried:

Packed 2 color sprite (10x1) sized up 160x102 => 3ms

Literal 16 color sprite (1x1) sized up 160x102 => 3ms

Literal 16 color sprite (160x102)                    => 3ms

 

Suzy, I stand corrected ;-)

 

  • Like 2
Link to comment
Share on other sites

8 hours ago, VladR said:

What's the exact device timing here ? I presume those tiles are not transparent, correct ?

 

No difference if I draw normal (with transparent) and background sprites. Always 12ms for cls+100 tiles.

  • Like 1
Link to comment
Share on other sites

Man, I would kill to see the internal HW implementation. This is the same bulls*it as with Jaguar's Blitter where it doesn't matter whether I draw two nontransparent bitmaps 768x240 or if they are transparent. I strongly encourage everybody to try to implement it in SW and compare the results.

 

Which means there's only one - albeit bloody insane - explanation : The HW always treats the payload same in the inner loop and performs the RenderTarget Read + per-pixel condition. Even for non-transparent ones, where it's not needed.

 

Even if they had a separate silicon (with parallel execution) for this particular purpose (which I doubt), it still should be faster to just dump the payload than apply per-pixel conditioning.

Link to comment
Share on other sites

There's one more scenario related to scanline scaling (that I'm currently working with). You have a scanline, that depending on distance, ranges in width from 4 pixels to 160 pixels.

 

Now, on Jaguar, I already did the benchmarking, and it doesn't matter whether you render 2x2 = 4 pixels or 128x128 = 16,384. It still takes the exact same amount of time, meaning the HW is literally going through each and every pixel, in brute-force.

 

I suspect the same will be true for Suzy, based on your sizing example ?

Link to comment
Share on other sites

There is another explanation:

'dma' channels can have different memory slots access, e.g source uses 'even' and destination ' odd' - as it is done in amiga blitter.

Therefore, when you use only destination channel, 'even' cycles are free, and there no is time difference between 'copy' and 'clear.'

  • Like 1
Link to comment
Share on other sites

Thanks. It would be very interesting for me to read more on the HW implementation of Blitters. But not the diagrams, those pictures are useless. A detailed description of the processing of the inner and outer loops during blitting, describing all the stages of the HW pipeline, including timing.

 

I'm just annoyed there's no performance advantage to doing least amount of processing. I wouldn't expect anything significantly parallelized in 1989 for an 8-bit HW.

Link to comment
Share on other sites

13 minutes ago, Cyprian_K said:

There is another explanation:

'dma' channels can have different memory slots access, e.g source uses 'even' and destination ' odd' - as it is done in amiga blitter.

As the Atari Lynx has been created by the same guys than the original Amiga, they probably reused some of their best ideas.

Link to comment
Share on other sites

Guys. What a pity you tell me this now. I could have written On Duty with linked sprites but now I am too far in the project for changing the design. I already used the RAM for other things and there is no way to change this any more.

Link to comment
Share on other sites

11 hours ago, 42bs said:

Prove: No difference.

I tried:

Packed 2 color sprite (10x1) sized up 160x102 => 3ms

Literal 16 color sprite (1x1) sized up 160x102 => 3ms

Literal 16 color sprite (160x102)                    => 3ms

 

Suzy, I stand corrected ;-)

 

Woaw, thanks a lot for the numbers and your test, it's very good to know!

  • Like 1
Link to comment
Share on other sites

Is there a way to check if an existing title is making used of chained SCBs?

 

I had been wondering why I saw the performance of Ms. Pac-Man increase (in emulation) as you eat the dots.

 

It just didn't make sense to me... Until...

 

If the dots were part of a linked list of sprites, then that would make total sense -- eating the dots would remove them from the chain, making it shorter, and thus less work to do.

 

Any way to confirm this?

 

And if that *is* the reason, then all the more excuse to spend some time improving how Handy core handles large lists of sprites.

Link to comment
Share on other sites

30 minutes ago, bhall408 said:

Is there a way to check if an existing title is making used of chained SCBs?

 

I had been wondering why I saw the performance of Ms. Pac-Man increase (in emulation) as you eat the dots.

 

It just didn't make sense to me... Until...

 

If the dots were part of a linked list of sprites, then that would make total sense -- eating the dots would remove them from the chain, making it shorter, and thus less work to do.

 

Any way to confirm this?

 

And if that *is* the reason, then all the more excuse to spend some time improving how Handy core handles large lists of sprites.

Take the source and add some debug output in susie.cpp ;-)

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...