Why are frame rates of Jaguar 3D games so low?

supercat · February 18, 2007

Because its NOT as simple as just copying data. Theres a LOT more to think about, Ai, poly setup, lighting & colour effects, z buffering etc.

If one adds in all the high-end goodies, those will take up a lot more time than just copying data. But judging from the speedup in PC Doom when I cut the resolution in half, I'd say that the primary limiting factor to PC speed is the data munging. If the Jaguar could do that efficiently, I would think Doom should be able to run at 20-30fps (by my eyeball it was probably closer to about 10).

supercat · February 19, 2007

I may be misunderstanding with VHS i thought that they always ran at the same fps, but that the amount of tape used to store the data was correspondingly different, thats why the heads no longer line up to give still frames pause at speeds other than SP due to the data mechanically being in different positions.
I could be entirely wrong of course. This really is a guess more than anything.

VHS always records at 30fps (NTSC) or 25fps (PAL). Each frame is recorded as a diagonal stripe on the tape. When a tape is recorded in SP mode, the stripes are far enough apart that there is no bleed between them. In the LP mode, there is a little bit of bleed, but not much; in EP mode, there is considerable bleeding.

When a two-head VCR is doing anything other than running at normal speed, the heads will not remain in contact with the stripes but will instead slip from stripe to stripe. On a tape recorded at the EP speed, the stripes are so close together that there's no "dead time" as the head passes from one to the next. On a tape recorded at SP speed, the stripes are much further apart and there is considerable dead time between them.

Four-head VCRs improve special-effects performance at SP speed by using larger heads; these larger heads may also be angled to match the stripe angle of a paused SP recording. Once upon a time, there used to be six-head VCRs which had another pair of heads for the LP speed, but I haven't seen one of those in well over a decade.

Atari_Owl · February 19, 2007

Because its NOT as simple as just copying data. Theres a LOT more to think about, Ai, poly setup, lighting & colour effects, z buffering etc.

If one adds in all the high-end goodies, those will take up a lot more time than just copying data. But judging from the speedup in PC Doom when I cut the resolution in half, I'd say that the primary limiting factor to PC speed is the data munging. If the Jaguar could do that efficiently, I would think Doom should be able to run at 20-30fps (by my eyeball it was probably closer to about 10).

Ahh but Doom is a Raycasting engine not a 3D poly engine.

I believe Doom was hardwired o 15, but i've NOT checked this.

Atari_Owl · February 19, 2007

I may be misunderstanding with VHS i thought that they always ran at the same fps, but that the amount of tape used to store the data was correspondingly different, thats why the heads no longer line up to give still frames pause at speeds other than SP due to the data mechanically being in different positions.
I could be entirely wrong of course. This really is a guess more than anything.

VHS always records at 30fps (NTSC) or 25fps (PAL). Each frame is recorded as a diagonal stripe on the tape. When a tape is recorded in SP mode, the stripes are far enough apart that there is no bleed between them. In the LP mode, there is a little bit of bleed, but not much; in EP mode, there is considerable bleeding.

When a two-head VCR is doing anything other than running at normal speed, the heads will not remain in contact with the stripes but will instead slip from stripe to stripe. On a tape recorded at the EP speed, the stripes are so close together that there's no "dead time" as the head passes from one to the next. On a tape recorded at SP speed, the stripes are much further apart and there is considerable dead time between them.

Four-head VCRs improve special-effects performance at SP speed by using larger heads; these larger heads may also be angled to match the stripe angle of a paused SP recording. Once upon a time, there used to be six-head VCRs which had another pair of heads for the LP speed, but I haven't seen one of those in well over a decade.

Thank you for the confirmation it was due to physical positioning of data relative to the heads, the tighter packing of data causing the heads to contact adjacent stripes makes sense of that.

supercat · February 19, 2007

Ahh but Doom is a Raycasting engine not a 3D poly engine.
I believe Doom was hardwired o 15, but i've NOT checked this.

Yeah, I'll admit doom cheats, but even in my Quake days I used to play at low resolutions to avoid the slowdowns caused by higher ones. Quake is "real" 3D with considerably more sophistication than needed for a good 3D experience (e.g. allowing each poly to by lit by up to four switchable light sources) but it still seemed to spend most of its time munging graphics.

Video · February 19, 2007

I may be misunderstanding with VHS i thought that they always ran at the same fps, but that the amount of tape used to store the data was correspondingly different due to the different tape speeds, thats why the heads no longer line up to give still frames pause at speeds other than SP due to the data physically being in different positions relative to the heads.

I could be entirely wrong of course.

That's probably it. I just figured since it used more or less tape, maybe it could run at different speeds. I always wondered what was with security VCR's, they're good enough now that even though it's putting upwards of 24 hours on one tape,(and that's in real time, not using the frame skip feature, that can push it up to over a month if I remember right)it's not loosing much if any picture quality to run like that. Now if you could get a home player to do that, it would be awesome!

Kinda cool to play those tapes in a home VCR, everythings super fast, and same with playing a home tape in the security player, everything's in slow motion..

With DVD recorders yes you can come back and record more, there is actually a disc 'closing' procedure one goes through to make it compatible with standard players at the end of recording, prir to that the disc remains open and can accept more data (unless its full). At least this is the case in the machines of my experience.

I personally would recommend DVD recorders with Hard Drives though as you can edit the data on HD then dump it to DVDR or RW much more easily.

Cool, I always wondered. WAsn't there a CD that was once upon a time supposed to do that? Anyways, I may look into a DVR type for my computer then after all. That seems more efficient thatn putting a few megs on a 4+ gig disc and not being able to use the rest of it.

I may be misunderstanding with VHS i thought that they always ran at the same fps, but that the amount of tape used to store the data was correspondingly different, thats why the heads no longer line up to give still frames pause at speeds other than SP due to the data mechanically being in different positions.
I could be entirely wrong of course. This really is a guess more than anything.

VHS always records at 30fps (NTSC) or 25fps (PAL). Each frame is recorded as a diagonal stripe on the tape. When a tape is recorded in SP mode, the stripes are far enough apart that there is no bleed between them. In the LP mode, there is a little bit of bleed, but not much; in EP mode, there is considerable bleeding.

When a two-head VCR is doing anything other than running at normal speed, the heads will not remain in contact with the stripes but will instead slip from stripe to stripe. On a tape recorded at the EP speed, the stripes are so close together that there's no "dead time" as the head passes from one to the next. On a tape recorded at SP speed, the stripes are much further apart and there is considerable dead time between them.

Four-head VCRs improve special-effects performance at SP speed by using larger heads; these larger heads may also be angled to match the stripe angle of a paused SP recording. Once upon a time, there used to be six-head VCRs which had another pair of heads for the LP speed, but I haven't seen one of those in well over a decade.

I actually had one of those 6 head VCR's (and supposedly there was an 8, but it was WAY out of my price range at the time) I'm not sure if it actually uses the heads differently or not, I"m pretty sure it just uses all of them at once to record or read a really clear smooth picture. (one of my friends tried to tell me that was how it read sterio sound, but eh, most 4 heads will do that)

Edited February 19, 2007 by Video

Gorf · February 21, 2007

Blame the 68k and programmers unwillingess to turn it off after

booting the game. you can not expect a 64 bit bus to get max

performance when you choke it constantly with a half speed,

quater buswidth processor. Until developers are willing to forget

that 68k exsists after boot up, you will never see what the jaguar is REALLY capable of. I Believe IS i & II and BattleSphere are examples

in the right directions of where the Jag could go. I remember Dan h.

of Eclipse said that they still did not exhaust the poly pushing abilities of the Jaguar even in ISII.

just my $0.02

Isgoed · February 21, 2007

So we haven't seen/will never see a game for the Jaguar that really uses all of the Jaguar's performance?

Punisher5.0 · February 21, 2007

Blame the 68k and programmers unwillingess to turn it off after

booting the game. you can not expect a 64 bit bus to get max

performance when you choke it constantly with a half speed,

quater buswidth processor. Until developers are willing to forget

that 68k exsists after boot up, you will never see what the jaguar is REALLY capable of. I Believe IS i & II and BattleSphere are examples

in the right directions of where the Jag could go. I remember Dan h.

of Eclipse said that they still did not exhaust the poly pushing abilities of the Jaguar even in ISII.

just my $0.02

Is that how Gorf was programmed? Just used the 68k to boot strap the system then shut it off? I may be wrong but I think BS used the 68k for processing controller input.

JagChris · February 21, 2007

Blame the 68k and programmers unwillingess to turn it off after

booting the game. you can not expect a 64 bit bus to get max

performance when you choke it constantly with a half speed,

quater buswidth processor. Until developers are willing to forget

that 68k exsists after boot up, you will never see what the jaguar is REALLY capable of. I Believe IS i & II and BattleSphere are examples

in the right directions of where the Jag could go. I remember Dan h.

of Eclipse said that they still did not exhaust the poly pushing abilities of the Jaguar even in ISII.

just my $0.02

Is that how Gorf was programmed? Just used the 68k to boot strap the system then shut it off? I may be wrong but I think BS used the 68k for processing controller input.

No. If I remember correctly, Gorf was programmed using MAINLY the 68k and C. Like 90% or something. The rest 10% were object processor and blitter etc.

Gorf · February 22, 2007

Blame the 68k and programmers unwillingess to turn it off after

booting the game. you can not expect a 64 bit bus to get max

performance when you choke it constantly with a half speed,

quater buswidth processor. Until developers are willing to forget

that 68k exsists after boot up, you will never see what the jaguar is REALLY capable of. I Believe IS i & II and BattleSphere are examples

in the right directions of where the Jag could go. I remember Dan h.

of Eclipse said that they still did not exhaust the poly pushing abilities of the Jaguar even in ISII.

just my $0.02

Is that how Gorf was programmed? Just used the 68k to boot strap the system then shut it off? I may be wrong but I think BS used the 68k for processing controller input.

No, no not at all!.... ...Gorf Classic is 90% C coded 68k. It

did not need to use the other processors as it is only a simple 2d tiny

sprite engine that the OPL and 68k are more than capable of handling.

'Surrounded!' on the other hand only uses the 68k to perform a vblank

and help the GPU jump from main and local. Well that is how it used to

do it. Now the 68k only does the vblank. That too will be removed from

the 68k and done in the GPU.

The last three projects i've started for the Jag I have the GPU doing ALL

the jumping both to and from main and local now sucessfully and I have

good ol' Atari Owl to thank for that. He figured out that part and most

cleverly I might add! Another pair of good eyes such as his is always

welcome.

If all these recent discoveries work like we hope, we should start seeing

some seriously improved apps on the little black kitty. I love my Ste and

its 68k, but it should have never been used in the Jag or restricted to boot

up duty only.

I understand the flawed thinking of the warm and fuzzy feeling

for adding it, but warm and fuzzy can and did in most cases become all to

comfortable and the games suffered as a result. Im quite sure, given the

source code of any game out there relying on the 68k even a little bit, I could

improve its frame rate significantly.

I haev moved one or two lines of code from the 68k to one of the RISCS

and notice a difference in some cases. That chip is just that much of a

detriment. NOw if they had use the 68k on a seprate bus with say, 64k

of ram to do all the AI, that would have been a vast improvement as the

risc's would rule the main bus and the 68k would do all the gam logic off

bus and out of the way and only feed the DSP and GPU the info it needed

through a small unified ram buffer, dual port so no waits or stalls.

THAT would have been smokin! 64k is plenty of space for game logic.

man can i blab!!!

Gorf · February 22, 2007

So we haven't seen/will never see a game for the Jaguar that really uses all of the Jaguar's performance?

TOOLS!!! The current ones are just plain awful and certainly dont deal with

the new discoveries of the GPU's main local jumping abilities. In fact they do not

really consider anything BUT the 68k for the most part. The Jag tools are 68k

based and very little use to the RISC's other than an assembler not written to

handle the now known ablitlies to jump around main and to and from local.

Trust me when I tell you the ability to code stuff the new way with the right

tools will make all the difference in the way the Jaguar performs. Of course

sloppy code is only defeating the purpose, so you must code neatly. Most rules

are the same except no 68k choking the life out of the system.

supercat · February 22, 2007

Blame the 68k and programmers unwillingess to turn it off after

booting the game. you can not expect a 64 bit bus to get max

performance when you choke it constantly with a half speed,

quater buswidth processor.

How is bus arbitration among the multiple processors handled on the Jaguar? I would think that a well-designed system should allow a 68000 to co-exist pretty nicely on a 64-bit bus with other devices. If the bus interface unit contained a 64+18 bit latch for data and address, along with an address comparator, it would only have to fetch one 64-bit chunk of data for every four words of instructions fetched. Adding another 64+18-bit latch for data fetches would provide some further improvement, though not as much as the first set of latches.

Does the Jaguar do anything like that, or is it more sophisticated or less sophisticated?

Edited February 22, 2007 by supercat

Atari_Owl · February 22, 2007

Blame the 68k and programmers unwillingess to turn it off after

booting the game. you can not expect a 64 bit bus to get max

performance when you choke it constantly with a half speed,

quater buswidth processor. Until developers are willing to forget

that 68k exsists after boot up, you will never see what the jaguar is REALLY capable of. I Believe IS i & II and BattleSphere are examples

in the right directions of where the Jag could go. I remember Dan h.

of Eclipse said that they still did not exhaust the poly pushing abilities of the Jaguar even in ISII.

just my $0.02

Is that how Gorf was programmed? Just used the 68k to boot strap the system then shut it off? I may be wrong but I think BS used the 68k for processing controller input.

No, no not at all!.... ...Gorf Classic is 90% C coded 68k. It

did not need to use the other processors as it is only a simple 2d tiny

sprite engine that the OPL and 68k are more than capable of handling.

'Surrounded!' on the other hand only uses the 68k to perform a vblank

and help the GPU jump from main and local. Well that is how it used to

do it. Now the 68k only does the vblank. That too will be removed from

the 68k and done in the GPU.

The last three projects i've started for the Jag I have the GPU doing ALL

the jumping both to and from main and local now sucessfully and I have

good ol' Atari Owl to thank for that. He figured out that part and most

cleverly I might add! Another pair of good eyes such as his is always

welcome.

If all these recent discoveries work like we hope, we should start seeing

some seriously improved apps on the little black kitty. I love my Ste and

its 68k, but it should have never been used in the Jag or restricted to boot

up duty only.

I understand the flawed thinking of the warm and fuzzy feeling

for adding it, but warm and fuzzy can and did in most cases become all to

comfortable and the games suffered as a result. Im quite sure, given the

source code of any game out there relying on the 68k even a little bit, I could

improve its frame rate significantly.

I haev moved one or two lines of code from the 68k to one of the RISCS

and notice a difference in some cases. That chip is just that much of a

detriment. NOw if they had use the 68k on a seprate bus with say, 64k

of ram to do all the AI, that would have been a vast improvement as the

risc's would rule the main bus and the 68k would do all the gam logic off

bus and out of the way and only feed the DSP and GPU the info it needed

through a small unified ram buffer, dual port so no waits or stalls.

THAT would have been smokin! 64k is plenty of space for game logic.

man can i blab!!!

Hello Gorf, good to see you

Fascinating post, thank you for the insight.

Edit: My current project also just uses the 68k for vblank then halts with a STOP.

Vblank in GPU is proving hard to fit in (64regs just aren't enough )

Thank you also very much for the kind words.

I'd also like to add that Steve here helped me with the GPU jumping inside Main problem too, as i had been relying on a less thorough method which contained a misunderstanding of page sizes, such that some of my jumps within main needed to be modified by trial and error - to the point with one that i got so frustrated, i ended up jumping back to local then back to main to get around the problem. This is no longer the case. Many Thanks Steve.

Edited February 22, 2007 by Atari_Owl

Gorf · February 22, 2007

Blame the 68k and programmers unwillingess to turn it off after

booting the game. you can not expect a 64 bit bus to get max

performance when you choke it constantly with a half speed,

quater buswidth processor.

How is bus arbitration among the multiple processors handled on the Jaguar? I would think that a well-designed system should allow a 68000 to co-exist pretty nicely on a 64-bit bus with other devices. If the bus interface unit contained a 64+18 bit latch for data and address, along with an address comparator, it would only have to fetch one 64-bit chunk of data for every four words of instructions fetched. Adding another 64+18-bit latch for data fetches would provide some further improvement, though not as much as the first set of latches.

Does the Jaguar do anything like that, or is it more sophisticated or less sophisticated?

Bus arbitrationis completely up to the programmer for the most part. The

bus management is pretty non exsistant and you need to pretty much

watch where you step so to speak. Essentially, the bus is managed in such

a way that as long as the 68k is active, it is the bus master. Bad move space

cadet!

Zerosquare · February 22, 2007

AFAIK the 68K bus accesses are not optimized, they use a whole 64-bit cycle to fetch 16 bits, so 75% of the theoretical bandwidth is wasted.

The docs have this to say about bus arbitration :

The CPU normally has the lowest bus priority but under interrupt its priority is increased.

The following list gives the priorities of all bus masters.

Highest priority

1. Higher priority daisy-chained bus master

2. Refresh

3. DSP at DMA priority

4. GPU at DMA priority

5. Blitter at high priority

6. Object Processor

7. DSP at normal priority

8. CPU under interrupt

9. GPU at normal priority

10. Blitter at normal priority

11. CPU

Lowest priority

What is a bit strange is that although the 68K has the lowest priority (so it shouldn't slow down other processors), tests show that disabling it improves performance.

EDIT : SCPCD's theory is that since the 68k bus accesses last longer than other accesses (since it is clocked at 13 Mhz instead of 26 Mhz for the other chips on the bus), at least 1 cycle is potentially wasted each time it occurs.

EDIT 2 : More than 1 cycle is wasted, since 68K are even longer than I thought.

Edited February 22, 2007 by Zerosquare

supercat · February 23, 2007

AFAIK the 68K bus accesses are not optimized, they use a whole 64-bit cycle to fetch 16 bits, so 75% of the theoretical bandwidth is wasted.

In other words, they spent many thousands of dollars making the other chips fast, and then they wrecked everything with a horrible bus interface when a good one could still have been simple.

The 68000 bus interface, after all, needs to have one 16-bit data path connected to the 68000 and a 64-bit data path connected to the bus. How hard would it have been to do something like:

On data read, perform the read using a 26MHz 64-bit bus cycle, latch 16 bits of the result, and then make it available to the CPU.
On code read, if address bits 3-23 don't match last code read, perform a 64-bit bus cycle and latch all 64 bits.  On any data read, feed the 68000 the contents of the latch.
On data write, latch the write address and data, and post the write at the next opportunity.

Nothing complicated there--certainly nothing so fancy as the Jaguar's other chips--but the 68000's bus width utilization for a given execution speed would probably have been cut by a factor of at least three for most typical code, and by a factor of 8 or more for some other code (if a loop fits in a single 8-byte unit and does not access any data memory, it wouldn't use any bus bandwidth).

What is a bit strange is that although the 68K has the lowest priority (so it shouldn't slow down other processors), tests show that disabling it improves performance.

EDIT : SCPCD's theory is that since the 68k bus accesses last longer than other accesses (since it is clocked at 13 Mhz instead of 26 Mhz for the other chips on the bus), at least 1 cycle is potentially wasted each time it occurs.

EDIT 2 : More than 1 cycle is wasted, since 68K are even longer than I thought.

That could very well be what's going on. I've seen similar things on other machines. Once the system starts a slow-device cycle, it will run to completion. If there would be many occasions when the bus would be idle for one cycle, the slow-device cycles will squeak into those and expand them.

Gorf · February 23, 2007

AFAIK the 68K bus accesses are not optimized, they use a whole 64-bit cycle to fetch 16 bits, so 75% of the theoretical bandwidth is wasted.

The docs have this to say about bus arbitration :

The CPU normally has the lowest bus priority but under interrupt its priority is increased.

The following list gives the priorities of all bus masters.

Highest priority

1. Higher priority daisy-chained bus master

2. Refresh

3. DSP at DMA priority

4. GPU at DMA priority

5. Blitter at high priority

6. Object Processor

7. DSP at normal priority

8. CPU under interrupt

9. GPU at normal priority

10. Blitter at normal priority

11. CPU

Lowest priority

What is a bit strange is that although the 68K has the lowest priority (so it shouldn't slow down other processors), tests show that disabling it improves performance.

EDIT : SCPCD's theory is that since the 68k bus accesses last longer than other accesses (since it is clocked at 13 Mhz instead of 26 Mhz for the other chips on the bus), at least 1 cycle is potentially wasted each time it occurs.

EDIT 2 : More than 1 cycle is wasted, since 68K are even longer than I thought.

However the actual priority is, the 68k may as well have top priority as it totally

kills the bus. You would think the priorities mattered, and to some extent they do

but really for nothing more than to keep the chips from stepping on each other.

Remember, the MMU is broken ( so they claim, but I think its a combination of this

and the pipeline. I dont think it is boken but acts differently according to who and how

many procesors are active at the time.) and that certainly does not help this move

along correctly. I know folks who have treated the Jaguar system as if it were a 2600

essentially using only the GPU and blitter to write directly to the frame buffer. Most

impressive as the GPU will run out of main and jump to and from local with no special

coding techniques. So the GPU and it's abilities to access main are different depending

on who else is live and not really ' broken'. It works fine but you have to work a little

harder as again, none of the tools consider these issues. The one smart thing they did

in the Jag was allow ALL processors to be disabled. Like I have said before, the more

code I move over to run on the riscs the faster the app runs and sometimes as little

as one or two line makes a noticable difference.

JagChris · February 23, 2007

AFAIK the 68K bus accesses are not optimized, they use a whole 64-bit cycle to fetch 16 bits, so 75% of the theoretical bandwidth is wasted.

In other words, they spent many thousands of dollars making the other chips fast, and then they wrecked everything with a horrible bus interface when a good one could still have been simple.

Oh man. You'll get the Frog started up again with talk like that! :rolling:

Gorf · February 23, 2007

AFAIK the 68K bus accesses are not optimized, they use a whole 64-bit cycle to fetch 16 bits, so 75% of the theoretical bandwidth is wasted.

In other words, they spent many thousands of dollars making the other chips fast, and then they wrecked everything with a horrible bus interface when a good one could still have been simple.

The 68000 bus interface, after all, needs to have one 16-bit data path connected to the 68000 and a 64-bit data path connected to the bus. How hard would it have been to do something like:
On data read, perform the read using a 26MHz 64-bit bus cycle, latch 16 bits of the result, and then make it available to the CPU.
On code read, if address bits 3-23 don't match last code read, perform a 64-bit bus cycle and latch all 64 bits.  On any data read, feed the 68000 the contents of the latch.
On data write, latch the write address and data, and post the write at the next opportunity.
Nothing complicated there--certainly nothing so fancy as the Jaguar's other chips--but the 68000's bus width utilization for a given execution speed would probably have been cut by a factor of at least three for most typical code, and by a factor of 8 or more for some other code (if a loop fits in a single 8-byte unit and does not access any data memory, it wouldn't use any bus bandwidth).

What is a bit strange is that although the 68K has the lowest priority (so it shouldn't slow down other processors), tests show that disabling it improves performance.

EDIT : SCPCD's theory is that since the 68k bus accesses last longer than other accesses (since it is clocked at 13 Mhz instead of 26 Mhz for the other chips on the bus), at least 1 cycle is potentially wasted each time it occurs.

EDIT 2 : More than 1 cycle is wasted, since 68K are even longer than I thought.

That could very well be what's going on.

Zerosquare is right though. I know for a fact that this is essentially the case with the

Jag from my own tests. IT does not belong in that system. Look at the cojag. What a

difference in perfomance when you use a more suitable processor as the bus master

(as in code master). Cojag is the same clock speed as the Jaguar. The only difference

is more ram and the 68020 or r1000( not sue why they switched these around).

My guess is the Jaguar chipset would operate most efficiently the way it was intended

to be, by its very designers, without ANY other master CPU, unless it was on its own

seperate bus. Shoot... the 68k on a seperate bus would have been fine. And again 64k

would have been loads for any gaming ai.

supercat · February 23, 2007

Oh man. You'll get the Frog started up again with talk like that!

Not familiar with the Frog, but I often find myself bothered when I see designs which were ruined by decisions which would have seemed feeble even when they were made. Things like the 2600 HMOVE circuitry don't really bother me since it works perfectly fine for anything anybody wanted to do with it when it was designed; it wasn't until many years later that other possibilities emerged.

On the other hand, I find myself annoyed at things like the design of Microchip's high-end 8-bit micros:

On the 18xxx parts, the first 128 bytes of RAM may be accessed freely regardless of what 256-byte bank is selected (good).
The newer parts allow the user to turn on an extended instruction set which includes instructions to add or subtract a constant from the pointer registers (good), but
Whenever the extended instruction set is enabled, some of the 128 bytes of "always available" address space get converted over to indexed addressing using the FSR2 register (could be good, except...)
96 bytes are converted over to FSR2+n (where n is 0 to 95), with only 32 bytes remaining as always-available RAM, and to make matters worse
The extended instruction set cannot be switched on and off at run-time

I have nothing whatsoever against adding instructions to a processor, especially when the opcodes that are replaced did nothing useful. And being able to directly access the top 8 or 16 bytes on a stack frame would also be a good and useful feature. But most stack frames on something like a PIC are apt to be eight bytes or less, and the existence of the add/subtract pointer instructions mean that accessing variables beyond the direct indexing range would not be a problem. So force the programmer to give up 3/4 of the unbanked variables!?

One of my current projects involves programming the TMS320LF2407A DSP. It's a member of the C2000 series, which is an offshoot of the TMS320C50 series; it's binary-compatible for those instructions it supports, but it omits nearly all of the good features of the TMS320C50.:

On the 320C50, branch instructions featured a "delayed" form which would execute the two instructions following the branch whether or not it was taken (instructions must be fetched two cycles before they're executed, so the instruction at the branch target can't be executed until the third cycle after the branch is executed; if one can put two useful instructions there, one can save two cycles on the cost of a branch. The C2000 only has non-delayed branches which take four cycles.
On the 320C50, there's an XC instruction which will conditionally skip the following one or two instruction words. Doesn't save a whole lot over a well-arranged delayed branch, but hugely better than a four-cycle non-delayed branch. Well, except that the C2000 doesn't support XC either.
The 320C50 has a really nice looping feature. Load a count register with the number of times a loop should be executed, and then run a "loop" instruction specifying the address of the last instruction in the loop. The next 'n' times the last instruction of the loop is executed, the processor will perform a jump to the start of the loop, unless the user clears the 'looping' flag within the loop. The jump occurs while other instructions are executing, and thus takes zero cycles. Three guesses if the C2000 supports looping.
The 320C50 supports 'circular addressing mode'. Whenever a pointer is incremented or decremented, if its value comes to equal one of the circular-end registers, it will be reloaded with the corresponding circular-start register. The '50 has two start/end pairs. The C2000? None.
One annoyance on all the TI parts I've worked with: it's possible to read and write code space as well as data space, and some RAM can be set up as either or both, but not at the same addresses. Given that some instructions require one operand to be in code space and one in data space, but that the C compiler has no concept of accessing code space, this is a major pain.
On the 320C50, one can do a data-memory to data-memory move from two variable addresses. On the C2000, one of the addresses for a data-to-data move must be a constant unless one uses self-modifying code. Code-to-data or data-to-code transfers allow for both addresses to be variable, but having RAM at different addresses in the two spaces makes that a bit icky.
On the 320C50, there are two 32-bit accumulators. Not many instructions access the second, but there are instructions to swap them, add them, store the minimum or maximum into the second, etc.

On the 32050, if one has a list of points and wishes to find the point furthest from the origin (x^2+y^2), the inner loop would take 3 cycles per rep. On the C2000...

; Assume accumulator holds the largest value seen thus far, and that AR2 points to the points and is currently selected
; AR3 is the loop counter.
 lt *+
 mpy *+
 lac #0
lp:
 spac   ; Subtract last multiplied point from accumulator
 bcnd not_max,ge ; Branch if accumulator >0 (i.e. this point wasn't the bigest)
 lac #0
not_max:
 lt *+
 mpya *+,ar3
 banz lp,*-,ar2

That loop takes 10-11 cycles (10 if max) as compared with three. What I find particularly irksome is that many of the '50's features could have saved a couple cycles, but none of them were included. Loop mode would have cut four cycles. Without loop mode, delayed branching would cut at least two. Conditional execution could cut two. With none of the other features present, the second accumulator would have cut four.

Oh well... At least I figured out how to abuse the address registers so as to code a reasonably-efficient square-root routine. If I hadn't manged that, integer-square-root would be a real dog.

supercat · February 23, 2007

Shoot... the 68k on a seperate bus would have been fine.

Even having it share the bus with a somewhat reasonable controller like I described should have been fine. From my understanding of the Jag hardware, fetching four one-word instructions will tie up the bus for 8 cycles. With a suitably-designed controller, it would tie up the bus for one cycle. Data reads and writes would still have been a little doggy, but since more than half of the memory operations will usually be code fetches, that shouldn't have been a problem. Code that performs a lot of "move.l @r0+,@r1+" instructions in sequence wouldn't benefit so much (each instruction is one code fetch, two data reads, and two data writes) but such operations are usually best left to a blitter anyway.

Is there something about the Jaguar's other processors that makes them unsuitable for running general-purpose code?

Zerosquare · February 23, 2007

supercat, this is very true, but remember that :

1) The Jaguar chipset was supposed to be used with a more powerful CPU. They finally settled on a 68000 because of cost issues.

2) As Gorf said, the designers knew the 68K was the "dog" in the system, and expected developers not to use it. It's been said that it was put it mainly to help programmers get started (as this processor was well-known), and one of the designers even said that "the 68K is there only to read the joysticks" (in fact, you don't even have to use it for that, the GPU and the DSP can do it too).

3) There were time and money issues at Atari when the Jaguar was being developed. Several other "rough edges" in the chipset could probably have been fixed if they had the chance.

On the other hand, I find myself annoyed at things like the design of Microchip's high-end 8-bit micros.

Well, you're not the only one bothered by some design choices Microchip made

Is there something about the Jaguar's other processors that makes them unsuitable for running general-purpose code?

Not at all. The programming "philosophy" is a bit different compared to a 68K (especially if you want to get the most out of them), but it's perfectly possible to use them as general-purpose processors.

It would be interesting to know how games on the Jag would have looked like if Atari had chosen not to include a CPU at all. I think there's a chance that they'd been better.

Gorf · February 23, 2007

It would be interesting to know how games on the Jag would have looked like if Atari had chosen not to include a CPU at all. I think there's a chance that they'd been better.

A simple 'stop' intruction and disabling the 68k's interrupts is all that

would take. Then it would be Tom and Jerry all by them selves.

The trouble with this of course is that since the system was designed

with a 16 bit host processor, for some reason(bad design again) the DSP

needed to be the same width as the bus. The DSP is lightning fast internally.

It is actually slower in some cases in main ram and extremely unstable with

jumps, moreso then the GPU is.

Since I got the main ram jumps and local to main jumps working the

only time the 68k is awake it to init the vblank once per frame. that will

be eliminted as well and once the game boots the 68k may only do an

interlude in between levels or something if I ever to bother to use it at

all again that is.

The framerates in our games that wre once 15-25 and maybe 30 when

most object were off screenare 30- 60 all the time and mostly 60. We

are not done yet. This is using the very inefficient Atari sample renderer

in the dev kit. DONT get me started on that thing...beautiful features it

has but so inefficient. JagMod certainly cleaned it and sped it up a lot

though. Before the main/local jump workarounds and JagMod's mods

(now you no why he is the JagMod) to Atari's renderer we were lucky

to get 5-15 frames a second. In all fairness the renderer docs clearly

state that it was not intended for use in games and should only be a

template for much more efficient code.

Actually, I have seen sources to the cart version of Hoverstike.

Except for the rendering, everything is pretty much done in the 68k..

BZZZZZTT! I would love to get my hands on that source a move it all

over to the RISC's( lots of work though). However, hat engine has much

potential. The 68k , not the textures are what slows that game down so

much.

Gorf

supercat · February 24, 2007

Well, you're not the only one bothered by some design choices Microchip made

What design choices do you find objectionable? I think the 14-bit PICs are an elegant architecture, but have been unimpressed by the later ones. Yes, they improve some things but too many aspects seem clunky.

On a related note: Cypress' PSOC chips are mostly pretty great, but the usefulness of their many I/O pin modes is severely degraded by the inability to do a read-modify-write on a port pin latch. Why didn't they fix that on the second-generation chips like the CY8C27xx since the problem was readily apparent on the first-generation ones?

BTW, design features I'd like to see in a microcontroller and can't figure out why I haven't:

-1- A UART which supports fractional baud rates nicely. Really very easy, actually--even a 22V10 could handle it.

-2- Inputs (for counters, etc.) that are synchronized to the CPU clock when it's running but operate asynchronously when it's not or, at minimum, can be switched between synchronous and asynchronous modes without risk of losing counts. Extremely easy to implement (if nothing else, use two transparent latches for synchronization, operated by different clock phases; when the clock will be stopped, make both latches transparent). So why do companies use a straight mux to switch between async and synchronous modes?

Note that even if one wants to use an edge-triggered latch in the synchronizer, one could still switch cleanly between synchronous and asynchronous modes by feeding the output of the sync mux into a "3-input majority gate" along with the input to the synchronizer and output of the majority gate. So why doesn't anyone do that?

-3- For chips that support wait-states on external memory, a "post-cycle" wait-state option. Some slower devices do not float their outputs quickly when de-selected. If an access to a fast device immediately follows an access to a slow one, the faster device's bus cycle might get clobbered by the slower device. Adding wait-states to the slower device won't help this problem. Adding a wait state to the faster device would allow correct operation, but the momentary bus contention would still be a bad thing. Adding a wait state following the end of the slower device cycle, during which nothing was selected on the bus, would avoid any bus contention.

-4- I'd like to see someone produce a really minimal RTC chip. It seems like all the RTC chips I've seen have a significant number of fancy features which contribute to price and power consumption as well as, in some cases, programming complexity (if an application needs to deal with time in Unix-style format, it would be nicer to just have the device count seconds rather than divvy it up into minutes, hours, days, weeks, months, years, etc. If the micro that will be talking to the RTC chip has an EEPROM (pretty common these days) the RTC chip wouldn't even have to provide a "set" function. Really all that's needed is an oscillator, a counter, and a means of oututting the count.

None of these things seem at all complicated.

Why are frame rates of Jaguar 3D games so low?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members