Jump to content
IGNORED

Doom networking / UART bug


Recommended Posts

I know in the cd encryption utility there they did use xmodem for transfer, it works at 19.200 when using it, at 19200 you will see in you terminal package on pc error's but these error's will be resolved in xmodem with re-send the package.

Maybe this xmodem protocol or something like that can be implemented in DOOM seems to be reliable with the result, I always got good downloads so the error checking and resending works ok.

 

When you need the sourcecode of the cd encryption utitlity let me know I can send it to you.

Edited by TXG/MNX
  • Like 1
Link to comment
Share on other sites

Test #3: doom_n03.abs

Changes:

  1. Changed serial packet size from 6 bytes to 7 bytes
  2. Inserted 0x80 dummy byte as the first byte in the packet
  3. Changed consistency check from XOR of both players' x and y to XOR of buttons + vblsinframe
  4. Inserted ACK byte as last byte in the packet
  5. Added retry loop for packets
  6. Reduced serial clock from 115200 bps to 38400 bps

Details:

 

I'm going to explain this one a bit, because I made some substantial changes to the network handshake. First, I changed the consistency check. The previous check was not catching a bad packet, but rather inconsistent x,y coordinates for both players. In other words, both consoles should always have exactly the same x,y pairs for both players. By changing the consistency check to the buttons + vblsframe being sent and received in each packet, I can catch a bad packet immediately upon receipt.

 

Next, I moved the consistency check to happen before the ACK byte is sent. The ACK byte is new, and allows each console to send $A1 (success) or $FF (failed). The receiving console, however, looks for $FF or $FE on this byte, because this byte is not checksummed and thus could encounter the UART bug undetected resulting in a left-shift of the byte. Any result besides $FF or $FE is considered success. Note that this XOR consistency check is certainly not foolproof; a multi-bit error could result in the generation of the same check byte. Also, if somehow the ACK byte was meant to show failed but did not report $FF or $FE, then the consoles would get out of sync.

 

Finally, I added a retry count and a resend loop for the packets. If a console detects a failure, it will send the $FF ACK byte back to the other console, then both consoles will resend the entire packet again. This can happen up to 3 times before it gives up and declares our beloved "NETWORK ERROR" and resets the level.

 

Result:

 

Networking was exceptionally stable. I never got a network crash, however after playing co-op a very long time (60+ minutes, across 4 levels) , I did notice that the two consoles got out of sync: the players could move around and do stuff, but they could no longer be seen on the other console at the expected location. While playing, I would periodically check each console to ensure the other player still could move and be seen by the other console too, so I didn't notice when they first got out of sync. It seemed to only be later in the game.

 

So, I think this version is very close to working. A previous version with this same change set did not include vblsinframe in the consistency check, and while the networking was very stable, I still saw NETWORK ERROR when loading some new levels, but not during the gameplay itself.

 

In addition to doom_n03.abs, I am attaching jagonly.C. This contains all the changes I have made thus far. I've spent a lot of hours on this over the past few days, and now I am hoping someone else can look through the source file and see if there is anything else I could do to make the networking more stable, short of rewriting it from scratch. I like the simplicity of it, so I'm just trying to build on what's there for now.

 

And I really, really could use a full debugger so I could inspect memory on demand or look at a history of serial packets. I've thought about dumping the packets on-screen or to a PC file via Skunkboard, but that would take more work so I haven't done it yet. ;)

 

TL;DR:

 

Co-op networking is greatly improved by retrying the serial transmission if an error is detected. More testing and code inspection is needed to make this 100% reliable. Try out the attached doom_n03.abs if you want to see for yourself. :)

doom_n03.zip

Edited by Songbird
  • Like 6
Link to comment
Share on other sites

Oddly enough that's pretty much what I had just finished writing for a test - typical :) - I'll attach my version if you want to see where I was going. Probably only difference is I try and resync after a transmission if the first byte is incorrect.

 

One thing I did notice is that there is a potential 'gotcha' in the use of ticcount between the jagonly.c file and doomdefs.h - it shows up as a volatile in the doomdefs but not in the jagonly file. There are also a few oddities in the original code as an unsigned is set to negative in a few functions ('c' in __mulsi3, FixedMul, FixedDiv).

 

Will try and have a look at your code tomorrow morning.

 

Link (as I can't upload attachments) https://mega.nz/#!UJM1hSIK!cAPMgFcovllvb2VeaeRruohBvCNek8LbT0QTETV7XFU

Edited by Welshworrier
  • Like 2
Link to comment
Share on other sites

Welsh, is there a key we need to download your version?

 

The issues I see with my ACK feature and/or the mysterious UART bug:

  1. Can't 100% protect against multi-bit errors.
  2. Can't 100% guarantee that consoles will read the ACK byte correctly since I don't know what happens in a cascading "left shift" UART failure for multiple bytes.
  3. Can't 100% guarantee that both consoles are on the same retry. If they're not, then one will have advanced in the game while the other is stuck in a loop.
  4. Can't 100% guarantee that the 0x80 dummy byte really aligns the UART -- if it did, then all this extra stuff wouldn't be necessary. :)
Link to comment
Share on other sites

Bugger, yes I'm afraid, I copied the wrong link and it's on my laptop - can't access mega off this phone and so will have to do it in the morning (about 6 hours time).

 

Then again.... due to my neighbours having a blazing row I've now got out of bed :)

 

Link with key:

https://mega.nz/#!UJM1hSIK!cAPMgFcovllvb2VeaeRruohBvCNek8LbT0QTETV7XFU

 

I've had a really stupid idea btw - if you only send the data 6 bits per byte (0..5) and set bit 6 to 1, any uart error would move it to the top bit and be easily identifiable. It would negate the need for the ack cycle at an increase of about 2 bytes per packet.

Edited by Welshworrier
Link to comment
Share on other sites

And here is a 'bodge version' of the jagonly source code with an attempt at trying to detect and correct the uart error by sending the main traffic as 2 nibbles with bit 6 set. Volatile now used for ticcount and consistancy check modded to actually use the integers of x and y position as interim computational values before converting to byte. Baud rate set as 115k...well because it's worth a try. Two extra functions defined that split a byte up and send as two nibbles and tag as either high or low nibble, and one that received the data in the same way.

 

https://mega.nz/#!cUMBBBbS!oKSSXt11vep5GqzDuJb7Kis-qDiVu-n9GOCgY1JgIXA

 

Minimal changes in effect but will send a larger packet across the serial link, internally it will be the same.

 

If someone could compile and try that would be nice :)

Now off to work I go.

Edited by Welshworrier
  • Like 2
Link to comment
Share on other sites

FYI, I fixed an essentially harmless bug in my jagonly.C in the doom_n03.zip download, and replaced the one in the above post. I did some more testing on my doom_n03, and I'm seeing the same behavior as yesterday: rock-solid co-op for the first three levels, then I take the secret exit to level 24, and somewhere things get out of sync. It's weird because I'm not seeing a network error, so that means there is an undetected failure somewhere.

 

I also tried the "only send nibbles" jagonly.C and it locked up immediately. :( I did a quick review of the source code and it looks OK, but otherwise I did not spend much time trying to debug this since my version is so close to perfect. I had found a similar problem to this with an earlier version of my code where I only sent a 0x80 dummy byte before every data byte, and discarded the dummy before receiving the data byte. Even though the change looked simple, it ended up locking up immediately. Never understood why.

 

Now I'm trying to use the built-in print functions to put some debug data on the screen, but not much success yet. Also, cheats don't seem to work in this version of the code... the code is there to check the Jag pad for cheats, but none of them seem to work. I need to figure this out so I could warp to level 24 immediately instead of playing for 30 min first each time. ;)

  • Like 1
Link to comment
Share on other sites

Thanks for trying that out for me - annoying it doesn't work :(

 

Your code looks fine from when I looked at it earlier today and I believe it seems like you've cracked the serial problem, however......

 

Examining the Doom code shows the way in which the network play is carried out is a bit 'pants' - in effect it plays a two player game on both consoles and simulates having a virtual joypad providing the input for the second player. This of course means that any discrepancies in timing between the two units will affect the positional information for the player i.e. you'll get a drift. The 'consistency' data byte isn't there for packet validation - it's done for positional/graphical consistency to make sure each console thinks the players are in the same positions. The fact there is a specific check for that byte being different and an attempt at network re-negotiation/restart when it doesn't match what's expected reinforces that idea (It would explain why people thought they had a fix over the years but still had it locking up at random times :) )

 

Can I suggest the following:

1. send the consistency byte across as before (based on x,y) but calculated correctly

2. add 8 bytes of data to send only the relevant player0/player1 position information across (2 ints)

3. add a check to see if the consistency is wrong and if so do a soft recover in setting the second players position to the passed value instead of trying to do a net reset as it currently does

4. add another byte for packet xor checksum/validation

 

I realise this will increase the packet size dramatically but seems necessary to remove the final 'feature'

Edited by Welshworrier
  • Like 2
Link to comment
Share on other sites

Welsh, I understand what you're saying and agree that Doom is absolutely set up to run a virtual second joypad using the linked console. This should be OK, however, given a) you have a reliable link, and b) each console waits for both local and linked input before proceeding with the next frame. In fact, that is about the simplest approach I think one could take. If you have any drift, then nothing makes sense in the game, because pressing a button will kill a monster on one console but not the other, running into a barrier will only happen on one console, etc. I'm not sure repositioning the players (a "soft" reset) will help, because the consoles will already be out of sync for some period of time.

 

I assume the Doom code does (b) already, which means all we need to figure out is how to make the link reliable. I'm pretty close, and now I have full cheats enabled plus the debug screen overlay, so I can inspect my ACK bytes in real time while playing. :) The curious thing is, in my latest build, every ACK byte always shows success both ways, and I haven't done anything to fundamentally make the serial link 100% reliable at the low level. So, I am going to insert a while(1) loop in the retry path and thus lock up if it ever hits that.

  • Like 2
Link to comment
Share on other sites

If you look at the player movement I believe you might find the issue - vblsinframe is used as a scaling factor to ensure that the second machines movement is the same as the first. This is used for x and y movement BUT NOT Z - falling down or running up could then potentially put you in different position as they would clip at different frames :)

 

You are not checking position in the consistancy byte and you are getting reliable serial comms with no ACK issues. This, to me at least, is setting off alarm bells.

 

As the network would normally fail as soon as the positions differed - about 70ms after the occurance (the original consistency check only used the fractional bits of the x/y due to the byte/int issue discussed earlier) then the drifting would not normally have been seen - especially due to the size of the weapons/objects/walls etc and the collision checking - hitting something a gnats tadger to the left/right of where the other console thought it hit wouldn't make much of a difference.

Edited by Welshworrier
  • Like 1
Link to comment
Share on other sites

You may be onto something. Now that I've run it a few more times, even at the original 115200 speed, I am actually not seeing a single ACK failure, which means the packets are going both ways fine. This could mean that it really is a higher-level console sync issue and not a low-level UART issue at all.

 

For the original consistency check, since it's 4 bytes, shouldn't it be like this:

consistancy = (tempcons>> ^ tempcons ^ (tempcons>>16) ^ (tempcons>>24)

An even simpler experiment would be to go back to the original jagonly.C and just put in the player x and y from "this" console, and forcing the "other" console to always use that x and y for the remote player. For extra credit, I could add the checksum byte and if the checksum fails, then use my ACK / retry loop to recover.

 

It would be interesting to compare against the original DOS / Linux Doom source, to see how it handled networking.

Link to comment
Share on other sites

Been looking again (I do sleep occasionally ;) )

If you change the code for P_PlayerZMovement to:


/*
===============
=
= P_PlayerZMovement
=
===============
*/

void P_PlayerZMovement (mobj_t *mo)
{	
/* */
/* check for smooth step up */
/* */
	if (mo->z < mo->floorz)
	{
		mo->player->viewheight -= mo->floorz-mo->z;
		mo->player->deltaviewheight = (VIEWHEIGHT - mo->player->viewheight)>>2;
	}

/* */
/* adjust height */
/* */
	mo->z += vblsinframe*(mo->momz>>2);
	
/* */
/* clip movement */
/* */
	if (mo->z <= mo->floorz)
	{	/* hit the floor */
		if (mo->momz < 0)
		{
			if (mo->momz < -GRAVITY*vblsinframe)	/* squat down */
			{
				mo->player->deltaviewheight = mo->momz>>3;
				S_StartSound (mo, sfx_oof);
			}
			mo->momz = 0;
		}
		mo->z = mo->floorz;
	}
	else
	{
		if (mo->momz == 0)
			mo->momz = -(GRAVITY>>1)*vblsinframe;
		else
			mo->momz -= (GRAVITY>>2)*vblsinframe;
	}
	
	if (mo->z + mo->height > mo->ceilingz)
	{	/* hit the ceiling */
		if (mo->momz > 0)
			mo->momz = 0;
		mo->z = mo->ceilingz - mo->height;		
	}
	
} 
 

Then the vblsinframe will be used in the Z calculations too.

 

Edit: slight mod to consistancy check.

Due to the fact we are only after an 8 bit result it would be better to do as 2 operations.

tempcons = (tempcons >> 16)^tempcons;
consistancy = (tempcons>> ^ tempcons;

As we can do a 16 bit xor for the first part

Edited by Welshworrier
  • Like 1
Link to comment
Share on other sites

Been thinking about the instant kickout too :)

 

If the consistancy check fails on the first packet (due to the positions of the two players not being setup correctly yet) then you'd get an instant fail - wonder if it's actually needed now if the Z fix works. Using your checksum/retry routine is probably a better option.

  • Like 1
Link to comment
Share on other sites

OK, I got ambitious and coded something different from the ground up in jagonly.C. I updated the consistency check, added "this console" x and y (indexed by consoleplayer), and added the checksum byte. I also restored the UART speed to 115200 and got rid of the 0x80 dummy header byte.

 

Here is the packet set up:

	outbytes[0] = buttons>>24;
	outbytes[1] = buttons>>16;
	outbytes[2] = buttons>>8;
	outbytes[3] = buttons;

	val = players[0].mo->x ^ players[0].mo->y ^ players[1].mo->x ^ players[1].mo->y;
	val = (val >> 16) ^ val;
	consistancy = (val>> ^ val;

	outbytes[4] = consistancy;
	outbytes[5] = vblsinframe;

	outbytes[6] = (players[consoleplayer].mo->x >> 24) && 0xFF;
	outbytes[7] = (players[consoleplayer].mo->x >> 16) && 0xFF;
	outbytes[8] = (players[consoleplayer].mo->x >>   && 0xFF;
	outbytes[9] = (players[consoleplayer].mo->x)       && 0xFF;

	outbytes[10] = (players[consoleplayer].mo->y >> 24) && 0xFF;
	outbytes[11] = (players[consoleplayer].mo->y >> 16) && 0xFF;
	outbytes[12] = (players[consoleplayer].mo->y >>   && 0xFF;
	outbytes[13] = (players[consoleplayer].mo->y)       && 0xFF;

	outbytes[14] = 0;
	
	for (i=0; i < PACKET_SIZE - 1; i++)
	{
		outbytes[14] ^= outbytes[i];
	}

Here is the error checking I have:

/* */
/* check for consistancy error */
/* */
	if (inbytes[4] != outbytes[4])
	{
		netErrors++;
		PrintNumber (15,13,netErrors);
		
		otherconsoleplayer = consoleplayer ^ 1;
		
		players[otherconsoleplayer].mo->x = 
					(inbytes[6] << 24) |
					(inbytes[7] << 16) |
					(inbytes[8] <<   |
					(inbytes[9]);

		players[otherconsoleplayer].mo->y = 
					(inbytes[6] << 24) |
					(inbytes[7] << 16) |
					(inbytes[8] <<   |
					(inbytes[9]);
	}

	checksum = 0;

	for (i=0; i < PACKET_SIZE - 1; i++)
	{
		checksum ^= inbytes[i];
	}

	if (checksum != inbytes[14])
	{
		jagobj_t	*pl;
	
		S_Clear ();
		pl = W_CacheLumpName ("neterror", PU_STATIC);	
		DrawPlaque (pl);
		Z_Free (pl);

		wait (200);
		goto reconnect;
	}

Two interesting things with this build which I labeled doom_n05.abs:

  1. When the consistency check fails (which is rare, but it happens), numErrors keeps incrementing every frame from then on which means the check fails every time after the first failure triggers.
  2. The checksum never fails, which has been true in all of my checksum experiments so far to the best of my knowledge.

This leads me to believe that the UART is not ever hitting the known "left shift" hardware bug, probably because Doom isn't driving it hard enough. Instead, there is something in the way the game calculates player position that can get messed up occasionally. Could be a memory leak or something else buried in the code.

 

The part that disappointed me was that my attempt at a "soft reset" by syncing the players' x and y did not work. The second player disappeared completely from the first console, and never came back even if I blew up a barrel and forced myself to respawn. So x and y do not appear to be sufficient, which is probably why Doom uses the big hammer and forces both players to restart the level.

Edited by Songbird
Link to comment
Share on other sites

I take it the error in the code is left for the reader to discover...

 

I'm guessing the cut and paste error in the recovery where you've copied the x position twice instead of the y position for the second one... (Using 6,7,8,9 instead of 10,11,12,13). ;)

 

Did you also put in the z fix from my earlier post? Would be interesting to see if the consistency failed then.

Edited by Welshworrier
  • Like 1
Link to comment
Share on other sites

Ack, that's what I get for late night coding. ;) Guess I need to run another test today...

 

I did not include the z fix. I looked at your suggested changes, but it wasn't obvious those changes were necessary or completely accurate. For example, I don't know why you are right-shifting GRAVITY before multiplying it by vblsinframe in a couple spots. That's different than what the original code was doing.

 

Also, I don't see how consistency could fail for z disparities, unless I incorporated z into the consistency check. I'm guessing it was intentional in the original code not to synchronize z between the systems, since GRAVITY will take care of itself. :D

 

There still must be another bug in the original code which causes the inconsistency in the first place.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...