Opened 7 years ago

Last modified 3 years ago

#3643 reopened defect

[PATCH] Server NetTurnManager crash when a client cancels a rejoin and rejoins again

Reported by: elexis Owned by:
Priority: Should Have Milestone: Backlog
Component: Network Keywords:
Cc: Patch:

Description (last modified by elexis)

Today a game on r17298 crashed after I late-joined a game as spectator, cancelled the rejoin by killing the process and then rejoining again or something similar. I could reproduce the issue at least once.

NetTurnManager.cpp(586): Assertion failed: "turn == m_ClientsReady[client] + 1"
Assertion failed: "turn == m_ClientsReady[client] + 1"
Location: NetTurnManager.cpp:586 (NotifyFinishedClientCommands)

Call stack:

(0x96d68b) ./pyrogenesis() [0x96d68b]
(0x915861) ./pyrogenesis() [0x915861]
(0x9170ca) ./pyrogenesis() [0x9170ca]
(0x4552e9) ./pyrogenesis() [0x4552e9]
(0x46f927) ./pyrogenesis() [0x46f927]
(0x461362) ./pyrogenesis() [0x461362]
(0x46ca3f) ./pyrogenesis() [0x46ca3f]
(0x47334e) ./pyrogenesis() [0x47334e]
(0x47399d) ./pyrogenesis() [0x47399d]
(0x473a7d) ./pyrogenesis() [0x473a7d]
(0x7f978d0f96aa) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76aa) [0x7f978d0f96aa]
(0x7f978ce2eeed) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f978ce2eeed]

errno = 0 (Try again later)
OS error = ?

Also this finite state machine error pops up on the server when cancelling the rejoin sometimes (but doesn't produce a crash):

ERROR: Net server: Error running FSM update (type=19 state=4)

MessageType 19 means NMT_SYNC_CHECK and server state 4 means means SERVER_STATE_INGAME.


Thanks to fatherbushido for sending the crashlog, allowing to understand the error!

Attachments (5)

crashlog_reproduce.txt.7z (98.1 KB ) - added by elexis 7 years ago.
This is the crashlog of my accidental reproduce.
destroy_server.patch (751 bytes ) - added by elexis 7 years ago.
defuse_v1.patch (817 bytes ) - added by elexis 7 years ago.
defuse_v2.patch (4.5 KB ) - added by elexis 7 years ago.
Not tested yet with the destroy-patch
defuse_v2.2.patch (4.5 KB ) - added by elexis 7 years ago.

Download all attachments as: .zip

Change History (24)

comment:1 by elexis, 7 years ago

Here the last messages of fatherbushidos crashlog:

<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4147, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4148, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4149, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4150, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4151, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4152, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4153, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4154, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CSimulationMessage { m_Client: 1, m_Player: 1, m_Turn: 4155, m_Data: ({type:"attack", entities:[12115, 12506, 12702, 12876, 12877, 13090, 13234], target:11715, allowCapture:true, queued:false}) } of size 224 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4155, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4156, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4157, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CSimulationMessage { m_Client: 7, m_Player: 3, m_Turn: 4158, m_Data: ({type:"construct", template:"structures/iber_civil_centre", x:270.0105285644531, z:945.39990234375, angle:2.356194490192345, actorSeed:15341, entities:[213, 214, 6907, 6908, 6909, 6910, 6911, 7161, 7164, 8006, 8008, 8009, 8178, 8179, 8180, 8181, 8182, 8695, 8696, 8697, 8698, 8699, 8871, 8872, 8873, 9047, 9048, 9049, 9050, 9051, 9160, 9161, 9162, 9163, 9164, 9328, 9329, 9330, 9331, 9577, 9578, 9579, 9580, 9581, 10133, 10231, 11031, 11032, 11033, 11034, 11035, 11077, 11078, 11079, 11080, 11215, 11216, 11217, 11218, 11219, 13617, 14730], autorepair:true, autocontinue:true, queued:true}) } of size 1105 to [7F03C2CA...]</p>
<p>Net: Sending message CEndCommandBatchMessage { m_Turn: 4158, m_TurnLength: 500 } of size 9 to [7F03C2CA...]</p>
<p>Net: Sending message CLoadedGameMessage { m_CurrentTurn: 4158 } of size 7 to [7F03C2CA...]</p>
<p>Net server: Received message CRejoinedMessage { m_GUID:  } of size 7 from [7F03C2CA...]</p>
<p>Net: Sending message CRejoinedMessage { m_GUID: 7F03C2CA5EAD7D9D } of size 23 to [7DCDFED8...]</p>
<p>Net: Sending message CRejoinedMessage { m_GUID: 7F03C2CA5EAD7D9D } of size 23 to [A1692034...]</p>
<p>Net: Sending message CRejoinedMessage { m_GUID: 7F03C2CA5EAD7D9D } of size 23 to [6094DF57...]</p>
<p>Net: Sending message CRejoinedMessage { m_GUID: 7F03C2CA5EAD7D9D } of size 23 to [7F03C2CA...]</p>
<p>Net server: Received message CEndCommandBatchMessage { m_Turn: 4136, m_TurnLength: 500 } of size 9 from [7F03C2CA...]</p>
<p>Net: Sending message CSimulationMessage { m_Client: 1, m_Player: 1, m_Turn: 4160, m_Data: ({type:"attack", entities:[14240, 14238, 13090, 12508, 12506, 12118, 12117, 12116, 12115, 12114, 12704, 12703, 12702, 12701, 12700, 12509, 12507, 12877, 12876, 13238, 13237, 13236, 12878, 13235, 13234, 12874, 13087, 14239, 14242, 13088, 14241, 14716, 13086, 14715], target:14554, allowCapture:true, queued:false}) } of size 569 to server</p>
<p>Net client: Disconnected</p>
<p>Net message: ({type:"netstatus", status:"disconnected", reason:0})</p>

Notice 7F03C2CA5EAD7D9D is my GUID and that the last CEndCommandBatchMessage sticks out as it has turn 4136 instead of 4158 or above:

<p>Net server: Received message CEndCommandBatchMessage { m_Turn: 4136, m_TurnLength: 500 } of size 9 from [7F03C2CA...]</p>

Not sure if relevant, but I noticed the server continues to send the serialized state even when the client has disconnected already.

by elexis, 7 years ago

Attachment: crashlog_reproduce.txt.7z added

This is the crashlog of my accidental reproduce.

comment:2 by elexis, 7 years ago

Milestone: BacklogAlpha 20

Putting this on the milestone as this crash occured thrice the last couple of days (and it kills the server thread, so the game is irrevocably gone (which it wouldnt if it only crashed the client thread like #3638).

Should be solvable by removing an ENSURE and reseting the expected turn number on rejoin.

by elexis, 7 years ago

Attachment: destroy_server.patch added

by elexis, 7 years ago

Attachment: defuse_v1.patch added

by elexis, 7 years ago

Attachment: defuse_v2.patch added

Not tested yet with the destroy-patch

by elexis, 7 years ago

Attachment: defuse_v2.2.patch added

comment:3 by elexis, 7 years ago

Keywords: patch review added
Summary: Server NetTurnManager crash when a client cancels a rejoin and rejoins again[PATCH] Server NetTurnManager crash when a client cancels a rejoin and rejoins again

Still not sure what rid the client to send that message with a turn number of the past.

One hypothesis is that the client adds a simulation command to m_QueuedCommands after finishing the loading screen but before finishing the rejoin (Synchronising gameplay with other players...) and then sends the CEndCommandBatchMessage with the turnnumber of the past (after raching the official current turn). Failed to reproduce it by opening the developer overlay in that timeframe.

Whatever is causing it, it shouldn't kill the whole game (even a bugged game might be better than expierencing a total shutdown). The patch removes the ENSURE breakpoint, shows an error message and disconnects the client.

comment:4 by elexis, 7 years ago

Milestone: Alpha 20Alpha 21

comment:5 by elexis, 6 years ago

Keywords: review removed

Works with regards to not crashing, but the NetTurnManager still expects a wrong turnnumber from the clients. The game will appear to be paused.

comment:6 by elexis, 6 years ago

Backlogging due to lack of progress.

comment:7 by elexis, 6 years ago

Milestone: Alpha 21Backlog

comment:8 by elexis, 6 years ago

Component: Core engineNetwork

(set component to network)

comment:9 by elexis, 6 years ago

Milestone: BacklogAlpha 22
Priority: Must HaveRelease Blocker

Doesn't have to stay a release blocker inevitably, but killed too many MP games to go unconsidered.

Last edited 6 years ago by elexis (previous) (diff)

comment:10 by elexis, 5 years ago

Description: modified (diff)
Milestone: Alpha 22Alpha 23
Priority: Release BlockerMust Have

comment:11 by elexis, 5 years ago

Patch: Phab:D1256

comment:12 by elexis, 5 years ago

In 21023:

Don't trigger a NetServerTurnManager debug breakpoint if a modified or unmodified client sends a non-sequential turnnumber.
Disconnect that client.

Refs #3643
Differential Revision: https://code.wildfiregames.com/D1256
Reviewed By: echotangoecho

comment:13 by elexis, 5 years ago

Keywords: patch removed
Milestone: Alpha 23Backlog
Patch: Phab:D1256
Priority: Must HaveShould Have

in reply to:  description comment:14 by Imarok, 4 years ago

Replying to elexis:

Also this finite state machine error pops up on the server when cancelling the rejoin sometimes (but doesn't produce a crash):

ERROR: Net server: Error running FSM update (type=19 state=4)

MessageType 19 means NMT_SYNC_CHECK and server state 4 means means SERVER_STATE_INGAME.

No, state 4 means NSS_JOIN_SYNCING

comment:15 by elexis, 4 years ago

Do you speak about alpha 23 or the version when this was reported?

in reply to:  15 comment:16 by Imarok, 4 years ago

Replying to elexis:

Do you speak about alpha 23 or the version when this was reported?

Of course about the version when this was reported.

Anyway you mixed up NetServerState and NetServerSessionState so SERVER_STATE_INGAME is totally wrong.

comment:17 by elexis, 4 years ago

Milestone: BacklogAlpha 23
Resolution: fixed
Status: newclosed

This ticket was left open following 0ad-staff.2016-03-27-18.15.log.html:

18:19:11 <Itms> I'm reluctant to commit your change as long as we don't entirely understand what causes the problem 18:20:07 <mimo_> I'm for CF now also, #3643 is not a blocker and better to fully understand it before commiting

Easier said than done.

I have the suspicion that Phab:rP14732 which sends the CEndCommandBatchMessage also at the loading screen and rejoin stage when it may only be send during the game can trigger this.

But that message is sent unreliably, it's only sent out 1 out of 10 times before destroying the client. And it may be that this error only occurs after loading the map but before simulating the first consecutive turn? Then you'd have something like a 1 in 10 * 1 in 10 chance to reproduce by Alt+F4ing as mentioned in the original ticket response.

I suspect rejoining more than once is not necessary, as the NetServerTurnManager already keeps client IDs unique, so data from a previous rejoin should not be able to affect the next rejoin.

The NetServer breakpoint was replaced regardless because malicious clients shouldn't be able to trigger breakpoints arbitrarily and repeatedly.

So the NetClient may or may not be fixed by Phab:D1557, this part is "Needs info". One may try to revert Phab:D1557 if one wants to test the hypothesis. If we receive this error report with the affected NetClient playing alpha 23.2 (i.e. including Phab:D1557), then we know the hypothesis is wrong.

A full mainlog.html would be benefitial too.

comment:18 by elexis, 4 years ago

Milestone: Alpha 23Backlog
Resolution: fixed
Status: closedreopened

Although the one known way to reproduce the stack is not breaking the game anymore, there is still one way this bug can occur after a rejoining client disconnected with the first alpha 23 release:

ERROR: NotifyFinishedClientCommands: Client 16 (Sheep54 (1465)) is ready for turn 753, but expected 805

The loading screen completed to 99% reportedly.

I strongly suspect r21837 has fixed this for the re-release clientside. It would explain why this occurs and why it occurs rarely.

The turnmanagers should still not become stuck once the offending client is disconnected.

comment:19 by elexis, 3 years ago

Sonds unlikely that its the issue described above, but it's possible to trigger the case ERROR: NotifyFinishedClientCommands: Client 1 (2) is ready for turn 2, but expected 8 by using quicksave & quickload in a networked game (including a23). (To solve that, the quickload method sounds like it should not be implemented for networked turnmanagers)

Note: See TracTickets for help on using tickets.