Opened 8 years ago
Last modified 4 years ago
#3643 reopened defect
[PATCH] Server NetTurnManager crash when a client cancels a rejoin and rejoins again
Reported by: | elexis | Owned by: | |
---|---|---|---|
Priority: | Should Have | Milestone: | Backlog |
Component: | Network | Keywords: | |
Cc: | Patch: |
Description (last modified by )
Today a game on r17298 crashed after I late-joined a game as spectator, cancelled the rejoin by killing the process and then rejoining again or something similar. I could reproduce the issue at least once.
NetTurnManager.cpp(586): Assertion failed: "turn == m_ClientsReady[client] + 1" Assertion failed: "turn == m_ClientsReady[client] + 1" Location: NetTurnManager.cpp:586 (NotifyFinishedClientCommands) Call stack: (0x96d68b) ./pyrogenesis() [0x96d68b] (0x915861) ./pyrogenesis() [0x915861] (0x9170ca) ./pyrogenesis() [0x9170ca] (0x4552e9) ./pyrogenesis() [0x4552e9] (0x46f927) ./pyrogenesis() [0x46f927] (0x461362) ./pyrogenesis() [0x461362] (0x46ca3f) ./pyrogenesis() [0x46ca3f] (0x47334e) ./pyrogenesis() [0x47334e] (0x47399d) ./pyrogenesis() [0x47399d] (0x473a7d) ./pyrogenesis() [0x473a7d] (0x7f978d0f96aa) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76aa) [0x7f978d0f96aa] (0x7f978ce2eeed) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f978ce2eeed] errno = 0 (Try again later) OS error = ?
Also this finite state machine error pops up on the server when cancelling the rejoin sometimes (but doesn't produce a crash):
ERROR: Net server: Error running FSM update (type=19 state=4)
MessageType 19 means NMT_SYNC_CHECK
and server state 4 means means SERVER_STATE_INGAME
.
Thanks to fatherbushido for sending the crashlog, allowing to understand the error!
Attachments (5)
Change History (24)
comment:1 by , 8 years ago
by , 8 years ago
Attachment: | crashlog_reproduce.txt.7z added |
---|
This is the crashlog of my accidental reproduce.
comment:2 by , 8 years ago
Milestone: | Backlog → Alpha 20 |
---|
Putting this on the milestone as this crash occured thrice the last couple of days (and it kills the server thread, so the game is irrevocably gone (which it wouldnt if it only crashed the client thread like #3638).
Should be solvable by removing an ENSURE
and reseting the expected turn number on rejoin.
by , 8 years ago
Attachment: | destroy_server.patch added |
---|
by , 8 years ago
Attachment: | defuse_v1.patch added |
---|
by , 8 years ago
Attachment: | defuse_v2.2.patch added |
---|
comment:3 by , 8 years ago
Keywords: | patch review added |
---|---|
Summary: | Server NetTurnManager crash when a client cancels a rejoin and rejoins again → [PATCH] Server NetTurnManager crash when a client cancels a rejoin and rejoins again |
Still not sure what rid the client to send that message with a turn number of the past.
One hypothesis is that the client adds a simulation command to m_QueuedCommands
after finishing the loading screen but before finishing the rejoin (Synchronising gameplay with other players...
) and then sends the CEndCommandBatchMessage
with the turnnumber of the past (after raching the official current turn).
Failed to reproduce it by opening the developer overlay in that timeframe.
Whatever is causing it, it shouldn't kill the whole game (even a bugged game might be better than expierencing a total shutdown). The patch removes the ENSURE
breakpoint, shows an error message and disconnects the client.
comment:4 by , 8 years ago
Milestone: | Alpha 20 → Alpha 21 |
---|
comment:5 by , 8 years ago
Keywords: | review removed |
---|
Works with regards to not crashing, but the NetTurnManager
still expects a wrong turnnumber from the clients. The game will appear to be paused.
comment:7 by , 8 years ago
Milestone: | Alpha 21 → Backlog |
---|
comment:9 by , 7 years ago
Milestone: | Backlog → Alpha 22 |
---|---|
Priority: | Must Have → Release Blocker |
Doesn't have to stay a release blocker inevitably, but killed too many MP games to go unconsidered.
comment:10 by , 7 years ago
Description: | modified (diff) |
---|---|
Milestone: | Alpha 22 → Alpha 23 |
Priority: | Release Blocker → Must Have |
comment:11 by , 6 years ago
Patch: | → Phab:D1256 |
---|
comment:13 by , 6 years ago
Keywords: | patch removed |
---|---|
Milestone: | Alpha 23 → Backlog |
Patch: | Phab:D1256 |
Priority: | Must Have → Should Have |
comment:14 by , 6 years ago
Replying to elexis:
Also this finite state machine error pops up on the server when cancelling the rejoin sometimes (but doesn't produce a crash):
ERROR: Net server: Error running FSM update (type=19 state=4)MessageType 19 means
NMT_SYNC_CHECK
and server state 4 means meansSERVER_STATE_INGAME
.
No, state 4 means NSS_JOIN_SYNCING
follow-up: 16 comment:15 by , 6 years ago
Do you speak about alpha 23 or the version when this was reported?
comment:16 by , 6 years ago
Replying to elexis:
Do you speak about alpha 23 or the version when this was reported?
Of course about the version when this was reported.
Anyway you mixed up NetServerState
and NetServerSessionState
so SERVER_STATE_INGAME
is totally wrong.
comment:17 by , 6 years ago
Milestone: | Backlog → Alpha 23 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
This ticket was left open following 0ad-staff.2016-03-27-18.15.log.html
:
18:19:11 <Itms> I'm reluctant to commit your change as long as we don't entirely understand what causes the problem 18:20:07 <mimo_> I'm for CF now also, #3643 is not a blocker and better to fully understand it before commiting
Easier said than done.
I have the suspicion that Phab:rP14732 which sends the CEndCommandBatchMessage
also at the loading screen and rejoin stage when it may only be send during the game can trigger this.
But that message is sent unreliably, it's only sent out 1 out of 10 times before destroying the client. And it may be that this error only occurs after loading the map but before simulating the first consecutive turn? Then you'd have something like a 1 in 10 * 1 in 10 chance to reproduce by Alt+F4ing as mentioned in the original ticket response.
I suspect rejoining more than once is not necessary, as the NetServerTurnManager already keeps client IDs unique, so data from a previous rejoin should not be able to affect the next rejoin.
The NetServer breakpoint was replaced regardless because malicious clients shouldn't be able to trigger breakpoints arbitrarily and repeatedly.
So the NetClient may or may not be fixed by Phab:D1557, this part is "Needs info". One may try to revert Phab:D1557 if one wants to test the hypothesis. If we receive this error report with the affected NetClient playing alpha 23.2 (i.e. including Phab:D1557), then we know the hypothesis is wrong.
A full mainlog.html
would be benefitial too.
comment:18 by , 5 years ago
Milestone: | Alpha 23 → Backlog |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
Although the one known way to reproduce the stack is not breaking the game anymore, there is still one way this bug can occur after a rejoining client disconnected with the first alpha 23 release:
ERROR: NotifyFinishedClientCommands: Client 16 (Sheep54 (1465)) is ready for turn 753, but expected 805
The loading screen completed to 99% reportedly.
I strongly suspect r21837 has fixed this for the re-release clientside. It would explain why this occurs and why it occurs rarely.
The turnmanagers should still not become stuck once the offending client is disconnected.
comment:19 by , 4 years ago
Sonds unlikely that its the issue described above, but it's possible to trigger the case ERROR: NotifyFinishedClientCommands: Client 1 (2) is ready for turn 2, but expected 8
by using quicksave & quickload in a networked game (including a23). (To solve that, the quickload method sounds like it should not be implemented for networked turnmanagers)
Here the last messages of fatherbushidos crashlog:
Notice
7F03C2CA5EAD7D9D
is my GUID and that the lastCEndCommandBatchMessage
sticks out as it has turn4136
instead of4158
or above:Not sure if relevant, but I noticed the server continues to send the serialized state even when the client has disconnected already.