Opened 6 years ago

Closed 6 years ago

Last modified 5 years ago

#5162 closed defect (needsinfo)

OOS with alpha 23

Reported by: elexis Owned by:
Priority: Release Blocker Milestone:
Component: UI & Simulation Keywords:
Cc: Patch:

Description (last modified by elexis)

This OOS occurred for all players except the host, 6700 turns into the game. I think everyone did rejoin.

EDIT:

OOS issues reproduced in this ticket:

  • #5205 non-visual hash mismatch in CCmpVisualActor since r21359
  • #5185 OOS on rejoin on turn 0, different RNG
  • #5186 OOS on rejoin on all square maps following r21675
  • #5198 OOS on rejoin on Danubius due to absent vector serialization (#4698) triggered by r20957
  • #5199 Doubleclicking StartGame causes rejoins with different RNG and terrain since ever

OOS debugging tools worked out in this ticket:

  • r21830 write oos_dump.dat when writing oos_dump.txt
  • r21829 test every full hash in non-visual replaymode

OOS debugging issues run into yet again:

  • #3348 oos_dump.txt are taken on different turns sometimes
  • #3615 missing serializationtest support for random maps

Attachments (13)

2018-05-19 OOS.zip (1.7 MB ) - added by elexis 6 years ago.
map-and-commands-for-5-19.zip (540.4 KB ) - added by temple 6 years ago.
2018-05-24 OOS Mainland turn 4 KunstRaucher.zip (1.3 MB ) - added by elexis 6 years ago.
Mainland match where fpre hosted with the fgod mod, KunstRaucher rejoined before turn 4 and got OOS on turn 4 (and got OOS much more often). Could not reproduce the issue with a rejointest yet.
screenshot4482.png (902.0 KB ) - added by temple 6 years ago.
the 26 trees
2018-05-24_OOS_nani_Cesar_wonder_JebelBarkal.7z (1.4 MB ) - added by elexis 6 years ago.
Here a hash mismatch but no difference in the textual simstate. nani used the fgod mod, I didn't.
5_screenshots.zip (1.3 MB ) - added by elexis 6 years ago.
5 screenshots of OOS message boxes
simstate_elexis_in_nanireplay_turn07260.7z (997.4 KB ) - added by elexis 6 years ago.
OOS_Talenters_incomplete.7z (582.7 KB ) - added by elexis 6 years ago.
Incomplete files, some turns missing. Someone else joined as Talenters that also wasn't fpre.
OOS_Talenters_Danubius.zip (1.4 MB ) - added by elexis 6 years ago.
New OOS report on Danubius. Talenters was the only one OOS and not using diplomacy colors. OOS dumps from different turns, but we could create the one corresponding to Talenters from nanis replay.
jebel_barkal_nani_host_2018-06-04.7z (1.9 MB ) - added by elexis 6 years ago.
Same OOS as in the other Jebel Barkal game hosted by nani that went also OOS after water was broken and had no textual diffs. On on turn 3340, nani said "lol", on turn 3347 "water turned black", OOS on turn 3420. All consecutive full-hash were wrong and quick hash correct.
9ffVWP1.png (597.0 KB ) - added by elexis 6 years ago.
stans atlas bug
oos_files_nani_jebel_barkal_binary.7z (1.7 MB ) - added by elexis 6 years ago.
screenshot1257.jpg (385.0 KB ) - added by elexis 6 years ago.

Change History (50)

by elexis, 6 years ago

Attachment: 2018-05-19 OOS.zip added

comment:1 by Stan, 6 years ago

Just what we needed. Sweet Christmas.

comment:2 by elexis, 6 years ago

I observed 3 OOS yesterday, one on turn 4 when someone rejoined before then, that I couldn't reproduce.

Another OOS was observed just now, when nani hosted a JB game. The Out-Of-Sync on turn 7260 was right after playerwin, just like in the replay above! So I think we're dealing with 2 different OOS on rejoin and this one is a playerwin one.

comment:3 by temple, 6 years ago

I did the rejoin test on turn 2113 for the 5/19 replay (after recreating the map in atlas; had to hard-code the difficulty, remove the added units from the jb script, and add a couple more decimal places to the water height), took some hours but it was all okay.

by temple, 6 years ago

comment:4 by elexis, 6 years ago

We suspected fpre's mod, but since the pyrogeneis binary is still the same (CReplayLogger::StartGame), the commands.txt would contain the modname, but it doesn't. So both host and rejoined client did not use the mod.

It also seems unlikely to be a platform issue, because I can replay Decgers (host) replay and compute the same hash, but my machine also computed the hash when I rejoined originally.

by elexis, 6 years ago

Mainland match where fpre hosted with the fgod mod, KunstRaucher rejoined before turn 4 and got OOS on turn 4 (and got OOS much more often). Could not reproduce the issue with a rejointest yet.

comment:5 by temple, 6 years ago

I couldn't reproduce either. The difference in the oos logs was the territory owner for the position of 26 trees. elexis had them correctly belonging to player 7 while KunstRaucher had them belonging to gaia.

742993c742993
<     territory: 7
---
>     territory: 0
etc.

It looks like that's only changed on MT_TurnStart, where it gets the territory owner from the TerritoryManager component. So I'm guessing that the territories are calculated slightly wrong somehow, but since it's not serialized I can't say for sure.

by temple, 6 years ago

Attachment: screenshot4482.png added

the 26 trees

comment:6 by elexis, 6 years ago

You asked about KunstRauchers OS:

file oos_dump_KunstRaucher.txt: ASCII text, with CRLF line terminators

-> Windows

comment:7 by elexis, 6 years ago

Reproduce some OOS:

  1. Multiplayer -> Host -> (preferably Skirmish/Scenario map ->) Start game
  2. During the loading screen, hit F9 to open the JS console, type Engine.SetSimRate(0);
  3. Open another 0ad instance, multiplayer -> join -> localhost as late observer
  4. After the rejoin, change the SimRate to something positive
  5. Apocalypse

in reply to:  6 ; comment:8 by temple, 6 years ago

Replying to elexis:

You asked about KunstRauchers OS:

file oos_dump_KunstRaucher.txt: ASCII text, with CRLF line terminators

-> Windows

Oh, right. Decger was windows too, but he was the host and had the correct hash so I guess it's just a coincidence.

comment:9 by elexis, 6 years ago

The rejoin-on-turn-0 OOS can be reproduced on alpha 19 and 22 too, so this cannot explain the mass of OOS reports we received in the last days.

by elexis, 6 years ago

Here a hash mismatch but no difference in the textual simstate. nani used the fgod mod, I didn't.

in reply to:  8 comment:10 by temple, 6 years ago

Replying to temple:

Oh, right. Decger was windows too, but he was the host and had the correct hash so I guess it's just a coincidence.

Sorry, his was the wrong hash! So windows could be a possibility.

by elexis, 6 years ago

Attachment: 5_screenshots.zip added

5 screenshots of OOS message boxes

comment:11 by elexis, 6 years ago

Boudicca got windows, replayed nani's commands of 2018-05-24_OOS_nani_Cesar_wonder_JebelBarkal.7z​ and didn't get any OOS errors.

I replayed nani's commands and got an OOS.

So it looks like a platform bug (I have to check if Boudicca used fgod, but someone with windows without fgod mod may try to reproduce too).

Next step is getting binary and textual simstate dumps for both platforms.

comment:12 by elexis, 6 years ago

Replaying nanis replay on my ubuntu machine with the svn checkout, I got the ERROR: Replay out of sync on turn 7260 and the binary state in the attachment.

comment:13 by elexis, 6 years ago

Saved simstate on replay OOS using:

Index: source/simulation2/Simulation2.cpp
===================================================================
--- source/simulation2/Simulation2.cpp	(revision 21824)
+++ source/simulation2/Simulation2.cpp	(working copy)
@@ -658,6 +658,11 @@
 	m->m_RejoinTestTurn = rejoinTestTurn;
 }
 
+void CSimulation2::DumpState()
+{
+	m->DumpState();
+}
+
 void CSimulation2::EnableOOSLog()
 {
 	if (m->m_EnableOOSLog)
Index: source/simulation2/Simulation2.h
===================================================================
--- source/simulation2/Simulation2.h	(revision 21824)
+++ source/simulation2/Simulation2.h	(working copy)
@@ -225,6 +225,7 @@
 	ScriptInterface& GetScriptInterface() const;
 
 	bool ComputeStateHash(std::string& outHash, bool quick);
+	void DumpState();
 	bool DumpDebugState(std::ostream& stream);
 	bool SerializeState(std::ostream& stream);
 	bool DeserializeState(std::istream& stream);
Index: source/simulation2/system/ReplayTurnManager.cpp
===================================================================
--- source/simulation2/system/ReplayTurnManager.cpp	(revision 21824)
+++ source/simulation2/system/ReplayTurnManager.cpp	(working copy)
@@ -81,6 +81,7 @@
 		m_HasSyncError = true;
 		LOGERROR("Replay out of sync on turn %d", turn);
 		g_GUI->SendEventToAll("ReplayOutOfSync");
+		m_Simulation2.DumpState();
 	}
 }
 

comment:14 by elexis, 6 years ago

Status Report of the five OOS issues found:

  • #5185 continues the OOS on rejoin on turn 0 problem.It occurs since alpha 19 it's likely irrelevant to fix for alpha 23 re.
  • #5186 continues the territory differences in the KunstRaucher mainland replay and another report from fpre / Palai yesterday. It is partially reproducible, responsible commit known and likely different from the Jebel Barkal hash mismatch.
  • I recall two oos_dumps with fpre that had pathfinder / unitAI diffs and that maybe him having had unintentially Phab:D1268 applied locally (but not in the mod).
  • -> Only remaining mystery: According to the oos_dumps in nanis JB replay, the textual representation of the simstate is equal, so it is only a mismatch of the hash but the players might or might not be able to continue to play.

nanis JB replay contains a hardly explainable bug that is likely the only mystery left for this ticket:

  • commands_nani.txt was replayed twice without a hash mismatch by Boudica on windows with the first release. Diplomacy colors were not enabled (default) and range overlays were enabled (default).

Noone else (especially nani) computes the same hash, no matter which platform and config settings. So we're trying to get Boudica to find the cause by compiling with the patch above and hopefully computing the hash again with it while replaying and dumping the simstate, possibly find the differing stacktrace.

by elexis, 6 years ago

Attachment: OOS_Talenters_incomplete.7z added

Incomplete files, some turns missing. Someone else joined as Talenters that also wasn't fpre.

comment:15 by elexis, 6 years ago

(18:10:02) nani: cpu can make mistakes too
(18:10:51) nani: got overheated time ago
(18:11:43) nani: just before exiting
(18:11:47) nani: water disapeared
(18:12:01) nani: then screen started blinking black quicly
(18:12:23) nani: then i close forced window
(18:12:39) nani: no error whatsoever
(18:14:34) nani: i dodnt get error game started blinking black I sold you already

Hardware failure causing a hash mismatch but no textual simstate diff sounds unlikely. Could water disappearing not be a shader bug but also some loss of the water grid? An out-of-memory corrupting the hash but not the textual oosdump sounds unlikely as well. Given that nani still can't recompute the hash from then, it also can't be a simple platform issue. Given that Boudica but noone else (sample increased to 5+ players) computes the same hash, none of this makes any sense.

comment:16 by elexis, 6 years ago

In 21830:

Save oos_dump.dat too when saving oos_dump.txt.
Saves time to reproduce the binary simstate or provides data that may be otherwise irreproducible, refs #5162.

Differential Revision: https://code.wildfiregames.com/D1544
Reviewed By: temple

comment:17 by elexis, 6 years ago

Milestone: Alpha 24
Resolution: needsinfo
Status: newclosed

#5186 and #5185 continue the reproducible remaining ones.

Remaining OOS indicated by the logfiles but irreproducible with current information:

  • The two Jebel Barkal replays with a hash mismatch after 1h
  • KunstRaucher territory OOS on mainland that sounds very related to #5186

We got no OOS reports in the last two days and there was one 90min JB game with lots of rejoins but no OOS. So it's hopefully not that bad as it looked when we had 10 reports few days ago.

To be reopened once more information surfaces.

comment:18 by elexis, 6 years ago

There was another OOS report just now. causative was the host, Briamm was the only player out of sync on turn 1 without a rejoin. Briamm speaks spanish only, was asked by nani, but could not upload the OOS log.

comment:19 by elexis, 6 years ago

Decger was online today and we could retrieve the following missing information about the JB OOS game from the very first report here:

  • his oos_dump.txt was overwritten by a later game
  • he replayed his own replay file later and got an OOS too
  • he is 100% sure that he enabled diplomacy colors during the match

(ping temple)

Version 0, edited 6 years ago by elexis (next)

comment:20 by temple, 6 years ago

Also Boudica has tried nani's replay again and couldn't reproduce his hash. So in that and the Decger game it happened to the host of the game on Jebel Barkal and wasn't reproducable afterwards. I don't have any ideas, so I think we have to wait for more data.

comment:21 by elexis, 6 years ago

The news is that Decger had diplomacy colors enabled and in all of our testing we had ran this with diplomacy colors disabled.

Simply enabling them during a match doesn't trigger an OOS, but hypothetically speaking, there may be a specific situation in which it could. Maybe if it's the host, serializes some colors but this is self-corrected from some GUI event or such.

by elexis, 6 years ago

Attachment: OOS_Talenters_Danubius.zip added

New OOS report on Danubius. Talenters was the only one OOS and not using diplomacy colors. OOS dumps from different turns, but we could create the one corresponding to Talenters from nanis replay.

comment:22 by temple, 6 years ago

Why do you say different turns? The rng is different, but the rng doesn't match any of nani's turns 701-721 at least. And counting different lines with comparison of those logs to Talenters, there's 1451 lines different in turn 716 (the match to nani's log, apart from local entities) but 12k+ lines different in 715 or 717, and more the farther you go away. So I think the logs are from the same turn.

Note that the quick hashes were different and stayed different the next few turns, whereas in other examples the difference was only in the full hashes, so this could be a different oos.

comment:23 by elexis, 6 years ago

Why do you say different turns?

It's what I learned (below) and observed (#3348). It could be an rng bug, but I'll dump my own simstates for similar turns and compare before I suspect a deviation from the rule.

From 2015-02-25-QuakeNet-#0ad-dev.log:

10:14 < Philip`> Yes - the RNG's state changes every time you ask it for another random number
10:17 < elexis> the oos dump starts with a 'rng' value, i thought this is the seed
10:22 < Philip`> elexis: That is the RNG's entire state, not its seed
10:22 < Philip`> The seed just determines the RNG's initial state at the beginning of the game, and then is no longer needed

(We really need to fix #3348 and #3615 and need a commandline argument to dump on specific turns.)

comment:24 by temple, 6 years ago

Talenters joined on turn 544 (didn't start producing hashes until turn 614) and left after turn 719. Here's the debug output from the Danubius trigger file:

Turn 546 (500)...
DEBUG [5] Filling ship 5297 with ({'units/gaul_mechanical_siege_ram':1, 'units/gaul_support_healer_b':1, 'units/gaul_cavalry_javelinist_b':1, 'units/gaul_infantry_slinger_b':2, 'units/gaul_infantry_spearman_b':4})
DEBUG [5] Filling ship 5298 with ({'units/gaul_mechanical_siege_ram':1, 'units/gaul_support_healer_e':1, 'units/gaul_cavalry_javelinist_b':3, 'units/gaul_cavalry_swordsman_b':1, 'units/gaul_infantry_javelinist_b':1, 'units/gaul_infantry_slinger_b':2})

Turn 716 (500)...
DEBUG [6] Ship 5297 will ungarrison at C (274,318)
DEBUG [6] Ship 5298 will ungarrison at D (106,614)

If you early return in UngarrisonShipsOrder then you'll calculate Talenters' hashes. So for some reason that function either wasn't called (or no trigger functions were called?) or didn't do what it was supposed to.

comment:25 by temple, 6 years ago

Reproduced. Problem is:

ERROR: Error in timer on entity 1, IID93, function DoAction: TypeError: this.riverDirection.cross is not a function

Apparently while this.riverDirection for the rejoined player contains the same data (i.e. x and y are the same values), it's not a Vector2D so the cross function isn't recognized.

comment:26 by elexis, 6 years ago

Yay... #4698.

comment:27 by elexis, 6 years ago

Found #5199, may or may not explain the KunstRaucher Mainland OOS.

comment:28 by temple, 6 years ago

The Talenters/Danubius OOS was fixed by r21834 (#5198), and an OOS on square maps (not reported in this ticket) was fixed by r21835 (#5186).

comment:29 by elexis, 6 years ago

#5199 can perfectly explain KunstRauchers Mainland OOS in fact, see reasoning over there.

So the only unresolved mystery in this thread seem to be the two lategame jebel barkal ones that might be GUI induced OOS. Recap:

  • Irreproducible hash and missing OOS_dump file. Decger recalled using diplo colors for sure.
  • Irreproducible hash, the oos_dump.txt contains no difference. I don't know if nani used diplomacy colors. nani spoke of the hardware defect, but only the last full hash of the replay was wrong, while later quick hashes and oos_dump.txt were correct.

Perhaps some event occured near that specific turn, such as playerdefeat?

comment:30 by elexis, 6 years ago

Description: modified (diff)

comment:31 by elexis, 6 years ago

In r21829:

Test all full hashes in non-visual replaymode by default and keep skipping quick-hash by default.

The previous code only tested quick hashes every 100 turns and could not be used to confirm replay hashes matching.
The option can become used for visual replays too.

Differential Revision: ​https://code.wildfiregames.com/D1538
Refs #5162
Reviewed By: temple

by elexis, 6 years ago

Same OOS as in the other Jebel Barkal game hosted by nani that went also OOS after water was broken and had no textual diffs. On on turn 3340, nani said "lol", on turn 3347 "water turned black", OOS on turn 3420. All consecutive full-hash were wrong and quick hash correct.

by elexis, 6 years ago

Attachment: 9ffVWP1.png added

stans atlas bug

comment:32 by elexis, 6 years ago

temple was reminded of https://wildfiregames.com/forum/index.php?/topic/24154-all-map-black/ and Stan reported that he gets a similar bug in atlas, see attachment:9ffVWP1.png

comment:33 by elexis, 6 years ago

In 21836:

Fix an OOS on rejoin after doubleclicking on the StartGame button, fixes #5199, refs #5162.

Prevents changes to the gamesettings after the game was started, so as to use the correct map seed when generating the random map terrain for rejoiners.
D1558 will prevent the UI bug #5206 and FSM error from doubleclicking on StartGame.

Differential Revision: https://code.wildfiregames.com/D1562
Tested and accepted by: temple

by elexis, 6 years ago

Attachment: screenshot1257.jpg added

comment:34 by elexis, 6 years ago

Here the weird unresolved Jebel Barkal OOS without rejoin where only nani is OOS occured again. Again the first thing that occured was that water went black for nani, then nani said in the chat that there will be an OOS, then there was the OOS. Again the textual simstate is identical, but now we have the binary simstate.

comment:35 by temple, 6 years ago

Looks like differences with objects versus references, first differences with lastCarriedType (ResourceGatherer), last difference with order (UnitAI), and who knows what else in between. Seems always nani with an object and elexis with a reference.

comment:36 by elexis, 6 years ago

I observe 3 diffs:

  • ResourceGatherer.lastCarriedType and ResourceSupply.cachedType
  • AuraManager or TechnologyManager modifications
  • UnitAI order queue

So perhaps AuraManager / TechnologyManager objects were created depending on the user interaction / GUIInterface and the other diffs are consequences.

comment:37 by elexis, 5 years ago

In 22998:

Support testing rejoins on turn 0, refs #3460, #4242, #5185, #5162, rP18940

Differential Revision: https://code.wildfiregames.com/D1522

Note: See TracTickets for help on using tickets.