#5162 closed defect (needsinfo)
OOS with alpha 23
Reported by: | elexis | Owned by: | |
---|---|---|---|
Priority: | Release Blocker | Milestone: | |
Component: | UI & Simulation | Keywords: | |
Cc: | Patch: |
Description (last modified by )
This OOS occurred for all players except the host, 6700 turns into the game. I think everyone did rejoin.
EDIT:
OOS issues reproduced in this ticket:
- #5205 non-visual hash mismatch in CCmpVisualActor since r21359
- #5185 OOS on rejoin on turn 0, different RNG
- #5186 OOS on rejoin on all square maps following r21675
- #5198 OOS on rejoin on Danubius due to absent vector serialization (#4698) triggered by r20957
- #5199 Doubleclicking StartGame causes rejoins with different RNG and terrain since ever
OOS debugging tools worked out in this ticket:
- r21830 write
oos_dump.dat
when writingoos_dump.txt
- r21829 test every full hash in non-visual replaymode
OOS debugging issues run into yet again:
Attachments (13)
Change History (50)
by , 6 years ago
Attachment: | 2018-05-19 OOS.zip added |
---|
comment:1 by , 6 years ago
comment:2 by , 6 years ago
I observed 3 OOS yesterday, one on turn 4 when someone rejoined before then, that I couldn't reproduce.
Another OOS was observed just now, when nani hosted a JB game.
The Out-Of-Sync on turn 7260
was right after playerwin, just like in the replay above!
So I think we're dealing with 2 different OOS on rejoin and this one is a playerwin one.
comment:3 by , 6 years ago
I did the rejoin test on turn 2113 for the 5/19 replay (after recreating the map in atlas; had to hard-code the difficulty, remove the added units from the jb script, and add a couple more decimal places to the water height), took some hours but it was all okay.
by , 6 years ago
Attachment: | map-and-commands-for-5-19.zip added |
---|
comment:4 by , 6 years ago
We suspected fpre's mod, but since the pyrogeneis binary is still the same (CReplayLogger::StartGame
), the commands.txt would contain the modname, but it doesn't. So both host and rejoined client did not use the mod.
It also seems unlikely to be a platform issue, because I can replay Decgers (host) replay and compute the same hash, but my machine also computed the hash when I rejoined originally.
by , 6 years ago
Attachment: | 2018-05-24 OOS Mainland turn 4 KunstRaucher.zip added |
---|
Mainland match where fpre hosted with the fgod mod, KunstRaucher rejoined before turn 4 and got OOS on turn 4 (and got OOS much more often). Could not reproduce the issue with a rejointest yet.
comment:5 by , 6 years ago
I couldn't reproduce either. The difference in the oos logs was the territory owner for the position of 26 trees. elexis had them correctly belonging to player 7 while KunstRaucher had them belonging to gaia.
742993c742993 < territory: 7 --- > territory: 0 etc.
It looks like that's only changed on MT_TurnStart, where it gets the territory owner from the TerritoryManager component. So I'm guessing that the territories are calculated slightly wrong somehow, but since it's not serialized I can't say for sure.
follow-up: 8 comment:6 by , 6 years ago
You asked about KunstRauchers OS:
file oos_dump_KunstRaucher.txt
: ASCII text, with CRLF line terminators
-> Windows
comment:7 by , 6 years ago
Reproduce some OOS:
- Multiplayer -> Host -> (preferably Skirmish/Scenario map ->) Start game
- During the loading screen, hit F9 to open the JS console, type
Engine.SetSimRate(0);
- Open another 0ad instance, multiplayer -> join -> localhost as late observer
- After the rejoin, change the SimRate to something positive
- Apocalypse
follow-up: 10 comment:8 by , 6 years ago
Replying to elexis:
You asked about KunstRauchers OS:
file oos_dump_KunstRaucher.txt
: ASCII text, with CRLF line terminators-> Windows
Oh, right. Decger was windows too, but he was the host and had the correct hash so I guess it's just a coincidence.
comment:9 by , 6 years ago
The rejoin-on-turn-0 OOS can be reproduced on alpha 19 and 22 too, so this cannot explain the mass of OOS reports we received in the last days.
by , 6 years ago
Attachment: | 2018-05-24_OOS_nani_Cesar_wonder_JebelBarkal.7z added |
---|
Here a hash mismatch but no difference in the textual simstate. nani used the fgod mod, I didn't.
comment:10 by , 6 years ago
Replying to temple:
Oh, right. Decger was windows too, but he was the host and had the correct hash so I guess it's just a coincidence.
Sorry, his was the wrong hash! So windows could be a possibility.
comment:11 by , 6 years ago
Boudicca got windows, replayed nani's commands of 2018-05-24_OOS_nani_Cesar_wonder_JebelBarkal.7z and didn't get any OOS errors.
I replayed nani's commands and got an OOS.
So it looks like a platform bug (I have to check if Boudicca used fgod, but someone with windows without fgod mod may try to reproduce too).
Next step is getting binary and textual simstate dumps for both platforms.
comment:12 by , 6 years ago
Replaying nanis replay on my ubuntu machine with the svn checkout, I got the ERROR: Replay out of sync on turn 7260
and the binary state in the attachment.
by , 6 years ago
Attachment: | simstate_elexis_in_nanireplay_turn07260.7z added |
---|
comment:13 by , 6 years ago
Saved simstate on replay OOS using:
Index: source/simulation2/Simulation2.cpp =================================================================== --- source/simulation2/Simulation2.cpp (revision 21824) +++ source/simulation2/Simulation2.cpp (working copy) @@ -658,6 +658,11 @@ m->m_RejoinTestTurn = rejoinTestTurn; } +void CSimulation2::DumpState() +{ + m->DumpState(); +} + void CSimulation2::EnableOOSLog() { if (m->m_EnableOOSLog) Index: source/simulation2/Simulation2.h =================================================================== --- source/simulation2/Simulation2.h (revision 21824) +++ source/simulation2/Simulation2.h (working copy) @@ -225,6 +225,7 @@ ScriptInterface& GetScriptInterface() const; bool ComputeStateHash(std::string& outHash, bool quick); + void DumpState(); bool DumpDebugState(std::ostream& stream); bool SerializeState(std::ostream& stream); bool DeserializeState(std::istream& stream); Index: source/simulation2/system/ReplayTurnManager.cpp =================================================================== --- source/simulation2/system/ReplayTurnManager.cpp (revision 21824) +++ source/simulation2/system/ReplayTurnManager.cpp (working copy) @@ -81,6 +81,7 @@ m_HasSyncError = true; LOGERROR("Replay out of sync on turn %d", turn); g_GUI->SendEventToAll("ReplayOutOfSync"); + m_Simulation2.DumpState(); } }
comment:14 by , 6 years ago
Status Report of the five OOS issues found:
- Phab:D1519 solves a non-visual replay hash mismatch due in CmpSoundGroup.
- #5185 continues the OOS on rejoin on turn 0 problem.It occurs since alpha 19 it's likely irrelevant to fix for alpha 23 re.
- #5186 continues the territory differences in the KunstRaucher mainland replay and another report from fpre / Palai yesterday. It is partially reproducible, responsible commit known and likely different from the Jebel Barkal hash mismatch.
- I recall two oos_dumps with fpre that had pathfinder / unitAI diffs and that maybe him having had unintentially Phab:D1268 applied locally (but not in the mod).
- -> Only remaining mystery: According to the oos_dumps in nanis JB replay, the textual representation of the simstate is equal, so it is only a mismatch of the hash but the players might or might not be able to continue to play.
nanis JB replay contains a hardly explainable bug that is likely the only mystery left for this ticket:
commands_nani.txt
was replayed twice without a hash mismatch by Boudica on windows with the first release. Diplomacy colors were not enabled (default) and range overlays were enabled (default).
Noone else (especially nani) computes the same hash, no matter which platform and config settings. So we're trying to get Boudica to find the cause by compiling with the patch above and hopefully computing the hash again with it while replaying and dumping the simstate, possibly find the differing stacktrace.
by , 6 years ago
Attachment: | OOS_Talenters_incomplete.7z added |
---|
Incomplete files, some turns missing. Someone else joined as Talenters that also wasn't fpre.
comment:15 by , 6 years ago
(18:10:02) nani: cpu can make mistakes too (18:10:51) nani: got overheated time ago (18:11:43) nani: just before exiting (18:11:47) nani: water disapeared (18:12:01) nani: then screen started blinking black quicly (18:12:23) nani: then i close forced window (18:12:39) nani: no error whatsoever (18:14:34) nani: i dodnt get error game started blinking black I sold you already
Hardware failure causing a hash mismatch but no textual simstate diff sounds unlikely. Could water disappearing not be a shader bug but also some loss of the water grid? An out-of-memory corrupting the hash but not the textual oosdump sounds unlikely as well. Given that nani still can't recompute the hash from then, it also can't be a simple platform issue. Given that Boudica but noone else (sample increased to 5+ players) computes the same hash, none of this makes any sense.
comment:17 by , 6 years ago
Milestone: | Alpha 24 |
---|---|
Resolution: | → needsinfo |
Status: | new → closed |
#5186 and #5185 continue the reproducible remaining ones.
Remaining OOS indicated by the logfiles but irreproducible with current information:
- The two Jebel Barkal replays with a hash mismatch after 1h
- KunstRaucher territory OOS on mainland that sounds very related to #5186
We got no OOS reports in the last two days and there was one 90min JB game with lots of rejoins but no OOS. So it's hopefully not that bad as it looked when we had 10 reports few days ago.
To be reopened once more information surfaces.
comment:18 by , 6 years ago
There was another OOS report just now. causative was the host, Briamm was the only player out of sync on turn 1 without a rejoin. Briamm speaks spanish only, was asked by nani, but could not upload the OOS log.
comment:19 by , 6 years ago
Decger was online today and we could retrieve the following missing information about the JB OOS game from the very first report here:
- his
oos_dump.txt
was overwritten by a later game - he replayed his own replay file later and got an OOS too
- he is 100% sure that he enabled diplomacy colors during the match
(ping temple)
comment:20 by , 6 years ago
Also Boudica has tried nani's replay again and couldn't reproduce his hash. So in that and the Decger game it happened to the host of the game on Jebel Barkal and wasn't reproducable afterwards. I don't have any ideas, so I think we have to wait for more data.
comment:21 by , 6 years ago
The news is that Decger had diplomacy colors enabled and in all of our testing we had ran this with diplomacy colors disabled.
Simply enabling them during a match doesn't trigger an OOS, but hypothetically speaking, there may be a specific situation in which it could. Maybe if it's the host, serializes some colors but this is self-corrected from some GUI event or such.
by , 6 years ago
Attachment: | OOS_Talenters_Danubius.zip added |
---|
New OOS report on Danubius. Talenters was the only one OOS and not using diplomacy colors. OOS dumps from different turns, but we could create the one corresponding to Talenters from nanis replay.
comment:22 by , 6 years ago
Why do you say different turns? The rng is different, but the rng doesn't match any of nani's turns 701-721 at least. And counting different lines with comparison of those logs to Talenters, there's 1451 lines different in turn 716 (the match to nani's log, apart from local entities) but 12k+ lines different in 715 or 717, and more the farther you go away. So I think the logs are from the same turn.
Note that the quick hashes were different and stayed different the next few turns, whereas in other examples the difference was only in the full hashes, so this could be a different oos.
comment:23 by , 6 years ago
Why do you say different turns?
It's what I learned (below) and observed (#3348). It could be an rng bug, but I'll dump my own simstates for similar turns and compare before I suspect a deviation from the rule.
From 2015-02-25-QuakeNet-#0ad-dev.log
:
10:14 < Philip`> Yes - the RNG's state changes every time you ask it for another random number
10:17 < elexis> the oos dump starts with a 'rng' value, i thought this is the seed
10:22 < Philip`> elexis: That is the RNG's entire state, not its seed
10:22 < Philip`> The seed just determines the RNG's initial state at the beginning of the game, and then is no longer needed
(We really need to fix #3348 and #3615 and need a commandline argument to dump on specific turns.)
comment:24 by , 6 years ago
Talenters joined on turn 544 (didn't start producing hashes until turn 614) and left after turn 719. Here's the debug output from the Danubius trigger file:
Turn 546 (500)... DEBUG [5] Filling ship 5297 with ({'units/gaul_mechanical_siege_ram':1, 'units/gaul_support_healer_b':1, 'units/gaul_cavalry_javelinist_b':1, 'units/gaul_infantry_slinger_b':2, 'units/gaul_infantry_spearman_b':4}) DEBUG [5] Filling ship 5298 with ({'units/gaul_mechanical_siege_ram':1, 'units/gaul_support_healer_e':1, 'units/gaul_cavalry_javelinist_b':3, 'units/gaul_cavalry_swordsman_b':1, 'units/gaul_infantry_javelinist_b':1, 'units/gaul_infantry_slinger_b':2}) Turn 716 (500)... DEBUG [6] Ship 5297 will ungarrison at C (274,318) DEBUG [6] Ship 5298 will ungarrison at D (106,614)
If you early return in UngarrisonShipsOrder
then you'll calculate Talenters' hashes. So for some reason that function either wasn't called (or no trigger functions were called?) or didn't do what it was supposed to.
comment:25 by , 6 years ago
Reproduced. Problem is:
ERROR: Error in timer on entity 1, IID93, function DoAction: TypeError: this.riverDirection.cross is not a function
Apparently while this.riverDirection
for the rejoined player contains the same data (i.e. x
and y
are the same values), it's not a Vector2D
so the cross
function isn't recognized.
comment:28 by , 6 years ago
The Talenters/Danubius OOS was fixed by r21834 (#5198), and an OOS on square maps (not reported in this ticket) was fixed by r21835 (#5186).
comment:29 by , 6 years ago
#5199 can perfectly explain KunstRauchers Mainland OOS in fact, see reasoning over there.
So the only unresolved mystery in this thread seem to be the two lategame jebel barkal ones that might be GUI induced OOS. Recap:
- Irreproducible hash and missing OOS_dump file. Decger recalled using diplo colors for sure.
- Irreproducible hash, the
oos_dump.txt
contains no difference. I don't know if nani used diplomacy colors. nani spoke of the hardware defect, but only the last full hash of the replay was wrong, while later quick hashes andoos_dump.txt
were correct.
Perhaps some event occured near that specific turn, such as playerdefeat?
comment:30 by , 6 years ago
Description: | modified (diff) |
---|
comment:31 by , 6 years ago
In r21829:
Test all full hashes in non-visual replaymode by default and keep skipping quick-hash by default.
The previous code only tested quick hashes every 100 turns and could not be used to confirm replay hashes matching.
The option can become used for visual replays too.
Differential Revision: https://code.wildfiregames.com/D1538
Refs #5162
Reviewed By: temple
by , 6 years ago
Attachment: | jebel_barkal_nani_host_2018-06-04.7z added |
---|
Same OOS as in the other Jebel Barkal game hosted by nani that went also OOS after water was broken and had no textual diffs. On on turn 3340, nani said "lol", on turn 3347 "water turned black", OOS on turn 3420. All consecutive full-hash were wrong and quick hash correct.
comment:32 by , 6 years ago
temple was reminded of https://wildfiregames.com/forum/index.php?/topic/24154-all-map-black/ and Stan reported that he gets a similar bug in atlas, see attachment:9ffVWP1.png
by , 6 years ago
Attachment: | oos_files_nani_jebel_barkal_binary.7z added |
---|
by , 6 years ago
Attachment: | screenshot1257.jpg added |
---|
comment:34 by , 6 years ago
Here the weird unresolved Jebel Barkal OOS without rejoin where only nani is OOS occured again. Again the first thing that occured was that water went black for nani, then nani said in the chat that there will be an OOS, then there was the OOS. Again the textual simstate is identical, but now we have the binary simstate.
comment:35 by , 6 years ago
Looks like differences with objects versus references, first differences with lastCarriedType
(ResourceGatherer), last difference with order
(UnitAI), and who knows what else in between. Seems always nani with an object and elexis with a reference.
comment:36 by , 6 years ago
I observe 3 diffs:
ResourceGatherer.lastCarriedType
andResourceSupply.cachedType
AuraManager
orTechnologyManager
modificationsUnitAI
order queue
So perhaps AuraManager / TechnologyManager objects were created depending on the user interaction / GUIInterface and the other diffs are consequences.
Just what we needed. Sweet Christmas.