Opened 8 years ago

Closed 7 years ago

#3637 closed defect (needsinfo)

Obstruction Manager OOS without rejoin

Reported by: wraitii Owned by:
Priority: Release Blocker Milestone:
Component: UI & Simulation Keywords:
Cc: Patch:

Description (last modified by elexis)

We experienced an OOS in a 3V4 today on r17298 (alpha 19 release version). There were no rejoins. I was the only OOS, turn 2840 according to elexis.

Attached is my commands.txt and oosdump

Attachments (21)

commands.txt (1.3 MB ) - added by wraitii 8 years ago.
oos_dump.txt.zip (734.5 KB ) - added by wraitii 8 years ago.
oos_dump_elexis_orig.zip (657.9 KB ) - added by elexis 8 years ago.
My version of the oosdump, generated ingame.
oos_dump_elexis_simulated.zip (1.1 MB ) - added by elexis 8 years ago.
OOS dump generated using -ooslog option and any of the two patches of #3460 for turn 2840.
scenario_map.zip (573.3 KB ) - added by elexis 8 years ago.
The scenario map that is equivalent to the random map. Can be used to do serializationtests.
wraitii_state_1840.txt.zip (706.4 KB ) - added by wraitii 8 years ago.
serializationtest_dump_turn_2847_cleaned.diff (11.5 KB ) - added by elexis 8 years ago.
A diff of the oosdumps of the host (-) and rejoined client (+) on turn 2847 (which is the turn where the serializationtest fails on both linux and mac os). Local entities and all hunks where only the ticket number differs were removed (all of those hunks have an exact difference of three in the IDs).
patch+state.zip (529.0 KB ) - added by wraitii 8 years ago.
instant_reproduce_turn2847.7z (664.4 KB ) - added by elexis 8 years ago.
Compile with this patch and you will be able to (a) start & stop visual replay exactly on that turn, (b) run a serializationtest that fails on the first turn. Change the filename to state.after.a or state.after.b and you will see the different states for the host / rejoined client one turn later.
fort_builders.jpg (231.2 KB ) - added by elexis 8 years ago.
With regards to the serializationtesterror on turn 2847: These two guys (entity 572 and 7221) are repairing the fort on turn 2847 in the simstate of the rejoined client. But in the simstate of the host, they are still approaching the fortress on that turn.
lumberjacks.jpg (135.0 KB ) - added by elexis 8 years ago.
With regards to the serializationtesterror on turn 2847: These two lumberjacks (entity 4134 and 4298) have (a) pathstate 5 (PATHSTATE_FOLLOWING_REQUESTING_SHORT) for the host and 3 (PATHSTATE_FOLLOWING) for the rejoined client, (b) tries 1 for the host and 0 for the rejoined client, (c) an actual pathfinder ticket number for the host, but ticket number 0 for the rejoined client.
fort_builders_palisade.jpg (202.5 KB ) - added by elexis 8 years ago.
With regards to the serializationtesterror on turn 2847: This palisade is placed exactly on turn 2847 (see commands.txt) with those two OOS spartan fortress builders (572 and 7221)!
commands_reproduce_palisade_oos.txt (3.7 KB ) - added by elexis 8 years ago.
See this commands.txt of a reproduce of the serializationtesterror. Here a unit was building a house, then got an order to build a palisade on turn 52, which is exactly the turn where it becomes OOS. This still doesn't explain the OOS without rejoin of the original commands.txt though.
2840.7z (773.4 KB ) - added by Itms 8 years ago.
2016-05-04-oos-r18133.7z (1.2 MB ) - added by elexis 8 years ago.
Full logs for r18133, only local entities different
2016-05-15 fatherbushido oos_r18179.zip (439.3 KB ) - added by elexis 8 years ago.
Only local entities differ. I'm using Ubuntu 15.04 with gcc 4.9.2 while fatherbushido uses Debian sid (kernel 4.5.3) with gcc 5.3.1.
noInfinity.diff (633 bytes ) - added by sanderd17 8 years ago.
hash_mismatch_on_rejoin_reproduce_r18742.7z (64.4 KB ) - added by elexis 8 years ago.
Identical simstate, but hash mismatch on rejoin. Replay file of host and rejoined client. Client (fatherbushido) rejoined turn 12(-1), "OOS" on turn 40. NOTICE: We had one rejoin before where no field was built and didn't get an OOS. In this replay there was only a field built. So perhaps sanderd17 is right.
hash_dump.patch (4.3 KB ) - added by elexis 8 years ago.
Creates one directory per turn, with one file per component containing all bytes that are hashed for that component. Thus should be able to identify the diff in the data that is hashed (rather than a diff of simstates).
hash_mismatch_on_rejoin_reproduce_bb_r18742.7z (238.3 KB ) - added by elexis 8 years ago.
Another reproduce with bb.
oos_a20_LongKnob.7z (921.6 KB ) - added by elexis 8 years ago.
First OOS report for Alpha 20. oos_dumps come from different turns, random map.

Change History (37)

by wraitii, 8 years ago

Attachment: commands.txt added

by wraitii, 8 years ago

Attachment: oos_dump.txt.zip added

by elexis, 8 years ago

Attachment: oos_dump_elexis_orig.zip added

My version of the oosdump, generated ingame.

by elexis, 8 years ago

OOS dump generated using -ooslog option and any of the two patches of #3460 for turn 2840.

by elexis, 8 years ago

Attachment: scenario_map.zip added

The scenario map that is equivalent to the random map. Can be used to do serializationtests.

comment:1 by elexis, 8 years ago

As wraitii made sure he had no patch applied (especially no template diffs), this should be a platform bug. He was the only player to use Mac OS.

Both of our ingame-oos-dumps have many differences to the one I generated using non-visual replay, so we can assume that both ingame dumps were generated on turns after 2840.

We can be certain that the OOS occured exactly on 2840 as the NetTurnManager sends the turnnumber in the message.

The serializationtest on that turn didn't throw an error on my machine.

To further debug, we need wraitii's oosdump of exactly that turn.

comment:2 by elexis, 8 years ago

Description: modified (diff)
Priority: Should HaveMust Have
Summary: OOS in an MP gameOOS without rejoin

Actually I get a serializationtest-error on turn 2847!

A pathfinder issue again.

Last edited 8 years ago by elexis (previous) (diff)

by wraitii, 8 years ago

Attachment: wraitii_state_1840.txt.zip added

by elexis, 8 years ago

A diff of the oosdumps of the host (-) and rejoined client (+) on turn 2847 (which is the turn where the serializationtest fails on both linux and mac os). Local entities and all hunks where only the ticket number differs were removed (all of those hunks have an exact difference of three in the IDs).

comment:3 by wraitii, 8 years ago

Patch attached facilitates reproducing by allowing to skip the first 28XX turns.

You will probably have to change some hardcoded paths.

Also attached is the binary state on turn 2701.

The way this works is you define "hashStart" to be the first turn you want to serializationtest on (then it will re-test all turns - it doesn't keep the state). FirstSimStart is used t skip the first turn: the sim will load the binary file "statelol.txt" at the hardocded path given and use that for the first sim. You also need "started" to be false for that, if "started" is true it will only serialization-test.

Deserializeall is iirc unused. serializeEvery means you will export the first sim state every X turn to help with reloading faster later on.

by wraitii, 8 years ago

Attachment: patch+state.zip added

by elexis, 8 years ago

Compile with this patch and you will be able to (a) start & stop visual replay exactly on that turn, (b) run a serializationtest that fails on the first turn. Change the filename to state.after.a or state.after.b and you will see the different states for the host / rejoined client one turn later.

by elexis, 8 years ago

Attachment: fort_builders.jpg added

With regards to the serializationtesterror on turn 2847: These two guys (entity 572 and 7221) are repairing the fort on turn 2847 in the simstate of the rejoined client. But in the simstate of the host, they are still approaching the fortress on that turn.

by elexis, 8 years ago

Attachment: lumberjacks.jpg added

With regards to the serializationtesterror on turn 2847: These two lumberjacks (entity 4134 and 4298) have (a) pathstate 5 (PATHSTATE_FOLLOWING_REQUESTING_SHORT) for the host and 3 (PATHSTATE_FOLLOWING) for the rejoined client, (b) tries 1 for the host and 0 for the rejoined client, (c) an actual pathfinder ticket number for the host, but ticket number 0 for the rejoined client.

comment:4 by elexis, 8 years ago

With regards to the serializationtesterror on turn 2847:

After thorough investigation of the diff of the oosdump of the serializationtesterror on turn 2847 (see attachment:serializationtest_dump_turn_2847_cleaned.diff) I noticed that only those two builders of the spartan fortress and those two lumberjacks have different pathfinder states. Every other hunk is a consequence of that.

It looks like those spartan builders start one turn earlier to build the fort and those lumberjacks having their shortpath computed one turn later for the host.

The next step would be to add LOGERROR's to the related functions and see where the diff originates.

Notice I'm using the word "host" for the primary simultion and "rejoined client" for the secondary simulation, as those would experience the issue this way if they had rejoined on this turn. In the original game there were no rejoins.

by elexis, 8 years ago

Attachment: fort_builders_palisade.jpg added

With regards to the serializationtesterror on turn 2847: This palisade is placed exactly on turn 2847 (see commands.txt) with those two OOS spartan fortress builders (572 and 7221)!

by elexis, 8 years ago

See this commands.txt of a reproduce of the serializationtesterror. Here a unit was building a house, then got an order to build a palisade on turn 52, which is exactly the turn where it becomes OOS. This still doesn't explain the OOS without rejoin of the original commands.txt though.

comment:5 by elexis, 8 years ago

Summary: OOS without rejoinOOS without rejoin and OOS on rejoin when placing palisade while building

http://trac.wildfiregames.com/raw-attachment/ticket/3637/fort_builders_palisade.jpg

comment:6 by elexis, 8 years ago

wraitii found out yesterday what happens on turn 2847 (palisade serializationtesterror) (Moved to #3647):

  • Whenever a palisade/wall is built, the original command in the simulation changed which triggers a serializationtesterror.
  • Good: It doesn't cause an OOS in games as the multiplayer server first sends the commands to all players which then all execute the command identically.
  • Bad: It still gives a serializationtesterror which will make debugging A19 OOS hard, as it will always fail when a palisade / wall is placed, so it must be either fixed before the release or everytime when we debug a commands.txt

For turn 2840 (platform-dependent hash mismatch):

  • We have a hash difference for linux/mac but the simstates are identical for all non-local entities
  • We already had something like this in #3108.
    • By doing the same appraoch as in comment 4 and 6 we can find out which component fails hashing, leading us directly to the bug.
  • This is good because it means the game is not OOS
  • This is bad because we will have probably many false positives in A19 games.
    • This means uncertainty if players are actually OOS.
    • This means we will have to workaround this bug everytime we want to check for A19 OOSes.
Last edited 8 years ago by elexis (previous) (diff)

comment:7 by wraitii, 8 years ago

In 17324:

Fix an oversight when constructing walls that changed issued commands and would result in false positives when running with -serializationtest. Refs #3637

comment:8 by Itms, 8 years ago

The states are identical for elexis and for me (Win7) on turn 2839 but differ on 2840. Attached are the dumps on both of our machines.

Last edited 8 years ago by Itms (previous) (diff)

comment:9 by elexis, 8 years ago

Description: modified (diff)
Summary: OOS without rejoin and OOS on rejoin when placing palisade while buildingObstruction Manager OOS without rejoin

by Itms, 8 years ago

Attachment: 2840.7z added

comment:10 by sanderd17, 8 years ago

All reports seem to be from A19. I don't know whether this is solved in the most recent version, but the last MP games on SVN didn't give an OOS. Even big games with 5 and 6 players.

The only OOS reported with these games were caused by an unclean simulation directory, so solved by removing unused files and restarting.

So I wonder if this ticket should still be open.

comment:11 by Itms, 8 years ago

Milestone: Alpha 20Alpha 21

We got some reports but not complete enough to understand what's happening. What is sure is that those are false positives and the simulation state is not actually different.

by elexis, 8 years ago

Attachment: 2016-05-04-oos-r18133.7z added

Full logs for r18133, only local entities different

comment:12 by elexis, 8 years ago

Priority: Must HaveRelease Blocker

Nominated for most annoying bugs of the last release. Doesn't need to stay, but should really be fixed.

by elexis, 8 years ago

Only local entities differ. I'm using Ubuntu 15.04 with gcc 4.9.2 while fatherbushido uses Debian sid (kernel 4.5.3) with gcc 5.3.1.

comment:13 by sanderd17, 8 years ago

Perhaps it's again an issue with data types that can't be serialised to JSON, f.e. the Infinity of farm fields. Though that wouldn't explain the randomness.

by sanderd17, 8 years ago

Attachment: noInfinity.diff added

comment:14 by Itms, 8 years ago

Would it be possible to try to reproduce it

  • with sanderd17 tentative fix
  • with the SM38 branch

or is it impossible to reproduce with the commands.txt only?

by elexis, 8 years ago

Identical simstate, but hash mismatch on rejoin. Replay file of host and rejoined client. Client (fatherbushido) rejoined turn 12(-1), "OOS" on turn 40. NOTICE: We had one rejoin before where no field was built and didn't get an OOS. In this replay there was only a field built. So perhaps sanderd17 is right.

by elexis, 8 years ago

Attachment: hash_dump.patch added

Creates one directory per turn, with one file per component containing all bytes that are hashed for that component. Thus should be able to identify the diff in the data that is hashed (rather than a diff of simstates).

by elexis, 8 years ago

Another reproduce with bb.

comment:15 by elexis, 8 years ago

The OOS from the last 3 files are likely fixed by #4239. The turn 2840 hash mismatch thing from comment:6 is probably not addressed yet unless some other patch had fixed it accidentally.

by elexis, 8 years ago

Attachment: oos_a20_LongKnob.7z added

First OOS report for Alpha 20. oos_dumps come from different turns, random map.

comment:16 by elexis, 7 years ago

Milestone: Alpha 21
Resolution: needsinfo
Status: newclosed
  • No data for the original issue (OOS without rejoin). Perhaps it was a stray file in wraitiis repo.
  • The serializationtest issue discovered in this thread was fixed
  • SVN looks very stable now, all known OOS fixed
  • This ticket can be resurrected once we have a reproduction
Note: See TracTickets for help on using tickets.