Opened 9 years ago

Closed 8 years ago

#3292 closed defect (fixed)

OOS on rejoin with new pathfinder

Reported by: elexis Owned by: wraitii
Priority: Release Blocker Milestone: Alpha 19
Component: Core engine Keywords: pathfinding
Cc: Patch:

Description (last modified by elexis)

We played a 2v3 on r16753, i.e. with the new pathfinder committed in r16751.

ffm rejoined early and was completely oos (i.e. he saw a different amount of units in my town, and I saw a different amount of units in his town).

Units seem to get stuck and sometimes stop walking for no apparent reason (while being in the walking animation and state). Mabye not related to this bug, since:

The diff only contains one entity that has different values. It is a woman with different waypoints and position. It is the second woman I produced at my CC (second 25). When doing the visual replay, the oos error appears turn 980 (8min 10sec).

--- oos_dump_elexis.txt	2015-06-13 01:23:48.092697000 +0200
+++ oos_dump_ffm.txt	2015-06-13 01:23:47.000000000 +0200

Somewhere:
     entity: 5790
-    x: 443.95232
+    x: 443.95234
     z: 1028.04763
     r: 0.8
     clearance: 1
     flags: 21
     group: 5790



Somewhere below that:
 - id: 5790
   Decay:
   Footprint:
   Minimap:
     r: 230
     g: 230
     b: 75
     active: true
-    x: 443.95232
+    x: 443.95234
     z: 1028.04763
   Obstruction:
     active: true
     moving: true
     control group: 5790
     control group 2: 0
     tag: 8652
     flags: 5
   OverlayRenderer:
   Ownership:
     owner: 4
   Position:
     in world: true
-    x: 443.95232
+    x: 443.95234
     y: 0
     z: 1028.04763
     last x: 445.1823
     last y diff: 0
     last z: 1023.95841
     territory: 4
     rot x: 0
     rot y: -0.36324
     rot z: 0
     altitude: 0
     relative: true
     floating: false
     constructionprogress: 0
     anchor: "upright"
     turret parent: 0
   Selectable:
   UnitMotion:
     radius: 0.8
     state: 3
     path state: 3
     pass class: "default"
     ticket: 0
     target entity: 0
     target pos x: 0
     target pos y: 0
     target offset x: 0
     target offset y: 0
     target min range: 0
     target max range: 0
     speed: 9.5
     moving: true
     facePointAfterMove: true
-    length: 17
+    length: 18
     waypoint x: 344.5
     waypoint z: 1110.5
     waypoint x: 353
     waypoint z: 1110.5
     waypoint x: 360.75
     waypoint z: 1107
     waypoint x: 366.125
     waypoint z: 1103.75
     waypoint x: 371.3125
     waypoint z: 1099.625
     waypoint x: 377.40625
     waypoint z: 1094.0625
-    waypoint x: 384.95313
-    waypoint z: 1086.78125
-    waypoint x: 390.22657
-    waypoint z: 1081.64063
-    waypoint x: 394.86329
-    waypoint z: 1077.07032
-    waypoint x: 399.18165
-    waypoint z: 1072.78516
-    waypoint x: 403.84083
-    waypoint z: 1068.14258
-    waypoint x: 409.67042
-    waypoint z: 1062.3213
-    waypoint x: 416.08521
-    waypoint z: 1055.91065
-    waypoint x: 421.79261
-    waypoint z: 1050.20533
-    waypoint x: 427.64631
-    waypoint z: 1044.35267
-    waypoint x: 434.07316
-    waypoint z: 1037.92634
-    waypoint x: 439.78658
-    waypoint z: 1032.21317
+    waypoint x: 382.95313
+    waypoint z: 1088.78125
+    waypoint x: 387.72657
+    waypoint z: 1084.14063
+    waypoint x: 393.11329
+    waypoint z: 1078.82<032
+    waypoint x: 398.30665
+    waypoint z: 1073.66016
+    waypoint x: 403.40333
+    waypoint z: 1068.58008
+    waypoint x: 409.45167
+    waypoint z: 1062.54004
+    waypoint x: 415.97584
+    waypoint z: 1056.02002
+    waypoint x: 420.23792
+    waypoint z: 1051.76001
+    waypoint x: 426.86896
+    waypoint z: 1045.13001
+    waypoint x: 431.18448
+    waypoint z: 1040.81501
+    waypoint x: 435.84224
+    waypoint z: 1036.15751
+    waypoint x: 440.67112
+    waypoint z: 1031.32876
     length: 0
     length: 0
     expected path ticket: 0
     next step clean: true
     type: 0
     goal x: 342
     goal z: 1110
     goal u x: 0
     goal u z: 0
     goal v x: 0
     goal v z: 0
     goal hw: 0
     goal hh: 0
     maxdist: 0
   Vision:
   VisualActor:
     base actor: "units/celts/female_citizen.xml"
     r: 1
     g: 1
     b: 1

Attachments (12)

t3292_commands.txt_oos_dump.7z (641.8 KB ) - added by elexis 9 years ago.
For now I only received one commands.txt of the rejoined player (ffm). Not sure if its the first or second one. Notice that the hash is wrong in turn 0 already according to the commands.txt file of ffm. We made sure that we are both on the same revision. We also did another test before that that was not oos.
commands_fun.txt (921.4 KB ) - added by elexis 9 years ago.
A replayfile that contains the playernames as well, if you want to replay it visually (ignore the oos error message on turn 0). If you replay the commands_elexis.txt file visually, then the oos error will apear at turn 980 (i.e. 8 minutes 10 seconds)!!
r16853_commands.txt_ooslog.7z (483.2 KB ) - added by elexis 9 years ago.
Another oos on rejoin (hapoofesgeli rejoined). Many values differ. The visibilities are probably #3107.
oos_dump_16863.7z (719.8 KB ) - added by elexis 9 years ago.
Another one including commands.txt
oos_rejoin_pathfinder_r16877_different_waypoints_one_woman_random_map.7z (172.9 KB ) - added by elexis 9 years ago.
This oos occured after rejoin after 19 seconds. About 30 units on the map. The only diff is a single woman having different waypoints and a different position. Random map. Screenshot of the accused woman: https://i.imgur.com/EXI94bu.jpg
oos_rejoin_pathfinder_r16876_different_rng_entitycount_waypoints_subdiv-items.7z (269.6 KB ) - added by elexis 9 years ago.
Here another oos on rejoin. Notice differences in the subdiv-items section, similar to one of the originally uploaded ooslogs. However this dump seems to be bogus as the rng value in the first line already differs. It changes every turn. So probably the simulation of one client has already progressed one turn when the CSyncErrorMessage arrived.
stuck_cav_r16884.7z (1.8 KB ) - added by elexis 9 years ago.
Here a minimal commands.txt on a skirmish map where the initial cavalry skirmisher got stuck in the woods. Only walk commands were issued. I did a serializationtest afterwards (results included), however no problem was discovered. In more tests I rejoined while & after a unit was stuck, but that didn't cause an out of sync either. Therefore stuck units shouldn't cause this oos.
oosdump_r16988.7z (322.9 KB ) - added by elexis 9 years ago.
No news here. OOS on rejoin, but serializationtest succeeds. Recipe is still the same, send many units to work and rejoin. In this dump, a woman (entity id 7086) has a slightly different position and different waypoints.
oosdumps_r17171.7z (330.5 KB ) - added by elexis 8 years ago.
Again, OOS on rejoin (turn 139), but serializationtest shows no errors
oos_tree_chopper.mp4 (1.6 MB ) - added by elexis 8 years ago.
Video of turns 85-145. The turn number is displayed at the left. The affected unit is selected.
serializationChange.patch (3.9 KB ) - added by wraitii 8 years ago.
Hack simulation2.cpp to test
serializationChange_v2.patch (5.1 KB ) - added by elexis 8 years ago.
Rebased wraitiis patch starting the serializationtest only after turn N.

Change History (32)

by elexis, 9 years ago

For now I only received one commands.txt of the rejoined player (ffm). Not sure if its the first or second one. Notice that the hash is wrong in turn 0 already according to the commands.txt file of ffm. We made sure that we are both on the same revision. We also did another test before that that was not oos.

by elexis, 9 years ago

Attachment: commands_fun.txt added

A replayfile that contains the playernames as well, if you want to replay it visually (ignore the oos error message on turn 0). If you replay the commands_elexis.txt file visually, then the oos error will apear at turn 980 (i.e. 8 minutes 10 seconds)!!

comment:1 by elexis, 9 years ago

Description: modified (diff)

comment:2 by Itms, 9 years ago

Owner: set to Itms

Thanks for all this data. I got this problem when running serialization tests, I'm glad it only creates rejoin OOSes.

I'm prioritizing my grid update change, which will make serialization tests less clunky.

by elexis, 9 years ago

Another oos on rejoin (hapoofesgeli rejoined). Many values differ. The visibilities are probably #3107.

comment:3 by Itms, 9 years ago

Ah thanks, now I will be able to take a look on the latest revision (I think I'm done changing a lot of things so it's more than time to fix those OOSes)

comment:4 by Stan, 9 years ago

does r16857 fix it ?

comment:5 by Itms, 9 years ago

No, why would it?

by elexis, 9 years ago

Attachment: oos_dump_16863.7z added

Another one including commands.txt

comment:6 by Itms, 9 years ago

Resolution: fixed
Status: newclosed

In 16867:

Use the terrain-only grid for terrain edges in the short pathfinder algorithm. This grid is updated on each terrain change, whereas the passability grid is updated once a turn. This caused OOS on rejoin, fixes #3292.

However, using the terrain-only grid reveals one discrepancy between the short pathfinder (which uses unit radii) and the long one (which uses unit clearances). So I implemented the change proposed by sanderd17 in #3294, which is removing unit radius and using only the pathfinder clearance. Refs #3294
Now some tweaking has to be done in the templates, so that units get a passability class suited to their apparent size. In the meantime the unit motion is quite bugged.

comment:7 by elexis, 9 years ago

Resolution: fixed
Status: closedreopened

Reopening (with Itms approval) because not all causes of the bug have been fixed.

Reproduce:

  1. Host a match with
  • deathmatch resources
  • tiny map (faster loading)
  • much wood
  • cheats enabled
  • and join with a second instance.
  1. Enter "the hive master" and "i am too busy" cheat
  2. Select the CC, set the gathering point on wood and produce units
  3. Let them gather for a while
  4. Rejoin with the other instance

#3339 can be used to write the oosdumps to different file locations. Serializationtestmode is too slow to produce results without falling asleep unfortunately.

by elexis, 9 years ago

This oos occured after rejoin after 19 seconds. About 30 units on the map. The only diff is a single woman having different waypoints and a different position. Random map. Screenshot of the accused woman: https://i.imgur.com/EXI94bu.jpg

comment:8 by elexis, 9 years ago

Here another oos dump including ooslogs for every turn of both host and rejoined client (zipped 5MB, all files unzipped 5GB).

http://www.megafileupload.com/4Qqv/oos_rejoin_pathfinder_r16877_with_ooslogs____4GB___.7z

mirror: http://www.file-upload.net/download-10790895/oos_rejoin_pathfinder_r16877_with_ooslogs____4GB___.7z.html

Differences are: positions and goals, tickets in the pathfinder section.

by elexis, 9 years ago

Here another oos on rejoin. Notice differences in the subdiv-items section, similar to one of the originally uploaded ooslogs. However this dump seems to be bogus as the rng value in the first line already differs. It changes every turn. So probably the simulation of one client has already progressed one turn when the CSyncErrorMessage arrived.

by elexis, 9 years ago

Attachment: stuck_cav_r16884.7z added

Here a minimal commands.txt on a skirmish map where the initial cavalry skirmisher got stuck in the woods. Only walk commands were issued. I did a serializationtest afterwards (results included), however no problem was discovered. In more tests I rejoined while & after a unit was stuck, but that didn't cause an out of sync either. Therefore stuck units shouldn't cause this oos.

comment:9 by Itms, 9 years ago

I get no serialization errors with all these replays. The only problem I had was caused by #3164 and with my patch for it, everything works fine.

Let's leave this ticket as it is in case someone gets this OOS again, but there is little I can do now.

by elexis, 9 years ago

Attachment: oosdump_r16988.7z added

No news here. OOS on rejoin, but serializationtest succeeds. Recipe is still the same, send many units to work and rejoin. In this dump, a woman (entity id 7086) has a slightly different position and different waypoints.

comment:10 by Itms, 9 years ago

Owner: Itms removed
Status: reopenednew

I won't have time for programming anymore during the upcoming months.

comment:11 by Josh, 9 years ago

I checked all the changes since the end of May, and it looks like no floating point numbers have been added to the serialized simulation (so this is likely not an issue with floating point accuracy). Checking that we have not started using any of the standard library which might behave differently across platforms would be a good next step.

comment:12 by elexis, 9 years ago

I'm suspecting bugs in the logic. This ticket might be actually a duplicate of #3538, since that one also produces different waypoints. (There may be more ways to get different waypoints though.)

comment:13 by wraitii, 8 years ago

In 17158:

Fix a logic error in the hierarchical pathfinder that resulted in it not updating all passability classes correctly. Fixes #3538, refs #3292 (it fixes one of the cases, but not the examples).

Also features style fixes, thanks leper for noticing.

comment:14 by elexis, 8 years ago

This issue is still prevalent with r17168. Contrary to the a18 OOSes (#3107 #3108 #3151) which rarely notably changed the simstate, this one is a game-over on rejoin and should stay a release blocker.

The good news is that most other OOS are fixed now, which prohibited debugging this one (especially #3450). (Still surprised by wraitii plottwisting the pathfinder :))

Last time I checked (2 months ago), there was an OOS on rejoin, but the serializationtest never failed. This is to be verified. Ideally we create a minimum commands.txt where the serializationtest fails, from which we can derive the bug.

by elexis, 8 years ago

Attachment: oosdumps_r17171.7z added

Again, OOS on rejoin (turn 139), but serializationtest shows no errors

by elexis, 8 years ago

Attachment: oos_tree_chopper.mp4 added

Video of turns 85-145. The turn number is displayed at the left. The affected unit is selected.

comment:15 by elexis, 8 years ago

  • The replay contains very few commands: create some units and let them chop wood.
  • One archer (entity id 7097) has a different position for the rejoined client.
  • Type g_Selection.addList([7097]); in the JS console to select him.
  • There is a diff in the LOS state and waypoints, but those should just be consequences of the different position.
  • Notice the archer receives a female inspiration aura bonus. I think if that would be buggy, we would have seen way more frequent OOS on a18.
  • The rejoined client starts hashing on turn 124, but all hash values match until turn 139.
  • Turn 138: In the replay (which matches the host), the unit has 10 wood (finished gathering). Stands at the tree without formation for one turn and then starts to walk to the dropsite.
  • As the hash values match before turn 139, the position of the gathering archer for those turns must be identical too.
  • I assume the rejoined client decides to dropoff one turn earlier.

Why 'the serializationtest doesn't catch it:

  • The serializationtest simulates only one turn of a rejoined client. But the OOS occurs multiple turns after the rejoin. Thus it is logical that serializationtest doesn't catch it.
  • We could mofiy the serializationtest to initialize the secondary simulation on turn 124 and then continue the simulation after that, instead of reseting it. Should trigger the OOS error.
Last edited 8 years ago by elexis (previous) (diff)

comment:16 by elexis, 8 years ago

bump

comment:17 by wraitii, 8 years ago

I hacked simulation2.cpp to initialize the serialization on turn 124 and run from there (at least I think, someone ought to check my code). Can't reproduce.

To clarify elexis, this actually happened over MP? It's not two instances on your computer?

by wraitii, 8 years ago

Attachment: serializationChange.patch added

Hack simulation2.cpp to test

comment:18 by wraitii, 8 years ago

My findings so far:

The actual desynch happens on turn 112, because of a poplar tree that gets destroyed. This happens because the dirtiness isn't destroyed the same. Somehow, m_MaxClearance in the obstructionmanager is 0 and not the expected 12.

M_maxClearance is defined on turn 97, when the second simulation is created (to simulate the player join). It is at first correctly assigned a value of 12. Then by some miracle I absolutely do not understand, it gets reverted back to 0.

I need to give this a good look. For now, it seems like magic.

comment:19 by elexis, 8 years ago

Notice the patch needs to run with hashStart = 97 instead of 124 to trigger the serializationtest-error.

comment:20 by wraitii, 8 years ago

Owner: set to wraitii
Resolution: fixed
Status: newclosed

In 17176:

Fix an oversight in the obstruction manager that lead to possibly the most insidious OOS error we've seen so far.

Fixes #3292 .

by elexis, 8 years ago

Rebased wraitiis patch starting the serializationtest only after turn N.

Note: See TracTickets for help on using tickets.