Opened 9 years ago
Closed 8 years ago
#3292 closed defect (fixed)
OOS on rejoin with new pathfinder
Reported by: | elexis | Owned by: | wraitii |
---|---|---|---|
Priority: | Release Blocker | Milestone: | Alpha 19 |
Component: | Core engine | Keywords: | pathfinding |
Cc: | Patch: |
Description (last modified by )
We played a 2v3 on r16753, i.e. with the new pathfinder committed in r16751.
ffm rejoined early and was completely oos (i.e. he saw a different amount of units in my town, and I saw a different amount of units in his town).
Units seem to get stuck and sometimes stop walking for no apparent reason (while being in the walking animation and state). Mabye not related to this bug, since:
The diff only contains one entity that has different values. It is a woman with different waypoints and position. It is the second woman I produced at my CC (second 25). When doing the visual replay, the oos error appears turn 980 (8min 10sec).
--- oos_dump_elexis.txt 2015-06-13 01:23:48.092697000 +0200 +++ oos_dump_ffm.txt 2015-06-13 01:23:47.000000000 +0200 Somewhere: entity: 5790 - x: 443.95232 + x: 443.95234 z: 1028.04763 r: 0.8 clearance: 1 flags: 21 group: 5790 Somewhere below that: - id: 5790 Decay: Footprint: Minimap: r: 230 g: 230 b: 75 active: true - x: 443.95232 + x: 443.95234 z: 1028.04763 Obstruction: active: true moving: true control group: 5790 control group 2: 0 tag: 8652 flags: 5 OverlayRenderer: Ownership: owner: 4 Position: in world: true - x: 443.95232 + x: 443.95234 y: 0 z: 1028.04763 last x: 445.1823 last y diff: 0 last z: 1023.95841 territory: 4 rot x: 0 rot y: -0.36324 rot z: 0 altitude: 0 relative: true floating: false constructionprogress: 0 anchor: "upright" turret parent: 0 Selectable: UnitMotion: radius: 0.8 state: 3 path state: 3 pass class: "default" ticket: 0 target entity: 0 target pos x: 0 target pos y: 0 target offset x: 0 target offset y: 0 target min range: 0 target max range: 0 speed: 9.5 moving: true facePointAfterMove: true - length: 17 + length: 18 waypoint x: 344.5 waypoint z: 1110.5 waypoint x: 353 waypoint z: 1110.5 waypoint x: 360.75 waypoint z: 1107 waypoint x: 366.125 waypoint z: 1103.75 waypoint x: 371.3125 waypoint z: 1099.625 waypoint x: 377.40625 waypoint z: 1094.0625 - waypoint x: 384.95313 - waypoint z: 1086.78125 - waypoint x: 390.22657 - waypoint z: 1081.64063 - waypoint x: 394.86329 - waypoint z: 1077.07032 - waypoint x: 399.18165 - waypoint z: 1072.78516 - waypoint x: 403.84083 - waypoint z: 1068.14258 - waypoint x: 409.67042 - waypoint z: 1062.3213 - waypoint x: 416.08521 - waypoint z: 1055.91065 - waypoint x: 421.79261 - waypoint z: 1050.20533 - waypoint x: 427.64631 - waypoint z: 1044.35267 - waypoint x: 434.07316 - waypoint z: 1037.92634 - waypoint x: 439.78658 - waypoint z: 1032.21317 + waypoint x: 382.95313 + waypoint z: 1088.78125 + waypoint x: 387.72657 + waypoint z: 1084.14063 + waypoint x: 393.11329 + waypoint z: 1078.82<032 + waypoint x: 398.30665 + waypoint z: 1073.66016 + waypoint x: 403.40333 + waypoint z: 1068.58008 + waypoint x: 409.45167 + waypoint z: 1062.54004 + waypoint x: 415.97584 + waypoint z: 1056.02002 + waypoint x: 420.23792 + waypoint z: 1051.76001 + waypoint x: 426.86896 + waypoint z: 1045.13001 + waypoint x: 431.18448 + waypoint z: 1040.81501 + waypoint x: 435.84224 + waypoint z: 1036.15751 + waypoint x: 440.67112 + waypoint z: 1031.32876 length: 0 length: 0 expected path ticket: 0 next step clean: true type: 0 goal x: 342 goal z: 1110 goal u x: 0 goal u z: 0 goal v x: 0 goal v z: 0 goal hw: 0 goal hh: 0 maxdist: 0 Vision: VisualActor: base actor: "units/celts/female_citizen.xml" r: 1 g: 1 b: 1
Attachments (12)
Change History (32)
by , 9 years ago
Attachment: | t3292_commands.txt_oos_dump.7z added |
---|
by , 9 years ago
Attachment: | commands_fun.txt added |
---|
A replayfile that contains the playernames as well, if you want to replay it visually (ignore the oos error message on turn 0). If you replay the commands_elexis.txt
file visually, then the oos error will apear at turn 980 (i.e. 8 minutes 10 seconds)!!
comment:1 by , 9 years ago
Description: | modified (diff) |
---|
comment:2 by , 9 years ago
Owner: | set to |
---|
Thanks for all this data. I got this problem when running serialization tests, I'm glad it only creates rejoin OOSes.
I'm prioritizing my grid update change, which will make serialization tests less clunky.
by , 9 years ago
Attachment: | r16853_commands.txt_ooslog.7z added |
---|
Another oos on rejoin (hapoofesgeli rejoined). Many values differ. The visibilities are probably #3107.
comment:3 by , 9 years ago
Ah thanks, now I will be able to take a look on the latest revision (I think I'm done changing a lot of things so it's more than time to fix those OOSes)
comment:7 by , 9 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Reopening (with Itms approval) because not all causes of the bug have been fixed.
Reproduce:
- Host a match with
- deathmatch resources
- tiny map (faster loading)
- much wood
- cheats enabled
- and join with a second instance.
- Enter "the hive master" and "i am too busy" cheat
- Select the CC, set the gathering point on wood and produce units
- Let them gather for a while
- Rejoin with the other instance
#3339 can be used to write the oosdumps to different file locations. Serializationtestmode is too slow to produce results without falling asleep unfortunately.
by , 9 years ago
This oos occured after rejoin after 19 seconds. About 30 units on the map. The only diff is a single woman having different waypoints and a different position. Random map. Screenshot of the accused woman: https://i.imgur.com/EXI94bu.jpg
comment:8 by , 9 years ago
Here another oos dump including ooslogs for every turn of both host and rejoined client (zipped 5MB, all files unzipped 5GB).
http://www.megafileupload.com/4Qqv/oos_rejoin_pathfinder_r16877_with_ooslogs____4GB___.7z
Differences are: positions and goals, tickets in the pathfinder section.
by , 9 years ago
Here another oos on rejoin. Notice differences in the subdiv-items
section, similar to one of the originally uploaded ooslogs. However this dump seems to be bogus as the rng
value in the first line already differs. It changes every turn. So probably the simulation of one client has already progressed one turn when the CSyncErrorMessage
arrived.
by , 9 years ago
Attachment: | stuck_cav_r16884.7z added |
---|
Here a minimal commands.txt on a skirmish map where the initial cavalry skirmisher got stuck in the woods. Only walk commands were issued. I did a serializationtest afterwards (results included), however no problem was discovered. In more tests I rejoined while & after a unit was stuck, but that didn't cause an out of sync either. Therefore stuck units shouldn't cause this oos.
comment:9 by , 9 years ago
I get no serialization errors with all these replays. The only problem I had was caused by #3164 and with my patch for it, everything works fine.
Let's leave this ticket as it is in case someone gets this OOS again, but there is little I can do now.
by , 9 years ago
Attachment: | oosdump_r16988.7z added |
---|
No news here. OOS on rejoin, but serializationtest succeeds. Recipe is still the same, send many units to work and rejoin. In this dump, a woman (entity id 7086) has a slightly different position and different waypoints.
comment:10 by , 9 years ago
Owner: | removed |
---|---|
Status: | reopened → new |
I won't have time for programming anymore during the upcoming months.
comment:11 by , 9 years ago
I checked all the changes since the end of May, and it looks like no floating point numbers have been added to the serialized simulation (so this is likely not an issue with floating point accuracy). Checking that we have not started using any of the standard library which might behave differently across platforms would be a good next step.
comment:12 by , 9 years ago
I'm suspecting bugs in the logic. This ticket might be actually a duplicate of #3538, since that one also produces different waypoints. (There may be more ways to get different waypoints though.)
comment:14 by , 8 years ago
This issue is still prevalent with r17168. Contrary to the a18 OOSes (#3107 #3108 #3151) which rarely notably changed the simstate, this one is a game-over on rejoin and should stay a release blocker.
The good news is that most other OOS are fixed now, which prohibited debugging this one (especially #3450). (Still surprised by wraitii plottwisting the pathfinder :))
Last time I checked (2 months ago), there was an OOS on rejoin, but the serializationtest never failed. This is to be verified. Ideally we create a minimum commands.txt where the serializationtest fails, from which we can derive the bug.
by , 8 years ago
Attachment: | oosdumps_r17171.7z added |
---|
Again, OOS on rejoin (turn 139), but serializationtest shows no errors
by , 8 years ago
Attachment: | oos_tree_chopper.mp4 added |
---|
Video of turns 85-145. The turn number is displayed at the left. The affected unit is selected.
comment:15 by , 8 years ago
- The replay contains very few commands: create some units and let them chop wood.
- One archer (entity id 7097) has a different position for the rejoined client.
- Type
g_Selection.addList([7097]);
in the JS console to select him. - There is a diff in the LOS state and waypoints, but those should just be consequences of the different position.
- Notice the archer receives a female inspiration aura bonus. I think if that would be buggy, we would have seen way more frequent OOS on a18.
- The rejoined client starts hashing on turn 124, but all hash values match until turn 139.
- Turn 138: In the replay (which matches the host), the unit has 10 wood (finished gathering). Stands at the tree without formation for one turn and then starts to walk to the dropsite.
- As the hash values match before turn 139, the position of the gathering archer for those turns must be identical too.
- I assume the rejoined client decides to dropoff one turn earlier.
Why 'the serializationtest doesn't catch it:
- The serializationtest simulates only one turn of a rejoined client. But the OOS occurs multiple turns after the rejoin. Thus it is logical that serializationtest doesn't catch it.
- We could mofiy the serializationtest to initialize the secondary simulation on turn 124 and then continue the simulation after that, instead of reseting it. Should trigger the OOS error.
comment:17 by , 8 years ago
I hacked simulation2.cpp to initialize the serialization on turn 124 and run from there (at least I think, someone ought to check my code). Can't reproduce.
To clarify elexis, this actually happened over MP? It's not two instances on your computer?
comment:18 by , 8 years ago
My findings so far:
The actual desynch happens on turn 112, because of a poplar tree that gets destroyed. This happens because the dirtiness isn't destroyed the same. Somehow, m_MaxClearance in the obstructionmanager is 0 and not the expected 12.
M_maxClearance is defined on turn 97, when the second simulation is created (to simulate the player join). It is at first correctly assigned a value of 12. Then by some miracle I absolutely do not understand, it gets reverted back to 0.
I need to give this a good look. For now, it seems like magic.
comment:19 by , 8 years ago
Notice the patch needs to run with hashStart = 97
instead of 124 to trigger the serializationtest-error.
by , 8 years ago
Attachment: | serializationChange_v2.patch added |
---|
Rebased wraitiis patch starting the serializationtest only after turn N.
For now I only received one commands.txt of the rejoined player (ffm). Not sure if its the first or second one. Notice that the hash is wrong in turn 0 already according to the commands.txt file of ffm. We made sure that we are both on the same revision. We also did another test before that that was not oos.