Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#4277 closed defect (fixed)

[PATCH] OOS on rejoin - ptolemian lighthouse causing different dirty visibilities in the rangemanager

Reported by: elexis Owned by: elexis
Priority: Release Blocker Milestone: Alpha 21
Component: UI & Simulation Keywords: patch
Cc: Patch:

Description

Trying some rejoining with r18829 on a skirmish map yielded this OOS where two entities have different dirty visibilities. The error is reproducible using the rejointest tool #4242.

--- debug.after.a.txt	2016-10-11 15:30:53.738286090 +0200
+++ debug.after.b.txt	2016-10-11 15:30:53.746285963 +0200
@@ -32611,11 +32611,15 @@
     length: 1024
     #: 363
     dirty visibility: 0
     #: 1
     dirty visibility: 1
-    #: 151
+    #: 119
+    dirty visibility: 0
+    #: 1
+    dirty visibility: 1
+    #: 31
     dirty visibility: 0
     #: 2
     dirty visibility: 1
     #: 31
     dirty visibility: 0

Attachments (8)

OOS_dirty_visibilities.7z (182.0 KB ) - added by elexis 8 years ago.
lighthouse_rejointest_r18832.7z (152.8 KB ) - added by elexis 8 years ago.
Minimal skirmish map reproducing wraitiis OOS with the rejointest.
lighthouse_fix.patch (1.8 KB ) - added by wraitii 8 years ago.
fix for the lighthouse OOS
reveal-shore-fix.patch (5.2 KB ) - added by Itms 8 years ago.
fix_through_serialization.patch (1.7 KB ) - added by wraitii 8 years ago.
reveal-shore-notfix_v2.patch (8.0 KB ) - added by elexis 8 years ago.
Prior patch by Itms with his new suggestion to also serialize m_MapSize in the pathfinder. Doesn't work as there is still the OOS after deleting the lighthouse when someone rejoined after it was built.
reveal-shore-fixv0.patch (8.0 KB ) - added by wraitii 7 years ago.
reveal-shore.patch (8.6 KB ) - added by Itms 7 years ago.

Download all attachments as: .zip

Change History (22)

by elexis, 8 years ago

Attachment: OOS_dirty_visibilities.7z added

comment:1 by elexis, 8 years ago

  • The diff is in m_DirtyVisibility of CCmpRangeManager.cpp
  • In r18121 repetitive vectors were introduced in case someone is surprised about that format
  • The diff tells us that exactly one tile is dirty to the host, but not to the rejoined client

comment:2 by elexis, 8 years ago

Summary: OOS on rejoin - different dirty visibilitiesOOS on rejoin - ptolemian lighthouse causing different dirty visibilities in the rangemanager

comment:3 by wraitii, 8 years ago

Not sure if it is related or not but you can very easily trigger an OOS by building the lighthouse and destroying it. It will OOS on the turn it gets destroyed. It seems capturing triggers the same OOS so I'm assuming any ownership change actually fails.

Might be related to this OOS, which is caused by the LOSPlayerCount for player 1 being completely fubar.

comment:4 by elexis, 8 years ago

Correct, build lighthouse, rejoin, delete it or user "wololo" cheat to change ownership -> OOS

by elexis, 8 years ago

Minimal skirmish map reproducing wraitiis OOS with the rejointest.

comment:5 by elexis, 8 years ago

Notice in this diff there are no dirty visibilities differing, but "visbilities" and "los state" of the range manager.

@@ -17656,7 +17656,7 @@
     x: 611.55506
     z: 249.45731
     vision: 0
-    visibilities: 9
+    visibilities: 10
     size: 2
     retain in fog: 1
     owner: 0
@@ -27552,7 +27552,7 @@
     #: 39
     los state: 0
     #: 89
-    los state: 13
+    los state: 15
     #: 97
     los state: 0
     #: 34

by wraitii, 8 years ago

Attachment: lighthouse_fix.patch added

fix for the lighthouse OOS

comment:6 by wraitii, 8 years ago

Attached patch fixes the lighthouse-specific OOS, but this is non-trivial.

The core issue is that the range manager does not serialize player LOS count (m_PlayerLosCount). This is because in theory it can be recomputed from the range manager state. However the lighthouse breaks this: the range manager does not know that it needs to reveal the shore (in practice, it does because m_losstate is serialized, but this will OOS once the lighthouse goes down).

The thing to note here is that any entity with special LOS behavior will trigger an OOS ultimately.

We have various ways of fixing this: -serialize m_PlayerLosCount. Would work for any new entities we'd add down the line, but may seriously bump the size of the serialization. -cache in the entity state of the range manager that this one needs to call reveal shore. Would work, but any new "special" entity would need another boolean and that just seems like ti would use too much memory to be worth the while. -Call a function in CmpVision when deserializing to call the reveal shore. This could work for different entities.

There is another problem: the reveal shore call relies on the pathfinder grid being initialized. But that gets initialized at the beginning of UpdateComponents, i.e. after deserialization and even after the deserialized message. To fix this, I make CmpVision force a grid update on MT_Deserialize, but this is rather inelegant. Furthermore, it breaks the serialization test because the "before" state becomes different. This is why the above patch comments that out. Note that the "After" state is similar.

comment:7 by Itms, 8 years ago

Keywords: patch rfc added
Summary: OOS on rejoin - ptolemian lighthouse causing different dirty visibilities in the rangemanager[PATCH] OOS on rejoin - ptolemian lighthouse causing different dirty visibilities in the rangemanager

This is not guaranteed to fix all sources of OOS but it is the proper version of wraitii's hack. I should never have put the revealing shore instructions inside the vision component, that belongs to the range manager.

With respect to wraitii's comment about "any new "special" entity would need another boolean": indeed that is not very streamlined. However, now all the relevant code is where it should be, so it would be really easy to generalize this system into a "special vision" system that associates a vision type with a LOS update function (here RevealShore) that is called upon ownership change and during the resetting of derived data.

by Itms, 8 years ago

Attachment: reveal-shore-fix.patch added

comment:8 by elexis, 8 years ago

wraitii's patch works, Itms one crashes on rejoin if and only if a lighthouse is built:

CCmpRangeManager.cpp(1910): Assertion failed: "shoreGrid.m_W == m_TerrainVerticesPerSide-1 && shoreGrid.m_H == m_TerrainVerticesPerSide-1"
Assertion failed: "shoreGrid.m_W == m_TerrainVerticesPerSide-1 && shoreGrid.m_H == m_TerrainVerticesPerSide-1"
Location: CCmpRangeManager.cpp:1910 (RevealShore)

by wraitii, 8 years ago

comment:9 by wraitii, 8 years ago

Itms's patches is my second solution above. Indeed it should work, but the pathfinder grid being unitinialized on deserialization makes it crash. To fix that we could serialize the pathfinder grid (no easy way to be sure we can recompute it before then).

Another solution to this OOS is to serialize player counts, which my patch above does. With the repetitive vectors, it shouldn't be a huge increase in memory footprint.

comment:10 by Itms, 8 years ago

I do not think serializing the entire player LOS count is sane.

In order to fix my patch, we should make sure that the pathfinder data that allows one to compute the shore grid is properly serialized. At first sight one should serialize m_MapSize in the pathfinder, and that should be enough.

by elexis, 8 years ago

Prior patch by Itms with his new suggestion to also serialize m_MapSize in the pathfinder. Doesn't work as there is still the OOS after deleting the lighthouse when someone rejoined after it was built.

by wraitii, 7 years ago

Attachment: reveal-shore-fixv0.patch added

comment:11 by wraitii, 7 years ago

Should work. The problem with the above patches is that the water manager gets deserialized last, so when the pathfinder calls it it gets garbage. This fixes it by doing the "reset derived data" on the deserialized message which is guaranteed to come after those.

BTW, why is the shore grid computed in the pathfinder, and why is it not cached?

comment:12 by Itms, 7 years ago

Keywords: review added; rfc removed

After some research, it looks like it would be the correct solution indeed. I modified the comment so it is clear that the ResetDerivedData needs to be done at that point, I removed your debug code and I changed the pathfinder component so it uses the SerializeCommon pattern.

The shore grid is strongly coupled to the pathfinding code (except for that lighthouse) so it made sense for it to be part of the pathfinder. I agree it could in principle be part of the water manager, but this component should then ask for some data from the pathfinder. Wrt caching, it's not really useful as the shore grid is not computed often, and note that caching would not have prevented the OOS here.

The attached patch should be ready for commit. I checked that serialization tests passed, rejoining could be tested too.

by Itms, 7 years ago

Attachment: reveal-shore.patch added

comment:13 by elexis, 7 years ago

Owner: set to elexis
Resolution: fixed
Status: newclosed

In 18879:

Fix an OOS on rejoin when a ptolemian lighthouse revealing the shoreline was built prior. Patch by Itms and wraitii, fixes #4277.

Serialize the mapsize in the pathfinder and the reveal shoreline flag in the range manager.
Reload the rangemanager data after other components have been deserialized.
Use the SerializeCommon pattern in the pathfinder to avoid code duplication.
Move the shoreline logic from the Vision component to the range manager.
Remove unused interface mocks from the rangemanager test following r16022.

comment:14 by elexis, 7 years ago

Keywords: review removed

Thanks for fixing this OOS!

The patch was rereviewed by wraitii and passed all my rejoin tests.

The original commands.txt from the report still triggers a failure with r18878 but not r18879. Notice on that report there was no ownership change or destruction of that lighthouse. Something else must have caused a recomputation of that one particular tile.

fatherbushido has also mentioned concern with #4290, that players might own multiple lighthouses. The code looks like it would remove the shoreline visibility after capturing (even if the player still has another lighthouse), but according to some testing this isn't an issue.

Note: See TracTickets for help on using tickets.