Opened 4 years ago

Closed 3 years ago

#5653 closed defect (needsinfo)

OOS ResourceGather / Attack rates object swap

Reported by: Silier Owned by:
Priority: Release Blocker Milestone:
Component: UI & Simulation Keywords:
Cc: Patch:

Description (last modified by wraitii)

Current state of the investigation:

What happened:

  • Somehow, cached rates in the ResourceGatherer components of several entities got partly overwritten with data from Attack in GetTemplateDataHelper in elexis' game (possilbly D_D_T).
  • This led to a farmer failing to farm food...
  • ...resulting in OOS between Angen+Stan and elexis.

The JS error appears to be the symptom of the underlying corruption, ultimately resulting in OOS.

Open questions:

  • It seems odd that D_D_T and elexis crashed. This might mean that both experienced the JS corruption, which would hint towards this actually being the result of some code doing something wrong instead of simply a freak error.
  • Angen reports not crashing, which given the 'normal' OOS dump might further imply that his JS was not corrupted and so the crash might have been linked.
  • That being said, if it is memory or stack corruption, a crash makes sense. It would further be OK if D_D_T and elexis didn't quite crash at the same time (memory is fickle).
  • neither elexis nor wraitii could find any JS code that would reasonably lead to this OOS, so it sounds like something broke in the engine/something broke in a completely abnormal way.

Attachments (4)

oos_dump.zip (558.3 KB ) - added by Silier 4 years ago.
2019-12-24_0003.zip (59.8 KB ) - added by Silier 4 years ago.
oos_dump.2.zip (1.2 MB ) - added by Silier 4 years ago.
oos_dump_elexis.7z (691.2 KB ) - added by elexis 4 years ago.

Change History (17)

by Silier, 4 years ago

Attachment: oos_dump.zip added

by Silier, 4 years ago

Attachment: 2019-12-24_0003.zip added

by Silier, 4 years ago

Attachment: oos_dump.2.zip added

comment:1 by elexis, 4 years ago

  • I was the host, we played a 2v2 on a Skirmish map
  • I used git mirror with rP23278, the others svn with rP23281, but they are compatible as far as I can see
  • There was a JS error, I think only I got it
  • Within one or two seconds D_D_T_ crashed and then one or two seconds later there was an OOS for Angen + Stan (every connected client that wasnt the host)
  • Then I took a screenshot thereof and then there was a segfault, again within 1-2 seconds

comment:2 by elexis, 4 years ago

Cc: elexis removed
Milestone: BacklogAlpha 24
Priority: If Time PermitsRelease Blocker
Summary: oos dump from match with rP23281OOS ResourceGather / Attack rates object swap
GAME STARTED, ALL INIT COMPLETE
WARNING: JavaScript warning: simulation/components/ResourceGatherer.js line 136
reference to undefined property this.rates[resourceType]
ERROR: Out-Of-Sync on turn 1517
Players: Angen, StanleySweet
Dumping state to /home/elexis/.config/0ad/logs/oos_dump.txt
TIMER| common/modern/setup.xml: 220.274 us
TIMER| common/modern/styles.xml: 186.297 us
TIMER| common/modern/sprites.xml: 1.23361 ms
TIMER| msgbox/msgbox.xml: 100.635 ms
Screenshot written to '/home/elexis/.local/share/0ad/screenshots/screenshot1047.png'
Segmentation fault (core dumped)

I cannot reproduce the hash I got using replay mode, nor the JS error.

Following OOS differences between my and Angens OOS dump:

     entity: 13809
-    x: 1198.91717
-    z: 498.42405
+    x: 1201.43053
+    z: 494.69137
     clearance: 0.8
-    flags: 21
+    flags: 5
     key: 13809
-    x: 1198.91717
-    z: 498.42405
+    x: 1201.43053
+    z: 494.69137
+    modified entities: 13809
...
-    modified entities: 13809

(different order)

This should be the StatisticsTracker

   "resourcesGathered": {
-    "vegetarianFood": 3099,
-    "food": 3716,
+    "vegetarianFood": 3100,
+    "food": 3717,
     "wood": 6731,
     "stone": 0,
     "metal": 0
   },

But the most interesting diff is this:

   "rates": {
-    "minRange": 0.75,
-    "maxRange": 0.25,
-    "elevationBonus": 1,
-    "elevationAdaptedRange": 0.5,
-    "repeatTime": 0.5,
+    "food.fruit": 0.75,
+    "food.grain": 0.25,
+    "food.meat": 1,
+    "metal.ore": 0.5,
+    "stone.rock": 0.5,
+    "stone.ruins": 2,
+    "treasure": 1,
+    "wood.ruins": 5,
     "wood.tree": 0.9918749999999998
   },
...
   "rates": {
-    "minRange": 0.75,
-    "maxRange": 0.25,
-    "elevationBonus": 1,
-    "elevationAdaptedRange": 0.5,
-    "repeatTime": 0.5,
+    "food.fruit": 0.75,
+    "food.grain": 0.25,
+    "food.meat": 1,
+    "metal.ore": 0.5,
+    "stone.rock": 0.5,
+    "stone.ruins": 2,
+    "treasure": 1,
+    "wood.ruins": 5,
     "wood.tree": 0.9918749999999998
   },

This repeats for apparently many entities.

That is ResourceGatherer with Rates from an Attack component.

So it seems an Object reference from Attack was assigned to ResourceGatherer.

by elexis, 4 years ago

Attachment: oos_dump_elexis.7z added

comment:3 by elexis, 4 years ago

13809 is a woman on Angens field. It has the following diff:

--- oos_dump_elexis.txt	2019-12-24 23:56:58.612796705 +0100
+++ oos_dump_Angen.txt	2019-12-24 23:55:40.842797290 +0100
 - id: 13809
-    x: 1198.91717
-    z: 498.42405
+    x: 1201.43053
+    z: 494.69137
-    moving: true
+    moving: false
-    x: 1198.91717
+    x: 1201.43053
-    z: 498.42405
+    z: 494.69137
-    target type: 1
+    target type: 0
-    target pos x: 1194.40894
-    target pos y: 505.11936
+    target pos x: 0
+    target pos y: 0
-    current speed: 8.99997
+    current speed: 0
-    length: 1
-    waypoint x: 1194.40894
-    waypoint z: 505.11936
+    length: 0
-    anim name: "walk"
+    anim name: "gather_grain"
-    anim speed: 9
-    sound group: ""
+    anim speed: 1
+    sound group: "resource/farming/farm.xml"
-    value: "walk"
+    value: "gather_grain"
-    value: "carry_food"
+    value: ""
-    "food": 8
+    "food": 9
-    "minRange": 1.5,
-    "maxRange": 0.5,
-    "elevationBonus": 1,
-    "elevationAdaptedRange": 0.35,
-    "repeatTime": 0.35,
+    "food.fruit": 1.5,
+    "food.grain": 0.5,
+    "food.meat": 1,
+    "metal.ore": 0.35,
+    "stone.rock": 0.35,
+    "stone.ruins": 2,
+    "treasure": 1,
+    "wood.ruins": 5,
-      "type": "GatherNearPosition",
+      "type": "Gather",
+        "target": 13885,
-        "x": 1194.408935546875,
-        "z": 505.1193542480469,
+        "lastPos": {
+          "x": 1194.408935546875,
+          "y": 22,
+          "z": 505.1193542480469
+        },
+          "y": 22,
-        "relaxed": true
+        "autoharvest": true
-    "type": "GatherNearPosition",
+    "type": "Gather",
+      "target": 13885,
-      "x": 1194.408935546875,
-      "z": 505.1193542480469,
+      "lastPos": {
+        "x": 1194.408935546875,
+        "y": 22,
+        "z": 505.1193542480469
+      },
+        "y": 22,
-      "relaxed": true
+      "autoharvest": true
-  "fsmStateName": "INDIVIDUAL.GATHER.WALKING",
-  "losRangeQuery": 2782
+  "fsmStateName": "INDIVIDUAL.GATHER.GATHERING",
+  "timer": 22359,
+  "losRangeQuery": 2782,
+  "gatheringTarget": 13885

The curious rates difference occurs for 284 entities:

12983, 12984, 12985, 12986, 12987, 12988, 12989, 12990, 12992, 12993, 12994, 12995, 12998, 12999, 13000, 13001, 13003, 13004, 13005, 13006, 13009, 13010, 13011, 13012, 13013, 13014, 13015, 13021, 13022, 13023, 13024, 13025, 13026, 13029, 13030, 13031, 13032, 13033, 13034, 13035, 13036, 13039, 13040, 13041, 13042, 13043, 13044, 13045, 13046, 13049, 13050, 13051, 13052, 13055, 13056, 13059, 13060, 13061, 13062, 13082, 13083, 13085, 13086, 13088, 13089, 13105, 13106, 13109, 13110, 13112, 13113, 13114, 13115, 13116, 13117, 13149, 13150, 13151, 13178, 13179, 13181, 13182, 13183, 13184, 13185, 13188, 13189, 13198, 13199, 13200, 13201, 13202, 13214, 13215, 13229, 13230, 13231, 13232, 13233, 13236, 13237, 13238, 13239, 13240, 13270, 13271, 13272, 13281, 13282, 13283, 13284, 13285, 13299, 13300, 13303, 13304, 13305, 13306, 13310, 13311, 13312, 13313, 13314, 13318, 13319, 13320, 13321, 13322, 13323, 13345, 13346, 13349, 13350, 13351, 13352, 13353, 13354, 13358, 13359, 13362, 13363, 13364, 13365, 13373, 13374, 13375, 13376, 13377, 13378, 13461, 13462, 13463, 13464, 13465, 13473, 13474, 13484, 13485, 13486, 13487, 13496, 13497, 13499, 13500, 13501, 13502, 13503, 13504, 13505, 13506, 13507, 13515, 13516, 13517, 13518, 13535, 13537, 13538, 13539, 13553, 13554, 13555, 13556, 13566, 13567, 13568, 13569, 13590, 13595, 13606, 13610, 13614, 13615, 13626, 13627, 13628, 13629, 13630, 13631, 13632, 13633, 13634, 13635, 13636, 13637, 13644, 13645, 13655, 13656, 13657, 13679, 13680, 13681, 13682, 13683, 13684, 13712, 13713, 13714, 13715, 13716, 13724, 13732, 13748, 13749, 13753, 13757, 13778, 13779, 13780, 13781, 13782, 13783, 13784, 13785, 13786, 13810, 13811, 13812, 13813, 13817, 13818, 13836, 13837, 13838, 13839, 13840, 13841, 13846, 13847, 13848, 13913, 13915, 13943, 13944, 13945, 13946, 13947, 13948, 13949, 13950, 13951, 13952, 13953, 13954, 13969, 13970, 13971, 13972, 14032, 14033, 14034, 14035, 14036, 14037, 14038, 14039, 14040, 14056, 14057, 14058, 14059, 14060, 14061

Those entities don't seem to share a commonality other than being human units:

     26      actor: "units/britons/infantry_javelinist_b.xml"
      2      actor: "units/britons/infantry_slinger_b.xml"
     17      actor: "units/britons/infantry_spearman_b.xml"
      8      actor: "units/carthaginians/female_citizen.xml"
      4      actor: "units/carthaginians/infantry_archer_b.xml"
      4      actor: "units/carthaginians/infantry_spearman_b.xml"
     64      actor: "units/celts/female_citizen.xml"
      4      actor: "units/mauryas/female_citizen.xml"
      2      actor: "units/mauryas/infantry_archer_b.xml"
      2      actor: "units/mauryas/infantry_spearman_b.xml"
      4      actor: "units/persians/female_citizen.xml"
      2      actor: "units/persians/infantry_archer_b.xml"
      2      actor: "units/persians/infantry_spearman_b.xml"
     60      actor: "units/ptolemies/female_citizen.xml"
     23      actor: "units/ptolemies/infantry_javelinist_b.xml"
     17      actor: "units/ptolemies/infantry_pikeman_b.xml"
     20      actor: "units/romans/female_citizen.xml"
     21      actor: "units/romans/infantry_javelinist_b.xml"
      2      actor: "units/romans/infantry_swordsman_b.xml"

For the rates diff, we observe that the only the object keys differ while the object values are the same! The values differ per entity but the value per entity is the same for Angen and me despite having different object names.

Also we notice that in my state with the Attack property names, there still is one gather rate name (so its not swapped object references but the properties being written incorrectly individually).

I could not identify any code that could be able to do anything of that.

It sounds like the GUI changing the simulation state, which could only be done via GUIInterface.

As replaymode gives me the hash that the other players had, it's not a platform specific pathfinder bug.

comment:4 by wraitii, 4 years ago

Owner: set to wraitii

comment:5 by wraitii, 4 years ago

This is probably the weirdest OOS report I've seen...

I'll add that the order of items comes from globalscripts/templates.js, there seems to be the only place where we define

minRange
maxRange
elevationBonus
elevationAdaptedRange
repeatTime

in that order and with those names. The fact that wood.tree survives is only more puzzling.

We can probably exclude corruption during the dump itself given the JS error.

Based on id: 13809 diff, it seems the entity on elexis's PC failed to PerformGather, which switched it into GatherNearPosition (most likely GetTargetGatherRate failed, which means CmpResourceGatherer.PerformGather immediately returned exhausted). A contrario, the unit kept gathering on Angen's PC. Such a scenario satisfactorily explains the JS error, the difference in food gathered and the entity state and position delta, so I reckon it's what happened.

Then if we take away the deltas from entity 13809, only the super weird gathering rates remain.

So the most likely scenario:

  • Somehow elexis (possibly DDT) experienced an issue. This resulted in cached gather rates being incorrect for elexis.
  • Thus elexis's game diverged from Angen/Stan a few turns later.

Now the standing question is _what_ corrupted elexis' game. Gui Interface calls are done behind structured clones, I can't find any obvious bug on bugzilla that would relate to this.

Given that affected units are owned by multiple owners, it's unlikely that they were all selected or something easy like that.

A possible direction is deepfreezing. This warns about stack overflow: https://bugzilla.mozilla.org/show_bug.cgi?id=644349 . Stack overflow sounds vaguely like what happened: random-looking corruption.

i'm saying that because GetTemplateData (from whole the attack object names come) deep freezes things. There has been one recent change at the time of the report https://code.wildfiregames.com/D2054 .

I have a weird hunch this is related to #5655, and perhaps both actually stem from the same issue when deep-freezing. More investigation required, but I'm submitting this comment as-is.

comment:6 by wraitii, 4 years ago

Description: modified (diff)

comment:7 by wraitii, 4 years ago

I utterly failed to crash this morning when dee-freezing large objects. Trying an MP game with Stan did nothing either.

Uneval-ing the deepfreeze calls do show that we're using it a fair bit, at moments reported to crash sometimes in #5655. That's suspicious, but not nearly enough to confirm that that is the issue. We're dee-freezing a bunch of objects, including lambdas (that might be dangerous, or might not).

https://bugzilla.mozilla.org/show_bug.cgi?id=1440748 suggest structured clones in general might be weird if my hunch that this crash comes from stack overflow is correct.

Again, JS corruption of memory/stack seems like it could be related to #5655 and https://bugzilla.mozilla.org/show_bug.cgi?id=1324002 and https://bugzilla.mozilla.org/show_bug.cgi?id=1337578

I don't really see why this would have started happening in December though.

comment:8 by wraitii, 4 years ago

Looking at his JS objects are stored in Spidermonkey, it seems the object layout (called a shape) is stored separately from the actual values. They appear to be stored in either N-ary trees (that can be partly shared) and/or in doubly-liked lists. That's what current code comments say anyways.

This makes it plausible that the OOS we see is indeed some shape corruption, e.g. pointers pointing somewhere they shouldn't. That doesn't really move the conversation forward much without a somewhat stable reproduction, however.

I'll try and reproduce some spidermonkey crashes further, but at this point I don't think I can do much more.

comment:9 by wraitii, 4 years ago

Owner: wraitii removed

I've tried other things in tests, such as cloning lambdas -> doesn't work.

I must say I can't figure this one out, and I can't seem to reproduce easily it or #5655 (which seems related. I would assume the crash callstacks would have given us the proof).

I'll get back to this when SM is upgraded I suppose.

comment:10 by Stan, 3 years ago

Might be related to #3375

comment:11 by wraitii, 3 years ago

Nah, this is much weirder. I think I'm right above when I write that this is because something changed the shape pointer of the JS object. Whether that be a stack overflow, a bad GC, or something else is impossible to determine.

If we can't reproduce it when testing A24, I think we should be able to close this.

comment:12 by wraitii, 3 years ago

Milestone: Alpha 24Alpha 25

Pushing to A25. I doubt we'll run into this again, so we may close it then.

comment:13 by wraitii, 3 years ago

Milestone: Alpha 25
Resolution: needsinfo
Status: newclosed

Taking this out of the queue. We've not heard of a similar issue since, and A24 so very few (none at all?) OOS reports.

Note: See TracTickets for help on using tickets.