Opened 4 years ago
Closed 3 years ago
#5653 closed defect (needsinfo)
OOS ResourceGather / Attack rates object swap
Reported by: | Silier | Owned by: | |
---|---|---|---|
Priority: | Release Blocker | Milestone: | |
Component: | UI & Simulation | Keywords: | |
Cc: | Patch: |
Description (last modified by )
Current state of the investigation:
What happened:
- Somehow, cached rates in the ResourceGatherer components of several entities got partly overwritten with data from Attack in GetTemplateDataHelper in elexis' game (possilbly D_D_T).
- This led to a farmer failing to farm food...
- ...resulting in OOS between Angen+Stan and elexis.
The JS error appears to be the symptom of the underlying corruption, ultimately resulting in OOS.
Open questions:
- It seems odd that D_D_T and elexis crashed. This might mean that both experienced the JS corruption, which would hint towards this actually being the result of some code doing something wrong instead of simply a freak error.
- Angen reports not crashing, which given the 'normal' OOS dump might further imply that his JS was not corrupted and so the crash might have been linked.
- That being said, if it is memory or stack corruption, a crash makes sense. It would further be OK if D_D_T and elexis didn't quite crash at the same time (memory is fickle).
- neither elexis nor wraitii could find any JS code that would reasonably lead to this OOS, so it sounds like something broke in the engine/something broke in a completely abnormal way.
Attachments (4)
Change History (17)
by , 4 years ago
Attachment: | oos_dump.zip added |
---|
by , 4 years ago
Attachment: | 2019-12-24_0003.zip added |
---|
by , 4 years ago
Attachment: | oos_dump.2.zip added |
---|
comment:1 by , 4 years ago
- I was the host, we played a 2v2 on a Skirmish map
- I used git mirror with rP23278, the others svn with rP23281, but they are compatible as far as I can see
- There was a JS error, I think only I got it
- Within one or two seconds D_D_T_ crashed and then one or two seconds later there was an OOS for Angen + Stan (every connected client that wasnt the host)
- Then I took a screenshot thereof and then there was a segfault, again within 1-2 seconds
comment:2 by , 4 years ago
Cc: | removed |
---|---|
Milestone: | Backlog → Alpha 24 |
Priority: | If Time Permits → Release Blocker |
Summary: | oos dump from match with rP23281 → OOS ResourceGather / Attack rates object swap |
GAME STARTED, ALL INIT COMPLETE WARNING: JavaScript warning: simulation/components/ResourceGatherer.js line 136 reference to undefined property this.rates[resourceType] ERROR: Out-Of-Sync on turn 1517 Players: Angen, StanleySweet Dumping state to /home/elexis/.config/0ad/logs/oos_dump.txt TIMER| common/modern/setup.xml: 220.274 us TIMER| common/modern/styles.xml: 186.297 us TIMER| common/modern/sprites.xml: 1.23361 ms TIMER| msgbox/msgbox.xml: 100.635 ms Screenshot written to '/home/elexis/.local/share/0ad/screenshots/screenshot1047.png' Segmentation fault (core dumped)
I cannot reproduce the hash I got using replay mode, nor the JS error.
Following OOS differences between my and Angens OOS dump:
entity: 13809 - x: 1198.91717 - z: 498.42405 + x: 1201.43053 + z: 494.69137 clearance: 0.8 - flags: 21 + flags: 5
key: 13809 - x: 1198.91717 - z: 498.42405 + x: 1201.43053 + z: 494.69137
+ modified entities: 13809 ... - modified entities: 13809
(different order)
This should be the StatisticsTracker
"resourcesGathered": { - "vegetarianFood": 3099, - "food": 3716, + "vegetarianFood": 3100, + "food": 3717, "wood": 6731, "stone": 0, "metal": 0 },
But the most interesting diff is this:
"rates": { - "minRange": 0.75, - "maxRange": 0.25, - "elevationBonus": 1, - "elevationAdaptedRange": 0.5, - "repeatTime": 0.5, + "food.fruit": 0.75, + "food.grain": 0.25, + "food.meat": 1, + "metal.ore": 0.5, + "stone.rock": 0.5, + "stone.ruins": 2, + "treasure": 1, + "wood.ruins": 5, "wood.tree": 0.9918749999999998 }, ... "rates": { - "minRange": 0.75, - "maxRange": 0.25, - "elevationBonus": 1, - "elevationAdaptedRange": 0.5, - "repeatTime": 0.5, + "food.fruit": 0.75, + "food.grain": 0.25, + "food.meat": 1, + "metal.ore": 0.5, + "stone.rock": 0.5, + "stone.ruins": 2, + "treasure": 1, + "wood.ruins": 5, "wood.tree": 0.9918749999999998 },
This repeats for apparently many entities.
That is ResourceGatherer with Rates from an Attack component.
So it seems an Object reference from Attack was assigned to ResourceGatherer.
by , 4 years ago
Attachment: | oos_dump_elexis.7z added |
---|
comment:3 by , 4 years ago
13809 is a woman on Angens field. It has the following diff:
--- oos_dump_elexis.txt 2019-12-24 23:56:58.612796705 +0100 +++ oos_dump_Angen.txt 2019-12-24 23:55:40.842797290 +0100 - id: 13809 - x: 1198.91717 - z: 498.42405 + x: 1201.43053 + z: 494.69137 - moving: true + moving: false - x: 1198.91717 + x: 1201.43053 - z: 498.42405 + z: 494.69137 - target type: 1 + target type: 0 - target pos x: 1194.40894 - target pos y: 505.11936 + target pos x: 0 + target pos y: 0 - current speed: 8.99997 + current speed: 0 - length: 1 - waypoint x: 1194.40894 - waypoint z: 505.11936 + length: 0 - anim name: "walk" + anim name: "gather_grain" - anim speed: 9 - sound group: "" + anim speed: 1 + sound group: "resource/farming/farm.xml" - value: "walk" + value: "gather_grain" - value: "carry_food" + value: "" - "food": 8 + "food": 9 - "minRange": 1.5, - "maxRange": 0.5, - "elevationBonus": 1, - "elevationAdaptedRange": 0.35, - "repeatTime": 0.35, + "food.fruit": 1.5, + "food.grain": 0.5, + "food.meat": 1, + "metal.ore": 0.35, + "stone.rock": 0.35, + "stone.ruins": 2, + "treasure": 1, + "wood.ruins": 5, - "type": "GatherNearPosition", + "type": "Gather", + "target": 13885, - "x": 1194.408935546875, - "z": 505.1193542480469, + "lastPos": { + "x": 1194.408935546875, + "y": 22, + "z": 505.1193542480469 + }, + "y": 22, - "relaxed": true + "autoharvest": true - "type": "GatherNearPosition", + "type": "Gather", + "target": 13885, - "x": 1194.408935546875, - "z": 505.1193542480469, + "lastPos": { + "x": 1194.408935546875, + "y": 22, + "z": 505.1193542480469 + }, + "y": 22, - "relaxed": true + "autoharvest": true - "fsmStateName": "INDIVIDUAL.GATHER.WALKING", - "losRangeQuery": 2782 + "fsmStateName": "INDIVIDUAL.GATHER.GATHERING", + "timer": 22359, + "losRangeQuery": 2782, + "gatheringTarget": 13885
The curious rates
difference occurs for 284 entities:
12983, 12984, 12985, 12986, 12987, 12988, 12989, 12990, 12992, 12993, 12994, 12995, 12998, 12999, 13000, 13001, 13003, 13004, 13005, 13006, 13009, 13010, 13011, 13012, 13013, 13014, 13015, 13021, 13022, 13023, 13024, 13025, 13026, 13029, 13030, 13031, 13032, 13033, 13034, 13035, 13036, 13039, 13040, 13041, 13042, 13043, 13044, 13045, 13046, 13049, 13050, 13051, 13052, 13055, 13056, 13059, 13060, 13061, 13062, 13082, 13083, 13085, 13086, 13088, 13089, 13105, 13106, 13109, 13110, 13112, 13113, 13114, 13115, 13116, 13117, 13149, 13150, 13151, 13178, 13179, 13181, 13182, 13183, 13184, 13185, 13188, 13189, 13198, 13199, 13200, 13201, 13202, 13214, 13215, 13229, 13230, 13231, 13232, 13233, 13236, 13237, 13238, 13239, 13240, 13270, 13271, 13272, 13281, 13282, 13283, 13284, 13285, 13299, 13300, 13303, 13304, 13305, 13306, 13310, 13311, 13312, 13313, 13314, 13318, 13319, 13320, 13321, 13322, 13323, 13345, 13346, 13349, 13350, 13351, 13352, 13353, 13354, 13358, 13359, 13362, 13363, 13364, 13365, 13373, 13374, 13375, 13376, 13377, 13378, 13461, 13462, 13463, 13464, 13465, 13473, 13474, 13484, 13485, 13486, 13487, 13496, 13497, 13499, 13500, 13501, 13502, 13503, 13504, 13505, 13506, 13507, 13515, 13516, 13517, 13518, 13535, 13537, 13538, 13539, 13553, 13554, 13555, 13556, 13566, 13567, 13568, 13569, 13590, 13595, 13606, 13610, 13614, 13615, 13626, 13627, 13628, 13629, 13630, 13631, 13632, 13633, 13634, 13635, 13636, 13637, 13644, 13645, 13655, 13656, 13657, 13679, 13680, 13681, 13682, 13683, 13684, 13712, 13713, 13714, 13715, 13716, 13724, 13732, 13748, 13749, 13753, 13757, 13778, 13779, 13780, 13781, 13782, 13783, 13784, 13785, 13786, 13810, 13811, 13812, 13813, 13817, 13818, 13836, 13837, 13838, 13839, 13840, 13841, 13846, 13847, 13848, 13913, 13915, 13943, 13944, 13945, 13946, 13947, 13948, 13949, 13950, 13951, 13952, 13953, 13954, 13969, 13970, 13971, 13972, 14032, 14033, 14034, 14035, 14036, 14037, 14038, 14039, 14040, 14056, 14057, 14058, 14059, 14060, 14061
Those entities don't seem to share a commonality other than being human units:
26 actor: "units/britons/infantry_javelinist_b.xml" 2 actor: "units/britons/infantry_slinger_b.xml" 17 actor: "units/britons/infantry_spearman_b.xml" 8 actor: "units/carthaginians/female_citizen.xml" 4 actor: "units/carthaginians/infantry_archer_b.xml" 4 actor: "units/carthaginians/infantry_spearman_b.xml" 64 actor: "units/celts/female_citizen.xml" 4 actor: "units/mauryas/female_citizen.xml" 2 actor: "units/mauryas/infantry_archer_b.xml" 2 actor: "units/mauryas/infantry_spearman_b.xml" 4 actor: "units/persians/female_citizen.xml" 2 actor: "units/persians/infantry_archer_b.xml" 2 actor: "units/persians/infantry_spearman_b.xml" 60 actor: "units/ptolemies/female_citizen.xml" 23 actor: "units/ptolemies/infantry_javelinist_b.xml" 17 actor: "units/ptolemies/infantry_pikeman_b.xml" 20 actor: "units/romans/female_citizen.xml" 21 actor: "units/romans/infantry_javelinist_b.xml" 2 actor: "units/romans/infantry_swordsman_b.xml"
For the rates
diff, we observe that the only the object keys differ while the object values are the same!
The values differ per entity but the value per entity is the same for Angen and me despite having different object names.
Also we notice that in my state with the Attack property names, there still is one gather rate name (so its not swapped object references but the properties being written incorrectly individually).
I could not identify any code that could be able to do anything of that.
It sounds like the GUI changing the simulation state, which could only be done via GUIInterface.
As replaymode gives me the hash that the other players had, it's not a platform specific pathfinder bug.
comment:4 by , 4 years ago
Owner: | set to |
---|
comment:5 by , 4 years ago
This is probably the weirdest OOS report I've seen...
I'll add that the order of items comes from globalscripts/templates.js, there seems to be the only place where we define
minRange maxRange elevationBonus elevationAdaptedRange repeatTime
in that order and with those names. The fact that wood.tree survives is only more puzzling.
We can probably exclude corruption during the dump itself given the JS error.
Based on id: 13809
diff, it seems the entity on elexis's PC failed to PerformGather, which switched it into GatherNearPosition (most likely GetTargetGatherRate
failed, which means CmpResourceGatherer.PerformGather
immediately returned exhausted). A contrario, the unit kept gathering on Angen's PC. Such a scenario satisfactorily explains the JS error, the difference in food gathered and the entity state and position delta, so I reckon it's what happened.
Then if we take away the deltas from entity 13809, only the super weird gathering rates remain.
So the most likely scenario:
- Somehow elexis (possibly DDT) experienced an issue. This resulted in cached gather rates being incorrect for elexis.
- Thus elexis's game diverged from Angen/Stan a few turns later.
Now the standing question is _what_ corrupted elexis' game. Gui Interface calls are done behind structured clones, I can't find any obvious bug on bugzilla that would relate to this.
Given that affected units are owned by multiple owners, it's unlikely that they were all selected or something easy like that.
A possible direction is deepfreezing. This warns about stack overflow: https://bugzilla.mozilla.org/show_bug.cgi?id=644349 . Stack overflow sounds vaguely like what happened: random-looking corruption.
i'm saying that because GetTemplateData (from whole the attack object names come) deep freezes things. There has been one recent change at the time of the report https://code.wildfiregames.com/D2054 .
I have a weird hunch this is related to #5655, and perhaps both actually stem from the same issue when deep-freezing. More investigation required, but I'm submitting this comment as-is.
comment:6 by , 4 years ago
Description: | modified (diff) |
---|
comment:7 by , 4 years ago
I utterly failed to crash this morning when dee-freezing large objects. Trying an MP game with Stan did nothing either.
Uneval-ing the deepfreeze calls do show that we're using it a fair bit, at moments reported to crash sometimes in #5655. That's suspicious, but not nearly enough to confirm that that is the issue. We're dee-freezing a bunch of objects, including lambdas (that might be dangerous, or might not).
https://bugzilla.mozilla.org/show_bug.cgi?id=1440748 suggest structured clones in general might be weird if my hunch that this crash comes from stack overflow is correct.
Again, JS corruption of memory/stack seems like it could be related to #5655 and https://bugzilla.mozilla.org/show_bug.cgi?id=1324002 and https://bugzilla.mozilla.org/show_bug.cgi?id=1337578
I don't really see why this would have started happening in December though.
comment:8 by , 4 years ago
Looking at his JS objects are stored in Spidermonkey, it seems the object layout (called a shape) is stored separately from the actual values. They appear to be stored in either N-ary trees (that can be partly shared) and/or in doubly-liked lists. That's what current code comments say anyways.
This makes it plausible that the OOS we see is indeed some shape corruption, e.g. pointers pointing somewhere they shouldn't. That doesn't really move the conversation forward much without a somewhat stable reproduction, however.
I'll try and reproduce some spidermonkey crashes further, but at this point I don't think I can do much more.
comment:9 by , 4 years ago
Owner: | removed |
---|
I've tried other things in tests, such as cloning lambdas -> doesn't work.
I must say I can't figure this one out, and I can't seem to reproduce easily it or #5655 (which seems related. I would assume the crash callstacks would have given us the proof).
I'll get back to this when SM is upgraded I suppose.
comment:11 by , 3 years ago
Nah, this is much weirder. I think I'm right above when I write that this is because something changed the shape pointer of the JS object. Whether that be a stack overflow, a bad GC, or something else is impossible to determine.
If we can't reproduce it when testing A24, I think we should be able to close this.
comment:12 by , 3 years ago
Milestone: | Alpha 24 → Alpha 25 |
---|
Pushing to A25. I doubt we'll run into this again, so we may close it then.
comment:13 by , 3 years ago
Milestone: | Alpha 25 |
---|---|
Resolution: | → needsinfo |
Status: | new → closed |
Taking this out of the queue. We've not heard of a similar issue since, and A24 so very few (none at all?) OOS reports.