Opened 8 years ago

Closed 8 years ago

Last modified 6 years ago

#4239 closed defect (fixed)

[PATCH] Serialize the data in DataTemplateManager to avoid OOS on rejoin

Reported by: Itms Owned by: Itms
Priority: Release Blocker Milestone: Alpha 21
Component: UI & Simulation Keywords: patch
Cc: Patch:

Description (last modified by elexis)

The data in DataTemplateManager (r18100) is only read from the disk so we don't serialize it. However, on rejoin, if a template had been loaded by the host, the next time this template is loaded, the guest will load a new JS object (which has the same content as the host's but it's not the same object).

Upon serialization, the objects will have different backrefs on the two instances, leading to a difference in the binary simstates.

Attachments (8)

DataTemplateManager.js.patch (679 bytes ) - added by Itms 8 years ago.
OOS_on_rejoin_18788_fatherbushido_turn_24_160.7z (1.0 MB ) - added by elexis 8 years ago.
siole and me started a game, no OOS. On turn 24 fatherbushido started to rejoin and got OOS on turn 160. Notice the rejoined client has more auras loaded than the non-rejoined one.
oos_dumps_09-24.7z (966.3 KB ) - added by elexis 8 years ago.
OOS dumps from 3 games, from september 24th, OOS on rejoin, diffs always have missing auras or a different order of aura items, no tech diffs
2016-09-30 instant_OOS_visualreplay.7z (166.5 KB ) - added by elexis 8 years ago.
Here a multiplayergame with only the host (so as to enable hashing). The visual replay gives an OOS on turn 20. ooslog dumps for every turn of the host and replayer are attached. The diff is the same as usual (missing auras for the replayer). The diff remains until the end of the log (turn 49), so it doesn't appear to be a "self-correcting" issue.
oos_backref_patched.7z (628.5 KB ) - added by elexis 8 years ago.
With r18797 and Itms patch applied, I got this OOS on rejoin. I had both the host and the client running with -ooslog, thus capturing that their textual simstates were different while their binary data had an 0x08 instead of an 0x03 in some object of the tech part. Thus being the backref issue originally discovered in this ticket again.
serialization-fix.patch (5.6 KB ) - added by Itms 8 years ago.
This new one should do the job, and uses an array instead of changing to a boolean
PRIOR_broken_serialization_patch.diff (3.2 KB ) - added by elexis 8 years ago.
This patch was uploaded by Itms along with comment:9. attachment:oos_backref_patched.7z has shown that patch is incomplete.
fix_aura_loading_order_v1.patch (1.2 KB ) - added by elexis 8 years ago.
Loads all aura templates on init in alphabetical order, thus should not cause different backref tags on different machines. Thus making it irrelevant whether the data is serialized or not (wishful thinking since I tested only 5 rejoins with the patch).

Change History (26)

by Itms, 8 years ago

comment:1 by Itms, 8 years ago

Resolution: fixed
Status: newclosed

In 18752:

Fix a frequent rejoining OOS. We actually need to serialize the data loaded from the disk, because JS objects in the memory or newly loaded from the disk will not behave the same way.

Fixes #4239, refs #3834.

comment:2 by Itms, 8 years ago

Keywords: review removed

comment:3 by elexis, 8 years ago

How he found out:

Itms joined my host with -ooslog enabled, rejoined until it became OOS. Then he produced the binary simstate of the host using -ooslog. Then compared the binary simstates (since the textual ones are identical).

In my reproduction, the first differing byte was near a "female_inspiration" string, thus occuring in the Aura component. This byte is 0x03 for one simstate and 0x08 in the other. The first byte of unserialized data is the data type defined by SerializedScriptTypes.h, i.e. it stands for an object type and a backref respectively.

Since the textual simstates were identical, the objects would have the same content. Since the binary simstates differ, the objects differ in their serialized representation though.

Thus looking at how the Auras component serializes. Prototype variables in Init are serialized and this.auras is an object which is given by cmpDataTemplateManager.GetAuraTemplate. The latter returns a new object from Engine.ReadJSONFile file, thus having the same content but a different serial representation!

The fix was to use the this.allTechs object from the host, which can be accomplished by letting the host serialize it. The code removed by that commit means that the DataTemplateManager component uses the default serializing mechanism and serializes everything in Init.


Using #4242 it is possible to simulate the rejoin determined by the prior experiment and prove that the unpatched code fails the serializationtest, while the patched code passes.

Thanks for fixing this hidden release blocker OOS Itms!

comment:4 by elexis, 8 years ago

Resolution: fixed
Status: closedreopened

The commit (r18752) causes (or otherwise triggers) multiplayer replays to become OOS on the first non-quick hash comparison. The revision before that doesn't show this behavior. Singleplayer replays are not hashed, so they don't trigger this bug.

comment:5 by elexis, 8 years ago

I have started a multiplayergame with -ooslog, then replayed it visually with -ooslog. The simstates differ on turn 1 (which is the first turn to be executed).

1c1
< State hash: 8e0949cf8cb2fcdf22cb68301cd057ef
---
> State hash: 6f879f720f797e88ea3b08fb6e7daf3f
91432,91519d91431
<     },
<     "wall_garrisoned": {
<       "type": "garrisonedUnits",
<       "affects": [
<         "Soldier"
<       ],
<       "modifications": [
<         {
<           "value": "Armour/Pierce",
<           "add": 3
<         },
<         {
<           "value": "Armour/Hack",
<           "add": 3
<         },
<         {
<           "value": "Armour/Crush",
<           "add": 3
<         },
<         {
<           "value": "Vision/Range",
<           "add": 20
<         }
<       ],
<       "auraName": "Wall Protection",
<       "auraDescription": "Units on walls have 3 extra Armor levels and higher vision."
<     },
<     "temple_heal": {
<       "type": "range",
<       "radius": 40,
<       "affects": [
<         "Human"
<       ],
<       "modifications": [
<         {
<           "value": "Health/RegenRate",
<           "add": 1
<         }
<       ],
<       "auraName": "Healing Aura",
<       "auraDescription": "Heals nearby units at 1 HP per second.",
<       "overlayIcon": "art/textures/ui/session/auras/heal.png"
<     },
<     "wonder_pop_1": {
<       "type": "global",
<       "affects": [
<         "Player"
<       ],
<       "modifications": [
<         {
<           "value": "Player/MaxPopulation",
<           "add": 10
<         }
<       ],
<       "auraName": "Wonder Aura",
<       "auraDescription": "Increase the population limit by 10 per Wonder owned.",
<       "stackable": true
<     },
<     "wonder_pop_2": {
<       "type": "global",
<       "affects": [
<         "Player"
<       ],
<       "modifications": [
<         {
<           "value": "Player/MaxPopulation",
<           "add": 40
<         }
<       ],
<       "auraName": "Glorious Expansion Aura",
<       "auraDescription": "Further increase the population limit by 40 per Wonder owned (requires \"Glorious Expansion\" tech).",
<       "requiredTechnology": "pop_wonder",
<       "stackable": true
<     },
<     "maur_pillar": {
<       "type": "range",
<       "radius": 75,
<       "affects": [
<         "Trader"
<       ],
<       "modifications": [
<         {
<           "value": "UnitMotion/WalkSpeed",
<           "multiply": 1.2
<         }
<       ],
<       "auraDescription": "All traders in range +20% walk speed.",
<       "overlayIcon": "art/textures/ui/session/auras/build_bonus.png"

Furthermore in nearly every multiplayer game that was started there was an instant OOS between the host and all clients, no matter which platform. I had compared 3 textual oos dumps, and there have always been techs missing and sometimes reordered, like in the diff above.

comment:6 by elexis, 8 years ago

Description: modified (diff)

comment:7 by sanderd17, 8 years ago

The correct solution would IMO be too not serialize the templates in the aura and tech managers either.

As a quick solution, the templates in the data manager can indeed be serialized again (though we need to watch out they're loaded deterministically then). But that again inflates the size of the serialisation state, which we tried to reduce.

by elexis, 8 years ago

siole and me started a game, no OOS. On turn 24 fatherbushido started to rejoin and got OOS on turn 160. Notice the rejoined client has more auras loaded than the non-rejoined one.

by elexis, 8 years ago

Attachment: oos_dumps_09-24.7z added

OOS dumps from 3 games, from september 24th, OOS on rejoin, diffs always have missing auras or a different order of aura items, no tech diffs

comment:8 by elexis, 8 years ago

Correction: The diff was always in the Aura part, never in the tech part.

by elexis, 8 years ago

Here a multiplayergame with only the host (so as to enable hashing). The visual replay gives an OOS on turn 20. ooslog dumps for every turn of the host and replayer are attached. The diff is the same as usual (missing auras for the replayer). The diff remains until the end of the log (turn 49), so it doesn't appear to be a "self-correcting" issue.

comment:9 by Itms, 8 years ago

Keywords: review added

Here is a patch that implements sanderd17's suggestion. It reverts r18752 and also removes the serialization of templates in Auras. The only templates serialized in the TechnologyManager are not used, only their keys in the object they belong to are used - so I replaced that value by a boolean, which is not going to cause serialization problems.

by elexis, 8 years ago

Attachment: oos_backref_patched.7z added

With r18797 and Itms patch applied, I got this OOS on rejoin. I had both the host and the client running with -ooslog, thus capturing that their textual simstates were different while their binary data had an 0x08 instead of an 0x03 in some object of the tech part. Thus being the backref issue originally discovered in this ticket again.

by Itms, 8 years ago

Attachment: serialization-fix.patch added

This new one should do the job, and uses an array instead of changing to a boolean

by elexis, 8 years ago

This patch was uploaded by Itms along with comment:9. attachment:oos_backref_patched.7z has shown that patch is incomplete.

comment:10 by elexis, 8 years ago

On the history of the DataTemplateManager and it's serialization issues:

a) Preliminaries:

  • r11584 quantumstate implemented technologies and introduced the TechnologyTemplateManager (a.k.a. DataTemplateManager)
    • All technology JSON files were loaded on init and serialized
    • I don't know whether there were OOS issues yet.
  • r13998 sanderd17 implemented auras using JSON files and the TechnologyTemplateManager
    • It didn't introduce the instant OOS described below, as the GUIInterface had no access to that simulation component yet.
    • It likely didn't introduce any other OOS issues as only the Aura component loaded aura templates, and the construction order of the components is and has most likely been in sync at that time.
  • r15318 moves the auras from JSON files to the XML entity template
    • Removes the reference to GetAuraTemplate but not the function.
    • Thus this.auras of the TechnologyTemplateManager will always be an empty object (not adding serialization issues).

b) Now it breaks:

  • r17974 (9 days after a20 release) moves auras back to JSON files and introduces an OOS without rejoin (instant OOS)
    • Reproduce: A networked game becomes OOS as soon as one player selects some units.
    • Cause: It allowed the GUI to change simulation data directly through the GUIInterface: To show tooltips, the GuiInterface loads aura data by the TechnologyTemplateManager. Since only that one client loaded that data and since that data is serialized, hashed and compared, it will trigger an OOS without rejoin. There was probably someone warning about
    • This instant OOS was apparently never noticed since noone tested the affected 126 revisions in multiplayer or with replays immediately after the a20 release. It was also missed in a review in irc on 2016-06-15.

c) Error progression:

  • r18100 (23 days later) converted the "instant OOS" to "hash mismatch on rejoin"
    • To improve performance, the unneeded serialization of the DataTemplateManager templates was removed (as the data can just be loaded from the files).
    • Thus the template cache wasn't compared anymore, not triggering instant OOS anymore
    • Instead we had ahash mismatch on rejoin (without actual simstate difference) for the next 5 months.
    • Reproduce: Repeat rejoining a host running with -ooslog and with a client running with -ooslog as well and compare the results in case of OOS. We can observer that the objects in this.auras of Auras or this.researchedTechs part of the TechnologyManager component have the same content (and identical .txt files) for different clients on that turn, but are serialized differently (differing .dat files).

d) Committed broken fix:

  • r18752 (5 months later) was the first attempt to fix the hash mismatch (and thus this ticket) by readding serialization (thus reverting the performance improvement of r18100), but just making that instant OOS from r17974 reappear.

e) Proposed broken patch: attachment:PRIOR_broken_serialization_patch.diff

  • Idea: The first patch proposed by Itms after r18752 removes the serialization in the DataTemplateManager again and also removes the serialization of aura objects in the Aura component.
  • Broken: Testing has shown that in order to trigger an OOS on rejoin, one has to rejoin 10 times instead of twice, so it worked better but is incomplete. The binary simstates have a different backref order in the serialized technology objects (JSON file data) of the TechnologyManager, which received the objects from the DataTemplateManager, whose object loading order is still influenced by the GUI. (Method: (The first byte that differs in the two files (ooslog_host/00120.dat vs ooslog_client/00066.dat) is a 0x03 / 0x08. (Searching for a unique string that is very close to the first differing byte (regent) in the according .txt file revealed that the researchedTechs variable of the TechnologyManager prototype differs)

f) Proposed updated patch: attachment:serialization-fix.patch

  • Idea: Cuts out the serialization of the technology templates in the TechnologyManager too
  • Review: See following post.

comment:11 by elexis, 8 years ago

On the logic:

  • I'm not convinced that this patch fixes the backref issue (hash mismatch on rejoin in some components using DataTemplateManager objects).
  • If we allow the GUI Interface to influece the backref order, then I'm expecting wrong backref tags in other places as well. It was already demonstrated that the technology template objects provided by the DataTemplateManager recycled and serialized in components can have different backref orders between clients. So how would we know that for example this.queue of ProductionQueue that also serializes objects provided by the DataTemplateManager also don't bug. The patch only removes the until now discovered affected serialized objects. Making a bug less frequent isn't fixing it. Furthermore the backref mismatch can be reintroduced really easily if any component uses data by the DataTemplateManager or any function that returns these objects, like ProductionQueue.prototype.GetTechnologiesList for example. There would have to be exclamation marks warning all users of the DataTemplateManager and then comments also for the component functions that provide these objects (progression of error)

On uploading: (Overwriting the patch everytime prevents the reader from checking whether the remarks were correct with regards to the previos patch, noticing what changed between the two last iterations and testing the previous patch with the provided replays)

On comments of the patch:

  • The comment Only serialize the template name of researched technologies, to avoid serialization issues should state what these serialization issues are

On code style:

  • Simplification: ret.researchedTechs = Object.keys(this.researchedTechs)
  • Operators should be at the end of the previous line, not at the beginning of the newline (although your code doesn't introduce this issue)

by elexis, 8 years ago

Loads all aura templates on init in alphabetical order, thus should not cause different backref tags on different machines. Thus making it irrelevant whether the data is serialized or not (wishful thinking since I tested only 5 rejoins with the patch).

comment:12 by elexis, 8 years ago

Resolution: fixed
Status: reopenedclosed

In 18804:

Fix an OOS without rejoin after selecting the first unit.
Introduced by r17974, caused by serializing the aura cache which is initialized by the tooltip calls from the GUI.
Reviewed by Itms, in combination with r18752 fixes #4239.

comment:13 by elexis, 8 years ago

Keywords: review removed
  • We decided that we will not remove the serialization of the templates for alpha 21, as not serializing needs more elaborative approach.
  • The serialization-safe implementation of not serializing the templates should be continued next alpha in #3834. About that:
    • Itms patch was incomplete since the TechnologyManager it didn't address all variables that serialize these objects, for example GetTechnologiesList of the ProductionQueue component
    • The DataTemplateManager should be renamed to JSONTemplateManager or anything else since Data doesn't have any meaning at all
    • The template loader should be moved to globalscripts

comment:14 by leper, 8 years ago

That commit is likely to cause spurious OOS, but in this case this is probably a good thing since it might teach some people to not keep random files around. (But have fun debugging those things)

There is nothing in GetTechnologiesList that should cause any issue, or influence serialized data, especially since the only data cmpTechnologyManager should serialize is what is researched, or has been researched. Caches can and should be recreated as needed (if at all).

Why just encode even more implementation details in the name, while you're at it?

Which could maybe be done once all issues are fixed, as that move alone is likely to cause fancy OOS issues, that will be hard to reproduce with the given tools, but persistent state might just be awesome.

comment:15 by elexis, 6 years ago

In 20606:

Don't serialize the templates of autoresearched technologies.

Refs #3834, #4239
Differential Revision: https://code.wildfiregames.com/D1109
Reviewed By: temple
Comments By: Itms
Proposed By: sanderd17

comment:16 by elexis, 6 years ago

In 20610:

Remove unused serialized cache of technology templates in TechnologyManager.researchedTechs.

Refs #3834, #3909, #4239, #4868, D1108
Differential Revision: https://code.wildfiregames.com/D1130
Reviewed By: mimo

comment:17 by elexis, 6 years ago

In 20737:

Replace DataTemplateManager simulation component with a globalscript, refs #4868.

Removes the serialization of JSON files, shrinking savegame files and rejoin states sent across the network, refs #3834, #4239, #3909, rP18100.
Removes the AI C++ code to read JSON files from rP13225 since the AI can now use the globalscript.
Allows the AI to read Aura templates and removal of GUIInterface code to improve performance.
Serialization of the JSON objects in other simulation components was removed in r20606 / D1109, r20610 / D1130.

Serialization removal planned by sanderd17
AI part proofread by mimo
Simulation part proofread by bb
Discussed with Itms on irc

Differential Revision: https://code.wildfiregames.com/D1108

comment:18 by elexis, 6 years ago

In 21014:

Fix AuraManager OOS on rejoin following rP20737 / D1108.

The AuraManager serialized the affects array of the aura JSON files, while the Auras components create new affects arrays upon deserialization.
So the AuraManager of rejoined clients failed to identify and remove the according aura effects of units who received the aura effect before the rejoin.

Fix it by not serializing the array but only the strings inside the array.
It were preferable to rebuild the modification cache and remove the serialization altogether, refs #3834.

rP18100 introduced the same OOS on rejoin, refs #3909, #4239, but it wasn't noticed due to the revert in rP18804.

Differential Revision: https://code.wildfiregames.com/D1201
Reviewed By: bb
Comments By: gameboy
Fixes #4924

Note: See TracTickets for help on using tickets.