#4239 closed defect (fixed)
[PATCH] Serialize the data in DataTemplateManager to avoid OOS on rejoin
Reported by: | Itms | Owned by: | Itms |
---|---|---|---|
Priority: | Release Blocker | Milestone: | Alpha 21 |
Component: | UI & Simulation | Keywords: | patch |
Cc: | Patch: |
Description (last modified by )
The data in DataTemplateManager
(r18100) is only read from the disk so we don't serialize it. However, on rejoin, if a template had been loaded by the host, the next time this template is loaded, the guest will load a new JS object (which has the same content as the host's but it's not the same object).
Upon serialization, the objects will have different backrefs on the two instances, leading to a difference in the binary simstates.
Attachments (8)
Change History (26)
by , 8 years ago
Attachment: | DataTemplateManager.js.patch added |
---|
comment:1 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:2 by , 8 years ago
Keywords: | review removed |
---|
comment:3 by , 8 years ago
How he found out:
Itms joined my host with -ooslog
enabled, rejoined until it became OOS. Then he produced the binary simstate of the host using -ooslog
. Then compared the binary simstates (since the textual ones are identical).
In my reproduction, the first differing byte was near a "female_inspiration" string, thus occuring in the Aura component. This byte is 0x03 for one simstate and 0x08 in the other. The first byte of unserialized data is the data type defined by SerializedScriptTypes.h
, i.e. it stands for an object
type and a backref
respectively.
Since the textual simstates were identical, the objects would have the same content. Since the binary simstates differ, the objects differ in their serialized representation though.
Thus looking at how the Auras
component serializes. Prototype variables in Init
are serialized and this.auras
is an object which is given by cmpDataTemplateManager.GetAuraTemplate
. The latter returns a new object from Engine.ReadJSONFile
file, thus having the same content but a different serial representation!
The fix was to use the this.allTechs
object from the host, which can be accomplished by letting the host serialize it. The code removed by that commit means that the DataTemplateManager
component uses the default serializing mechanism and serializes everything in Init
.
Using #4242 it is possible to simulate the rejoin determined by the prior experiment and prove that the unpatched code fails the serializationtest, while the patched code passes.
Thanks for fixing this hidden release blocker OOS Itms!
comment:4 by , 8 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
The commit (r18752) causes (or otherwise triggers) multiplayer replays to become OOS on the first non-quick hash comparison. The revision before that doesn't show this behavior. Singleplayer replays are not hashed, so they don't trigger this bug.
comment:5 by , 8 years ago
I have started a multiplayergame with -ooslog
, then replayed it visually with -ooslog
. The simstates differ on turn 1 (which is the first turn to be executed).
1c1 < State hash: 8e0949cf8cb2fcdf22cb68301cd057ef --- > State hash: 6f879f720f797e88ea3b08fb6e7daf3f 91432,91519d91431 < }, < "wall_garrisoned": { < "type": "garrisonedUnits", < "affects": [ < "Soldier" < ], < "modifications": [ < { < "value": "Armour/Pierce", < "add": 3 < }, < { < "value": "Armour/Hack", < "add": 3 < }, < { < "value": "Armour/Crush", < "add": 3 < }, < { < "value": "Vision/Range", < "add": 20 < } < ], < "auraName": "Wall Protection", < "auraDescription": "Units on walls have 3 extra Armor levels and higher vision." < }, < "temple_heal": { < "type": "range", < "radius": 40, < "affects": [ < "Human" < ], < "modifications": [ < { < "value": "Health/RegenRate", < "add": 1 < } < ], < "auraName": "Healing Aura", < "auraDescription": "Heals nearby units at 1 HP per second.", < "overlayIcon": "art/textures/ui/session/auras/heal.png" < }, < "wonder_pop_1": { < "type": "global", < "affects": [ < "Player" < ], < "modifications": [ < { < "value": "Player/MaxPopulation", < "add": 10 < } < ], < "auraName": "Wonder Aura", < "auraDescription": "Increase the population limit by 10 per Wonder owned.", < "stackable": true < }, < "wonder_pop_2": { < "type": "global", < "affects": [ < "Player" < ], < "modifications": [ < { < "value": "Player/MaxPopulation", < "add": 40 < } < ], < "auraName": "Glorious Expansion Aura", < "auraDescription": "Further increase the population limit by 40 per Wonder owned (requires \"Glorious Expansion\" tech).", < "requiredTechnology": "pop_wonder", < "stackable": true < }, < "maur_pillar": { < "type": "range", < "radius": 75, < "affects": [ < "Trader" < ], < "modifications": [ < { < "value": "UnitMotion/WalkSpeed", < "multiply": 1.2 < } < ], < "auraDescription": "All traders in range +20% walk speed.", < "overlayIcon": "art/textures/ui/session/auras/build_bonus.png"
Furthermore in nearly every multiplayer game that was started there was an instant OOS between the host and all clients, no matter which platform. I had compared 3 textual oos dumps, and there have always been techs missing and sometimes reordered, like in the diff above.
comment:6 by , 8 years ago
Description: | modified (diff) |
---|
comment:7 by , 8 years ago
The correct solution would IMO be too not serialize the templates in the aura and tech managers either.
As a quick solution, the templates in the data manager can indeed be serialized again (though we need to watch out they're loaded deterministically then). But that again inflates the size of the serialisation state, which we tried to reduce.
by , 8 years ago
Attachment: | OOS_on_rejoin_18788_fatherbushido_turn_24_160.7z added |
---|
siole and me started a game, no OOS. On turn 24 fatherbushido started to rejoin and got OOS on turn 160. Notice the rejoined client has more auras loaded than the non-rejoined one.
by , 8 years ago
Attachment: | oos_dumps_09-24.7z added |
---|
OOS dumps from 3 games, from september 24th, OOS on rejoin, diffs always have missing auras or a different order of aura items, no tech diffs
comment:8 by , 8 years ago
Correction: The diff was always in the Aura part, never in the tech part.
by , 8 years ago
Attachment: | 2016-09-30 instant_OOS_visualreplay.7z added |
---|
Here a multiplayergame with only the host (so as to enable hashing). The visual replay gives an OOS on turn 20. ooslog
dumps for every turn of the host and replayer are attached. The diff is the same as usual (missing auras for the replayer). The diff remains until the end of the log (turn 49), so it doesn't appear to be a "self-correcting" issue.
comment:9 by , 8 years ago
Keywords: | review added |
---|
Here is a patch that implements sanderd17's suggestion. It reverts r18752 and also removes the serialization of templates in Auras
. The only templates serialized in the TechnologyManager
are not used, only their keys in the object they belong to are used - so I replaced that value by a boolean, which is not going to cause serialization problems.
by , 8 years ago
Attachment: | oos_backref_patched.7z added |
---|
With r18797 and Itms patch applied, I got this OOS on rejoin. I had both the host and the client running with -ooslog
, thus capturing that their textual simstates were different while their binary data had an 0x08 instead of an 0x03 in some object of the tech part. Thus being the backref issue originally discovered in this ticket again.
by , 8 years ago
Attachment: | serialization-fix.patch added |
---|
This new one should do the job, and uses an array instead of changing to a boolean
by , 8 years ago
Attachment: | PRIOR_broken_serialization_patch.diff added |
---|
This patch was uploaded by Itms along with comment:9. attachment:oos_backref_patched.7z has shown that patch is incomplete.
comment:10 by , 8 years ago
On the history of the DataTemplateManager
and it's serialization issues:
a) Preliminaries:
- r11584 quantumstate implemented technologies and introduced the
TechnologyTemplateManager
(a.k.a.DataTemplateManager
)- All technology JSON files were loaded on
init
and serialized - I don't know whether there were OOS issues yet.
- All technology JSON files were loaded on
- r13998 sanderd17 implemented auras using JSON files and the
TechnologyTemplateManager
- It didn't introduce the instant OOS described below, as the
GUIInterface
had no access to that simulation component yet. - It likely didn't introduce any other OOS issues as only the Aura component loaded aura templates, and the construction order of the components is and has most likely been in sync at that time.
- It didn't introduce the instant OOS described below, as the
- r15318 moves the auras from JSON files to the XML entity template
- Removes the reference to
GetAuraTemplate
but not the function. - Thus
this.auras
of theTechnologyTemplateManager
will always be an empty object (not adding serialization issues).
- Removes the reference to
b) Now it breaks:
- r17974 (9 days after a20 release) moves auras back to JSON files and introduces an OOS without rejoin (instant OOS)
- Reproduce: A networked game becomes OOS as soon as one player selects some units.
- Cause: It allowed the GUI to change simulation data directly through the
GUIInterface
: To show tooltips, theGuiInterface
loads aura data by theTechnologyTemplateManager
. Since only that one client loaded that data and since that data is serialized, hashed and compared, it will trigger an OOS without rejoin. There was probably someone warning about - This instant OOS was apparently never noticed since noone tested the affected 126 revisions in multiplayer or with replays immediately after the a20 release. It was also missed in a review in irc on 2016-06-15.
c) Error progression:
- r18100 (23 days later) converted the "instant OOS" to "hash mismatch on rejoin"
- To improve performance, the unneeded serialization of the
DataTemplateManager
templates was removed (as the data can just be loaded from the files). - Thus the template cache wasn't compared anymore, not triggering instant OOS anymore
- Instead we had ahash mismatch on rejoin (without actual simstate difference) for the next 5 months.
- Reproduce: Repeat rejoining a host running with
-ooslog
and with a client running with-ooslog
as well and compare the results in case of OOS. We can observer that the objects inthis.auras
ofAuras
orthis.researchedTechs
part of theTechnologyManager
component have the same content (and identical .txt files) for different clients on that turn, but are serialized differently (differing .dat files).
- To improve performance, the unneeded serialization of the
d) Committed broken fix:
- r18752 (5 months later) was the first attempt to fix the hash mismatch (and thus this ticket) by readding serialization (thus reverting the performance improvement of r18100), but just making that instant OOS from r17974 reappear.
e) Proposed broken patch: attachment:PRIOR_broken_serialization_patch.diff
- Idea: The first patch proposed by Itms after r18752 removes the serialization in the
DataTemplateManager
again and also removes the serialization of aura objects in the Aura component. - Broken: Testing has shown that in order to trigger an OOS on rejoin, one has to rejoin 10 times instead of twice, so it worked better but is incomplete.
The binary simstates have a different backref order in the serialized technology objects (JSON file data) of the
TechnologyManager
, which received the objects from theDataTemplateManager
, whose object loading order is still influenced by the GUI. (Method: (The first byte that differs in the two files (ooslog_host/00120.dat
vsooslog_client/00066.dat
) is a 0x03 / 0x08. (Searching for a unique string that is very close to the first differing byte (regent
) in the according.txt
file revealed that theresearchedTechs
variable of theTechnologyManager
prototype differs)
f) Proposed updated patch: attachment:serialization-fix.patch
- Idea: Cuts out the serialization of the technology templates in the
TechnologyManager
too - Review: See following post.
comment:11 by , 8 years ago
On the logic:
- I'm not convinced that this patch fixes the backref issue (hash mismatch on rejoin in some components using
DataTemplateManager
objects). - If we allow the GUI Interface to influece the backref order, then I'm expecting wrong backref tags in other places as well.
It was already demonstrated that the technology template objects provided by the
DataTemplateManager
recycled and serialized in components can have different backref orders between clients. So how would we know that for examplethis.queue
ofProductionQueue
that also serializes objects provided by theDataTemplateManager
also don't bug. The patch only removes the until now discovered affected serialized objects. Making a bug less frequent isn't fixing it. Furthermore the backref mismatch can be reintroduced really easily if any component uses data by theDataTemplateManager
or any function that returns these objects, likeProductionQueue.prototype.GetTechnologiesList
for example. There would have to be exclamation marks warning all users of theDataTemplateManager
and then comments also for the component functions that provide these objects (progression of error)
On uploading: (Overwriting the patch everytime prevents the reader from checking whether the remarks were correct with regards to the previos patch, noticing what changed between the two last iterations and testing the previous patch with the provided replays)
On comments of the patch:
- The comment
Only serialize the template name of researched technologies, to avoid serialization issues
should state what these serialization issues are
On code style:
- Simplification:
ret.researchedTechs = Object.keys(this.researchedTechs)
(tech.autoResearch && this.CanResearch(template)
unneeded parenthesis (since&&
has a higher operator precedence as||
(just like*
and+
) (refs r18016, https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Operator_Precedence#Table))
- Operators should be at the end of the previous line, not at the beginning of the newline (although your code doesn't introduce this issue)
by , 8 years ago
Attachment: | fix_aura_loading_order_v1.patch added |
---|
Loads all aura templates on init in alphabetical order, thus should not cause different backref tags on different machines. Thus making it irrelevant whether the data is serialized or not (wishful thinking since I tested only 5 rejoins with the patch).
comment:13 by , 8 years ago
Keywords: | review removed |
---|
- We decided that we will not remove the serialization of the templates for alpha 21, as not serializing needs more elaborative approach.
- The serialization-safe implementation of not serializing the templates should be continued next alpha in #3834. About that:
- Itms patch was incomplete since the
TechnologyManager
it didn't address all variables that serialize these objects, for exampleGetTechnologiesList
of theProductionQueue
component - The
DataTemplateManager
should be renamed toJSONTemplateManager
or anything else sinceData
doesn't have any meaning at all - The template loader should be moved to
globalscripts
- Itms patch was incomplete since the
comment:14 by , 8 years ago
That commit is likely to cause spurious OOS, but in this case this is probably a good thing since it might teach some people to not keep random files around. (But have fun debugging those things)
There is nothing in GetTechnologiesList
that should cause any issue, or influence serialized data, especially since the only data cmpTechnologyManager
should serialize is what is researched, or has been researched. Caches can and should be recreated as needed (if at all).
Why just encode even more implementation details in the name, while you're at it?
Which could maybe be done once all issues are fixed, as that move alone is likely to cause fancy OOS issues, that will be hard to reproduce with the given tools, but persistent state might just be awesome.
In 18752: