Opened 8 years ago

Closed 8 years ago

#4270 closed defect (fixed)

[PATCH] OOS on rejoin - Visual Actor

Reported by: elexis Owned by: Itms
Priority: Release Blocker Milestone: Alpha 21
Component: UI & Simulation Keywords: patch
Cc: sanderd17 Patch:

Description (last modified by elexis)

As #4239 was solved yesterday, we finally got to test for other OOS issues the first time since many months. Here, bb rejoined a game and triggered an OOS.

The textual simstate representations only differ in the visual actor component of 3 women and 1 spearman. They were all in the carry_wood animation for the host but in the walk animation for the rejoined client.

   VisualActor:
     base actor: "units/persians/infantry_spearman_b.xml"
     r: 1
     g: 1
     b: 1
     anim run threshold: 11.75
     anim name: "build"
     anim once: false
     anim speed: 1
     sound group: "resource/construction/con_wood.xml"
     anim desync: 0.05
     anim sync repeat time: 0
     length: 1
     key: "animation"
-    value: "carry_wood"
+    value: "walk"
     seed: 423
     actor: "units/persians/infantry_spearman_b.xml"

Attachments (13)

oos_bb.7z (434.0 KB ) - added by elexis 8 years ago.
r18808
fix-visualactor-save-reload.patch (6.2 KB ) - added by Itms 8 years ago.
fix-visualactor-save-reload.2.patch (6.9 KB ) - added by Itms 8 years ago.
fix-visualactor-save-reload.3.patch (8.9 KB ) - added by Itms 8 years ago.
squashed version for testing
visual-actor-squash.patch (11.4 KB ) - added by Itms 8 years ago.
0ad_rejoin_error.txt (27.2 KB ) - added by elexis 8 years ago.
OOS_on_rejoin_siole_r18825.7z (587.3 KB ) - added by elexis 8 years ago.
crash_on_rejoin_siole_vs_Dizaka_r18828.7z (480.0 KB ) - added by elexis 8 years ago.
Can't reproduce the crash with the (non-visual) rejointest despite scenario equivalent map
visualactor_crash_pyrennean_r18830.7z (132.4 KB ) - added by elexis 8 years ago.
reproduce_crash.7z (261.4 KB ) - added by elexis 8 years ago.
Reproduce the crash by applying this patch (which deserializes the given simstate in loadme) and doing a -visual-replay on the replay file. The crash occurs here after zooming out.
savegame-0100.0adsave (547.6 KB ) - added by elexis 8 years ago.
Alternative reproduce of the crash instantaneously after loading this savegame from the menu (without patch).
screenshot.jpg (474.3 KB ) - added by elexis 8 years ago.
fix_issue.patch (512 bytes ) - added by wraitii 8 years ago.
Fixes the bug as far as I can tell.

Change History (32)

by elexis, 8 years ago

Attachment: oos_bb.7z added

comment:1 by elexis, 8 years ago

Reproduce: Gather a bit of wood with a unit, then rejoin while the unit is walking.

comment:2 by elexis, 8 years ago

Description: modified (diff)

comment:3 by Itms, 8 years ago

Keywords: rfc patch added
Summary: OOS on rejoin - Visual Actor[PATCH] OOS on rejoin - Visual Actor

The attached patch should fix this issue, and it also fixes a number of problems. It is a bigger patch than I expected, and I'm going to try to break down the different changes.

  • I removed a confusing comment (IMHO): the SelectAnimation in Init is just a proper and fairly simple animation initialization, it is not some clever code to avoid OOS or anything.
  • The animation sync offset was not saved in the component's memory, thus it was lost when the actor was reloaded. I am not sure it is a big deal, but for the sake of it I added some memory, that I serialized.
  • The function that sets the running threshold also loaded a new "walk" animation, which is stupid. The function should be renamed but I don't want to do that right now, it could cause a lot of breakage. I will propose a patch for the renaming.
  • I fixed the deserialization by only reloading the animations in the visual Unit. Previously we used the SelectAnimation methods that overwrote a number of things, whereas now everything is reloaded properly without touching the serialized data, so it cannot OOS. I created a new function from existing code, and avoided code duplication.
  • I fixed the Update function and improved the style a lot. Previously it updated only a part of the serialized data, so it could have caused problems.

The OOS of the ticket was caused by the bullet points 3 and 4: the deserialization code wrongly called (4) a function that was itself broken (3).

I am unsure whether it could cause regressions. I believe that all the code I deleted was not logical at all, but maybe some other parts of 0 A.D. assume that it behaves this way.

comment:4 by Itms, 8 years ago

I misunderstood the actual purpose of SelectMovementAnimation, so here is an updated patch.

The only differences with respect to the previous one are the SelectMovementAnimation function and the Update function that I had not rewritten properly. I added some detailed comments about how this second function, and a TODO explaining how it should ideally be done.

comment:5 by elexis, 8 years ago

The patch doesn't seem to fix the issue, I can still reproduce it the same way (gather a bit and then rejoin while the carrying unit walks). The oos dump diff is the same.

There are two "walk" animations hardcoded, I would guess those would have to become "carry_wood" if the unit is carrying wood (same for other res).

comment:6 by Itms, 8 years ago

Found the issue, it is also necessary to serialize the animation overrides (else the walk animation won't be overrode by the carry_wood one on update).

However I'm stumbling upon another serialization issue I'm trying to fix now.

comment:7 by Itms, 8 years ago

The other issue was caused by the fact that the secondary simulation in the test mode is non-visual. This would not have caused an OOS in an actual game. So I should have fixed the issue now.

I propose to remove the early return in HandleMessage so we won't have issues when running serialization or rejoin tests. I'm creating a git branch for eased reviewing, and I will post a squashed patch.

by Itms, 8 years ago

squashed version for testing

comment:8 by Itms, 8 years ago

Here is a RC that also contains style fixes, splitted for easier reviewing: https://github.com/na-Itms/0ad/tree/visual-actor

And have a squashed RC version for SVN testing.

by Itms, 8 years ago

Attachment: visual-actor-squash.patch added

comment:9 by Itms, 8 years ago

In 18821:

Improve style and whitespace in CCmpVisualActor.
Refs #4270

comment:10 by Itms, 8 years ago

In 18822:

Save the animation sync offset, for consistency's sake.
Refs #4270

comment:11 by Itms, 8 years ago

Owner: set to Itms
Resolution: fixed
Status: newclosed

In 18823:

Fix a number of things in cmpVisualActor:

  • Rewrite the Update function, add more details and information in comments, and properly serialize everything that function needs.
  • Fix the broken deserialization code by using a sane helper function.
  • Fix the SelectMovementAnimation function.

Fixes #4270.

comment:12 by Itms, 8 years ago

In 18824:

Some tweaks in order to fix the non-graphical test modes (serialization test, etc.), and possible future non-visual players like headless game servers.
The internal data of the component, which is serialized, should not depend on the presence of a visual Unit.

Also remove a misleading comment about a simple initialization code.

Refs #4270

comment:13 by Itms, 8 years ago

Keywords: rfc removed

comment:14 by elexis, 8 years ago

Resolution: fixed
Status: closedreopened
  • This error is triggered shortly after rejoins sometimes:
    Assertion failed: "! m_Anim->m_ObjectBounds.IsEmpty()"
    Location: Model.cpp:117 (CModel::CalcBounds)
    
  • We got an OOS on rejoin, identical txt dumps, on a random map script unfortunately, so will try to reproduce it

by elexis, 8 years ago

Attachment: 0ad_rejoin_error.txt added

by elexis, 8 years ago

comment:15 by Itms, 8 years ago

Thanks for the report, I can take a look this afternoon.

Did this assertion fail in the same game as this OOS? Else I would need the commands.txt for the assertion failing.

by elexis, 8 years ago

Can't reproduce the crash with the (non-visual) rejointest despite scenario equivalent map

by elexis, 8 years ago

Attachment: reproduce_crash.7z added

Reproduce the crash by applying this patch (which deserializes the given simstate in loadme) and doing a -visual-replay on the replay file. The crash occurs here after zooming out.

by elexis, 8 years ago

Attachment: savegame-0100.0adsave added

Alternative reproduce of the crash instantaneously after loading this savegame from the menu (without patch).

by elexis, 8 years ago

Attachment: screenshot.jpg added

comment:16 by elexis, 8 years ago

  • turn 5842: rejointest starts
  • turn 5969: serializationtest fails
  • turn 5980: OOS detected in the actual match (as it is checked only every 20th turn in MP)

The savegame in attachment:reproduce_crash.7z is from turn 5980. The game doesn't crash unless one of the entities in the center of this screenshot would become visible. (The screenshot was made from the replay starting on turn 0.) http://trac.wildfiregames.com/raw-attachment/ticket/4270/screenshot.jpg

by wraitii, 8 years ago

Attachment: fix_issue.patch added

Fixes the bug as far as I can tell.

comment:17 by wraitii, 8 years ago

Okay so I've gone digging and based on my findings the above patch fixes the issue. Note that it won't save a broken save game, but adding the following to the deserialize call does:

if (m_AnimSyncRepeatTime == fixed::Zero())
{
	m_AnimSyncOffsetTime = fixed::Zero();
}

That being said, the issue is that when we select a new animation we reset m_AnimSyncRepeatTime but not m_AnimSyncOffsetTime, which on deserialization may trigger a recalculation of anim states with bogus information resulting in nan down the line (see irc logs of october 15).

To reproduce the original issue, do the following: -Set Engine.SetSimRate(0.1) -Get a citizen soldier (such as an archer). -Order it to attack something twice in a short interval of time (so we trigger SetAnimationSync in UnitAI) -Order it to build something. -Save when the unit is in building animation -Load that save.

It should crash. If you follow the same steps with the above patch, it doesn't.

comment:18 by gameboy, 8 years ago

My friend, have you fixed the problem with the latest patch?

comment:19 by wraitii, 8 years ago

Resolution: fixed
Status: reopenedclosed

In 18835:

Fix an oversight in r18822/18823, fixes #4270 .

Note: See TracTickets for help on using tickets.