Opened 4 years ago

Closed 3 years ago

#4480 closed defect (fixed)

Segfault when packing/unpacking

Reported by: fatherbushido Owned by:
Priority: Release Blocker Milestone: Alpha 22
Component: Core engine Keywords: patch
Cc: Patch:

Description (last modified by fatherbushido)

When ordering a move to a packed siege unit in the unit demo scenario map, I get that

ERROR: JavaScript error: simulation/components/Pack.js line 102
TypeError: cmpPlayer is null
  Pack.prototype.GetPackTime@simulation/components/Pack.js:102:1
  Pack.prototype.GetProgress@simulation/components/Pack.js:112:28
  StatusBars.prototype.AddPackBar@simulation/components/StatusBars.js:157:58
  StatusBars.prototype.RegenerateSprites@simulation/components/StatusBars.js:111:14
  StatusBars.prototype.RemoveAuraSource@simulation/components/StatusBars.js:77:2
  Auras.prototype.RemoveBonus@simulation/components/Auras.js:397:4
  Auras.prototype.RegisterGlobalOwnershipChanged@simulation/components/Auras.js:277:1
  AuraManager.prototype.OnGlobalOwnershipChanged@simulation/components/AuraManager.js:254:4
ERROR: Script message handler OnGlobalOwnershipChanged failed
Erreur de segmentation

(Erreur de segmentation == segfault) The error is expected I guess, but I don't know if the segfault is. I just report that, close it if it doesn't reveal any underlying bug.

Attachments (6)

commands_min.txt (1.7 KB ) - added by elexis 4 years ago.
valgrind.txt (8.6 KB ) - added by fatherbushido 4 years ago.
wrongstatusbar.diff (1.0 KB ) - added by fatherbushido 4 years ago.
isolate a bit more
gdb_full.txt (257.5 KB ) - added by fatherbushido 4 years ago.
just non interpreted raw data
weirdcid.txt (9.8 KB ) - added by fatherbushido 4 years ago.
32767
fix.patch (768 bytes ) - added by wraitii 4 years ago.
Tentative correct fix

Download all attachments as: .zip

Change History (40)

comment:1 by leper, 4 years ago

Milestone: BacklogAlpha 22
Priority: Should HaveRelease Blocker

Not sure I'd consider the error to be expected, but being able to crash the game from the simulation (be that by stupid (or malicious) code, or bad template data) is a release blocker.

(Release build backtrace)

GAME STARTED, ALL INIT COMPLETE
ERROR: JavaScript error: simulation/components/Pack.js line 102
TypeError: cmpPlayer is null
  Pack.prototype.GetPackTime@simulation/components/Pack.js:102:1
  Pack.prototype.GetProgress@simulation/components/Pack.js:112:28
  StatusBars.prototype.AddPackBar@simulation/components/StatusBars.js:157:58
  StatusBars.prototype.RegenerateSprites@simulation/components/StatusBars.js:111:14
  StatusBars.prototype.RemoveAuraSource@simulation/components/StatusBars.js:77:2
  Auras.prototype.RemoveBonus@simulation/components/Auras.js:397:4
  Auras.prototype.RegisterGlobalOwnershipChanged@simulation/components/Auras.js:277:1
  AuraManager.prototype.OnGlobalOwnershipChanged@simulation/components/AuraManager.js:254:4
ERROR: Script message handler OnGlobalOwnershipChanged failed

Thread 1 "pyrogenesis" received signal SIGSEGV, Segmentation fault.
0x0000002f00000000 in ?? ()
(gdb) bt
#0  0x0000002f00000000 in ?? ()
#1  0x000000000058f9f2 in CDynamicSubscription::CompareIComponent::operator() (
    this=0x95cc4a0, cmpB=0x1f05f2e0, cmpA=<optimized out>)
    at ../../../source/simulation2/system/DynamicSubscription.h:48
#2  std::_Rb_tree<IComponent*, IComponent*, std::_Identity<IComponent*>, CDynamicSubscription::CompareIComponent, std::allocator<IComponent*> >::_M_get_insert_unique_pos (__k=@0x7fffffffcaf8: 0x1f063610, this=0x95cc4a0)
    at /usr/include/c++/6.3.1/bits/stl_tree.h:1836
#3  std::_Rb_tree<IComponent*, IComponent*, std::_Identity<IComponent*>, CDynamicSubscription::CompareIComponent, std::allocator<IComponent*> >::_M_insert_unique<IComponent* const&> (this=this@entry=0x95cc4a0, 
    __v=@0x7fffffffcaf8: 0x1f063610)
    at /usr/include/c++/6.3.1/bits/stl_tree.h:1889
#4  0x000000000058f260 in std::set<IComponent*, CDynamicSubscription::CompareIComponent, std::allocator<IComponent*> >::insert (
    __x=@0x7fffffffcaf8: 0x1f063610, this=0x95cc4a0)
    at /usr/include/c++/6.3.1/bits/stl_set.h:483
#5  CDynamicSubscription::Remove (this=0x95cc458, cmp=<optimized out>, 
    cmp@entry=0x1f063610)
    at ../../../source/simulation2/system/DynamicSubscription.cpp:31
#6  0x00000000004a3563 in CComponentManager::RemoveComponentDynamicSubscriptions (this=this@entry=0x49af0b8, component=0x1f063610)
    at ../../../source/simulation2/system/ComponentManager.cpp:627
#7  0x00000000004a3d1e in CComponentManager::FlushDestroyedComponents (
    this=this@entry=0x49af0b8)
    at ../../../source/simulation2/system/ComponentManager.cpp:932
#8  0x000000000046e206 in CSimulation2Impl::UpdateComponents (simContext=..., 
    turnLengthFixed=..., turnLengthFixed@entry=..., 
    commands=std::vector of length 0, capacity 0)
    at ../../../source/simulation2/Simulation2.cpp:566
#9  0x0000000000476497 in CSimulation2Impl::Update (this=0x49af090, 
    turnLength=<optimized out>, commands=std::vector of length 0, capacity 0)
    at ../../../source/simulation2/Simulation2.cpp:377
#10 0x0000000000478a0f in CSimulation2::Update (this=<optimized out>, 
    turnLength=<optimized out>, commands=std::vector of length 0, capacity 0)
    at ../../../source/simulation2/Simulation2.cpp:731
#11 0x00000000004b0cdc in CTurnManager::Update (this=<optimized out>, 
    simFrameLength=<optimized out>, maxTurns=1)
    at ../../../source/simulation2/system/TurnManager.cpp:163
#12 0x000000000061a35b in CGame::Update (this=0x499aad0, 
    deltaRealTime=0.39637419581413269, doInterpolate=doInterpolate@entry=true)
    at ../../../source/ps/Game.cpp:401
#13 0x000000000042c058 in Frame () at ../../../source/main.cpp:336
#14 RunGameOrAtlas (argc=argc@entry=1, argv=argv@entry=0x7fffffffeb48)
    at ../../../source/main.cpp:531
#15 0x0000000000421907 in main (argc=1, argv=0x7fffffffeb48)
    at ../../../source/main.cpp:571

(some debugging, just to explain the change in frame numbers)

#5  0x000000000058f9f2 in CDynamicSubscription::CompareIComponent::operator() (
    this=0x95cc4a0, cmpB=0x1f05f2e0, cmpA=<optimized out>)
    at ../../../source/simulation2/system/DynamicSubscription.h:48
48				int cidB = cmpB->GetComponentTypeId();
(gdb) p *(IComponent*)0x1f063610
$9 = {_vptr.IComponent = 0xca4a20 <vtable for CCmpPosition+16>, 
  m_EntityHandle = {m_Id = 637, m_ComponentCache = 0x1f058650}, 
  m_SimContext = 0x49af090}
(gdb) p *cmpB
$11 = {_vptr.IComponent = 0x1f993920, m_EntityHandle = {m_Id = 919310448, 
    m_ComponentCache = 0x1f058650}, m_SimContext = 0x49af090}

So cmpB is some local entity (m_Id == 0x36cb9070).

Now why cmpB has some messed up location for GetComponentTypeId is another question.

(gdb) x/8 *0x1f063610
0xca4a20 <_ZTV12CCmpPosition+16>:	0x00540600	0x00000000	0x00540560	0x00000000
0xca4a30 <_ZTV12CCmpPosition+32>:	0x00540a80	0x00000000	0x005335c0	0x00000000
(gdb) info vtbl *(IComponent*)0x1f063610
vtable for 'IComponent' @ 0xca4a20 (subobject @ 0x1f063610):
[0]: 0x540600 <CCmpPosition::~CCmpPosition()>
[1]: 0x540560 <CCmpPosition::~CCmpPosition()>
[2]: 0x540a80 <CCmpPosition::Init(CParamNode const&)>
[3]: 0x5335c0 <CCmpPosition::Deinit()>
[4]: 0x540070 <CCmpPosition::HandleMessage(CMessage const&, bool)>
[5]: 0x5339b0 <CCmpPosition::Serialize(ISerializer&)>
[6]: 0x540c80 <CCmpPosition::Deserialize(CParamNode const&, IDeserializer&)>
[7]: 0x50eef0 <ICmpPosition::NewJSObject(ScriptInterface&, JS::MutableHandle<JSObject*>) const>
[8]: 0x492170 <IComponent::GetJSInstance() const>
[9]: 0x533580 <CCmpPosition::GetComponentTypeId() const>
(gdb) info vtbl *cmpB
vtable for 'IComponent' @ 0x1f993920 (subobject @ 0x1f05f2e0):
[0]: 0x65
[1]: 0x61
[2]: 0x1f05be20
[3]: 0x1f05f2d0
[4]: 0x0
[5]: 0x2efef10
[6]: 0x1f05f2e0
[7]: 0x4bf00000043
[8]: 0x4400000000
[9]: 0x0

No real clue as to why that vtable is messed up, but it clearly is. Possibly cmpB is refering to some stale/invalid entity, but that is just pointing in random directions.

Last edited 4 years ago by leper (previous) (diff)

by elexis, 4 years ago

Attachment: commands_min.txt added

comment:2 by elexis, 4 years ago

Replay with only one command reproducing the issue if and only if doing a visual replay and having the unit selected at the time the unpack finishes, f.e. by changing the perspective to the player and enabling the follow-player option or instead executing g_Selection.addList([941]) in the js console.

I wonder how that simulation error doesn't occur when not executing the steps above.

comment:3 by elexis, 4 years ago

Expecting that the GUI having selected that entity calls RegenerateSprites to show the current pack progress (even if the stacktrace says otherwise).

The first line in it (cmpPlayer.GetCheatTimeMultiplier()) was added by r18472 (so that the "i am too busy" cheat also affects pack time). (Wondering why this only appears to break on that map.)

comment:4 by elexis, 4 years ago

Summary: Segfault when ordering a packed unit to move in the unit demo mapSegfault when ordering a packed unit to move

Just had this segfault in a survival testgame on svn :(

comment:5 by elexis, 4 years ago

Everything before r19092 doesn't segfault, everything from this on does. Perhaps this only applies to packable siege engines spawned from trigger scripts, as the segfault occured on the demo map and on survival, but afaics not with units that were trained from a building manualls. That still doesn't mean it was that specific commit, as it could just have highlighted an engine bug that was in there before.

comment:6 by fatherbushido, 4 years ago

It doesn't really bring anything to the discussion but removing the multiplication by the cheat time multiplier (we get a cmpPlayer null when the packing is finished) doesn't result in a segfault. (But changing that is only hiding deeper things).

(in that case we only got that WARNING: JavaScript warning: simulation/components/StatusBars.js line 211 reference to undefined property cp[playerID])

Edit: we get the segfault when packing or unpacking is complete (nothing related to move). Edit: we should really look at deep things described by leper Edit: there is something really strange. When the packing is finished (after calling the changeEntityTemplate function and calling destroy) we have again a call to GetPackTime (that's the one wich trigger the bugs).

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:7 by fatherbushido, 4 years ago

And reverting to r19091 we don't get that cmpPlayer null in GetPackTime just after the packing/unpacking is finished. I wonder how r19092 could have introduced that.

comment:8 by fatherbushido, 4 years ago

It sounds like there is something wrong with message and timers :/

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:9 by fatherbushido, 4 years ago

So when the pack/unpack is finished ChangeEntityTemplate Engine.BroadcastMessage(MT_EntityRenamed, { "entity": oldEnt, "newentity": newEnt }); Engine.DestroyEntity(oldEnt);

With the r19092 commit we have again after that a call to RegenrateSprites (I need to check why, perhaps something is faster or longer with this commit) with the old entity (the component is not yet destroyed) but it looks like the ownership is yet removed (owner set to -1).

(DestroyEntity is slow, I remember other bugs like that)

So we update the different bar, for example the pack bar wich leads to a call to GetProgress and GetPackTime without cmpPlayer so we get (needs to check) undefined stuff and the call of cmpOverlayRenderer.AddSprite is broken.

So we have now keys to understand what happens and fix that. (a -without thinking- solution is for example to check that owner is different of -1 at top of RegenerateSprites). But it will remains questions at how that can lead to a segfault.

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:10 by fatherbushido, 4 years ago

Perhaps this only applies to packable siege engines spawned from trigger scripts, as the segfault >occured on the demo map and on survival, but afaics not with units that were trained from a building >manualls.

It occurs when you have aura icons and you pack/unpack a unit (for example demetrius and a ballista).

(edit: also when looking the replay, I guess the aura statusbar are not activated?)

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:11 by fatherbushido, 4 years ago

Summary: Segfault when ordering a packed unit to moveSegfault when packing/unpacking

comment:12 by fatherbushido, 4 years ago

(just notes) http://trac.wildfiregames.com/browser/ps/trunk/source/simulation2/system/ComponentManager.cpp#L902

(Ownership component sent owner to -1 listening at L922 message)

comment:13 by fatherbushido, 4 years ago

In both rev Auras calls RemoveBonus for the destroyed entities. It really seems it's only a timing issue :/ (One can notice that destroying such a unit also give a warning for the capture status bar wich doesn't lead to a segfault, edit: it only occurs for global auras).

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:14 by fatherbushido, 4 years ago

(Could it be a not released CEntityHandle?)

comment:15 by fatherbushido, 4 years ago

(with r19092 in Auras.RemoveBonus the destroyed unit is returned this.GiveMembersWithValidClass(name, ents while it's not the case in r19091. In both case the owner is yet set to -1 but in the first case the identity component is not yet destroyed :/)

edit: ( to be more explicit, in r19091 RemoveBonus is triggered by the range query update as in r19092 it's triggered by AuraManager listening to GlobalOwnerShipChange).

edit: also a funny fact (segfault with killed unit but not with command-destroyed one).

Last edited 4 years ago by fatherbushido (previous) (diff)

by fatherbushido, 4 years ago

Attachment: valgrind.txt added

comment:16 by fatherbushido, 4 years ago

Uhm nice, doesn't get a segfault on killing with vanish type death. (no local entity added)

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:17 by fatherbushido, 4 years ago

well spent hours trying to find dangling pointers. @need leper

in reply to:  15 comment:18 by leper, 4 years ago

Excellent debugging work so far!

Replying to fatherbushido:

So when the pack/unpack is finished ChangeEntityTemplate Engine.BroadcastMessage(MT_EntityRenamed, { "entity": oldEnt, "newentity": newEnt }); Engine.DestroyEntity(oldEnt);

With the r19092 commit we have again after that a call to RegenrateSprites (I need to check why, perhaps something is faster or longer with this commit) with the old entity (the component is not yet destroyed) but it looks like the ownership is yet removed (owner set to -1).

That sounds broken, so some component is likely keeping references to something it shouldn't keep.

(DestroyEntity is slow, I remember other bugs like that)

It mainly marks it for destruction (on the next turn IIRC), that is mostly done to prevent issues if something still accesses it before it is fully destroyed.

Replying to fatherbushido:

(Could it be a not released CEntityHandle?)

It does look a lot like that, we could try actually poisoning the memory when destroying those, maybe that gives us some more hints.

Replying to fatherbushido:

(with r19092 in Auras.RemoveBonus the destroyed unit is returned this.GiveMembersWithValidClass(name, ents while it's not the case in r19091. In both case the owner is yet set to -1 but in the first case the identity component is not yet destroyed :/)

That function or component is broken if it returns something that was "destroyed" (changed owner to -1 which is what JS components (and most C++ ones) should handle if they keep references to entities).

That said the fix in D143 seems wrong (or rather a slight workaround or possibly an optimization, that shouldn't be the only defense), and the individual Aura components should not do stupid things in that case.

Also Auras.Clean() looks a bit suspicious if it is called indiscriminately in the OwnershipChanged handler. It shouldn't really add auras again if we are destroying the entity in question.

Given that the debugging work outlines quite nicely what steps are needed to make this crash, we should think about writing a test which we can then use to figure out why (and how and where) we get some sort of corruption into the cache data structures which then leads to the segfault.

comment:19 by fatherbushido, 4 years ago

Thanks for the enlightening comments.

by fatherbushido, 4 years ago

Attachment: wrongstatusbar.diff added

isolate a bit more

comment:20 by fatherbushido, 4 years ago

Well not still able to provide a test. Just notes: With that diff, just build a pack unit, when packing * and selected * is over, listening to OnDestroy (added in the diff) calls RegenerateSprites and AddBar in StatusBars component (only the pack bar in the diff). As the previous unit has not yet owner, cmpPlayer is null and so we can't get the GetPackProgress() in Pack component (the JS error). And things get wrong when calling cmpOverlayRenderer with args depending of GetPackProgress().

by fatherbushido, 4 years ago

Attachment: gdb_full.txt added

just non interpreted raw data

comment:21 by fatherbushido, 4 years ago

(without creating the new entity we have also the segfault)

comment:22 by fatherbushido, 4 years ago

I didn't take care at first when seeing that big int for component id, but 32767 for cidB is something noticeable.

comment:23 by elexis, 4 years ago

As 32767 is -1 converted to 15bit unsigned integer, it sounds like a -1 is passed as placeholder for something undefined or unowned. Smells like a check for that case is missing somewhere, which could be traced down with some effort

by fatherbushido, 4 years ago

Attachment: weirdcid.txt added

32767

comment:24 by fatherbushido, 4 years ago

(yes that (edit: that = what I posted) doesn't bring anything in fact)

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:25 by elexis, 4 years ago

you're welcome

comment:26 by fatherbushido, 4 years ago

(still working on that, but I am learning what I need to catch the whole thing before) (I succeed to trigger the segfault without any JS warning/error)

Last edited 4 years ago by fatherbushido (previous) (diff)

comment:27 by wraitii, 4 years ago

Found a fix which I believe is the correct one, see the attached patch.

The problem is that somehow we make our way to status bars.js RegenerateSprite() when OnDestroy is called (I have no idea how), leading to OverlayRenderer and another component (haven't IDed which) reset their dynamic subscriptions, BUT we do not flatten them before deleting components, leading to dangling component pointers, and then it crashes on the next flatten call semi-randomly.

So we need to flatten dynamic subscriptions after the OnDestroy messages are sent, and then it all works fine. I'm really unsure why this only happens with packing units having an aura tbh, but it does, and this fixes it for me.

Edit: I am pretty sure this is the correct fix, but I don't really understand how it ended up crashing semi-randomly, so if someone has an idea. I do think we could create a unit test by having a component dynamically unsubscribe on OnDestroy, but this should be tested.

Last edited 4 years ago by wraitii (previous) (diff)

by wraitii, 4 years ago

Attachment: fix.patch added

Tentative correct fix

comment:28 by elexis, 4 years ago

Component: UI & SimulationCore engine
Keywords: patch rfc added

comment:29 by fatherbushido, 4 years ago

(Refs r15400) The patch works. I guess it's the opportune time to check/review that Dynamic subscription code.

comment:30 by fatherbushido, 3 years ago

So to sum up:

The unit pack. The pack is complete. The entity is put in the destroy queue. FlushDestroyedComponents() will start his job Dynamic subscription are flatten The destroy message is send My mistake in Auras will ask StatusBar to regenerate its sprites. (That status bar component doesn't handle such cases, should it?, Pack component also still return isPacking = true, ...) Wich lead to Reset sprites and Add a bar in Overlay Renderer. Only the first one is necessary to trigger the problem. When reseting sprites in overlay renderer, OverlayRender update its subscription (UpdateMessageSubscriptions) and actually unsubsribed to RenderSubmit and Interpolate calling DynamicSubscriptionNonsync with argument = false, and now m_Removed is not empty (I need to actually printf to show that's it's this IComponent* which remain in m_Removed).

Moving the Flattening in the loop after the destroy message fix the issue. (I wonder if we should not still not throw an error message here that some component did bad things before? As point out in the comment).

(edit: I forgot we need to add the test too).

After it will remain to add check and/or prevent Auras, Pack, StatusBar to do bad things, that's the easy part.

Last edited 3 years ago by fatherbushido (previous) (diff)

comment:31 by wraitii, 3 years ago

(I need to actually printf to show that's it's this IComponent* which remain in m_Removed).

I can confirm that.

(I wonder if we should not still not throw an error message here that some component did bad things before? As point out in the comment).

if you mean "Throw an error when a component dynamically unsubscribes on delete" I don't think that's really going to be possible.

comment:32 by fatherbushido, 3 years ago

Perhaps we could but it would add useless complication I think:

flattening before the loop, and inside the loop after the destroy message ensuring that it is again flattened (perhaps we can return a boolean in the Flatten function when we do nothing, currently it is if (m_Added.empty() && m_Removed.empty()) return;)

(Let's forget that :p)

comment:33 by fatherbushido, 3 years ago

Description: modified (diff)

Test in r19423 Fix in r19424

It remains some js fix.

comment:34 by fatherbushido, 3 years ago

Keywords: rfc removed
Resolution: fixed
Status: newclosed

Thanks leper for the tutoring and good pratice guide. Pack unit tests: r19350 Engine bug test: r19423 Wraitii's engine fix: r19424 Pack fix: r19429 Capture Bar: r19451

It remains to clean some auras behavior, mainly in the way we handle dead (target) entities in GiveMembersWithValidClass or RemoveBonus. (Better and more individual unit tests will perhaps help.) The Auras Clean function is in fact ok (for dead of source entities) as CalculateAffectedPlayers will return an empty array if/when trying to apply the bonus.

So I permit myself to close the ticket.

Note: See TracTickets for help on using tickets.