Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#4523 closed defect (fixed)

PushOrderFront infinite loop / Segfault when sending back-to-order command for garrisoned unit

Reported by: elexis Owned by:
Priority: Must Have Milestone: Alpha 22
Component: UI & Simulation Keywords:
Cc: Patch: Phab:D64 Phab:D685 Phab:D686

Description (last modified by elexis)

In this game hosted by pesem, many if not all clients have experienced a SEGFAULT. It was reported yesterday by causative, Boudica had it too and the replay was uploaded by Grugnas. Replaying the thing in non-visual replaymode reproduces the segfault!

Attachments (4)

2017-04-09_0002.zip (387.3 KB ) - added by elexis 7 years ago.
Thanks Grugnas for sharing the replay!
2017-04-09_0002.tar.gz (1.4 KB ) - added by causative 7 years ago.
concise replay causing the segfault
commands.txt (6.3 KB ) - added by elexis 7 years ago.
Another replay provided by Franksy where everyone segfaulted on turn 38 because a cavalry was ordered to garrison on turn 37 and ordered to gather on turn 38
fix4523.patch (673 bytes ) - added by mimo 7 years ago.

Download all attachments as: .zip

Change History (21)

by elexis, 7 years ago

Attachment: 2017-04-09_0002.zip added

Thanks Grugnas for sharing the replay!

comment:1 by elexis, 7 years ago

Looks like an error that seems unlikely to be fixed soon:

Thread 1 "pyrogenesis" received signal SIGSEGV, Segmentation fault.
0x00007ffff719f926 in js::jit::GetOptimizationLevel (pc=<optimized out>, script=...) at /.../a21/trunk/libraries/source/spidermonkey/mozjs-38.0.0/js/src/jit/Ion.cpp:2049
2049	    return js_IonOptimizations.levelForScript(script, pc);
(gdb) info stack
#0  0x00007ffff719f926 in js::jit::GetOptimizationLevel (pc=<optimized out>, script=...) at /media/.../trunk/libraries/source/spidermonkey/mozjs-38.0.0/js/src/jit/Ion.cpp:2049
#1  js::jit::Compile (cx=cx@entry=0xd04fc0, script=..., script@entry=..., osrFrame=osrFrame@entry=0x0, osrPc=osrPc@entry=0x0, constructing=false, forceRecompile=forceRecompile@entry=false)
    at /.../a21/trunk/libraries/source/spidermonkey/mozjs-38.0.0/js/src/jit/Ion.cpp:2080
#2  0x00007ffff71a0a11 in js::jit::CanEnter (cx=cx@entry=0xd04fc0, state=...) at /.../a21/trunk/libraries/source/spidermonkey/mozjs-38.0.0/js/src/jit/Ion.cpp:2244
#3  0x00007ffff6f92039 in js::RunScript (cx=cx@entry=0xd04fc0, state=...) at /.../a21/trunk/libraries/source/spidermonkey/mozjs-38.0.0/js/src/vm/Interpreter.cpp:424
#4  0x00007ffff6f9227c in js::Invoke (cx=cx@entry=0xd04fc0, args=..., construct=construct@entry=js::NO_CONSTRUCT)
    at /.../a21/trunk/libraries/source/spidermonkey/mozjs-38.0.0/js/src/vm/Interpreter.cpp:517
#5  0x00007ffff72dd147 in js_fun_apply (cx=0xd04fc0, argc=<optimized out>, vp=0x7fffff7ffb78) at /.../a21/trunk/libraries/source/spidermonkey/mozjs-38.0.0/js/src/jsfun.cpp:1323

by causative, 7 years ago

Attachment: 2017-04-09_0002.tar.gz added

concise replay causing the segfault

comment:2 by causative, 7 years ago

Description: modified (diff)

I was able to reproduce this error in SVN. You can reliably cause the segfault by:

  • Select a citizen cavalry unit
  • Click to hunt an animal (camel, in this case), but don't actually kill it.
  • Garrison the cavalry
  • Use the script console to post a network message telling the garrisoned cavalry to resume work: Engine.PostNetworkCommand({"type":"back-to-work","entities":[21062]}).
  • It segfaults

The error can also be triggered without using the script console.

  • Select a citizen cavalry unit
  • Click to hunt an animal (camel, in this case), but don't actually kill it.
  • Click to stand right next to the building you will garrison into
  • Set the game speed to Turtle (0.1x) to make the next step easier
  • Ctrl-click to garrison into the building, and very quickly afterwards press y to resume work, so that both commands post to the same turn.
  • the game segfaults. It may take a few tries before this happens.

I have provided a replay of this in SVN.

However, note that this method is not quite the same thing nigel87 did. He garrisoned his cavalry (entity 21062) many turns before he ordered it to resume work. How did he issue the command for it to resume work, without the script console and without issuing very fast consecutive commands?

comment:3 by elexis, 7 years ago

I could reproduce it by adding a keyboard shortcut to that unit, garrisoning it, then repeatedly clicking on the icon button while pressing Y, many turns later without any console action.

Thanks a lot for the debugging!

comment:4 by elexis, 7 years ago

Summary: Alpha 21 Segfault replaySegfault when sending back-to-order command for garrisoned unit

comment:5 by elexis, 7 years ago

Description: modified (diff)
Milestone: BacklogAlpha 22

A known issue with an existing patch at https://code.wildfiregames.com/D64 (we just didn't know that it could be triggered in actual games).

comment:6 by elexis, 7 years ago

Summary: Segfault when sending back-to-order command for garrisoned unitPushOrderFront infinite loop / Segfault when sending back-to-order command for garrisoned unit

by elexis, 7 years ago

Attachment: commands.txt added

Another replay provided by Franksy where everyone segfaulted on turn 38 because a cavalry was ordered to garrison on turn 37 and ordered to gather on turn 38

comment:7 by mimo, 7 years ago

This last replay does not look at all as phab:D64. It happens because the gather order arrived just at the same turn as the unit is garrisoned, and once garrisoned, units are not supposed to receive any order except Ungarrison if not turret. The simplest is to just forget about such orders (which are very unlikely to happen, and will be even less likely when decreasing the turn length) as is done in the attached patch.

Version 0, edited 7 years ago by mimo (next)

by mimo, 7 years ago

Attachment: fix4523.patch added

comment:8 by elexis, 7 years ago

This last replay does not look at all as ​phab:D64

Isn't it exactly that infinite loop that causes a segfault after having had this one way to trigger it? Applying D64 to a21 also fixes the segfault with that replay.

D64 attempts to remove the possibility entirely, while the attachment intends to remove the way to trigger it, not a contradiction IMO.

The attachment also sounds like it would also fix the first issue in #4291.

The attachment fixes the reproduce recipe mentioned above (assign group to a cavalry, garrison it, move the mouse cursor over an animal, press the hotkey quickly while rightclicking repeatedly on the animal), hence the comment isn't exactly accurate. Since it fixes the reproduce, it must also fix the original replay reported here.

Patch seems good to me, but we should still fix the broader fix D64 sometime, right?

Also about the turn length, there can also be lag sometimes which can stretch the realtime turn duration to arbitrary lengths (like up to 60 seconds). (Indeed very rare to run into, still a segfault is bad if it kills a game)

comment:9 by mimo, 7 years ago

to my understanding, D64 is supposed to fix a "valid loop" but which by chance seems to never happen except in wraiiti changes, while that one is an invalid loop: we should never go out of a garrison state without having issued an Ungarrison order.

Concerning the last point, the alternatives are either to drop the order as in the patch, or to insert an Ungarrison order just before the new order. That's possible though it complicates a bit the code for something which is very rare. I've no strong opinions about it.

And concerning the first report about back-to-orders, there are two points:

  • for a garrisoned not turret unit, the only possibility for it to apply is for the same reason as before (order issued on the exact turn the unit is garrisoned), and thus the same patch would work.
  • but there is also the problem that for turrets, the bact-to-work button is available in the gui, and clicking on it will only clear the back-to-work orders as the unit can't do it. So here also we have two alternatives: either disable the back-to-work button for garrisoned units (and have an early return in UnitAI.BackToWork if the unit is garrisoned), or insert an Ungarrison order in this UnitAI.BackToWork function. Here also, i've no real preferences.

comment:10 by mimo, 7 years ago

see D685 for the back-to-work case and D686 for the case covered by fix4523.patch

Last edited 7 years ago by mimo (previous) (diff)

comment:11 by elexis, 7 years ago

In r19869 by mimo:

fix back-to-work button for garrisoned units

Reviewed By: fatherbushido

Differential Revision: ​https://code.wildfiregames.com/D685

comment:12 by elexis, 7 years ago

Patch: Phab:D64 Phab:D685 Phab:D686

comment:13 by mimo, 7 years ago

phab:D686 committed in r19900

Last edited 7 years ago by mimo (previous) (diff)

comment:14 by elexis, 7 years ago

Milestone: Alpha 22Backlog
Priority: Release BlockerMust Have

Potential proposal to fix the general loop instead of only every known entrance remaining at Phab:D64. It was also proposed by bb in Phab:D686 to only fix the entrance and not the loop itself.

comment:15 by mimo, 7 years ago

Preventing something (which should not happen) to happen, as in D685 or D686, is imo better than trying to block a possible infinite loop without understanding its cause (as in D64) with possible uncontrolled side-effects (as the unwanted change in the chasing behaviour). As long as we don't have a way to reproduce it in a licit case, i don't feel the need for D64.

comment:16 by elexis, 7 years ago

Resolution: fixed
Status: newclosed

True (good to hear).

Maybe we could add an error message to D64 though so that in case of that error occuring in a game, it doesn't kill the game while still giving developers the feedback (admittedly a segfault will be more reported than a random error). Hm.

Maybe even only an error message with a stack (which would also inform readers of the code to take care) and no early return so that we have error + segfault.

comment:17 by leper, 7 years ago

Milestone: BacklogAlpha 22

Maybe we could add an error message to D64 though so that in case of that error occuring in a game [...]

D64 fixes a bug caused by other changes of which none are merged, and quite a few of which are having other side effects, so if we are thinking about making code more robust in case of people doing stupid things then we should do that everywhere and not just because some proposed set of changes is broken.

Note: See TracTickets for help on using tickets.