#4053 closed defect (fixed)
[PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector
Reported by: | sanderd17 | Owned by: | echotangoecho |
---|---|---|---|
Priority: | Release Blocker | Milestone: | Alpha 21 |
Component: | Core engine | Keywords: | patch |
Cc: | leper, Yves | Patch: |
Description
When building Spidermonkey on GCC 6, there are quite often segfaults appearing in the garbage collector.
This happens especially often when a lot of data is flushed, like when finalising the load stage of a random map. And for some reason, it happens more often when the game window isn't focused (when it's in the background).
I tested this on our current SpiderMonkey (v31) and on leper's upgrade to v38, and both fail with the same symptoms.
A typical backtrace can be found here: http://pastebin.com/YWRzvpkK
After asking on #jsapi, it appears that these issues are only fixed from v49 onwards. And for now, we need to disable certain optimisations.
In particular the optimisations listed on their ticket 1245783, comment 33.
Attachments (14)
Change History (55)
comment:1 by , 8 years ago
comment:2 by , 8 years ago
Summary: | Building spidermonkey on GCC 6 results in sefaults in the Garbage Collector → Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector |
---|
comment:3 by , 8 years ago
Keywords: | patch review added |
---|---|
Owner: | set to |
Summary: | Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector → [PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector |
When in debug mode, an assertion was triggered hinting at creating a Rooted<T> somewhere outside of a request.
This patch fixes that issue, and during my testing the segfaults did not occur anymore. It could still be that a similar issue also happens elsewhere, in a code section not executed during my testing.
by , 8 years ago
Attachment: | autorequestfix.patch added |
---|
comment:4 by , 8 years ago
During further testing, it seems that the issue is unfortunately not fixed with the above patch and only occurs when compiling in release mode.
by , 8 years ago
Attachment: | autorequestfix_v2.patch added |
---|
Found some more by searching for all occurances of JS::RootedValue
, JS::RootedObject
and GetContext
(grep -R 'GetContext();' -A 1
).
by , 8 years ago
Attachment: | tmpfixgcsegfault.patch added |
---|
comment:6 by , 8 years ago
In the changelog for GCC 6:
Value range propagation now assumes that the this pointer of C++ member functions is non-null. This eliminates common null pointer checks but also breaks some non-conforming code-bases (such as Qt-5, Chromium, KDevelop). As a temporary work-around -fno-delete-null-pointer-checks can be used. Wrong code can be identified by using -fsanitize=undefined.
When adding -fno-delete-null-pointer-checks
the segfaults don't seem to occur anymore,
but I'm not sure if this is the right way to add it.
by , 8 years ago
Attachment: | fixtrace.patch added |
---|
comment:7 by , 8 years ago
It seems there is the possibility for another segfault in our own code:
When ObjectIdCache
is constructed, ObjectIdCache::Trace
is added
as a GC roots tracer, while the ObjectIdCache
is not properly initialized
until ObjectIdCache::init
is called, causing a segfault when a GC run occurs
between running the ObjectIdCache
constructor and ObjectIdCache::init
.
comment:8 by , 8 years ago
Keywords: | review removed |
---|
Thanks for working on it. Since the SpiderMonkey 38 upgrade is almost ready, I would like to push those changes to after the update, in order to avoid fixing conflicts this late in the review process of #3708.
You will need to rebase the GC segfault patch when the update happens, the two other patches look good to me.
by , 8 years ago
Attachment: | tmpfixgcsegfault2.patch added |
---|
by , 8 years ago
Attachment: | tmpfixgcsegfault3.patch added |
---|
comment:9 by , 8 years ago
Keywords: | rfc added |
---|
comment:11 by , 8 years ago
Patch not by echotangoecho, and I'm very sure I haven't found all calls, since I have aborted my search at some point. Had found them by simple string searches for GetContext()
.
comment:14 by , 8 years ago
Keywords: | rfc removed |
---|
Thanks a lot for those patches. I had to investigate upstream a bit for the last two ones.
As mentioned in the commit message, the patches that would fix the issue are not compatible with vs2013. I'm not too fond of patching the SM code itself for one platform and not for another, so I rather committed your fix.
It adds a slew of warnings with clang under OSX, but we already have a huge number of those warnings and we'd just need an active OSX dev in order to improve the situation. In the meantime we can now have GCC6 devs ;)
comment:16 by , 8 years ago
Keywords: | patch removed |
---|---|
Milestone: | Alpha 21 → Alpha 22 |
Resolution: | fixed |
Status: | closed → reopened |
Summary: | [PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector → Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector |
So the patch above did not fix segfaults in SM with GCC6: here is a backtrace provided by Elinvention: http://hastebin.com/cereyolawa.go
I decided to revert that workaround since it does not cover every case, it's better to tell people GCC6 is not supported at all, so they don't use an unstable version of the game.
follow-up: 18 comment:17 by , 8 years ago
If this is the plan, then the build system should error, or at least print a warning, when building with GCC 6. This way everyone trying building it is noticed of the possible issue. (On quitting with an error one can always remove the check if he really want as it is already done in Debian when building with root user.)
comment:18 by , 8 years ago
Replying to fabio:
If this is the plan, then the build system should error, or at least print a warning, when building with GCC 6.
I agree. Actually 0ad can be built with GCC6, SM38 can't. Repository managers of distros where GCC6 has become the default patched SM38 with workarounds, so people on those distros should just use that version.
What do you think of the attached patch?
by , 8 years ago
Attachment: | GCC6.patch added |
---|
comment:19 by , 8 years ago
Why is the latest patch using cc
if explicitly setting a different compiler should be supported? Why is it not checking whether that compiler actually is g++
(which also means that cc
above is wrong in two different ways)?
Why do we assume that distros have a packaged recent SpiderMonkey (we are pretty much the only thing requiring that and in lots of cases we depend on some patches, so using the bundled one is a lot easier; or they at least have a source for those patches), have some old version of GCC or another compiler, and that distros should do the work of patching SpiderMonkey if we don't even bother to do that.
Why did the (reverted) commit patch something instead of just adding something to the exact same environment variable? Also the bug report in the description does list two flags that are needed, though from looking at a few other tickets referenced from there that could not be all that is needed.
And dropping support for GCC6 means dropping Arch, Fedora, Debian Unstable (though GCC6 is going to end up in other versions too; though here an older version is more likely to be available), openSUSE Tumbleweed (though that likely also has other versions in the repos), and Ubuntu 16.10 (at least with the default compiler). (And will likely surprise a few Gentoo/Funtoo users using the newest GCC.) Which mostly means dropping support for nearly all those distributions/variants that will actually ship the new release.
comment:20 by , 8 years ago
Milestone: | Alpha 22 → Alpha 21 |
---|
Hi, thanks for the input.
I will write a new patch when I have time, probably tomorrow. The windings of building on Linux start to make sense to me and I think I finally see how to patch this.
However I did not understand your comment about cc
. I wanted to have a command that outputs something on everything Unix. Moreover, clang -dumpversion
dumps 4.2.1
for historical reasons so it also confirms it is not gcc, let alone gcc6.
comment:21 by , 8 years ago
(Actually Arch does seem to have a gcc5 package that is only required by one other package and I'll look sternly at anyone causing us to be added to that priviledged club.)
cc
is most likely (but needn't be) a symlink to some compiler, then again the generated Makefiles are not using cc
, unless you explicitly tell them to use that, which would mean setting CC
to that, but we're actually looking for c++
. That has the exact same issue as using cc
though and breaks in case someone wants to use a different compiler while cc
or c++
is actually GCC, and doesn't help the user who is doing the right thing and setting both CC
and CXX
(CPP
is the C preprocessor).
(Also default to the same thing as the Makefiles in case that isn't set, and ensure that is kept in sync, but then again don't do that because this approach is broken.)
comment:22 by , 8 years ago
Here is a patch that fixes all the segfaults that we spotted so far, based on the SM issue mentioned in the ticket.
However, I couldn't manage to test that g++
was the default compiler. I tried things like if [ -z "$CXX" -o ( echo $CXX | grep -q "g++" ) ]
but I couldn't get the syntax right after hours of attempts and google searches. Maybe there is a clever way...
Note: Fedora maintainers used different flags to take care of the issue, but they also disabled optimizations, which is a big deal for us, performance-wise (http://pkgs.fedoraproject.org/cgit/rpms/mozjs38.git/commit/?id=761399aba092bcb1299bb4fccfd60f370ab4216e).
by , 8 years ago
Attachment: | GCC6.2.patch added |
---|
comment:23 by , 8 years ago
Another little thing: some people patched the SM38 official tarball for the version number. The version is 38.2.1 but the tarball contains the version number 38.3.0. We didn't touch that but some maintainers fixed the inaccuracy. We should allow so-called 38.2 system versions to be used. The attached patch does just that.
by , 8 years ago
Attachment: | versioncheck.patch added |
---|
comment:24 by , 8 years ago
The optimisations should probably be disabled with GCC >= 6 (not just == 6).
comment:25 by , 8 years ago
Use CC=${CC:-gcc}
, which sets it to gcc in case it is not set or set to something empty. Since we assume that all compilers on *nix at least sort of imitate gcc (IIRC even icc does and support for that is likely more than just broken). (Checking that file we already use ${foo:=bar}
, so maybe just use that.) (Actually use CXX
, but that should have been clear by now.)
You can also chain test
(1) calls ([
is just that) by doing &&
, or better using exp1 -a exp2
; and as fabio said compare the numbers as just that.
The comment about why we are doing that could be a bit more explicit.
Also I do notice that we did clobber CXXFLAGS
previously...
by , 7 years ago
Attachment: | GCC6.3.patch added |
---|
comment:28 by , 7 years ago
I understand you must be tired of dealing with a newbie in sh scripting.
I realized that :=
is useless here since I don't pass the newly set value of CXX
to configure, but I can't do that on Windows I think. So I should just use :-
.
I don't test properly the value because CXX
could be /usr/bin/g++
or something, instead of g++
.
I fixed those things the way I could. I (really) hope that the problem was this, and that the fix is a proper one.
by , 7 years ago
Attachment: | GCC6.4.patch added |
---|
comment:29 by , 7 years ago
Nothing to with your knowledge of sh
.
Also no need to pass it, since we are just falling back to defaults, or the thing is already set at that point, so we don't need to do anything.
Now what happens if someone does not have g++
in $PATH
with that code snippet?
Or if someone has that be a symlink to clang++
(and yes, there are systems that do that that we do support). And why do we have that variable if we don't use it?
comment:30 by , 7 years ago
Yes, I should use ${CXX}
on line 38 (and use :=
on line 36), in case the user has their custom compiler not in $PATH
but properly set in the environment variable. Is that what you mean? Else, if they don't have g++
in $PATH
and don't set CXX
to anything I don't know how we can handle that.
As already stated above, clang++ -dumpversion
will output 4.2.1
so it works too.
And all that definitely has to do with my knowledge of the world of Unix developers. But I'm learning, as you can see.
comment:31 by , 7 years ago
Correct, and given that we set it to something if it isn't set, we do not need to check whether it is set, do we? So we should be down to two if
s, which should then just be merged as every compiler we support on those OSs does react sanely.
The comment should be updated to make it easier to find when doing SpiderMonkey upgrades, and properly state why we disable those, not that we do that which is obvious from the code.
comment:32 by , 7 years ago
I'm almost there but I cannot on my life figure out how to combine conditionals properly. It's been an hour now and I got nothing to work. I'm about to throw my laptop through the window. See what I meant about sh?
The closest thing that does work is: if echo "${CXX:=g++}" | grep -q "g++" && [ "`"${CXX:=g++}" -dumpversion | cut -f1 -d.`" -ge "6" ]
but that obviously misses the point. If I replace that by if echo "${CXX:=g++}" | grep -q "g++" && [ "`"${CXX}" -dumpversion | cut -f1 -d.`" -ge "6" ]
then I get an ominous "command not found" error. I suppose this is because the &&
separates both commands and unsets the value of CXX
. How am I supposed to get past that?
comment:33 by , 7 years ago
The grep
part is not needed, see above about the GCC compatibility on those OSs. What are you actually trying to test there? It could be some wrapper script that does check for sane input or list help output and not pass that to the actual compiler (unlikely, but not impossible).
So that isn't one of the checks I was talking about.
test
(aka [
) has -a
and -o
options, then again && works fine. Also the quotation marks in the part that checks the version are strange. Variable substitution also happens within strings, unless you use single quotes, and the quotes are there so you don't get command injection or unexpected things.
For the resetting that should not happen, as this is the same shell executing and not some subshell being run (as in (cmd params)
).
comment:34 by , 7 years ago
Ah sorry I got carried away with the grep thing. I was trying to handle the case where CXX
is set to something like /usr/bin/g++
but now that I correctly use the environment variable in the version test, I don't need to do that indeed.
I removed one pair of useless double quotes.
So here's the resulting patch. Thanks for your help.
If you don't mind, I would like to know why the other patch fails (in the assumption that I still want to grep something). Since you're saying that it should not happen I let you test it. I'm pretty sure I missed something but what?
FYI, the output I get is -dumpversion: command not found
which is probably because ${CXX}
there evaluates to an empty string.
by , 7 years ago
Attachment: | GCC6.5.patch added |
---|
by , 7 years ago
Attachment: | notworking.patch added |
---|
comment:35 by , 7 years ago
If the crashes are reproducible (#4250 ?) it would be nice to also add a test for this issue.
comment:36 by , 7 years ago
Looks sane.
For the notworking patch: Works with zsh, doesn't work with bash, dash, tcsh (because that uses different syntax), fish (because that also uses different syntax and doesn't really attempt to do POSIX) or mksh.
It should(TM) (someone point me to the docs in case I'm wrong) be working in the POSIX compatible (at least to some degree) ones of those (all but tcsh and fish) given that [1] states
Assign Default Values. If parameter is unset or null, the expansion of word (or an empty string if word is omitted) shall be assigned to parameter. In all cases, the final value of parameter shall be substituted. Only variables, not positional parameters or special parameters, can be assigned in this way.
.
And both If parameter is unset or null, the expansion of word is assigned to parameter. The value of parameter is then substituted. Positional parameters and special parameters may not be assigned to in this way.
[2] the bash docs as well as the mksh docs (not the ksh ones, because those links don't seem to work) If name is set and not NULL, it is substituted; otherwise, it is assigned word and the resulting value of name is substituted.
[3]. (Finding a sentence in the dash documentation very close to the wording from POSIX is left as an exercise to the reader.)
The zsh documentation states In the first form, if name is unset then set it to word; in the second form, if name is unset or null then set it to word; and in the third form, unconditionally set name to word. In all forms, the value of the parameter is then substituted.
[4] where the second form is :=
So I suspect that this is a bug in those shells as most uses of :=
actually assign the result to the variable itself, though that should not be needed (and their docs state as much).
[1] http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_06_02 [2] https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html [3] https://www.mirbsd.org/htman/i386/man1/mksh.htm [4] http://zsh.sourceforge.net/Doc/Release/Expansion.html#Parameter-Expansion
comment:37 by , 7 years ago
Keywords: | review patch added |
---|---|
Summary: | Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector → [PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector |
Ok, thanks for the research - I feel relieved to know it's not a specificity of the language I couldn't grasp, but rather an issue with some shells.
As I suspected, I can't merge the conditionals, because even when the OS is Windows, the second part of the test is executed (leading to a noisy g++: command not found
on Windows). This wouldn't break the compilation but it's still bad.
fabio: I'm not quite sure how to test that. It so happens that the sprintf error triggers the segfault, so I could add a test where a wrong sprintf is called, but it would segfault if the test fails, preventing other tests to run, and it would error out when the test succeeds, which is not easy to change. Plus, this test won't check for other sources of segfaulting. I think the safest test we can run in this situation is playing the game.
by , 7 years ago
Attachment: | GCC6.6.patch added |
---|
comment:39 by , 7 years ago
Keywords: | review removed |
---|
comment:40 by , 7 years ago
Why do a workaround (disabling optimizations) instead of upgrading to SpiderMonkey >= v49?
comment:41 by , 7 years ago
Because the latest release of SpiderMonkey is 45, for which an upgrade was in the works last I checked. We also want to keep the game playable and disabling those optimizations is the only way to do that, since those break it so aren't useful anyway.
So this is not a workaround, but the only possible fix available at the moment. (Unless someone volunteers to backport likely huge amounts of code from a Firefox version that builds on GCC6.)
New comments on that issue, it seems to be more severe: