Opened 18 months ago

Closed 14 months ago

Last modified 13 months ago

#4053 closed defect (fixed)

[PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector

Reported by: sanderd17 Owned by: echotangoecho
Priority: Release Blocker Milestone: Alpha 21
Component: Core engine Keywords: patch
Cc: leper, Yves Patch:

Description

When building Spidermonkey on GCC 6, there are quite often segfaults appearing in the garbage collector.

This happens especially often when a lot of data is flushed, like when finalising the load stage of a random map. And for some reason, it happens more often when the game window isn't focused (when it's in the background).

I tested this on our current SpiderMonkey (v31) and on leper's upgrade to v38, and both fail with the same symptoms.

A typical backtrace can be found here: http://pastebin.com/YWRzvpkK

After asking on #jsapi, it appears that these issues are only fixed from v49 onwards. And for now, we need to disable certain optimisations.

In particular the optimisations listed on their ticket 1245783, comment 33.

Attachments (14)

autorequestfix.patch (447 bytes) - added by echotangoecho 17 months ago.
autorequestfix_v2.patch (8.3 KB) - added by elexis 17 months ago.
Found some more by searching for all occurances of JS::RootedValue, JS::RootedObject and GetContext (grep -R 'GetContext();' -A 1).
tmpfixgcsegfault.patch (1.5 KB) - added by echotangoecho 17 months ago.
fixtrace.patch (1.0 KB) - added by echotangoecho 17 months ago.
tmpfixgcsegfault2.patch (2.5 KB) - added by echotangoecho 15 months ago.
tmpfixgcsegfault3.patch (2.5 KB) - added by echotangoecho 15 months ago.
GCC6.patch (909 bytes) - added by Itms 14 months ago.
GCC6.2.patch (1.9 KB) - added by Itms 14 months ago.
versioncheck.patch (587 bytes) - added by Itms 14 months ago.
GCC6.3.patch (2.0 KB) - added by Itms 14 months ago.
GCC6.4.patch (2.0 KB) - added by Itms 14 months ago.
GCC6.5.patch (2.0 KB) - added by Itms 14 months ago.
notworking.patch (2.0 KB) - added by Itms 14 months ago.
GCC6.6.patch (2.0 KB) - added by Itms 14 months ago.

Download all attachments as: .zip

Change History (55)

comment:1 Changed 18 months ago by sanderd17

New comments on that issue, it seems to be more severe:

<terrence-afk> the frontend appears to be marking a nullptr, which is not allowed
<terrence-afk> see frame 3 and 4
<terrence-afk> appears to be here: https://dxr.mozilla.org/mozilla-central/source/js/src/frontend/ParseNode.cpp#1188
<terrence-afk> I have no idea why that should happen... you should probably consult someone who's touched the frontend recently; maybe jonco or shu
<terrence-afk> sanderd17: hurm, actually if &object is nullptr, implies that so is |this| in that context...
<terrence-afk> and since we reach there via an internal list that is guarded on box == nullptr.... that's totally screwed up
<terrence-afk> if I were you, I would reproduce this under |rr| and then just walk the cause backwards from there to figure out what gcc is miscompiling
<terrence-afk> note that the jit isn't really implicated in this crash

comment:2 Changed 18 months ago by sanderd17

Summary: Building spidermonkey on GCC 6 results in sefaults in the Garbage CollectorBuilding spidermonkey on GCC 6 results in segfaults in the Garbage Collector

comment:3 Changed 17 months ago by echotangoecho

Keywords: patch review added
Owner: set to echotangoecho
Summary: Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector[PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector

When in debug mode, an assertion was triggered hinting at creating a Rooted<T> somewhere outside of a request.

This patch fixes that issue, and during my testing the segfaults did not occur anymore. It could still be that a similar issue also happens elsewhere, in a code section not executed during my testing.

Changed 17 months ago by echotangoecho

Attachment: autorequestfix.patch added

comment:4 Changed 17 months ago by echotangoecho

During further testing, it seems that the issue is unfortunately not fixed with the above patch and only occurs when compiling in release mode.

comment:5 Changed 17 months ago by elexis

In 18451:

Add missing JSAutoRequest calls. Based on patch by echotangoecho, refs #4053.

Changed 17 months ago by elexis

Attachment: autorequestfix_v2.patch added

Found some more by searching for all occurances of JS::RootedValue, JS::RootedObject and GetContext (grep -R 'GetContext();' -A 1).

Changed 17 months ago by echotangoecho

Attachment: tmpfixgcsegfault.patch added

comment:6 Changed 17 months ago by echotangoecho

In the changelog for GCC 6:

Value range propagation now assumes that the this pointer of C++ member functions is non-null. This eliminates common null pointer checks but also breaks some non-conforming code-bases (such as Qt-5, Chromium, KDevelop). As a temporary work-around -fno-delete-null-pointer-checks can be used. Wrong code can be identified by using -fsanitize=undefined.

When adding -fno-delete-null-pointer-checks the segfaults don't seem to occur anymore, but I'm not sure if this is the right way to add it.

Changed 17 months ago by echotangoecho

Attachment: fixtrace.patch added

comment:7 Changed 17 months ago by echotangoecho

It seems there is the possibility for another segfault in our own code: When ObjectIdCache is constructed, ObjectIdCache::Trace is added as a GC roots tracer, while the ObjectIdCache is not properly initialized until ObjectIdCache::init is called, causing a segfault when a GC run occurs between running the ObjectIdCache constructor and ObjectIdCache::init.

comment:8 Changed 15 months ago by Itms

Keywords: review removed

Thanks for working on it. Since the SpiderMonkey 38 upgrade is almost ready, I would like to push those changes to after the update, in order to avoid fixing conflicts this late in the review process of #3708.

You will need to rebase the GC segfault patch when the update happens, the two other patches look good to me.

Changed 15 months ago by echotangoecho

Attachment: tmpfixgcsegfault2.patch added

Changed 15 months ago by echotangoecho

Attachment: tmpfixgcsegfault3.patch added

comment:9 Changed 15 months ago by echotangoecho

Keywords: rfc added

comment:10 Changed 14 months ago by Itms

In 18730:

Add missing JSAutoRequest calls. (hopefully all of them have been spotted now)

Patch by echotangoecho, refs #4053

comment:11 Changed 14 months ago by elexis

Patch not by echotangoecho, and I'm very sure I haven't found all calls, since I have aborted my search at some point. Had found them by simple string searches for GetContext().

comment:12 Changed 14 months ago by Itms

In 18757:

Fix a possible source of segmentation fault in our custom version of ObjectIdCache?.

Patch by echotangoecho, refs #4053

comment:13 Changed 14 months ago by Itms

Resolution: fixed
Status: newclosed

In 18762:

Workaround for the segfault issue we have with SpiderMonkey with GCC6, patch by echotangoecho.

The issue could be fixed by applying patches from upstream, but they are not compatible with VS2013, which doesn't support unrestricted unions.

Fixes #4053

comment:14 Changed 14 months ago by Itms

Keywords: rfc removed

Thanks a lot for those patches. I had to investigate upstream a bit for the last two ones.

As mentioned in the commit message, the patches that would fix the issue are not compatible with vs2013. I'm not too fond of patching the SM code itself for one platform and not for another, so I rather committed your fix.

It adds a slew of warnings with clang under OSX, but we already have a huge number of those warnings and we'd just need an active OSX dev in order to improve the situation. In the meantime we can now have GCC6 devs ;)

comment:15 Changed 14 months ago by Itms

In 18767:

Revert r18762 after a report that segfaults still occured.

Support for GCC6 will be pushed to A22. Refs #4053

comment:16 Changed 14 months ago by Itms

Keywords: patch removed
Milestone: Alpha 21Alpha 22
Resolution: fixed
Status: closedreopened
Summary: [PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage CollectorBuilding spidermonkey on GCC 6 results in segfaults in the Garbage Collector

So the patch above did not fix segfaults in SM with GCC6: here is a backtrace provided by Elinvention: http://hastebin.com/cereyolawa.go

I decided to revert that workaround since it does not cover every case, it's better to tell people GCC6 is not supported at all, so they don't use an unstable version of the game.

comment:17 Changed 14 months ago by fabio

If this is the plan, then the build system should error, or at least print a warning, when building with GCC 6. This way everyone trying building it is noticed of the possible issue. (On quitting with an error one can always remove the check if he really want as it is already done in Debian when building with root user.)

comment:18 in reply to:  17 Changed 14 months ago by Itms

Replying to fabio:

If this is the plan, then the build system should error, or at least print a warning, when building with GCC 6.

I agree. Actually 0ad can be built with GCC6, SM38 can't. Repository managers of distros where GCC6 has become the default patched SM38 with workarounds, so people on those distros should just use that version.

What do you think of the attached patch?

Changed 14 months ago by Itms

Attachment: GCC6.patch added

comment:19 Changed 14 months ago by leper

Why is the latest patch using cc if explicitly setting a different compiler should be supported? Why is it not checking whether that compiler actually is g++ (which also means that cc above is wrong in two different ways)?

Why do we assume that distros have a packaged recent SpiderMonkey (we are pretty much the only thing requiring that and in lots of cases we depend on some patches, so using the bundled one is a lot easier; or they at least have a source for those patches), have some old version of GCC or another compiler, and that distros should do the work of patching SpiderMonkey if we don't even bother to do that.

Why did the (reverted) commit patch something instead of just adding something to the exact same environment variable? Also the bug report in the description does list two flags that are needed, though from looking at a few other tickets referenced from there that could not be all that is needed.

And dropping support for GCC6 means dropping Arch, Fedora, Debian Unstable (though GCC6 is going to end up in other versions too; though here an older version is more likely to be available), openSUSE Tumbleweed (though that likely also has other versions in the repos), and Ubuntu 16.10 (at least with the default compiler). (And will likely surprise a few Gentoo/Funtoo? users using the newest GCC.) Which mostly means dropping support for nearly all those distributions/variants that will actually ship the new release.

comment:20 Changed 14 months ago by Itms

Milestone: Alpha 22Alpha 21

Hi, thanks for the input.

I will write a new patch when I have time, probably tomorrow. The windings of building on Linux start to make sense to me and I think I finally see how to patch this.

However I did not understand your comment about cc. I wanted to have a command that outputs something on everything Unix. Moreover, clang -dumpversion dumps 4.2.1 for historical reasons so it also confirms it is not gcc, let alone gcc6.

comment:21 Changed 14 months ago by leper

(Actually Arch does seem to have a gcc5 package that is only required by one other package and I'll look sternly at anyone causing us to be added to that priviledged club.)

cc is most likely (but needn't be) a symlink to some compiler, then again the generated Makefiles are not using cc, unless you explicitly tell them to use that, which would mean setting CC to that, but we're actually looking for c++. That has the exact same issue as using cc though and breaks in case someone wants to use a different compiler while cc or c++ is actually GCC, and doesn't help the user who is doing the right thing and setting both CC and CXX (CPP is the C preprocessor).

(Also default to the same thing as the Makefiles in case that isn't set, and ensure that is kept in sync, but then again don't do that because this approach is broken.)

comment:22 Changed 14 months ago by Itms

Here is a patch that fixes all the segfaults that we spotted so far, based on the SM issue mentioned in the ticket.

However, I couldn't manage to test that g++ was the default compiler. I tried things like if [ -z "$CXX" -o ( echo $CXX | grep -q "g++" ) ] but I couldn't get the syntax right after hours of attempts and google searches. Maybe there is a clever way...

Note: Fedora maintainers used different flags to take care of the issue, but they also disabled optimizations, which is a big deal for us, performance-wise (http://pkgs.fedoraproject.org/cgit/rpms/mozjs38.git/commit/?id=761399aba092bcb1299bb4fccfd60f370ab4216e).

Changed 14 months ago by Itms

Attachment: GCC6.2.patch added

comment:23 Changed 14 months ago by Itms

Another little thing: some people patched the SM38 official tarball for the version number. The version is 38.2.1 but the tarball contains the version number 38.3.0. We didn't touch that but some maintainers fixed the inaccuracy. We should allow so-called 38.2 system versions to be used. The attached patch does just that.

Changed 14 months ago by Itms

Attachment: versioncheck.patch added

comment:24 Changed 14 months ago by fabio

The optimisations should probably be disabled with GCC >= 6 (not just == 6).

comment:25 Changed 14 months ago by leper

Use CC=${CC:-gcc}, which sets it to gcc in case it is not set or set to something empty. Since we assume that all compilers on *nix at least sort of imitate gcc (IIRC even icc does and support for that is likely more than just broken). (Checking that file we already use ${foo:=bar}, so maybe just use that.) (Actually use CXX, but that should have been clear by now.)

You can also chain test(1) calls ([ is just that) by doing &&, or better using exp1 -a exp2; and as fabio said compare the numbers as just that.

The comment about why we are doing that could be a bit more explicit.

Also I do notice that we did clobber CXXFLAGS previously...

Last edited 14 months ago by leper (previous) (diff)

comment:26 Changed 14 months ago by Itms

Something like that? :)

Changed 14 months ago by Itms

Attachment: GCC6.3.patch added

comment:27 Changed 14 months ago by leper

No.

comment:28 Changed 14 months ago by Itms

I understand you must be tired of dealing with a newbie in sh scripting.

I realized that := is useless here since I don't pass the newly set value of CXX to configure, but I can't do that on Windows I think. So I should just use :-.

I don't test properly the value because CXX could be /usr/bin/g++ or something, instead of g++.

I fixed those things the way I could. I (really) hope that the problem was this, and that the fix is a proper one.

Changed 14 months ago by Itms

Attachment: GCC6.4.patch added

comment:29 Changed 14 months ago by leper

Nothing to with your knowledge of sh.

Also no need to pass it, since we are just falling back to defaults, or the thing is already set at that point, so we don't need to do anything.

Now what happens if someone does not have g++ in $PATH with that code snippet? Or if someone has that be a symlink to clang++ (and yes, there are systems that do that that we do support). And why do we have that variable if we don't use it?

comment:30 Changed 14 months ago by Itms

Yes, I should use ${CXX} on line 38 (and use := on line 36), in case the user has their custom compiler not in $PATH but properly set in the environment variable. Is that what you mean? Else, if they don't have g++ in $PATH and don't set CXX to anything I don't know how we can handle that.

As already stated above, clang++ -dumpversion will output 4.2.1 so it works too.

And all that definitely has to do with my knowledge of the world of Unix developers. But I'm learning, as you can see.

comment:31 Changed 14 months ago by leper

Correct, and given that we set it to something if it isn't set, we do not need to check whether it is set, do we? So we should be down to two ifs, which should then just be merged as every compiler we support on those OSs does react sanely.

The comment should be updated to make it easier to find when doing SpiderMonkey upgrades, and properly state why we disable those, not that we do that which is obvious from the code.

comment:32 Changed 14 months ago by Itms

I'm almost there but I cannot on my life figure out how to combine conditionals properly. It's been an hour now and I got nothing to work. I'm about to throw my laptop through the window. See what I meant about sh?

The closest thing that does work is: if echo "${CXX:=g++}" | grep -q "g++" && [ "`"${CXX:=g++}" -dumpversion | cut -f1 -d.`" -ge "6" ]

but that obviously misses the point. If I replace that by if echo "${CXX:=g++}" | grep -q "g++" && [ "`"${CXX}" -dumpversion | cut -f1 -d.`" -ge "6" ]

then I get an ominous "command not found" error. I suppose this is because the && separates both commands and unsets the value of CXX. How am I supposed to get past that?

comment:33 Changed 14 months ago by leper

The grep part is not needed, see above about the GCC compatibility on those OSs. What are you actually trying to test there? It could be some wrapper script that does check for sane input or list help output and not pass that to the actual compiler (unlikely, but not impossible).

So that isn't one of the checks I was talking about.

test (aka [) has -a and -o options, then again && works fine. Also the quotation marks in the part that checks the version are strange. Variable substitution also happens within strings, unless you use single quotes, and the quotes are there so you don't get command injection or unexpected things.

For the resetting that should not happen, as this is the same shell executing and not some subshell being run (as in (cmd params)).

comment:34 Changed 14 months ago by Itms

Ah sorry I got carried away with the grep thing. I was trying to handle the case where CXX is set to something like /usr/bin/g++ but now that I correctly use the environment variable in the version test, I don't need to do that indeed.

I removed one pair of useless double quotes.

So here's the resulting patch. Thanks for your help.

If you don't mind, I would like to know why the other patch fails (in the assumption that I still want to grep something). Since you're saying that it should not happen I let you test it. I'm pretty sure I missed something but what? FYI, the output I get is -dumpversion: command not found which is probably because ${CXX} there evaluates to an empty string.

Changed 14 months ago by Itms

Attachment: GCC6.5.patch added

Changed 14 months ago by Itms

Attachment: notworking.patch added

comment:35 Changed 14 months ago by fabio

If the crashes are reproducible (#4250 ?) it would be nice to also add a test for this issue.

comment:36 Changed 14 months ago by leper

Looks sane.

For the notworking patch: Works with zsh, doesn't work with bash, dash, tcsh (because that uses different syntax), fish (because that also uses different syntax and doesn't really attempt to do POSIX) or mksh. It should(TM) (someone point me to the docs in case I'm wrong) be working in the POSIX compatible (at least to some degree) ones of those (all but tcsh and fish) given that [1] states Assign Default Values. If parameter is unset or null, the expansion of word (or an empty string if word is omitted) shall be assigned to parameter. In all cases, the final value of parameter shall be substituted. Only variables, not positional parameters or special parameters, can be assigned in this way..

And both If parameter is unset or null, the expansion of word is assigned to parameter. The value of parameter is then substituted. Positional parameters and special parameters may not be assigned to in this way.[2] the bash docs as well as the mksh docs (not the ksh ones, because those links don't seem to work) If name is set and not NULL, it is substituted; otherwise, it is assigned word and the resulting value of name is substituted.[3]. (Finding a sentence in the dash documentation very close to the wording from POSIX is left as an exercise to the reader.)

The zsh documentation states In the first form, if name is unset then set it to word; in the second form, if name is unset or null then set it to word; and in the third form, unconditionally set name to word. In all forms, the value of the parameter is then substituted.[4] where the second form is :=

So I suspect that this is a bug in those shells as most uses of := actually assign the result to the variable itself, though that should not be needed (and their docs state as much).

[1] http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_06_02 [2] https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html [3] https://www.mirbsd.org/htman/i386/man1/mksh.htm [4] http://zsh.sourceforge.net/Doc/Release/Expansion.html#Parameter-Expansion

comment:37 Changed 14 months ago by Itms

Keywords: review patch added
Summary: Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector[PATCH] Building spidermonkey on GCC 6 results in segfaults in the Garbage Collector

Ok, thanks for the research - I feel relieved to know it's not a specificity of the language I couldn't grasp, but rather an issue with some shells.

As I suspected, I can't merge the conditionals, because even when the OS is Windows, the second part of the test is executed (leading to a noisy g++: command not found on Windows). This wouldn't break the compilation but it's still bad.

fabio: I'm not quite sure how to test that. It so happens that the sprintf error triggers the segfault, so I could add a test where a wrong sprintf is called, but it would segfault if the test fails, preventing other tests to run, and it would error out when the test succeeds, which is not easy to change. Plus, this test won't check for other sources of segfaulting. I think the safest test we can run in this situation is playing the game.

Changed 14 months ago by Itms

Attachment: GCC6.6.patch added

comment:38 Changed 14 months ago by Itms

Resolution: fixed
Status: reopenedclosed

In 18801:

Disable some optimizations when building SpiderMonkey with GCC6.
This fixes the segfault issues we encounter when building with this compiler.

According to https://bugzilla.mozilla.org/show_bug.cgi?id=1245783#c33, this is enough to fix this issue. The fixes for SpiderMonkey to address the underlying problem are included in its version 49.

Fixes #4053, refs #4250

comment:39 Changed 14 months ago by Itms

Keywords: review removed

comment:40 Changed 13 months ago by Melroy van den Berg

Why do a workaround (disabling optimizations) instead of upgrading to SpiderMonkey >= v49?

comment:41 Changed 13 months ago by leper

Because the latest release of SpiderMonkey is 45, for which an upgrade was in the works last I checked. We also want to keep the game playable and disabling those optimizations is the only way to do that, since those break it so aren't useful anyway.

So this is not a workaround, but the only possible fix available at the moment. (Unless someone volunteers to backport likely huge amounts of code from a Firefox version that builds on GCC6.)

Note: See TracTickets for help on using tickets.