Vanfanel
Posts: 432
Joined: Sat Aug 18, 2012 5:58 pm

Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Fri Mar 30, 2018 12:18 pm

Hi there,

I usually build some game engines for the Raspberry Pi, and I have noticed that some code segfaults on the Pi only. In all cases, the code is for deleting C++ objects or freeing memory, and is called on program's exiting code.

In ECWolf, for example, there are A LOT of delete[] calls causing segfaults on Pi only.

https://bitbucket.org/ecwolf/ecwolf/src ... t.cpp-1929

Also here:

https://bitbucket.org/ecwolf/ecwolf/src ... eo.cpp-359

On GZDoom it happens here:

https://github.com/coelckers/gzdoom/blo ... o.cpp#L731

As you can see, the code is similar and it always happens on destructors.
I can't see what's wrong with this (from a logical point of view) and since it happens only on the Pi, the authors won't help much because most of them don't build software for the Pi at all, even if I always report those problems so things keep working on the Pi platform.

Any ideas on what could be going on here?

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 22044
Joined: Sat Jul 30, 2011 7:41 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Fri Mar 30, 2018 2:22 pm

Difficult to tell. If the code is multithreaed I'd suspect either a race condition or a use after free.

Try running under valgrind to see if that can find the exact error location, or under gbd?
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

Vanfanel
Posts: 432
Joined: Sat Aug 18, 2012 5:58 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Fri Mar 30, 2018 4:46 pm

@jamesh: I used GDB already to determine the locations where the Pi-only segfaults happen. How could Valgrind help here? I never used it, only GDB to do C programming.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 22044
Joined: Sat Jul 30, 2011 7:41 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Fri Mar 30, 2018 4:53 pm

Tools like valgrind can detect things like use after free, and some threading issues.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
"My grief counseller just died, luckily, he was so good, I didn't care."

Vanfanel
Posts: 432
Joined: Sat Aug 18, 2012 5:58 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Fri Mar 30, 2018 5:24 pm

@jamesh: ok, before I go try Valgrind, do you have an indea on what could be wrong on ARM vs X64 on a delete call? There's no threading involved to my knowledge, at least in ECWolf.

jahboater
Posts: 4173
Joined: Wed Feb 04, 2015 6:38 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Fri Mar 30, 2018 6:55 pm

Vanfanel wrote:
Fri Mar 30, 2018 4:46 pm
@jamesh: I used GDB already to determine the locations where the Pi-only segfaults happen. How could Valgrind help here? I never used it, only GDB to do C programming.
valgrind takes the executable and executes each instruction one by one, checking absolutely everything as it goes.
It checks every single memory reference is valid, common library calls are correct, uninitialized memory is not being referenced, etc etc etc. It is normally very slow, for obvious reasons. An alternative are the sanitizers available with gcc and clang which are faster but not so thorough.

I suggest running your programs with valgrind on the x86 machine which is (a) faster, and (b) I find valgrind works better on x86.
Although your program crashes on the Pi, the cause of the problem may be visible on the x86 machine too its just that by chance, say, a memory corruption is not corrupting anything important on x86 but it is on the Pi. If it finds nothing on x86, then try valgrind on the Pi.

If you compile your program for debug, then any messages from valgrind will have source line numbers in, but you can just run valgrind on your plain release executable for a quick check if you want.

Just do

valgrind xxx

The advantage of valgrind is that it finds problems as they happen - at the root cause. GDB helps afterwards.
Your problem with free is common, but it could have been caused long ago - and valgrind might find when.

To use all the sanitizers:

Code: Select all

SAN = -fno-sanitize-recover=all          \
      -fsanitize=undefined               \
      -fsanitize=address                 \
      -fsanitize-address-use-after-scope \
      -fsanitize=leak                    \
      -fsanitize=bounds                  \
      -fsanitize=bounds-strict           \
      -fsanitize=integer-divide-by-zero  \
      -fsanitize=float-divide-by-zero    \
      -fsanitize=float-cast-overflow     \
      -fsanitize=unreachable             \
      -fsanitize=vla-bound               \
      -fsanitize=null                    \
      -fsanitize=signed-integer-overflow \
      -fsanitize=object-size             \
      -fsanitize=bool                    \
      -fsanitize=enum                    \
      -fsanitize=return                  \
      -fsanitize=shift                   \
      -fsanitize=alignment
Another thing you can do is to turn on every single warning flag in the compiler!

Code: Select all

WARN = -Wfatal-errors -Wall -Wextra -Wconversion -Wunused -Wundef -Wcast-qual \
       -Wredundant-decls -Wunreachable-code -Wwrite-strings -Warray-bounds \
       -Wstrict-aliasing=3 -Wstrict-overflow=1 -Wstrict-prototypes -Winline \
       -Wshadow -Wswitch -Wmissing-include-dirs -Woverlength-strings -Wpacked \
       -Wdisabled-optimization -Wmissing-prototypes -Wformat=2 -Winit-self \
       -Wmissing-declarations -Wunused-parameter -Wlogical-op -Wuninitialized \
       -Wnested-externs -Wpointer-arith -Wdouble-promotion -Wunused-macros \
       -Wunused-function -Wunsafe-loop-optimizations -Wnull-dereference \
       -Wduplicated-cond -Wshift-overflow=2 -Wnonnull -Wcast-align -Warray-bounds=2

Vanfanel
Posts: 432
Joined: Sat Aug 18, 2012 5:58 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Fri Mar 30, 2018 9:40 pm

@jahboater: This Valgrind stuff is really crazy. I am running it on X86_64 because the Pi won't start the game in 10 minutes... Too slow for it.
ECWolf causes SO MANY warnings that it won't show them all!

Code: Select all

==5361== 
==5361== More than 10000000 total errors detected.  I'm not reporting any more.
==5361== Final error counts will be inaccurate.  Go fix your program!
==5361== Rerun with --error-limit=no to disable this cutoff.  Note
==5361== that errors may occur in your program without prior warning from
==5361== Valgrind, because errors are no longer being displayed.
==5361== 
There are a lot of warnings like this:

Code: Select all

==5361== Invalid write of size 4
==5361==    at 0x10F7484D: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_dri.so)
==5361==    by 0xF2AA017: ??? (in /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0)
==5361==    by 0x4EE8932: ??? (in /usr/lib/x86_64-linux-gnu/libSDL2-2.0.so.0.6.0)
==5361==    by 0x26BD1D: SDLFB::Update() (sdlvideo.cpp:856)
==5361==    by 0x253883: VH_UpdateScreen() (id_vh.cpp:151)
==5361==    by 0x2A3AD0: InitGame() (wl_main.cpp:484)
==5361==    by 0x2A56C2: WL_Main(int, char**) (wl_main.cpp:1247)
==5361==    by 0x2A5810: main (wl_main.cpp:1304)
==5361==  Address 0x7f31b7af8a1c is not stack'd, malloc'd or (recently) free'd
==5361== 

I have tried suppressing some, using valgring --suppressions=FILE, where FILE has:

Code: Select all

{
   ignore_unversioned_libs
   Memcheck:Leak
   ... 
   obj:/usr/lib/x86*/lib*.so
}
{
   ignore_versioned_libs
   Memcheck:Leak
   ... 
   obj:/usr/lib/x86*/lib*.so
}


But it seems to be ignored and I am still getting the same 100000+ warnings... Any idea on how to get ONLY the warnings related to the "delete on exit" segfault??

User avatar
Paeryn
Posts: 2512
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Sat Mar 31, 2018 3:37 am

I've just had a quick play around compiling ecwolf, I get a segfault if I try to use SDL2 (when initialising the sounds) but using SDL1.2 it runs fine with no segfaults for me. I'm just re-building it with optimisations on to see if that causes any... Nope, tons of warnings about type conversions, cast alignments and shadowing though.

One thing I did notice is that in scanner.cpp they compare a variable of type char against Tk_NoToken which is defined as having the value -1 which on ARM will mean the test will always be false (because char is unsigned by default on ARM rather than signed on x86).

Wow... Running with all the sanitisers on (had to disable reporting of globals for the address sanitiser as that reports a global buffer overflow early on and terminates the program even when told to not), https://drive.google.com/open?id=1F0UM6 ... UfARbnAKdU and that was just up to starting the first level and quitting straight away. It still didn't segfault though.
She who travels light — forgot something.

jahboater
Posts: 4173
Joined: Wed Feb 04, 2015 6:38 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Sat Mar 31, 2018 8:16 am

Paeryn wrote:
Sat Mar 31, 2018 3:37 am
One thing I did notice is that in scanner.cpp they compare a variable of type char against Tk_NoToken which is defined as having the value -1 which on ARM will mean the test will always be false (because char is unsigned by default on ARM rather than signed on x86).
Good catch!
You could use -fsigned-char perhaps on ARM (wont be efficient but never mind for now).

Another difference is that the x86 machine is 64-bit and the Pi is 32-bit .....

jahboater
Posts: 4173
Joined: Wed Feb 04, 2015 6:38 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Sat Mar 31, 2018 8:31 am

Vanfanel wrote:
Fri Mar 30, 2018 9:40 pm
There are a lot of warnings like this:

Code: Select all

==5361== Invalid write of size 4
==5361==    at 0x10F7484D: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_dri.so)
==5361==    by 0xF2AA017: ??? (in /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1.2.0)
==5361==    by 0x4EE8932: ??? (in /usr/lib/x86_64-linux-gnu/libSDL2-2.0.so.0.6.0)
==5361==    by 0x26BD1D: SDLFB::Update() (sdlvideo.cpp:856)
==5361==    by 0x253883: VH_UpdateScreen() (id_vh.cpp:151)
==5361==    by 0x2A3AD0: InitGame() (wl_main.cpp:484)
==5361==    by 0x2A56C2: WL_Main(int, char**) (wl_main.cpp:1247)
==5361==    by 0x2A5810: main (wl_main.cpp:1304)
==5361==  Address 0x7f31b7af8a1c is not stack'd, malloc'd or (recently) free'd
==5361== 
Well how about starting by fixing this problem at sdlvideo.cpp line 856 ? Looks bad!

Nowadays I run valgrind (on x86) regularly, compile with all warnings on, and run with the sanitizers on regularly - making sure it is clean with all three before continuing. That way you only ever get a small number of warnings and they are bound to be closely related to a change you have just made - which is fresh in you mind and easy to fix.

That doesn't help you though :(

Sometimes one single error can cause thousands of warnings, so I would start from the beginning and fix obvious problems (like the one above), and you never know, most of the rest might disappear!

On ARM:
Perhaps just compile with "-Wall -fsigned-char" initially, and fix anything obvious. Then add -Wconversion perhaps. Perhaps -mno-unaligned-access might reduce the alignment errors from valgrind.

Vanfanel
Posts: 432
Joined: Sat Aug 18, 2012 5:58 pm

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Sat Apr 07, 2018 10:03 pm

I have some updates on this.
Your ideas are good, jahboater, but these segfaults on quit (on delete[] and free functions) seem to be happening on X86_64 too if I use the same SDL2 driver, KMSDRM, which I sent for merging...

https://github.com/spurious/SDL-mirror/ ... deo/kmsdrm

The problem is NOT happening inside the driver, but inside the game's code. For example, on GZDoom, it segfaults on quit here:

zstrings.cpp : 1275
free (this)

If I comment that, it segfaults in symbols.cpp : 236
Symbols.Clear();

But that does NOT happen using the BRCM SDL2 driver on SDL2 or the X11 SDL2 driver on X86_64. It only happens when using SDL2 with the KMSDRM driver, both in ARM and X86_64.
The quitting functions from the KMSDRM driver all exit with no problem at all!
This is beyond strange and I don't know how the hell debug it. Valgrind throws me so many errors "it won't count anymore", and building GZDoom with ASAN support makes the segfault go away. In other words, building with ASAN, there are NO segfaults on exit. This is crazy...

Latest stableSDL2.0.8 and GZDoom sources are here, in case someone can build SDL 2.0.8 with KMSDRM support (--enable-video-kmsdrm) and run GZDoom on it:
https://www.libsdl.org/release/SDL2-2.0.8.tar.gz
https://zdoom.org/files/gzdoom/src/gzdoom-g3.3.1.zip

Since the problem is reproducible on X86_64, too, it's easy and fast to try.
It's important to have this fixed because the VC4 driver stack is the future of the Raspberry Pi platform and SDL2 runs on it using the KMSDRM driver.

User avatar
Paeryn
Posts: 2512
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Segfaults that only happen on Raspberry Pi and not in X64. What's going on here?

Sun Apr 08, 2018 3:42 am

Vanfanel wrote:
Sat Apr 07, 2018 10:03 pm
I have some updates on this.
Your ideas are good, jahboater, but these segfaults on quit (on delete[] and free functions) seem to be happening on X86_64 too if I use the same SDL2 driver, KMSDRM, which I sent for merging...

https://github.com/spurious/SDL-mirror/ ... deo/kmsdrm

The problem is NOT happening inside the driver, but inside the game's code. For example, on GZDoom, it segfaults on quit here:

zstrings.cpp : 1275
free (this)

If I comment that, it segfaults in symbols.cpp : 236
Symbols.Clear();

But that does NOT happen using the BRCM SDL2 driver on SDL2 or the X11 SDL2 driver on X86_64. It only happens when using SDL2 with the KMSDRM driver, both in ARM and X86_64.
If it only happens with SDL's kmsdrm driver then I'd look into that to see if it is writing to memory that it doesn't own. I'm thinking it could be writing to memory that it shouldn't and corrupting the game's data (or malloc's internal list of allocated memory). Did you compile SDL with ASAN as well to make sure it is behaving?
She who travels light — forgot something.

Return to “C/C++”