Compiling for Raspberry Pi 2


26 posts   Page 1 of 2   1, 2
by hjimbens » Mon Feb 02, 2015 8:38 am
When I compile for Raspberry Pi 1 with gcc I use:
Code: Select all
CFLAGS+=-DSTANDALONE -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -DTARGET_POSIX -D_LINUX -fPIC -DPIC -D_REENTRANT -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 -U_FORTIFY_SOURCE -Wall -g -DHAVE_LIBOPENMAX=2 -DOMX -DOMX_SKIP64BIT -ftree-vectorize -pipe -DUSE_EXTERNAL_OMX -DHAVE_LIBBCM_HOST -DUSE_EXTERNAL_LIBBCM_HOST -DUSE_VCHIQ_ARM -Wno-psabi

When I use clang I add:
Code: Select all
CFLAGS+= -ccc-host-triple armv6-unknown-eabi -march=armv6 -mfpu=vfp -mcpu=arm1176jzf-s -mtune=arm1176jzf-s -mfloat-abi=hard

I suspect that these clang flags are hardwired in the gcc that comes with Raspbian.
What do I have to do when I want to compile for Pi 2 and get the best performance?
Posts: 36
Joined: Fri May 24, 2013 9:05 am
by jamesh » Mon Feb 02, 2015 10:14 am
I think that currently the gcc on Raspbian does not support cortex-a7-NEON. BUT, the linaro cross compiler on the Raspi github does support cortex-A7 NEON, so you can use that.

I get a huge speed up on the x264 library - up to 32x faster.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 17109
Joined: Sat Jul 30, 2011 7:41 pm
by PeterO » Tue Feb 03, 2015 10:16 am
I can't quite remember the details, but when I wanted to build an I2S kernal module the othe week the instructions I followed included an upgrade to gcc. I **think** it moved from version 4.8 to 4.9. 4.9 documents do suggest that A7 and NEON are supported

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

I'll be trying some things when I get to try my PI2 (probably not before Wednesday evening) .

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),Aeromodelling,1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson
User avatar
Posts: 3316
Joined: Sun Jul 22, 2012 4:14 pm
by jahboater » Wed Feb 04, 2015 6:47 pm
For gcc 4.8 I use
gcc-4.8 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard

This gave me the best code and was the only option that enabled sdiv/udiv.
To get gcc 4.8 just do
sudo apt-get install gcc-4.8

For gcc 4.6 (the default compiler on rasbian)
gcc -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard

For a Pi 1 (B+)
gcc -mcpu=arm1176jzf-s -mfpu=vfp -mfloat-abi=hard

Hope this helps.
Posts: 1328
Joined: Wed Feb 04, 2015 6:38 pm
by gregeric » Fri Feb 06, 2015 5:33 pm
And when compiling large projects on the Pi2, use "make -j 4" to take advantage of the extra cores .
Posts: 1492
Joined: Mon Nov 28, 2011 10:08 am
by RoyLongbottom » Sun Feb 08, 2015 3:01 pm
jahboater wrote:For gcc 4.8 I use
gcc-4.8 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
.

Thanks for those. I am running my benchmarks on the RPi 2. One of the first to try was the Linpack benchmark. With those gcc 4.6 parameters, speeds were little different to my original (gcc linpack.c cpuidc.c -lm -lrt -O3 -march=armv6 -mfloat-abi=hard -mfpu=vfp -o linpackPiA6). I installed gcc 4.8 and that was 28% faster (with your parameters), but results of numeric calculations were different - need to think about it.

I have posted the first results (original MP-MFLOPs) to show MP use. The gcc 4.8 version was up to 9% faster with one numeric result slightly different, suggesting different rounding. Performance results are here:

viewtopic.php?p=688425#p688425
Posts: 208
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
by PeterO » Sun Feb 08, 2015 5:43 pm
I'm pretty sure the Neon extensions are only single precision (float) rather than double precison (double) instructions..

The FFTW I'm using came out with all the functions called fftwf_xxxxxx (which doesn't seem to be documented anywhere I could find).

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),Aeromodelling,1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson
User avatar
Posts: 3316
Joined: Sun Jul 22, 2012 4:14 pm
by PiGraham » Sun Feb 08, 2015 9:01 pm
Optimal build for Quake3?
Please suggest tweeks.

Code: Select all
#!/bin/bash
# this script builds q3 with SDL
# invoke with ./build.sh
# or ./build.sh clean to clean before build

# directory containing the ARM shared libraries (rootfs, lib/ of SD card)
# specifically libEGL.so and libGLESv2.so
ARM_LIBS=/opt/vc/lib
SDL_LIB=lib

# directory containing baseq3/ containing .pk3 files - baseq3 on CD
BASEQ3_DIR="/home/${USER}/"

# directory to find khronos linux make files (with include/ containing
# headers! Make needs them.)
INCLUDES="-I/opt/vc/include -I/opt/vc/include/interface/vcos/pthreads"

# prefix of arm cross compiler installed
#CROSS_COMPILE=bcm2708-

# clean
if [ $# -ge 1 ] && [ $1 = clean ]; then
   echo "clean build"
   rm -rf build/*
fi

# sdl not disabled
make -j4 -f Makefile COPYDIR="$BASEQ3_DIR" ARCH=arm \
        CC=""$CROSS_COMPILE"gcc-4.8" USE_SVN=0 USE_CURL=0 USE_OPENAL=0 \
        CFLAGS="-DVCMODS_MISC -DVCMODS_OPENGLES -DVCMODS_DEPTH -DVCMODS_REPLACET                                                     RIG $INCLUDES -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard" \
        LDFLAGS="-L"$ARM_LIBS" -L$SDL_LIB -lSDL -lvchostif -lvmcs_rpc_client -lv                                                     cfiled_check -lbcm_host -lkhrn_static -lvchiq_arm -lopenmaxil -lEGL -lGLESv2 -lv                                                     cos -lrt"

# copy the required pak3 files over
# cp "$BASEQ3_DIR"/baseq3/*.pk3 "build/release-linux-arm/baseq3/"
# cp -a lib build/release-linux-arm/baseq3/
exit 0



Performance is VERY impressive.

Overclock arm freq 1100, overvolt 6

Video mode 1280x1024
Lighting Vertex
Geometric detail high
Texture detail all the way up the slider
Texture quality 32bit
Texture filter Trilinear

90fps in the demo, a good 60+ in multiplayer with 5 bots

Never below 30fps with 12 bots and lots of action.
Posts: 2478
Joined: Fri Jun 07, 2013 12:37 pm
Location: Waterlooville
by RoyLongbottom » Mon Feb 09, 2015 10:23 am
PeterO wrote:I'm pretty sure the Neon extensions are only single precision
PeterO

I found that my Linpack performance gains were through using vfp4 that provides fused multiply and add. I have a C program for NEON Linpack for Android to convert for RPi. When produced, NEON could only handle single precision and there was no automatic vectorisation. Intrinsics had to be used instead. Following is the code used for the performance dependent daxpy function (instead of #ifdef ROLL or UNROLL):

Code: Select all
#ifdef NEON

    float  cf[4];
    float32x4_t x41, y41, c41, r41;
    float32_t   *ptrx1 = (float32_t *)dx;
    float32_t   *ptry1 = (float32_t *)dy;
    float32_t   *ptrc1 = (float32_t *)cf;
    for (i=0; i<4; i++)
    {
     cf[i] = da;
    }
   
    m = n % 4;
    if ( m != 0)
    {
            for (i = 0; i < m; i++)
                    dy[i] = dy[i] + da*dx[i];
                   
            if (n < 4) return;
    }
   
   
    ptrx1 = ptrx1 + m;
    ptry1 = ptry1 + m;
    c41 = vld1q_f32(ptrc1);
    for (i = m; i < n; i=i+4)
    {
        x41 = vld1q_f32(ptrx1);
        y41 = vld1q_f32(ptry1);
   
       
        r41 = vmlaq_f32(y41, x41, c41);
        vst1q_f32(ptry1, r41);
   
        ptrx1 = ptrx1 + 4;
        ptry1 = ptry1 + 4;
    }

#endif

Posts: 208
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
by mrvn » Sat Feb 14, 2015 12:23 pm
Under Debian (jessie/sid) one can use the gcc-arm-none-eabi cross compiler package. That gcc is compiled with a number of libgcc flavour for different cpu cores. One has to pick the right flags to get the optimized libgcc for the RPi2 and to optimize the source itself. /usr/share/doc/gcc-arm-none-eabi/readme.txt.gz shows all the possible options. Relevant for the RPi2 are the three last:

  • Cortex-A* (No FP): [-mthumb] -march=armv7-a
  • Cortex-A* (Soft FP): [-mthumb] -march=armv7-a -mfloat-abi=softfp -mfpu=vfpv3-d16
  • Cortex-A* (Hard FP): [-mthumb] -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16
Posts: 58
Joined: Wed Jan 09, 2013 6:50 pm
by jamesh » Sat Feb 14, 2015 12:43 pm
Neon floating point doesn't adhere entirely to the IEEE standard.

From https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=‘neon’), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 17109
Joined: Sat Jul 30, 2011 7:41 pm
by jbeale » Sun Feb 15, 2015 4:04 pm
I tried recompiling some C code on RP2 that does a single-precision 2D inverse FFT with 4096 x 4096 elements, using the current Raspbian gcc without special flags, and also gcc-4.8 with them. I did 4 runs with each version, and the time in the FFT routine averaged 62 seconds in both cases. Is this expected, or should there be an improvement? There was one tiny difference; the older GCC code took 62 or 63 seconds on each run. The newer code varied from 61 to 64 seconds. The system was otherwise unloaded (except for 'top' running, showing the FFT code at 99.8% ... 100.3% cpu) in both cases.

Code: Select all
Compile flags:
(Case 1) gcc -ansi -pedantic -Wall -O4   (Raspbian gcc version 4.6.3)
(Case 2) gcc-4.8 -ansi -pedantic -Wall -O4 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -funsafe-math-optimizations


If anyone's interested, the code is http://www.bealecorner.org/best/gforge/Hflab095.zip and inside the interpreter my test was:
Code: Select all
cfill 4096; tic; ifft; toc


EDIT: using the built-in repeat function the variability went away, but still nearly the same speed.
Code: Select all
HL>repeat 10 cfill 4096; tic; ifft; toc; pop
62 seconds.
62 seconds.
61 seconds.
62 seconds.
62 seconds.
62 seconds.
62 seconds.
61 seconds.
62 seconds.
62 seconds.

HL-ARMv7>repeat 10 cfill 4096; tic; ifft; toc; pop
62 seconds.
62 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.
61 seconds.


Note: with 4 threads each at 100% CPU, my power meter says 0.47 A @ 5.02V = 2.36 W, and after a few minutes cpu temperature = 53 C

Interestingly, if I run four instances of the "hl" program each doing a large FFT job, each one takes about 156 seconds to complete instead of 62 seconds, so with all four CPUs busy, each thread is running at 40% of the full speed it had when running alone. I guess the problem is the memory bus access, since each 4096x4096 complex array takes up 134 MB it won't fit in any kind of local cache, and the FFT process is highly non-local, accessing every element frequently.
User avatar
Posts: 3249
Joined: Tue Nov 22, 2011 11:51 pm
by PeterO » Mon Feb 16, 2015 5:31 pm
I can't check now, but I think I've read somewhere that you have to use "compiler intrinsics" to get gcc to generate code that uses NEON. The compiler switch in not enough by itself.

NOTE: I don't have any experience with using said intrinsics !

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),Aeromodelling,1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson
User avatar
Posts: 3316
Joined: Sun Jul 22, 2012 4:14 pm
by RoyLongbottom » Tue Feb 17, 2015 3:23 pm
My Linpack NEON code for Linux is above. This runs on ARM CPUs via Android:

viewtopic.php?p=689390#p689390

Besides this #include <arm_neon.h> is needed and can be compiled on the RPi with the following (but not some of the other things suggested)

gcc linpackneon.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -o linpackNEONPi

The benchmark runs and appears to obtain the same speed as ARM via Android, with a similar CPU MHz, but numerical calculations end up as nan (not a number).

Any suggestions?
Posts: 208
Joined: Fri Apr 12, 2013 9:27 am
Location: Essex, UK
by hydra3333 » Thu Feb 19, 2015 6:16 am
I am starting to learn about compiling for the Pi. api's, abi's ... so confusing.

There are many different search results on what arguments to use with "make", and which GCC compiler version to use, (4.9.2 apparently requires a patch http://www.intestinate.com/pilfs/patche ... ault.patch ) https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

A surprising find indicates "neon" on the Cortex-A7 (neon-vfpv4 ?) can produce either unexpected calculation results or "not a number" results.

I don't know what the bottom line from it is, however could it be "use neon at your own risk" ?

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
if the selected floating-point hardware includes the NEON extension (e.g. -mfpu=‘neon’), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
viewtopic.php?p=688601&sid=175982e1174bc153f1eb0b539b6fa653#p688601
The gcc 4.8 version was up to 9% faster with one numeric result slightly different, suggesting different rounding.
viewtopic.php?p=697401&sid=175982e1174bc153f1eb0b539b6fa653#p697401
The benchmark runs and appears to obtain the same speed as ARM via Android, with a similar CPU MHz, but numerical calculations end up as nan (not a number).

http://en.wikipedia.org/wiki/ARM_archit ... _.28VFP.29
seems to indicate either VFPv4-D16 or VFPv4

Anyway, for compiling ffmpeg and a few of it's dependencies from the latest sources, I'm unsure what to use and how, so I could someone confirm whether it's like this ?
Code: Select all
./configure -j 4 --enable-static ?--arch=armhf? --target-os=linux --enable-static --disable-asm --enable-gpl --prefix=/usr --enable-nonfree -mcpu=cortex-a7 -mfpu=vfpv4 -march=armv7-a -mfpu=vfpv4 -mfloat-abi=hard -funsafe-math-optimizations -mtune=cortex-a7 -mcpu=cortex-a7 
sudo make
sudo make install
Also, which compiler ? Should I install gcc 4.9.2 just because it's the latest ? If so, I'll google it. And would it work with the arguments above ?

jamesh wrote:I get a huge speed up on the x264 library - up to 32x faster.
Would you be able to post all your commandlines ? (including wget's etc) ?
Posts: 101
Joined: Thu Jan 10, 2013 11:48 pm
by jahboater » Thu Feb 19, 2015 8:41 am
I would remove the -march flag.
gcc-4.8 fails
error: switch -mcpu=cortex-a7 conflicts with -march=armv7-a switch
and its not needed anyway.
Posts: 1328
Joined: Wed Feb 04, 2015 6:38 pm
by jamesh » Thu Feb 19, 2015 9:10 am
hydra3333 wrote:
jamesh wrote:I get a huge speed up on the x264 library - up to 32x faster.
Would you be able to post all your commandlines ? (including wget's etc) ?


You will need to change the CCJPREFIX line to point to where you have installed the linaro cross compiler (from raspi github tools repo)

Code: Select all
git clone git://git.videolan.org/x264.git
cd x264

export CCJPREFIX="/home/james/projects/raspberrypi/tools/arm-bcm2708/gcc-linaro-arm-linux-gnueabihf-raspbian-x64/bin/arm-linux-gnueabihf-"

./configure --host=arm-linux --cross-prefix=${CCJPREFIX} --enable-static --extra-cflags="-mcpu=cortex-a7 -mfpu=neon-vfpv4"

make

Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 17109
Joined: Sat Jul 30, 2011 7:41 pm
by hydra3333 » Sat Feb 21, 2015 5:17 am
I thought I'd try a native Pi2 compile using standard gcc 4.6.3 on the Pi2 for the Pi2 ... why not.
My first attempt is from-scratch download and building from source of many of the ffmpeg external libraries, prior to building ffmpeg.
I fluked getting most external libraries working with some trial and error. A few I could not, and as a newbie I am still unsure what to do with them (as noted in the .sh below).
In line with Pi2's gcc 4.6.3 as at 2015.02.21 and prior to more googling, most external library builds use
Code: Select all
snip, out of date

edit: slightly updated
edit2: outdated code, removed

also: stop posting this here and referred over to forum viewtopic.php?f=67&t=100108
Posts: 101
Joined: Thu Jan 10, 2013 11:48 pm
by eriktheitalian » Tue Mar 17, 2015 9:28 am
I'm respect raspberry developers. I'm happy with Rpi2. But...

I'm tired with kernel compilation processes. I'm not using cross-compiller. I'm using full time raspberry. There is no optimized kernel building documentation. When i'm used debian over x86 i'm founded lots of optimized kernel documentation. Sample: I'm building core i7 ivy bridge optimized kernel. I can't do fine tunning over raspbian kernel.

I'm building raspbian kernel. I'm not know is this armv6 optimized or v7 optimized.

I'm tried what is mean for "-march=native" for gcc 4.9.2. This is resulting "armv7ve".

My "gcc -v" output including "-march=armv6"
I cant using enough English language. My writings can be wrong grammer.$
"in micro$oft we not trust"
User avatar
Posts: 358
Joined: Thu Feb 19, 2015 1:03 pm
by eriktheitalian » Tue Mar 17, 2015 9:39 am
What is mean of optimized kernel building documentation for me ?

Real time process optimized kernel.
Xorg, GUI response/latency optimized kernel.
Server optimized kernel.
Memory usage optimized kernel.
IO transfer optimized kernel.
Gpu optimized.
Power management focused kernel.
And more...
I cant using enough English language. My writings can be wrong grammer.$
"in micro$oft we not trust"
User avatar
Posts: 358
Joined: Thu Feb 19, 2015 1:03 pm
by eriktheitalian » Sat Mar 28, 2015 8:09 am
I'm working on finding performance related gcc flags for rpi2. I'm confused. Lots document. Lots idea. I cant find clean simple source.(This is not raspberry's problem. This is s universal) I need test options one by one.
I cant using enough English language. My writings can be wrong grammer.$
"in micro$oft we not trust"
User avatar
Posts: 358
Joined: Thu Feb 19, 2015 1:03 pm
by eriktheitalian » Sat Mar 28, 2015 8:20 am
I'm founded about what is mean of -O3 or -O4
"gcc -c -Q -O3 --help=optimizers | grep enabled"
I cant using enough English language. My writings can be wrong grammer.$
"in micro$oft we not trust"
User avatar
Posts: 358
Joined: Thu Feb 19, 2015 1:03 pm
by eriktheitalian » Sat Mar 28, 2015 10:01 am
RoyLongbottom wrote:My Linpack NEON code for Linux is above. This runs on ARM CPUs via Android:

viewtopic.php?p=689390#p689390

Besides this #include <arm_neon.h> is needed and can be compiled on the RPi with the following (but not some of the other things suggested)

gcc linpackneon.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -o linpackNEONPi

The benchmark runs and appears to obtain the same speed as ARM via Android, with a similar CPU MHz, but numerical calculations end up as nan (not a number).

Any suggestions?


Thanks for important benchmark. Its very useful for one by one test.
I cant using enough English language. My writings can be wrong grammer.$
"in micro$oft we not trust"
User avatar
Posts: 358
Joined: Thu Feb 19, 2015 1:03 pm
by AlessandroFerri » Fri Apr 24, 2015 5:33 pm
Thank you all for having dealt with the topic.
In fact I have compiled pulseaudio and other programs in raspberry Pi2 in the same way in which I have compiled for raspberry B+. I followed the instructions on this post:

https://www.raspberrypi.org/forums/view ... 29&t=87138

But now I realized that doing so PulseAudio works with only one core on raspberry Pi2 and not with four cores. I see by running top. I think that must be given other parameters to be included on ./configure and make.

You have any suggestions?
Posts: 61
Joined: Tue Apr 02, 2013 5:44 pm
by ejolson » Tue Apr 28, 2015 2:51 am
AlessandroFerri wrote:Thank you all for having dealt with the topic.
In fact I have compiled pulseaudio and other programs in raspberry Pi2 in the same way in which I have compiled for raspberry B+. I followed the instructions on this post:

https://www.raspberrypi.org/forums/view ... 29&t=87138

But now I realized that doing so PulseAudio works with only one core on raspberry Pi2 and not with four cores. I see by running top. I think that must be given other parameters to be included on ./configure and make.

You have any suggestions?


Your question is about parallel and threaded processing. I don't think pulse audio is written in a way that scales to multi-core architectures. For this you probably want JACK2

https://en.wikipedia.org/wiki/JACK_Audio_Connection_Kit

which reportedly has SMP multi-processor scalability. It would be nice to have a separate forum topic about parallel processing where this and related topics could be discussed.
Posts: 1000
Joined: Tue Mar 18, 2014 11:47 am