augusto.beiro
Posts: 12
Joined: Sun Jan 22, 2012 12:52 pm

Re: FPU flags test

Sat Jan 28, 2012 12:23 pm

Hi everybody;

I'm new to embedded software and hardware. And like everyone here, can't wait

for Rpi launchment.

Meantime i've been doing some tests to get know how arm arch works. As Rpi will

support my favourite distro, Debian, I've managed to create a development

enviroment using qemu and Debian squeeze for arm.

But, Debian's stock software is compiled for generic arm, and maybe this is not

optimal for Rpi hardware specs. The most importat thing is the presence of and

FPU. I've readed this post

http://www.raspberrypi.org/for.....erformance

but i wanted to do tests on my own.

So, i have made this tests, and tested then on a Samsung Galaxy (the only ARM

machine i have). Samsung is Cortex A8, but it thing results would be applicable

to RPi in some way.

Test program has two modules. Custom code (Camack's square root implementation

and decimal pi finding , and standard libc calls to sqrt and acos ). First module

has no libc dependencies, and second one relays on libc

implementation. Both are floating point intensive.

The tests:

IMPORTANT NOTE: Couldn't set fixed font in forum. You can see an spreadshet here:

https://docs.google.com/spreadsheet/pub ... utput=html

=== LEGEND

static tag =statically linked

ulibc tag =uclibc instead of libc

cortex-a8 tag =-mcpu=cortex-a8

fpu tag =fpu optimizations

noop tag =no optimizations at all

===

Target: arm-linux-gnueabi

Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-8' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs

--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch

--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix

--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug

--enable-objc-gc --disable-sjlj-exceptions --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi

--target=arm-linux-gnueabi

Thread model: posix

gcc version 4.4.5 (Debian 4.4.5-8)

OPS="-lm -O3 -fomit-frame-pointer -pipe -ftree-vectorize"

CPU=" -mcpu=cortex-a8 "

FPU=" -mfpu=neon -mfloat-abi=softfp -ffast-math "

STATIC_ULIBC="-static /usr/arm-linux-uclibc/usr/lib/libm.a "

STATIC_LIBC="-lm -static "

DYNAMIC=" -lm "

Command Custom Code libmath Code iterations/sec run time lastval

./c_sqrt-arm 3.529000 4.112000 65443.0039 7.6410 1.569382

./c_sqrt-arm-static 3.465000 3.007000 77263.5938 6.4720 1.569382

./c_sqrt-arm-static-ulibc 3.455000 3.017000 77263.5938 6.4720 1.569382

./c_sqrt-cortex-a8 3.429000 2.952000 78353.1797 6.3820 1.569382

./c_sqrt-cortex-a8-fpu 1.427000 1.941000 148470.9062 3.3680 1.569382

./c_sqrt-cortex-a8-fpu-static 1.418000 1.942000 148824.4062 3.3600 1.569382

./c_sqrt-cortex-a8-fpu-static-ulibc 1.410000 1.931000 149670.7500 3.3410 1.569382

./c_sqrt-arm 3.418000 2.967000 78316.3672 6.3850 1.569382

./c_sqrt-arm-static 3.376000 2.953000 79009.3203 6.3290 1.569382

./c_sqrt-arm-static-ulibc 3.362000 2.961000 79084.2969 6.3230 1.569382

./c_sqrt-cortex-a8 3.364000 2.958000 79096.8047 6.3220 1.569382

./c_sqrt-cortex-a8-fpu 1.354000 1.955000 151072.5000 3.3100 1.569382

./c_sqrt-cortex-a8-fpu-static 1.354000 1.926000 152454.2656 3.2800 1.569382

./c_sqrt-cortex-a8-fpu-static-ulibc 1.365000 1.927000 151898.5469 3.2920 1.569382

./c_sqrt-arm 3.399000 2.964000 78587.1406 6.3630 1.569382

./c_sqrt-arm-noop 4.088000 3.005000 70489.1484 7.0940 1.569382

./c_sqrt-arm-static 3.359000 2.949000 79272.3516 6.3080 1.569382

./c_sqrt-arm-static-ulibc 3.351000 2.946000 79410.8281 6.2970 1.569382

./c_sqrt-cortex-a8 3.383000 2.960000 78834.9375 6.3430 1.569382

./c_sqrt-cortex-a8-fpu 1.354000 1.941000 151760.2500 3.2950 1.569382

./c_sqrt-cortex-a8-fpu-static 1.355000 1.942000 151668.1875 3.2970 1.569382

./c_sqrt-cortex-a8-fpu-static-ulibc 1.354000 1.926000 152454.2656 3.2800 1.569382

So, we can see an improvement when using fpu. Firts module (Custom) code benefits a lot cause is all affected

by optimization flags. Second module (libmath) relays on libc so,in theory flags are affecting but no so much.

* cortex-a8 tag is not affecting

* FPU improving Custom code about 2.5x and libmath calls about 2.1x

* static ulibc and dynamic libc behaving the same.

Now, lets see Sourcery G++

Target: arm-none-linux-gnueabi

Configured with: /scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/src/gcc-4.3/configure --build=i686-pc-linux-gnu

--host=i686-pc-linux-gnu --target=arm-none-linux-gnueabi --enable-threads --disable-libmudflap --disable-libssp --disable-libstdcxx-pch

--with-gnu-as --with-gnu-ld --with-specs='%{funwind-tables|fno-unwind-tables|mabi=*|ffreestanding|nostdlib:;:-funwind-tables}'

--enable-languages=c,c++ --enable-shared --enable-symvers=gnu --enable-__cxa_atexit --with-pkgversion='Sourcery G++ Lite 2009q1-203'

--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls --prefix=/opt/codesourcery

--with-sysroot=/opt/codesourcery/arm-none-linux-gnueabi/libc --with-build-sysroot=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/install/arm-none-linux-gnueabi/libc

--with-gmp=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/obj/host-libs-2009q1-203-arm-none-linux-gnueabi-i686-pc-linux-gnu/usr

--with-mpfr=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/obj/host-libs-2009q1-203-arm-none-linux-gnueabi-i686-pc-linux-gnu/usr

--disable-libgomp --enable-poison-system-directories --with-build-time-tools=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/install/arm-none-linux-gnueabi/bin

--with-build-time-tools=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/install/arm-none-linux-gnueabi/bin

Thread model: posix

gcc version 4.3.3 (Sourcery G++ Lite 2009q1-203)

OPS="-lm -O3 -fomit-frame-pointer -pipe -ftree-vectorize"

CPU=" -mcpu=cortex-a8 "

FPU=" -mfpu=neon -mfloat-abi=softfp -ffast-math "

STATIC_ULIBC="-static /usr/arm-linux-uclibc/usr/lib/libm.a "

STATIC_LIBC="-lm -static "

DYNAMIC=" -lm "

Command Custom Code libmath Code iterations/sec run time lastval

./sc_sqrt-arm 3.514000 3.175000 74712.3828 6.6930 1.569382

./sc_sqrt-arm-noop 4.292000 2.976000 68801.5938 7.2680 1.569382

./sc_sqrt-arm-static 3.331000 3.023000 77455.0781 6.4560 1.569382

./sc_sqrt-arm-static-ulibc 3.247000 2.934000 80888.0625 6.1820 1.569382

./sc_sqrt-cortex-a8 3.265000 3.004000 79765.5156 6.2690 1.569382

./sc_sqrt-cortex-a8-fpu 1.021000 1.963000 167577.0781 2.9840 1.569382

./sc_sqrt-cortex-a8-fpu-static 1.026000 1.914000 170085.0312 2.9400 1.569382

./sc_sqrt-cortex-a8-fpu-static-ulibc 1.020000 2.035000 163682.4844 3.0550 1.569382

./sc_sqrt-arm 3.476000 3.246000 74390.0625 6.7230 1.569382

./sc_sqrt-arm-noop 4.403000 3.271000 65161.5859 7.6740 1.569382

./sc_sqrt-arm-static 3.622000 3.162000 73710.2031 6.7840 1.569382

./sc_sqrt-arm-static-ulibc 3.349000 3.174000 76659.5156 6.5230 1.569382

./sc_sqrt-cortex-a8 3.564000 3.283000 73031.9844 6.8470 1.569382

./sc_sqrt-cortex-a8-fpu 1.118000 2.135000 153719.6406 3.2530 1.569382

./sc_sqrt-cortex-a8-fpu-static 1.130000 2.554000 135735.6094 3.6840 1.569382

./sc_sqrt-cortex-a8-fpu-static-ulibc 1.244000 2.313000 140581.9531 3.5570 1.569382

* cortex-a8 tag is not affecting neither

* FPU improving Custom code about 3x. As systems libc is the same, libmath behave same.

* static ulibc and dynamic libc behaving the same.

After 2 days compiling on qemu arm ( ). I've managed to package an optimized libc6

So:

Target: arm-linux-gnueabi

Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-8' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs

--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch

--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix

--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug

--enable-objc-gc --disable-sjlj-exceptions --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi

--target=arm-linux-gnueabi

Thread model: posix

gcc version 4.4.5 (Debian 4.4.5-8)

OPS="-lm -O3 -fomit-frame-pointer -pipe -ftree-vectorize"

CPU=" -mcpu=cortex-a8 "

FPU=" -mfpu=neon -mfloat-abi=softfp -ffast-math "

STATIC_ULIBC="-static /usr/arm-linux-uclibc/usr/lib/libm.a "

STATIC_LIBC="-lm -static "

DYNAMIC=" -lm "

LIBC_CUSTOM VERSION: 2.11.2-10-cortex-a8.2

LIBC FLAGS: -O3 -g -mcpu=cortex-a8 -fomit-frame-pointer -mfpu=vfp -mfloat-abi=softfp

Command Custom Code libmath Code iterations/sec run time lastval

./c_sqrt-arm 3.775000 0.842000 107537.6328 4.6510 1.569382

./c_sqrt-arm-noop 4.290000 0.987000 94760.2812 5.2780 1.569382

./c_sqrt-arm-static 3.451000 3.009000 77395.1406 6.4610 1.569382

./c_sqrt-arm-static-ulibc 3.708000 3.088000 73580.0469 6.7960 1.569382

./c_sqrt-cortex-a8 3.654000 1.056000 106167.7266 4.7100 1.569382

./c_sqrt-cortex-a8-fpu 1.369000 0.596000 254478.3750 1.9650 1.569382

./c_sqrt-cortex-a8-fpu-static 1.393000 2.223000 138249.9375 3.6170 1.569382

./c_sqrt-cortex-a8-fpu-static-ulibc 1.388000 1.956000 149536.4844 3.3440 1.569382

./c_sqrt-arm 3.811000 0.880000 106597.7422 4.6910 1.569382

./c_sqrt-arm-noop 4.563000 0.927000 91083.7891 5.4900 1.569382

./c_sqrt-arm-static 3.829000 3.094000 72230.2500 6.9230 1.569382

./c_sqrt-arm-static-ulibc 3.964000 3.314000 68707.0625 7.2780 1.569382

./c_sqrt-cortex-a8 3.703000 0.860000 109587.9922 4.5630 1.569382

./c_sqrt-cortex-a8-fpu 1.446000 0.634000 240408.6562 2.0800 1.569382

./c_sqrt-cortex-a8-fpu-static 1.464000 2.060000 141898.4062 3.5240 1.569382

./c_sqrt-cortex-a8-fpu-static-ulibc 1.540000 2.175000 134602.9688 3.7150 1.569382

./sc_sqrt-arm 3.406000 0.838000 117825.1641 4.2440 1.569382

./sc_sqrt-arm-noop 4.092000 0.845000 101265.6953 4.9380 1.569382

./sc_sqrt-arm-static 3.407000 3.111000 76706.5469 6.5190 1.569382

./sc_sqrt-arm-static-ulibc 3.318000 2.937000 79944.0469 6.2550 1.569382

./sc_sqrt-cortex-a8 3.348000 0.855000 118974.5391 4.2030 1.569382

./sc_sqrt-cortex-a8-fpu 1.033000 0.606000 304908.5312 1.6400 1.569382

./sc_sqrt-cortex-a8-fpu-static 1.050000 1.944000 167017.3750 2.9940 1.569382

./sc_sqrt-cortex-a8-fpu-static-ulibc 1.057000 1.917000 168140.5469 2.9740 1.569382

./sc_sqrt-arm 3.318000 0.829000 120581.1406 4.1470 1.569382

./sc_sqrt-arm-noop 4.285000 0.847000 97437.6484 5.1320 1.569382

./sc_sqrt-arm-static 3.337000 2.937000 79701.9453 6.2740 1.569382

./sc_sqrt-arm-static-ulibc 3.307000 2.935000 80110.5391 6.2420 1.569382

./sc_sqrt-cortex-a8 3.330000 0.848000 119686.4531 4.1780 1.569382

./sc_sqrt-cortex-a8-fpu 1.036000 0.609000 303981.7500 1.6450 1.569382

./sc_sqrt-cortex-a8-fpu-static 1.034000 1.923000 169107.2031 2.9570 1.569382

./sc_sqrt-cortex-a8-fpu-static-ulibc 1.033000 1.922000 169221.6562 2.9550 1.569382

Now we can see dynamic linked code run very very fast:

* Custom code behave the same

* libmath code gains 4x-5x. when linked dynamically

Conclusions.

* Sourcery G++ seems to make slightest fast code than gcc.

* FPU flags are a "must" option.

* In case of generic distros as Debian, libc and main libraries should be recompiled with fpu options.

* Applying ops we can go 3.5x-4x faster in floating point operations

dom
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5502
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge

Re: FPU flags test

Sat Jan 28, 2012 6:34 pm

You are completely correct.

All debian packages (including C runtime) are build with software floating point. This is very slow on software that heavily uses floating point.

We have produced replacement debian C runtime libraries that are built with -mfpu=vfp. They do make some apps run faster (nbench reports a floating point index 2.7 times faster when libm is built this way).

Building for armv6 helps a little (you get a memcpy with a prefetch instruction in which is a little faster).

However the debian libraries have undergone testing with the default libraries and there could be incompatabilities, so there may be packages that won't work with these libraries. We'll make these available for people who want to experiment.

There are also the packages and additional libraries you install with apt-get. These will just be built with software fp and for armv5.

Building with the hardware fp ABI additionally saves 20 cycles per fp parameter on each function call, so that would be nice to have.

What we really want is a whole distribution built with the desired flags, but that requires (significant) effort from the distribution builders. We'll see who's the first to offer it - I think it may make their distribution more popular.

asb
Forum Moderator
Forum Moderator
Posts: 853
Joined: Fri Sep 16, 2011 7:16 pm
Contact: Website

Re: FPU flags test

Sun Jan 29, 2012 8:31 am

dom said:


What we really want is a whole distribution built with the desired flags, but that requires (significant) effort from the distribution builders. We'll see who's the first to offer it - I think it may make their distribution more popular.


Gentoo and the embedded-targetted distributions such as Angstrom/OpenEmbedded should be able to do this fairly easily.

Return to “General discussion”