Hi everybody;
I'm new to embedded software and hardware. And like everyone here, can't wait
for Rpi launchment.
Meantime i've been doing some tests to get know how arm arch works. As Rpi will
support my favourite distro, Debian, I've managed to create a development
enviroment using qemu and Debian squeeze for arm.
But, Debian's stock software is compiled for generic arm, and maybe this is not
optimal for Rpi hardware specs. The most importat thing is the presence of and
FPU. I've readed this post
http://www.raspberrypi.org/for.....erformance
but i wanted to do tests on my own.
So, i have made this tests, and tested then on a Samsung Galaxy (the only ARM
machine i have). Samsung is Cortex A8, but it thing results would be applicable
to RPi in some way.
Test program has two modules. Custom code (Camack's square root implementation
and decimal pi finding , and standard libc calls to sqrt and acos ). First module
has no libc dependencies, and second one relays on libc
implementation. Both are floating point intensive.
The tests:
IMPORTANT NOTE: Couldn't set fixed font in forum. You can see an spreadshet here:
https://docs.google.com/spreadsheet/pub ... utput=html
=== LEGEND
static tag =statically linked
ulibc tag =uclibc instead of libc
cortex-a8 tag =-mcpu=cortex-a8
fpu tag =fpu optimizations
noop tag =no optimizations at all
===
Target: arm-linux-gnueabi
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-8' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-objc-gc --disable-sjlj-exceptions --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi
--target=arm-linux-gnueabi
Thread model: posix
gcc version 4.4.5 (Debian 4.4.5-8)
OPS="-lm -O3 -fomit-frame-pointer -pipe -ftree-vectorize"
CPU=" -mcpu=cortex-a8 "
FPU=" -mfpu=neon -mfloat-abi=softfp -ffast-math "
STATIC_ULIBC="-static /usr/arm-linux-uclibc/usr/lib/libm.a "
STATIC_LIBC="-lm -static "
DYNAMIC=" -lm "
Command Custom Code libmath Code iterations/sec run time lastval
./c_sqrt-arm 3.529000 4.112000 65443.0039 7.6410 1.569382
./c_sqrt-arm-static 3.465000 3.007000 77263.5938 6.4720 1.569382
./c_sqrt-arm-static-ulibc 3.455000 3.017000 77263.5938 6.4720 1.569382
./c_sqrt-cortex-a8 3.429000 2.952000 78353.1797 6.3820 1.569382
./c_sqrt-cortex-a8-fpu 1.427000 1.941000 148470.9062 3.3680 1.569382
./c_sqrt-cortex-a8-fpu-static 1.418000 1.942000 148824.4062 3.3600 1.569382
./c_sqrt-cortex-a8-fpu-static-ulibc 1.410000 1.931000 149670.7500 3.3410 1.569382
./c_sqrt-arm 3.418000 2.967000 78316.3672 6.3850 1.569382
./c_sqrt-arm-static 3.376000 2.953000 79009.3203 6.3290 1.569382
./c_sqrt-arm-static-ulibc 3.362000 2.961000 79084.2969 6.3230 1.569382
./c_sqrt-cortex-a8 3.364000 2.958000 79096.8047 6.3220 1.569382
./c_sqrt-cortex-a8-fpu 1.354000 1.955000 151072.5000 3.3100 1.569382
./c_sqrt-cortex-a8-fpu-static 1.354000 1.926000 152454.2656 3.2800 1.569382
./c_sqrt-cortex-a8-fpu-static-ulibc 1.365000 1.927000 151898.5469 3.2920 1.569382
./c_sqrt-arm 3.399000 2.964000 78587.1406 6.3630 1.569382
./c_sqrt-arm-noop 4.088000 3.005000 70489.1484 7.0940 1.569382
./c_sqrt-arm-static 3.359000 2.949000 79272.3516 6.3080 1.569382
./c_sqrt-arm-static-ulibc 3.351000 2.946000 79410.8281 6.2970 1.569382
./c_sqrt-cortex-a8 3.383000 2.960000 78834.9375 6.3430 1.569382
./c_sqrt-cortex-a8-fpu 1.354000 1.941000 151760.2500 3.2950 1.569382
./c_sqrt-cortex-a8-fpu-static 1.355000 1.942000 151668.1875 3.2970 1.569382
./c_sqrt-cortex-a8-fpu-static-ulibc 1.354000 1.926000 152454.2656 3.2800 1.569382
So, we can see an improvement when using fpu. Firts module (Custom) code benefits a lot cause is all affected
by optimization flags. Second module (libmath) relays on libc so,in theory flags are affecting but no so much.
* cortex-a8 tag is not affecting
* FPU improving Custom code about 2.5x and libmath calls about 2.1x
* static ulibc and dynamic libc behaving the same.
Now, lets see Sourcery G++
Target: arm-none-linux-gnueabi
Configured with: /scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/src/gcc-4.3/configure --build=i686-pc-linux-gnu
--host=i686-pc-linux-gnu --target=arm-none-linux-gnueabi --enable-threads --disable-libmudflap --disable-libssp --disable-libstdcxx-pch
--with-gnu-as --with-gnu-ld --with-specs='%{funwind-tables|fno-unwind-tables|mabi=*|ffreestanding|nostdlib:;:-funwind-tables}'
--enable-languages=c,c++ --enable-shared --enable-symvers=gnu --enable-__cxa_atexit --with-pkgversion='Sourcery G++ Lite 2009q1-203'
--with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls --prefix=/opt/codesourcery
--with-sysroot=/opt/codesourcery/arm-none-linux-gnueabi/libc --with-build-sysroot=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/install/arm-none-linux-gnueabi/libc
--with-gmp=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/obj/host-libs-2009q1-203-arm-none-linux-gnueabi-i686-pc-linux-gnu/usr
--with-mpfr=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/obj/host-libs-2009q1-203-arm-none-linux-gnueabi-i686-pc-linux-gnu/usr
--disable-libgomp --enable-poison-system-directories --with-build-time-tools=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/install/arm-none-linux-gnueabi/bin
--with-build-time-tools=/scratch/mitchell/builds/4.3-arm-none-linux-gnueabi-respin/lite/install/arm-none-linux-gnueabi/bin
Thread model: posix
gcc version 4.3.3 (Sourcery G++ Lite 2009q1-203)
OPS="-lm -O3 -fomit-frame-pointer -pipe -ftree-vectorize"
CPU=" -mcpu=cortex-a8 "
FPU=" -mfpu=neon -mfloat-abi=softfp -ffast-math "
STATIC_ULIBC="-static /usr/arm-linux-uclibc/usr/lib/libm.a "
STATIC_LIBC="-lm -static "
DYNAMIC=" -lm "
Command Custom Code libmath Code iterations/sec run time lastval
./sc_sqrt-arm 3.514000 3.175000 74712.3828 6.6930 1.569382
./sc_sqrt-arm-noop 4.292000 2.976000 68801.5938 7.2680 1.569382
./sc_sqrt-arm-static 3.331000 3.023000 77455.0781 6.4560 1.569382
./sc_sqrt-arm-static-ulibc 3.247000 2.934000 80888.0625 6.1820 1.569382
./sc_sqrt-cortex-a8 3.265000 3.004000 79765.5156 6.2690 1.569382
./sc_sqrt-cortex-a8-fpu 1.021000 1.963000 167577.0781 2.9840 1.569382
./sc_sqrt-cortex-a8-fpu-static 1.026000 1.914000 170085.0312 2.9400 1.569382
./sc_sqrt-cortex-a8-fpu-static-ulibc 1.020000 2.035000 163682.4844 3.0550 1.569382
./sc_sqrt-arm 3.476000 3.246000 74390.0625 6.7230 1.569382
./sc_sqrt-arm-noop 4.403000 3.271000 65161.5859 7.6740 1.569382
./sc_sqrt-arm-static 3.622000 3.162000 73710.2031 6.7840 1.569382
./sc_sqrt-arm-static-ulibc 3.349000 3.174000 76659.5156 6.5230 1.569382
./sc_sqrt-cortex-a8 3.564000 3.283000 73031.9844 6.8470 1.569382
./sc_sqrt-cortex-a8-fpu 1.118000 2.135000 153719.6406 3.2530 1.569382
./sc_sqrt-cortex-a8-fpu-static 1.130000 2.554000 135735.6094 3.6840 1.569382
./sc_sqrt-cortex-a8-fpu-static-ulibc 1.244000 2.313000 140581.9531 3.5570 1.569382
* cortex-a8 tag is not affecting neither
* FPU improving Custom code about 3x. As systems libc is the same, libmath behave same.
* static ulibc and dynamic libc behaving the same.
After 2 days compiling on qemu arm ( ). I've managed to package an optimized libc6
So:
Target: arm-linux-gnueabi
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-8' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-objc-gc --disable-sjlj-exceptions --enable-checking=release --build=arm-linux-gnueabi --host=arm-linux-gnueabi
--target=arm-linux-gnueabi
Thread model: posix
gcc version 4.4.5 (Debian 4.4.5-8)
OPS="-lm -O3 -fomit-frame-pointer -pipe -ftree-vectorize"
CPU=" -mcpu=cortex-a8 "
FPU=" -mfpu=neon -mfloat-abi=softfp -ffast-math "
STATIC_ULIBC="-static /usr/arm-linux-uclibc/usr/lib/libm.a "
STATIC_LIBC="-lm -static "
DYNAMIC=" -lm "
LIBC_CUSTOM VERSION: 2.11.2-10-cortex-a8.2
LIBC FLAGS: -O3 -g -mcpu=cortex-a8 -fomit-frame-pointer -mfpu=vfp -mfloat-abi=softfp
Command Custom Code libmath Code iterations/sec run time lastval
./c_sqrt-arm 3.775000 0.842000 107537.6328 4.6510 1.569382
./c_sqrt-arm-noop 4.290000 0.987000 94760.2812 5.2780 1.569382
./c_sqrt-arm-static 3.451000 3.009000 77395.1406 6.4610 1.569382
./c_sqrt-arm-static-ulibc 3.708000 3.088000 73580.0469 6.7960 1.569382
./c_sqrt-cortex-a8 3.654000 1.056000 106167.7266 4.7100 1.569382
./c_sqrt-cortex-a8-fpu 1.369000 0.596000 254478.3750 1.9650 1.569382
./c_sqrt-cortex-a8-fpu-static 1.393000 2.223000 138249.9375 3.6170 1.569382
./c_sqrt-cortex-a8-fpu-static-ulibc 1.388000 1.956000 149536.4844 3.3440 1.569382
./c_sqrt-arm 3.811000 0.880000 106597.7422 4.6910 1.569382
./c_sqrt-arm-noop 4.563000 0.927000 91083.7891 5.4900 1.569382
./c_sqrt-arm-static 3.829000 3.094000 72230.2500 6.9230 1.569382
./c_sqrt-arm-static-ulibc 3.964000 3.314000 68707.0625 7.2780 1.569382
./c_sqrt-cortex-a8 3.703000 0.860000 109587.9922 4.5630 1.569382
./c_sqrt-cortex-a8-fpu 1.446000 0.634000 240408.6562 2.0800 1.569382
./c_sqrt-cortex-a8-fpu-static 1.464000 2.060000 141898.4062 3.5240 1.569382
./c_sqrt-cortex-a8-fpu-static-ulibc 1.540000 2.175000 134602.9688 3.7150 1.569382
./sc_sqrt-arm 3.406000 0.838000 117825.1641 4.2440 1.569382
./sc_sqrt-arm-noop 4.092000 0.845000 101265.6953 4.9380 1.569382
./sc_sqrt-arm-static 3.407000 3.111000 76706.5469 6.5190 1.569382
./sc_sqrt-arm-static-ulibc 3.318000 2.937000 79944.0469 6.2550 1.569382
./sc_sqrt-cortex-a8 3.348000 0.855000 118974.5391 4.2030 1.569382
./sc_sqrt-cortex-a8-fpu 1.033000 0.606000 304908.5312 1.6400 1.569382
./sc_sqrt-cortex-a8-fpu-static 1.050000 1.944000 167017.3750 2.9940 1.569382
./sc_sqrt-cortex-a8-fpu-static-ulibc 1.057000 1.917000 168140.5469 2.9740 1.569382
./sc_sqrt-arm 3.318000 0.829000 120581.1406 4.1470 1.569382
./sc_sqrt-arm-noop 4.285000 0.847000 97437.6484 5.1320 1.569382
./sc_sqrt-arm-static 3.337000 2.937000 79701.9453 6.2740 1.569382
./sc_sqrt-arm-static-ulibc 3.307000 2.935000 80110.5391 6.2420 1.569382
./sc_sqrt-cortex-a8 3.330000 0.848000 119686.4531 4.1780 1.569382
./sc_sqrt-cortex-a8-fpu 1.036000 0.609000 303981.7500 1.6450 1.569382
./sc_sqrt-cortex-a8-fpu-static 1.034000 1.923000 169107.2031 2.9570 1.569382
./sc_sqrt-cortex-a8-fpu-static-ulibc 1.033000 1.922000 169221.6562 2.9550 1.569382
Now we can see dynamic linked code run very very fast:
* Custom code behave the same
* libmath code gains 4x-5x. when linked dynamically
Conclusions.
* Sourcery G++ seems to make slightest fast code than gcc.
* FPU flags are a "must" option.
* In case of generic distros as Debian, libc and main libraries should be recompiled with fpu options.
* Applying ops we can go 3.5x-4x faster in floating point operations

