okenido
Posts: 78
Joined: Thu Aug 02, 2018 11:47 am

Optimizing data structures for NEON

Mon Oct 15, 2018 10:33 am

Hello,

I want to help GCC generating vectorized neon code for my calculations that works on a lot of structs. (All the same, just a lot of them).

Currently the code look like that :

Code: Select all

struct thing{ int a, b;}obj[4];

obj[0].a += obj[0].b;
obj[1].a += obj[1].b;
obj[2].a += obj[2].b;
obj[3].a += obj[3].b;
As you see, every read/write on the memory is sizeof(obj) away from each other.

Code: Select all

int objA[4];
int objB[4];


obA[0] += obB[0];
obA[1] += obB[1];
obA[2] += obB[2];
obA[3] += obB[3];

...
Isn't this way of organizing things more efficient ? So i guess NEON could load a pointer to the first objA and objB then do a vector operation on them since they are contiguous in memory?

In my program sizeof(obj) is about 70 bytes so each a, b... values aren't close in memory.

KnarfB
Posts: 198
Joined: Wed Dec 14, 2016 10:47 am
Location: Germany

Re: Optimizing data structures for NEON

Mon Oct 22, 2018 7:42 pm

Isn't this way of organizing things more efficient ? So i guess NEON could load a pointer to the first objA and objB then do a vector operation on them since they are contiguous in memory?
Yes, you're right. When I put your code into a file

Code: Select all

int obA[4];
int obB[4];

int test() {
  obA[0] += obB[0];
  obA[1] += obB[1];
  obA[2] += obB[2];
  obA[3] += obB[3];
}
and compile it with some options, gcc will auto-vectorize:

Code: Select all

$ gcc -O3 -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard -funsafe-math-optimizations -S -c test.c
$ more test.s
...
        movw    r3, #:lower16:obA
        movt    r3, #:upper16:obA
        movw    r2, #:lower16:obB
        movt    r2, #:upper16:obB
        vld1.32 {q8}, [r3]
        vld1.32 {q9}, [r2]
        vadd.i32        q8, q8, q9
        vst1.32 {q8}, [r3]
        bx      lr
Generally, you should also consider using const qualifiers and restrict pointers where appropriate, e.g. for function parameters, which designate "unique" objects within a context.

okenido
Posts: 78
Joined: Thu Aug 02, 2018 11:47 am

Re: Optimizing data structures for NEON

Mon Nov 05, 2018 3:27 pm

Thanks !

I re-wrote critical parts of my code with manual NEON intrinsics and got +50% performance.
It seems GCC is simply unable to vectorize things when it gets more complicated than the obvious parallel addition/multiplication.

It's not very pretty since my object get split into multiple variables ( var1[nbObjects], var2[nbObjects] ), but the speed improvement is awesome.

LdB
Posts: 1657
Joined: Wed Dec 07, 2016 2:29 pm

Re: Optimizing data structures for NEON

Mon Nov 05, 2018 4:01 pm

I didn't see this before but having played around with this a couple of weeks ago, you didn't mention a crucial thing.

To get GCC to do it properly you need #include <arm_neon.h>
That thing is a massive file and has all the shortcuts in it. It only does basic optimizations without it.
I also found you could use the special types to teach it new tricks.

okenido
Posts: 78
Joined: Thu Aug 02, 2018 11:47 am

Re: Optimizing data structures for NEON

Tue Nov 06, 2018 3:22 pm

I already include this header but it's for the intrinsics which I use manually. Does it make sense to include it by default ?

I use those extra types, like uint16x8_t which is a block of eight 16 bit unsigned ints that can be mapped to a single NEON 128 bit register and processed by a single instruction.
NEON programming is pretty fun, although it requires some care or it can be slower than pure ARM code. Latency between ARM<=>NEON register is high, some interleaving is required to hide that. My code is less readable but way faster for the critical parts :D

Return to “Bare metal, Assembly language”