User avatar
bitbank
Posts: 252
Joined: Sat Nov 07, 2015 8:01 am
Location: Sarasota, Florida
Contact: Website

Maximizing SIMD performance

Sat Feb 18, 2017 5:24 pm

Here's a link to an article I wrote on LinkedIn about how to use SIMD to maximize the performance of your C code. The sample code in the article is written in SSE, but the same ideas apply to ARM NEON on the RPi2/3:

https://www.linkedin.com/pulse/maximizi ... larry-bank

Let me know if you have any comments/questions.
The fastest code is none at all :)

User avatar
PeterO
Posts: 4727
Joined: Sun Jul 22, 2012 4:14 pm

Re: Maximizing SIMD performance

Sat Feb 18, 2017 6:17 pm

So without any actual ARM NEON examples it's hard to see where the value lies for users on this site ?
Other than getting your page hit rate up ?
PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

User avatar
bitbank
Posts: 252
Joined: Sat Nov 07, 2015 8:01 am
Location: Sarasota, Florida
Contact: Website

Re: Maximizing SIMD performance

Sat Feb 18, 2017 6:24 pm

PeterO wrote:So without any actual ARM NEON examples it's hard to see where the value lies for users on this site ?
Other than getting your page hit rate up ?
PeterO
I don't care about page hits. My point in posting the article is to share the idea that SIMD is not 'scary' and for those who use the crutch of the C compiler "-O3" option and assume they're code is being vectorized, they might be surprised when it isn't able to do it. Obviously I'm not trying to teach SIMD programming in a 1 page article. I'm sorry that you completely missed the point of the article and dismissed it offhand when you saw some SSE code.
The fastest code is none at all :)

User avatar
PeterO
Posts: 4727
Joined: Sun Jul 22, 2012 4:14 pm

Re: Maximizing SIMD performance

Sat Feb 18, 2017 6:26 pm

bitbank wrote:
PeterO wrote:So without any actual ARM NEON examples it's hard to see where the value lies for users on this site ?
Other than getting your page hit rate up ?
PeterO
I don't care about page hits. My point in posting the article is to share the idea that SIMD is not 'scary' and for those who use the crutch of the C compiler "-O3" option and assume they're code is being vectorized, they might be surprised when it isn't able to do it. Obviously I'm not trying to teach SIMD programming in a 1 page article. I'm sorry that you completely missed the point of the article and dismissed it offhand when you saw some SSE code.
SSE code is useless as far as this site is concerned... So as I said it is hard to see how your page is of any help to people wanting to start to learn about SIMD on the PI.

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

User avatar
bitbank
Posts: 252
Joined: Sat Nov 07, 2015 8:01 am
Location: Sarasota, Florida
Contact: Website

Re: Maximizing SIMD performance

Sat Feb 18, 2017 6:36 pm

PeterO wrote: SSE code is useless as far as this site is concerned... So as I said it is hard to see how your page is of any help to people wanting to start to learn about SIMD on the PI.

PeterO
Again, you're missing the point. Even if I provided a NEON example, it wouldn't teach anyone NEON programming. I mention clearly that the code will look nearly identical with NEON/Altavec/DSP because the operations are universal SIMD concepts. To clarify, the point of posting the article:

1) Don't assume your C compiler is using SIMD instructions when you set "maximum" optimization
2) Coding with SIMD is not voodoo - explore the idea further by learning something on your own

If you don't see value in that for the RPI community, so be it, but you seem to be stuck on the idea that using sample SSE code to prove a point on an ARM device support site is somehow forbidden.
The fastest code is none at all :)

jahboater
Posts: 4429
Joined: Wed Feb 04, 2015 6:38 pm

Re: Maximizing SIMD performance

Sun Feb 19, 2017 8:44 am

bitbank wrote:My point in posting the article is to share the idea that SIMD is not 'scary' and for those who use the crutch of the C compiler "-O3" option and assume they're code is being vectorized, they might be surprised when it isn't able to do it.
Well you need more than -O3. -ftree-vectorize at the very least, probably others - I am not an expert.
Even with auto-vectorization turned on, I am sure you will still be able to beat the compiler, even gcc 6.3.

It would be nice if you gave an example with NEON intrinsics. It looks to me like NEON on armv8 is better designed, cleaner, and simpler than all the sse/avx intel stuff.

NEON on the Pi3 armv8 is quad issue, so apart from the SIMD aspect it should be very fast.
In 64-bit mode it is fully IEEE compliant.

User avatar
bitbank
Posts: 252
Joined: Sat Nov 07, 2015 8:01 am
Location: Sarasota, Florida
Contact: Website

Re: Maximizing SIMD performance

Sun Feb 19, 2017 11:34 am

jahboater wrote:
bitbank wrote:My point in posting the article is to share the idea that SIMD is not 'scary' and for those who use the crutch of the C compiler "-O3" option and assume they're code is being vectorized, they might be surprised when it isn't able to do it.
Well you need more than -O3. -ftree-vectorize at the very least, probably others - I am not an expert.
Even with auto-vectorization turned on, I am sure you will still be able to beat the compiler, even gcc 6.3.

It would be nice if you gave an example with NEON intrinsics. It looks to me like NEON on armv8 is better designed, cleaner, and simpler than all the sse/avx intel stuff.

NEON on the Pi3 armv8 is quad issue, so apart from the SIMD aspect it should be very fast.
In 64-bit mode it is fully IEEE compliant.
The NEON instruction set is better designed than SSE for two reasons - it was created several years after SSE came on the scene and it was designed as a coprocessor with plenty of "instruction space". NEON has every permutation of instruction for every data type. SSE didn't have the room in the instruction encoding (without making super long instructions), so more compromises were made.

The RPi3 has an ARM Cortex-A53 which is a 64-bit version of the Cortex-A7 with a few small improvements. It has a dual-issue in-order pipeline, not quad-issue. The other thing holding back the performance of the RPi3 is the memory speed and (I believe) a 32-bit memory bus. Most 64-bit machines have 64-bit wide memory interfaces. Even though the NEON instruction set is more advanced than SSE/SSE2/SSE4, a modern Intel CPU at the same clock speed will perform better because of its quad-issue out-of-order pipeline, and faster memory bus.

For my example code, the SSE and NEON code look virtually identical. Here's the C code and NEON code for the curious:

Code: Select all

uint32_t array_a[32];
uint32_t array_b[32];
uint32_t array_c[32];
int i;

for (i=0; i<32; i++)
{
   if (array_a[i] != 0)
      array_c[i] = array_a[i] * array_b[i];
   else
      array_c[i] = array_b[i];
}
Now the NEON version of that:

Code: Select all

for (i=0; i<32; i+= 4)
{
   uint32x4_t u32A, u32B, u32C, u32Product, u32Mask;

   u32A = vld1q_u32(&array_a[i]); // read 4 uint32_t's from array_a
   u32B = vld1q_u32(&array_b[i]); // read 4 uint32_t's from array_b
   u32Mask = vceqq_u32(u32A, vdupq_n_u32(0)); // compare the 4 a's to 0
   u32Product = vmulq_u32(u32A, u32B); // multiply A's by B's
   u32C = vbicq_u32(u32Product, u32Mask); // select the products where A != 0
   u32B = vandq_u32(u32B, u32Mask); // select the B's where A == 0
   u32C = vorrq_u32(u32B, u32C); // combine the two sets
   vst1q_u32(&array_c[i], u32C); // write 4 uint32_t results to array_c
}
The fastest code is none at all :)

jahboater
Posts: 4429
Joined: Wed Feb 04, 2015 6:38 pm

Re: Maximizing SIMD performance

Sun Feb 19, 2017 12:28 pm

bitbank wrote: The NEON instruction set is better designed than SSE for two reasons - it was created several years after SSE came on the scene and it was designed as a coprocessor with plenty of "instruction space". NEON has every permutation of instruction for every data type. SSE didn't have the room in the instruction encoding (without making super long instructions), so more compromises were made.
Yes, but I believe it is no longer a coprocessor in the A53 and its even more complete, for example, all five rounding modes are available as individual instructions for all types (float to float, float to signed int, float to unsigned int) without messing with the fpcr which is pretty cool.
bitbank wrote:The RPi3 has an ARM Cortex-A53 which is a 64-bit version of the Cortex-A7 with a few small improvements. It has a dual-issue in-order pipeline, not quad-issue.
The ARM integer stuff is definitely dual issue, but I thought the NEON unit was separately quad issue, I may be wrong.
bitbank wrote:The other thing holding back the performance of the RPi3 is the memory speed and (I believe) a 32-bit memory bus. Most 64-bit machines have 64-bit wide memory interfaces. Even though the NEON instruction set is more advanced than SSE/SSE2/SSE4, a modern Intel CPU at the same clock speed will perform better because of its quad-issue out-of-order pipeline, and faster memory bus.
I agree its nowhere near as fast as a modern Intel CPU. The memory is 450Mhz DDR2!!!
I don't think NEON is more advanced than AVX especially AVX512. I just thought it was easier to use and simpler and doesn't have the history of the old SSE stuff.

The later "performance" ARM cpus are out-of-order (starting with the A57 and A72 ...), the A53 is a low power design which would be the "little" in a big.little soc.

Return to “C/C++”