gcc __builtin_prefetch cache usage help request
Posted: Wed Oct 29, 2014 2:08 pm
Anyone have any experience in using gcc pld preload hints and optimizing code.
I really am a newbie to programming this type of stuff, but will describe the program and what I hope to do.
I have a largish struct array that is passed to functions as pointer and want to make sure it is preloaded in to cache to speed processing.
Assuming __builtin_prefetch actually works on ARMv6 and gcc implements it properly and that gcc does not use it without being instructed, I have some questions.
The questions I have....
1) I assume cache is loaded in lines of 32 bytes? gcc does no calculation on struct size so I have to call __builtin_prefetch with offset of n*32 upto the size of the struct is this correct?
2) is there actually any point in loading the struct data variabled into local register vars if the struct is already loaded in cache?
3) is there any point using __pure, __restrict, or __promise?
4) does anyone have any other suggestions for optimizing code?
5)does what I have done look correct and worthwhile?
6) would it be worthwhile trying to use ARMv6 SIMD on the short ints (I think I heard there are some instruction that allow filling 32bit regs as shorts or bytes and running multiple calculation parallel?), does gcc implement instructions for simd operations or how can I call ARM asm in gcc easily?
outline example code below:
Of course all this maybe totally wrong as I have not tried the code and never done any lower lever optimization on ARM before, corrections and advice welcomed.
I really am a newbie to programming this type of stuff, but will describe the program and what I hope to do.
I have a largish struct array that is passed to functions as pointer and want to make sure it is preloaded in to cache to speed processing.
Assuming __builtin_prefetch actually works on ARMv6 and gcc implements it properly and that gcc does not use it without being instructed, I have some questions.
The questions I have....
1) I assume cache is loaded in lines of 32 bytes? gcc does no calculation on struct size so I have to call __builtin_prefetch with offset of n*32 upto the size of the struct is this correct?
2) is there actually any point in loading the struct data variabled into local register vars if the struct is already loaded in cache?
3) is there any point using __pure, __restrict, or __promise?
4) does anyone have any other suggestions for optimizing code?
5)does what I have done look correct and worthwhile?
6) would it be worthwhile trying to use ARMv6 SIMD on the short ints (I think I heard there are some instruction that allow filling 32bit regs as shorts or bytes and running multiple calculation parallel?), does gcc implement instructions for simd operations or how can I call ARM asm in gcc easily?
outline example code below:
Code: Select all
typedef struct {
register int var1;
register short var2;
register float var3;
register short var4;
register int loopcnt;
}largeStruct ;
func( void *pn)
{
largeStruct *s=pn;
int *dataout_pointer;
__builtin_prefetch (dataout_pointer, 1, 3;
for (j = 0; j < 8; j++)
{
register var1=s->var1;
register var2=s->var2;
register n=s->loopcnt;
for (i = 0; i < n; i++)
{
var3=var.... etc
/* other calculations.... */
if(var3<p)
*(dataout_pointer+i)= avalue;
/* ... */
}
s->var1=var1;
s->var2=var2;
s->var3var3;
//move to next struct and repeat loop
s++;
}