cmisip
Posts: 94
Joined: Tue Aug 25, 2015 12:38 am

Is there a performance penalty to using two byte structs instead of 4?

Mon Jul 09, 2018 1:49 am

I can save memory by using 2 byte structs but I don't know if I am giving up performance and speed by doing so because the arm cpu is 32 bit.

Thanks,
Chris

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 20227
Joined: Sat Jul 30, 2011 7:41 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Mon Jul 09, 2018 9:19 am

Yes, you are. Keeping things as multiple s of 32 bits results in faster running code.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

User avatar
allfox
Posts: 431
Joined: Sat Jun 22, 2013 1:36 pm
Location: Guang Dong, China

Re: Is there a performance penalty to using two byte structs instead of 4?

Mon Jul 09, 2018 9:51 am

I have some uncertain about this.

I think it's the compiler make the decision. If it is optimizing for speed, it could fit the 2 bytes struct into a 4 byte aligned space.

And I have an unsure illusion that Windows would fit anything into 8 byte aligned space.

cmisip
Posts: 94
Joined: Tue Aug 25, 2015 12:38 am

Re: Is there a performance penalty to using two byte structs instead of 4?

Mon Jul 09, 2018 11:16 am

I need to save memory for a program that I am working on. It deals with a series of int8_t value pairs. If I put two of these pairs in a struct ( the pairs are unrelated and really should be dealt with individually in a loop iteration) , so that the struct is 4 bytes long, would that still incur a performance penalty? My loop will just have to read values from the buffer half as many times but each read will take 4 bytes and I perform on each pair the same routine (so that routine happens twice on two sets of data in the same loop iteration). Or how can this be done without incurring a performance penalty?

Or I could put the entire buffer in an array, but would probably still feel the performance penalty if I operate on two byte values at a time.

Is there a solution that affords saving memory and not incurring a penalty?

How about reading 4 bytes and then bit masking and bit shifting to get the first and second pair individually. But will the bit operations negate the performance gain of a 4 byte read?

Thanks,
Chris

User avatar
joan
Posts: 13537
Joined: Thu Jul 05, 2012 5:09 pm
Location: UK

Re: Is there a performance penalty to using two byte structs instead of 4?

Mon Jul 09, 2018 11:34 am

Rather than ask why don't you just find out?

Time the alternatives and use the fastest if that is important to you.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 20227
Joined: Sat Jul 30, 2011 7:41 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Mon Jul 09, 2018 1:29 pm

You could store 4 bytes of information in a 32bit value, then extract the data using bit ops. You could always be using 32 bit aligned addresses, with a slight performance impact on getting the data out. Although TBH that is probably what the compiler would be doing if you just let it chose.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

jahboater
Posts: 2851
Joined: Wed Feb 04, 2015 6:38 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Mon Jul 09, 2018 4:53 pm

The compiler will keep such small structs in registers as far as possible.

The ARM CPU on the Pi has some very handy instructions for dealing with these structs while they are in registers "bfi", "ubfx", the amusingly named "sbfiz" and "ubfiz" and others. With these, there is no extra cost - the compiler will definitely use them.

But as stated above, saving a 2 byte struct to memory (ldrh/strh) is likely to be costly compared to a 4 byte struct, although only two byte alignment is needed.

See the "-freg-struct-return" compiler option.

cmisip
Posts: 94
Joined: Tue Aug 25, 2015 12:38 am

Re: Is there a performance penalty to using two byte structs instead of 4?

Tue Jul 10, 2018 12:17 am

I will be using memcpy to load values into a buffer and read from it. I therefore need to create a buffer of uint32_t* to be assured that I am given an address that is 4 byte aligned. Instead of saving a 4 byte word and using bit operations, use a struct with 4 members that are one byte size. In the loop iteration, the handling of the struct members should not be a performance penalty. It is when I copy to and from memory that I must keep the transfer 4 bytes wide. Is this correct?

Thanks,
Chris

Code: Select all

uint32_t *mvect_buffer = (uint32_t*)malloc(mv_size);

 struct vector_package { //package will be 32 bit wide to optimize memory transfers
	      uint8_t xcoord1=0;
	      uint8_t ycoord1=0;
	      uint8_t xcoord2=0;
	      uint8_t ycoord2=0;
} ups;	

....

//Write to buffer in 4 byte chunks otherwise, performance penalty
memcpy(mvect_buffer+offset,&ups,sizeof(vector_package));
//Load from buffer in 4 byte chunks
memcpy(&ups,mvect_buffer+offset,sizeof(vector_package));

while (true) {  //No performance penalty here due to the fact that special registers are used?
   ups.xcoord1+=other_value;
   ups.xcoord2+=other_value;
   ups.ycoord1+=other_value;
   ups.ycoord2+=other_value;
}


ejolson
Posts: 1830
Joined: Tue Mar 18, 2014 11:47 am

Re: Is there a performance penalty to using two byte structs instead of 4?

Tue Jul 10, 2018 12:38 am

cmisip wrote:
Tue Jul 10, 2018 12:17 am
I will be using memcpy to load values into a buffer and read from it. I therefore need to create a buffer of uint32_t* to be assured that I am given an address that is 4 byte aligned. Instead of saving a 4 byte word and using bit operations, use a struct with 4 members that are one byte size. In the loop iteration, the handling of the struct members should not be a performance penalty. It is when I copy to and from memory that I must keep the transfer 4 bytes wide. Is this correct?

Thanks,
Chris

Code: Select all

uint32_t *mvect_buffer = (uint32_t*)malloc(mv_size);

 struct vector_package { //package will be 32 bit wide to optimize memory transfers
	      uint8_t xcoord1=0;
	      uint8_t ycoord1=0;
	      uint8_t xcoord2=0;
	      uint8_t ycoord2=0;
} ups;	

....

//Write to buffer in 4 byte chunks otherwise, performance penalty
memcpy(mvect_buffer+offset,&ups,sizeof(vector_package));
//Load from buffer in 4 byte chunks
memcpy(&ups,mvect_buffer+offset,sizeof(vector_package));

while (true) {  //No performance penalty here due to the fact that special registers are used?
   ups.xcoord1+=other_value;
   ups.xcoord2+=other_value;
   ups.ycoord1+=other_value;
   ups.ycoord2+=other_value;
}

One of the founding fathers of computer science said
Donald Knuth wrote:We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
It looks to me like you have created a clumsy data structure for the sake of optimisation that will likely make your program much more complicated while offering little if any performance improvement.

Generally it is a good idea to write the program in the simplest way first and make sure that it gives correct answers before complicating it with optimisation. It is also important to have the simplest correct program available to compare whether the added complexity resulting from each optimisation is worthwhile and to make sure the optimised program still gives correct answers.

cmisip
Posts: 94
Joined: Tue Aug 25, 2015 12:38 am

Re: Is there a performance penalty to using two byte structs instead of 4?

Tue Jul 10, 2018 12:56 am

I am actually at that point. I made sure to make the optimizations in a different branch. The program I am working on has a performance bottleneck and the raspberry has limited memory. I am trying to solve both issues but I did think more than twice about making the code more complex in order to gain a little more performance. In the end it may not be worth much performance gain and I'll probably revert the changes but I would have explored the avenue and learned something from it.

Thanks,
Chris

jahboater
Posts: 2851
Joined: Wed Feb 04, 2015 6:38 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Tue Jul 10, 2018 6:19 am

"Make it work, then make it fast" ....
Personally I would just get the latest version of GCC (8.1). This will (a) produce the best code, and (b) will annotate the assembler listing with the source code lines making it very easy to find out whats going on. Using a recent compiler and playing with the options will likely produce more benefit than your own micro-optimizations - and you can maintain simple, readable, portable code. There is a script to install GCC 8.1 here:
viewtopic.php?f=33&t=212636

Anyway ...
cmisip wrote:
Tue Jul 10, 2018 12:17 am
I will be using memcpy to load values into a buffer and read from it. I therefore need to create a buffer of uint32_t* to be assured that I am given an address that is 4 byte aligned.
Thats a very good idea. Memcpy will always do the right thing, aligned or not, and is the preferred way of "type punning".
In this case your 4 byte memcpy's compile as a single instruction:

Code: Select all

@ try.c:22: uint32_t *mvect_buffer = (uint32_t*)malloc(4000);
    bl  malloc      
@ try.c:39: memcpy(&ups,mvect_buffer+offset,sizeof(struct vector_package));
    ldr r3, [r0, #12]  
Similarly for the write, it uses str (store register). I set offset to 3, so the #12 is 3 * 4. Malloc returns mvect_buffer in r0.

This code however was not so good. C promotes all this stuff to "int" for the calculations (because some CPU's like ARM cannot do arithmetic on single bytes). So it all ended up in 4 different registers and finally used strb (store byte) to save each field (from the least significant byte of each register). It never stored it as a packed structure at all. Quite fast I suppose.

Code: Select all

while (true) {  //No performance penalty here due to the fact that special registers are used?
   ups.xcoord1+=other_value;
   ups.xcoord2+=other_value;
   ups.ycoord1+=other_value;
   ups.ycoord2+=other_value;
}
Its hard to predict what the compiler will do.

As Joan says, benchmark any and all changes you make to see what the real performance is.

If you want to get into micro-optimization (at home for fun), you should always look carefully at the code produced with "-S -fverbose-asm" and fully understand what the compiler is doing.
Last edited by jahboater on Tue Jul 10, 2018 6:27 am, edited 1 time in total.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 20227
Joined: Sat Jul 30, 2011 7:41 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Tue Jul 10, 2018 6:25 am

Seems wasteful to use memcpy for 4bytes. Just cast the pointers to uint32 and write. Avoids the overhead of a function call.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

jahboater
Posts: 2851
Joined: Wed Feb 04, 2015 6:38 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Tue Jul 10, 2018 6:34 am

jamesh wrote:
Tue Jul 10, 2018 6:25 am
Seems wasteful to use memcpy for 4bytes. Just cast the pointers to uint32 and write. Avoids the overhead of a function call.
Sorry, no its not.

Code: Select all

@ try.c:39: memcpy(&ups,mvect_buffer+offset,sizeof(struct vector_package));
    ldr r3, [r0, #12]  
There is no function call.

Writing things like:

*(uint32_t*)ptr = num

will likely produce identical code to:

memcpy( ptr, &num, 4 )

but if it cant for some reason, it will always do the right thing.
memcpy is the preferred way of "type punning" like this.
I'm not sure what you are supposed to do if "num" is a literal though.

cmisip
Posts: 94
Joined: Tue Aug 25, 2015 12:38 am

Re: Is there a performance penalty to using two byte structs instead of 4?

Fri Jul 13, 2018 11:13 am

I got the clumsy data structure working and I haven't seen any alignment traps after running for a few days. There are two processess sending information via mmap. One is the capture daemon that is sending vector data and the other is the analysis daemon which is checking to see if the vectors are inside a polygon. I am finding that the polygon test is cpu intensive and is keeping the analysis daemon behind the capture daemon. Its using a ring buffer so eventually the capture daemon overruns the buffer. I am thinking that I would quantize the frame in discrete 16x16 blocks and assign a bit to each one. For 1920x1080, thats about 1 kilo byte. The analysis daemon will have to precompute at startup which of its cells are inside a polygon and mark those with an ON bit in a 1 kilo byte zone_masq variable. Then at runtime, the capture daemon sends its own vector_masq with significant vectors with the ON bit. Then in the analysis daemon at runtime, I would simply AND the two and arrive at a bitcount of all vectors that are significant and inside the zone polygon. I think maybe that is the fastest way for the analsis daemon to do its job.

What is the best way to handle this memory write and read? Would 64 bit or 128 bit or any order of magnitude of 32 bit read/write be as performant as a single 32 bit read and write? The buffer is a uint8_t * that is malloc'd.

Thanks,
Chris

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 20227
Joined: Sat Jul 30, 2011 7:41 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Fri Jul 13, 2018 11:30 am

jahboater wrote:
Tue Jul 10, 2018 6:34 am
jamesh wrote:
Tue Jul 10, 2018 6:25 am
Seems wasteful to use memcpy for 4bytes. Just cast the pointers to uint32 and write. Avoids the overhead of a function call.
Sorry, no its not.

Code: Select all

@ try.c:39: memcpy(&ups,mvect_buffer+offset,sizeof(struct vector_package));
    ldr r3, [r0, #12]  
There is no function call.

Writing things like:

*(uint32_t*)ptr = num

will likely produce identical code to:

memcpy( ptr, &num, 4 )

but if it cant for some reason, it will always do the right thing.
memcpy is the preferred way of "type punning" like this.
I'm not sure what you are supposed to do if "num" is a literal though.
So presumably the compiler realises the copy is small as its a literal and does something different. Does it specifically look for memcpy/similar calls to optimise out then?
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

jahboater
Posts: 2851
Joined: Wed Feb 04, 2015 6:38 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Fri Jul 13, 2018 12:39 pm

jamesh wrote:
Fri Jul 13, 2018 11:30 am
So presumably the compiler realises the copy is small as its a literal and does something different.
Yes indeed, exactly that.
Any small call to memset or memcpy will be replaced with a simple instruction sequence - which may include SIMD registers if the length is 16 bytes or more (gcc may use NEON on the Pi). On Intel hardware that has specific string instructions, ALL calls might replaced, as do things like strlen().
jamesh wrote:
Fri Jul 13, 2018 11:30 am
Does it specifically look for memcpy/similar calls to optimise out then?
I think it will replace quite a few different library routines where there is a hardware instruction that can do the job.
Math routines such as lrint() or round() or sqrt() get replaced with a single instruction. And so on.
Also, at a higher level, it can do things like this common example, the beginners hello world program
printf( "Hello, world\n" ); gets replaced with puts( "Hello, world" ); which is smaller and faster (the compiler understands the format string).

String functions have been replaced for many years and where there is a choice there are GCC parameters to specify how the user wants it done. For example on Intel -mstringop-strategy=xxxx and various others for individual functions.

The language standard does include much of the library, so the compiler knows exactly what it can and cannot do.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 20227
Joined: Sat Jul 30, 2011 7:41 pm

Re: Is there a performance penalty to using two byte structs instead of 4?

Fri Jul 13, 2018 1:14 pm

Cool! I guess similar to C++ templating out liternals.to specific actions.

I guess this is knowledge that the older generation can sometimes not realise; when I started, compilers didn't do this .But suppose when you think about it, its fairly low hanging fruit for optimisation.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Please direct all questions to the forum, I do not do support via PM.

Return to “Advanced users”

Who is online

Users browsing this forum: No registered users and 15 guests