Usually for small loops (like one for an array like you mention), you want to unroll it between 2 and 4 times, no more than that. If you try to unroll it all of the way, other optimizations will be lost.
Here is a story...
Years ago I was implementing an encryption algorithm for a military project. The guts of that algorithm was basically a loop over a 256 byte block applying the same hand full of logical operations to each byte. Except at three particular positions in the buffer it did some different or extra logical operation.
This project was written in PL/M, a high level C/Pascal/Algol like language. So I wrote the thing and compiled it. Then looked at the assembler it generated. Sure enough the logical operations were a few simple instructions, but the bulk of the body of the loop was the tests and jumps for those three special cases. Yuk.
So. I removed the loop. Copy and pasted the logical operations 256 times. Added the little differences at the correct locations in that sequence. Compiling it again I got my nice clean string of instructions doing real work, no time wasted on useless tests. It ran about ten times faster than the loop version!
Now, being hell bent on speed, I looked at the generated assembler and notice the compiler had not been as optimal as I would like. So I copied it's generated assembler and removed the inefficiencies in it.
The speed gain from optimizing assembler code? About 20%.
We put the PL/M version into production. The speed gain of moving to assembler was not worth it. Having everything in the same language, and something people could easily read, was not worth giving up for a marginal performance gain.
Since that time, I have been very loath to dick around squeezing marginal gains out of code by writing in assembler or fiddling about optimizing it.
Note: Unrolling the loop in my historical example made things much faster because I could remove redundant loop count tests and because the processor had no cache, it did not matter that the code ended tens of times bigger. On a modern CPU it may be better to keep a tight loop in the cache and keep the tests in there.
Memory in C++ is a leaky abstraction .