What's the problem then? I've spotted a very rookie mistake (shame on me), and I've removed ! from ! and added "add x1, x1, #256" as the first instruction of the big block copy, and "add x0, x0, #256" before the "sub x3, x3, #1". Without this it didn't copy the last few bytes except for the first iteration as ! only incremented the registers by 224, and if an add is necessary anyway then I though it would be clearer to add 256.
Actually, that was my intention I don't want the compiler to optimize my memcpy, but maybe it's just me.jahboater wrote:Trouble then is that it cannot participate in the surrounding C optimization.
Yes, but with my Ivy Bridge processor I got better results with the SSE copy, just like these guys. Not sure why isn't ERMSB kick in. Also glibc does not use it only in special cases (see threshold and vecsize), but neither does it use movntdqa, only movaps for some reason. Need to investigate that too. The Linux kernel memcpy now has a dirty "insert a jump instruction in memcpy" hack, which is just auch, hurtsjahboater wrote:Interesting. I believe the ERMSB enhancements did not extend to rep movsq - only movsb and stosb. Intel are supposed to keep updating rep movsb as modern hardware uses wider internal data paths and so on. They intend it as "the" memcpy instruction. In certain cases (originally, operands aligned to 16 bytes and length a multiple of 64 bytes) the performance of rep movsb far exceeds movsd and probably movsq. When I get a PC with AVX512 I'll try 64 byte register moves.
No, you are perfectly correct, those small block copy loops do not load values for the next iteration. As I said, beta, there's still place for improvements, thanks!jahboater wrote:Or am I being naive and you were relying on something clever with dual issue or something?