bzt
Posts: 393
Joined: Sat Oct 14, 2017 9:57 pm

Re: Puzzled, Mem speed?

Thu Mar 21, 2019 4:49 pm

jahboater wrote:
Thu Mar 21, 2019 8:32 am
bzt wrote:
Thu Mar 21, 2019 1:12 am
Any compilation errors?
No!
What's the problem then? I've spotted a very rookie mistake (shame on me), and I've removed ! from []! and added "add x1, x1, #256" as the first instruction of the big block copy, and "add x0, x0, #256" before the "sub x3, x3, #1". Without this it didn't copy the last few bytes except for the first iteration as []! only incremented the registers by 224, and if an add is necessary anyway then I though it would be clearer to add 256.
jahboater wrote:Trouble then is that it cannot participate in the surrounding C optimization.
Actually, that was my intention :-) I don't want the compiler to optimize my memcpy, but maybe it's just me.
jahboater wrote:Interesting. I believe the ERMSB enhancements did not extend to rep movsq - only movsb and stosb. Intel are supposed to keep updating rep movsb as modern hardware uses wider internal data paths and so on. They intend it as "the" memcpy instruction. In certain cases (originally, operands aligned to 16 bytes and length a multiple of 64 bytes) the performance of rep movsb far exceeds movsd and probably movsq. When I get a PC with AVX512 I'll try 64 byte register moves.
Yes, but with my Ivy Bridge processor I got better results with the SSE copy, just like these guys. Not sure why isn't ERMSB kick in. Also glibc does not use it only in special cases (see threshold and vecsize), but neither does it use movntdqa, only movaps for some reason. Need to investigate that too. The Linux kernel memcpy now has a dirty "insert a jump instruction in memcpy" hack, which is just auch, hurts :-)
jahboater wrote:Or am I being naive and you were relying on something clever with dual issue or something?
No, you are perfectly correct, those small block copy loops do not load values for the next iteration. As I said, beta, there's still place for improvements, thanks!

Cheers,
bzt

jahboater
Posts: 4675
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Thu Mar 21, 2019 5:15 pm

bzt wrote:
Thu Mar 21, 2019 4:49 pm
What's the problem then? I've spotted a very rookie mistake (shame on me), and I've removed ! from []! and added "add x1, x1, #256" as the first instruction of the big block copy, and "add x0, x0, #256" before the "sub x3, x3, #1". Without this it didn't copy the last few bytes except for the first iteration as []! only incremented the registers by 224, and if an add is necessary anyway then I though it would be clearer to add 256.
Those changes fixed one set of regression tests! But it still fails somewhere else.
Have I got the above changes in correctly?

Code: Select all

asm(
      /* check arguments, x0=dst, x1=src, w2=len */
    "cbz   x0, 2f;"
    "cbz   x1, 2f;"
    "cbz   x2, 2f;"
      /* small or unaligned block? */
    "and   x3, x2, 0xFFFFFFFFFFFFFE00;"
    "cbz   x3, 1f;"
    "and   x3, x0, 0xF;"
    "cbnz  x3, 1f;"
    "and   x3, x1, 0xF;"
    "cbnz  x3, 1f;"
      /* copy large blocks, 256 bytes per iteration */
    "add    x1, x1, 256;"
    "ldp    q0,  q1, [x1, 32 *0];"
    "ldp    q2,  q3, [x1, 32 *1];"
    "ldp    q4,  q5, [x1, 32 *2];"
    "ldp    q6,  q7, [x1, 32 *3];"
    "ldp    q8,  q9, [x1, 32 *4];"
    "ldp   q10, q11, [x1, 32 *5];"
    "ldp   q12, q13, [x1, 32 *6];"
    "ldp   q14, q15, [x1, 32 *7];"
    "lsr   x3, x2, 8;"
    "and   x2, x2, 0xFF;"
"0:  stp    q0,  q1, [x0, 32 *0];"
    "ldp    q0,  q1, [x1, 32 *0];"
    "stp    q2,  q3, [x0, 32 *1];"
    "ldp    q2,  q3, [x1, 32 *1];"
    "stp    q4,  q5, [x0, 32 *2];"
    "ldp    q4,  q5, [x1, 32 *2];"
    "stp    q6,  q7, [x0, 32 *3];"
    "ldp    q6,  q7, [x1, 32 *3];"
    "stp    q8,  q9, [x0, 32 *4];"
    "ldp    q8,  q9, [x1, 32 *4];"
    "stp   q10, q11, [x0, 32 *5];"
    "ldp   q10, q11, [x1, 32 *5];"
    "stp   q12, q13, [x0, 32 *6];"
    "ldp   q12, q13, [x1, 32 *6];"
    "stp   q14, q15, [x0, 32 *7];"
    "ldp   q14, q15, [x1, 32 *7];"
    "add   x0, x0, 256;"
    "sub   x3, x3, 1;"
    "cbnz  x3, 0b;"
    "sub   x1, x1, 256;"
      /* copy small and remaining block */
    "cbz   x2, 2f;"
"1:  and   x3, x0, 0x7;"
    "cbnz  x3, 1f;"
    "and   x3, x1, 0x7;"
    "cbnz  x3, 1f;"
    "lsr   x4, x2, 3;"
    "cbz   x4, 1f;"
    "and   x2, x2, 0x7;"
"0:  ldr   x3, [x1], 8;"
    "sub   x4, x4, 1;"
    "str   x3, [x0], 8;"
    "cbnz  x4, 0b;"
    "cbz   x2, 2f;"
"1:  ldrb  w3, [x1], 1;"
    "sub   w2, w2, 1;"
    "strb  w3, [x0], 1;"
    "cbnz  w2, 1b;"
"2:"
      // updated
    : "+r" (x0), "+r" (x1), "+r" (x2) :
      // clobbered
    : "x3", "x4", "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8",
      "q9", "q10", "q11", "q12", "q13", "q14", "q15" );
Last edited by jahboater on Thu Mar 21, 2019 6:45 pm, edited 1 time in total.

jahboater
Posts: 4675
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Thu Mar 21, 2019 5:21 pm

bzt wrote:
Thu Mar 21, 2019 4:49 pm
The Linux kernel memcpy now has a dirty "insert a jump instruction in memcpy" hack, which is just auch, hurts :-)
Yes, that's dreadful ! Shocking!

This comment says the kernel just uses rep movsb (like I do - its simplicity is appealing).

Code: Select all

*
 * memcpy_erms() - enhanced fast string memcpy. This is faster and
 * simpler than memcpy. Use memcpy_erms when possible.
 */
ENTRY(memcpy_erms)
	movq %rdi, %rax
	movq %rdx, %rcx
	rep movsb
	ret
ENDPROC(memcpy_erms)

bzt
Posts: 393
Joined: Sat Oct 14, 2017 9:57 pm

Re: Puzzled, Mem speed?

Sat Mar 23, 2019 3:18 pm

jahboater wrote:
Thu Mar 21, 2019 5:15 pm
Those changes fixed one set of regression tests! But it still fails somewhere else.
Have I got the above changes in correctly?
Almost! The "add x1" line is at wrong place, it goes in the loop as the first instruction (in the line with the 0: label):

Code: Select all

      /* copy large blocks, 256 bytes per iteration */
-    "add    x1, x1, 256;"
    "ldp    q0,  q1, [x1, 32 *0];"

    "and   x2, x2, 0xFF;"
+ "0:  add x1, x1, #256;"
+   "stp    q0,  q1, [x0, 32 *0];"
jahboater wrote:This comment says the kernel just uses rep movsb (like I do - its simplicity is appealing).
Also says "when possible." Otherwise I couldn't agree more! Simplicity is the ultimate sophistication! Too bad it's not working on my hardware :-( I wish I could use that too! I hope within a few years ERMSB will be wide spread enough to safely throw out all the other implementations. Good news is, that's just a matter of when, not if.

Cheers,
bzt

User avatar
Gavinmc42
Posts: 3727
Joined: Wed Aug 28, 2013 3:31 am

Re: Puzzled, Mem speed?

Tue Mar 26, 2019 8:44 am

Internally the bus might be 128bit but the external 1GB LPDDR2 dram is only 32bit.
Looking at the data sheet the package contains 2 dies of 4Gb which are 16 bit wide.

So when running a 64bit OS the data must be fetched from the ram in two cycles.
Until Pi's get a 64bit memory bus a 64bit OS are handicapped?
This DDR has burst length of 4 ,8 ,16, at least in the Elpida data sheet I found for the old 3B.

Just confirmed the 3B+ is using the same lpddr2 Elpide B8132B4B-8D-F
It has 8 internal banks for concurrent operation.
With 400MHz clock, ddr2 means 800Mb/s/pin, faster on the 3B+?
4byte wide = 3.2GB/s? 3.6GB/s for 450MHz?

I have no idea what your code does I'm a hardware guy.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

LdB
Posts: 1225
Joined: Wed Dec 07, 2016 2:29 pm

Re: Puzzled, Mem speed?

Tue Mar 26, 2019 6:50 pm

Gavinmc42 wrote:
Tue Mar 26, 2019 8:44 am
Until Pi's get a 64bit memory bus a 64bit OS are handicapped?
That is a silly statement you go to 64bit mode for the virtualization it is the whole point to the CoretexA53 it is not designed to be the worlds fastest 64bit processor. The CortexA57 is twice the throughput but it would be a much harder core to join to the older Pi peripherals.

Current Pi Linux versions don't really use any virtualization stuff so I am guessing there is little benefit to 64bit linux on the Pi which is probably why all the official O/S releases are actually running 32bit.

Secondly talking about things being handicapped because a single core can't hit the memory bus limit is bizarre. In many situations especially multicore you simply don't want that to happen it is a good thing.

User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Tue May 21, 2019 4:58 am

It is interesting how people have pointed out issues with using a single instruction pair for memcopy. I think we all know these.

The test case example (and the kind of thing that would be done this way), is for a memset operation using a single STMIA instruction in the inner loop (no load operations).

Ok so my internet connection has been down for a while, sorry about the delay in saying anything.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

Return to “Bare metal, Assembly language”