User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Puzzled, Mem speed?

Fri Mar 15, 2019 8:18 am

If the SDRAM is clocked at 450MHz (RPi3B+) and the ARM external Data bus is 128 bits wide, I would expect the maximum throughput to be close to 6.86GB/s. From this we take off a little for refresh, so say 6GB/s. Even if we take off up to 1GB/s for the GPU to access RAM (more than twice the rate for 60FPS at 1920X1080x32bpp [474.61MB/s]) we should still have over 5GB/s bandwith between ARM and RAM.

Now for the troubles, with cores 1-3 parked in WFE, I am barely reaching 1.01GB/s throughput from a single core with a tight enough loop that I should be maxing out the bus.

The loop I use for testing memory speed is one I have shown before, it just writes to the entire framebuffer (as it is the same SDRAM as the rest of the system memory) for a known number of frames, and I output the time taken. The loop normally runs through 65536 full frames at 1360x768x32bpp, and the total time elapsed is recorded. I get consistantly within margin of error of 240FPS (239.997FPS) on my RPi 3B+ at 1150MHz core all other settings default.

These results would suggest a 32 bit wide bus instead of the 128 bit wide bus that we have on the RPi 3B+.

I am using the ARM STM instruction in a loop that I have verified to be the fastest possible, storing 4 registers per store (any more or less slows us down). Only one store per iteration of the loop (any more slows us down).

Caches enabled, branch prediction enabled, for now MMU dissabled. These are the same speeds I get in RISC OS for the same test, proving that the limit is not the OS (bare metal matches).

If needed I can post the code of the loop itself.

Sorry about the way the resolutions are listed. Unfortunately if I do it correctly and put the bpp number after an AT symbol the forum hides it (the forum can not tell that there is nothing that can be construed as a valid domain name there so thinks it to be an email).
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

jahboater
Posts: 4182
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 8:33 am

DavidS wrote:
Fri Mar 15, 2019 8:18 am
I am using the ARM STM instruction in a loop that I have verified to be the fastest possible, storing 4 registers per store (any more or less slows us down). Only one store per iteration of the loop (any more slows us down).
I found that NEON had faster access to memory, I don't know why. NEON is quad issue which may help a little.

Also the registers are larger.
You can use STM with the 128-bit Q registers, so if 4 registers per store gives the best performance, then with LDM/STM {Q0-Q3} you can move 64 bytes each time, faster than {R0-R3}.

16 byte or better alignment is helpful.

In 64-bit mode LDP/STP Q0,Q1 are even faster still, but you can only move 32 bytes at a time.

LdB
Posts: 1102
Joined: Wed Dec 07, 2016 2:29 pm

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 9:44 am

You may want to look at STM timings even on an ARM7 and its similar on a cortexa53
https://hardwarebug.org/2014/05/15/cort ... e-timings/

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 9:56 am

jahboater wrote:
Fri Mar 15, 2019 8:33 am
DavidS wrote:
Fri Mar 15, 2019 8:18 am
I am using the ARM STM instruction in a loop that I have verified to be the fastest possible, storing 4 registers per store (any more or less slows us down). Only one store per iteration of the loop (any more slows us down).
I found that NEON had faster access to memory, I don't know why. NEON is quad issue which may help a little.

Also the registers are larger.
You can use STM with the 128-bit Q registers, so if 4 registers per store gives the best performance, then with LDM/STM {Q0-Q3} you can move 64 bytes each time, faster than {R0-R3}.

16 byte or better alignment is helpful.

In 64-bit mode LDP/STP Q0,Q1 are even faster still, but you can only move 32 bytes at a time.
Thank you for the reply. Yes the inner loop is alligned to 16 byte boundry, and is 12 bytes in size.

[h2]I should of been more clear in my issue with this:[/h2]
On the RPi1B+ running at 1GHz the same code achives a bit over 2GB/s throughput, that is with 400MHz RAM.

Of course some of the setup code is slightly different to accomplish the same effective setup (L2 cache on VC, no worries about getting out of HYP, though still enabling L1 caches and Branch prediction).

This code should be a good deal faster on the RPi 3B+ with the improvments, just the memory speed alone should have made a huge improvement. I did not expect the full 5GB/s potential of the 3B+ (as I do not expect the full 4GB logical potential on the 1B+).

How is it that for the particular case of linear access to large blocks of memory the ARMv6 is noticably outperforming the ARMv8 with the same code? There is nothing that is suboptimal for the ARMv8 (conditional branch targets more often lower than current R15, as much as possible register dependancy avoided in nearby instructions).
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 10:01 am

LdB wrote:
Fri Mar 15, 2019 9:44 am
You may want to look at STM timings even on an ARM7 and its similar on a cortexa53
https://hardwarebug.org/2014/05/15/cort ... e-timings/
Thank you for that. That would infer that the STM/LDM ops are reading in 64-bit chunks on the ARMv7/8. Though testing would imply that they are reading/writing in 32-bit chunks in the ARMv8 and in 64-bit chunks on the ARMv6, this does not add up.

Why is the ARMv8 slower than the ARMv6 in this? How do I better optimize this situation for the ARMv8 using only integer, nonvfp classic ARM instructions?
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

jahboater
Posts: 4182
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 10:04 am

DavidS wrote:
Fri Mar 15, 2019 9:56 am
Yes the inner loop is aligned to 16 byte boundry, and is 12 bytes in size.
I meant the data should be aligned to 16 bytes if possible.
If the external bus is 128-bits wide, that would fit well.
DavidS wrote:
Fri Mar 15, 2019 9:56 am
Why is the ARMv8 slower than the ARMv6 in this? How do I better optimize this situation for the ARMv8 using only integer, nonvfp classic ARM instructions?
I don't know. Seems very strange.
LDM/STM have been dropped from 64-bit mode because they are slow and don't handle interrupts well.
Replaced with LDP/STP which are more RISC like!

When you use the 128-bit Q registers with NEON, it is nothing to do with floating-point.
You use them to move integer data (or any data, it doesn't care).

If you can find a faster loop than this (in 32-bit ARM mode) for moving data I would be very very interested.
The SUBS is in the middle for the obvious scheduling reasons.
Note this reduces register pressure because none of the ARM integer registers are used for the data.

Code: Select all

loop: vldm r1!,{q0-q3} 
      subs r2,r2,#64
      vstm r0!,{q0-q3}
      bne loop
NEON is guaranteed present, forget VFP.

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 11:03 am

I am doing this to figure out the maximum throughput from ARM while the VFP instruction que is kept full for other uses, thus eliminating the usabiltiy of VFP/NEON.

I am aware that if it were down to simple memory copy/set operations I could speed it up a lot by using DMA channels. Though some algorithms require calculation performing memory linear memory operations at the fastest possible rate, interleved with the data processing operations.

I would like to get this one figured out so that I can get back to rounding out my current code, do some testing, get the internet connected RPi booted into Linux and upload to github.
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

PhilE
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2149
Joined: Mon Sep 29, 2014 1:07 pm
Location: Cambridge

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 11:58 am

Post your code - somebody may spot something.

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 3:18 pm

PhilE wrote:
Fri Mar 15, 2019 11:58 am
Post your code - somebody may spot something.
Alright, seems a bit dull to post three instructions so was hoping not to.

Here is the inner loop, and the two levels out on the loop (so as to provide something to look at):

Code: Select all

.ClrSLpC
    MOV R1,R0
    LDR R5,FbBase
    MOV R2,R0
    LDR R9,ScrnY
    MOV R3,R0
.ClrSLpY
      LDR R6,ScrnX
      MOV R7,R5
.ClrSLpX
        STMEA R7!,{R0-R3}
        SUBS  R6,R6,#4
      BPL ClrSLpX
      ADD R5,R5,R4
      SUBS R9,R9,#1
    BPL ClrSLpY
    SUBS R0,R0,#&00010000
  BPL ClrSLpC
Entered with R0 having the value &00FF0000, R4 having the row length/pitch (FB is wider than the screen). Hope I copied that correctly, a bit more difficult to read without a fixed width font.

This is just to test the maximum bandwith for memory writes, so the information can be used for other projects (including one that is waiting for completion so I can get it to github).
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

jahboater
Posts: 4182
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 4:42 pm

Perhaps put the SUBS before the STMEA, then there is more time for the flags to be set before the BPL?

Code: Select all

.ClrSLpX
      SUBS  R6,R6,#4
      STMEA R7!,{R0-R3}
      BPL ClrSLpX

LdB
Posts: 1102
Joined: Wed Dec 07, 2016 2:29 pm

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 4:46 pm

here is my entry doing 64 bit writes .. if the ARM7 timings are close it is a lot quicker

Code: Select all

r0 = low 32 bits of 64 bits value to write
r1 = high 32 bits of 64 bit value to write
r2 = frame buffer end address to end write at
r3 = frame buffer start address to start write at  
    movt    r3, 0
    movt    r2, 0
.loop:
    strd    r0, [r3], #8
    cmp     r3, r2
    bne     .loop

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 10:04 pm

jahboater wrote:
Fri Mar 15, 2019 4:42 pm
Perhaps put the SUBS before the STMEA, then there is more time for the flags to be set before the BPL?

Code: Select all

.ClrSLpX
      SUBS  R6,R6,#4
      STMEA R7!,{R0-R3}
      BPL ClrSLpX
Thank you. Face palm moment, I did what I recomend against for performance. Speed up not significant though.
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Fri Mar 15, 2019 10:11 pm

LdB wrote:
Fri Mar 15, 2019 4:46 pm
here is my entry doing 64 bit writes .. if the ARM7 timings are close it is a lot quicker

Code: Select all

r0 = low 32 bits of 64 bits value to write
r1 = high 32 bits of 64 bit value to write
r2 = frame buffer end address to end write at
r3 = frame buffer start address to start write at  
    movt    r3, 0
    movt    r2, 0
.loop:
    strd    r0, [r3], #8
    cmp     r3, r2
    bne     .loop
I had not thought about using the double word instructions, I had forgotten about them to be honest.

I just on a quick test this seems to be about 3 times as fast from a quick test (under RISC OS). I will have to pop that into the bare metal test program and see how it goes.

I guess this is another FacePalm. As this is available from v5 and newer ARM cores it should work on the RPi 1 as well.
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

LdB
Posts: 1102
Joined: Wed Dec 07, 2016 2:29 pm

Re: Puzzled, Mem speed?

Sat Mar 16, 2019 12:09 am

3 to 4 fold is pretty much what the data sheet says which proves it isn't a data bus issue but a CPU cycle latency issue and you don't have enough speed on the ARM to max the bus out.

WTB faster ARM :-)

If you want to test data bus speed probably setup same sort of thing from the VC4 that only writes 128 bits at a time. I am guessing it will a lot faster than that even.

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Sat Mar 16, 2019 2:37 am

LdB wrote:
Sat Mar 16, 2019 12:09 am
3 to 4 fold is pretty much what the data sheet says which proves it isn't a data bus issue but a CPU cycle latency issue and you don't have enough speed on the ARM to max the bus out.

WTB faster ARM :-)

If you want to test data bus speed probably setup same sort of thing from the VC4 that only writes 128 bits at a time. I am guessing it will a lot faster than that even.
You are likely correct about the VC4 being able to max out the bus. My only interest for this case is to figure out what the maximum speed the ARM can write without resorting to VFP/NEON (so VFP can be used for other operations at the same time).

I do think that if you loaded down 12 of the DMA channels you could likely max out the usable bus speed, though could be wrong.
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Sat Mar 16, 2019 3:10 am

LdB wrote:
Sat Mar 16, 2019 12:09 am
3 to 4 fold is pretty much what the data sheet says which proves it isn't a data bus issue but a CPU cycle latency issue and you don't have enough speed on the ARM to max the bus out.
BTW:
STRD is slower on the RPi 1, and faster on the RPi 2B, 3B, and 3B+ (my 2B is a BCM2856 ARMv7 version). Though it is still fast enough on the RPi 1 not to worry about (it is only a 15% difference in speed).

I spent so many years writing code that had to be able to run on ARMv3 systems, as well as some that had to run on ARMv2 systems that I often forget about the newer instructions, which there are only a handful for ARM32-bit (WITHOUT thinking Thumb) so it should be easy enough to remember. Off the top of my head:
ARMv4 SETEND, BLX/BX, and some cache control.
ARMv5 DSP Extensions, long multiplies, CLZ, BKPT, and double word load/store instructions.
ARMv6 Not really sure of any changes from ARMv5 to the instruction set, enhencments were in performance to my knowledge.
ARMv7 HYP mode, ERET and similar, SDIV/UDIV.
ARMv8 Removed SWP.

That is just off the top of my head, I am sure I missed a few (though not many).

Later implementations can include coprocessors like VFP/NEON, though these do not add ARM instructions (they are just coprocessor instructions that have been there since ARMv2). The coprossers are usually able to operate asyncronously from the ARM core (they share the silicon substrate, though so do the periphials in the modern SoC's).
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

bzt
Posts: 343
Joined: Sat Oct 14, 2017 9:57 pm

Re: Puzzled, Mem speed?

Mon Mar 18, 2019 7:15 pm

Hi,
DavidS wrote:
Fri Mar 15, 2019 8:18 am
I am using the ARM STM instruction in a loop that I have verified to be the fastest possible, storing 4 registers per store (any more or less slows us down). Only one store per iteration of the loop (any more slows us down).
This seems odd. Here I've linked more memcpy implementations, but all of them uses more than one stores per iteration.

For a 32 bit version, take a look at this memcpy. It uses two LDMs/STMs per iteration, and can utilize nasty PC hacks to jump into a list of instructions to load and store the proper amount of bytes (line 98) Nice! That's what I call a clever optimisation, even if it's unused! :-D

Also @jahboater's advice is good. I'm working on a fast memcpy myself for some time now, here's what I got so far (beta).

Features and limitations:
  • it does not correct alignment, only uses optimized copy if both the source and the destination are 16 bytes aligned and length is bigger than 512 bytes
  • the big chunk copy uses NEON registers, and copies 256 bytes per iteration
  • for smaller blocks there're two options: if source and destination are 8 bytes aligned, and length is bigger or equal than 8 bytes, then it copies 8 bytes per iteration, otherwise 1 byte.
  • needs more testing for prefetching and more optimisations
  • licensed under CC-by-nc-sa (you can use it in your hobby project freely as long as you place an attribution, but not in commertial software without a preliminary written permission from me)

Code: Select all

    /* fast memcpy Copyright (c) 2019 by bzt CC-by-nc-sa */
memcpy:
    /* check arguments, x0=dst, x1=src, x2=len */
    cbz     x0, 2f
    cbz     x1, 2f
    cbz     x2, 2f
    /* small or unaligned block? */
    and     x3, x2, #0xFFFFFFFFFFFFFE00
    cbz     x3, 1f
    and     x3, x0, #0xF
    cbnz    x3, 1f
    and     x3, x1, #0xF
    cbnz    x3, 1f
    /* copy large blocks, 256 bytes per iteration */
    ldp      q0,  q1, [x1, #32 *0]
    ldp      q2,  q3, [x1, #32 *1]
    ldp      q4,  q5, [x1, #32 *2]
    ldp      q6,  q7, [x1, #32 *3]
    ldp      q8,  q9, [x1, #32 *4]
    ldp     q10, q11, [x1, #32 *5]
    ldp     q12, q13, [x1, #32 *6]
    ldp     q14, q15, [x1, #32 *7]!
    lsr     x3, x2, #8
    and     x2, x2, #0xFF
0:  stp      q0,  q1, [x0, #32 *0]
    ldp      q0,  q1, [x1, #32 *0]
    stp      q2,  q3, [x0, #32 *1]
    ldp      q2,  q3, [x1, #32 *1]
    stp      q4,  q5, [x0, #32 *2]
    ldp      q4,  q5, [x1, #32 *2]
    stp      q6,  q7, [x0, #32 *3]
    ldp      q6,  q7, [x1, #32 *3]
    stp      q8,  q9, [x0, #32 *4]
    ldp      q8,  q9, [x1, #32 *4]
    stp     q10, q11, [x0, #32 *5]
    ldp     q10, q11, [x1, #32 *5]
    stp     q12, q13, [x0, #32 *6]
    ldp     q12, q13, [x1, #32 *6]
    stp     q14, q15, [x0, #32 *7]!
    ldp     q14, q15, [x1, #32 *7]!
    sub     x3, x3, #1
    cbnz    x3, 0b
    sub     x1, x1, #256
    /* copy small and remaining block */
    cbz     x2, 2f
1:  and     x3, x0, #0x7
    cbnz    x3, 1f
    and     x3, x1, #0x7
    cbnz    x3, 1f
    lsr     x4, x2, #3
    cbz     x4, 1f
    and     x2, x2, #0x7
0:  ldr     x3, [x1], #8
    str     x3, [x0], #8
    sub     x4, x4, #1
    cbnz    x4, 0b
    cbz     x2, 2f
1:  ldrb    w3, [x1], #1
    strb    w3, [x0], #1
    sub     w2, w2, #1
    cbnz    x2, 1b
2:  ret
I also have a x86/SSE version of this memcpy, because I'm working on a multiplatform library, but I don't think it's relevant on this forum.

Feel free to test my code and measure. Should provide much more than 1GB/s (with proper alignment).

Cheers,
bzt

jahboater
Posts: 4182
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Tue Mar 19, 2019 9:05 am

Thanks for posting this.
I tried it by inlining it in a large C program, and sadly failed, 99.999999999% likely because my C inline stuff is wrong :(
Its called "movs" because the x86 version just does "rep movsb" (since ERMSB)

Code: Select all

static char *
movs( void *dst, const void *src, dword len )
{
#if DEBUG
  assert( !overlap( dst, src, len ) );
#endif
#if A64
    register void * x0 reg( "x0" ) = dst;
    register const void * x1 reg( "x1" ) = src;
    register dword x2 reg( "x2" ) = len;
      /* fast memcpy Copyright (c) 2019 by bzt CC-by-nc-sa */
asm(
      /* check arguments, x0=dst, x1=src, w2=len */
    "cbz   x0, 2f;"
    "cbz   x1, 2f;"
    "cbz   x2, 2f;"
      /* small or unaligned block? */
    "and   x3, x2, 0xFFFFFFFFFFFFFE00;"
    "cbz   x3, 1f;"
    "and   x3, x0, 0xF;"
    "cbnz  x3, 1f;"
    "and   x3, x1, 0xF;"
    "cbnz  x3, 1f;"
      /* copy large blocks, 256 bytes per iteration */
    "ldp    q0,  q1, [x1, 32 *0];"
    "ldp    q2,  q3, [x1, 32 *1];"
    "ldp    q4,  q5, [x1, 32 *2];"
    "ldp    q6,  q7, [x1, 32 *3];"
    "ldp    q8,  q9, [x1, 32 *4];"
    "ldp   q10, q11, [x1, 32 *5];"
    "ldp   q12, q13, [x1, 32 *6];"
    "ldp   q14, q15, [x1, 32 *7]!;"
    "lsr   x3, x2, 8;"
    "and   x2, x2, 0xFF;"
"0:  stp    q0,  q1, [x0, 32 *0];"
    "ldp    q0,  q1, [x1, 32 *0];"
    "stp    q2,  q3, [x0, 32 *1];"
    "ldp    q2,  q3, [x1, 32 *1];"
    "stp    q4,  q5, [x0, 32 *2];"
    "ldp    q4,  q5, [x1, 32 *2];"
    "stp    q6,  q7, [x0, 32 *3];"
    "ldp    q6,  q7, [x1, 32 *3];"
    "stp    q8,  q9, [x0, 32 *4];"
    "ldp    q8,  q9, [x1, 32 *4];"
    "stp   q10, q11, [x0, 32 *5];"
    "ldp   q10, q11, [x1, 32 *5];"
    "stp   q12, q13, [x0, 32 *6];"
    "ldp   q12, q13, [x1, 32 *6];"
    "stp   q14, q15, [x0, 32 *7]!;"
    "ldp   q14, q15, [x1, 32 *7]!;"
    "sub   x3, x3, 1;"
    "cbnz  x3, 0b;"
    "sub   x1, x1, 256;"
      /* copy small and remaining block */
    "cbz   x2, 2f;"
"1:  and   x3, x0, 0x7;"
    "cbnz  x3, 1f;"
    "and   x3, x1, 0x7;"
    "cbnz  x3, 1f;"
    "lsr   x4, x2, 3;"
    "cbz   x4, 1f;"
    "and   x2, x2, 0x7;"
"0:  ldr   x3, [x1], 8;"
    "sub   x4, x4, 1;"
    "str   x3, [x0], 8;"
    "cbnz  x4, 0b;"
    "cbz   x2, 2f;"
"1:  ldrb  w3, [x1], 1;"
    "sub   w2, w2, 1;"
    "strb  w3, [x0], 1;"
    "cbnz  w2, 1b;"
"2:"
      // updated
    : "+r" (x0), "+r" (x1), "+r" (x2) :
      // clobbered
    : "x3", "x4", "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8",
      "q9", "q10", "q11", "q12", "q13", "q14", "q15" );
  return (char*)dst + len;
#else
  return (char*)memcpy( dst, src, len ) + len;
#endif
}
The only logic I changed was the final two small loops where I moved the "sub" up and in between the ldr/str, and the final cbnz to w2

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Tue Mar 19, 2019 10:05 am

bzt wrote:
Mon Mar 18, 2019 7:15 pm
Hi,
DavidS wrote:
Fri Mar 15, 2019 8:18 am
I am using the ARM STM instruction in a loop that I have verified to be the fastest possible, storing 4 registers per store (any more or less slows us down). Only one store per iteration of the loop (any more slows us down).
This seems odd. Here I've linked more memcpy implementations, but all of them uses more than one stores per iteration.

For a 32 bit version, take a look at this memcpy. It uses two LDMs/STMs per iteration, and can utilize nasty PC hacks to jump into a list of instructions to load and store the proper amount of bytes (line 98) Nice! That's what I call a clever optimisation, even if it's unused! :-D
I would agree that it seems odd at first. Then I got to looking at the timings, and realized fairly quickly why it is the quickest when using LDM/STM with the newer ARM CPU's.

With the ARMv5 and earlier (and some ARMv6 implementations) it was/is a lot quicker to use unrolling tricks. Though with the ARM1176, Cortex-A7, and Cortex-A53 used in the Raspberry Pi series of computers it is a lot faster to bring it down to a single instruction when doing write only. This is going to be used to implement routines that write values out into an vector in memory after creating the values by calculations, very useful for many fast algorithms.

Now for a large block memcopy I would not recommend using the CPU, VFP, or GPU. The reason being that these units could be put to other uses while the memcopy is being completed by DMA, and I do not know of many personal computers that are without DMA. Think about the case of the 4 level recursion max Real Time Ray Tracer as a potential application (I still think it can be done on a RPi at 320x200 resolution), and you start to see that making use of every resource can make a huge difference in the result.

Now if only I can get the Ray Tracer to reach better than 14FPS at 640x480 16bpp (using 4 ARM cores, 4 NEON cores, and as much QPU time as I feel comfortable using) for only up to 850 surfaces visible. I am sure I will improve my algorithm, as well as learn some tricks from others as I progress down this path.
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

jahboater
Posts: 4182
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Tue Mar 19, 2019 10:18 am

DavidS wrote:
Tue Mar 19, 2019 10:05 am
The reason being that these units could be put to other uses while the memcopy is being completed by DMA, and I do not know of many personal computers that are without DMA.
I don't think moving stuff between memory and registers with LDM or LDP or LDR can use DMA ?
Or is this an ARM thing I don't know about?

User avatar
DavidS
Posts: 3800
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Puzzled, Mem speed?

Tue Mar 19, 2019 10:43 am

jahboater wrote:
Tue Mar 19, 2019 10:18 am
DavidS wrote:
Tue Mar 19, 2019 10:05 am
The reason being that these units could be put to other uses while the memcopy is being completed by DMA, and I do not know of many personal computers that are without DMA.
I don't think moving stuff between memory and registers with LDM or LDP or LDR can use DMA ?
Or is this an ARM thing I don't know about?
You would be correct, those are ARM instructions not DMA. I said that for a large block memcopy I would not use LDM/LDP/LDR though would instead use DMA.
RPi = Way for me to have fun and save power.
100% Off Grid.
Household TTL Electricity Usage = 1.4KW/h per day.
500W Solar System, produces 2.8KW/h per day average.

jahboater
Posts: 4182
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Tue Mar 19, 2019 10:47 am

DavidS wrote:
Tue Mar 19, 2019 10:43 am
jahboater wrote:
Tue Mar 19, 2019 10:18 am
DavidS wrote:
Tue Mar 19, 2019 10:05 am
The reason being that these units could be put to other uses while the memcopy is being completed by DMA, and I do not know of many personal computers that are without DMA.
I don't think moving stuff between memory and registers with LDM or LDP or LDR can use DMA ?
Or is this an ARM thing I don't know about?
You would be correct, those are ARM instructions not DMA. I said that for a large block memcopy I would not use LDM/LDP/LDR though would instead use DMA.
Ah sorry, misread it.
I have never seen DMA used for memcpy but it sounds like a very good idea.

bzt
Posts: 343
Joined: Sat Oct 14, 2017 9:57 pm

Re: Puzzled, Mem speed?

Thu Mar 21, 2019 1:12 am

jahboater wrote:
Tue Mar 19, 2019 9:05 am
Thanks for posting this.
I tried it by inlining it in a large C program, and sadly failed, 99.999999999% likely because my C inline stuff is wrong :(
Any compilation errors? I'd recommend to put it in a separate .S file, it expects the input in a C calling convention, so you can easily call it from C, and since it's a leaf function no need for the function prologue/epilogue.

Other than that your inline asm looks strange to me, the input variables should be in the second constraint group, something like

Code: Select all

asm volatile ("mov x0, %0; mov x1, %1; mov x2, %2;..." : : "r"(dst), "r"(src), "r"(len) : "x3", "x4"...);
Maybe allocating as registers should work too (after all dst, src, len are passed in x0, x1, x2 to movs() already). Use the volatile keyword to tell the compiler to use the asm template as-is.
jahboater wrote:Its called "movs" because the x86 version just does "rep movsb" (since ERMSB)
:-) Yeah, my x86 version does a little bit more than that. It uses prefetchnta, movdqa for the 256 bytes block, copies the remaining with rep movsq and only less than 8 bytes with rep movsb.
jahboater wrote:The only logic I changed was the final two small loops where I moved the "sub" up and in between the ldr/str, and the final cbnz to w2
Hey thank you very much, that was a typo, thankfully harmless, but should been w2 (unlikely that copying more than 4G unaligned buffer is intented ever). Moving the sub one instruction up shouldn't make a change. Any particular reason for that?

Cheers,
bzt

bzt
Posts: 343
Joined: Sat Oct 14, 2017 9:57 pm

Re: Puzzled, Mem speed?

Thu Mar 21, 2019 1:42 am

DavidS wrote:
Tue Mar 19, 2019 10:05 am
With the ARMv5 and earlier (and some ARMv6 implementations) it was/is a lot quicker to use unrolling tricks. Though with the ARM1176, Cortex-A7, and Cortex-A53 used in the Raspberry Pi series of computers it is a lot faster to bring it down to a single instruction when doing write only. This is going to be used to implement routines that write values out into an vector in memory after creating the values by calculations, very useful for many fast algorithms.
I see, but it's not only that. Maybe it's not obvious, but those memcpy implementations (as well as mine) are not simply unrolling the loop. In each iteration the memory is loaded for the next interation's write. So for example with 64 bytes per iteration, those STP/LDP pairs are writing 64 bytes of continous memory addresses from the registers, while loading the same registers with the next 64 bytes of continous memory, and not just read 16 bytes write 16 bytes 4 times in a loop block. It becames clearer if you separate the STP/LDP pairs apart:

Code: Select all

stp    <-- writes at x+0
ldp    <-- reads from y+0
stp    <-- writes at x+16
ldp    <-- reads from y+16
...
   =>
stp    <-- writes at x+0      note: all x, using registers loaded in the previous iteration
stp    <-- writes at x+16
...
ldp    <-- reads from y+0     note: all y, reading for the next iteration
ldp    <-- reads from y+16
...
It worth mixing those because of the pipeline which means it should perform better than a loop with a single instruction pair. Now I haven't checked proper cacheline size and pipeline readahead etc. on AArch64 yet, I've just used 256 bytes because that's how much the x86 version does. It is possible that copying 128 bytes is not slower. Needs more testing, but I think you can squeeze more performance with more intructions if you do it right.

Cheers,
bzt

jahboater
Posts: 4182
Joined: Wed Feb 04, 2015 6:38 pm

Re: Puzzled, Mem speed?

Thu Mar 21, 2019 8:32 am

bzt wrote:
Thu Mar 21, 2019 1:12 am
Any compilation errors?
No!
bzt wrote:
Thu Mar 21, 2019 1:12 am
I'd recommend to put it in a separate .S file, it expects the input in a C calling convention, so you can easily call it from C, and since it's a leaf function no need for the function prologue/epilogue.
Yes, agreed, for a function as large as this. Trouble then is that it cannot participate in the surrounding C optimization.
bzt wrote:
Thu Mar 21, 2019 1:12 am
Other than that your inline asm looks strange to me, the input variables should be in the second constraint group, something like

Code: Select all

asm volatile ("mov x0, %0; mov x1, %1; mov x2, %2;..." : : "r"(dst), "r"(src), "r"(len) : "x3", "x4"...);
Maybe allocating as registers should work too (after all dst, src, len are passed in x0, x1, x2 to movs() already).
That bit does work. Allocating as registers avoids the three mov's (and three more regs in the clobbered list), and in fact does nothing at all since, as you say, they are passed in the right registers anyway (the compiler deals with all that).
I did it this way to preserve your code exactly as written. Normally I use %0 %1 etc.
Usually with inline asm I force inline the small C function it is within. There is then no ABI to worry about and the register allocator and optimizer have free reign to do what they like.
--------------------------------------------------------------------------
bzt wrote:
Thu Mar 21, 2019 1:12 am
:-) Yeah, my x86 version does a little bit more than that. It uses prefetchnta, movdqa for the 256 bytes block, copies the remaining with rep movsq and only less than 8 bytes with rep movsb.
Interesting. I believe the ERMSB enhancements did not extend to rep movsq - only movsb and stosb. Intel are supposed to keep updating rep movsb as modern hardware uses wider internal data paths and so on. They intend it as "the" memcpy instruction. In certain cases (originally, operands aligned to 16 bytes and length a multiple of 64 bytes) the performance of rep movsb far exceeds movsd and probably movsq. When I get a PC with AVX512 I'll try 64 byte register moves.
--------------------------------------------------------------------------
bzt wrote:
Thu Mar 21, 2019 1:12 am
Moving the sub one instruction up shouldn't make a change. Any particular reason for that?
Just scheduling. To give more time for the dependencies. The str depends on the prior ldr,
so placing another instruction in between gives the slow ldr more time to complete (only one cycle, but better than nothing). Ditto the subs and the cbnz. The cbnz depends on the sub, so putting the str in between means the cbnz does not have to wait. Handy that moving the one instruction does both!
I have seen slight improvements with this in the past on A64 and the change does no harm? Or am I being naive and you were relying on something clever with dual issue or something?

Return to “Bare metal, Assembly language”