juj
Posts: 26
Joined: Sat Nov 18, 2017 10:51 pm

Any tricks to further optimize this memory diffing code?

Thu Jul 05, 2018 6:43 pm

I need to do really fast diffs of two memory blocks of fixed size 307200, well aligned to my will, to find the first index of where those two blocks differ, on a Raspberry Pi Zero.

Here is my current best attempt:

Code: Select all

int diff(uint8_t *a, uint8_t *b)
{
  uint8_t *endPtr;
  asm volatile(
    "mov r1, %[a]\n" // pointer to buffer 1 to diff
    "add r0, r1, #307200\n" // end pointer of buffer 1, loop finish condition
    "mov r2, %[b]\n" // pointer to buffer 2 to diff

  "start2:\n"
    "pld [r1, #128]\n" // preload data caches for both buffers 128 bytes ahead of time
    "pld [r2, #128]\n"

    "ldmia r1!, {r3,r4,r5,r6}\n" // load 4x32-bit elements of buffer 1
    "ldmia r2!, {r7,r8,r9,r10}\n" // corresponding elements from buffer 2
    "cmp r3, r7\n" // compare all elements for diff
    "bne end2\n"
    "cmp r4, r8\n"
    "bne end2\n"
    "cmp r5, r9\n"
    "bne end2\n"
    "cmp r6, r10\n"
    "bne end2\n"

    // unroll once to process 8x32bits = 32 bytes = L1 data cache line size of Raspberry Pi Zero in one iteration
    "ldmia r1!, {r3,r4,r5,r6}\n"
    "ldmia r2!, {r7,r8,r9,r10}\n"
    "cmp r3, r7\n"
    "bne end2\n"
    "cmp r4, r8\n"
    "bne end2\n"
    "cmp r5, r9\n"
    "bne end2\n"
    "cmp r6, r10\n"
    "bne end2\n"

    "cmp r0, r1\n" // test loop end condition
    "bne start2\n"

  "end2:\n"
    "mov %[result], r1\n\t"
    : [result]"=r"(endPtr)
    : [a]"r"(a), [b]"r"(b)
    : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7", "r8", "r9", "r10"
  );
  return endPtr-a;
}
I wonder if this code is already at the limits of what the hardware can do (and I should move elsewhere to look for ways to optimized), or if there are some tricks that could be employed to tune this code further?

Check out https://gist.github.com/juj/e6423dcd3c3 ... 4d39ba983c for a minimal compilable test case on Pi Zero.

The performance I am at at the moment is that it takes on average ~5 clocks to process one byte of memory, or other way around, processing speed is about 200MB/second.

The use case for this is in https://github.com/juj/fbcp-ili9341, where I am doing delta diffs of images to minimize transfer on the SPI bus.

I have tried unrolling further, and varying the data cache preload distance, though for now it looks like one cache line and 128 bytes ahead seem to be the sweet spot. At the moment the code takes about 2 msecs to run for a 307200 bytes block, which is still a bit too much to my liking, hence looking to squeeze juice out of this further.

Any comments welcome! ARM assembly is not my usual forte, and this is just for hobby code, so a lot of learning as a I go. Thanks!

jahboater
Posts: 3051
Joined: Wed Feb 04, 2015 6:38 pm

Re: Any tricks to further optimize this memory diffing code?

Thu Jul 05, 2018 7:40 pm

If it were a Pi3 I would suggest using the 16 byte NEON SIMD registers instead of the 4 byte ARM registers, something like:

ldmia r1!, {q3,q4,q5,q6}

which is (say) 64 bytes at a time.

I think you can do the same thing with VFP on the Pi Zero, but I suppose it will not be so fast.

You can give ranges by the way:

ldmia r1!, {q3-q6}

If you want to stay within a data cache line, then one "q" register (2x16 bytes) at a time perhaps.
Which would also avoid ldmia which is slow, just using ldr.

juj
Posts: 26
Joined: Sat Nov 18, 2017 10:51 pm

Re: Any tricks to further optimize this memory diffing code?

Sun Jul 08, 2018 9:40 pm

Thanks, good points. Unfortunately this is not for the Pi 3, but for the Zero, since I do find that I can easily reach my performance target on the Pi 3 already, but Pi Zero is proving somewhat harder.

Familiarized myself with the ARMv6 manual at http://infocenter.arm.com/help/topic/co ... p7_trm.pdf , and it was interesting to find that it contains good examples of the pipeline and how to count instruction cycle timings and so forth.

One observation to reduce the # of instructions is that in the original code, a sequence

Code: Select all

    "cmp r3, r7\n" // compare all elements for diff
    "bne end2\n"
    "cmp r4, r8\n"
    "bne end2\n"
    "cmp r5, r9\n"
    "bne end2\n"
    "cmp r6, r10\n"
    "bne end2\n"
can be compacted a bit with conditionals:

Code: Select all

    "cmp r3, r7\n" // compare all elements for diff
    "cmpeq r4, r8\n"
    "cmpeq r5, r9\n"
    "cmpeq r6, r10\n"
    "bne end2\n"
this shows a very tiny improvement to the timings, up to about 215MB/sec processing speed from previous 204MB/sec.

Tried replacing ldmias with ldrds, but that did not seem to have an effect in this scenario.

The current version I am at looks like this: https://gist.github.com/juj/e33338d2630 ... 5b05c7be3b

Trying to understand a bit more about the memory latency and preloading on the Pi Zero, I measure the following synthetic tests:

Code: Select all

    "mov r0, %[a]\n" // pointer to buffer 1 to diff
    "add r1, r0, #307200\n" // end pointer of buffer 1, loop finish condition
  "start_%=:\n"
    "add r0, r0, #32\n"
    "cmp r0, r1\n" // test loop end condition
    "blo start_%=\n"
  "end_%=:\n"
The above code counts 3 clocks per iteration of the inner loop, which makes sense: 1 cycle for add, 1 for cmp and 1 for the branch. Then, adding in a preload instruction:

Code: Select all

    "mov r0, %[a]\n" // pointer to buffer 1 to diff
    "add r1, r0, #307200\n" // end pointer of buffer 1, loop finish condition
  "start_%=:\n"
    "pld [r0, #256]\n"
    "add r0, r0, #32\n"
    "cmp r0, r1\n" // test loop end condition
    "blo start_%=\n"
  "end_%=:\n"
the above code jumps the inner loop to take 64 cycles per iteration, suggesting that preloading a cache line takes about 61 clock cycles? The throughput bandwidth of preloading is quite exactly at 500MB/sec.

Loading the memory as well as preloading retains the same 500MB/sec:

Code: Select all

    "mov r0, %[a]\n" // pointer to buffer 1 to diff
    "add r1, r0, #307200\n" // end pointer of buffer 1, loop finish condition
  "start_%=:\n"
    "ldr r2, [r0]\n"
    "pld [r0, #256]\n"
    "add r0, r0, #32\n"
    "cmp r0, r1\n" // test loop end condition
    "blo start_%=\n"
  "end_%=:\n"
Preloading through two streams of memory:

Code: Select all

    "mov r0, %[a]\n" // pointer to buffer 1 to diff
    "add r2, r0, #307200\n" // end pointer of buffer 1, loop finish condition
    "mov r1, %[b]\n" // pointer to buffer 2 to diff
  "start_%=:\n"
    "pld [r0, #256]\n" // preload data caches for both buffers 128 bytes ahead of time
    "pld [r1, #256]\n"
    "add r0, r0, #32\n"
    "add r1, r1, #32\n"
    "cmp r0, r2\n" // test loop end condition
    "blo start_%=\n"
  "end_%=:\n"
puts inner loop to 150 clock cycles, or 211 MB/sec, which is more or less exactly the bandwidth that I am getting with the code that does the actual diffs. This suggests that my diff code is bound by the actual memory bandwidth of Pi Zero? If so, that does make me happy in the sense that it gives an impression of having "optimal" code that is blocked by memory bandwidth bounds.

Though https://github.com/simonjhall/copies-and-fills reports "up to 680MB/sec" memcpy speeds. Those are a bit more than my 500MB/sec preload scenario - I wonder if those may have been reached by overclocking an ARMv6 Pi. The code there looks quite identical, no other tricks there compared to straightforward ldmia and stmia in the aligned inner loops case.

jahboater
Posts: 3051
Joined: Wed Feb 04, 2015 6:38 pm

Re: Any tricks to further optimize this memory diffing code?

Sun Jul 08, 2018 11:50 pm

I'm not sure it gains anything as there doesn't seem to be an instruction that compares two vectors and sets the flags directly.
Looks like you need about 4 insns to do it, but it is comparing four 32 bit integers each time. I saw this on SO:

Code: Select all

vceq.i32  q15, q0, q3
vlsi.32   d31, d30, #16
vcmp.f64  d31, #0
vmrs      APSR_nzcv, fpscr

AlfredJingle
Posts: 69
Joined: Thu Mar 03, 2016 10:43 pm

Re: Any tricks to further optimize this memory diffing code?

Tue Jul 17, 2018 12:32 pm

I have presently no Pi-Zero running so these are partly theoretical musings. But I did a lot of optimizing on a pi1 and based on that the following:

I would test if doing the actual comparison is faster with the following routine:

Code: Select all

sub r3, r3, r7
sub r4, r4, r8
sub r5, r5, r9
sub r6, r6, r10
orr r3, r3, r4
orr r5, r5, r6
orrs r3, r3, r5
After the orrs the equal-flag is set if the 4 regs match the other 4 regs. This is a fast routine as it avoids most interlocking, and it sets the flags only once. On a Pi3 it takes only 5c, on a Zero I would expect something like 10-12c. Once you find a mismatch between the 4 words a short routine is needed to find which if the 4 words is the actual offender.

In addition I would play a little longer with the cache-preload instructions. I found that there are occasions where they help and I have examples where they did nothing or even slowed the memory-transfer down. F.I. try with only one pld and with shorter offsets like 64.

And finally you could try to put the two data sets in memory with the 'do-not-execute flag set. Setting this flag not only avoids execution within that memory-area, but also avoids pre-loading of the instruction-cache. Thus, in theory, releasing some memory-bandwidth. I have not tested this.


Have fun!
going from a 6502 on an Oric-1 to an ARMv8 is quite a big step...

Return to “Bare metal, Assembly language”