Replacement memcpy


55 posts   Page 1 of 3   1, 2, 3
by teh_orph » Sun Jun 24, 2012 4:46 pm
I'm unsure of which section to put this in (distro, programming, general etc) so this seems a pretty visible place.

As part of this X work something I have been working on whilst working headless (I've done a lot of travelling recently) is a replacement memcpy function. I'd say this is a very frequently-used function by programmers - whether they know it or not - so hiding a faster memcpy in Xorg would be silly. I would love to get this into some kind of rpi-specific libc for all programs to benefit.

Using it in a custom rpi distro sounds easy to me, for instance Raspbian, but who do I talk to about sticking this into other versions, eg Arch? At small sizes < 100 bytes it is currently ~30% slower than libc, but quickly becomes >2x faster.

I'd also suggest this as a competition to write a better version of the function...please! :D
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by jecxjo » Sun Jun 24, 2012 5:33 pm
Are you looking at creating your own C library or just replacing the memcpy function?
IRC (w/ SSL Support): jecxjo.mdns.org
Blog: http://jecxjo.motd.org/code
ProjectEuler Friend: 79556084281370_44d12dd95e92b1d9453aba2bdc94101b
User avatar
Posts: 157
Joined: Sat May 19, 2012 5:22 pm
Location: Ames, IA (USA)
by teh_orph » Sun Jun 24, 2012 5:41 pm
Just a couple of functions. I've got to do memset too as part of some of the image routines too, so it'll be just the two functions. There's nothing fancy used other than a page or two of ARMv6 assembly.
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by dom » Sun Jun 24, 2012 5:57 pm
I'd have thought for Raspian, submit patches to mpthompson/plugwash and we can have our own optimised libc.
For Debian, produce a shared object with just memcpy (and any other optmised functions in) and use LD_PRELOAD to replace the libc functions.
Moderator
Moderator
Posts: 3862
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by teh_orph » Sun Jun 24, 2012 10:05 pm
Cool, I didn't know you could do that! That makes my work a whole lot easier...
Digging out the original GNU memcpy source and flipping the pld switch gives a 30% boost over the regular version. Looking at the code I'm unsure why it is disabled.
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by teh_orph » Sun Jul 01, 2012 5:08 pm
Hello guys.
Replacement memcpy and memset here.
https://github.com/simonjhall/copies-and-fills

If anyone could test this, it would really help! I've of course tested it myself (random numbers, iterating through sizes, source and destination alignments) but I would be unsurprised if there was a bug of two.

There's a readme to be found on that page in the repo for the interested. Github seems to enjoy reformatting it too :)

For the lazy, memcpy is up to 2-3x faster, memset is up to 7x faster. In conjunction with LD_PRELOAD I think this could be usable with every program?
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by jamesh » Sun Jul 01, 2012 5:23 pm
Just out of interest, what have you done to make it faster?

I did wonder about some functions that call the GPU to do some functions like this - only worth it at larger sizes because of the overheads though but could be a bit faster. (250Mhz 16 way vector unit). Functions are already there on the GPU as you might expect, but it's moving from the Arm MMU to flat address space that is the problem.
Moderator
Moderator
Posts: 10532
Joined: Sat Jul 30, 2011 7:41 pm
by dom » Sun Jul 01, 2012 5:39 pm
@jamesh
DMA is as fast as vector GPU code for memcpy/memset, and the ARM can set up DMA much more cheaply than communicating with GPU.
He knows how to drive the DMA anyway. It's the usual tradeoff. DMA is faster but costs more to set it up, requires locking etc. Might be worth considering for block sizes, say, bigger than a few K.

@teh_orph
Just tried it and I see the speedups you report with membench and LD_PRELOAD. Good work. A quick test shows ldxe/midori still work.

@mpthompson @plugwash
Assuming it doesn't break anything, can we get this included in raspbian? I think it will show a measurable performance boost.
Moderator
Moderator
Posts: 3862
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by dom » Sun Jul 01, 2012 5:58 pm
If anyone wants to test this:

cd /home/pi
wget https://dl.dropbox.com/u/3669512/temp/libcofi_rpi.so
echo /home/pi/libcofi_rpi.so | sudo tee /etc/ld.so.preload

and then reboot. You should find memset/memcpy goes faster (which should improve X speed when dragging/scrolling/redrawing).

Please report if anything breaks, or if anything feels faster.
Moderator
Moderator
Posts: 3862
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by teh_orph » Sun Jul 01, 2012 6:45 pm
Thanks for the testing Dom. Yes if you notice anything different from not using LD_PRELOAD let me know. It's easy enough to get the debugger to stop in memcpy through user error - but if it does it for one version and not the other then there's a bug!

(There's the potential for overfetch from memcpy, but only when source/dest alignments differ in an ugly way. In which case the overfetch can be up to three bytes, but I don't think this could trip memory protection (ie read from an invalid region) as AFAIK memory protection is only done on 4k page boundaries...)

James: nothing interesting to be honest! I went through the libc version very carefully in IDA and it does a lot of cool tricks, but they don't seem to work well on the ARM core we have. These things just need a bit of tuning on the target they're to run on. Some specific things:
- making proper use of the single issue
- aligning the destination to 32 bytes
- data prefetch
- figuring out the best combination of load/store instructions for our CPU

NB: this is for the software, non-GPU fallback of the Xorg code I've been working on ie for when DMA can't be used haha!

Finally it'd be great if other people with the skills could improve on what I've written here. Bring on the competitive spirit etc! Better performance at small transfer sizes for instance.
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by jamesh » Sun Jul 01, 2012 7:26 pm
dom wrote:@jamesh
DMA is as fast as vector GPU code for memcpy/memset, and the ARM can set up DMA much more cheaply than communicating with GPU.
He knows how to drive the DMA anyway. It's the usual tradeoff. DMA is faster but costs more to set it up, requires locking etc. Might be worth considering for block sizes, say, bigger than a few K.


Cheers Dom. I did wonder whether there would be any benefit. No would appear to be the answer!

Do Arm memcpy/memset use DMA?
Moderator
Moderator
Posts: 10532
Joined: Sat Jul 30, 2011 7:41 pm
by teh_orph » Sun Jul 01, 2012 7:47 pm
jamesh wrote:Do Arm memcpy/memset use DMA?
Not my ones. The overhead would be too high. The biggest advantage to using DMA is that it's asynchronous - something that's not really doable with memcpy/memset as far as I can see.

(Maybe a scheme where the page is marked such that when it is read/written next there's an OS trap and then a wait-for-DMA?)

I'm still a bit disappointed with the performance of memcpy. Slightly less than one byte per transferred per clock cycle. That can't be right...
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by kadamski » Sun Jul 01, 2012 10:11 pm
dom wrote:If anyone wants to test this:
echo /home/pi/libcofi_rpi.so | sudo tee /etc/ld.so.preload

Just did that on resbian, verified that it works by:
Code: Select all
$ ps aux | grep /usr/bin/X
root      1627 14.4 10.2  42308 22916 tty7     Rs+  Jul01   1:59 /usr/bin/X -nolisten tcp :0 -auth /tmp/serverauth.3F5LrCukfa
$ sudo cat /proc/1627/maps | grep libcofi
402d4000-402d5000 r-xp 00000000 b3:03 55318      /home/k/copies-and-fills/libcofi_rpi.so
402d5000-402dc000 ---p 00001000 b3:03 55318      /home/k/copies-and-fills/libcofi_rpi.so
402dc000-402dd000 rw-p 00000000 b3:03 55318      /home/k/copies-and-fills/libcofi_rpi.so


I'm not sure if anything works faster (I can't think of any realworld benchmark I could do) but I'm writing this post from midori running on this system so at least it seems not to break anything.
Posts: 187
Joined: Fri Jun 08, 2012 10:56 pm
by asb » Sun Jul 01, 2012 10:13 pm
If anyone wants to play with x11perf/cairo-perf-trace or some other relevant benchmarking tool for a before/after comparison that would be neat.
Moderator
Moderator
Posts: 757
Joined: Fri Sep 16, 2011 7:16 pm
by mpthompson » Mon Jul 02, 2012 3:32 am
This is something I would love to get into Raspbian. We'll have to see what's involved. I could be wrong, but I believe that memcpy (along with many of the other string.h functions) are actually built into gcc and other gnu compiler tools and the may generally ignore the libc functions. It could be a little trickier to get a lot of the code across all packages than just changing it in libc. Hopefully I'm wrong, though.
User avatar
Moderator
Moderator
Posts: 620
Joined: Fri Feb 03, 2012 7:18 pm
Location: San Carlos, CA
by asb » Mon Jul 02, 2012 6:53 am
mpthompson wrote:This is something I would love to get into Raspbian. We'll have to see what's involved. I could be wrong, but I believe that memcpy (along with many of the other string.h functions) are actually built into gcc and other gnu compiler tools and the may generally ignore the libc functions. It could be a little trickier to get a lot of the code across all packages than just changing it in libc. Hopefully I'm wrong, though.


It's a mixture. In cases where memcpy/memset is inlined by the compiler then obviously LD_PRELOADing this lib won't help. However, many apps will use the dynamically linked memcpy/memset.
Moderator
Moderator
Posts: 757
Joined: Fri Sep 16, 2011 7:16 pm
by kadamski » Mon Jul 02, 2012 7:49 am
I did some tests with x11perf by running it like this:
Code: Select all
x11perf -resize -uresize -popup -move -umove


The results are strange. It seems that using replacement library i got WORSE results than without it.
For example:
Code: Select all
WITH:
 200000 trep @   0.1688 msec (  5920.0/sec): Resize unmapped window (200 kids)
WITHOUT:
 200000 trep @   0.1487 msec (  6730.0/sec): Resize unmapped window (200 kids)

Here are full results with libcofi_rpi enabled:
http://pastebin.com/nMUsZhdM
and here are results with libcofi_rpi disabled:
http://pastebin.com/vXJF0cDk

I will try to repeat this tests once again and will repost the results.
Posts: 187
Joined: Fri Jun 08, 2012 10:56 pm
by teh_orph » Mon Jul 02, 2012 8:25 am
NB I would expect no change for cairo. Cairo uses its own copy and filling code (in C) and this is mainly the reason I wrote the code...I was writing it anyway! I plan on adding the cairo versions soon, hence calling the library "copies and fills" :)

Regarding the compiler automatically inlining its own version of the code I'd expect this only when someone does my_struct = another_struct. Hopefully programmers aren't doing this on 10kb structures or classes! If they are then there would be definite value in replacing memcpy otherwise not as my version is slightly slower than the original at small sizes.

Btw stock Xorg IIRC uses C loops for copies and blits too, not memcpy. (in libfb)
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by kadamski » Mon Jul 02, 2012 9:09 am
I did some more tests and it seems that previous results are not reliable. I'm getting the same results with and without the library now (I can see the same difference in numbers between two runs of without this library than I posted previously). I guess what teh_orph said explains why.

So, any ideas for some other benchmarking options?
Posts: 187
Joined: Fri Jun 08, 2012 10:56 pm
by tufty » Mon Jul 02, 2012 1:17 pm
Simon.

memset :

Have you benchmarked the difference between doing strd (i.e. 2 registers at a time), and various number of registers at a time using stm? I assume you have, as you're doing 4-register writes rather than 8.

I suspect, looking at the TRM, that you might be able to eke out a bit more performance by heading for 64 bit aligned transfers, too (i.e. potentially taking the hit on an extra single word transfer before doing the block), as that seems to shift the number of memory cycles for a 4 register transfer from 3 to 2. It improves the result latency, too, although the register lock latency should stay the same.

The code I'm using (which is adapted from FreeBSD code) does 128 and 32 byte blocks separately, but it only does 2 registers at a time. I might well rework it, thinking about this.

memcpy :

Here, I think you might be able to get faster by using less registers in your 32 byte loop (and making sure you're 64 bit aligned, as above). My reading of the TRM is:

Code: Select all
fast_loop:
   ldmia r1!, {r4-r11}   ; 1 cycle, 4 memory cycles, 4 or 5 cycle latency on r11
   subs r3, #32             ; 1 cycle
   stmia r0!, {r4-r11}   ; locked for another 3 or 4 cycles waiting for r11, I think
   pld [r1, #128]
   bne fast_loop

My (beer-addled) feeling is that you may be able to get faster by doing 2 interleaved 4-register load and stores. You take a one-cycle hit on each but you should be clear to do the stores straight away. Like this:
Code: Select all
ldmia r1!, {r4-r7}
ldmia r1!, {r8-r11}
pld here?
stmia r0!, {r4, r7}
stmi1 r0!, {r8-r11}

Of course, I'm probably missing something fundamentally obvious here. And the beer's probably not helping.

Simon
Posts: 1330
Joined: Sun Sep 11, 2011 2:32 pm
by teh_orph » Mon Jul 02, 2012 2:00 pm
Yo fellow Simon!

tufty wrote:memset :

Have you benchmarked the difference between doing strd (i.e. 2 registers at a time), and various number of registers at a time using stm? I assume you have, as you're doing 4-register writes rather than 8.
Yes. Eight reg stm gave about 1.1GB/s, 2*four reg stm gives just shy of 1.4 GB/s. Store dual (strd) was 800 MB/s IIRC? I will check. I'm unsure of why eight-reg stm is slower than four-reg stm on memset, but not memcpy.

I suspect, looking at the TRM, that you might be able to eke out a bit more performance by heading for 64 bit aligned transfers, too
64 bit or 64 byte? My stuff 32 *byte* aligns the data for a definite significant win.

The code I'm using (which is adapted from FreeBSD code) does 128 and 32 byte blocks separately, but it only does 2 registers at a time. I might well rework it, thinking about this.
Yeah could be worth a further unroll...I'll have a test.

memcpy :

Intestesting, re: the timings. I'll re-time with these different layouts and see what I get.
What's interesting is that
ldm r1! {stuff}
pld [r1, #12]

the pld takes a big hit if directly after the ldm...because r1 is updated quite late? Moving it far away improves the pld performance quite a bit. (I'm looking in oprofile results btw). Yet r1! can't the culprit...as I do back-to-back ldm r1!s in memset and it's faster! Wanna guess??
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by veryevil » Mon Jul 02, 2012 3:04 pm
dom wrote:If anyone wants to test this:

cd /home/pi
wget https://dl.dropbox.com/u/3669512/temp/libcofi_rpi.so
echo /home/pi/libcofi_rpi.so | sudo tee /etc/ld.so.preload

and then reboot. You should find memset/memcpy goes faster (which should improve X speed when dragging/scrolling/redrawing).

Please report if anything breaks, or if anything feels faster.


Hey, it wont work for me. I get massive amounts of

ERROR: ld.so: object '/boot/libconfi_rpi.so' from /etc/ld.so.preload cannot be preloaded: ignored

I downloaded your file and put it on the boot directory but I also first used wget into /root/

I'm guessing the issue is that im using Raspbian
Posts: 24
Joined: Fri Mar 09, 2012 3:38 pm
by veryevil » Mon Jul 02, 2012 3:10 pm
Its ok I fixed it. I pulled teh_orphs git repo and built it natively on Rasbian and then it worked.
Posts: 24
Joined: Fri Mar 09, 2012 3:38 pm
by teh_orph » Mon Jul 02, 2012 5:20 pm
Tufty, here's some numbers. 64b aligned.
memset:
4*strd = 276 MB/s (r4, r5 * 4)
4*2 reg stmia = 199 MB/s (r4, r5 * 4)
2*4 reg stmia = 1389 MB/s (r1, r4, r5, r6 * 2)
2*4 reg stmia = 1385 MB/s (1, 4, 5, 6, 7, 8, 9, 10)

memcpy:
my original,
ldmia r1!, {r4-r11}
subs r3, #32
stmia r0!, {r4-r11}
pld [r1, #128]
bne fast_loop
336 MB/s

ldmia r1!, {r4-r7}
ldmia r1!, {r8-r11}
subs r3, #32
stmia r0!, {r4-r7}
stmia r0!, {r8-r11}
pld [r1, #128]
bne fast_loop
339 MB/s

Moving the pld above the stmia: 338 MB/s
Moving the pld (and changing the value) to above the ldmia: 339 MB/s
(unable to try 64-byte unrolling at this time)

Not a lot of movement! The memset changes are quite shocking though.

EDIT: quick chuckle. pld'ing r0+#128 just for kicks, above the ldmia...158 MB/s!
User avatar
Posts: 315
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
by tufty » Mon Jul 02, 2012 6:46 pm
Interleaving your posts for reply. May be disjoint...
teh_orph wrote:
tufty wrote:Have you benchmarked the difference between doing strd (i.e. 2 registers at a time), and various number of registers at a time using stm? I assume you have, as you're doing 4-register writes rather than 8.
Yes. Eight reg stm gave about 1.1GB/s, 2*four reg stm gives just shy of 1.4 GB/s. Store dual (strd) was 800 MB/s IIRC? I will check. I'm unsure of why eight-reg stm is slower than four-reg stm on memset, but not memcpy.


4*strd = 276 MB/s (r4, r5 * 4)
4*2 reg stmia = 199 MB/s (r4, r5 * 4)
2*4 reg stmia = 1389 MB/s (r1, r4, r5, r6 * 2)
2*4 reg stmia = 1385 MB/s (1, 4, 5, 6, 7, 8, 9, 10)

Interesting that the strd and 2-register stms are so different. According to the TRM they should be the same as a best case, i.e. 1 cycle, and have the same latencies, but with strd falling behind stm in the worst case.

Looks like 2x4 is the sweet spot for memset, but I don't see why. 8 register should beat it in purely performance terms, especially for memset where you're not locking your register values. Or maybe that's it? 2 interleaved 4 register writes meaning that none of the registers are locked at any point you want to use them?
64 bit or 64 byte? My stuff 32 *byte* aligns the data for a definite significant win.

Yeah, 64 bit (doubleword aligned). Brainfart on my part, it's obvious your code is 32 byte aligned.

memcpy :

Intestesting, re: the timings. I'll re-time with these different layouts and see what I get.
What's interesting is that
ldm r1! {stuff}
pld [r1, #12]

the pld takes a big hit if directly after the ldm...because r1 is updated quite late? Moving it far away improves the pld performance quite a bit. (I'm looking in oprofile results btw). Yet r1! can't the culprit...as I do back-to-back ldm r1!s in memset and it's faster! Wanna guess??

my guess would be that back-to-back ldmia r1!, {...} don't need the value of r1 to be ready, but pld does.

Give this a shot, just for kicks (if you have time, of course). Brings the preload up to a point where the r1 value should be available, and gives a maximum distance between it and the next time we touch r1. The subs only needs to be 2 cycles away from the branch to get best case performance anyway, so it might as well be used to give the preload time to take effect.
Code: Select all
ldmia r1!, {r4-r7}
ldmia r1!, {r8-r11}
stmia r0!, {r4-r7}
pld [r1, #128]
subs r3, #32
stmia r0!, {r8-r11}
bne fast_loop


Simon
Posts: 1330
Joined: Sun Sep 11, 2011 2:32 pm