Optimizing the kernel copy_page and memcpy functions


10 posts
by hglm » Sat Jun 22, 2013 10:07 am
While memcpy in userspace has received plenty of attention to be optimized for the Raspberry Pi, the same cannot be said for the memcpy-related functions in the kernel, the performance of which can be important for certain workloads. The RPi is very sensitive to the right prefetch strategy when copying memory blocks, and current implementations of copy_page and memcpy in the kernel are significantly slower than RPi-optimized versions would be. In particular the copy_page function as implemented in arch/arm/lib/copy_page.S still retains optimizations for the 15 year old StrongARM platform while not being optimized for modern ARM platforms.

When the copy_page function is lifted out of the kernel and benchmarked in a sandbox environment, a RPi-optimized implementation (which changes the prefetch strategy) shows a 70% performance improvement.

A similar speed up is observed for an optimized kernel memcpy function (which changes the prefetch strategy and forces write alignment to the cache line size), for sizes of about 1K and larger. For smaller sizes, as expected the benefit decreases as the size decreases.
Posts: 30
Joined: Fri May 31, 2013 8:24 pm
by hglm » Sat Jun 22, 2013 10:18 am
I have benchmarked a kernel with optimized copy_page and memcpy functions. To do so, I had to compile OProfile from source and make sure it uses the timer-based sampling method instead of hardware performance counters which are not fully supported by the ARMv6 platform. Using oprofile, and supplying it with a vmlinux file with kernel symbols matched with the running kernel version, it is possible to do system-wide profiling including the kernel.

Testing a run of ./configure, which has a relatively large copy_page footprint, on a moderately sized source tree on a ramdisk, the following profile results where obtained:

Without optimized copy_page and memcpy:

Code: Select all
1126     18.3507  cc1                      /usr/lib/gcc/arm-linux-gnueabihf/4.6/cc1
1037     16.9003  libbfd-2.22-system.so    /usr/lib/libbfd-2.22-system.so
726      11.8318  libc-2.13.so             /lib/arm-linux-gnueabihf/libc-2.13.so
395       6.4374  bash                     /bin/bash
287       4.6773  ld-2.13.so               /lib/arm-linux-gnueabihf/ld-2.13.so
209       3.4061  vmlinux                  copy_page
189       3.0802  vmlinux                  default_idle
158       2.5750  vmlinux                  do_page_fault
112       1.8253  libcofi_rpi.so           memcpy
93        1.5156  vmlinux                  __do_fault
85        1.3853  vmlinux                  cfb_imageblit
72        1.1734  as                       /usr/bin/as
72        1.1734  libcofi_rpi.so           memset
69        1.1245  vmlinux                  __memzero
60        0.9778  ld.bfd                   /usr/bin/ld.bfd
59        0.9615  vmlinux                  filemap_fault
53        0.8638  vmlinux                  handle_pte_fault
47        0.7660  vmlinux                  memcpy
44        0.7171  vmlinux                  get_page_from_freelist
43        0.7008  vmlinux                  find_get_page
40        0.6519  vmlinux                  find_vma
35        0.5704  sed                      /bin/sed
35        0.5704  vmlinux                  __down_read_trylock


With optimized copy_page and memcpy:

Code: Select all
1148     18.8320  cc1                      /usr/lib/gcc/arm-linux-gnueabihf/4.6/cc1
1052     17.2572  libbfd-2.22-system.so    /usr/lib/libbfd-2.22-system.so
710      11.6470  libc-2.13.so             /lib/arm-linux-gnueabihf/libc-2.13.so
428       7.0210  bash                     /bin/bash
254       4.1667  ld-2.13.so               /lib/arm-linux-gnueabihf/ld-2.13.so
190       3.1168  vmlinux                  default_idle
171       2.8051  vmlinux                  do_page_fault
119       1.9521  libcofi_rpi.so           memcpy
113       1.8537  vmlinux                  copy_page
94        1.5420  vmlinux                  __do_fault
85        1.3944  as                       /usr/bin/as
79        1.2959  vmlinux                  cfb_imageblit
77        1.2631  vmlinux                  __memzero
63        1.0335  ld.bfd                   /usr/bin/ld.bfd
63        1.0335  libcofi_rpi.so           memset
55        0.9022  vmlinux                  filemap_fault
44        0.7218  vmlinux                  handle_pte_fault
41        0.6726  sed                      /bin/sed
40        0.6562  vmlinux                  find_get_page
38        0.6234  vmlinux                  find_vma
38        0.6234  vmlinux                  get_page_from_freelist
37        0.6070  vmlinux                  handle_mm_fault
37        0.6070  vmlinux                  memcpy


Note how time taken by copy_page drops from 3.41% to 1.86%. Although the profiling results are not extremely accurate due to the sampling method used and other factors, the copy_page improvement seems to be significant and should improve the running time of this particular workload by about 1.5%.

A smaller speed-up is observed for memcpy (0.77% to 0.61%) although this result is less significant due to the more limited sampling.

I am on the look-out for workloads that have a larger kernel copy_page and memcpy footprint. Any suggestions?

BTW: Preliminary patches implementing optimized copy_page and memcpy are available at https://github.com/hglm/patches/tree/master/rpi.
Posts: 30
Joined: Fri May 31, 2013 8:24 pm
by dom » Sat Jun 22, 2013 12:03 pm
I did attempt this a while back, but found running the quake3 timedemo would sometimes segfault with the patch applied.
I had a suspicion it was misaligned access related, but haven't had a chance to investigate.

It may have been something I got wrong in patching it, so I'd be interested if quake3 works reliably for you (or post/PR your patch and I'll try it here).
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5040
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by hglm » Sat Jun 22, 2013 12:45 pm
dom wrote:I did attempt this a while back, but found running the quake3 timedemo would sometimes segfault with the patch applied.
I had a suspicion it was misaligned access related, but haven't had a chance to investigate.

The copy_page patch is quite straightforward (since copy_page is always page aligned), but I found the current kernel memcpy implementation for ARM a bit tricky. In fact when testing the original kernel memcpy function in userspace, I detected copy errors related to unaligned memcpy. I have made no attempt to fix this yet, or to ascertain whether the bug actually exists in kernel space. The kernel memcpy function does do some tricky fiddling with the program counter when doing unaligned access.
dom wrote:It may have been something I got wrong in patching it, so I'd be interested if quake3 works reliably for you (or post/PR your patch and I'll try it here).


I'll check out quake3, in any case the two patches are available at the address linked at the bottom of the last message.
Posts: 30
Joined: Fri May 31, 2013 8:24 pm
by dom » Sat Jun 22, 2013 1:36 pm
hglm wrote:I'll check out quake3, in any case the two patches are available at the address linked at the bottom of the last message.


I've not spotted a problem with quake with your patch on a few runs. No measurable change on framerate.

How does your memcpy compare with the ones we use in userland (https://github.com/bavison/arm-mem/)?
How about memset?
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5040
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by hglm » Sat Jun 22, 2013 2:54 pm
dom wrote:How does your memcpy compare with the ones we use in userland (https://github.com/bavison/arm-mem/)?
How about memset?


I believe Raspbian uses the libcofi optimized memcpy and memset via the ld.so.preload mechanism (/etc/ld.so.preload), which is from https://github.com/simonjhall/copies-and-fills/, and which is different from arm-mem which is used in Pidora. In my testing libcofi performs pretty well (perhaps better than arm-mem). libcofi's memset is also pretty fast (faster than the default glibc).

The changes I made to the kernel memcpy mainly change preload offsets and do not significantly alter the overall function, and when tested versus libcofi in userspace it is a little slower in some cases. But the kernel memcpy may have somewhat differerent requirements (smaller code size, ability to handle exceptions).

But I have also been experimenting with a wholly different set of memcpy functions, in my fastarm repository (https://github.com/hglm/fastarm/) which are work-in-progress, and while significantly faster than glibc on other ARM platforms, on Raspbian libcofi does pretty well. I have yet to do extensive real-world benchmarks using oprofile, which can give different results from synthetic benchmarks that repeatedly call memcpy. There are some complex trade offs involved, and historically there have been cases of "optimized" memcpy implementation on different Linux platforms that while showing good performance in synthetic benchmarks, actually slowed down the system in real-world usage.
Posts: 30
Joined: Fri May 31, 2013 8:24 pm
by hglm » Sat Jun 22, 2013 7:52 pm
In the latest version of the patch I fixed a performance regression in memset/memzero which was caused by forcing write alignment to a 32-byte boundary, which was beneficial for the memcpy code. The RPi doesn't like memset/memzero writing whole cache lines at once at an aligned address. Profling "perf bench sched pipe", which utilizes the kernel memset function, time spent in memset goes from 1.9% to 1.1%.

In general, I think copy_page is called when new processes are forked or created and start writing their own data into previously shared memory pages (copy on write). So complex shell scripts or running "configure" which trigger lots of processes will see the greatest benefit from improvements in copy_page. Programs using threads generally don't use copy_page a lot because the threads share the data segment of the process.

Kernel memcpy generally isn't used very heavily, usually much less than userspace memcpy in the overall profile. I suspect disk I/O generates kernel memcpy calls, but when accessing an SD-card the number of calls will be limited due to the slowness of reading and writing to the SD card; with a faster disk drive connected, the memcpy utilization may go up. There may also be other scenarios (certain device drivers?) that more heavily utilize kernel memcpy.

Kernel memset is used by pipes as mentioned above. Finally, memzero has a usage profile similar to copy_page because it is also tied to the creation of new processes.
Posts: 30
Joined: Fri May 31, 2013 8:24 pm
by hglm » Wed Jun 26, 2013 10:18 pm
I have been doing some more work on this and I think am making some progress. The validation errors I was seeing with the original kernel memcpy have cleared up (accidently set wrong endianness in my userspace environment). The memcpy implementation now performs more highly tuned preloads and has a fast-path for the small-to-moderate size word aligned case. This seems to be help, reducing the memcpy profile time by half in some cases.

One useful benchmark is running "perf bench sched messaging" from the linux-tools package (itself a profiler, but I am still using oprofile). This benchmark has particularly high memcpy usage. This show a reduction of the time spent in kernel memcpy from 5.2% to 2.7% with the optimized version, with memcpy calls mostly originating in copy_to_user_memcpy. Real-world timing seems to confirm a speed-up of a few percent for overall running time. Conversely, the perf pipe benchmark shows some improvement in profiling stats but seems to be regressing a bit in real-world timing (which is prone to variability).

I have found I have to be careful to prevent performance regressions when optimizing a function because some tasks use the same function very differently (for example lots of very small memcpys vs. fewer but large memcpys), and also the copy_to_user and copy_from_user functions are important (they share some of the memcpy code).

This is still work-in-progress and the changes to the memcpy code are not very readable but an updated patch is available in https://github.com/hglm/patches/ as kernel-armv6v7-mem-funcs.patch. I am trying to support other modern ARM platforms as well but tuning may be difficult due to the wide variability of ARM-based devices.

These optimizations are not going to make a huge difference in practical usage except maybe for some special cases. But the speed-up of the core functions is not insignficant and will see some benefit in things like process creation (copy_page) and any hypothetical kernel mecmpy-intensive tasks if they exist.
Posts: 30
Joined: Fri May 31, 2013 8:24 pm
by dom » Wed Jun 26, 2013 11:01 pm
Sounds good. Even 1 or 2% speedups are worth having.
If a dozen people each found a 2% improvement we'd have a big boost.
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5040
Joined: Wed Aug 17, 2011 7:41 pm
Location: Cambridge
by szrpj » Sun Aug 11, 2013 9:29 pm
hglm wrote:But I have also been experimenting with a wholly different set of memcpy functions, in my fastarm repository (https://github.com/hglm/fastarm/) which are work-in-progress, and while significantly faster than glibc on other ARM platforms, on Raspbian libcofi does pretty well. I have yet to do extensive real-world benchmarks using oprofile, which can give different results from synthetic benchmarks that repeatedly call memcpy. There are some complex trade offs involved, and historically there have been cases of "optimized" memcpy implementation on different Linux platforms that while showing good performance in synthetic benchmarks, actually slowed down the system in real-world usage.

It sure looks interesting looking at benchmark results.
User avatar
Posts: 6
Joined: Tue Jul 03, 2012 9:17 pm