hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

memset performance

Mon Feb 24, 2014 5:36 pm

Following some discussion in an earlier thread at http://www.raspberrypi.org/phpBB3/viewt ... 72&t=65573, I thought I would try to optimise my memset and memcpy routines in my bare metal OS. These are used in display handling and file/program loading so I thought this would be worthwhile.

For testing purposes I link the memset routine into a test program which then runs various lengths and addresses. This allows me to update the memset routine and serial download the updated program without restarting the Pi. The test program gets loaded into memory each time its run, potentially to a different address. The address of the buffer memset operates on is always the same.

If the memset routine is loaded at an address which is an odd multiple of 16 bytes, the routine takes around 27 milliseconds to memset 1024 bytes 10000 times (around 379MB/s) However the same test takes 45 milliseconds if the code happens to be loaded at an even multiple of 16 bytes (so around 227MB/s)

As far as I can tell its only the instruction cache and branch prediction which could possibly account for this difference, but its *huge* - a 66% degradation just based on where the code gets loaded. I have MMU enabled but am only mapping entire 1MB sections and the code is always loaded entirely into the same section.

I have tried disabling caches and branch prediction (control register holds 0x00850079), but still get two results, a faster one (36ms) and slower (47ms) with the same correspondence to the memset routine entry address. Is there anything else that could cause this difference based purely on where in memory the code is loaded?

dwelch67
Posts: 968
Joined: Sat May 26, 2012 5:32 pm

Re: memset performance

Mon Feb 24, 2014 6:36 pm

Did you look at the recent FIQ Size thread? Where exactly you load your code can make a big difference, both how arm fetches, then the side effect of that going into the caches. My typical trick is to add or remove nops to my boostrap code so that the whole binary moves this way or that 4 bytes at a time. For tests like yours I would vary only two fixed values the from address and the to address for the copy and not let the code move, that would help understand some of your memcpy performance, then independent of that without changing those numbers then push the memcpy routine one word per experiment through memory until the performance numbers start to repeat and/or you find the sweet spot that has the highest peformance, then vary your to and from addresses again and see what happens.

This is the nature of benchmarking. Not you but all too often I find there are these assumptions that all compilers generate the same code from the same high level language. That you have to change your high level language code in order to change the performance of a program, you have to change platforms to change the performance, etc. It is quite easy to demonstrate with most benchmarks, even small loops, that the execution performance can vary widely with the same compiler, same toolchain, same hardware, same everything except how you build things. Even changing the order of the .c files on a gcc command line can and will vary your performance. Then the next problem is accurately timing the experiment, there are often mistakes made there as well.

The Michael Abrash book(s) talked about in that FIQ Size thread are good, I am most familiar with the zen of assembly, but have all of his stuff. The black book also has a zen timer chapter. There is work to port the zen of assembly to an open epub as well, will see what happens with that. Dont focus on his asm or his graphics but the idea that your assumptions about what is going on is probably wrong, you need to experiment and time things and try crazy things see them vary, and then try to figure out why.

If I assumed right and your memcpy routine and/or the loop that calls it is not in the same location in memory for every test, then that difference can affect the results. you are not comparing an apples to apples test of this size or offset of memory vs another if the code behind the memcpy is moving as well. For a memcpy performance test the code running the test needs to be the same every time and only the starting offsets vary from one binary to the other. And so on. When I benchmark which is rare these days, I will build the same source with -O0, -O1, -O2, -O3 and for each of those I will have one, two, three, four, five....nops in the bootstrap. And basically try to find the max performance of all of those combinations. Repeat that with the other compiler or whatever else I am benchmarking against.

dwelch67
Posts: 968
Joined: Sat May 26, 2012 5:32 pm

Re: memset performance

Mon Feb 24, 2014 6:42 pm

In short the FIQ size discussion. I believe that ARM is fetching multiple instructions per fetch, aligned. And with that assumption where a branch lives within that chunk of instructions affects how much time the branch prediction has, if enabled. And then there are all the cache side effects as well, the alignment of your loops may cause an extra cache line fetch on the first time through if not centered well. And worse if you have a loop calling a loop and they happen to be such that the cache lines collide then they may be evicting each other (there are more than two ways so you would need a number of things to come together to cause this, but your data move may be causing your code to evict in L2).

And as I mentioned I would vary optimization and stuff, I forgot I also vary i cache on/off d cache on/off...

I went so far as to write my thumbulator simulator specifically to try to see how llvm/clang and gcc compared as far as instructions executed. that was my original reason to write that simulator.

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: memset performance

Mon Feb 24, 2014 7:39 pm

Thanks for the response. I'll try a hardcoded memory address for loading the program to begin with and then move it around to see how that affects the performance. Right now program code alignment is only at a 16 byte boundary in my O/S (I allocate the code memory off a heap). When I've got proper memory management it will be on a 4K boundary, so at least then the memset code can be predictably located.

From what I've seen I need to check a range of addresses from 0 to 28 bytes offset; I think the pattern repeats after that but will verify.

To fully optimise I'm expecting that it would be necessary to align specific sections of the code, not just the entire routine - I might find the sweet spot for the routine as a whole but that might not be the absolute optimum. However right now that's not my top priority but I would at least like to understand the reason for the variation.

What still concerns me is that I see two different measurements with all caches and branch prediction turned off.

colinh
Posts: 95
Joined: Tue Dec 03, 2013 11:59 pm
Location: Munich

Re: memset performance

Thu Feb 27, 2014 8:32 pm

If you're getting 380 Mbytes/B instead of around a gigabyte/s for a memset then something is fundamentally wrong. It's hard to tell what though, without a little more info, like are you even doing this in assembler? :)

You need MMU enabled. Not just I and D cache, and branch prediction bits set. I had a certain amount of trouble setting the right bits for the entries in my TTBR table. The data should be 32 byte aligned so that a ldm/stm with 8 words uses a whole cache line in one operation.

Did I give enough info/code in the thread you mention for you to try my code? It might be an idea to start from there and then optimise. Or from Simon Hall's memcpy/memset code linked to in my first post.

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: memset performance

Sun Mar 02, 2014 1:30 am

My memset is in assembler. It uses at most one of each of store byte, halfword, word, 2 words, 4 words to get to a 32-byte boundary and then writes 8 words at a time (and then one of each of the above as necessary if any bytes left over).
MMU, data and instruction cache and branch prediction are all turned on. Hoping to get some time this week to take another look to play with the code location and get more consistent results.

My translation table entries are for full sections and have the following flags:

Type: 0x0000100C (Outer and Inner Write-Back, Write Alloc Normal sharable)
Shared: 0x00010000
Access: 0x00000400 (Kernel read/write, user no access)
Section: 0x00000002
Total: 0x0001140E

colinh
Posts: 95
Joined: Tue Dec 03, 2013 11:59 pm
Location: Munich

Re: memset performance

Sun Mar 02, 2014 10:16 pm

Total: 0x0001140E
OK. You probably want a different entry for the registers and IO: (ie device, not normal, memory) and you can set the XN bit for regions that don't contain executable code.

That gives 0x00010416 for such memory.

How's your CP15 Control Register c1? OK, you say you have I Z C and M set. I have XP set too.

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: memset performance

Mon Mar 03, 2014 2:13 pm

Colin,

With regard to TTRB attributes, "device" only makes sense for memory-mapped registers, correct? In which case I don't see that being relevant to a memset function which is likely never to be used on memory-mapped registers. GPU memory might want different caching attributes to regular memory I guess as typically I would not read from that memory, but for a general purpose memset function I want to measure against "normal" memory.

Here's my initial inner loop, 3 instructions, where r1 and r4-r10 are set up with copies of the byte to be set:

Code: Select all

ms_loop_32:
    stmia r0!,{r1,r4-r10}
    subs r3, r3, #32
    bne  ms_loop_32
Measurements confirm that if ms_loop_32 happens to lie in a different 32-byte area to the branch instruction, performance drops by around 40% (4.3 seconds for 1GB rather than 2.4 seconds). Given there are only 2 instructions between them, 6 out of 8 alignments for the code give the higher performance.

I tried modifying the loop to split the stmia into two as follows:

Code: Select all

ms_loop_32:
    stmia r0!,{r1,r4-r6}
    stmia r0!,{r1,r4-r6}
    subs r3, r3, #32
    bne  ms_loop_32
As expected there are 3 out of 8 alignments where this has slower performance, again agreeing with the analysis that if the loop start is in a different 32-byte block to the branch itself, performance suffers.
With this change, best I got was 2.3 seconds for 1GB, very little difference from the single stmia instruction.
My code has no other loops in it; there are some conditionals and a handful of branches.

The control register value is 0x0085187d for all tests. This means XP, I, Z, C, M bits are all set.
TTRB attributes for both the code and data addresses are 0x0001140e. Trying 0x0001040e (no allocate on write) appeared to make no difference to performance. In either case best I could get was 2.3 seconds for 1 million memsets of 1024 bytes, or 445 Megabytes per second. I expect performance would be a little better for a smaller number of larger memsets.
(edit: tested 1000 1024000 byte memsets, in 2.0 seconds so 500MB/s but still a way off 1GB/s).

colinh
Posts: 95
Joined: Tue Dec 03, 2013 11:59 pm
Location: Munich

Re: memset performance

Mon Mar 03, 2014 3:20 pm

hldswrth wrote:With regard to TTRB attributes, "device" only makes sense for memory-mapped registers, correct? In which case I don't see that being relevant to a memset function which is likely never to be used on memory-mapped registers. GPU memory might want different caching attributes to regular memory I guess as typically I would not read from that memory, but for a general purpose memset function I want to measure against "normal" memory.
Well, you won't be using memset on the memory-mapped registers. But you do *use* them, don't you?

I have write-allocate off - so as not to clobber the cache. I also have sharing off. Other than that, I don't know. Do you get a different result if you switch the data caching off (just to make sure that D-cache is working at all)?

dwelch67
Posts: 968
Joined: Sat May 26, 2012 5:32 pm

Re: memset performance

Mon Mar 03, 2014 4:00 pm

Could/has someone done the stmia only loop test with an ldmia only? I am curious. It is my belief on a different ARM 11 that writes dont use a larger length on the AXI bus. There is a length field and while we would hope that the axi bus does an 8 word read or write as a length of 4 (4-1 = 3 on the axi bus) 64 bit things. The read does a length of 4 but the write is perhaps a length of 1. Wondering if your tests can see that kind of resolution. Reads in general should be slower than writes, but if you see them faster that would tell us something...

David

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: memset performance

Tue Mar 04, 2014 2:22 pm

Thanks for the suggestions so far.
One other thing to check - I'm setting the TTRB0 address to table address | 0x00000003 (shared/cacheable). Any difference to what you are doing?

colinh
Posts: 95
Joined: Tue Dec 03, 2013 11:59 pm
Location: Munich

Re: memset performance

Tue Mar 04, 2014 4:24 pm

Code: Select all

init_mmu:
    push    {lr}
    ldr     r1, =ttbr0              // addr(TTBR0)

    ldr     r2, =0x0000040E
    mov     r3, #0
    mov     r4, #0x200
    bl      set_pgtbl_entry

    ldr     r2, =0x00002416
    mov     r3, #0x200
    mov     r4, #0x1000             // end at memloc 0x01.0000.0000 (ie. 0xFFFF.FFFF)
    bl      set_pgtbl_entry

    ldr     r2, =0x0000040E
    mov     r3, #0x480              // framebuffer = 0x4800.6000
    mov     r4, #0x490              // make 10 Mbyte cacheable
    bl      set_pgtbl_entry

    mov     r3, #3
    mcr     15, 0, r3, cr3, cr0, 0  // set domain 0 to master

    mcr     15, 0, r1, cr2, cr0, 0  // set TTBR0 (addr of ttbr0)  (ptblwlk inner non cacheable,
                                    // outer non-cacheable, not shareable memory)
    mov     r3, #0
    mcr     15, 0, r3, cr7, cr7, 0  // invalidate data cache and flush prefetch buffer

    mcr     15, 0, r3, cr8, cr7, 0

    ldr     r3, =0x00801805
    mrc     15, 0, r2, cr1, cr0, 0  // enable MMU, L1 cache and instruction cache, L2 cache, write
    orr     r2, r3                  // buffer, branch prediction and extended page table on
    mcr     15, 0, r2, cr1, cr0, 0

    pop     {pc}

set_pgtbl_entry:
    lsl     r0, r3, #20             // = r3 * 0x10.0000 (1M)
    orr     r0, r2
    str     r0, [r1, r3, lsl #2]
    add     r3, #1
    cmp     r3, r4
    bne     set_pgtbl_entry
    mov     pc, lr

 .section .data
 
 .align 14
 
 .globl ttbr0
 ttbr0:
 .space 4<<12                        // 4 bytes * 4096 entries

Edit: changed code

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: memset performance

Tue Mar 11, 2014 12:07 pm

Thanks for that. Mostly my settings are already the same as yours. Specific differences I see in my code vs. yours:

I have two additional control register updates in my VMM setup, to restrict cache size to 16K and to indicate always use TTBR0.
I set domain 0 to client rather than master;
I set caching and sharing flags (0x3) into the TTRB0 address;
My TTRB0 entries are "shared" whereas yours are "non-shared".

I'll try with these changes to see what happens to the memset performance.

rst
Posts: 488
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: memset performance

Wed Mar 12, 2014 12:11 pm

hldswrth wrote:My TTRB0 entries are "shared" whereas yours are "non-shared".
The "shareable" memory region attribute has much influence on the performance. Setting it off will normally result in a huge performance gain.

But: It may work in a simple scenario while complex systems are much more difficult to get reliable without it.

Rene

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: memset performance

Fri Mar 14, 2014 3:45 pm

<edit>
Managed to get this working with the values you use, with non-shared memory.
I had to make some changes to my code to get it to work:
1) In the mailbox interface, because memory is shared between the ARM and VC processors, I have to clean and invalidate the data cache before writing to the mailbox so that VC sees the values and ARM sees the updates that VC makes. That was stopping me getting a framebuffer or memory size.
2) In my module loader I need to clean the data cache as well as invalidating the instruction cache when I load code into memory.
3) The keyboard (CSUD) currently does not work in this mode, again I suspect because of buffers shared between ARM and devices, so I will have to look at inserting data cache clean and/or invalidation at the right places - I've not yet got that working.

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: memset performance

Tue Mar 18, 2014 9:22 pm

Final update here. Got non-shared memory with CSUD USB/keyboard driver working with some additional changes in the CSUD code. The simplest option was to use non-shared memory everywhere, and then force a cache clean/invalidate when transferring data from ARM to USB, which I did by updating the HcdTransmitChannel function:

Code: Select all

	// Clean and invalidate the data cache so that DMA sees the data and
	// we see the updated data.        
 	int cr = 0;
	__asm volatile ("mcr p15, 0, %0, c7, c14, 0" :: "r" (cr));
 	// Rest of code unchanged...
	Host->Channel[channel].DmaAddress = buffer;
With the non-shared memory attribute my memset routine completes 100MB in 66ms, or 1.4GB per second, which now sounds like the kind of figure I should be seeing, so thanks for the guidance!

To avoid the cache invalidation I then tried mapping 1MB of shared memory, and updated CSUD to use that for the buffer. I then wasted some time until I realised that the DmaAddress shown above has to be the physical address of the buffer, not the virtual address. My original non-shared buffer happened to be at physical+0x8000000 so it just worked, my shared buffer didn't have such a simple mapping. Now I have that sorted its working fine

Return to “Bare metal, Assembly language”