so the whole test function, well I chopped off the pop and return.
Code: Select all
d600804c <branch_test>:
d600804c: e92d00f0 push {r4, r5, r6, r7}
d6008050: e59f006c ldr r0, [pc, #108] ; d60080c4 <one+0x40>
d6008054: e3a04801 mov r4, #65536 ; 0x10000
d6008058: e5901000 ldr r1, [r0]
d600805c: e1a05001 mov r5, r1
d6008060 <top>:
d6008060: e2855001 add r5, r5, #1
d6008064: e2855001 add r5, r5, #1
d6008068: e2855001 add r5, r5, #1
d600806c: e3140001 tst r4, #1
d6008070: e2855001 add r5, r5, #1
d6008074: e2855001 add r5, r5, #1
d6008078: e2855001 add r5, r5, #1
d600807c: 12855001 addne r5, r5, #1
d6008080: e2855001 add r5, r5, #1
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
d600808c: e2855001 add r5, r5, #1
d6008090: e2855001 add r5, r5, #1
d6008094: e2855001 add r5, r5, #1
d6008098: e2855001 add r5, r5, #1
d600809c: e2544001 subs r4, r4, #1
d60080a0: 1affffee bne d6008060 <top>
d60080a4: e1a00000 nop ; (mov r0, r0)
d60080a8: e1a00000 nop ; (mov r0, r0)
d60080ac: e1a00000 nop ; (mov r0, r0)
d60080b0: e1a00000 nop ; (mov r0, r0)
d60080b4: e5902000 ldr r2, [r0]
d60080b8: e0410002 sub r0, r1, r2
top: is aligned on a 8 word boundary and that did make a huge difference in performance. the end of the loop is where it is I was playing around there and left it like that. The number of processor clocks using the built in timer
002AFFFC
Now change from two instructions to two other instructions. Not interested in how many times I add r5, am interested in the number of instructions in the loop.
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: e2855001 add r5, r5, #1
d6008078: e2855001 add r5, r5, #1
d600807c: 0a000000 beq d6008084 <one>
d6008080: e2855001 add r5, r5, #1
d6008084 <one>:
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
d600808c: e2855001 add r5, r5, #1
d6008090: e2855001 add r5, r5, #1
0030FFFC
so that added 6 clocks per loop and I believe this is the sweet spot the fastest place, now perhaps because we fetch from 0x80 to 0x7C in one shot then that goes into the pipe and those two instructions preceed the beq the pipe can perhaps know that it is going to branch and where. Maybe not maybe it is a case of the first prefetch after beq is already fetched, we need one more prefetch after pulling in 0x80 to 0x8C plus the fetch for the destination (fetch 0x80 to 0x8C) but that is well over 6 clocks more than a dozen. So i dont think it fetches starting at 0x80 twice, so it is perhaps a combination of things, I dont have more visibility into this test at this time.
so move one of the add r5s from before to after the branch stuff
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: e2855001 add r5, r5, #1
d6008078: 0a000000 beq d6008080 <one>
d600807c: e2855001 add r5, r5, #1
d6008080 <one>:
d6008080: e2855001 add r5, r5, #1
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
d600808c: e2855001 add r5, r5, #1
00313FFC
it adds another 1/4th of a clock per loop...interesting.
shift the beq stuff another instruction back
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: 0a000000 beq d600807c <one>
d6008078: e2855001 add r5, r5, #1
d600807c <one>:
d600807c: e2855001 add r5, r5, #1
d6008080: e2855001 add r5, r5, #1
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
00373FFC
wow, another 6 clocks. I wonder if the words starting at 0x80 were fetched twice...
Okay so that asked as many questions as it answered. With that little experiment we at best added 6 clocks by using a branch if equal rather than a conditional execution. and you can probably guess that
is going to add a bunch more clocks than
so going back and removing the conditional execution, do the add every time rather than every other
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: e2855001 add r5, r5, #1
d6008078: e2855001 add r5, r5, #1
d600807c: e2855001 add r5, r5, #1
d6008080: e2855001 add r5, r5, #1
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
d600808c: e2855001 add r5, r5, #1
002AFFFC
same number of clocks, no real surprise there.
But look at this an unconditional branch to the next instruction:
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: e2855001 add r5, r5, #1
d6008078: e2855001 add r5, r5, #1
d600807c: e2855001 add r5, r5, #1
d6008080: eaffffff b d6008084 <one>
d6008084 <one>:
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
d600808c: e2855001 add r5, r5, #1
0036FFFC
ouch 11 more clocks per loop. It had just prefetched those instructions!
Back that branch up one instruction in the address space, maybe the branch prediction comes in and it knows what we are doing
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: e2855001 add r5, r5, #1
d6008078: e2855001 add r5, r5, #1
d600807c: eaffffff b d6008080 <one>
d6008080 <one>:
d6008080: e2855001 add r5, r5, #1
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
Nope, another half clock.
00377FFC
back up one more
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: e2855001 add r5, r5, #1
d6008078: eaffffff b d600807c <one>
d600807c <one>:
d600807c: e2855001 add r5, r5, #1
d6008080: e2855001 add r5, r5, #1
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
d600808c: e2855001 add r5, r5, #1
00437FFC
wow that just hurts. I wonder if the words starting at 0x80 got fetched twice.
Code: Select all
d6008070: e2855001 add r5, r5, #1
d6008074: eaffffff b d6008078 <one>
d6008078 <one>:
d6008078: e2855001 add r5, r5, #1
d600807c: e2855001 add r5, r5, #1
d6008080: e2855001 add r5, r5, #1
d6008084: e2855001 add r5, r5, #1
d6008088: e2855001 add r5, r5, #1
00377FFC
and that hurts a little less.
So I would like to know more, but it is showing the branch penalty. From what started all of this though, with the number of instructions you have in your handler, perhaps it is all conditional and perhaps you could have traded some conditional branches to chunks of code (causing some chunks of code to not consume clocks) over lots of conditional executions which are always consuming clocks. You might have microptimized this more. I dont know, have not seen your code, dont need to. Wanted to point out as our mentor Michael Abrash would tell us, dont assume you know how it works, test it, try crazy things, and time it. Look at the last two posts, look at how crazy the same exact sequences of code varied simply by their location in ram. Assume and understand that with the cache on the first passes through that code would have varied much worse depending on how that code landed across cache boundaries. For an interrupt handler you want to execute once every so often that first pass against the cache could easily consume more clocks than a branch might. Add dram to this and add an operating system with other interrupts and task switching and the like messing with the caches, etc. It all just gets less predictable. A branch over a cache line which is often evicted due to heavy use by data accesses may turn out to be overall faster than straight through execution for example. At my day job we had a situation where the data we handled we placed boundaries in memory that was a power of 2, which is fine if you are handing that side of data every time, but we werent most of the time it was smaller, much smaller than the worst case. So we were striping our data and punishing the cache. by changing the reserved size of a data chunk to be 5 times some power of two rather than 8 times that power of two, we use more cache ways, less cache pounding, higher performance. the additional cost of times five really isnt that bad instead of <<3 to get the 8 times (actually the whole thing was block number <<N). 5 = 4+1. so (x*4)+(x*1) = x(4+1) = x*5. You can do x<<2+x to get a times 5 so you add one operation and burn a temporary register for the duration of that operation to save a ton of time in cache evicting and fetching.
Fun experiment, sorry to hijack your thread, hope your handler is working and is fast enough for what you need.
David
Thanks for pointing out that black book project, talked to that person and the wheels are in motion for the zen of asm, will see if it all works out...I highly recommend that book, esp for a platform like the raspberry pi where there are so many things going on in the background that can be cycle stealers...