gchamp
Posts: 6
Joined: Mon Apr 01, 2019 8:10 pm

CPU Execution Speed

Mon Apr 01, 2019 8:49 pm

Hi everyone,

I'm trying to get an accurate picture of how fast the Raspberry PI 3B+ executes instructions. I have tested 3 approaches to measure the execution speed which all gives coherent results, but the execution speed is slower than what I anticipated.

I have tested timing with the PMCCNTR, with the 1MHz system timer and with an oscilloscope. On my specific benchmark, I measure that the each instruction takes on average 40 cycles to execute. Of course each instruction will have different timing, but 40 cycles on average is way more than I expected.

My full code is available on github. It is a fork of LdB codes for bare metal on raspberry, so the benchmark is ran with the L1 Instruction and Data cache properly setup and with branch preditions. The code is executed in 32 bit mode. My main function calls ARM_setmaxspeed, so the CPU speed should be set at 1.4GHz.

The benchmarked code is the following:

Code: Select all

// t1 = read PMCCNTR OR 1MHz system timer OR turn pin ON 
for (i = 1; i < 10000; i++) {
	c[i] = a[i] + b[i] + c[i - 1];
}
// t2 = read PMCCNTR OR 1MHz system timer OR turn pin OFF
// transmit via serial (t2 - t1)
With gcc version 7.1.0 with -O3, the heart of the loop compiles to the following assembly code:

Code: Select all

    a104:	e5f73001 	ldrb	r3, [r7, #1]!
    a108:	e5fa1001 	ldrb	r1, [sl, #1]!
    a10c:	e1570004 	cmp	r7, r4
    a110:	e0833001 	add	r3, r3, r1
    a114:	e0822003 	add	r2, r2, r3
    a118:	e6ef2072 	uxtb	r2, r2
    a11c:	e5eb2001 	strb	r2, [fp, #1]!
    a120:	1afffff7 	bne	a104 <kernel_main+0x194>
This is 8 instructions which are executed 10 000 times. The benchmark takes about 80 000 instructions to execute (omitting overhead to read the time and setup the loop).

Here are the results for the 3 methods to measure the time.

PMCCNTR: 3502238 cycles
System Timer (1MHz): 2503 ticks
Oscilloscope: 2.520 ms

Assuming that the raspberry pi is correctly running at 1.4GHz, I get the following average cycles per instructions:

PMCCNTR = (3502238) / (80000) = 43.78 cycles
System Timer (1MHz) = ((2503 * (0.000001)) / (1 / (1400 * 10^6))) / 80000 = 43.80 cycles
Osciolloscope: ((2.520 * 10^-3) / (1 / (1400 * 10^6))) / 80000 = 44.10 cycles

The code on github enables the PMCCNTR, but I have tested without activating it an the results for the System timer are the same.
I am fairly certain the Raspberry PI is running at 1.4GHz since ARM_setmaxspeed correctly writes this frequency on the uart.
I have tested reading the results in the debugger via JTAG (except for the oscilloscope ;), but the results are no different.

So I assume there is 3 possible causes of thoses results:
- The methodology is flawed (since 3 differents methods give the same results, I assumed it is not)
- I am misinterpreting the results
- Something is not activated that causes a huge slowdown (L1 Icache, L1 Dcache and branch prediction are on)

Does someone have an idea of why the instructions seem so slow to execute?

Thanks in advance,
Guillaume

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: CPU Execution Speed

Tue Apr 02, 2019 12:43 am

You need the L2 cache on for proper cpu performance. Most simple examples won't set it up because once the cache is on you need cache control on some things and that is a whole learning cycle. Having got familiar with the basics you need to now do work setting up L2 cache. The L2 cache is different between 32bit and 64 bit and can vary between models depending which TLB format you use which adds to the complexity.
Your code above indicates you are in 32bit and you are on a Pi3 so you probably want ARM7 TLB format but the ARM8 short form is also another option.

Once you understand the TLB table setup this will give you a basic overview of the registers you need to bash
https://wiki.osdev.org/ARM_Paging

Finally even when you do that the PI GPIO bus is slow it varies on Pi model from 44Mhz on a Pi1 to 70Mhz on a Pi3. It doesn't matter how fast the core is going when you try to hit a pin it acts like there is a massive wait state inserted. It's actually a bus exchange but the effect is the same.

If you want to see the max GPIO bus speeds on each model I recommend
https://github.com/hzeller/rpi-gpio-dma-demo

gchamp
Posts: 6
Joined: Mon Apr 01, 2019 8:10 pm

Re: CPU Execution Speed

Tue Apr 02, 2019 2:32 am

Hi,

thank you for the answer.

Will L2 cache access really have an impact here? The instructions in the core of the loop fits entirely in L1 cache, so there shoudn't be any cache misses. I could confirm this with the PMU. By the nature of the computation inside the loop, I see there could be some Dcache misses, but I've done similar tests without large memory access and the results weren't any better. Is there something else that would be faster with L2 caches?

In any cases, I am willing to try it. Am i correct that your example 10_virtualmemory enables the L2 cache? I can use it as a starting point to re-do the benchmarking.

Thank you for the heads up on the GPIO. I am not trying to optimize the toggling of a pin, so that should not be a concern. My results are the same even without trying to toggle the pin.

gchamp
Posts: 6
Joined: Mon Apr 01, 2019 8:10 pm

Re: CPU Execution Speed

Wed Apr 03, 2019 1:32 am

It seems you are right. I re-did the test on 64 bits with MMU enabled and the average cycle per instruction for my loop is about 0.70. I added my code on github.

Do you know why enabling the MMU make such a huge difference? From what I understood if the MMU disabled it would simply act as an identity function for the address translation (VA=PA). So my assumption was that the cache would work correctly.

LdB
Posts: 1280
Joined: Wed Dec 07, 2016 2:29 pm

Re: CPU Execution Speed

Wed Apr 03, 2019 12:11 pm

The opcodes are being fetched and data read and written from cache in the pipelines.
My MMU code simply maps an identity map 1:1 over the 1GB+QA7 memory of the Pi so practically it looks the same.

Given that detail your code only does those two things so what is creating the speedup should be obvious :-)

http://infocenter.arm.com/help/index.js ... BABIC.html

The cortexA55 gets even more boost because the L2 cache is private to each core so there is a tradeoff in the A53 for smaller size.

Long answer short it isn't feasible to measure a CPU performance with the L2 cache off. Even optimizing how you match the L2 cache setup to your code can make big differences, that is why there is no one size fits all way to set the L2 cache up.

gchamp
Posts: 6
Joined: Mon Apr 01, 2019 8:10 pm

Re: CPU Execution Speed

Wed Apr 03, 2019 7:02 pm

I think I understand more clearly what was the issue now. Thank you for your answers and your time.

I read a bit more on the behavior of the Cortex-A53 when MMU is disabled and I believe that if the MMU is off, the L1 and L2 data caches are not allocating anything. The instruction cache however works as expected.

What I found is that:
  • The L1 cache and L2 cannot be disabled independently. Thus, setting the bit for the Dcache enable would turn everything on. [1]
  • If the MMU is off, every memory access is considered access to non-cachechable Device_nGnRn memory. Any access to memory goes directly to main memory. [2]
  • If the MMU is off, the instruction cache work as expected if enabled. [3]
So, if the MMU is off any instruction that touches memory is about 100 times (~1 cycles vs ~100 cycles) longer than it should. To verify this, I timed with the cycle counter 100 immediate mov instructions (mov x0, #100) and 100 load instructions (ldr x0, [x1]). The mov instructions take about 250 cycles and the ldr instructions take about 14500 cycles (!). By enabling the MMU and setting the correct memory attributes, thus making the data cache usable, the 100 mov instructions take about 60 cycles and the 100 load instructions about 110 cycles. Not 100% certain as to why enabling the data cache also affects the speed of immediate move instructions, but as you said it cleary relates to making sure that all the caches are correctly usable. I have updated my code on github to include this benchmark.

Once again, thank you very much for your help and your code on github! :D

[1] ARM Cortex-A53 MPCore Processor Technical Reference Manual, rev r0p2, section 6.2 "Cache Behavior"
[2] ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile, section D5.2.9 "The effects of disabling a stage of address translation"
[3] ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile, section D5.2.9 "The effects of disabling a stage of address translation"

gpk
Posts: 3
Joined: Mon Mar 25, 2019 9:08 pm

Re: CPU Execution Speed

Tue Apr 09, 2019 8:45 pm

I have also noticed this. I thought (probably naively) that following routine would cause a busy wait for roughly n cycles:

Code: Select all

wait_cycles:
        cmp     x0, 0
        ble     end
        sub     x0, x0, #4      // we'll spend 4 cycles in the loop
        b       wait_cycles
end:
        ret
But it actually waits for about 30x more cycles than I thought it would. So even without memory access it's about 30 cycles per instruction!

gchamp
Posts: 6
Joined: Mon Apr 01, 2019 8:10 pm

Re: CPU Execution Speed

Wed Apr 10, 2019 8:43 pm

Right, I didn't find a satisfactory explanation as to why doing operating only with the registers is a bit slower with the MMU off. However, with the MMU configured for a 1 to 1 mapping, everything that touches memory is about 10 to 20 times faster.

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: CPU Execution Speed

Tue Apr 23, 2019 9:17 pm

you can find my experiments too from years ago. Your loop has the problem that at least three instructions in a row are waiting on the instructions just before to complete, so you lose a lot of the features of a pipeline like this, for example a loop with a bunch of the same instruction add r0,r0,#1 will go really slow because of the dependencies, but if you craft the benchmark differently you can/will get vastly different results. it is also a very shallow loop so branching often, make it deeper, perhaps much deeper. or better yet just try to find my question. I asked in stackoverflow as well but may have removed that one once I figured out it was me not the processor.

also remember we are a slave to the memory system so dont expect the arm to be fed as fast as it would desire, even with L2. so try to keep the test in L1.

run baremetal not in an operating system, etc.

not sure what your goal is but remember benchmarks are BS because they are so easy to manipulate. the processor will average a clock less per instruction, whatever arms documents state, if you can feed it and if you can remove obvious hazard/stalls...

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: CPU Execution Speed

Tue Apr 23, 2019 9:21 pm

.globl ARMTEST3
ARMTEST3:
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
subs r0,r0,#1
nop
nop
nop
nop
nop
nop
nop
nop
bne ARMTEST3
bx lr


ARMTEST3
0x01000000 sub instructions
0x08000000 nop instructions
0x00100000 bne instructions
0x09100000 instructions
037000D7 system clocks
2.64 instructions per clock. 659Mips


on an original raspi I was just getting started back then, lots of water and code under the bridge since then. might have even left this code in my repo...

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: CPU Execution Speed

Tue Apr 23, 2019 9:23 pm

look in my bench02 directory in the (old) raspberrypi repo (github.com/dwelch67/raspberrypi). port that forward as needed.

David (dwelch67)

Return to “Bare metal, Assembly language”