turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

RPi 2b (Cortex A7) instruction timing?

Sat Aug 22, 2015 11:42 am

I wonder if someone has come across information of cortex A7 instruction timings?
I read that most instructions take one 'cycle', but what is 'cycle' in ARM-language?
With different processors it means different things. What does it mean in ARM-talk?

How is the external clock frequency related to cycle times?
In some processors external clock period and cycle period are the same, in some, the internal cycle period takes several external clock periods and in some there is a internal frequency multiplier, and the cycle frequency can be, like, 4 times the external frequency. I recall that somewhere a pipeline slot was called 'cycle'.

My point is to be able to calculate (roughly) how long (in ms or us, or maybe ns) it takes to execute a given piece of assembly code - like an interrupt routine.
De-bugging is for sissies - real men do de-monstrations.

User avatar
mahjongg
Forum Moderator
Forum Moderator
Posts: 12322
Joined: Sun Mar 11, 2012 12:19 am
Location: South Holland, The Netherlands

Re: RPi 2b (Cortex A7) instruction timing?

Sat Aug 22, 2015 12:20 pm


turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Sat Aug 22, 2015 4:22 pm

Could you be maybe a bit more specific?
I couldn't find anything relevant (except input clock to AMBA master clock relationship).
De-bugging is for sissies - real men do de-monstrations.

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2115
Joined: Thu Jul 11, 2013 2:37 pm

Re: RPi 2b (Cortex A7) instruction timing?

Sat Aug 22, 2015 10:37 pm

In a processor as complex as the A7, the answer is invariably "it's complicated".

It has a limited superscalar (dual issue, in-order execution) pipeline which I believe is 8 stages deep. If you correctly order your instructions, memory accesses and data alignment you can have an ultimate execution rate of 2 instructions per clock.

Let's say you have a tightly written assembly loop, which performs some operation 1024 times and then exits. It reads some data from memory, performs some logic/arithmetic on it and writes the data out to a different area of memory. Both the L1 and L2 caches are switched on.

The first time round the loop, it will take substantially longer to get from the start of the loop to the end because of all the cache activity that has to be performed prior to executing the tightly-written loop. The instruction cache needs to go fetch lines containing your loop, the branch predictor needs to get itself up-to-speed with what instructions it should speculatively fetch and retain in the L1 Icache. Data needs to be brought in, perhaps from as far away as SDRAM (hundreds of cycles).

Assuming the first time round the loop, the various caches have performed their jobs and you now have single-cycle access to instructions and data, the next N loops will go a lot faster. If, through sensible uses of preload hints and instruction ordering, your assembly suffers no pipeline stalls waiting on data or register results and makes full use of both execution pipelines, you can achieve 2 IPC. Writes are lazy - they head off to the load/store unit and the store hardware keeps track of any dependency hazards. If you subsequently don't touch data that's written, you can keep that store unit maximally busy.

A useful thing present in ARM processors is the process cycle counter. It counts input clock cycles between tracepoints.

http://infocenter.arm.com/help/index.js ... FDEEJ.html
Rockets are loud.
https://astro-pi.org

turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Sun Aug 23, 2015 12:55 am

I expected it to be complicated - had a taste of it with TI 6000 series DSP.
And the TI 6000-device documents (including the pipelining) were available. :lol:

The first thing of all: is the chip synchronized to the input frequency?
That is 'cycle time' equal to the clock frequency period?
That is "you can achieve 2 IPC" with 900 MHz that's (about) 1800 MIPS if the core could "run free"?
(Of course, in reality there are limiting factors - as you mentioned, like caches, instruction dependencies, ...)
I understand that documents about the pipelines are stuff under NDA?
So I have to settle with some "rule of thumb" anyway?

I guess my best approach is to read about the performance monitor (in general) and the cycle counter (in particular).
Thanks!
De-bugging is for sissies - real men do de-monstrations.

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2115
Joined: Thu Jul 11, 2013 2:37 pm

Re: RPi 2b (Cortex A7) instruction timing?

Sun Aug 23, 2015 5:27 pm

There is a single input pin for the A7 clock. This clock is used throughout the cluster.

As usual, the best documentation for the pipelines is source code: in this case, GCC.

https://github.com/gcc-mirror/gcc/blob/ ... rtex-a7.md
Rockets are loud.
https://astro-pi.org

turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Sun Aug 23, 2015 6:38 pm

Thanks. That needs to be included in the list of readings too.

At the moment, my problem is really much simpler - a delay loop.
This is closish, but not quite:

Code: Select all

debug_wait1: @ ms count in r8
	cmp	r8, #0
	bxeq	lr
	mov	r10, #0
loop4$:
	ldr	r9, =0x0000061a
loop5$:
	nop	{0}
	subs	r9, r9, #1
	bne loop5$
	add	r10, r10, #1
	cmp	r10, r8
	bne loop4$
	bx	lr
If I take a rough guess of 700MHz clock (less than 10% error), a clock cycle is ~1.5 ns.
if the three instructions: nop, subs and bne all take 1 cycle, a single loop would take ~4.5 ns.
Again 1 million ns (=1 ms) / 4.5 ns = ~222222 loops = 0x3640e.
Quite far from 0x61a. What's wrong with my calculation?
(Maybe that also clears my question about cycle - clock frequency relation...)
De-bugging is for sissies - real men do de-monstrations.

rst
Posts: 410
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: RPi 2b (Cortex A7) instruction timing?

Sun Aug 23, 2015 7:34 pm

Did you enable the instruction cache? It is not enabled by default. So any instruction fetch goes to the SDRAM.

jbrooks
Posts: 1
Joined: Mon Aug 24, 2015 7:39 pm

Re: RPi 2b (Cortex A7) instruction timing?

Mon Aug 24, 2015 8:05 pm

I've been doing quite a bit of assembly coding, profiling, and optimization on the Raspberry Pi 2 lately. Here's my take on how the code is likely to be scheduled by the A7 CPU:

debug_wait1: @ ms count in r8
; cmp+branch usually take 1 cycle combined, though branch mispredicts will stall less if cmp is earlier
cmp r8, #0
bxeq lr

; mov immed is a lower insn and can be paired with an upper ldr/str, cmp, bitfield, etc. See gcc src arm.c, See gcc src arm.c, cortexa7_younger().
mov r10, #0

loop4$:
; single loads/stores are an upper insn and can be paired with a lower branch, immed insn, extend. See gcc src arm.c, cortexa7_older_only().
ldr r9, =0x0000061a

loop5$:
; nops take one cycle each and don't pair with any other insn
nop {0}

; subs+branch will take 1 cycle, though again branch mispredicts will be shorter if subs is moved earlier (ie, before nop)
subs r9, r9, #1
bne loop5$

; lower insn
add r10, r10, #1

; cmp & branch take 1 cycle
cmp r10, r8
bne loop4$

; branch is lower
bx lr

So by my count, issueing all these insns would take 8 cycles, with the loop taking 2 cycles per loop iteration. This is assuming no stalls.

In my experience, the major causes of stalls on the Cortex A7 are cache miss stalls, pipeline reg dependency stalls, and branch mispredict stalls.

Oh, also insn issue seems much better (faster) with thumb code than arm insns.

Going somewhat off-topic, you may want to consider a HW timer instead of a spin-loop if you want to wait for a specific amount of time. Modern CPUs can vary their clock speed due to temperature, battery saving method, etc. Also hw timers use a lot less energy/battery than spin loops.

Hope that helps.
-JB

jdb
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 2115
Joined: Thu Jul 11, 2013 2:37 pm

Re: RPi 2b (Cortex A7) instruction timing?

Mon Aug 24, 2015 9:31 pm

Yes, if your timing needs to be precise and of short duration (sub-millisecond) then you should use the dedicated timing hardware on-chip. There are two options - the system time counter (within the GPU) is a monotonically incrementing 64-bit counter (2x32 bit registers) that has an update rate of 1uS. Within the A7 cluster is the ARM architected timer, driven from either the APB clock (which is variable when overclocking) or the oscillator clock. Use this if you need to have sub-microsecond precision as you can drive it at 19.2MHz and it's much closer in terms of bus cycles.

https://www.raspberrypi.org/documentati ... rev3.4.pdf
Rockets are loud.
https://astro-pi.org

turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Tue Aug 25, 2015 7:45 am

Thanks, jbrooks and jdb.
I guess I'd better go with a timer - just in case.
As such, at the moment, I need timings order of a second - say, from half a second to 10 seconds.
The point is using led blinks for debugging - different sequences mean different things.

But I think I'll need sub-ms timings in the near future.
I guess it's also time to enable caches. :lol:
De-bugging is for sissies - real men do de-monstrations.

turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Tue Aug 25, 2015 10:29 am

rst wrote:Did you enable the instruction cache? It is not enabled by default. So any instruction fetch goes to the SDRAM.
Good point. Any idea how long a SDRAM access takes? Closer to 10 cycles?
De-bugging is for sissies - real men do de-monstrations.

rst
Posts: 410
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: RPi 2b (Cortex A7) instruction timing?

Tue Aug 25, 2015 12:04 pm

turboscrew wrote:Any idea how long a SDRAM access takes? Closer to 10 cycles?
I guess more but I do not know.

BTW if you wonder about Pi 2 bare metal performance you should also read this article.

turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Tue Aug 25, 2015 12:59 pm

Thanks, rst. I thought that the other cores were sleeping.
De-bugging is for sissies - real men do de-monstrations.

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: RPi 2b (Cortex A7) instruction timing?

Wed Aug 26, 2015 7:47 pm

I thought I had a thread both here and on stackoverflow trying to get to 700mhz on the rpi1, should be no difference on the rpi2. the trick is as always is to keep the pipe as full as possible, which means both feeding it fast enough and keeping it from flushing, and to prove you did it you really should run a lot of instructions relative to the timer resolution.

multiple cores doesnt mean much each is its own separate cpu, not one cpu with four execution units or anything like that.

the sdram is ... dram ... which is incredibly slow. dram technology is speed limited at something like 133mhz even though they put faster and faster front ends on the controller (2133mhz for example) . that is why l1, l2, etc caching are terms we see mentioned often, to solve the cheap (slow) memory problem. another problem is we dont have visibility into the sharing of the ram/peripherals inside the chip so we dont have an idea but can assume that the gpu wins from a performance perspective, so what is the true dram speed for the arm? we wont know exactly so you just have to do some timing experiments to get a rough estimate (make sure you get the cache out of the measurements). and you will end up finding that it is so incredibly slow that it is not fast enough to care about arm performance you cant keep the pipe fed if you are relying on sdram speeds, the only way to keep the pipe fed is to get the code under test into l1 cache and stay within code in the l1 cache and that just can feed the pipe then you still need to make the code such that the pipe doesnt need to flush.

David

turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Wed Aug 26, 2015 9:29 pm

The problem is not the RPi 2B maximum performance, but to figure out some rough estimate of average performance in different configurations. One can't even get a full understanding, because there is so much necessary info that requires NDA.

My priority is rough delay calculations after "Raspbian boot files". Like just getting a delay loop for, say, 1 ms or 10 ms that can be used as a basic delay for longer delay loops. For shorter and more accurate timings there is HW, but many simple things, like blinking a led, doesn't need <1% accurate or sub-millisecond timing. Also the timers might be needed for something else that requires much more accurate timing or microsecond-class timing.

Also, it looks like a project of its own to figure out the state of RPi 2B after "Raspbian boot files". There is no documentation about what does the boot do before 'kernel7.img' starts running. It's kind of rough if one needs to go through all kinds of discussions and boot loader sources just to write a simple "blinky", especially when RPis are not the most straight-forward platforms.
De-bugging is for sissies - real men do de-monstrations.

JacobL
Posts: 76
Joined: Sun Apr 15, 2012 2:23 pm

Re: RPi 2b (Cortex A7) instruction timing?

Sat Nov 07, 2015 3:15 pm

Also, it looks like a project of its own to figure out the state of RPi 2B after "Raspbian boot files". There is no documentation about what does the boot do before 'kernel7.img' starts running.
There are a few things documented about the state at boot, but they come from the requirements to the bootloader from the Linux kernel, since the bootloader is designed to load the Linux kernel in this case: https://www.kernel.org/doc/Documentation/arm/Booting

turboscrew
Posts: 174
Joined: Sat Jan 18, 2014 1:50 pm
Location: Nokia (town), Finland

Re: RPi 2b (Cortex A7) instruction timing?

Sun Nov 08, 2015 7:46 pm

Thanks. The timing problem has passed (I read the system timer), but otherwise interesting document.
De-bugging is for sissies - real men do de-monstrations.

Return to “Bare metal, Assembly language”