In a processor as complex as the A7, the answer is invariably "it's complicated".
It has a limited superscalar (dual issue, in-order execution) pipeline which I believe is 8 stages deep. If you correctly order your instructions, memory accesses and data alignment you can have an ultimate execution rate of 2 instructions per clock.
Let's say you have a tightly written assembly loop, which performs some operation 1024 times and then exits. It reads some data from memory, performs some logic/arithmetic on it and writes the data out to a different area of memory. Both the L1 and L2 caches are switched on.
The first time round the loop, it will take substantially longer to get from the start of the loop to the end because of all the cache activity that has to be performed prior to executing the tightly-written loop. The instruction cache needs to go fetch lines containing your loop, the branch predictor needs to get itself up-to-speed with what instructions it should speculatively fetch and retain in the L1 Icache. Data needs to be brought in, perhaps from as far away as SDRAM (hundreds of cycles).
Assuming the first time round the loop, the various caches have performed their jobs and you now have single-cycle access to instructions and data, the next N loops will go a lot faster. If, through sensible uses of preload hints and instruction ordering, your assembly suffers no pipeline stalls waiting on data or register results and makes full use of both execution pipelines, you can achieve 2 IPC. Writes are lazy - they head off to the load/store unit and the store hardware keeps track of any dependency hazards. If you subsequently don't touch data that's written, you can keep that store unit maximally busy.
A useful thing present in ARM processors is the process cycle counter. It counts input clock cycles between tracepoints.
http://infocenter.arm.com/help/index.js ... FDEEJ.html
Rockets are loud.