findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

FIQ Size

Fri Feb 07, 2014 3:05 am

Hi,

This isn't exactly bare metal because Linux is floating around in the background, but I figured the bare metal forum is the best place to ask and actually get an answer.

Question: When a regular RPi Linux kernel is running, how large can the FIQ code at the end of the vector table be?

I've read at [1] that by default the RPi kernel is installed at 0x8000, so does that mean it would be safe to have an FIQ code size of 0x8000-0x1C, where 0x1C is the end of the vector table that the processor branches to on an FIQ interrupt? It would seem that the historically "safe" size for other ARM chips is 0x1E4 based on [2] and a few other discussions online, but it is unclear to me how that relates to a situation where the kernel is at 0x8000.

Thanks...

-Adam

[1] https://github.com/dwelch67/raspberrypi ... /baremetal
[2] http://www.spinics.net/lists/arm-kernel/msg02474.html

User avatar
rpdom
Posts: 18157
Joined: Sun May 06, 2012 5:17 am
Location: Chelmsford, Essex, UK

Re: FIQ Size

Fri Feb 07, 2014 7:35 am

I hope you are aware that the USB driver implementation in the Raspi kernel uses the FIQ

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Fri Feb 07, 2014 8:51 am

Yes, I'm running with the USB FIQ fix disabled. My FIQ code works fine and USB also works. I just don't know how much space is safely available after 0x1C for FIQ code without branching elsewhere.

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Fri Feb 07, 2014 4:21 pm

There are no limits on the size of any handler as far as the ARM architecture goes. The vector table is just a short list of single instructions. You use a "b something" or "ldr pc, something" the latter allowing you to branch anywhere in the address space of the processor (yes it goes from like 15 words of vector table to 31 worst case, assume 32 to be clean and pretty). Your handler although not wise can be as large as you want (as large as you have memory/prom for), and the fiq stack likewise can be as large as you want (have space for). This is true for any of the exception handlers.

If Linux imposes some limitation (cant see how) then that is another story.

David

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Fri Feb 07, 2014 6:35 pm

For an IRQ or other exception you normally branch wherever you want subject to the relative branch size constraint, however the FIQ is specifically designed to be at the end of the table so you can either branch or install a small bit of code without branching elsewhere. The latter is what I have been doing, and it works exactly as expected with Linux in the background. Linux has built in facilities for installing the FIQ code at the end of the table or installing a branch instruction based on the contents of the R8_FIQ to anywhere in the address space (not a relative branch).

I had to add some code, and now my FIQ handler is 128 instructions, which is bigger than the max size of 121 quoted in other spots around the web (e.g. link 2 in the original post). I would like to think Linux doesn't have anything in the space between 0x1C and 0x8000 for the RPi, but it would be nice to know if there is some other usage I'd overwrite if I step past the historically referenced size limit of 0x1E4 bytes. I'm not using any space for an FIQ stack, btw. The banked FIQ registers provide enough flexibility without using a stack.

-Adam

P.S. Dwelch67, thank you so much for putting together your bare metal guide. It was really helpful getting up to speed with ARM assembly and the RPi architecture after having mostly used the dsPIC in the past.

User avatar
chrisryall
Posts: 155
Joined: Wed Nov 27, 2013 11:45 am
Location: Wirral UK
Contact: Website

Re: FIQ Size

Fri Feb 07, 2014 6:46 pm

AFAIR the ARM last IRQ vector was there for very small, fast handlers, that were time critical. At 128 instructions the extra few clock cycles for a standard jump vector to a handler elsewhere in memory … might not make much much difference?

colinh
Posts: 95
Joined: Tue Dec 03, 2013 11:59 pm
Location: Munich

Re: FIQ Size

Fri Feb 07, 2014 7:56 pm

Remember that at some stage (after kernel.img gets loaded) ATAGS get written at 0x100 (unless you disable this in config.txt <edit: not cmdline.txt> - but you don't do that for Linux, I guess). I suppose that's irrelevant though.

But kernel.img isn't the kernel. It's a self-extracting bzImage which contains the gzipped kernel, zImage. I think that's what gets self-extracted, overwriting itself starting at 0x0000 (which ia. sets up the vector table).

Sooo, I think you need to get at the zImage itself, gunzip it, and then look at it with objdump.
Last edited by colinh on Fri Feb 07, 2014 10:54 pm, edited 1 time in total.

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Fri Feb 07, 2014 8:45 pm

@chrisryall My code is time critical, but can likely tolerate an additional branch. In terms of documenting the system, however, I think it would be good know whether a branch is even needed in this case. I'm also just barely over the historically quoted size for FIQ code. I wish I had a straightforward* way to chop down my code size.

@colinh I'll look into that, but I suppose there's a greater than zero probability that the kernel might load additional stuff in the area at run time. Run time stuff wouldn't be captured in the raw image.

Maybe it's just best to ask on a Linux specific forum for exactly what the kernel does and assumes about FIQ code sizes? Still, I feel like it's going to be tough to find the right one that'd know enough about the low level details to give an answer. Do you guys have any suggestions on where to post?

* I'm dealing with an 18 bit parallel interface covering 8 ADC channels, and because the RPi GPIO pins aren't in order I have to spend a good portion of my code shuffling bits around. I've got each shuffle down to two instructions (an AND to mask the bit, ORR with LSL to shift it into a common result register), and I think that's the best that can be done. The rest of the code has to handle an encoder signal (with different encoder timing offsets that need to be applied to each of the ADC channels) and storing the ADC result in a data structure. I think 128 instructions is not bad for all this.

colinh
Posts: 95
Joined: Tue Dec 03, 2013 11:59 pm
Location: Munich

Re: FIQ Size

Fri Feb 07, 2014 10:43 pm

findx wrote:@colinh I'll look into that, but I suppose there's a greater than zero probability that the kernel might load additional stuff in the area at run time. Run time stuff wouldn't be captured in the raw image.
objdump of the actual image would clear that up. Getting the uncompressed image isn't as trivial as one might have assumed, AFAIK. I may install linux on some card and see if I can compile the kernel on the RPi...
Maybe it's just best to ask on a Linux specific forum for exactly what the kernel does and assumes about FIQ code sizes? Still, I feel like it's going to be tough to find the right one that'd know enough about the low level details to give an answer. Do you guys have any suggestions on where to post?
I suppose elinux (embedded linux (e.g. on ARM)) and/or u-boot people would know this stuff.
* I'm dealing with an 18 bit parallel interface covering 8 ADC channels, and because the RPi GPIO pins aren't in order I have to spend a good portion of my code shuffling bits around. I've got each shuffle down to two instructions (an AND to mask the bit, ORR with LSL to shift it into a common result register), and I think that's the best that can be done. The rest of the code has to handle an encoder signal (with different encoder timing offsets that need to be applied to each of the ADC channels) and storing the ADC result in a data structure. I think 128 instructions is not bad for all this.
128 bytes!? :shock: When I were a lad, we didn't have 128 bytes! Mainframes had 128 bytes!

You could post that and see if any of us can optimise it a bit. :mrgreen:

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Sat Feb 08, 2014 3:23 am

Well I've answered my own question with some experimenting and poking around to find the right section of kernel code to read.

Experiment:
Having an FIQ code size larger than 0x200-0x1C = 0x1E4 kills my system.

Kernel code:
The kernel apparently installs the vector table at the high location of 0xFFFF0000 not 0x0. In a number of locations in arch/arm/kernel/entry-armv.S and arch/arm/kernel/traps.c, the max size of the vector table of 0x200 is hard coded. The space after 0x200 is filled with something they call a vector stub:

/*
* Vector stubs.
*
* This code is copied to 0xffff0200 so we can use branches in the
* vectors, rather than ldr's. Note that this code must not
* exceed 0x300 bytes.
*
* Common stub entry macro:
* Enter in IRQ mode, spsr = SVC/USR CPSR, lr = SVC/USR PC
*
* SP points to a minimal amount of processor-private memory, the address
* of which is copied into r0 for the mode specific abort handler.
*/

As for the good ol' days of small ram and slow processors, I have a lot of respect for what people were able to pull off. If you guys haven't come across it before, Mike Abrash's Graphics Programming Black Book [1] has a pretty awesome collection of all sorts of old tricks from the 8086 on up to the Pentium. Regarding my code, I don't think it'd make much sense to post it. It's definitely readable, but I'd spend more time explaining what it has to do than it would take to just implement a branch statement.

[1] http://www.amazon.com/Michael-Abrashs-G ... 1576101746

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Sat Feb 08, 2014 3:53 am

findx wrote:For an IRQ or other exception you normally branch wherever you want subject to the relative branch size constraint, however the FIQ is specifically designed to be at the end of the table so you can either branch or install a small bit of code without branching elsewhere. The latter is what I have been doing, and it works exactly as expected with Linux in the background. Linux has built in facilities for installing the FIQ code at the end of the table or installing a branch instruction based on the contents of the R8_FIQ to anywhere in the address space (not a relative branch).

I had to add some code, and now my FIQ handler is 128 instructions, which is bigger than the max size of 121 quoted in other spots around the web (e.g. link 2 in the original post). I would like to think Linux doesn't have anything in the space between 0x1C and 0x8000 for the RPi, but it would be nice to know if there is some other usage I'd overwrite if I step past the historically referenced size limit of 0x1E4 bytes. I'm not using any space for an FIQ stack, btw. The banked FIQ registers provide enough flexibility without using a stack.

-Adam

P.S. Dwelch67, thank you so much for putting together your bare metal guide. It was really helpful getting up to speed with ARM assembly and the RPi architecture after having mostly used the dsPIC in the past.
thanks for the comment.

The architecture does not prevent you from having a big handler (starting with a branch or executing linearly from the exception table), as mentioned that may not be wise. So if there is a limit being hit (esp the small number of bytes mentioned) it is likely some linux specific thing and not a bare metal thing. the exception table is simply a start address for the exception, load the pc here, set this mode save the psr as defined, whatever then just set the processor in motion from that address, reset, irq, etc are all the same you can run as far and wide as you have code space for (limitation of your ram/rom not the architecture).

it sounds like it is either an issue of how you are inserting or merging or building linux or how the mmu is setup, etc. Are you crashing into a data or prefetch abort? Or just crashing? Do you have a jtag debugger, is your I/O usage such that you can spare those I/O pins? If so you can possibly use jtag to debug where it is crashing or at least what it crashed into.

David

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Sat Feb 08, 2014 4:47 am

Based on the kernel code described in my previous post, it's crashing because I'm overwriting the "vector stubs." I don't think I need to do any more work to diagnose the issue since it's pretty clear to me where the 0x200 magic number comes from now. In my original post I had mistakenly thought there was a vast sea (pond) of bytes between 0x1C and 0x8000 that wasn't being used on the RPi.

My plan right now is to just branch outside the vector table and avoid this problem all together. I really don't want to start fiddling with the hard coded 0x200 constants in the kernel code, and optimizing my code size further will be more time consuming than to just branch outside the vector table. Based on the FIQ / GPIO / ADC latencies I've measured, I can easily tolerate a branch.

I appreciate all your help and suggestions. At some point I'll get my old FT2232D board going for JTAG, but at the moment I don't have an unknown bug to track down.

-Adam

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: FIQ Size

Sat Feb 08, 2014 5:15 pm

findx wrote: Mike Abrash's Graphics Programming Black Book [1] has a pretty awesome collection of all sorts of old tricks from the 8086 on up to the Pentium.

[1] http://www.amazon.com/Michael-Abrashs-G ... 1576101746
Oddly enough, this cropped up recently on another forum I frequent - It's been converted to markdown format:

MD Source -> https://github.com/jagregory/abrash-black-book
HTML from MD -> http://www.jagregory.com/abrash-black-book/

User avatar
chrisryall
Posts: 155
Joined: Wed Nov 27, 2013 11:45 am
Location: Wirral UK
Contact: Website

Re: FIQ Size

Sat Feb 08, 2014 5:37 pm

if you don't have a space constraint you can sometimes make code bigger, but faster using ARM conditional (non)execution to avoid any further looping? It'll not be pretty, no way Pythonic, but thanks to the pipeline, like the proverbial stuff off a shovel. The devil is in detail here, all depends on what you are up to, but branch instuctions waste several cycles each, and if you can keep that to only one … ?

My apologies if you know all this already, but the best thing I ever did way back in the 70's was an awkward to calculate 265 byte lookup table (4 Mb whole University!) that did EBCIC/ASCII conversion,a bizzare but mandatory bit order reversal, and parity stuff all in one go. Be greedy, even 256 bytes ain't "a lot" nowadays. Can you pre-calculate a series of your bit shuffle results (there aren't that many pins?) and replace a run of those with a single indexed load, maybe repeatedly? It might be really weird (actually no longer. in any algorithmic way) maths, but good luck

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Sat Feb 08, 2014 11:44 pm

Well, I was able to get my code to fit by cheating a little. The low hanging fruit was to optimize this 3 instruction wait macro, which I used 5 different places to space out the correct parallel bus bit timing:

Code: Select all

// Previous wait code... 3 instructions and no need to mess with R14
.macro WAIT loops, scratchReg
	ldr		\scratchReg, =\loops
1:
	subs		\scratchReg, \scratchReg, #1
	bne		1b
.endm
3*5 = 15 instructions, which is 12% of the 121 instructions available. I had already looked into bl branching to a dedicated wait routine, but that required spending instructions caching / restoring R14, and in the end I was still over the FIQ size limit.

From coding the whole thing up, I've noticed that there's a built-in delay reading the GPIO levels... so I tried it out as an alternate wait routine:

Code: Select all

// New wait code, only one instruction
.macro WAIT scratchReg
	GET_GPIO_PINS	gpioPtr=GPIO_PTR_R10, pinLevelsRet=\scratchReg
.endm
Using this as a wait routine, when I go to strobe my /RD /CS line (twice for each of the 8 ADC channels, yellow signal in the attached), I get a delay of ~90 ns between bit toggles, which is close enough to the 30 ns I was shooting for. Without the delay, the pin toggle rate is ~12 ns, which is too fast. Occasionally, the delay is longer than the nominal ~90 ns, which I was able to capture in the attached, but I have not seen shorter than ~90 ns.

While it fits (I'm using 118 of the 121 instructions available without substantially changing the code I already have), I'm not 100% comfortable with this as a solution given the variability and that it's basically a hack. I plan to revisit it later on after I finish the rest of my driver implementation.

Anyways, I figured I'd post this in case someone else is curious about GPIO timing.

-Adam

The other macros, for reference are:

Code: Select all

.macro GET_VALUE basePtr, offset, destReg, predication
	ldr\predication	\destReg, [ \basePtr, \offset]
.endm

.macro SET_VALUE basePtr, offset, srcReg, predication
	str\predication	\srcReg, [ \basePtr, \offset]
.endm

.macro GET_GPIO_PINS gpioPtr, pinLevelsRet  // Get all GPIO bank 0 pins
	GET_VALUE	basePtr=\gpioPtr, offset=#GPLEV0, destReg=\pinLevelsRet
.endm

.macro SET_GPIO_PIN gpioPtr, pinNum, pinState, scratchReg, oneReg  // pinState = 1 or 0, high or low, respectively on GPIO bank 0
	lsl		\scratchReg, \oneReg, #\pinNum
	.if \pinState==1
			SET_VALUE basePtr=\gpioPtr, offset=#GPSET0, srcReg=\scratchReg
	.elseif \pinState==0
			SET_VALUE basePtr=\gpioPtr, offset=#GPCLR0, srcReg=\scratchReg
	.else
		.error "Error in SET_GPIO_PIN macro input..."
	.endif
.endm
Attachments
notRD notCS line.png
notRD notCS line.png (8.34 KiB) Viewed 6221 times

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Sun Feb 09, 2014 2:03 am

tufty wrote:
findx wrote: Mike Abrash's Graphics Programming Black Book [1] has a pretty awesome collection of all sorts of old tricks from the 8086 on up to the Pentium.

[1] http://www.amazon.com/Michael-Abrashs-G ... 1576101746
Oddly enough, this cropped up recently on another forum I frequent - It's been converted to markdown format:

MD Source -> https://github.com/jagregory/abrash-black-book
HTML from MD -> http://www.jagregory.com/abrash-black-book/
Nice! does that include the zen of assembly language as well which I believe was supposedly on the cdrom (I lost my cdrom that came with the black book). I got copies of the cdrom version and have the actual book from when it came out and there were gaps in the black book version of the zen of asm. You can get the Zen of asm used for a good price and it is well worth it, remember as then and now dont think about the 8086 but think about the approach to performance and cycle eaters. That book (zen of asm) shaped my life/career and use what I formed from reading it every day finding bugs in processor designs by beating the hell out of the design and/or knowing where the design cycle eaters are as well as code issues.

I wonder if Michael Abrash authorized a basically open black book then maybe he has or will green light an open source zen of asm ebook.

David

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Sun Feb 09, 2014 2:17 am

chrisryall wrote:if you don't have a space constraint you can sometimes make code bigger, but faster using ARM conditional (non)execution to avoid any further looping? It'll not be pretty, no way Pythonic, but thanks to the pipeline, like the proverbial stuff off a shovel. The devil is in detail here, all depends on what you are up to, but branch instuctions waste several cycles each, and if you can keep that to only one … ?
So based on a stackoverflow question I finally started looking at the 64 bit ARM architecture today. They have gotten rid of the conditional execution because they say that the branch prediction is good enough to not need it. Now this is an ARM11 (raspberry pi) and I the pipe is deep and I assume some prediction is there, maybe not as good as its followers the ARMv7 or ARMv8 but possibly not as bad as we think. Perhaps your timing results show that perhaps not, would be fun to do an experiment, I assume the shared dram and whatever magic is going on behind the curtain with the memory controller, L3 caching and the like this is probably the wrong platform to be trying such a thing. Need something with a dedicated sram like the hercules launchpad perhaps. I have others I could try it on as well. Branching from the vector table to a larger code space to avoid this 0x200 size problem, being unconditional, should be easy to predict and perhaps not as much of a clock penalty. Hmmm....sounds like a fun research project.

David

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: FIQ Size

Sun Feb 09, 2014 5:38 pm

dwelch67 wrote:Branching from the vector table to a larger code space to avoid this 0x200 size problem, being unconditional, should be easy to predict and perhaps not as much of a clock penalty. Hmmm....sounds like a fun research project.
I'd suggest manually cache hinting with PLI (and maybe PLD), as well. Kick off a cache hint, do some processing in your 0x200 bytes to let the cache catch up, then branch unconditionally.

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Mon Feb 10, 2014 10:52 pm

Debating how to approach this, not really even gotten into it and already having fun. Starting with the code below where it happened to land in the binary. Not the raspberry pi but another arm11, mpcore in fact, on chip fast sram. caches OFF.

0038FFED is the timer ticks from the ldr r1,[r0] to the ldr r2,[r0]. As shown below. The times after that are when I add one nop, then another then another. Moving this code through memory. Yes the cache is OFF!. Understand a couple of things. One we dont know how the processor works as far as that conditional branch at the end. How much ahead can it determine if it is going to branch. Second I happen to know or believe that this processor is fetching 8 instructions at a time (aligned). So if the top of the loop is such that you have to fetch 8 instructions to get a few that is wasted clock cycles. Likewise at the end does the fetch boundary affect the branch prediction? More experiments...

These timer values are exactly repeatable with the same binary, I know this chip well and know there is nothing else going on to affect the bus. Likewise I simply ran it over and over again for a number of the binaries and got the same result. with dram instead of sram you should expect some variation, with the raspberry pi, who knows. probably not a hard experiment to try there...

Point being just because you get a number for a benchmark doesnt mean that is the end of that test, can you trust that number, as shown here you cant, you have to understand the one test case before moving on to a test case with a branch rather than conditional execution.

as far as the penalty for an additional fiq branch sitting on an a operating system, caches on, who knows where the code landed, you could easily be burning many extra clock cycles doing the branch or not doing the branch or where you do the branch, etc just by where you code lies. if you walk or even prefetch into a full cache line, but then dont use the whole thing, then branch landing poorly in the middle or end of another cache line and throw that away...you get the picture. Microoptimizing may be a wasted effort if you dont take the other things into consideration.

Code: Select all

d6008030 <branch_test>:
d6008030:	e92d00f0 	push	{r4, r5, r6, r7}
d6008034:	e59f0064 	ldr	r0, [pc, #100]	; d60080a0 <top+0x5c>
d6008038:	e3a04801 	mov	r4, #65536	; 0x10000
d600803c:	e5901000 	ldr	r1, [r0]
d6008040:	e1a05001 	mov	r5, r1

d6008044 <top>:
d6008044:	e2855001 	add	r5, r5, #1
d6008048:	e2855001 	add	r5, r5, #1
d600804c:	e2855001 	add	r5, r5, #1
d6008050:	e3140001 	tst	r4, #1
d6008054:	12855001 	addne	r5, r5, #1
d6008058:	e2855001 	add	r5, r5, #1
d600805c:	e2855001 	add	r5, r5, #1
d6008060:	e2855001 	add	r5, r5, #1
d6008064:	e2855001 	add	r5, r5, #1
d6008068:	e2855001 	add	r5, r5, #1
d600806c:	e2855001 	add	r5, r5, #1
d6008070:	e2855001 	add	r5, r5, #1
d6008074:	e2855001 	add	r5, r5, #1
d6008078:	e2855001 	add	r5, r5, #1
d600807c:	e2855001 	add	r5, r5, #1
d6008080:	e2855001 	add	r5, r5, #1
d6008084:	e2855001 	add	r5, r5, #1
d6008088:	e2544001 	subs	r4, r4, #1
d600808c:	1affffec 	bne	d6008044 <top>
d6008090:	e5902000 	ldr	r2, [r0]
d6008094:	e0410002 	sub	r0, r1, r2
d6008098:	e8bd00f0 	pop	{r4, r5, r6, r7}
d600809c:	e12fff1e 	bx	lr
d60080a0:	d6800604 	strle	r0, [r0], r4, lsl #12

0038FFED
00387FED
0037FFEE
00377FEA
0036FFEE
0036FFEF
00387FED
002D7FF8
0038FFED
More to come I hope.

David

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Tue Feb 11, 2014 3:55 am

so the whole test function, well I chopped off the pop and return.

Code: Select all

d600804c <branch_test>:
d600804c:   e92d00f0    push    {r4, r5, r6, r7}
d6008050:   e59f006c    ldr r0, [pc, #108]  ; d60080c4 <one+0x40>
d6008054:   e3a04801    mov r4, #65536  ; 0x10000
d6008058:   e5901000    ldr r1, [r0]
d600805c:   e1a05001    mov r5, r1
d6008060 <top>:
d6008060:   e2855001    add r5, r5, #1
d6008064:   e2855001    add r5, r5, #1
d6008068:   e2855001    add r5, r5, #1
d600806c:   e3140001    tst r4, #1
d6008070:   e2855001    add r5, r5, #1
d6008074:   e2855001    add r5, r5, #1
d6008078:   e2855001    add r5, r5, #1
d600807c:   12855001    addne   r5, r5, #1
d6008080:   e2855001    add r5, r5, #1
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
d600808c:   e2855001    add r5, r5, #1
d6008090:   e2855001    add r5, r5, #1
d6008094:   e2855001    add r5, r5, #1
d6008098:   e2855001    add r5, r5, #1
d600809c:   e2544001    subs    r4, r4, #1
d60080a0:   1affffee    bne d6008060 <top>
d60080a4:   e1a00000    nop         ; (mov r0, r0)
d60080a8:   e1a00000    nop         ; (mov r0, r0)
d60080ac:   e1a00000    nop         ; (mov r0, r0)
d60080b0:   e1a00000    nop         ; (mov r0, r0)
d60080b4:   e5902000    ldr r2, [r0]
d60080b8:   e0410002    sub r0, r1, r2
top: is aligned on a 8 word boundary and that did make a huge difference in performance. the end of the loop is where it is I was playing around there and left it like that. The number of processor clocks using the built in timer

002AFFFC

Now change from two instructions to two other instructions. Not interested in how many times I add r5, am interested in the number of instructions in the loop.

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   e2855001    add r5, r5, #1
d6008078:   e2855001    add r5, r5, #1
d600807c:   0a000000    beq d6008084 <one>
d6008080:   e2855001    add r5, r5, #1
d6008084 <one>:
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
d600808c:   e2855001    add r5, r5, #1
d6008090:   e2855001    add r5, r5, #1
0030FFFC

so that added 6 clocks per loop and I believe this is the sweet spot the fastest place, now perhaps because we fetch from 0x80 to 0x7C in one shot then that goes into the pipe and those two instructions preceed the beq the pipe can perhaps know that it is going to branch and where. Maybe not maybe it is a case of the first prefetch after beq is already fetched, we need one more prefetch after pulling in 0x80 to 0x8C plus the fetch for the destination (fetch 0x80 to 0x8C) but that is well over 6 clocks more than a dozen. So i dont think it fetches starting at 0x80 twice, so it is perhaps a combination of things, I dont have more visibility into this test at this time.

so move one of the add r5s from before to after the branch stuff

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   e2855001    add r5, r5, #1
d6008078:   0a000000    beq d6008080 <one>
d600807c:   e2855001    add r5, r5, #1
d6008080 <one>:
d6008080:   e2855001    add r5, r5, #1
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
d600808c:   e2855001    add r5, r5, #1
00313FFC

it adds another 1/4th of a clock per loop...interesting.

shift the beq stuff another instruction back

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   0a000000    beq d600807c <one>
d6008078:   e2855001    add r5, r5, #1
d600807c <one>:
d600807c:   e2855001    add r5, r5, #1
d6008080:   e2855001    add r5, r5, #1
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
00373FFC

wow, another 6 clocks. I wonder if the words starting at 0x80 were fetched twice...

Okay so that asked as many questions as it answered. With that little experiment we at best added 6 clocks by using a branch if equal rather than a conditional execution. and you can probably guess that

Code: Select all

beq one
  add
b two
one
  add
two
is going to add a bunch more clocks than

Code: Select all

addeq
addne
so going back and removing the conditional execution, do the add every time rather than every other

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   e2855001    add r5, r5, #1
d6008078:   e2855001    add r5, r5, #1
d600807c:   e2855001    add r5, r5, #1
d6008080:   e2855001    add r5, r5, #1
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
d600808c:   e2855001    add r5, r5, #1
002AFFFC

same number of clocks, no real surprise there.

But look at this an unconditional branch to the next instruction:

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   e2855001    add r5, r5, #1
d6008078:   e2855001    add r5, r5, #1
d600807c:   e2855001    add r5, r5, #1
d6008080:   eaffffff    b   d6008084 <one>
d6008084 <one>:
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
d600808c:   e2855001    add r5, r5, #1
0036FFFC

ouch 11 more clocks per loop. It had just prefetched those instructions!

Back that branch up one instruction in the address space, maybe the branch prediction comes in and it knows what we are doing

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   e2855001    add r5, r5, #1
d6008078:   e2855001    add r5, r5, #1
d600807c:   eaffffff    b   d6008080 <one>
d6008080 <one>:
d6008080:   e2855001    add r5, r5, #1
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
Nope, another half clock.

00377FFC

back up one more

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   e2855001    add r5, r5, #1
d6008078:   eaffffff    b   d600807c <one>
d600807c <one>:
d600807c:   e2855001    add r5, r5, #1
d6008080:   e2855001    add r5, r5, #1
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
d600808c:   e2855001    add r5, r5, #1
00437FFC

wow that just hurts. I wonder if the words starting at 0x80 got fetched twice.

Code: Select all

d6008070:   e2855001    add r5, r5, #1
d6008074:   eaffffff    b   d6008078 <one>
d6008078 <one>:
d6008078:   e2855001    add r5, r5, #1
d600807c:   e2855001    add r5, r5, #1
d6008080:   e2855001    add r5, r5, #1
d6008084:   e2855001    add r5, r5, #1
d6008088:   e2855001    add r5, r5, #1
00377FFC

and that hurts a little less.

So I would like to know more, but it is showing the branch penalty. From what started all of this though, with the number of instructions you have in your handler, perhaps it is all conditional and perhaps you could have traded some conditional branches to chunks of code (causing some chunks of code to not consume clocks) over lots of conditional executions which are always consuming clocks. You might have microptimized this more. I dont know, have not seen your code, dont need to. Wanted to point out as our mentor Michael Abrash would tell us, dont assume you know how it works, test it, try crazy things, and time it. Look at the last two posts, look at how crazy the same exact sequences of code varied simply by their location in ram. Assume and understand that with the cache on the first passes through that code would have varied much worse depending on how that code landed across cache boundaries. For an interrupt handler you want to execute once every so often that first pass against the cache could easily consume more clocks than a branch might. Add dram to this and add an operating system with other interrupts and task switching and the like messing with the caches, etc. It all just gets less predictable. A branch over a cache line which is often evicted due to heavy use by data accesses may turn out to be overall faster than straight through execution for example. At my day job we had a situation where the data we handled we placed boundaries in memory that was a power of 2, which is fine if you are handing that side of data every time, but we werent most of the time it was smaller, much smaller than the worst case. So we were striping our data and punishing the cache. by changing the reserved size of a data chunk to be 5 times some power of two rather than 8 times that power of two, we use more cache ways, less cache pounding, higher performance. the additional cost of times five really isnt that bad instead of <<3 to get the 8 times (actually the whole thing was block number <<N). 5 = 4+1. so (x*4)+(x*1) = x(4+1) = x*5. You can do x<<2+x to get a times 5 so you add one operation and burn a temporary register for the duration of that operation to save a ton of time in cache evicting and fetching.

Fun experiment, sorry to hijack your thread, hope your handler is working and is fast enough for what you need.
David

Thanks for pointing out that black book project, talked to that person and the wheels are in motion for the zen of asm, will see if it all works out...I highly recommend that book, esp for a platform like the raspberry pi where there are so many things going on in the background that can be cycle stealers...

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Tue Feb 11, 2014 5:20 am

I absolutely agree with both you and Abrash... you have to measure these things, and assume nothing. Here's a quote from Abrash's Black Book, Chapter 3:

A case in point: A few years back, I came across an article about 8088 assembly language called “Optimizing for Speed.” Now, “optimize” is not a word to be used lightly; Webster’s Ninth New Collegiate Dictionary defines optimize as “to make as perfect, effective, or functional as possible,” which certainly leaves little room for error. The author had, however, chosen a small, well-defined 8088 assembly language routine to refine, consisting of about 30 instructions that did nothing more than expand 8 bits to 16 bits by duplicating each bit.

The author of “Optimizing” had clearly fine-tuned the code with care, examining alternative instruction sequences and adding up cycles until he arrived at an implementation he calculated to be nearly 50 percent faster than the original routine. In short, he had used all the information at his disposal to improve his code, and had, as a result, saved cycles by the bushel. There was, in fact, only one slight problem with the optimized version of the routine….

It ran slower than the original version!


The rest of the story is available at http://www.jagregory.com/abrash-black-b ... me-nothing (thanks to tufty for pointing out the online version).

Anyways, I added the branch out of the vector table, reverted to my regular three instruction wait routine, and did some benchmarking under load. Specifically, I had the RPi crunching at a load average of 8.05, 8.12, 7.95 using 8 dd if=/dev/zero of=/dev/null with the FIQ running in the background. I fed the FIQ a ~60 KHz pulse train to respond to, set the scope to infinite persistence, and let it cook for an hour to get a feeling for how bad the jitter is without doing a full 24 hour latency measurement.

Attached is the result, with blue being the ~60 KHz, 50 ns positive width pulse train and yellow the /RD /CS line strobed by the FIQ routine. 60 KHz is actually 2-4x faster than I will ever need the 8 channels x 18 bit ADC data to come in at, so apparently I still have some time to spare without optimizing any of my code. If needed, I can cut the data rate in half by not reading in 4 of the channels and can also spend time trying to optimize the code, looking at cache, maybe delaying the bitshifts to a tight loop outside the FIQ?, etc.... but it's not looking like I will need to do this at the moment.

Since I have to do some double precision matrix math with the ADC signal (run a neural network), I did some basic benchmarking with the FIQ going in the background, and I thought I'd share the results if anyone's curious.

This is with the FIQ responding to a ~60 KHz pulse train (asynch. falling edge detection) on the first benchmark execution and those previous dd processes killed:
RPi ~ # gcc -O2 -DUNIX flops.c -o flops
RPi ~ # ./flops

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module Error RunTime MFLOPS
(usec)
1 2.8422e-14 0.1993 70.2470
2 2.5047e-13 0.0810 86.4031
3 -7.6605e-15 0.2520 67.4520
4 2.2771e-13 0.2016 74.4186
5 3.8858e-14 0.4509 64.3105
6 7.5495e-15 0.3720 77.9504
7 -1.1369e-13 0.2974 40.3467
8 1.2612e-13 0.4530 66.2297

Iterations = 128000000
NullTime (usec) = 0.0000
MFLOPS(1) = 79.1345
MFLOPS(2) = 57.0239
MFLOPS(3) = 65.5811
MFLOPS(4) = 71.1719

For reference, with no FIQ running, third execution on a freshly rebooted system:
RPi ~ # ./flops

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module Error RunTime MFLOPS
(usec)
1 2.8422e-14 0.1264 110.7540
2 2.5047e-13 0.0516 135.5522
3 -7.6605e-15 0.1595 106.6144
4 2.2771e-13 0.1280 117.2161
5 3.8858e-14 0.2848 101.8381
6 7.5495e-15 0.2349 123.4453
7 -1.1369e-13 0.1884 63.6816
8 1.2612e-13 0.2867 104.6322

Iterations = 128000000
NullTime (usec) = 0.0000
MFLOPS(1) = 124.5043
MFLOPS(2) = 90.0866
MFLOPS(3) = 103.6437
MFLOPS(4) = 112.4759

I've benchmarked my adaptive neural network code running within Xenomai using the Eigen C++ matrix library and randomly generated data. It clocks in with an average execution time of 3 ms while the 60 KHz FIQ going. That's 2x slower than without the FIQ, but I can tolerate up to ~20 ms of delay, so things are looking good at the moment. Once all the code is together though, I will need to validate the timing with a clock external to the system.
Attachments
oneHourJitter_8_dd_zero_to_null_LoadAverage_8.05_8.12_7.95.png
oneHourJitter_8_dd_zero_to_null_LoadAverage_8.05_8.12_7.95.png (9.48 KiB) Viewed 6060 times

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Tue Feb 18, 2014 6:56 pm

duh, I didnt turn the branch prediction bit(s) on.

Code: Select all

    mrc p15, 0, r0, c1, c0, 0
    @orr r0,r0,#0x0800 ;@ branch prediction
    bic r0,r0,#0x0800 ;@ branch prediction
    mcr p15, 0, r0, c1, c0, 0
    mrc p15, 0, r0, c1, c0, 1
    @orr r0,r0,#0x000F ;@ branch prediction
    bic r0,r0,#0x000F ;@ branch prediction
    mcr p15, 0, r0, c1, c0, 1
orr turns it on the bic lines off.

Code: Select all

d6008070:	e2855001 	add	r5, r5, #1
d6008074:	e2855001 	add	r5, r5, #1
d6008078:	e2855001 	add	r5, r5, #1
d600807c:	eaffffff 	b	d6008080 <one>
d6008080 <one>:
00377FFC ticks for 0x10000 loops branch prediction off.  
001D803C with it on.

Code: Select all

d6008070:	e2855001 	add	r5, r5, #1
d6008074:	e2855001 	add	r5, r5, #1
d6008078:	eaffffff 	b	d600807c <one>
d600807c <one>:
d600807c:	e2855001 	add	r5, r5, #1
d6008080:	e2855001 	add	r5, r5, #1
00437FFC off
0029003D on

Code: Select all

d6008070:	e2855001 	add	r5, r5, #1
d6008074:	eaffffff 	b	d6008078 <one>
d6008078 <one>:
d6008078:	e2855001 	add	r5, r5, #1
d600807c:	e2855001 	add	r5, r5, #1
d6008080:	e2855001 	add	r5, r5, #1
00377FFC bp off
00288031 bp on

Code: Select all

d6008070:	eaffffff 	b	d6008074 <one>
d6008074 <one>:
d6008074:	e2855001 	add	r5, r5, #1
d6008078:	e2855001 	add	r5, r5, #1
d600807c:	e2855001 	add	r5, r5, #1
d6008080:	e2855001 	add	r5, r5, #1
00377FFC bp off
00288031 bp on
So having that branch at the end of a fetch line perhaps gave the branch prediction time to prepare. With the branch in there it was on par with having a non-branch in there.

fun stuff,
David

findx
Posts: 29
Joined: Mon Jul 29, 2013 7:52 pm

Re: FIQ Size

Wed Feb 19, 2014 9:58 pm

Cool! That's good to know.

dwelch67
Posts: 1002
Joined: Sat May 26, 2012 5:32 pm

Re: FIQ Size

Thu Feb 20, 2014 1:47 am

I didnt realize until I saw some code in the dexbasic thread that the branch prediction was something you turned on or off. So of course I had to go back and try it on that system I did the other stuff on...

colinh
Posts: 95
Joined: Tue Dec 03, 2013 11:59 pm
Location: Munich

Re: FIQ Size

Sat Feb 22, 2014 6:12 pm

... and it's turned off by default, as are instruction and data cacheing (see section 3.2.7 c1, Control Register of the ARM1176JZF-s TRM).

Return to “Bare metal, Assembly language”