Grumps
Posts: 6
Joined: Tue Mar 06, 2012 8:07 pm

Exection speed

Tue Nov 20, 2012 9:35 am

Hi
I have followed a couple of tutorials and tried DexOS (http://www.cl.cam.ac.uk/freshers/raspbe ... orials/os/) but have now changed to using gcc as assembler was doing my head in - again!
So, I have a short piece of C code (written from first principles) that flashes the LED (surely that's the first thing everybody does) and then requests an 800x600x32bpp frame buffer.
At the end of all of the setup I have a little loop drawing a scrolling test pattern:

Code: Select all

  for(;;) {
    for(x=0;x!=1000;x++) {
      FramePtr=FrameBufferInfo[8];
      for(i=0;i!=100000;i++) 
        *FramePtr++=(0xff000000+i+x);        
    }
  }
This assembles down to a str, cmp, add, bne for the inner loop. I can time the execution of the outer loop and it takes 13s. That's less than 8 million loops per sec. or 32 million instructions per sec. Does that sound right? It seems slow to me. I have just this:

Code: Select all

hdmi_mode=8
disable_overscan=1 
In my config.txt file.
Any clues? Ta.

Grumps
Posts: 6
Joined: Tue Mar 06, 2012 8:07 pm

Re: Exection speed

Tue Nov 20, 2012 9:53 am

Update:
I found another post about cache which I've now turned on (I think) using:

Code: Select all

  asm volatile ("MRC p15, 0, %0, c1, c0, 0" : "=r" (controlRegister));
  controlRegister|=0x1800; 
  asm volatile ("MCR p15, 0, %0, c1, c0, 0" :: "r" (controlRegister));
The loop now takes 2.5s which is over 5x faster. This is now about 160 million instructions per sec.
Is that to be expected? Data cache is not on, but should not make a difference.

rupertr
Posts: 11
Joined: Fri Sep 07, 2012 2:21 am

Re: Exection speed

Tue Nov 20, 2012 12:21 pm

data cache should make a difference.
Also try turning on branch prediction.

dwelch67
Posts: 954
Joined: Sat May 26, 2012 5:32 pm

Re: Exection speed

Tue Nov 20, 2012 2:59 pm

And turn on optimization when you compile (-O2 or -O3 if using gcc). Being shared memory intense and having math in the core of the loop is going to slow it down quite a bit. Perhaps one or dozens of millions of cycles per second.

Grumps
Posts: 6
Joined: Tue Mar 06, 2012 8:07 pm

Re: Exection speed

Tue Nov 20, 2012 4:07 pm

Thanks.
I'm using -O2 optimization which generates the loop core as str, cmp, add, bne. Is there a guide somewhere that tells me what amount of clocks are needed per instruction?
If the instruction cache is on, then I'd expect no external memory access is required for the loop's core. So all of the slow down must be associated with writing to the DDR (as you say, shared).
Most of my accesses are sequential so the DDR should cope well with that (until interrupted by some GPU requirement).
I assume the DDR and core clock are all setup before my .img is executed? The BCM2835 ARM Peripherals pdf that I have shows nothing for any PLLs or the DDR controller.

I'm not complaining about the speed, it ain't bad at all. My little program is now doing continuous memory dumps in hex to the screen, and that works out at about 22Mpixels/second (32bpp). All written with my [email protected] C abiliites ;)

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Exection speed

Wed Nov 21, 2012 11:57 am

Grumps wrote:Thanks.
I'm using -O2 optimization which generates the loop core as str, cmp, add, bne. Is there a guide somewhere that tells me what amount of clocks are needed per instruction?
Yep, the ARM1176JZF-S TRM, Section 16

Return to “Bare metal, Assembly language”