vsiles
Posts: 41
Joined: Wed Feb 04, 2015 10:04 am

Performance issue

Thu Jan 14, 2016 6:43 pm

Hi !
Now that I have a small OS that works, I'm looking at (very simple) example to test my system.

I'm starting pure baremetal again, no OS, with a simple program that fill 600 4ko pages with random stuff, copy them with a simple for loop in a second buffer, then print a checksum of the second buffer.

Without MMU, it takes about 15sec. With a simple 1:1 MMU (I can't tell exactly which flags I use, I don't have the code right now, I'll post it tomorrow), it takes about 6sec.

The same program on raspbian, compiled with -O0 takes about 0.2 sec

I have no idea whatsoever why the difference is so huge. Does anyone can have an educated guess ?

Best,
V.

User avatar
Paeryn
Posts: 3089
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Performance issue

Thu Jan 14, 2016 10:25 pm

Caches will make a difference. I think if the MMU is switched off then the D-cache is forcibly disabled even if you try enabling it, the I-cache can be enabled without the MMU though.
She who travels light — forgot something.
Please note that my name doesn't start with the @ character so can people please stop writing it as if it does!

vsiles
Posts: 41
Joined: Wed Feb 04, 2015 10:04 am

Re: Performance issue

Fri Jan 15, 2016 10:29 am

Here are the details:
- The MMU and SCTLR.{C,I,Z} are on (D-Cache, I-Cache, Branch prediction).
- I also set SCTLR.TRE and SCTLR.AFE for Tex Remapping & simplified model for access flag.
- My MMU maps the memory as section, with the S bit set in the mappings
- PRRR values are set to Normal Memory, Shareable (Outer Shareable)
- NMRR values are set to Inner Share Write Throught, not Write Allocate - Outer Share Write Back Write Allocate.

hldswrth
Posts: 108
Joined: Mon Sep 10, 2012 4:14 pm

Re: Performance issue

Fri Jan 15, 2016 11:57 am

You might find this thread interesting:
viewtopic.php?f=72&t=70428&p=510555&hil ... py#p510555
I was getting 1.6GB/s in the end with an optimised memset function and memory set as non-shared

vsiles
Posts: 41
Joined: Wed Feb 04, 2015 10:04 am

Re: Performance issue

Fri Jan 15, 2016 12:04 pm

Thank you for the link, I'll have a look right away !

dwelch67
Posts: 968
Joined: Sat May 26, 2012 5:32 pm

Re: Performance issue

Sat Jan 16, 2016 3:53 am

are you sure on raspian that it has actually completed the copy or has it simply scheduled it and let your program continue? in the same way in linux when it says it has finished copying a file it has barely only begun.

there is nothing an os can do that you can do bare metal, in either case it is just a sequence of instructions running through the processor. It is a matter of confirming the measurements of time/performance are actually real and accurate, and the figuring out ways to improve that.

in one respect the mmu should actually make things slower as it has to do table lookups, sometimes out of very slow dram, that without the mmu you would not have had to do. On the other hand for a platform like this the mmu is the only way to isolate cachable address spaces from non-cacheable. the data cache might help you here (might not, depends), so worth trying, larger pages should be better than small, fewer uncached table lookups.

arrangement and alignment of instructions, etc...

David

vsiles
Posts: 41
Joined: Wed Feb 04, 2015 10:04 am

Re: Performance issue

Tue Jan 19, 2016 2:55 pm

I'm pretty sure on raspbian the copy is done since it ouputs the correct value computed after the copy. I'm still going through all the info from the thread hldswrth mentioned, and I see there are lots of ways to improve my code. Thanks for the input !

vsiles
Posts: 41
Joined: Wed Feb 04, 2015 10:04 am

Re: Performance issue

Tue Jan 19, 2016 4:32 pm

Shame on me...
Next time, I'll remove the -O0 right away, this really is a performance killer. In memset the generated code was performing way to much store (mostly to the stack to save local variable all the time)...

My bad :D

27troadster
Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Performance issue

Thu Jan 21, 2016 2:55 am

vsiles wrote:Here are the details:
...
- My MMU maps the memory as section, with the S bit set in the mappings
...
I'm pretty sure that having the S-bit set (designating the section of memory as "shared") will turn off level 1 data caching. From ARM1176 tech manual, pg 6-21, paragraph titled"shared normal memory" is the statement: "The processor does not cache shareable locations at level one."

So even if the C bit is set, and D-Cache is enabled in CP15, if the S-bit is set it will override these and data caching will not occur. I do not know how the Pi deals with cache coherency for shared sections of memory in Level 2 cache.

Kipp

vsiles
Posts: 41
Joined: Wed Feb 04, 2015 10:04 am

Re: Performance issue

Thu Jan 21, 2016 11:08 am

I'm sorry, I forgot to say that I'm on a Rpi2, so the situation is a bit difference with the cortex-a7, and you have to correctly setup the Inner/Outer Shareability domains, but L1 cache should still be one.

vsiles
Posts: 41
Joined: Wed Feb 04, 2015 10:04 am

Re: Performance issue

Fri Jan 22, 2016 10:05 am

I also noticed that Cache/Shareability flags you put in the TTBR0 register have a huge impact on performance !

Return to “Bare metal, Assembly language”