User avatar
raphaeldinge
Posts: 6
Joined: Sat Dec 23, 2017 1:37 am

Enabling MMU slows down VFP?

Tue Dec 26, 2017 2:45 pm

I'm making the following performance test under different contexts :

Code: Select all

void  measure_vmul_f32 (sys::Console & console)
{
   asm volatile ("vmov.f32 s15, %0" :: "r" (0x3f800000) : "memory"); // 1.f
   asm volatile ("vmov.f32 s14, %0" :: "r" (0xbf800000) : "memory"); // -1.f

   auto start = sys::get_clock_us ();

   for (int i = 0 ; i < 100 ; ++i)
   {
      unroll_loop1K (asm volatile ("vmul.f32 s15, s14, s15"));
   }

   auto end = sys::get_clock_us ();

   console << "100K vmul.f32 in " << (end - start) << " us\n";
}

config.txt is set up as below :

Code: Select all

force_turbo=1
arm_freq=1200

The VFP is setup with single/double precision, as well as "RunFast mode" (ie. round denormals to zero and ignore NaNs, while I believe this is not relevant for the test as I'm going to only oscillate between 1 and -1)

Code: Select all

__vfp_init:
   push  {r0}

   // Setup coprocessor
   mrc   p15, 0, r0, c1, c0, 2   /* coprocessor access control register */
   orr   r0, r0, #0xF00000       /* cp10/cp11 (single + double precision) */
   mcr   p15, 0, r0, c1, c0, 2

   // Instruction Sync Barrier
   isb

   // Enable coprocessor
   mov   r0, #0x40000000   // EN bit to 1
   fmxr  fpexc, r0

   // Initialize FP Status Control Register
   mov   r0, #0x3000000    // FZ & DN bits to 1 : RunFast mode
   fmxr  fpscr, r0

   pop   {r0}
   bx    lr


If I set up only the instruction cache e.g. :

Code: Select all

void  enable_instr_cache ()
{
    uint32_t ctl;
    asm volatile ("mrc p15, 0, %0, c1, c0,  0" : "=r" (ctl));
    ctl |= ARM_CTL_BRANCH_PREDICTION;
    ctl |= ARM_CTL_L1_INSTRUCTION_CACHE;
    asm volatile ("mcr p15, 0, %0, c1, c0,  0" : : "r" (ctl) : "memory");
}

I get 100K vmul.f32 in 335 us so this gives 298 MFLOPS (is it already low?)

Now if I set up the MMU, whatever the configuration, it works, and I can even use my framebuffer, etc.
However I get 100K vmul.f32 in 820 us.

Considering my 100K loop is only register based, I don't get how the MMU could have any impact on that. Am I missing something?

dwelch67
Posts: 939
Joined: Sat May 26, 2012 5:32 pm

Re: Enabling MMU slows down VFP?

Wed Dec 27, 2017 1:28 am

every single access including fetching of code goes through the mmu, which goes to a table in ram. there is some table caching but not perfect.

when you enabled or disabled the mmu did you add any code (or simply nop or not nop the mrc or other instructions that do the enable). changing the alignment of your code could also easily cause such a performance difference. you sure it is the mmu?

User avatar
raphaeldinge
Posts: 6
Joined: Sat Dec 23, 2017 1:37 am

Re: Enabling MMU slows down VFP?

Wed Dec 27, 2017 11:25 am

@dwelch67 Thanks for your reply.

In my code I have different functions to enable or not the MMU (both do the instruction and branch cache though), and I could verify that multiple time. I will need to verify that the linker doesn't move addresses around but I'm pretty sure this is not the case (I output a disassembled version at each build for me to monitor the changes)

However maybe I see something like you suggest with the instruction cache : maybe my unrool_loop10K is unrealistic (anyway) and is hitting the instruction cache in a bad way. I think I'll come back with that question if I have still the same problem with a realistic scenario.

Thanks!

AlfredJingle
Posts: 69
Joined: Thu Mar 03, 2016 10:43 pm

Re: Enabling MMU slows down VFP?

Sat Jan 06, 2018 3:39 pm

Hi,

I assume that you use a rpi3 (seeing the 1200Mhz). So I tested your little loop (but not unrolled) directly in assembly on my own system (baremetal rpi3 @ 1200Mhz with a corefreq @ 400Mhz), with switched-on MMU, and measured the loop at 334 microsecs. So the same as you in your first test.

Hopefully this helps a bit:
In your code I don't see you switching on the data-cache, this you should obviously do.
I see you switching on the branche-predictor, which according the ARMv8 architecture reference manual is unnecessary.
On a ARMv8 processor unrolling of loops is hardly ever beneficial, probably because it dirties the cache unnecessary and branching is highly optimised. In my system, as soon as a subroutine takes longer than 8-9 cycles, I have never seen any benefit. I tried your test with 4 vmuls in a loop and that save no time.

If you want faster calculation you should play with the SIMD. It does 4 float calculations in parallel.

Happy coding!
going from a 6502 on an Oric-1 to an ARMv8 is quite a big step...

User avatar
raphaeldinge
Posts: 6
Joined: Sat Dec 23, 2017 1:37 am

Re: Enabling MMU slows down VFP?

Sat Jan 06, 2018 4:23 pm

@AlfredJingle Thanks for your reply

In the meantime, I also made some code to enable the data cache, but I could see only an improvement when the MMU is setup as well, which makes sense anyway I guess.

I've had a really hard time setting up the MMU, and I think I should really read more the datasheets about it. I finally "found some code I kinda copy/pasted to make it work", and I have a huge performance boost, my I2S implementation is working as well as my I2C implementation, but my framebuffer implementation does not work anymore. But I guess because I don't really understand what I'm doing.

For the very short timeframe I have for this PoC project, I'm now considering using UART for debugging, and drop the framebuffer (that I only used for debugging)

Here is the code anyway, should it ring a bell to someone, that would be greatly appreciated.
For example setting up the flags for the table sections, I don't get the change in 0x5 or 0x4 which seems to indicate different kind of sections in the datasheets, and don't make sense to me.

Code: Select all

void  enable_mmu ()
{
   addr_t page_table = 0x4000;

   uint32_t idx = 0;

   for (; idx < 1008 ; ++idx)
   {
      uint32_t val = (idx << 20) | 0x50C0E;
      kern::write_addr (page_table + idx * 4, val);
   }

   for (; idx < 4096 ; ++idx)
   {
      uint32_t val = (idx << 20) | 0x40C06;
      kern::write_addr (page_table + idx * 4, val);
   }

   // Only TTBR0
   asm volatile ("mcr p15, 0, %0, c2, c0,  2" : : "r" (0));

   // Table walk
   uint32_t ttbr0;
   asm volatile ("mrc p15, 0, %0, c2, c0,  0" : "=r" (ttbr0));
   ttbr0 &= 0x3FBE;
   ttbr0 |= 0x4043;
   asm volatile ("mcr p15, 0, %0, c2, c0,  0" : : "r" (ttbr0));

   // All access domain client
   asm volatile ("mcr p15, 0, %0, c3, c0,  0" : : "r" (0x55555555));

   // Start cache
   uint32_t ctl;
   asm volatile ("mrc p15, 0, %0, c1, c0,  0" : "=r" (ctl));
   ctl &= 0xFFFFFFFD;
   ctl |= 4096;   // instr cache
   ctl |= 2048;   // branch prediction
   ctl |= 4;      // data cache
   asm volatile ("mcr p15, 0, %0, c1, c0,  0" : : "r" (ctl) : "memory");

   // Start MMU
   asm volatile ("mrc p15, 0, %0, c1, c0,  0" : "=r" (ctl));
   ctl |= 1;
   asm volatile ("mcr p15, 0, %0, c1, c0,  0" : : "r" (ctl) : "memory");
}

AlfredJingle
Posts: 69
Joined: Thu Mar 03, 2016 10:43 pm

Re: Enabling MMU slows down VFP?

Sat Jan 06, 2018 5:24 pm

Hi

This is what I use:
0x90C0E for normal memory. (0x50C0E is fine in itself but means you use 16Mb sections which you must do consistently throughout.)
Than 1 1Mb section with 0x90C12. This a uncached part of memory and is handy for Mailbox work.
0x90C1E for screen memory
and 0x90C16 for everything above 1008.

When the framebuffer writes a reply in cached memory the data-cache is not aware of that, and thus (usually) will not reload the data from memory. To solve that problem you either let the framebuffer write to a piece of uncached memory (thus my 1 Mb piece of uncached memory), or you invalidate the datacache before reading the data. Writing to a uncached piece of memory is far easier than learning how to mange your caches, so that would be my advice.
going from a 6502 on an Oric-1 to an ARMv8 is quite a big step...

Return to “Bare metal, Assembly language”

Who is online

Users browsing this forum: No registered users and 6 guests