leiradel
Posts: 32
Joined: Wed Feb 13, 2019 10:38 pm

Memory barriers

Sun Feb 24, 2019 10:36 pm

Hi,

I'm trying to add memory barriers around peripheral accesses. First, my memory barrier instructions are mcr p15, 0, r0, c7, c10, #5 on ARMv6-based boards, and dmb on other boards.

The BCM2837 ARM Peripherals document is very clear about where memory barriers must be used:
1.3 Peripheral access precautions for correct memory ordering

(snip)

* A memory write barrier before the first write to a peripheral.
* A memory read barrier after the last read of a peripheral.

It is not required to put a memory barrier instruction after each read or write access. Only at
those places in the code where it is possible that a peripheral read or write may be followed
by a read or write of a different peripheral. This is normally at the entry and exit points of the
peripheral service code.
So I believe I could just add the memory barriers as the first and last instructions in functions that read or write to peripherals. However, I'm not sure as what to do in functions that read and write to a peripheral in the same function, or that read stuff in a loop, i.e.:

Code: Select all

int mbox_send(void* msg) {
  uint32_t value;

  // Write message to mailbox.
  do {
    value = mem_read32(BASE_ADDR + STATUS1);
  }
  while ((value & UINT32_C(0x80000000)) != 0); // Mailbox full, retry.

  // Send message to channel 8: tags (ARM to VC).
  const uint32_t msgaddr = (mem_arm2vc((uint32_t)msg) & ~15) | TAGS;
  mem_write32(BASE_ADDR + WRITE1, msgaddr);

  // Wait for the response.
  do {
    do {
      value = mem_read32(BASE_ADDR + STATUS0);
    }
    while ((value & UINT32_C(0x40000000)) != 0); // Mailbox empty, retry.

    value = mem_read32(BASE_ADDR + READ0);
  }
  while ((value & 15) != TAGS); // Wrong channel, retry.

  if (((mbox_msgheader_t*)msg)->code == UINT32_C(0x80000000)) {
    return 0; // Success!
  }

  return -1; // Ugh...
}
Where should I use memory barriers in the function above? I believe one just before the mem_write32, and one just before the if statement, which is after the last mem_read32 but outside the loop.

Also, how do I deal with multiple peripherals, i.e.:

Code: Select all

int uart_read(unsigned timeout_us) {
  if (timeout_us == 0) {
    // timeout_us == 0 means retry forever.
    while (!uart_canread()) {
      // nothing
    }
  }
  else {
    const uint64_t timeout = timer() + timeout_us;

    while (!uart_canread()) {
      if (timer() >= timeout) {
        return -1;
      }
    }
  }

  // There's data available in the receive FIFO, return it.
  return mem_read32(BASE_ADDR + AUX_MU_IO_REG) & 0xff;
}
uart_canread will read from the Mini UART, and timer will read from the system timer. Should I just use memory barriers inside those functions, as described in the documentation? I'd like to use as few memory barriers as possible.

Thanks in advance,

Andre

User avatar
DavidS
Posts: 4334
Joined: Thu Dec 15, 2011 6:39 am
Location: USA
Contact: Website

Re: Memory barriers

Sun Mar 03, 2019 3:03 pm

You should have memory barriers around each and every access to the peripherals and or mailboxes. So every time you access some peripheral or mailbox it should be enclosed by memory barriers, this is to be sure that the data gets out of the data cache and where it is needed.
RPi = The best ARM based RISC OS computer around
More than 95% of posts made from RISC OS on RPi 1B/1B+ computers. Most of the rest from RISC OS on RPi 2B/3B/3B+ computers

leiradel
Posts: 32
Joined: Wed Feb 13, 2019 10:38 pm

Re: Memory barriers

Sat Mar 09, 2019 12:32 pm

DavidS wrote:
Sun Mar 03, 2019 3:03 pm
You should have memory barriers around each and every access to the peripherals and or mailboxes.
Yes, my doubt is about the granularity. I.e. in the code below I follow the documentation to the letter, so I've added barriers outside the loops (which is sort of after the last read) and also before the only write in the function:

Code: Select all

int mbox_send(void* msg) {
  // Wait for the mailbox to become empty.
  while (1) {
    const uint32_t value = mem_read32(BASE_ADDR + STATUS1);

    if ((value & UINT32_C(0x80000000)) == 0) {
      // Mailbox empty, exit loop.
      break;
    }
  }

  mem_dmb();

  // Send message to channel 8: tags (ARM to VC).
  const uint32_t msgaddr = (mem_arm2vc((uint32_t)msg) & ~15) | TAGS;
  mem_write32(BASE_ADDR + WRITE1, msgaddr);

  // Wait for the response.
  while (1) {
    while (1) {
      const uint32_t value = mem_read32(BASE_ADDR + STATUS0);
      
      if ((value & UINT32_C(0x40000000)) == 0) {
        // Response arrived, exit loop.
        break;
      }
    }

    const uint32_t value = mem_read32(BASE_ADDR + READ0);

    if ((value & 15) == TAGS) {
      // Correct channel, exit loop.
      break;
    }
  }

  mem_dmb();

  if (((mbox_msgheader_t*)msg)->code == UINT32_C(0x80000000)) {
    return 0; // Success!
  }

  return -1; // Ugh...
}
While this code works in all my tests, I wonder if I should move the barriers inside the loops.

Thanks for the help.

LdB
Posts: 1177
Joined: Wed Dec 07, 2016 2:29 pm

Re: Memory barriers

Sun Mar 10, 2019 1:53 am

The issue isn't huge on the Pi because none of the CPU's are out of order ARM's
It really only speculative reads and two accesses to two peripherals on the axi bus coming back in different order and sync primnitives.
So your code is fine you don't need to worry about inside the loop.
Once the cache is on expect far more issue with the cache coherency than memory barriers.

leiradel
Posts: 32
Joined: Wed Feb 13, 2019 10:38 pm

Re: Memory barriers

Sun Mar 10, 2019 10:18 pm

LdB wrote:
Sun Mar 10, 2019 1:53 am
Once the cache is on expect far more issue with the cache coherency than memory barriers.
But if I put the peripherals in a noncacheable page I shouldn't have trouble right?

LdB
Posts: 1177
Joined: Wed Dec 07, 2016 2:29 pm

Re: Memory barriers

Mon Mar 11, 2019 1:25 am

leiradel wrote:
Sun Mar 10, 2019 10:18 pm
But if I put the peripherals in a noncacheable page I shouldn't have trouble right?
Yes but that decision precludes some things.

It is up to how or indeed if you virtualize the peripherals area, remember how everything appears is up to your mapping on the MMU ... the cortexa53 is designed to virtualize everything there are no exceptions. With the cache on the memory structure exchanged to the GPU on the mailbox messages also becomes a problem, so you can (i) setup a memory area without caching for the exchanges or (ii) Deal with the coherency in a cached memory area by using the cache instructions. The above areas will be the first you encounter but later on when you start doing DMA transfers and using memory structures with USB etc and these same issues will crop up as two different cores will be accessing memory blocks.

So you have followed a good guideline but that guideline is based around a "linux like" memory mapping, there are no guarantees if you setup things differently. I worried you seem to be seeking guarantees your code will work in any setup and that simply isn't possible. I left this question alone originally because it is highly complex and depends up what is setup and as asked is problematic, the only reason I have answered is because of some responses.

If you want to mess around bzt has an example where he virtualized the uart port (play with cache settings)
https://github.com/bztsrc/raspi3-tutori ... tualmemory
You may also look at trying to bang the fastest GPIO signal out of a Pi and the effect of cache on that (not just the DMA)
https://github.com/hzeller/rpi-gpio-dma-demo

This harks back to the spider OS thread and trying to write the screen the fastest way and that depends what I am allowed to setup and use. Under a linux setup screen writing is much much slower than what I can actually do if I set something up myself because the linux setup is not optimal for screen writing. If you are trying to do a hard RTOS it is all wrong if you are trying to write a general OS it's good .. take your pick :-)

bzt
Posts: 374
Joined: Sat Oct 14, 2017 9:57 pm

Re: Memory barriers

Mon Mar 11, 2019 1:32 pm

Hi,
leiradel wrote:
Sun Mar 10, 2019 10:18 pm
But if I put the peripherals in a noncacheable page I shouldn't have trouble right?
As LdB wrote, I've played with this a bit, and you can see an example in my tutorials. Noncachable is not enough, I have found in some doc that you must map the MMIO as outter sharable, and the attridx must point to the MAIR value of 4 (nGnRE). The former is needed so that all core access the memory similarly, and the latter to pass all access through the MMU right away, so that all peripherals get the register read/write immediately. With that mapping you don't need barriers.

The MAIR value cames from here: http://infocenter.arm.com/help/topic/co ... DHJBB.html
The outter sharabilty cames from here and also because that's how the trusted-firmware code does it.

Cheers,
bzt

leiradel
Posts: 32
Joined: Wed Feb 13, 2019 10:38 pm

Re: Memory barriers

Sat Mar 16, 2019 8:03 pm

LdB wrote:
Mon Mar 11, 2019 1:25 am
I worried you seem to be seeking guarantees your code will work in any setup and that simply isn't possible.
I just want my code to be correct under the my setup. Currently I'm running with caches disabled but I want to enable them in the future, so I'll have to configure the MMU.
LdB wrote:
Mon Mar 11, 2019 1:25 am
If you are trying to do a hard RTOS it is all wrong if you are trying to write a general OS it's good ..
I'm not sure what it'll be to be honest... For now I just want to be able to load programs from the file system and run them in user mode, and have the required syscalls implemented in the kernel.

Thanks for the info.
bzt wrote:
Mon Mar 11, 2019 1:32 pm
Noncachable is not enough, I have found in some doc that you must map the MMIO as outter sharable, and the attridx must point to the MAIR value of 4 (nGnRE).
Ah nice, thanks for that.
bzt wrote:
Mon Mar 11, 2019 1:32 pm
With that mapping you don't need barriers.
You mean I don't need any of the memory barrier functions? The Broadcom documentation explicitely says I need them. As it doesn't specify under which circumstances, I was assuming the memory barriers would be needed with and without caches enabled, and independently of the MMU setup.

bzt
Posts: 374
Joined: Sat Oct 14, 2017 9:57 pm

Re: Memory barriers

Mon Mar 18, 2019 5:56 pm

Hi,
leiradel wrote:
Sat Mar 16, 2019 8:03 pm
You mean I don't need any of the memory barrier functions? The Broadcom documentation explicitely says I need them. As it doesn't specify under which circumstances, I was assuming the memory barriers would be needed with and without caches enabled, and independently of the MMU setup.
Well I wasn't clear enough. With that mapping you guarantee that all reads and writes goes through the peripheral immediately, therefore all reads and writes will be in order as expected.

If you mix different peripheral accesses, then you'll still need the barrier. As the BCM2837 doc says on page 7:
It is only when switching from one peripheral to another that data can arrive out-of-order
It is not required to put a memory barrier instruction after each read or write access. Only at those places in code where it is possible that a peripheral read or write may be followed by a read or write of a different peripheral.
But I've found this isn't so strict, there's a good chance your code will work properly without a barrier (because it's not common to mix different peripheral access). This is in contrast to setting the sctlr register for example, where the barrier is a must, otherwise a crash will happen for sure. Non-the-less I recommend to surround every peripheral service routine with a read and write barrier, as the doc suggests, better to be safe than sorry.

With ISRs it's same, also because one ISR only handles one peripheral (the one the IRQ belongs to), but that could interrupt another peripheral service routine causing a mixed peripheral access, so as the doc says:
If an interrupt routine reads from a peripheral the routine should start with a memory read barrier. If an interrupt routine writes to a peripheral the routine should end with a memory write barrier.
That's two barriers per ISR tops.

Cheers,
bzt

leiradel
Posts: 32
Joined: Wed Feb 13, 2019 10:38 pm

Re: Memory barriers

Tue Mar 26, 2019 8:23 pm

bzt wrote:
Mon Mar 18, 2019 5:56 pm
Well I wasn't clear enough. With that mapping you guarantee that all reads and writes goes through the peripheral immediately, therefore all reads and writes will be in order as expected.

If you mix different peripheral accesses, then you'll still need the barrier. As the BCM2837 doc says on page 7:
It is only when switching from one peripheral to another that data can arrive out-of-order
It is not required to put a memory barrier instruction after each read or write access. Only at those places in code where it is possible that a peripheral read or write may be followed by a read or write of a different peripheral.
Ah cool, thank you very much for the info.

Return to “Bare metal, Assembly language”