It depends on the vendor, the bus is likely 64 bit, but has byte enables so you can address any byte in the address space. control and status registers generally dont decode on a byte basis as it rarely makes sense. I would assume on the raspberry pi that you should use 32 bit accesses. byte accesses vs word accesses dont save you anything and can sometimes cost you more cycles, 32 bit variables and 32 bit accesses (or 64 bit) are the cheapest/fastest.
You most certainly and will find many examples that use volatile uint32_t*. My examples dont and I have here and there stated the reasons (I have many times gotten gcc (and others) to fail to properly generate the right instruction for volatile uint32_t solutions). Further, I am willing to sacrifice the extra cycles of having an abstraction layer for memory/register access for many reasons. Mostly having to do with writing a driver one time that can be used on a host talking to simulated hardware, then on the processor in the simulation, then later on silicon and also then from the host through an interface (serial, jtag, pcie, usb, etc). Never having to re-write the core application, only change the abstraction layer. Plus the benefit of accurately and correctly using the right instruction to cause the right bus cycle, rather than hoping the compiler will. And it is very easy to inline the abstraction if you want to recover the speed.
I have also watched folks have to refactor a lot of code to add an abstraction layer after the fact (for various reasons) when moving to a new platform or going through a new interface. Was painful for them every time.
No, nothing special about these compilers nor this platform, the volatile uint32_t* approach will give you the same/similar experience on this platform as on any other.