Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Data Cache not working with Baking Pi - SOLVED

Fri Oct 02, 2015 8:38 am

Hi all,

I am learning about Assem and RPi, I’ve made significant headway thanks to Alex Chadwick, David Welch and others, thanks for your postings.

I’ve read the ASM ref man, 1176 tech man and Broadcom’s manual, and I’m having problems with the following, hopefully someone knows the “trick” or bit of knowledge that I am missing to make this work:

I cannot get Data Caching to work with Alex Chadwick's "Input 02" program.

Background. I have successfully turned on the MMU, ICache and DCache when using Alex's Screen 04 program, and verified performance increases using a simple str/ldr loop to ram and checking # of clock cycles to complete X number of write/reads.

But when I try to apply the same code to Input02, it locks up. I have tracked it down to the function ReadLine, by following BL commands from: Main -> readLine (located in "terminal.s") The issue arises between the command to BL to KeyboardPoll and "teq r0,#0"

If I turn off the DCache between these two lines everything works fine. If I leave DCache turned on then the program locks with just a solid cursor showing on the screen.

I figured the problem was with the call to KeyboardPoll and had something to do with CSUD (Alex’s USB driver for Keyboard I/O). So I reconfigured the PageTable (which is set up for 1 Meg sections) such that TEX: 0b000, B bit = 0 and Cbit = 0, turning off Buffereing and Cacheing, for sections starting at 1Meg and above. Then I loaded all of the instructions from CSUD to the 1 Meg section (address beginning at 0x100000). Meanwhile, the first Meg section (address beginning at 0x0) was set to TEX: 0b000, Bbit = 1, Cbit = 1.

My thought was that all of the program except the stuff dealing with USB keyboard I/O would be in a cacheable section of memory and the CSUD USB keyboard I/O stuff would be in non-cacheable memory. Thus it should work. Nope, still locks up in the same manner.

Now, if I make the 1st section non-cacheable and the rest cacheable, ie the section with CSUD code as cacheable, then it still locks but the cursor goes blank, which means it gets through a little deeper into the program before locking, I’m guessing?

I’ve read, re-read, and am reading again ARM Ref man, ARM 1176 tech man, and the Broadcom 2835 peripheral manual. I cannot figure out what is happening to cause the DCache to lock the program.

I don’t have any handlers set up, so I don’t know if it’s throwing an exception or not, my next step was to write them to try to troubleshoot this further.

I can’t image the issue is with the “teq r0,#0” but maybe I’m missing something. Anyone?
Also, how are immediates treated by Cache are they instructions or Data. To be specific, I am referring to an immediate that cannot be used by the MOV command, example: when you have to use something like: LDR r0,=0x12369, so the compiler puts the # at the end of that block of code in RAM and inserts it’s address into the LDR command. My question is; does the MMU treat that # placed at the end of the block of code as an instruction and load it into the ICache or as Data and load it into the DCache?

Thanks ,
Last edited by 27troadster on Sat Jan 23, 2016 4:26 pm, edited 4 times in total.

Posts: 477
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: Data Cache not working with Baking Pi

Sat Oct 03, 2015 9:35 am

27troadster wrote:My question is; does the MMU treat that # placed at the end of the block of code as an instruction and load it into the ICache or as Data and load it into the DCache?
I think into DCache but only if this memory range is cacheable.

If I remember well somebody reported here about how to use CSUD in an environment using the MMU and caches some time ago. He modified CSUD so that it invalidates/clears the data cache before/after the slave DMA operations it does to/from the USB host controller. The problem are the data buffers which are used for transfers and which must be made coherent between ARM CPU and the host controller.

Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Data Cache not working with Baking Pi

Mon Oct 12, 2015 8:19 am

Thanks rst!

Your post pointed me in the right direction that eventually led me to be able to make it work....mostly…

The post you referred to is: “memset performance,” by hldswrth on Mon Feb 24, 2014 5:36 pm.

For those interested in doing the same, it's not as straight forward as one might assume, the following are lessons learned I discovered along the way (I’m sure there are other ways to make it work but this is what I did):
1) The makefile in csud-master needs to be modified. The ‘make’ program assigns the first target as the default target and tries to make that target, as long as it can make that target, it ignores everything else. The 1st target, as written, is "all: ", followed by recipes to output some text to the screen. I Researched the GNU Make Manual written by: Richard M. Stallman, Roland McGrath, Paul D. Smith. I added the line ".DEFAULT_GOAL = device" device is a target that causes the file "libcsud.a" archive to be built.

2) There are three builds determined by the first couple of lines in the makefile that can be built. STANDALONE, LOWLEVEL, DRIVER. The LOWLEVEL and DRIVER builds have external dependencies that Alex discusses in his "readme" file, but I don't know how to implement a wrapper to pass stuff to the 'system', so I built it as a STANDALONE and it works.

3) The archive file is built using -std=c99. Reading the "info" and "pdf" files for GNU gcc compiler, c99 does not recognize the keyword "asm". We have to use"__asm__" But NOTE: those underscores are actually TWO underscores each, ie: "_ _ a s m _ _" without the spaces of course. It took me awhile and a lot of extra reading the books before I realized this minor point.

4) I also changed the first couple of lines in makefile so it builds the archive for RPI.

The last post in the above mentioned "memset performance" is:

“Final update here. Got non-shared memory with CSUD USB/keyboard driver working with some additional changes in the CSUD code. The simplest option was to use non-shared memory everywhere, and then force a cache clean/invalidate when transferring data from ARM to USB, which I did by updating the HcdTransmitChannel function:

Code: Select all
// Clean and invalidate the data cache so that DMA sees the data and
// we see the updated data.
int cr = 0;
__asm volatile ("mcr p15, 0, %0, c7, c14, 0" :: "r" (cr));
// Rest of code unchanged...
Host->Channel[channel].DmaAddress = buffer;

With the non-shared memory attribute my memset routine completes 100MB in 66ms, or 1.4GB per second, which now sounds like the kind of figure I should be seeing, so thanks for the guidance!

To avoid the cache invalidation I then tried mapping 1MB of shared memory, and updated CSUD to use that for the buffer. I then wasted some time until I realised that the DmaAddress shown above has to be the physical address of the buffer, not the virtual address. My original non-shared buffer happened to be at physical+0x8000000 so it just worked, my shared buffer didn't have such a simple mapping. Now I have that sorted its working fine.”

I was able to get this working buy applying the cache clean/invalidate fix. But, for the above mentioned reasons, and because I want to lock a portion of the cache (which means having to change the invalidate line buried in HcdTransmitChannel function (not an ideal situation)) I’d rather apply hlsdwrth’ s second option of putting the ‘buffer’ in shared memory (which I read to mean caching is turned off for shared sections of memory)

Problem is: I moved all of the .data section for CSUD to the 1 Meg section starting at 0x200000. I made the 2nd and 3rd sections shared and set bufferable and cacheable bits to 0 in the page table. 2nd and 3rd sections correspond to addresses 0x10 0000 through 0x2F FFFC. I verified (at least I think I adequately verified, maybe not) the buffer was located at an address of 0x20xxxx. (where xxxx is some address that I don’t recall off the top of my head)

I am doing 1:1 mapping for the entire page table, ie for the entire address space.

Does anyone have any insight into how to put the ‘buffer’ into shared memory? And/or how to verify the address of the ‘buffer’?
Does the buffer change location at run time?
Does the address of buffer simply hold a pointer to some other location?
I can’t figure out how hldswrth’s program put his buffer at 0x800 0000?

Any help would be great.


Posts: 477
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: Data Cache not working with Baking Pi

Tue Oct 13, 2015 4:15 pm

I think you can do two things here. First you can define a special DMA (double) buffer in your non-cached region and copy the contents of the write buffer into it before starting DMA and copy it back after a read operation completes. Because the "buffer" pointer can point anywhere in CSUD this should be the only method but it should be slow. But I haven't tried this.

I would suggest not to use a special DMA buffer in a non-cache region here but to clean and invalidate not the entire cache before/after the transfer. Instead you can clean and invalidate only the memory range of the currently used buffer by iterating over the buffer in steps of the data cache line length. The following function does this job. nAddress is the virtual address of the buffer.

Code: Select all

#define DATA_CACHE_LINE_LENGTH		32		// for Raspberry Pi 1

void CleanAndInvalidateDataCacheRange (unsigned nAddress, unsigned nLength) __attribute__ ((optimize (3)))

	while (1)
		//  Clean and Invalidate Data Cache Line, using MVA
		__asm__ volatile ("mcr p15, 0, %0, c7, c14,  1" : : "r" (nAddress) : "memory");



	// Data Memory Barrier
	__asm__ volatile ("mcr p15, 0, %0, c7, c10, 5" : : "r" (0) : "memory");
BTW you should add 0x40000000 (if L2 cache is enabled which is normally the case) to the physical address of the buffer before writing it into the DMA address register of the USB host controller because the controller works with bus addresses of the GPU side.

User avatar
Posts: 161
Joined: Wed Sep 30, 2015 10:29 am
Location: Australia
Contact: Website

Re: Data Cache not working with Baking Pi

Fri Oct 16, 2015 9:38 am

Just a little bit of extra info with regard to the RPi (not RPi 2) and cached / shared memory.

Because the RPi shares the L2 cache with the GPU you can mark memory as Cached (Write back) and Shared and pass that memory to the USB DMA buffer (you still need to translate physical to bus addresses) and it will remain coherent without using clean and/or invalidate cache operations.

Be aware, before you go and mark all memory as Shared, that if you use LDREX/STREX anywhere these need to be done in Cached Non Shared memory on the RPi or they fail randomly.

The RPi2 doesn't share the L2 cache with the GPU so none of the above applies and you have to use either clean/invalidate or non cached memory for the USB DMA buffer and LDREX/STREX must be in cached shared memory to be coherent between the cores.
Ultibo.org | Make something amazing

Threads, multi-core, OpenGL, Camera, FAT, NTFS, TCP/IP, USB and more in 3MB with 2 second boot!

Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Data Cache not working with Baking Pi

Sat Oct 17, 2015 1:53 pm

thanks rst and Ultibo,

So I guess I don't quite understand L2 and the memory map.

1) I assumed since physical memory stopped at 0x4000.0000 that the MMU memory map (ie page table entries) stopped at 0x4000.0000, but am I wrong on that point?
2) Do the memory attributes that I set in the page table get passed L2 as well?
3) Should I use the option: TEX = 0b1xx, B=a,C=a, such that xx sets the attributes of L2 and 'aa' sets the attributes of L1 (per the ARM 1176 tech man)?
4) If I set memory attributes for memory > 0x4000.0000, for example I set 0x4200.0000 for cacheable, then how does that map to a physical address since physical addresses stop at 0x4000.0000? (per the BCM2835 manual)
5) Or...does 0x4200.0000 map to Bus address 0x4200.0000, which really means RAM address 0x200.0000 that is cacheable in L2 (with coherency) and if it were 0xC200.0000 that would mean RAM address 0x200.0000 that is non-cacheable in L2?

Am I in the ballpark or do I have it all screwed up?

Ok, did some more reading and even more confused....
Broadcom 2835 manual states that RAM addresses from the ARM physical addresses are mapped to VC/GPU Bus addresses starting at 0xC000.0000. (no cacheing).
So why would I need to add 0x4000.0000 to the 'buffer' address that I pass to the DMA? (not saying that' s wrong, I just don’t understand why)
And how do we control L2? ARM 1176 talks about inner and outer cache, which they recommend would refer to L1 and L2 respectively (for only 2 levels of cache), but that it is implementation defined. So how does the Pi work? Does it set the two MSB’s (ie, 0x0nnn.nnnn –vs- 0x4nnn.nnnn –vs- 0x8nnn.nnnn –vs- 0xCnnn.nnnn) based on the ARM MMU L2 cache settings? Or if I use the page tables to map to addresses > 0x4000.0000 does this force the VC/GPU to put it in RAM at 0x4000.0000 (or 0x8000.0000, etc)?

I guess the real questions are:
1) What VC/GPU Bus Address does RAM, that is located at 0x0000.0000 – 0x0200.0000 (ie first 512 megs) on the physical address space that the ARM MMU sees, get mapped to? (because it’s the VC/GPU bus address that we need to pass to the Broadcom chip DMA, as I understand it) For example: if I put a ‘buffer’ at ARM physical address 0x1000, and I want the Broadcom chip DMA to load it with some data, do I pass the DMA controller a “destination_address” of 0xC000.1000? or 0x4000.1000 as (if I read his post correctly) suggested by ‘rst’ above?
2) How can we control L2 cache on the Pi? (1st gen. B+)


Posts: 477
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: Data Cache not working with Baking Pi

Sun Oct 18, 2015 9:02 am

There are four alias regions in the address room of the VC/GPU (see the BCM2835 manual pg. 5). These map the same SDRAM region but with different behaviour. The "'4' Alias - L2 cache coherent (non allocating)" region at 0x4000.0000 should be used for DMA if the config.txt option "disable_l2cache" is not used. Otherwise the "'C' Alias - direct uncached". I learned this sometimes and can only suggest it here.

Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Data Cache not working with Baking Pi

Tue Nov 24, 2015 10:45 pm


I added the following to csud-master/source/hcd/dwc/designware20.c

void setDatabuffer(u8 buffer){
databuffer = buffer;

Then I made a "databuffer" in main.s, with 0x4000 bytes allocated (because the databuffer in designware20.c pointed to a section of memory that csud calls the "heap" and it is 0x4000 long). After USBInitialize is called in the main function, I added the following code:

* Setup the DataBuffer in CSUD (for USB keyboard). By adding 0x4000.0000 to the dataBufferInMain memory location,
* causes the BCM2835 to put this memory location in the '4' Alias => L2 Cache coherent (non allocating)
ldr r0,=dataBufferInMain
add r0,r0, #0x40000000
bl setDatabuffer

I probably could have added the 0x4000.0000 alias to the "heap" in CSUD, but I don't know the C language very well.

thanks all for the help!


Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Data Cache not working with Baking Pi - UNSOLVED

Sun Nov 29, 2015 5:25 am

So I thought I had this “SOVLED”, but no, hence the title name change to “UNSOLVED”

The above solution works, IF data caching is turned on after the first call to ReadLine in the file “terminal.s” If I enable data caching before this, the computer locks somewhere in the ReadLine function.

I found that I can get it to work if I turn on data caching before a call to “ReadLine” if I insert a clean / invalidate data cache instruction, specifically: __asm__ volatile (“mcr p15,0,%0,c7,c14,0” :: “r” (0)); in the right place within CSUD as such:

Tracing the issue through the code:
ReadLine (in terminal.s) calls:
KeyboardUpdate (in keyboard.s) which calls:
KeyboardPoll (in keyboard.c, this and all the rest are in CSUD) which calls:
HidReadDevice (in hid.c) which calls:
HidGetReport (in hid.c)

If I clean/invalidate the data cache only once prior to the call to “HidGetReport” everything works fine.

Digging deeper through the code, only cleaning/invalidating once no longer works, I suspect due to multiple calls to the same code for the various devices and hubs. So cleaning/invalidating every time, I can get the program to work through the following:

HidGetReport (in hid.c) calls:
UsbControlMessage (in usbd.c) which calls:
HcdSubmitControlMessage (in designware20.c) which calls:
HcdChannelSendWait (in designware20.c)

In this function I’ve narrowed the issue down to a “trial” block that lets the code try to “talk” to the channel 3 times, followed by a “do-while” block. If I clean/invalidate the data cache prior to the “try” block, everything works fine. If I clean/invalidate the cache after the do-while block, it doesn’t work and the computer locks same as before.

Digging deeper, the next function called is:
HcdPrepareChannel (in designware20.c), if I clean/invalidate within this function, immediately upon entry, then everything works fine. Also, if I clean/invalidate after this function returns to HcdChannelSendWait, everything works. Problem is the above mentioned loops allow for multiple calls to HcdPrepareChannel, so I really can’t say for sure where the clean/invalidate needs to go in order to keep digging through the function calls, therefore:

Due to the “try” and “do-while” loops I’m stuck.

Other things I’ve tried, suspecting a DMA issue, I changed the code in MemoryAllocate such that all memory issued to CSUD has the 0x4000.0000 alias. => didn’t work.

I put all of CSUD in non-bufferable, non-cacheable, shared memory => didn’t work.

I did both the above => didn’t work.

In “memset performance,” by hldswrth on Mon Feb 24, 2014 5:36 pm, hldswrth stated he got it to work, but he didn’t say when he was turning the data caching on.

Does anyone know or have suggestions on how to get CSUD to work within Baking Pi’s Input02 program by turning data caching on at the beginning of main?

My motivation for doing this is because I want to lock down vital routines into the data cache before the main program runs. In order to lock data caches, I need to turn data caching on. I could do some work-around and only clean/invalidate the non-locked caches, but I’d rather find the source of the problem, seems like every time I try a band-aid solution, it comes back to haunt me down the road.

Any and all comments would be welcome, thanks,


P.S. Instead of allocating memory in “main”, then adding 0x4000.0000 to the memory address and passing it to the dataBuffer in designware20.c, I found it is better to just add 0x4000.0000 to the memory location returned from MemoryAllocate when it is called in the function HcdStart in the file designware20.c Hope that helps anyone following this trying to do the same.

Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Data Cache not working with Baking Pi - UNSOLVED

Mon Nov 30, 2015 10:57 pm


I’ve tracked the issue down to the call to HcdChannelTransmit. This is the same routine that was the issue before and it’s the only one that I can find that uses the DMA.

Previously I thought I had fixed the issue by adding 0x4000.0000 to the address of databuffer and saving the new address in databuffer. It appeared this solved the problem because the program worked if I turned Data Caching on, ie. Keyboard input was being read by the program and echoed onto the screen. However, I found out that although the program worked, there were errors occurring in the background.

I figured this out when I made a routine to output the “LOG” print information from CSUD and set the TYPE to LOWLEVEL in the CSUD makefile. What I found was with Data Caching turned off, there were no errors “talking” to the USB devices. But with Data Caching turned on, there were many screens worth of errors, but eventually all the errors stopped and the program returned to the cursor and would be ready for more input.

CSUD is written to deal with transmit errors and it will keep polling until CSUD is successful. So as long as the transmission was eventually successful, everything appeared to work fine. Even if it took 10, 20 or 100 tires before it worked, which of course would be imperceptible to humans.

I also wrote a routine that outputted the addresses sent to the DMA in HcdChannelTransmit. What I found was most of the calls to the DMA used the 0x4nnn.nnnn address of the databuffer, but every 5th call the DMA was passed an address from the stack (address: 0x17nnn, due to where I put my stack).

The peculiar thing is that if Data Caching is turned on before the first call to ReadLine, then roughly 1 of every 3 calls (in an irregular pattern) to the DMA uses an address from the stack. But if I turn on Data Cache after the first call to ReadLine, then it’s every 5th call that gets an address from the stack.

I believe the program appears to work when Data Caching is turned on after the first call to ReadLine, but doesn’t work if turned on before, because the number of stack addresses sent to the DMA is much fewer in the first scenario.

Attempted solution: I modified the code such that a 0x4000.0000 alias was orred to every address that was sent to the DMA. I did not modify the actual address, I only changed the address just before sending it to the DMA. This did not work. I think because even though 0x4001.0000, for example, will return the same stored value as 0x1.0000, they have different L2 behavior, hence the “alias”.

I think the use of the stack is screwing things up. The compiler uses the stack for local variables, so I have the daunting task of finding where these local variables are declared in the long and numerous functional call chains within CSUD. The use of the stack is also why simply having all memory returned by MemoryAllocate have the “4” alias, did not work.

In the meantime, I will try to move the entire stack to 0x4nnn.nnnn, and see if that works.

Any suggestions / ideas are welcome and would be appreciated,


Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Data Cache not working with Baking Pi - UNSOLVED

Sat Jan 23, 2016 4:25 pm

Solved (at least solved good enough to work without errors):
Short answer: To use data caching with CSUD we need to two things: (details are included below)
1) Put the entire stack at 0x4nnn nnnn.
2) Put the databuffer at alias 0x4nnn nnnn.

Long answer: So I spent several days trying to figure out just where the local variables in CSUD were being declared (CSUD is written in C, C puts locally declared variables on the stack). Well...no luck. It is a nightmare because:
1) the call chains are very long with a lot of branches
2) calls that pass a variable for the the data buffer don't always end up at HcdChannelTransmit
3) the passed argument for "buffer" isn't always used as the data buffer, in other words, the same function calls that will eventually cause a call to HcdChannelTransmit are sometimes used for different purposes, ie. to just get information about a USB device or hub and not actually to talk to the device.

So I gave up on that and just gave the entire stack the 0x4 alias. I did this in main by declaring a 0x1000 long space in memory in my .section .data region. The linker puts it in memory at the next available memory location when it links the program. Then, one of the very first things I have 'main' do is to add 0x4000 0000 to top of the stack address and store it in the stack pointer register:

@ allocate a spot for the stack:
.section .data
.align 2
.skip 0x1000
@ then in main:
ldr r0, =stackTop
add sp, r0, #0x40000000

To put the databuffer at the 0x4 alias:
in desingware20.c, in the function "HcdStart()" modify the call to MemoryAllocate as such:
if ((databuffer = (MemoryAllocate(1024))) == NULL)
if ((databuffer = (MemoryAllocate(1024) + 0x40000000)) == NULL)

Done. Of course this means that all calls to the stack will have coherency in L2. I'm not sure what this does to execution times. I don't know how Broadcom implements the L2 cache coherency policy. Unlike 'C', the routines I write in assembly don't put variables on the stack so that's not an issue, however, I do use push and pop quite often when entering and exiting functions, so there may be a performance penalty there, dunno, haven't benchmarked it. I did write code that used two different stacks, one that was used by CSUD and another used by everything else. I did this by: example, say I had stack1 in the SP register and I wanted to switch to stack2 -> I would store the address in SP into a memory location that would hold the current location in stack1. Then I retrieved the stored current location in stack2 and put that into the SP register. Make a call to CSUD. Then switch back to stack1. It worked fine, but was cumbersome and not very elegant so I deleted it and I'm just using one stack at 0x4nnn nnnn alias.

Hope this info helps anyone trying to use CSUD with data caching turned on.

Posts: 19
Joined: Sun Apr 12, 2015 12:10 pm

Re: Data Cache not working with Baking Pi - SOLVED

Fri Jan 06, 2017 1:06 am


Bottom Line Up Front:
With respect to CSUD,neither the address passed to the DMA or the local variables on the stack (some of which use the DMA) are affected by L2 cacheing. It seemes only L1 cacheing affects the DMA and the Local variables on the stack.

When data cacheing is enabled, CSUD stops working. This is because problems arise with CSUD’s ability to “talk” to USB devices when data cacheing is used. The caching issues affect both the data buffer passed to the DMA engine and local variables, which are placed on the Stack, that are passed to the DMA engine.

Findings overview (After extensive testing explained below):
DMA engine:
The address of the data buffer passed to the DMA engine needs to be in a section of memory in which L1 cache is turned off otherwise, the data sent from the DMA to the data buffer will only update the data in memory but will not be received by the ARM because the ARM will be reading old data that is in the L1 Cache. This causes the ARM to believe nothing has changed on the keyboard and therefore, the ARM will not receive any input from the keyboard and behave as though no one typed on the keyboard.

Solution: One solution is to clean/invalidate the data cache as described by ‘rst’ above. Another solution is to place the data buffer passed to the DMA engine in a section of memory where L1 cacheing is turned off using page table attributes.
L2 cacheing does not seem to affect the DMA. As long as L1 cacheing is 'off' then any alias (0x00000000, 0x40000000, 0x80000000, or 0xC0000000, per BCM2835 pg 5) and any L2 cache attributes set using the TEX bits in the page table (per ARM 1176 Tech man, table 6-3, pg 6-16) will allow the DMA to update the data buffer and the ARM processor will receive the updates.

Stack: (for locally declared variables in CSUD that are passed to the DMA)
Background: CSUD is written in the 'C' language. In 'C', locally declared variables are placed on the stack. in CSUD there are many locally defined variables that are used to initially gather information from the USB devices. If the stack is placed in a section of memory that is L1 "write-back cached, no allocate on Write" (ARM 1176 Tech Man, pg 6-15 & 6-16 refers) then the local variables do not get updated by the USB device therefore CSUD runs in an infinite loop as it keeps trying to get info about the USB devices detected.
L1 Cacheing can be enabled for the memory location that contains the stack but it must be “Write-Through cached, No Allocate on Write” (ARM 1176 Tech Man, pg 6-15 & 6-16 refers).
L2 Cacheing, both the TEX attributes for L2 cacheing and aliases have no effect on the stack for CSUD operation.

Solution 1: Put the stack in a section of memory that is designated as any thing except L1 "write-back cached-no allocate on write".

Solution 2: Find all the local instances in CSUD that are passed to the DMA and put them in a section of memory that is not L1 "write-back cached, no allocate on Write". After spending days trying to decipher the local variables, I gave up, thus this is still solution is still UNSOLVED.

Detailed analysis:
As I now understand it: for simple 1 Mbyte sections, there are 4096 page table entries. Each page table entry corresponds to 1 Mbyte of memory.
The reason there are 4096 entries is because this covers all of the 32-bit address space. To explain: in 32 bit addressing, the address range goes from 0 to 4,294,967,295 (2^32). Each Mbyte contains 1,048,576 bytes. Therefore, there are 4096 1-Mbyte sections (1,048,576 * 4096 = 4,294,967,296).
When the ARM processor wants to access a memory address, it sends that address to the MMU which looks at the page table entry that corresponds to the Mbyte section of the address passed from the ARM to the MMU. For example: ARM passes the address 0x8000 to the MMU, this is located within the first Mbyte of memory, therefore, the MMU will look at the first page table entry. If the ARM passed the address: 0x108000 to the MMU, this will be in the second Mbyte of memory, therefore, the MMU will look at the second page table entry. So forth and so on.
Typically when we set up the page table entries to allow data caching, we set all memory below 512 Mbytes to cacheable and all entries above 512 Mbytes are set to “device, shared” which means caching is turned off. This is key to understanding the ‘4’ alias confusion.

The reason the addition of 0x40000000 to the data buffer’s address seemed to be the solution was not due to the ‘4’ alias, but rather because when the address was sent from the ARM to the MMU, the MMU looked at the page corresponding to the 0x400 Mbyte section. In other words, lets say the address of the data buffer passed to the DMA engine was initially 0x1234. This is in the first Mbyte of memory, the MMU looks at the first “page” in the page table and sees that cacheing is enabled in this section, thus CSUD will not work. So the initial thinking, based on BCM2835 manual, was that there was an issue with L2 cacheing and the DMA, so a ‘4’ alias was appended to the data buffer address, resulting in: 0x40001234. Doing this allowed CSUD to work. My false reasoning was that L1 cache was still enabled because 0x40001234 accesses the exact same memory location as 0x1234, therefore, the issue must be in L2 cache, which was solved by the ‘4’ alias. However, this reasoning is wrong. When the ARM passed 0x40001234 to the MMU, the MMU no longer looked at the first page table entry, it looked at the 1024-th page table entry (0x400 = 1024). The memory attributes for page table entry 1024 were inner and outer caching off, device, shared. Thus the ‘4’ alias not only affected L2 per BCM2835 manual pg 5, but also turned off cacheing due to the way I set up the page table.

First an assumption: L1 is “inner” and L2 is “outer”.

The MMU was set up with flat mapping, in other words ARM addresses mapped to the same physical address, ie ARM address 0x1234 mapped to physical address 0x1234.
The confusing part of all this is that there are multiple ways to affect L1 and L2 cacheing which are not always readily apparent. As described above, I though appending the ‘4’ alias was only affecting L2 cacheing, but in fact L1 cache was also affected (albeit in a round about manner).
To test what actually affects what, I place the data buffer for the DMA at memory location 0x300000, which is in the 4th Mbyte of memory. This corresponds to the 4th page table entry which affects memory addresses 0x300000 through 0x3FFFFF. To separate controls for L1 and L2 cacheing, I set TEX [14] bit to ‘1’. This causes TEX bits [13:12] and the C,B bits [3:2] to behave according to table 6-3 in the ARM 1176 tech man, pg 6-16, allowing independent control of L1 and L2 caches. (with TEX[14] set to ‘0’, the bits C & B control both L1 and L2 cache)
Then, with L1 set to strongly ordered memory, bits [3:2] = 0b00, I tested every combination of L2 cacheing using TEX bits [13:12]. Next I changed L1 to “device, shared”, bits [3:2] = 0b01 and again tested every combination of L2 cacheing. CSUD worked correctly in every one of these scenarios. (note: the alias in these cases was ‘0’)
Next I added the ‘4’ alias by remapping in the page table. Then ‘8’ and ‘C’, none of these alias had any effect and CSUD worked.
I repeated all of the above tests with L1 cache set to “Write-Through cached”, bits [3:2] = 0b10, and again with L1 cache set to “Write-Back cached”, bits [3:2] = 0b11. CSUD failed in all cases. No combination of alias’s (0,4,8 or C) or L2 cacheing, including off, allowed CSUD to function when either write-back or write-through L1 cache was turned on.
To further verify my thoughts on how the MMU works, I added the ‘4’ alias to the address of the data buffer and changed the L1 and L2 attributes for page table entry 1024. With L1 caching off, L2 attributes had no effect and CSUD worked fine. But when any type of L1 cacheing was turned on, CSUD stopped working even though it had the ‘4’ alias.

I did all of the above tests with the stack as well. I placed the stack in the middle of a different 1-Mbyte section (to be exact, I placed the stack in the middle of the 1-Mbyte section that extends from 0x200000 to 0x2FFFFF. The address 0x280000 was used for the stack. This is 1/2 way between 0x200000 and 0x2FFFFF) to ensure that moving downward through the stack would not cause it to address the page table entry for the next lower Mbyte section. I got the same results as with the data buffer with one exception. CSUD still worked with L1 cache set to “write-through cached” but not with “write-back cached”. So it seems the stack (local variables for CSUD) can be in a L1 cached section of memory as long as write-through is set as the cache policy. No combination of L2 cacheing or aliases on the stack address had any effect on the operation of CSUD, only L1 affected it.

The address passed to the DMA only has to have L1 cacheing turned off and is not affected by L2 cacheing.

If you have any comments, either cofirming or disagreeing with my findings please let me know. This is a complicated subject and I probably overlooked some nuance. I also hope this helps anyone trying to use CSUD with data cacheing turned on, or any application of L2 cache control.


Return to “Bare metal, Assembly language”