LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Bare metal malloc

Fri Jul 06, 2018 11:33 pm

Hi all,

I had an attempt at creating a malloc function only I'm not sure if I have made a working implementation or a disaster waiting to happen. Here is what I have written.

Code: Select all

void dynamic_memory_alloc_init_stage_1()
{
        extern unsigned char _start, _end;

        StartOfProgram = &_start;
        EndOfProgram = &_end + 512;
        TotalSpaceAvaliable = (unsigned char *)MMIO_BASE - EndOfProgram;
        ChunkSize = 1000;
        NumberOfChunks = TotalSpaceAvaliable / ChunkSize;
}

void dynamic_memory_alloc_init(char allocated[])
{
        for(unsigned long i = 0; i < NumberOfChunks; i++)
        {
                allocated[i] = 0;
        }
}

char *malloc(size_t amountToAllocate, char allocated[])
{
        //Remember to ofset all allocations by EndOfProgram
        //Remember to check if any chunk end hits MMIO_BASE
        if(amountToAllocate > ChunkSize)
        {
                return NULL;
        }
        for(unsigned long i = 0; i < NumberOfChunks; i++)
        {
                if(allocated[i] == 0)
                {
                        allocated[i] = 1;
                        unsigned char *x = EndOfProgram + i * ChunkSize;
                        if(x[ChunkSize] < MMIO_BASE)
                        {
                                return (char *)x;
                        }
                        else
                        {
                                return NULL;
                        }
                }
        }
        return NULL;
}
I haven't yet implemented the MMU code from the post I initiated about writing to the framebuffer is slow. I'm also unsure as to whether I will have to change my code to be compatible If you see any problems here please tell me. Also I'm not sure how to make sure each memory location returned by malloc memory aligned.

User avatar
Arjan
Posts: 256
Joined: Sat Sep 08, 2012 1:59 pm

Re: Bare metal malloc

Sat Jul 07, 2018 10:36 am

Here some baremetal examples for malloc:

https://github.com/rsta2/circle/blob/ma ... /alloc.cpp

https://github.com/vanvught/rpidmx512/b ... c/malloc.c

The malloc is independent of the MMU.
http://www.raspberrypi-dmx.org/
Open Source DMX/RDM/MIDI/OSC/Art-Net/sACN solutions

bzt
Posts: 183
Joined: Sat Oct 14, 2017 9:57 pm

Re: Bare metal malloc

Sat Jul 07, 2018 1:22 pm

Well, that's not entirely true, malloc is not independent of MMU, not at all.

Without MMU: you use the free memory between EndOfProgram - MMIO_BASE. You have a fixed size of memory to play with, and you just record which memory area is allocated and which is free. The linked code examples are good for this.
Keywords: fixed sized memory buffer, single-level allocator

With MMU: you'll need at least two memory allocators. One is responsible for allocating and freeing page frames (4K, 64K etc., depending on your MMU configuration), often called PMM (Physical Memory Manager). The other is a higher level allocator, often called VMM (Virtual Memory Manager), able to allocate as small as 8 or 16 bytes of memory, and which starts with an empty memory buffer. This one is responsible to keep track of allocated and free memory only within this resizable buffer. When it needs memory to allocate and the buffer is not sufficient (either too small or it's full or fragmented), the higher level allocator must call the page frame allocator to allocate a frame (or more) and map that (those) at the end of the buffer. Typically this is done via a 'sbrk' or 'mmap' syscall, but in a bare metal app could be a direct function call as well. This is how dlmalloc, jemalloc and others work. Most of them never free the memory in a sense that they only record free memory in the buffer (at VMM level), but never unmap it and free the page frames (at PMM level). This is of course waste of RAM, but also means that subsequent malloc calls do not need to call PMM allocator again. It also worth mentioning that systems that implement supervisor- and user space, the PMM is always running in supervisor mode (EL1), and needs exclusive access to RAM allocation data, while the VMM could be running in user space (at EL0), probably with more instances, each using it's own buffer allocation data.
Keywords: dynamic sized buffers (one for each address space), multi-level allocators (PMM + VMM)

Now it's possible to convert a higher level memory allocator from one scenario to the other, and it's rather easy:
MMU-less allocator to MMU-aware allocator: start with an empty memory buffer, and when out of memory call 'sbrk'/'mmap' to make it grow. (Like modifying the one in Circle)
MMU-aware to MMU-less allocator: instead of empty buffer, set up a fixed size memory buffer at start (like heap_low - heap_end), and replace all 'sbrk'/'mmap' calls with displaying an 'out of memory' error message. (Like modifying dlmalloc, jemalloc etc.)

Porting dlmalloc (either in MMU-aware form as-is, or modified to MMU-less version) to a bare metal application should be straightforward, as portability is one of its main goals. As far as I can see, Circle's allocator is basicaly using very similar concepts as dlmalloc. More information on Doug Lea's allocator (outdated, but it's a good start to understand what it does).

Now dlmalloc is a small, portable, general-purpose allocator, and of course as such it can perform really poorly under certain circumstances. Therefore most game developers hate it, and they implement their own special allocators in their game engines. One of those is jemalloc, roboust, fast, multi-threaded, bit harder to port to a bare metal application, but maybe needed if those special circumstances apply.

Oh, almost forgot to mention: both dlmalloc and jemalloc supports alignment.

Cheers,
bzt

dwelch67
Posts: 944
Joined: Sat May 26, 2012 5:32 pm

Re: Bare metal malloc

Sat Jul 07, 2018 8:16 pm

I would step back before going forward. Most importantly why do you need malloc in a baremetal program/environment? Should be able to remove that requirement.

If not then why do you need the MMU? Data cache enable or other? Even though baremetal are you doing something multi-threaded, are you making an operating system? (then it isnt baremetal). If you are using the mmu only for data cache purposes you can setup the mmu one time to cover all of ram and make virtual = physical as far as addressing goes and you are done with the mmu, only have to deal with allocating one address space.

Alignment means some number of zeros in the lower address bits, so if you want to return an 8 byte aligned address you need three zeros at the bottom. If your current free address is 0x1111 then you cant pass 0x1110 because that overlaps you need to move it forward. We ask multiple alignment questions to interview candidates, ill just give you the answer Align down 8 bytes you would address=address&(~7), to clear the lower three bits to align up and avoid an overlap which you will want to do here address=(address+7)&(~7), we have seen many different ways to solve it if these bits are set then add, and with 7 add 8 minus that number add 8 and mask, or worse use a modulo operator (optimizes out but its the thought process of tossing around divides). Now you may wish to care about those additional bytes, perhaps when an alloc happens you round the size they requested up to your alignment (same math: size=(size+7)&(~7)).

But yes if you are making an operating system, multiple threads, each using the same virtual address space (program a thinks they are at 0x80000, program b thinks they are at 0x80000), then you will need to take all of the available memory choose to break it down into only one mmu block size or take advantage of the various block sizes the mmu offers (at the cost of a lot more work in your code). Depending on the mmu implementation you might have a third (if not more for other reasons) pointer/allocator dealing with the mmu memory space that holds the second level descriptors. In addition to keeping track of allocated memory blocks and keeping track of how they map into a particular applications address space.

Probably easies to start by not using the mmu if possible map it all linearly and pass it out linearly, no free's. If that wont work completely for you then no mmu, flat physical address space and use one of the many published algorithms. Next level as needed mmu on but mapped virtual=physical. Then the pain comes after that.

As well as what system are you connected to if you have malloc you have a full C library? which one, exactly how that connects to the malloc code is or can be C library specific there is more than one implementation of a C library. sbrk is a common term/function used though so that may still be your entry point into your memory management. Where this matters is the linker script which is both usually compiler and C library specific, and that tells you how to do the math to find the heap and its size programmatically and/or you need to add to the library and/or its linker script and/or its bootstrap to be able to figure out where the unused space is to turn into a heap. There is not one answer to the what is the math for base and size of the heap because it very much depends on the linker script and variables provided in that script (or some other solution) for which there are many for each target and compiler (or assume there are). So there is no one answer but one (or more) answer(s) per combination of target, library, options.

Step back and get a big picture idea of the problem you are trying to solve and then break that down system engineering style, then attack those parts. Starting with do I support malloc() at all in my baremetal application.

David

macload1
Posts: 12
Joined: Mon Jul 07, 2014 6:50 am

Re: Bare metal malloc

Sun Jul 08, 2018 8:31 am

You may check out my repo: https://github.com/macload1/RPiFreeRTOSuGFX/

In src/Demo/main.c you'll find the function: void * _sbrk(uint32_t delta) that's needed to use your compiler's malloc function.

Also check out the linker script needed to make all this work.

Best regards,
macload1

bzt
Posts: 183
Joined: Sat Oct 14, 2017 9:57 pm

Re: Bare metal malloc

Sun Jul 08, 2018 10:03 am

@dwelch67: so true! I have focused too much on the universal case forgetting that a bare metal app is not a general purpose application!

Indeed, if you use MMU only for caching with identity mapping, no multi-threading, then the MMU-less allocator works perfectly, and no need to complicate things. Also, you should ask yourself, if you have taken into consideration and you do need malloc, do you need free too? Many bare metal app just loads, arranges it's structures in memory and then runs. Such an app only uses malloc during it's initialization phase, but not later, therefore free can be ommited. That simplifies things hell a lot. If that's the case, a perfectly working malloc implementation can be as simple as:

Code: Select all

#define MEM_START &_end
#define MEM_END   MMIO_BASE
unsigned long freeptr = MEM_START;

void *malloc(unsigned int size)
{
  if (size < 1 || freeptr + size > MEM_END)
    return (void*)0;
  freeptr += size;
  return (void*)(freeptr - size);
}
Cheers,
bzt

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Bare metal malloc

Sun Jul 08, 2018 10:54 am

The reason for malloc was actually for file reading as I am going to have an interpreter running on the pi. Malloc was only really going to be used for this so I reckon my solution would suffice clearly with a bit of a clean up though. However thanks for showing me your solutions. I was only going to do 1:1 mapping because I wanted to use the MMU for speeding up access to the framebuffer. However I am still a bit confused as to how it works because if you had to make a virtual address for literally every physical address used wouldn't that be much slower? In the case of a framebuffer would I only need to convert the first address of the framebuffer or would I need to convert literally every pixel? I realise this is of topic but how does the framebuffer not write over the kernel code? and how do you get the start and end address of the framebuffer

bzt
Posts: 183
Joined: Sat Oct 14, 2017 9:57 pm

Re: Bare metal malloc

Wed Jul 11, 2018 5:12 pm

Okay, let's take a step back :-)

First of all, if you have 1:1 identity mapping, then you can forget about physical / virtual addresses (as they are mapped identically).

Otherwise, all memory references your code is making are virtual addresses. The MMU is responsible to translate those into physical addresses, completely transparently to your code, you don't have to convert anything. Your code does not need to know when it reads or writes the address 0x1001 whether it's mapped to physical page 0x1000 or 0 or 0x100000 in RAM. As for the framebuffer for example, let's say on one boot VC returns 0xF000, on next boot 0xE000 (just examples). You can map that return value (and all following pages up till the size of the framebuffer) at 0x10000. That way it doesn't matter what's the real physical framebuffer address is, your code will always write to 0x10000+y*pitch+x*4 to plot a pixel. Now if you don't use paging (or only use identity mapping), you have to use a variable to store the starting address of the framebuffer as returned by the VC mailbox call (as it may vary over boots), and therefore plotting a pixel would be fb+y*pitch+x. Hope this makes sense to you.

About malloc: if you plan to load a single program on boot (maybe a few additional libraries it references) and interpret that until you turn off the RPi, then simple malloc is perfectly fine. But if you plan to provide an interactive user menu to load different programs (and their libraries) and interpret those, then you'll definitely need to free memory. The difference is that in the first case you only load one program and interpret that in the entire time, while in the second case you don't know how many times the user wants to load a program (therefore you'll have to free the allocated memory to restore to the initial state before a new program could be loaded). This gets more complicated if the interpreted program also allowed to allocate memory dynamically (where the same consideration applies, does it do that only for initialization or many times throughout it's execution?) I'd say for a proper interpreter, you'll need both malloc and free.

Cheers,
bzt

dwelch67
Posts: 944
Joined: Sat May 26, 2012 5:32 pm

Re: Bare metal malloc

Wed Jul 11, 2018 11:15 pm

Not knowing your application there are ways to statically "allocate" space for buffers that might have in an normal situation been allocated dynamically. so depending on what you are doing you might still be able to get rid of malloc.

Yes, in theory the mmu is slower. When we say 1:1 we mean 0x10000000 virtual maps to 0x10000000 physical. Why one would do this is because, certainly with this architecture/design, you might want to use the data cache for program data and whatever you have in ram, but you dont want to generically have the cache on for the entire arm address space, as the peripherals will then get cached, you check the status of a peripheral that result lands in the cache you later go to check it, instead of getting the current status you may instead get the cached status. What the mmu provides in this situation is data cache control. You can map the whole memory space (you are going to use) with the virtual address = physical address, but for the ram spaces you enable caching for peripheral address space you do not enable caching. Technically every access now has to get looked up in the mmu which equals slower, but some percentage, probably high percentage, of those are ram accesses which some percentage of those are cached which equals faster. So there is a tradeoff.

Instruction cache does not have this problem, you can turn it on without the MMU because only instruction fetches are affected and if you are trying to execute instructions out of the peripherals control register address space you are already in trouble. the data bus is used both for program data, accessing variables/ram, and peripheral accesses, control registers, etc...

The point here is that you can use what may or may not be a simpler algorithm designed for flat physical address space for managing the allocations. And dont have to separately manage the mmu allocations nor the mmu table allocations.

The good side of using an mmu for memory allocation is that you can make larger blocks out of smaller blocks. You can to some extent grab some number of non-linear physical blocks and make them linear in the virtual space. But that is more code to write and debug. I have not actually tried this but the gray beard that mentored me when I was in my twenties had worked on some mainframes where the algorithms would try to migrate the allocations to the top or bottom of the heap to try to leave large chunks in the middle in case a large allocation came along. My thought has always been that an mmu can help avoid some of that...Depends on the mmu I guess and how many address bits you can control.

There are no doubt if not already many linked above, linear address space heap allocation routines you can borrow and use them against the memory space you determine to be unused by your application (and unused by the GPU). Being bare metal you own all of the memory available to the arm its just how you divide it up that is up to you.

Again not knowing your situation, you might be able to cheat, for example I dont remember if it was a jpeg library or some other I was using to test a processor with and my malloc was a simple give out the current pointer, then add the size rounded up to an alignment to the pointer for the next one. No way to free pointers, etc. I knew how many mallocs were coming, I didnt want to get into that code to change them to static code time allocations, it just worked. Had I wanted to decode more than one jpeg per test, I would have simply reset the heap pointer between each decode basically freeing all the mallocs from the prior run. I dont know if you are in this situation but it is something to think about if you only need this feature for short linear runs of memory that once done can be discarded completely, that shortcut could work. The alloc routine then becomes literally a few lines of code.

My guess is you want to start with an off the shelf routine that assumes a single memory space (no virtual vs physical, is what it is), and go from there.

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Bare metal malloc

Thu Jul 12, 2018 3:43 am

I have actually got it working but it only allocated 1024 bytes at a time. I'm looking to eventually bet it allocating multiple chunks if required however this works for now. I have on problem though. How do I differentiate between the memory that I should be accessing and the GPU memory?

Return to “Bare metal, Assembly language”