LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Multicore Code works on QEMU not on real hardware

Sat Oct 13, 2018 10:03 am

Hello all,

I have written a method of controlling other cores from core 0. It works in QEMU and in my x86 port of my code (the x86 port runs on posix compatible systems). When I try to run it on real hardware the function to initialize the cores hangs. I'm guessing this occurs because it only releases a core when a variable has been set by the core that was previously released. Due to the nature of the problem I'm guessing it is a cache coherency problem. I have used memory barriers and flushed the data cache over a certain range to no avil. It still hangs and there are no indications the first core even sets the variable. It would be nice if someone could check my code. I reckon it is just something I overlooked. This is the link to the GitHub repo https://github.com/OllieLollie1/Raspi3-Kernel The relevant files are located in src/ and are main.c, multicore.c, and start.s

EDIT 1:
I know LdB has a working asm version but I would prefer C for the ease of readability.

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Mon Oct 15, 2018 9:58 am

For a start you have bugs in the GPU mailbox messages.

It's barely 100 lines of code in all the initialize code check them all .. I found 2 straight up

Uart init the clock does not take 12 bytes it takes 8 you got it right in system.c but wrong in uart.c
get_gpu_memory_split in gpu_memory.c, the response is not 0 bytes the response is 8 bytes.

Can I also say I seriously dislike the malloc and free inside the printf, that is just crazy setup a static buffer of the largest size you imagine much easier of everything.

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Mon Oct 15, 2018 10:55 am

Thanks LdB for finding the mailbox message bugs. On a side note the malloc and free are there because I need them for my eventual use of this kernel. I'm eventually going to be recording gigabytes of sensor data on an autonomous bot. I'm just used to using malloc and free since I've mostly only programed user space applications before.

Is there an issue with my implementation of it or is it just bare metal tradition that it doesn't happen.

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Mon Oct 15, 2018 1:21 pm

You have a semaphore lock on it so using dynamic memory just slows down your print function, you only ever need one buffer it is impossible you need more that semaphore enforces that. As a general rule you make it fixed size and flush at a fixed point. For example a typical implementation will flush output when any given format processing instruction exceeds 256 characters or 512 characters in length.

Most of the individual conversions will not exceed 65 characters (you already worked out the max size of some of your conversions I saw them), the largest being a binary output of a 64bit value. Only strings (%s) and padding can be longer and they are easy to flush output by resetting/moving the buffer pointer and continue. There is no actual conversion they are just copying or padding a character so it's dead easy to flush them to that point and resume. So generally most implementations either allocate once on startup or a simple declare along line "static char printbuf[256];" etc.

Even if you were going to have one for each core you would simply copy (or in linux terms fork) the printf execution code and you would setup a buffer for each copy. Again it would only ever be done at setup as far to slow to do it every time you printf.

That is what I am raising my eyebrows at I thought you wanted faster printing not slower.

I would also point out .. the printf buffer implementation here is tiny, more complete and on the stack all you need is a semaphore if you want
https://github.com/LdB-ECM/Raspberry-Pi ... mb-stdio.c
Literally all the code needs is a void func (char ch) which you can initialize or change at will.
Your "uart_send" function needs a minor interface change to char from uint32_t to use with the unit, and you would just need a void func (char ch) output in the lfb unit to use that with the unit.
Look at Init_EmbStdio all it does is hold that pointer to flush the converted character output to. So all you have to do is provide a suitable function and you can pass it to the code.
Even if you eventually want to replace the code it's an easy start point.
To have each multicore have its own output for example you just need and array of output functions and the two sizes which you set via coreID and call the output function via coreID. The cores can share the rest of the code because all the buffers are on their core stack
This may help you understand what I am saying for multicore consoles as an alternative to semaphoring a single function.

Code: Select all

struct PerCoreOutput{
	int MaxOutCount;	 // Max output count
	int OutputCount; // Current output count								
	void (*Console_WriteChar) (char); // The output handler function
};

struct PerCoreOutput CoreConsoles[4] = {0 };

void Init_EmbStdio (void (*handler) (char ch))
{
        // Assume we have a function GetCoreID which gets the core ID
	CoreConsoles[GetCoreID()].Console_WriteChar = handler;	// Set new handler for this core ID												
}
UPDATE:
I can also tell you I found the source of your error beyond the mailbox problems ... you send core 1..3 to core_wait_for_instruction

core_wait_for_instruction executes a semaphore command only you don't have the MMU online to that core yet and that is a deadlock it will never aquire a lock your core will be spinning forever. Go back to my code I already told you that, I walked them in one at a time using only a volatile.
https://github.com/LdB-ECM/Exchange/blo ... src/main.c
/* Setup mmu on additional cores */
/* We can not use semaphores until MMU up on all cores */
For reference .... All I did to debug your code was add a uart_send after each function

Code: Select all

	// set up serial console
	uart_init();
	uart_send('1');

	lfb_init();
	uart_send('2');

	dynamic_memory_alloc_init();
	uart_send('3');
	console_init();	
	uart_send('4');
	init_audio_jack();
	uart_send('5');

	gl_quad_scene_init(&scene, &(shader[0]));
	uart_send('6');

	//Create mmu table on Core 0
	init_page_table();
	uart_send('7');

	mmu_init(); //Now turn on MMU on Core 0
	*core0_ready = true;
	uart_send('8');

	multicore_init(); //Now core_execute is avalible to be run after this
	uart_send('9');
What I got on the UART screen was 12345678 ... No 9 hence it's crashing in multicore_init and then looked at it.

The big hint is the way you have them coming in they don't need to wait in that function anyhow, bring them straight in and bring the mmu online and straight to where you want them, you have everything already setup with core0.

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Mon Oct 15, 2018 8:57 pm

Thanks for the ideas about printing. I knew the error occurs in the core_wait_for_instruction part and I used that to keep the cores there when they aren't doing anything. When I call core_execute() a function pointer is set and the specified core is then taken from that function and executes something and then it returns there. When a core first enters core_wait_for_instruction the first line of code is mmu_init so shouldn't that bring the mmu online so semaphores can be used?

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Tue Oct 16, 2018 12:42 am

Update at bottom:

I will deal with that in a second but you also have a print instruction right after the MMU_init call which never appears.
So we are left with even more questions about your printf implementation.

You also have printf which uses a semaphore used in places that raised my eyebrows like mbox.c, gpu_memory.c, core1.c, core2.c, core3.c which may well be called long before semaphores are ever available. For safety I think you need some sort of flag from the MMU unit when all 4 cores are done and printf needs to exit if it isn't set or something like that given it will lockup currently.

I tried commenting out the semaphore and putting a printf statement straight after the console_init and I get nothing up on screen so something is very broken and it has nothing to do with cache or multicores.

Now back to the semaphore I am terribly confused multicore.c has the function "void get_core_ready()" which clears the ready flag but core1,core2, core3 also clear the same flag, there seems to be no planning to all this or you have two schemes?????

Now follow what happens when you start core 1 it goes into "core_wait_for_instruction" which yes turns the MMU on but only on that core it then prints something that uses a semaphore (interesting), then hits "get_core_ready" and then comes back and attempts to lock another semaphore with "semaphore_inc(&core1_execute_lock);". There are still two other cores that haven't got there MMU online they haven't even been launched yet ... I repeat I think you must have all 4 cores with MMU online before you try to aquire a semaphore, I am dubious about having some cores with and some without.

I am also a little dubious your core launch code works, core0 has the cache on over the memory .. try something like

Code: Select all

	asm volatile ("mov x1, #0xe0\n"\
				  "mov x2, #0x80000\n"\
				  "str  x2, [x1]");
	asm volatile ("dc civac, %0" : : "r" (0xe0) : "memory");
	asm volatile ("sev");

I did however notice you already have a suitable function you could convert to use with that other implementation in console.c in

Code: Select all

void console_print(char *input)
A simple single character print could be formed with something simple like

Code: Select all

void console_print_char(char ch)
{
	char temp[2] = { 0 };
	temp[0] = ch;
	console_print(&temp[0]);
}

UPDATE:
I changed the printf function over to the alternative, fixed a few things in multicore.c up and used the above console_write_char and it works to at least get display to screen. I also tried using the changed uart_send and that works as well. The use of semaphores while not all cores have the MMU up makes it very touchy, the use of printf to report core acknowledge in "multicore_init" is needed to provide some sort of timing delay and it shouldn't be needed. The execute function is dead so not perfect but a start and I am out of time to mess around with. The code and binary is on the link
https://github.com/LdB-ECM/Exchange/tre ... nel-master

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Tue Oct 16, 2018 9:39 am

Thanks for your assistance however on testing the new printf implementation is more time consuming. Instead of printing all of what was passed to printf console_print is being called potentially hundreds of times to be able to print 3 lines. Is there a way to speed this up or not. If there isn't my implementation was faster by doing everything in one go.

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Tue Oct 16, 2018 1:53 pm

It is trivial to do what you want ... I assume you can make a suitable function on console.c which matches my function I setup next.

So make a suitable interface instead of the character by character say something like void console_print(char* buffer, int count); You will need a size you won't be able to rely on a '\0' terminated string because you have to chunk buffer large strings out and you won't want to have to try and insert '\0' into the long strings.

So lets do this because it is really simple .. something like 30 lines of code changes.

We change the function pointer

Code: Select all

static void (*Console_WriteChar) (char* buffer, int size) = NULL;	// The output handler function
Okay we need a new init to set the function again that is trivial

Code: Select all

void Init_EmbStdio (void (*handler) (char* buffer, int size))
{
	Console_WriteChar = handler;									// Set new handler function												
}
Now all we need to do is provide a slightly altered version of the internal output ... so lets do it
Look at the code and follow what it does .. and convince yourself it writes intbuf size chunks
There is a flush required at end of printf which is 1 line of code as you may still have data in buffer

Code: Select all

static char intbuf[256];

static int blockprn_to_buf(int c, void* prn_func)
{
        intbuf[OutputCount] = (char)c;  //Hold the character in the internal buffer
	OutputCount++;      // Inc the count of characters in buffer
        if (OutputCount == sizeof(intbuf)) {   // If the buffer is full write the buffer to function pointer         
        	void (*handler) (char*, int) = prn_func;
                handler(&intbuf[0], OutputCount); // Write internal buffer data
                OutputCount = 0;   // Buffer now empty .. start loading from zero again
	}
	return (int)c;
}
Now lets do the printf function

Code: Select all

int printf (const char *fmt, ...)
{
	va_list args;	// Argument list
	int count = -1;	// Preset fail to number of characters printed
	if (Console_WriteChar) {
		OutputCount = 0;	 // Zero output count
		MaxOutCount = INT_MAX;  // This is a maximum size function
		va_start(args, fmt); // Create argument list
		count = _doprnt(fmt, args, blockprn_to_buf, Console_WriteChar); // Run conversions to our new block write function
		
 		/* If OutputCount > 0 there is still data in intbuf we need to flush it */ 
               if (OutputCount > 0) Console_WriteChar(&intbuf[0], OutputCount);

		va_end(args); // Done with argument list
	}
	return count;	// Return number of characters printed
}
So that is it job done your printf now writes chunks up to intbuf size instead of character by character .. you can change the size of the buffer to whatever you want.

There may be a few typos I haven't run the code I just did it while I typed but it should be very close to working as it's pretty trivial.

Now if you feel comfortable with your old code then use it but you will need to fix it up .. I am just offering easy and flexible options.

Note: Actually looking at it you might actually be able to use you existing char* null terminated console print if you pull up 1 character short on the internal buffer and zero it and on the flush add a zero after last character. Will depend what your print function does when it sees a null termination in terms of appending characters on in the next screen position. I leave that for you to think about as an exercise. Hmm let me try it next post.
Last edited by LdB on Tue Oct 16, 2018 2:36 pm, edited 3 times in total.

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Tue Oct 16, 2018 2:26 pm

UPDATE: Since I didn't have to change console.c I tested it with your console_print function ... works :-)
https://github.com/LdB-ECM/Exchange/tre ... nel-master

Lets try a different version ... We change the function pointer

Code: Select all

static void (*Console_WriteChar) (char*) = NULL;	// The output handler function
Okay we need a new init to set the function

Code: Select all

void Init_EmbStdio (void (*handler) (char* buffer))
{
	Console_WriteChar = handler;	// Set new handler function												
}
new block print functiion

Code: Select all

static char intbuf[256];

static int blockprn_to_buf(int c, void* prn_func)
{
        intbuf[OutputCount] = (char)c;  //Hold the character in the internal buffer
	OutputCount++;      // Inc the count of characters in buffer
        if (OutputCount == sizeof(intbuf)-1) {   // If the buffer is full write the buffer to function pointer         
        	void (*handler) (char*) = prn_func;
        	&intbuf[ sizeof(intbuf)-1] = 0;
                handler(&intbuf[0]); // Write internal buffer data
                OutputCount = 0;   // Buffer now empty .. start loading from zero again
	}
	return (int)c;
}
new printf routine

Code: Select all

int printf (const char *fmt, ...)
{
	va_list args;	// Argument list
	int count = -1;	// Preset fail to number of characters printed
	if (Console_WriteChar) {
		OutputCount = 0;	 // Zero output count
		MaxOutCount = INT_MAX;  // This is a maximum size function
		va_start(args, fmt); // Create argument list
		count = _doprnt(fmt, args, blockprn_to_buf, Console_WriteChar); // Run conversions to our new block write function
		
 		/* If OutputCount > 0 there is still data in intbuf we need to flush it */ 
               if (OutputCount > 0) {
               	   &intbuf[OutputCount] = 0; // null terminate string 
                   Console_WriteChar(&intbuf[0]);
                }

		va_end(args); // Done with argument list
	}
	return count;	// Return number of characters printed
}
Think that is all correct .. and should work with console_print and uart_puts

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 6:07 am

Thank you for that it is actually 30% faster than my original print function in QEMU so I would guess there would be larger speed improvements on the actual pi. Only minor edits were required to make your code compile. Mostly how you try and set the address of an array index to null.

UPDATE:
I realised there was a small bug in the core_execute function but there seems to still be a bug in multicore_init

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 7:02 am

I had a few misplaced ampersands doing the terminate, I found that when I tried to compile it :-)

On the other yes you need to go to 2 clear steps and you need to stop trying to merge them

1.) Bring the MMU online on each core
2.) Then setup what you want to do with the cores after that.

Currently you are trying to do both and you can't the reason is obvious why .. semaphores & primitive locks.

Step 1 must be accomplished without semaphores and primitive locks
Step 2 you really want to use those because they are the best and most efficient options.

Step 1 invariably means you must have some sort of park while you wait for all the cores, and currently you seem to be trying to avoid the secondary park and it bites you. Basically all you need is when the core comes in setup the MMU and park it until you have them all.
A simple secondary park is 11 lines of assembler that is basically what you are trying to avoid, you can possibly get it down simpler if you are prepared to hard spin the cores. Put the spin in C code if you like I believe my normal park is something like and lets prepend entry with MMU_init

Code: Select all

/* cores enter here .. turn on mmu and then park */
volatile uint32_t* mailbox = (uint32_t*)0x400000CC;
void(*func) (void) = 0;
mmu_init();
do {
   uint64_t Core ID;
   asm("wfe");
   asm("mrs %0, MPIDR_EL1": : (Core_ID));
   func = (void*)(uintptr_t)mailbox[(CoreID & 0x3) * 4]; 
   mailbox[(CoreID & 0x3) * 4] = 0;
} while (func ==  0);
func();
Once they are all in the while loop you can then command them to wherever you want them by simple write to the mailbox where to send them and a sev.

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 8:24 am

Yep that makes sense, as a side note why did you choose that address for your mailbox?

UPDATE:
I have implemented exactly what you have written but it only shows a black screen.
Last edited by LizardLad_1 on Wed Oct 17, 2018 8:40 am, edited 1 time in total.

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 8:30 am

That is the physical mailbox hardware between the cores .. you can use interrupts etc if you want

remember this sheet for the hardware
https://www.raspberrypi.org/documentati ... rev3.4.pdf

0x4000_00CC = Core 0 Mailbox 3 Rd/Clr
0x4000_00DC = Core 1 Mailbox 3 Rd/Clr
0x4000_00EC = Core 2 Mailbox 3 Rd/Clr
0x4000_00FC = Core 3 Mailbox 3 Rd/Clr

You write to them at
0x4000_008C = Core 0 Mailbox 3 Set
0x4000_009C = Core 1 Mailbox 3 Set
0x4000_00AC = Core 2 Mailbox 3 Set
0x4000_00BC = Core 3 Mailbox 3 Set

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 8:43 am

I tried implementing it to no avil it appears now even in qemu it doesn't work. I have pushed to github if you want to check my main and multicore files.

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 10:07 am

No .. you aren't getting it.

You have the write address wrong (0x4000008C .. NOT 9c) think about the index after it but problem is deeper I dont even get what that function is for.

But your code is structurally not doing what it has to .. so let me fix it

There is a bit of an issue going on are you using your own libraries or the C compiler standards now?
I ask because you are including your own headers in the src/include such as stdint.h and they don't always match the actual standard library.
You need to go one way or other you can't interchange .. I will likely delete them from your code and use #include <stdint.h> etc

Anyhow I simplified your multicore the park and core_execute works. I also have given you an example of how to setup parameters as I made play_audio take a start and end pointer and setup core 2 to play the sound with parameters. I also added printf long long int as it looks like you want it (%llu) and fixed a bug in mbox.c. Everything in main.c now works.
https://github.com/LdB-ECM/Exchange/tre ... nel-master

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 8:57 pm

Thank you it makes more sense now. The reason I wanted to be able to pass a void * is because I was going almost for a pthread interface. Yes on Monday I switched over to Linaro so I also changed all of the headers to the standard headers. I think when you forked my repo I hadn't pushed that change yet. Thank you so much for your help. I hope to one day learn enough so that I can eventually be of help to others.

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Wed Oct 17, 2018 10:09 pm

Also it says LONG_LONG_SIZE undefined how did you make it work?

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Thu Oct 18, 2018 2:29 am

Yes I would go down the pthreads path as a start for your 1st O/S it's the simplest, if you make it modular you can even have one system on each core. Then you can play with interconnecting the thread systems on each core or make the cores all work in one thread system. Pros and cons to that choice. Just command the cores to enter the thread system since you have mmu's up and can build primitives .. you can just use the standard c atomics for the primitives since you are using c standard libraries you don't have to make them yourself.
https://www.studytonight.com/operating- ... ithreading
You will find a lot of one to one model code because most implementations of linux use it, the many to many implementation code you will have to scratch around harder for.

I also meant to say you seem to be using the render for a high speed screen clear (I get it is hundreds of times faster than manually clearing) and if that is the case use a single triangle and make all 3 corners black and it will render a black triangle to screen aka screen goes black :-)
If you look at the code you can also set the colour the screen gets cleared with, search for GL_CLEAR_COLORS you should be able to work it out (the next two entries and set triangle corners to same). You could do it without the triangle at all but I would need to rework the GLSetup.

I will explain in full printf since you may want to change this a few more times as you add in functionality.

Okay printf basically takes a variadic so basically all you are trying to do is match the parameter in the format string how to load the next variadic. So basically the code implementation is simply a string parser connected to an output function pointer. To send blocks in the earlier code we changed the output function this time we need to change the parser.

So for reading an integer there is an enumeration flag that gets set all I did was add long long ... that is what it is complaining about you didn't see that bit of the code change to the enumeration.

Code: Select all

enum integer_size {
    SHORT_SHORT_SIZE,
    SHORT_SIZE,
    REGULAR_SIZE,
    LONG_SIZE,
    LONG_LONG_SIZE
};
So as you read the format string flags you set which of those enumerations it will be char, short, int, long int , long long int .. you have signed or unsigned on a simple flag as well. So I changed the length parser done in step 4 which is marked with
/*************************************
* 4. Optional length modifier *
*************************************/
Eventually you get to the point you need to read the right vardiac size and now you have the enumeration to tell you which.
There is a signed and an unsigned read for an integer so two places to change code .. lets look at unsigned which I added the new size to read.

Code: Select all

handle_unsigned:
if (size == LONG_LONG_SIZE)
{
	ularg = va_arg(ap, unsigned long long);
} else if (size == LONG_SIZE)
{
        ularg = va_arg(ap, unsigned long);
}
else
{
     /* Note: 'unsigned char' and 'unsigned short' are promoted
       * to 'unsigned int' when passed as variadic arguments.  */
      ularg = va_arg(ap, unsigned int);
}
That was all there was to it.

LizardLad_1
Posts: 126
Joined: Sat Jan 13, 2018 12:29 am

Re: Multicore Code works on QEMU not on real hardware

Thu Oct 18, 2018 8:04 am

Yes I didn't see the enumerated type thanks. I have been testing the code and it is very close to working. It executes the function forever until a new call comes through. I tried to fix it because I thought you had made an error by writing a zero to the read/clr register so I tried to write 0xFFFF to the read/clr register and that didn't work. I thought it would because the documentation says there are 16 registers that are write to clear and can also be read and they were what you used. It also came up with an undefined instruction exception in qemu when I ran it so I know that writing 0xFFFF is wrong. So what is the correct way to do it so the function only executes once. If you want my current code just fork my repo. It has had a few updates since your last fork.

UPDATE:
Sorry everyone can completely disregard this post. I was misinterpreting output. Thanks for your help LdB everything is working as expected.

LdB
Posts: 872
Joined: Wed Dec 07, 2016 2:29 pm

Re: Multicore Code works on QEMU not on real hardware

Thu Oct 18, 2018 8:38 am

Good to hear :-)

Remember when you do your threads you can remove the complex function pointer union because you will jump it to entry so you just need the address thru the mailbox. You can't ever return so there is no point using a function just use a 1 line asm with the address to jump it to the address.

Good luck.

Update: Small apology to an error above and in text in GLES code. The triangle corner colors are marked wrong the varyings are marked as red, green,blue they are infact blue, green, red. The background color format is Alpha, Red, Green, Blue which wasn't mentioned above either.

So this makes screen background bright blue

Code: Select all

emit_uint8_t(&p, GL_CLEAR_COLORS);
emit_uint32_t(&p, 0xFF0000FF);	// Alpha full, Full blue
emit_uint32_t(&p, 0xFF0000FF);	// 32 bit clear colours need to be repeated twice
emit_uint32_t(&p, 0);
emit_uint8_t(&p, 0);

Return to “Bare metal, Assembly language”