LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Writing to the framebuffer is slow

Sun May 20, 2018 4:49 am

Why does it appear that writing to the frame buffer is slow. I can see the framebuffer being written to, as is appears on the screen. I am not sure how to speed it up. However I know it must be possible if a linux distro can render 60fps on a pi. So my question is how do I speed up framebuffer access? Here is my GitHub repo: https://github.com/OllieLollie1/Raspi3-Kernel

LdB
Posts: 1576
Joined: Wed Dec 07, 2016 2:29 pm

Re: Writing to the framebuffer is slow

Sun May 20, 2018 1:30 pm

Lets deal with the basics stuff first

1.) You don't have the arm cores at full speed they are at the slow default speed
Get the max speed mailbox command 0x00030004 ARM clock ID = 0x3
Set that clock speed mailbox command 0x00038002 ARM clock ID = 0x3

2.) You haven't got even the basic branch and Data caches enabled in your start.S code
Your cores are still up at EL2 and sorry off the top of my head I don't know how to get the caches on in EL2.
I always run my kernel in EL1 as that is the more usual setup .. Look at Figure 3.3 that is what I use
http://infocenter.arm.com/help/index.js ... 03s01.html
Not saying it's wrong or bad I just haven't looked at doing it in EL2

If you want to bring the whole MMU online bzt has a start tutorial on doing that which gives you all the caches
https://github.com/bztsrc/raspi3-tutori ... tualmemory

3.) It would appear you have 2 cores writting to the same framebuffer without resource locks. That may well prove interesting.

Now as for the other thing you are discussing rendering which has no relationship to what you are doing. Rendering gets the GPU to do all the work. Unfortunately it is not really viable to drive it baremetal on the Pi for two reasons. The first is the documentation is a bit lacking and if that was the only reason I probably would have persisted. What I think is the bigger problem (as did Microsoft) is that the shader support is lacking.

I have a play around on github and one core render a full screen at 100fps like you say
https://github.com/LdB-ECM/Raspberry-Pi ... LES2_Model
Why I stopped is that next you need shaders and all the PI support libraries use the MESA shader compiler. So you have to port a non trivial amount of code to get your shaders able to be compiled into the GPU language. The other alternative is to precompile all your shaders but that is far to limiting. What is far more normal is the shader compiler to be embedded into the GPU or as a stand alone library supported by the manufacturer and we just pass the shader code to it to be turned into GPU code.

Where Microsoft got up to with the shader support can be seen here
https://github.com/Microsoft/graphics-d ... d-Features

There is an alternative a number of people have now done which is to use the the released userland libraries for linux by shimming the messaging threads system. For me there was no point in it because I ended up stuck with libraries I couldn't update and we still had the shader issue.
The library files are here
https://github.com/raspberrypi/firmware ... dfp/opt/vc

Now what you could do possibly do is turn your fonts to suitable render objects and get the GPU to render them for you. As you probably only want a solid color on the characters you could use a pre-compiled shader.
Something like freetype-GL you could do with what we know of the GPU in baremetal
https://github.com/rougier/freetype-gl

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Mon May 21, 2018 7:08 am

Sorry, but I am quite new to the subject of bare metal and I have a few questions about your answer.

1) Would you please explain what the differences between the different exception levels

2) What is a MMU and how would it make anything faster.

I have looked at the MMU tutorials by bzt but I don't understand what they are used for. I thought that it would ruin all of my previous code by changing addresses.

If you look through my code you can see a lot of my code is derived from the code written by bzt. I am also in the process of writing the resource locks. If you look in the header folder of my code I just haven't had time to finish it because I have had a lot of assessments at school lately.

Also on an unrelated topic what are processor interrupts and how do they work?

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 26659
Joined: Sat Jul 30, 2011 7:41 pm

Re: Writing to the framebuffer is slow

Mon May 21, 2018 9:19 am

LizardLad_1 wrote:
Mon May 21, 2018 7:08 am
Sorry, but I am quite new to the subject of bare metal and I have a few questions about your answer.

1) Would you please explain what the differences between the different exception levels

2) What is a MMU and how would it make anything faster.

I have looked at the MMU tutorials by bzt but I don't understand what they are used for. I thought that it would ruin all of my previous code by changing addresses.

If you look through my code you can see a lot of my code is derived from the code written by bzt. I am also in the process of writing the resource locks. If you look in the header folder of my code I just haven't had time to finish it because I have had a lot of assessments at school lately.

Also on an unrelated topic what are processor interrupts and how do they work?
Don;t want to discourage you, but if you are having to ask these sort of basic questions, then going baremetal on a Pi may be a little bit of a stretch. Might be worth starting with Linux, learning about all this stuff in a more benign environment there first.

MMU = Memory Management unit - https://en.wikipedia.org/wiki/Memory_management_unit
Processor interrupts - https://en.wikipedia.org/wiki/Interrupt
Exception levels - https://en.wikipedia.org/wiki/Exception_handling
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Mon May 21, 2018 10:46 am

Sorry I knew of them through different names. I understand what they are and their uses.

LdB
Posts: 1576
Joined: Wed Dec 07, 2016 2:29 pm

Re: Writing to the framebuffer is slow

Mon May 21, 2018 1:46 pm

Yes on Intel they are cpu protection ring or privilege level it's just ARM's name for it.
https://en.wikipedia.org/wiki/Protection_ring

So you have the ARM CPU in Hypervisor mode. Now in Intel ring format that is called ring -1 if you look in the link above at the section labelled "Hypervisor mode".

Basically you currently have the CPU at a level you would run VM ware at not an O/S kernel which is why I don't know it.
There is nothing you did the CPU was given to you by the boot stub in EL2 mode because if you want to run VMware you need it like that.

Generally (the ARM norm) if we aren't intending to do VMware we punch a few registers and switch the CPU to EL1 mode where we run our kernel.

So beyond what you do in your basic bootloader here is all I do .. it's commented

Code: Select all

//"================================================================"
//  Initialize HCR_EL2 so EL1 is 64 bits for all Cores
//"================================================================"
	mov	x0, #(1 << 31)						// 64bit EL1
	msr	hcr_el2, x0

//"================================================================"
//  Initialize SCTLR_EL1 for all Cores
//"================================================================"
    /*  RES1 bits (29,28,23,22,20,11) to 1
	 *  RES0 bits (31,30,27,21,17,13,10,6) +
	 *  UCI,EE,EOE,WXN,nTWE,nTWI,UCT,DZE,I,UMA,SED,ITD,
	 *  CP15BEN,SA0,SA,C,A,M to 0 */
	mov	x0, #0x0800
	movk	x0, #0x30d0, lsl #16
	orr    x0, x0, #(0x1 << 2)            // The C bit on (data cache). 
	orr    x0, x0, #(0x1 << 12)           // The I bit on (instruction cache)
	msr	sctlr_el1, x0

//"================================================================"
//  Return to the EL1_SP1 mode from EL2 for all Cores
//"================================================================"
	mov	x0, #0x3c5							// EL1_SP1 | D | A | I | F
	msr	spsr_el2, x0						// Set spsr_el2 with settings
	adr	x0, exit_el1						// Address to exit EL2
	msr	elr_el2, x0							// Set elevated return register
	eret									// Call elevated return
exit_el1:
So my code exits at exit_el1 label running in EL1_SP1 mode

Walking thru the 3 steps.

EL1 can run code in both 64 and 32 bit code.
So a 64bit VMware can run both 64 bit O/S and 32 bit O/S at the same time.
We don't have VMware and we want to run 64bit so we set EL1 to 64Bit

The next bit turns on the data and branch caches, so I have at least some cache running if I decide not to do the full cache setup. That helps with execution speed. Other than that it sets up some basic EL2 security traps so they pass to EL1.

Then I switch the CPU from EL2 to EL1 making sure the interrupts are disabled.
To do that we write to special registers and do an eret and cpu does an elevated return to a lower level protection.

So really all the difference is I end up with the cpu in 64Bit EL1 and some cache turned on and you are still in EL2 with no cache.

In my code that happens for each core as I bring them into there startup position, so each core is in EL1_SP1 with basic cache on.
All your code runs in either state it's just faster on the EL1 setup because of the cache.

The advantages of bringing the whole MMU online is more cache speed, memory access protection, cache coherency control for multiprocessor and the ability to virtualize memory. The cost is setup code and requirement for a bit of memory for the table.

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Wed Jun 06, 2018 10:49 am

Hi @LdB Your suggestions helped a lot so, thank you. However I am still only able to draw a singular frame a second. I havn't been able to completely understand the use of the MMU and I foresee a few problems the major one being that if I turn the caches on if i change the value of a semaphore it wouldn't be visible to the other cores making semaphores completely redundant. If I turn on the MMU can I have a 1:1 virtual to physical mapping so that my current code doesn't break?

Once I have this completed I am hoping to achieve a layer type system where I have a set number of layers all drawn to the framebuffer in a certain order however I foresee this being terribly inefficient so if you can give me any tips or if you see any errors in my reclocking code please tell me.

I have read this thread viewtopic.php?f=72&t=211003

LdB
Posts: 1576
Joined: Wed Dec 07, 2016 2:29 pm

Re: Writing to the framebuffer is slow

Wed Jun 06, 2018 5:40 pm

This is a 1:1 32bit table for the 4GB space
It assumes VC is configured for 128Mb as if you are going to turn camera or opengl on

0x00000000-0x00FFFFFF 16Mb is setup for your kernel code
0x01000000-0x37FFFFFF 16Mb to 880Mb is setup for heap/general memory
0x38000000-0x3EFFFFFF 880Mb to1023Mb is setup for VC
0x3F000000-0x3FFFFFFF 1023Mb-1024Mb for peripherals
0x40000000-0x40FFFFFF 1024Mb-1025Mb CPU mailbox (QA7_rev3.4.pdf datasheet)
0x41000000-0x7FFFFFFF 1025Mb-2048Mb unused
0x80000000-0xFFFFFFFF 2048Mb-4096Mb unused

Single line of code will set it up for you .. configure each core to use the table.

Code: Select all

 init_page_table ();

Update: I just remembered you want it for 64bit don't you. It's very similar under 64bit just some different bit definitions. I will need to come back to this :-)

Code: Select all

static volatile __attribute__((aligned(0x4000))) uint32_t page_table[4096];
static volatile __attribute__((aligned(0x400))) uint32_t leaf_table[256];

// ARM Cortex-A Handbook 9.6.1
// bits 0,4 = PXN,XN
#define SECTION_SHAREABLE (1<<16)
#define SECTION_FULL_ACCESS (1<<11)|(1<<10) // APX 0, AP 11
#define SECTION_XN (1<<4)

//  12  2
// TEX CB when SCTRL.TRE is set to 0
// 001 11 Outer and Inner Write-Back, Write-Allocate 
// 000 11 Outer and Inner Write-Back, no Write-Allocate
// 000 10 Outer and Inner Write-Through, no Write-Allocate
// 000 00 Strongly-ordered 
// 000 01 Shareable Device
// 010 00 non shareable device
#define SECTION_WRITEBACK_ALLOCATE    (1<<12)|(1<<3)|(1<<2)|2
#define SECTION_WRITEBACK_NO_ALLOCATE         (1<<3)|(1<<2)|2
#define SECTION_WRITETHROUGH_NO_ALLOCATE      (1<<3)       |2
#define SECTION_STRONGLY_ORDERED                            2
#define SECTION_SHAREABLE_DEVICE                     (1<<2)|2
#define SECTION_NON_SHAREABLE_DEVICE  (1<<13)              |2

void init_page_table (void) {
	uint32_t base = 0;

	// All Write-Back memory can be cached when ACTLR.SMP is set to 1, the MMU is enabled, and SCTLR.C is set to 1.

	// initialize page_table
	// 1024MB - 16MB of kernel memory (some belongs to the VC)
	// default: 880 MB ARM ram, 128MB VC

	/* This is the 0-16Mb which I allocate to kernel */
	for (base = 0; base < 15; base++) {
		// section descriptor (1 MB)
		// kernel is uncached
		page_table[base] = base << 20 | SECTION_WRITETHROUGH_NO_ALLOCATE | SECTION_FULL_ACCESS;
	}

	/* 880Mb of heap/memory space */
	for (; base < 880; base++) {
		// section descriptor (1 MB)
		// heap is cached
		page_table[base] = base << 20 | 1 << 14 | 1 << 13 | 1 << 12 | 1 << 3 | 1 << 2 | 2 | SECTION_FULL_ACCESS;
	}

	/* VC ram up to 0x3F000000 */
	for (; base < 1024 - 16; base++) {
		// section descriptor (1 MB)
		page_table[base] = base << 20 | SECTION_WRITETHROUGH_NO_ALLOCATE | SECTION_SHAREABLE | SECTION_FULL_ACCESS;
	}

	/* 16 MB peripherals at 0x3F000000 */
	for (; base < 1024; base++) {
		// shared device, never execute
		page_table[base] = base << 20 | 0x10416;
	}

	// 1 MB mailboxes
	// shared device, never execute
	page_table[base] = base << 20 | 0x10416;
	++base;

	// unused up to 0x7FFFFFFF
	for (; base < 2048; base++) {
		page_table[base] = 0;
	}

	// 2047MB unused (rest of address space)
	for (; base < 4096; base++) {
		page_table[base] = 0;
	}

	// initialize leaf_table
	for (base = 0; base < 256; base++) {
		leaf_table[base] = 0;
	}
}

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Sun Jun 10, 2018 7:34 am

Thanks a lot LdB. How are you going on the different bit definitions?

LdB
Posts: 1576
Joined: Wed Dec 07, 2016 2:29 pm

Re: Writing to the framebuffer is slow

Tue Jun 12, 2018 8:16 am

Sorry for the delay I have been busy and I was going to cheat and use bzt's tutorial example but it had too many problems so I end up rewriting much of it.

So this is the conversion of bzt's tutorial
https://github.com/LdB-ECM/Raspberry-Pi ... tualmemory

You probably won't be interested in the virtualization just the 1:1 mapping so I will give you a quick description.

Now I give a warning the L1,L2,L3 levels gets all very confusing in 64bit mode because there are so many modes of table operation.
If you look at table D4-20 here
https://github.com/codingbelief/arm-arc ... _format.md#
You will see for 4K granual size the tables are actually between level1 and level2, you will see for 16k & 64K granual they run between level2 and 3. Now I started with 64K granuals so my naming won't match the 4k naming don't get caught.

So the 1:1 mapping is a simple 2 level table translation .. I call them L2 & L3 (64K granual names) they are L1 & L2 respectively if you use the 4K granual names. Yes it's a pain in the butt that your level names change just by changing granularity.
The Pi3 address range we need to span just over 0x40000000 to get the core mailboxes on QA7_rev3.4.pdf

On my L3 table each entry is a block of 2M and you have max 512 entries which is means my L3 table spans 1GB (0x40000000), just short of what we need. So we need two 512 L3 tables. So I allocate 1 big L3 table of 1024 entries to meet that requirement and it needs to be aligned to a 4K boundary.

Code: Select all

/* The Level 3 ... 1 to 1 mapping */
/* This will have 1024 entries x 2M so a full range of 2GB */
static __attribute__((aligned(4096))) uint64_t L3map1to1[1024] = { 0 };
So now on the L2 table you simply map the first two entries to L3map1to1[0] and L3map1to1[512] the rest of the table is zero and unused

Code: Select all

/* The Level 2 .... 1 to 1 mapping */
static __attribute__((aligned(4096))) uint64_t L2map1to1[512] = { 0 };

So now in
void init_page_table(void)
You will see we organize that mapping, and then run thru allocating address to the blocks and or'ing on some flags.

Then all you do is punch a few registers and throw the table to ttbr0_el1 and you are done.

In doing that the assumption is your kernel code is running in EL1 .. you need to adjust registers if you have the core in a different EL.

If it helps to see the physical written expansion of the 1:1 table here it is
https://github.com/LdB-ECM/Exchange/blo ... t/memmap.s

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Wed Jun 13, 2018 7:07 am

So from what I have read of your code I can still use direct writes to the addresses I may be wrong? Do I need to use virtualmap() for every read/write?

AlfredJingle
Posts: 69
Joined: Thu Mar 03, 2016 10:43 pm

Re: Writing to the framebuffer is slow

Wed Jun 13, 2018 10:24 am

Hi LizardLad:

Yes once the translation tables are activated you can just write/read to/from any memory-location like you would normally, it just is much faster. In my system a fill of a 3Mb screen buffer takes between 3-5 ms. This variation in time comes from whether the GPU reads from vc-memory at the same time when a core is writing to that vc-memory. If you use C++ instead of optimised assembly for writing to vc-memory, you will not see any slowing of memory-access due to GPU access as the compiled C++ routines leave the GPU enough time to get data from memory.

Because in your posts you wrote about cache-coherency and metaphors, there is one addition I would propose to the memory map of LdB. I would add a 1 Mb block (or a more specific size if you use L2/L3 translation), fi between VC and heap-memeory, which is uncached (tex: 001 cb: 00). This is where you put your buffers for communication with and from the mailboxes, and it is where you can put your metaphors between the different cores. Because the memory-block is uncached, it will always contains correct data.
If you envision a high data-flow between the cores, I would add a second memory block which is set to outer/inner cached write back, write-ALLOCATE, shared. The coherency-system of the ARMv8 processor only works for cached memory with write-allocate. Memory with this setting is slower than normal cached memory but much faster than uncached memory.

Finally: I have a system running which does more or less the same you describe. It took me long time (>18 months) to get it running perfectly. Just keep trying! You will get there, learn a lot and have a lot of fun!
going from a 6502 on an Oric-1 to an ARMv8 is quite a big step...

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Fri Jun 22, 2018 11:09 pm

So do I write to the 64 bit address returned by virtualmap() or can I write to the memory addresses that I have hard coded in?

AlfredJingle
Posts: 69
Joined: Thu Mar 03, 2016 10:43 pm

Re: Writing to the framebuffer is slow

Sat Jun 23, 2018 8:41 pm

You write to the virtual address. If you do a 1:1 mapping, as described above by LdB, there is no difference between physical and virtual addresses.
going from a 6502 on an Oric-1 to an ARMv8 is quite a big step...

bzt
Posts: 563
Joined: Sat Oct 14, 2017 9:57 pm

Re: Writing to the framebuffer is slow

Sun Jul 01, 2018 12:04 am

Hi,

First of all, I'd like to say thanks to LdB for rewriting my tutorial! I liked that, the more examples we have the merrier! :-)

Second, on topic, LdB is right you should have caching, but that's not all. You can do a lot by writing your code carefully. For example why my lfb_print() routine looks like that, is not a coincidence. I've paid attention not to do multiplication within the loop, instead I've calculated everything in advance and used only addition inside, as that's much much faster.

I've checked your code, and here are some advice. Your setpixel function is just plain wrong. Don't calculate the offset every time, that's slow. Assuming you are displaying an image 1024x768 means you are calculating the same multiplication more than a million times. Instead use a variable and adjust it. Second, don't write rgb byte by byte to memory, that's slow too. Instead use bit shift operators, like (r<<16)|(g<<8)|b which will likely to be compiled as register only instructions (no memory access penalty). Also, don't calculate every time the pixel value of main_r, main_g and main_b, do it once before the loops, which would also save lots of function calls.
Furthermore, having transparent text is a nice feature, but this is not the way to do it. Instead of

Code: Select all

for
  for
    if transparency
      if fg
        setpixel1
    else
      setpixel2
Do something like

Code: Select all

if transparency
  for
    for
      if fg
        setpixel1
else
  for
    for
      setpixel2
Having a conditional within the loop is bad, as most likely gcc will compile it as a conditional branch. Now what branch do is, clearing the instruction prefetch cache, which slows down the CPU. So move the conditional outside of the loops. I've used the trinary operator on two constants to select background and foreground color, which is hopefully compiled to a branch-less conditional instruction (but not guaranteed, depends on actual register load, -O level etc.). With transparency you must have at least one "if" in the loop, but avoid having more.

As a basic rule, use as little code as possible within the loops, avoid function calls and arithmetic calculation other than addition. If you can, replace multiplication with bit shifting (x*2 = x<<1, x*4 = x<<2, x*1024 = x<<10, x/2 = x>>1, x/4 = x>>2 etc.) When copying an image from memory to the framebuffer, it also helps to use uint64_t (pixel1<<32|pixel2), which would cut the number of loop iterations in half.

For even further optimization, I'm afraid you'll have to drop C and switch to Assembly, use registers to store the pre-calculated values and have only one single STR instruction to store the pixel in the framebuffer. With STP, you can write 4 pixels at once, minimizing the number of memory access and MMU address translations.

(And a final note on why I've used font width + 1: on the original IBM VGA, from here I got the font, each character were represented in a 9x16 pixel box. Only 8 bits were stored in ROM, and the 9th coloumn was cleared (making a 1 pixel gap between characters to increase readability), except for the box drawing characters which copied the 8th coloumn.)

bzt

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Sun Jul 01, 2018 4:49 am

Alright I tried some of your suggestions bzt here is what the asm is now

Code: Select all

adrp	x6, 4
adrp	x5, 8
lsl	w0, w0, #2
and	w2, w2, #0xff
ldr	w6, [x6]
ubfiz	w3, w3, #8, #8
ldr	x5, [x5]
ubfiz	x4, x4, #16, #8
orr	w3, w3, w2
orr	w3, w3, w4
mul	w1, w1, w6
add	x5, x5, x1
str	w3, [x5, w0, uxtw]
ret
It appears that ptr += x * 4 compiles to ptr += (x<<2) anyway with optimisations.
There is a noticeable speed difference however I don't know how to change ptr += pitch * y to a bit shift as y isn't always a power of two.

And one other question without loosing speed is there a way to create layers such that text is over the colours? The only way I can think of doing this is to create two framebuffers and write the text over the colours and then copy it to the display framebuffer.

bzt
Posts: 563
Joined: Sat Oct 14, 2017 9:57 pm

Re: Writing to the framebuffer is slow

Sun Jul 01, 2018 9:31 am

Hi,

About your Asm, I don't know, there're no comments at all. What are the registers for? And I don't see any loop, that's certainly missing. Otherwise yeah, that's the idea, calculate the pixel in a register, also keep track of the offset in a register, and use STR to write to the framebuffer.

You can only use the shift trick with power of two. If your multiplication consist of a constant (by mathematical means, not by programming language means) like pitch, and a variable (in a matchematical varying value way), then you can replace it effectively by repeating addition. That's what I do, "offs += pitch" in the "for y" loop.

About colour layers, I'm not sure, what you're asking. If you mean to read the framebuffer, then yes, it is very common to create a shadow buffer in RAM, because reading the framebuffer is usually slow. Although I'm not sure this is needed for the RPi considering it has a special SoC in which the GPU and the CPU shares memory bus (sort of). I'd say turn off caching and create a function that reads each and every uint32_t from a buffer (drop the result). Call it at least a million times on a normal RAM buffer, measure the time it consumes. That will be your baseline. Next, repeat the test on a framebuffer address provided by the MailBox call. If you can see significant difference, then it worth creating a shadow RAM buffer.

bzt

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Mon Jul 02, 2018 2:56 am

The ASM there is what is generated by gcc it only includes the lfb_draw_pixel function. The reason there is no loop there is because the function only draws a single pixel. The loop is in the main function not in the lfb_draw_pixel. That is why I calculate the offset each time because the coordinates could be any arbitrary number.

bzt
Posts: 563
Joined: Sat Oct 14, 2017 9:57 pm

Re: Writing to the framebuffer is slow

Mon Jul 02, 2018 9:32 am

LizardLad_1 wrote:
Mon Jul 02, 2018 2:56 am
The ASM there is what is generated by gcc it only includes the lfb_draw_pixel function. The reason there is no loop there is because the function only draws a single pixel. The loop is in the main function not in the lfb_draw_pixel. That is why I calculate the offset each time because the coordinates could be any arbitrary number.
See? There's your problem. When writing text on screen or displaying pixmaps, those coordinates are not arbitrary (only the first one). Subsequent calls are using constantly increasing positions. Use that to your advantage, replace universal solution with a specialized offset adjustment to speed up your code.

Also not using a function call from within the loop is essential to keep cache locality, so that your setpixel routine, the loop start and the loop end are all included in the same cache block. Using registers are cache agnostic, therefore does not violate cache locality principle (and also their access latency are less by far as compared to accessing memory, see memory hierarchy):

Code: Select all

---cache block A---
func
  code
---cache block B---
for
  for
    call func
Is not good and ineffective as the cache must be invalidated and reloaded on each iteration. On the other hand,

Code: Select all

---cache block A---
for
  for
    code
fits in one cache block, therefore the cache won't be invalidated throughout the entire execution of the loop. Just to make it sure, now I'm talking about instruction caching of the generated code and not the data caching for the framebuffer.

bzt

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Sun Jul 08, 2018 11:16 am

The point of lfb_write_pixel was just to write to arbitrary pixels. I am planning on writing a lfb_draw_rectangle to quickly draw rectangles of a single colour. The reason I wrote a function to draw singular pixels without writing a pixmap to them is to set the colour programmatically If you look at my main loop (before I moved on to file reads so look at previous commits) it is writing a gradient not a pixmap or a piece of text.

riverajl
Posts: 10
Joined: Thu Dec 06, 2018 10:03 am

Re: Writing to the framebuffer is slow

Fri Dec 21, 2018 10:06 am

LdB wrote:
Sun May 20, 2018 1:30 pm
Lets deal with the basics stuff first

1.) You don't have the arm cores at full speed they are at the slow default speed
Get the max speed mailbox command 0x00030004 ARM clock ID = 0x3
Set that clock speed mailbox command 0x00038002 ARM clock ID = 0x3
Hi guys.

I have tried to check/set the ARM clock via mailbox but it always returns rate as zero. According to the documentation, it seems that the clock does not exist.

Can someone help me?

José Luiz

LizardLad_1
Posts: 133
Joined: Sat Jan 13, 2018 12:29 am

Re: Writing to the framebuffer is slow

Mon Dec 31, 2018 12:13 am

With the help of LdB I've gotten this working so hopefully I can help. First question do you have the caches enabled and if not may I see your mbox code and your calls?

User avatar
ab1jx
Posts: 885
Joined: Thu Sep 26, 2013 1:54 pm
Location: Heath, MA USA
Contact: Website

Re: Writing to the framebuffer is slow

Tue Feb 12, 2019 2:37 pm

My approach to doing this. You don't need to mess with mailboxes anymore, that's built into the kernel now. You open /dev/fb0, mmap to get a pointer and write to it. You need to keep track of what goes where but it's the same as writing into an image buffer then calling libjpeg or libpng to write it out as an image file.

Save as fbtemp.c, compile like:
gcc -g fbtemp.c -o fbtemp
I stripped this down to be a template that does the plumbing from another program. It doesn't actually do anything with the framebuffer once it has it, just backs up the screen, sets it to black, sleeps 2 seconds, then puts it back.

Code: Select all

/*
   framebuffer template program    
   
   Run this and the screen goes blank for 2 seconds, then comes back.
   
*/

#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>        /* for mmap */
#include <sys/ioctl.h>
#include <linux/fb.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

char *fbp; // pointer to framebuffer (upper left corner)
char *backp; // screen backup area
uint32_t screenbytes; // screen size in bytes (authoritative)
uint32_t screenstride; // bytes for each scanline of full screen
uint32_t yres; // vertical screen resolution
uint32_t xres; // horizontal screen resolution
uint32_t bpp; // bytes per pixel
uint32_t bitsperpixel; // sometimes you need bits, sometimes bytes
uint32_t speccolor = 0x01f5f600;  // Tektronix scope blue pixel at 32 bpp

void restore_screen(void) {  // copy back
  memcpy(fbp,backp,screenbytes);
}

void fbinit(void) { // set up framebuffer, back up screen area
  struct fb_var_screeninfo vinfo; // fetched with an ioctl
  struct fb_fix_screeninfo finfo; // this has smem_len: screen bytes
  int fbfd;  // frame buffer file descriptor
  fbfd = open("/dev/fb0", O_RDWR);
  if (fbfd == -1) {
    fprintf(stderr,"Error opening /dev/fb0\n");
    perror("open ");
    exit(1);
  }
  // get the fixed screen information
  if (ioctl (fbfd, FBIOGET_FSCREENINFO, &finfo)) {
    printf("Error reading fixed information.\n");
    exit(2);
  }
  // get variable screen info
  // each struct (FSCREENINFO and VSCREENINFO) has unique and useful numbers
  if (ioctl (fbfd, FBIOGET_VSCREENINFO, &vinfo)) {
    fprintf(stderr,"Error reading variable screen info struct.\n");
    exit(3);
  }
  xres = vinfo.xres;
  yres = vinfo.yres;
  bpp = vinfo.bits_per_pixel / 8;  
  bitsperpixel = vinfo.bits_per_pixel;
  printf("Screen is %u x %u, %u bytes/pixel\n",xres,yres,bpp);
  screenstride = finfo.line_length;
  screenbytes = finfo.smem_len;
  backp = malloc(screenbytes);
  if (backp == NULL) {
    fprintf(stderr,"Can't malloc %u bytes to back up screen\n",screenbytes);
    exit(1);
  }
  fbp = (char *) mmap(NULL,finfo.smem_len,PROT_READ | PROT_WRITE, MAP_SHARED,fbfd,0);
  if (fbp < 0) { // returns (void *) -1 on error
    fprintf(stderr,"mmap failed\n");
    perror("mmap ");
    exit(1);
  }
  close(fbfd); // don't need anymore
  memcpy(backp,fbp,screenbytes); // back up screen
  bzero(fbp,screenbytes);  // fill screen with black
} // end fbinit


int main(void) {
  fbinit();
  sleep(2); // simulate doing something, call your function here
  restore_screen();
  free(backp);
  munmap(fbp,screenbytes);
  return 0;
}
Every instance I've seen the framebuffer has been RGBA, which is good and bad. A lot of memory, but you can write a uint32_t pixel anywhere, much simpler than 24 bit color and writing red, green, blue as individual bytes and having them line up right.

LdB
Posts: 1576
Joined: Wed Dec 07, 2016 2:29 pm

Re: Writing to the framebuffer is slow

Wed Feb 13, 2019 1:37 am

He is working baremetal there is no such thing as /dev/fb0 which is a linux O/S construct and you need linux for it to work.

You need to be careful in this section historically it was just baremetal no linux or O/S at all.

User avatar
ab1jx
Posts: 885
Joined: Thu Sep 26, 2013 1:54 pm
Location: Heath, MA USA
Contact: Website

Re: Writing to the framebuffer is slow

Wed Feb 20, 2019 7:26 pm

LdB wrote:
Wed Feb 13, 2019 1:37 am
He is working baremetal there is no such thing as /dev/fb0 which is a linux O/S construct and you need linux for it to work.

You need to be careful in this section historically it was just baremetal no linux or O/S at all.
OK, I got here by search engine, didn't notice the Bare metal, Assembly language at the top. I've only done assembly on 80x86 but thinking about it here. And/or GPU assembly.

Depending on what you're doing writing to the framebuffer as uint32_t seems quite a bit faster than writing bytes. Not always possible, like for images. I just got a Bresenham line drawing function going at 32 bits, the CPU usage seems to be a lot lower. 1 word per pixel, pretty slick.

Return to “Bare metal, Assembly language”