krom
Posts: 60
Joined: Wed Dec 05, 2012 9:12 am
Contact: Website

Re: Trying Bare Metal on Raspberry Pi 2

Tue Feb 24, 2015 11:40 am

I tried my idea for using the multiple ARM CPU Cores to make my julia fractal animation faster,
using the the same code block for all 4 cores & it worked & is a massive speed up:
https://github.com/PeterLemon/Raspberry ... ctal/Julia
If you change the resolution code to:

Code: Select all

; Setup Frame Buffer
SCREEN_X       = 1920
SCREEN_Y       = 1080
You will see smooth animation even at this high resolution for the 1st time!

All cores are rendering 4 pixels at a time in linear frame buffer memory, I found this was a great way to maximize calculation thruput.
I did not use any synchronization in this demo at all, so I was pleasantly surprised that the animation looks so nice & stable!
I will of course explore synchronization in the future, but I wanted a demo like this to show what you can do without it =D

Next up, I think I'll try to make a simple ray-tracer, optimized with NEON instructions, using all 4 cores.

rst
Posts: 267
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: Trying Bare Metal on Raspberry Pi 2

Tue Feb 24, 2015 1:14 pm

krom wrote:You will see smooth animation even at this high resolution for the 1st time!
Thank you for this beautiful smooth animation! I tried it out. Looks very nice!
All cores are rendering 4 pixels at a time in linear frame buffer memory, I found this was a great way to maximize calculation thruput.
I did not use any synchronization in this demo at all, so I was pleasantly surprised that the animation looks so nice & stable!
I think these fractal calculations are very well suited for the multi-core. Maybe you do not need any synchronization which is mostly required if two cores race for a resource or the calculation on one core depends on that on the other.
I will of course explore synchronization in the future, but I wanted a demo like this to show what you can do without it =D
Very impressive!

mrvn
Posts: 58
Joined: Wed Jan 09, 2013 6:50 pm

Re: Trying Bare Metal on Raspberry Pi 2

Wed Feb 25, 2015 11:30 pm

I now have a nice RPi 2 demo that is almost exclusively C/C++ with all the essentials and some bling: Multi-core Mandelbrot demo

Design notes:

My boot steps are as follow:
  1. setup stack, clear bss, call kernel_main (in boot.S)
  2. initialize the activity LED (bootloader should have done that but lets be sure) and blink it 3 times
  3. initialize the UART and say "Hello"
  4. initialize the Framebuffer and draw a test pattern (with text rendering)
  5. initialize page tables with write-back, write-allocate for memory, write-through for graphics, uncached for peripherals
  6. activate MMU and caches (now things run fast)
  7. activate FPU
  8. activate other cores (activate MMU + caches + FPU there too)
  9. enable locking for UART (see ldrex/strex below)
  10. render mandelbrot fractal
I found a few extra snafus along the way that others might stumble into as well, so here it goes:
  • don't forget to setup a stack for each core
  • cache snooping only works when caches are on, don't mix cores with and without caches enabled
  • ldrex/strex only works with caches enabled, __sync_lock_test_and_set() goes into an endless loop without caches, similar for all the other atomic operations
Code is split into seperate files. The file names should be self explainatory. E.g. enabling the FPU is in fpu.cc.

What it does / how to use it:
The demo is ment to run with serial and hdmi connected. The serial is used for control and hdmi to display pretty pictures. At boot you will see the activity led blink and then the framebuffer show a test pattern. Then all the cores are started which you can follow on the serial. After that rendering of the famous Mandelbrot fractal starts.

Rendering method
I use a method called guessing with multiple passes. This first computes points in a loose grid. Then it looks for areas with uniform color and "guesses" that the points inside are also this color. Next the points that couldn't be guessed are computed and the guessing is repeated with a smaller and smaller grid size till all points are done.

zoom control
You can zoom into the fractal via the serial. All zooms are factor 2 so you can only select where to zoom, not how much. The numbers 1-9 select part of the screen to zoom in. Look at your numpad on the keyboard to see which number is where on the screen. E.g. 1 means the left bottom corner, 5 the middle of the screen and 9 the right top corner and so on. With o you can zoom out by a factor of 2 and n doubles the max iteration (and shifts the coloring). When you zoom in the old image is scaled as starting point of the new one. This saves at least 1/4 of the time but can be much more if the guessing works well.

Multi-core method
Core 0 is the controling unit and core 1-3 run as additional compute slave, the mandeld() function. When a new image needs to be rendered core 0 sets the parameters and starts things of by setting params.line = 0. Then all cores start crunching. Each core atomically fetches (and then computes that line later) and increments params.line till the bottom of the screen is reached. Cores 1-3 go back to sleep waiting for params.line to be reset while core 0 goes back into the control loop. If you look closely you can see that rendering happens in multiple lines in parallel. It isn't very noticeable. Look at the lines that already have half the pixels set. They fill up faster than the line before that. At the end of each screen each core outputs the number of lines it rendered.

Enjoy.

mrvn
Posts: 58
Joined: Wed Jan 09, 2013 6:50 pm

Re: Trying Bare Metal on Raspberry Pi 2

Thu Feb 26, 2015 9:21 am

krom wrote: If you change the resolution code to:

Code: Select all

; Setup Frame Buffer
SCREEN_X       = 1920
SCREEN_Y       = 1080
You will see smooth animation even at this high resolution for the 1st time!
What kind of FPS do you get?

My framebuffer access seems really slow, being uncached. But when I turn on caching for the framebuffer there is a very noticeable caching effect leaving parts of the screen not written back. I would have to flush caches every now and then, which seems to be rather complex on ARMv7.

mimi123
Posts: 583
Joined: Thu Aug 22, 2013 3:32 pm

Re: Trying Bare Metal on Raspberry Pi 2

Thu Feb 26, 2015 10:41 am

mrvn wrote:
krom wrote: If you change the resolution code to:

Code: Select all

; Setup Frame Buffer
SCREEN_X       = 1920
SCREEN_Y       = 1080
You will see smooth animation even at this high resolution for the 1st time!
What kind of FPS do you get?

My framebuffer access seems really slow, being uncached. But when I turn on caching for the framebuffer there is a very noticeable caching effect leaving parts of the screen not written back. I would have to flush caches every now and then, which seems to be rather complex on ARMv7.
60 fps. The source code is available if you want it.
(I'm currently coding a Qemu for Pi2)

mrvn
Posts: 58
Joined: Wed Jan 09, 2013 6:50 pm

Re: Trying Bare Metal on Raspberry Pi 2

Thu Feb 26, 2015 6:01 pm

mimi123 wrote:(I'm currently coding a Qemu for Pi2)
Will you put it on github or is it there already?

mrvn
Posts: 58
Joined: Wed Jan 09, 2013 6:50 pm

Re: Trying Bare Metal on Raspberry Pi 2

Thu Feb 26, 2015 7:41 pm

mimi123 wrote:
mrvn wrote:
krom wrote: If you change the resolution code to:

Code: Select all

; Setup Frame Buffer
SCREEN_X       = 1920
SCREEN_Y       = 1080
You will see smooth animation even at this high resolution for the 1st time!
What kind of FPS do you get?
60 fps. The source code is available if you want it.
I just did a little speed test on a 1920x1200x32 framebuffer using this code to fill the screen with colors:

Code: Select all

for(uint32_t t = 0; t < 256; ++t) {
	uint32_t col = t * 0x10000;
	uint32_t *p = (uint32_t *)Framebuffer::fb.base;
	uint32_t *q = (uint32_t *)(Framebuffer::fb.base + Framebuffer::fb.size);
	while(p < q) {
		*p++ = col++;
	}
}

Inner loop gives:
	14e8:	e4832004	str     r2, [r3], #4
	14ec:	e1510003	cmp     r1, r3
	14f0:	e2822001	add     r2, r2, #1
	14f4:	8afffffb	bhi     14e8 <test_fb()+0x2c>

Enabling I/D caches and branch prediction, just like the julia demo uses, it takes ~12 seconds, or ~21 fps. It's just one core but also a much smaller loop than the julia demo has.

Enabling the MMU and mapping memory inner/outer write-back, write allocate and the framebuffer inner write-through, no write allocate + outer write-back, write-allocate it takes ~8 seconds, of 32 fps.

So where did you get 60 fps from? It sounds suspiciously like the refresh rate of the mode the video chip outputs? There is no syncing going on so hitting exactly the screens refresh rate sounds suspicious.

PS: 640x480x32 with MMU gets me ~256 fps. Must have a greater L2 cache effect.

mimi123
Posts: 583
Joined: Thu Aug 22, 2013 3:32 pm

Re: Trying Bare Metal on Raspberry Pi 2

Sun Mar 01, 2015 10:57 am

mrvn wrote:
mimi123 wrote:(I'm currently coding a Qemu for Pi2)
Will you put it on github or is it there already?
I will push it if I get SMP okay.(now, Linux only boots in UP mode)

krom
Posts: 60
Joined: Wed Dec 05, 2012 9:12 am
Contact: Website

Re: Trying Bare Metal on Raspberry Pi 2

Fri Apr 03, 2015 7:38 am

Hi,
Sorry for the brief pause in development...

So I was reading some of the forum posts here and I saw one about a guy who got more speed out of the Raspberry Pi 2 on single core:
http://www.raspberrypi.org/forums/viewt ... 2&t=102806
dom wrote:Your bare metal environment doesn't wake the other cores up, which means they keep polling a register, and so compete on the bus with the useful work you want the first core to do.
using config.txt option arm_control=0x1000, changes the "cmp r0, #0xff" to be a cmp r0, #0x1", which causes the secondary cores to not poll for the signal, but just to sleep.
This was obviously just a debug switch when trying to get mulitcore linux working, which fortunately hasn't been removed yet.
The real solution is for your bare metal world to wake up the other cores and make them sleep, or change the bootcode to sleep directly.
So this reminded me of my bare metal video codec test, which seemed to run much slower on the Raspberry Pi 2 compared to the Raspberry Pi 1!
I tested my video using the config.txt option arm_control=0x1000, and it suddenly ran much faster as the extra cores were not fighting for BUS space.

I then took away the config.txt option, and woke up the 3 extra cores, putting them all into a simple loop, and the huge speed gain for the video decoding worked =D

Here is the updated Video test which now runs much faster on the Raspberry Pi 2:
https://github.com/PeterLemon/Raspberry ... GRBLZVideo

So that was a big problem fixed, and the moral is if you want the most speed out of your bare metal single core coding, you really should wake up the other cores and put them in a loop.

Now I only have 2 problems left in my Raspberry Pi 2 understanding:
1. DMA demos are still not working (the 1st DMA always works, my hello world demo only prints "h")
2. My V3D animation demo is static (the 1st V3D Contol List always works, could this be linked to the dma problem above?)

I will try to get to the bottom of this, would be great if anyone knows what the problem might be...

kriss
Posts: 66
Joined: Thu Apr 02, 2015 8:53 pm
Location: france for now ...

Re: Trying Bare Metal on Raspberry Pi 2

Fri Apr 03, 2015 2:05 pm

hi
with this interesting thread you gave me the need to buy a PI2 ;)
before it come can one of you give the qemu command line to emulate a PI2 so i will be able to learn and code asm for it ...
i'll try to make real bare metal PI2 ;)
thanks

JS2
Posts: 14
Joined: Thu May 14, 2015 11:40 pm

Re: Trying Bare Metal on Raspberry Pi 2

Thu May 21, 2015 4:24 am

I'm not sure so far if the Snoop Control Unit is on by default so that the memory is coherent between the cores without intervention.
Anyone make any progress with this? I am seeing behavior that could be the result of coherency problems with different cores sharing the same atomic, and I'm not seeing any obvious info on how to turn on the SCU. Trying to get the peripheral base address with

Code: Select all

mrc p15, 4, r1, c15, c0, 0
always returns zero so I'm not sure how to get at the config registers

rst
Posts: 267
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: Trying Bare Metal on Raspberry Pi 2

Thu May 21, 2015 11:26 am

JS2 wrote:
I'm not sure so far if the Snoop Control Unit is on by default so that the memory is coherent between the cores without intervention.
Anyone make any progress with this? I am seeing behavior that could be the result of coherency problems with different cores sharing the same atomic, and I'm not seeing any obvious info on how to turn on the SCU.
I think the SCU is on by default. But you have to ensure that data which is concurrently used from more than one core is located in memory which is marked as "Shareable" (S-Bit) in its page table entry.

JS2
Posts: 14
Joined: Thu May 14, 2015 11:40 pm

Re: Trying Bare Metal on Raspberry Pi 2

Fri May 22, 2015 5:04 am

Nothing like the embarrassment of asking for help to motivate you to find your own bug. My per-core page table setup was correct, but I had a bug in my atomic spinlock due to a bad mercurial merge. PC wasn't being restored at the end of my lock function, so we were falling off the edge and continuing right into the adjacent unlock function, hence other cores sometimes not seeing the lock was already locked. Sorry to waste your time

krohini
Posts: 8
Joined: Tue Jun 16, 2015 6:57 am

Re: Trying Bare Metal on Raspberry Pi 2

Sun Jun 21, 2015 4:52 pm

Hi,

I have been trying to get a speed up with cache/mmu turned on only for a single core. Even after enough trying I am unable to get any performance improvement. Its the same as that without caches on as though memory regions are not getting cached. I am using a single level translation. What is the impact of having the other three cores just powered on waiting for a jump address. Or is it simply attributes going wrong in the translation table.
What is the effect of the various barriers(data, memory, instruction)?
What changes in settings is required when using a single level section based translation?
I have been trying to follow from some successful examples posted in this thread.It would be great to get some suggestions with probable errors.

I really need help here! Thanks!

JS2
Posts: 14
Joined: Thu May 14, 2015 11:40 pm

Re: Trying Bare Metal on Raspberry Pi 2

Mon Jun 22, 2015 2:17 am

Can't tell much without seeing the code, but I'm curious

0) how are you measuring performance?
1) what kind of stuff are you doing? Are you more heavy on ALU or memops?
2) how is your L2 set up? Writeback, write through, etc?
3) are you using cache-friendly access patterns?
4) are you talking i$ or d$?

Again, I don't know if its the cache setup or the data access patterns, but turning on a cache is not a guaranteed universal speedup. Reading a byte from each cacheline is still going to result in a miss on every access.

krohini
Posts: 8
Joined: Tue Jun 16, 2015 6:57 am

Re: Trying Bare Metal on Raspberry Pi 2

Mon Jun 22, 2015 4:56 pm

Hi JS2,
I have tried to answer all the questions
0) I am measuring performance with just dhrystone for a million runs. So its repeated use of same data and instructions. Does that answer 3)
Umm, I forgot to add an important part. I am not trying bare metal. It is a part of another OS that I am trying to port to Pi 2. So the mmu setting is not very straightforward. And fortunately after some more trying, I have been able to see the expected performance improvement :) There was something missing in the configuration mmu/cache configurations.

2) How is the L2 set up differently? Is is the same as outer cache attributes (from ARM v7 Architecture reference manual) I am using write back. When just running dhrystone several times, should write-allocate give better result than no-write allocate?

4) I have both I and D caches turned on.

JS2
Posts: 14
Joined: Thu May 14, 2015 11:40 pm

Re: Trying Bare Metal on Raspberry Pi 2

Tue Jun 23, 2015 1:05 am

So its repeated use of same data and instructions. Does that answer
Well yes and no. Technically "the same data" could involve multiple threads false sharing cache lines, or the same sparsely located lines knocking existing lines out of cache. However, it seems you fixed the issue which is great. If you don't mind me asking, what was the setting you were missing?

krohini
Posts: 8
Joined: Tue Jun 16, 2015 6:57 am

Re: Trying Bare Metal on Raspberry Pi 2

Tue Jun 23, 2015 10:33 am

Ya. It was just the C and B bits in one of the sections :D The region was originally left non-cacheable, so I didn't change its configurations for a long time. And that alone seemed to be the problem. I need to understand this more.

Do the TEX bits configurations apply to sections just as they do to pages? For instance, the cacheability attributes for inner and outer caches?

diracsbracket
Posts: 9
Joined: Thu Jun 25, 2015 3:07 pm

Re: Trying Bare Metal on Raspberry Pi 2

Thu Jun 25, 2015 3:14 pm

Hi,
Could you please detail how you compiled and generated the kernel.img file you put on the SD Card?
I tried a similar program in C, but the generated image does not work when put on SD Card. However, the same image (after converting it to the uImage format) works when loaded via TFTP in u-boot...
Any ideas ?
Thanks!

mimi123
Posts: 583
Joined: Thu Aug 22, 2013 3:32 pm

Re: Trying Bare Metal on Raspberry Pi 2

Fri Jun 26, 2015 9:19 am

diracsbracket wrote:Hi,
Could you please detail how you compiled and generated the kernel.img file you put on the SD Card?
I tried a similar program in C, but the generated image does not work when put on SD Card. However, the same image (after converting it to the uImage format) works when loaded via TFTP in u-boot...
Any ideas ?
Thanks!
Load address at 0x8000, stack address settled up, objcopy -O binary used?

Sonny05
Posts: 22
Joined: Wed Jun 24, 2015 4:53 pm

Re: Trying Bare Metal on Raspberry Pi 2

Thu Jul 02, 2015 12:01 pm

Hello

I have problem with UART.

From PC to RPI the data is sent correctly
but from RPI to PC the data is sent badly.

I use program 'gtkterm' Baudrate-115200, Stop bit-1, Data bit-8, Parity bit-none

I tried an example from the user 'mrvn' and yet I still returning data wrong.

Some idea?

daivuk
Posts: 6
Joined: Sun Jul 26, 2015 2:03 pm
Location: Canada

Re: Trying Bare Metal on Raspberry Pi 2

Mon Jul 27, 2015 12:54 am

rst wrote:
krom wrote:2. Old Frame Buffer code does not work, only MailBox Tags Frame Buffer works.
The older way of setting the frame buffer, does not seem to exist on the Raspberry Pi 2.
For me the older method works well. There must be another influence.
Do you have any code samples? I've been stuck on that part for many days now. It would be greatly appreciated!

Thanks

rst
Posts: 267
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany

Re: Trying Bare Metal on Raspberry Pi 2

Mon Jul 27, 2015 9:02 am

daivuk wrote:
rst wrote:
krom wrote:2. Old Frame Buffer code does not work, only MailBox Tags Frame Buffer works.
The older way of setting the frame buffer, does not seem to exist on the Raspberry Pi 2.
For me the older method works well. There must be another influence.
Do you have any code samples? I've been stuck on that part for many days now. It would be greatly appreciated!

Thanks
https://github.com/rsta2/circle/blob/ma ... buffer.cpp

daivuk
Posts: 6
Joined: Sun Jul 26, 2015 2:03 pm
Location: Canada

Re: Trying Bare Metal on Raspberry Pi 2

Mon Jul 27, 2015 7:22 pm

Thanks rst.

I actually found out this morning that by substracting my address by 0xC0000000 fixed my problem.
You are doing this:

Code: Select all

m_pInfo->BufferPtr & 0x3FFFFFFF
It has the same effect.

It's when you add GPU_MEM_BASE, which is 0xC0000000
We have to remove it here when we get it back.

krohini
Posts: 8
Joined: Tue Jun 16, 2015 6:57 am

Re: Trying Bare Metal on Raspberry Pi 2

Sat Aug 08, 2015 7:14 pm

Hey,
Can somebody help me with understanding how to generate inter processor interrupts for Pi 2? I referred to the QA7 document and am clueless about this. Are there registers on common with Pi 1 which are not specified in QA7?

Return to “Bare metal”

Who is online

Users browsing this forum: No registered users and 5 guests