Circle - C++ bare metal environment (with USB)

99 posts   Page 4 of 4   1, 2, 3, 4
by django013 » Mon May 15, 2017 2:49 pm

thank you for your patience and support!
QA7 is very exciting. Thanks!

I don't know anything about multicore programming, so there's much new to me.
... by the way: do you have any links to learn multicore programming?
I know a bit of stm32 - and the newer devices are able to run timers at system clock.

But this is not important to me - so may be my questions where wrong. Sorry.

I'm working on a kind of software pwm - so my main interest is the fastest GPIO speed. Having a timer interrupt faster than the GPIO clock does not make a lot of sense.
The desired amplitude is several microseconds, but if I use 32bit to divide the amplitude I hit nanoseconds. A resolution, that can't be output to the pins does not make any sense, so therefore I wanted to know the fastest GPIO clock and related timer ...

In the QA7 document I read, that the timer could be driven by the crystal clock or the APB clock.
I guess, that APB is the base for GPIO clock, so this would be the option, that makes sense to me.
The 500MHz PLL is probabely an internal clock only and the APB would than be 125MHz right?
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by rst » Mon May 15, 2017 4:12 pm
django013 wrote:I'm working on a kind of software pwm - so my main interest is the fastest GPIO speed.
In 32-bit mode you can expect about 16 million GPIO read operations per second. Writing should be quicker. I have read about 60 MHz rate on a RPi 1. I don't think the GPIO block will be much faster on the RPi 3 or in 64-bit mode.
Posts: 266
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany
by django013 » Tue May 16, 2017 3:34 am
I don't think the GPIO block will be much faster on the RPi 3 or in 64-bit mode.

looking on timer access only doesn't require 64bit or single instruction access.
But taking the sum of calculations, that have to be worked out between pin-changes shows, that in 64bit mode the calculation takes less cpu cycles.

I did some additional search and found interesting links:
- ... ransmitter

The latter is very interesting for me, as it shows that dma cannot speed up gpio output, so the fastest way to toggle a pin is the direct register access.
Very interesting too the different timings for set and clear.
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by LdB » Tue May 16, 2017 4:01 am
I have trouble with that last report because they are using C code and the only flag we get to see is the -O3 compilation flag.

This may get back to Raspbian for the Pi and the old 4.8 or 4.9 GCC compiler it uses and I am unfamiliar with it. It would have been real useful if they showed they assembler dump of the IO access so we could see what it was doing.

I am running on the official arm compiler version 6.4 and 7.1 beta and there are huge differences in the opcodes they spit between ARM6, ARM7 and ARM8 compilation and especially if you align the data. I always see it makes huge differences when the code is used on DMA transfers. I often use my ARM6 code on the Pi3 but it isn't a patch on the speed of ARM8 code on a Pi3. What you really want to see is what opcode set is being used for the output. The opcode sets are expanded on the ARM7 and ARM8 for a reason and the report ignores that. A quick read of the first page of the ARM8 opcode set will shed some light and details on the changes and why it makes a difference. If you compile 64bit code the changes in particular are chalk and cheese and it's obvious why from the link. ... FEEIA.html

I strongly doubt they have reached the limit unless there is a physical GPIO bus constraint which is outside my experience or knowledge and more in David's court.

Realistically the only fair way to do all that if you are after absolute speed is craft your own block of assembler code and I take the whole report with a grain of salt from the software perspective.

One of the interesting things they got was the slower speed on the Pi2 and Pi3 than the Pi1. This twigged my interest because it raised the question in my mind does the Arm7 and Arm8 run some Arm6 code slower than the Arm6 itself. They assumed the Pi2/3 bus was slower but the other alternatives which needs to be excluded that the opcode speed hasn't changed directly or via bus implementation. They may have made new instructions faster but have they made old ones slower .. time to hit the opcode data sheets and do some opcode testing :-)

I have never tried it but will craft some GPIO baremetal and see what I can get it to do. I only have a 100Mhz storage scope so hopefully it's less than that but unless there is a bus constraint I am pretty sure it will be faster. So you want jitter and max speed? So simple will have time to do tonight and post results.
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by django013 » Tue May 16, 2017 8:33 am

I completely agree - they didn't tell the whole story.

Looking at the sources it looks to my, that they run the build using linux. But with linux you don't have jitterfree execution time. And there's a huge difference, whether you run 32bit linux or 64bit ...
And I agree too on the impact the compiler release and target makes.
And if linux usage is true, than the gpio clock is derived from external crystal, which is not the fastest solution probabely.

When I did some timing tests on linux, I was told, that clock_gettime(CLOCK_MONOTONIC_RAW ...) would be the fastest time-access. With a "normal" multithread application I had response times from few nanoseconds to nearly about a whole second.
And nothing else running on the pi.
Response time deviation was similar on 32bit and 64bit.

I then tried to create cpusets. That reduced the jitter but the worst response time was stil unacceptable. The structure returned from clock_gettime is really suboptimal and requires additional computation at each timestamp access.

Therefor I'd like to setup a timer, that increases an aligned 64bit counter, which can be accessed by single instruction.
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by LdB » Tue May 16, 2017 3:37 pm
So I tried the code below ... I can't capture it on my 100Mhz scope its too fast and by a fair way and the signal is heavily degraded I am not sure the IO pin tracking is setup for this sort of speed (My probe leads are 1Ghz rated the scope not). I have made an arrangement to a mates place who has a 1Ghz storage scope and will post results tomorrow. I can sort of see the signal in free running mode but the digital capture wont work it's too fast. So I am already strongly doubting the answer above because if the bus was holding me up I would be seeing thier recorded speed which is well within my scopes range.

My feel from playing tonight is that they didn't design the IO bus to be used at the speeds and that will be what limits operation speed not anything on the ARM or bus :-)

This is the simplest ARM6 opcode deadloop you can see its just two store operations and then a branch so just 3 opcodes. There is a bit of preamble creating the address and the bit mask and then just a hard deadloop of 3 opcodes from which it wont return.

What I was expecting was a 1/3rd duty cycle squarewave .. it will be off centric because of the branch. So it will be interesting to see the result on the scope tomorrow.

Code: Select all
/* "PROVIDE C FUNCTION: void RPI_GPIO_SQRWave (uint8_t gpio_num);" */
.section .text.RPI_GPIO_SQRWave, "ax", %progbits
.balign   4
.globl RPI_GPIO_SQRWave;      
.type RPI_GPIO_SQRWave, %function
.syntax unified
;@ RPI_GPIO_SQRWave -- Pi1, 2 & 3 code
;@ C Function: void RPI_GPIO_SQRWave (uint8_t gpio_num)
;@ Entry: R0 = GPIO Port Number
    cmp r0, #54                        ;@ GPIO port number can only be 0..53
    blt .GPIO_SQRWave_Valid
   bx   lr                           ;@ Port invalid just exit
    cmp r0, #32                        ;@ Bank 0 is 0..31, Bank 1 is 32 .. 53
   bge .GPIO_Bank1_SqrWave
    mov r2, #0x1C                     ;@ BANK 0 GPIO_BIT_SET offset 
   mov r3, #0x28                     ;@ BANK 0 GPIO_BIT_CLR offset
    b .ReJoin_GPIO_SqrWave
    sub r0, r0, #32                     ;@ GPIO number % 32
    mov r2, #0x20                     ;@ BANK 1 GPIO_BIT_SET offset 
   mov r3, #0x2C                     ;@ BANK 1 GPIO_BIT_CLR offset
   mov r1, #1                        ;@ 1 bit to shift
   lsl r0, r1, r0                     ;@ shift it by modulo 32 GPIO port number (R1 is bit to hit)
     ldr r1, =RPi_IO_Base_Addr
   ldr r1, [r1]
   add r1, r1, #0x200000               ;@ Create GPIO base offset  Pi1: 0x20200000 Pi2/3: 0x3F200000
   str r0, [r1, r2];                  ;@ Hit the set GPIO address
   str r0, [r1, r3];                  ;@ Hit the clear GPIO Address
   b .Do_GPIOSqrWave                  ;@ deadloop
   bx   lr                           ;@ Return will never happen we deadloop above
.balign   4
.ltorg                              ;@ Tell assembler its ok to put ltorg data for above code here
.size   RPI_GPIO_SQRWave, .-RPI_GPIO_SQRWave
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by django013 » Wed May 17, 2017 3:15 am

very interesting your work!

But that's half of the story only too. What about the frequency setup?
I guess, if you drive the pins outside the spec, signal quality might not be usable/reliable.

St has a very nice tool to setup pins and clock tree configuration, so all dividers and constraints are visible. What about the bcm2837?
Poor to say: I don't have specs where the clock tree is handled.
Programmers guide of armv8 doesn't spend a word on clock settings, prescalers and constraints :(
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by LdB » Wed May 17, 2017 4:14 am
So I got access to a 300Mhz storage scope now here is something you wouldn't expect to see from that code
I can see why my 100Mhz scope was struggling with it.

The really high speed signal is what we were expecting but I don't know for the life of me what the big clunk signals are that are knocking it out. I tried moving the Pi3 from 700Mhz to 1.2Ghz the result is the same.

That is GPIO21 I went for it because its at the end of the pinouts but I might try a different one. Is there any detail on the drive output of the different IO pins?
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by LdB » Wed May 17, 2017 4:48 am
So I moved everything to GPIO7 and thought I would first try the code on a Pi1 even more interesting
The code is producing 21mhz output .. now that I wasn't expecting .. I was expecting something like 150Mhz from the 700Mhz
I was so stunned I had to write the CPU speed out to video screen so I could check it was at 700Mhz.

It is taking a whopping 33 clock cycles to execute 3 opcodes. The raw speed is listed as 2 clock cycles for those instructions on the ARM6 data sheet. In my head I was expecting 5 clock cycles for the opcodes (2+2+1) so a speed of 700Mhz/5 = 140Mhz so I guess the GPIO hardware in the SOC is not capable of the ARM speed and it is becoming (16+16+1) clock cycles.

I assume the Pi1 effectively wait states the whole CPU when it outputs to GPIO and it does indeed cause a fair amount of jitter (hopefully you can see it on the screen) as you have a fast clock gated by a slower clock.

Same code on the Pi3 on GPIO7 creates the same junk as GPIO21.

Anyone got any ideas?

EDIT: I was so surprised by the Pi1 I had to check each opcode was delayed so I made it double write the set code
Code: Select all
   str r0, [r1, r2];                  ;@ Hit the set GPIO address
   str r0, [r1, r2];                  ;@ Hit the set GPIO address
   str r0, [r1, r3];                  ;@ Hit the clear GPIO Address
   b .Do_GPIOSqrWave                  ;@ deadloop

And yes the up time expanded and the frequency dropped to 14Mhz
So GPIO instructions are definitely wait stated on the Pi1 they take approximately 23ns and you can't exceed 21-22Mhz no matter what code you write. I assume it invokes the NWAIT line on the ARM6 and holds everything until the data hits the peripheral. You can also forget about the cache on the Pi1 it won't help you with GPIO if that is all true.

Arm has an application note on the ARM6 bus and wait stating it for slow peripherals on the PDF

Still leaves me scratching my head with Pi3 GPIO. It sort of looks like there is some slow latching clock signal imposed over the real signal thru what almost looks like a tristate setup. At times I get what I expect and the at other time some slow latch seems to lock and hold the data out on the GPIO. I understand it would work at slow enough speed but it is a bit crass from a hardware point of view. I will work out some code to spread out my opcodes tonight and once I get below the latching frequency it should work properly if it is indeed like that.

So the take home message for me was the GPIO is very different to normal memory on the Pi.
Last edited by LdB on Wed May 17, 2017 8:09 am, edited 1 time in total.
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by django013 » Wed May 17, 2017 8:05 am

your research is not far from my expectations.

When I look at an stm32f407 - that can drive gpio pins with half the system clock. No matter if you would be able to write an opcode, that toggles a pin at system clock, the pin-output-subsystem is not.
So no matter whether you call it waitstates or jitter - the pin-change commands are synched to the gpio clock.

I read, that the pi1 has a very poor gpio subsystem and for so it is really really slow.
And your results match the results from other guys measuring gpio speed.

The interesting part is the junk from pi3 - obviously the synching between system clock and gpio clock failed. From stm32 I know, that you won't get any pinchange, if you didn't setup the gpio clock.
How is the behaviour of the bcm2837?
Did you setup the gpio clock?
If so, how did you do that? To what base?
What is the timebase of the micropulses?
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by LdB » Wed May 17, 2017 8:39 am
Well it is as you expected then :-)

I find it a bit quirky if the IO bus can only latch at such a low speed you would think they could provide a readback on when it's latched. You could then at least do stuff while you wait for it to latch, there are a lot of clock cycles going unused. They obviously don't really expect it to be used as a high speed IO.

Anyhow clock details is what I am ferreting around for, as with all the Pi3 so little is documented at this level in any meaningful way. The bootcode does all the setups of the clocks initially so the GPIO subsystem clock will be running at the default as I haven't changed it.

The mailbox has a number of clocks they allow change on but none I can see as the GPIO system clock although they note there are actually more clocks than shown there

Unique clock IDs:
0x000000000: reserved
0x000000001: EMMC
0x000000002: UART
0x000000003: ARM
0x000000004: CORE
0x000000005: V3D
0x000000006: H264
0x000000007: ISP
0x000000008: SDRAM
0x000000009: PIXEL
0x00000000a: PWM

I know it's not the Core, Arm, V3D, SDRAM or EMMC clocks as I know what they do.

If anyone knows and wants to put me out of my mysery :-)
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by django013 » Wed May 17, 2017 11:39 am

if the naming scheme st uses is not invented by them, it should be something like APB1 .. n which means something like ARM Peripheral Bus or the like ;)
As st-chips have various APB clocks, that might be true for bcm2837 and some pins might clock at higher speed if used by an alternate function module like spi or pwm ...

Yes - documentation for the pi3 processor is really a mess. I didn't see a complete specs yet and no programmers manual either. Most docs available refer to pi1 but pi3 is like a big secret :(

Shouldn't cmsis have a startup-code template too?

The bootcode does all the setups of the clocks initially so the GPIO subsystem clock will be running at the default as I haven't changed it.

Then the peripheral bus will probabely be derived from the crystal clock, which does not provide the highest speed possible.

I didn't heard about mailbox before (in embedded context), but from the QA7 doc I think, the mailbox is used for communication between the cores, isn't it?
So the peripheral bus needs to have its own prescale factors.
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by rst » Wed May 17, 2017 3:42 pm
django013 wrote:Yes - documentation for the pi3 processor is really a mess. I didn't see a complete specs yet and no programmers manual either. Most docs available refer to pi1 but pi3 is like a big secret :(
You should know, that bare metal is not the favourite programming model for the RPi. That's my personal opinion, but all I have read points me in this direction. As I understand it, the mission of the Raspberry Pi Foundation (RPF) is to get kids and young people familiar with computers and programming. And only a few kids will start using a computer by programming it in bare metal. Writing documentations takes time and money. And the RPF has to decide carefully for what to spent it.

Nevertheless I think the Cortex-A53 CPU in the RPi 3 is well documented. You can download the ARMv8 ARM and the TRM from the ARM site. Also we can be happy that there is the BCM2835 Peripherals document, which is mostly still valid for the BCM2837. Without this (and the QA7 document) we would walk totally in the dark. I think, these forums are an important source of information too. I takes time, to read them, I know.

The good thing about the Raspberry Pi is, that is has much more users then most (or any) other related platform. So if you manage to get your application running, you will have some audience. If you want to implement some bare metal application and audience is not important for you, you may be on the wrong platform here. It is your decision.

Then the peripheral bus will probabely be derived from the crystal clock, which does not provide the highest speed possible.
To be honest, I cannot answer this. I am dealing with the RPi for about four years, but I have never read about configuring some (general) GPIO clock which touches the timing of the bit input or output at GPIO pins. I think, you have a different view here, because you come from a different embedded platform, which provided such a clock. My previous platform was the PC where GPIOs are very rare.

I didn't heard about mailbox before (in embedded context), but from the QA7 doc I think, the mailbox is used for communication between the cores, isn't it?
The mailbox is used to communicate with the RPi firmware, which runs on the VideoCore (aka GPU). A lot of services otherwise available through some API can be reached using these mailbox services.
Posts: 266
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany
by dwelch67 » Thu May 18, 2017 1:50 am
Just like we use, or used to , mailboxes to deliver messages to each other no different here. Specifically the raspi ones we talk about are between the arm and the gpu, but you could certainly setup your own between cores, it is usually just a register or memory location that both sides agree on which one should write it and the other one only reads it...separate boxes (memory locations) for each direction if you need that.

AMBA/AXI is the traditional names for the busses on these full sized cores, AHB, APB are certainly added to that for the microcontrollers, they might also have put some on the big uns.

No general rule on clocking a chip other than you should do your system engineering. One of the many things you do before taping out (very interesting historical term, literally the masks where hand drawn many times larger on a material which is basically the same as clear tape, now it just means the files are sent to the foundry) is timing closure. We desire this product to run at these clock speeds, we have designed this portion of the device if not the whole device to work at these speeds, run simulations to determine if the combinational paths between flip-flops/latches will resolve within the designed number of clock periods (or one clock in a lot of cases). this exercise gives you a feel for whether you want to change the design to meet closure, and what does that mean do you simply fix the long paths by making them simpler, do you fix the long paths by adding an intermediate latch, or do you fix it by saying this peripheral in the chip can only run at 1/x times the system speed. With the first parts you have another set of tests, called schmoo (various spellings, is on wikipedia), basically you cover the various silicon mixtures and thicknesses that a chip might see in the normal production run, a part on the fast side has thicker material and/or different mixture, can run faster, but can also burn out with infant mortality, a part on the slow side is...slow and might not make timing for the desired speed. You of course have to do this at the desired max temp for the part, whichis at least a few times hotter than you need to burn your fingerprints off. 80-105C or more at the die, which on the case is a little colder, depends on the design requirements...the schmoo testing you push the various boundaries and plot, you are looking for margin, can I run this at 750Mhz, am I just barely making it or do I have quite a bit of margin. this is why you can overclock most devices, they are screened to work at max clock at max temp, but if you make them colder (too cold and they will latch up and literally burn up something internally) then you can increase the clock and stay within margin. At least for the bulk of the production, you can of course also screen out some faster parts, the yield is lower, but one would expect you screen for X if it fails it goes into be screened for Y if it fails, end up with a number of "buckets" speed grades, or perhaps features in the chip that can be fused out a 486 with the fpu working is a 486DX with it broke is a 486SX could potentially be the same die going through the same screening some pass the fpu some dont and you literally blow a fuse in the part that prevents that portion from working.

In microcontrollers power consumption is very high on the feature list, so would make sense to have a number of the peripherals speed limited by design to the cpu, you cant access all of them as fast as the cpu can run so why burn the power, and maybe eat yield, by designing them to be faster. Not all chips, not all ARMs work that way some the whole thing is one clock, I suspect from what we know about these chips is there is a 250Mhz system clock we know about, I wouldnt be surprised if much of the chip runs off that clock and the arm core is on a faster clock (750, 1000, etc). Likewise the GPU probably is too. Although designed for phones/tablets which are power conscious these are not microcontrollers we are not talking about milliamps here.

the amba/axi is in the arm clock domain, stepping down to another domain would be on the vendor side of that after the axi/amba/ahb/apb busses the vendor logic would then cross a clock domain into the system logic with peripherals, if you were to run at different clock rates.

on a part like this I wouldnt be surprised if there is no internal RC clock like an mcu has, this would rely on an external reference clock probably 100Mhz, we could see maybe from the original pi schematics. Inside the part PLLs are used to bump that up for the various use cases, the cpu cores of course, the system clock and if any peripheral clocks are derived from that, it might be the case as we see with 750/250 = 3 and 1000/250=4 that perhaps there is one PLL that creates both of those clocks but who knows. you have the dram clocks as well which could be separate plls. no pcie on this device nor ethernet but there is usb, could clock that down or have another pll...better to have one pll and various divisors down from that as the system is synced, less problems with crossing clock domains, but if nothing else with dram we have that problem anyway and it is easier to solve these days than it used to be.

My guess as mentioned would be that the 750 and/or 1000 and 250 are derived from the same pll. That the ARM we know and possibly the GPU are running off this faster clock and everything else other that things in other clock domains (the ddr side of the dram IP, USB, video, etc) are running at 250. Just a guess I have no real information.

I also assume they dont have clock divisors for the peripheral clock bus if there is one, this is again not a microcontroller we are not fine tuning the power consumption to milli/microamps. The first thing they would want to do if that were remotely interesting, would be to allow us to turn off arm cores in the multi-core chips, even spinning idle uses a noticeable amount of power, if those were not worthy then a little gpio peripheral isnt...not saying they dont have registers for that AFAIK we have not seen them someone would need to wade through GPU disassemblies...


Broadcom is very well know for being very tight with information. The raspberry pi experience has been shocking for us and likely for them. How did they know that the popularity would be what it is the OLPC project was pretty much a failure IMO. How could they have predicted there would be a bare metal community? How could they have predicted that one or a small number of individuals would reverse engineer the GPU leading to BCM giving in and providing documentation? No surprise whatsoever that they have not produced docs for the pi2 or pi3 other than the little supplements we got, no surprise that we didnt get full schematics. HUGE surprise that we got schematics for the early/original pi and HUGE surprise that we got the ARM peripheral doc we got. Typically you have to reverse engineer all of your information from the linux driver sources if they choose to honor the GPL and publish that code. Allwinner is a good example here, although recently vendors have been illegally publishing NDA protected documents and others are simply leaked into the wild.

So I dont think it is just a matter of why bother spending the man hours on public docs for the relatively tiny bare metal community, we dont need them to survive. The docs are already written internally they could just release them (well cutting out the peripherals they have NDAs for like possibly the ddr controller and the usb controller) with minimal work. I think the keep secrets factor is to some extent in play here. Am quite pleased with what they have provided so far despite a couple of rants when they first refused schematics and docs for the pi2, makes total sense to me now. Now the quality of the docs we have, sure that is on the low end, I have seen worse, and have seen much better, but they did make an effort and the community has to some extent picked up from there to keep track of the documentation errors, still wish someone would make it their hobby to write a new doc from scratch perhaps, that includes the info as well as the corrections and post that somewhere in some form (ideally source code based documentation (latex or docbook for example) that builds pdf and epub/mobi). Not it!.

The amba/axi/ahb/apb information is available on arms website. The TRM indicates to some extent what the I/O looks like on the core, but is not detailed enough, for the pi2 and pi3 I have no doubt they are 64 bit AXI busses, for the ARM11 though is that 64 or 32 bit wide? probably 64. the L1 cache is inside the core clocked accordingly, inside these external busses, if you buy an add on L2 cache from arm then it has an AXI on each side one to mate with the core and one the vendor hooks onto (also documented at arm PL310 for example is one if i remember correctly, could be wrong).

makes and probably made it fairly easy for broadcom to rip out the arm11 and put in the armv7 then rip that out and put the armv8 in there, just cut it off at the axi bus, and put it back on, there are other signals like individual resets for each core if running as multi-core (documented in the trm I think) broadcom didnt appear to document the individual resets and clock enables so we simply deal with them all popping on at once, would have been nice to have had access to the separate reset and enables for those cores (perhaps bcm has a register for that that the gpu uses and the gpu software simply pops them all on rather than one at a time based on some mailbox thing). so it makes sense when they tell us that the pi2 is a pi1 chip with the arm core replaced and the pi3 is a pi2 with the arm core replaced. if you stay on the same foundry/process you dont have to re-do timing, you only have to redo the layout of that section of the chip, etc. now the new cores are larger over time one would assume certainly a single core arm11 to a multi-core armv7 or v8. so maybe they did have to do a new layout of the whole chip but assume they didnt have to re-do timing except at the boundaries where they cut/added stuff.

The M in CMSIS is for microcontroller this is not a microcontroller, I dont know but assume that arm is not going to bother with chip vendors CMSIS headers,they do the core headers and dictate some guidelines, vendors that want to comply would need to make the headers/files. And you think broadcom who wont open up docs, would bother to write and publish header files for microcontroller use cases for the relatively small bare metal folks using the platform? Nope...They already have header files for linux that they or someone has produced for them, all that is required of them.

Although difficult for even folks that can/do make board products to get broadcom to sell to you, if you were to manage that, then there would be some docs we cant see that are protected by NDA that you would get to see, and/or access to their staff for getting your board product up. Where I work we were certainly not broadcom-worthy so I have no clue what any real, more useful, broadcom documents look like (and if I did then the NDA would possibly prevent me from saying I had seen them or had an NDA)...other than the raspi ones we have thus far.
Posts: 721
Joined: Sat May 26, 2012 5:32 pm
by django013 » Thu May 18, 2017 4:11 am
Thank you for the big explanations :)

You should know, that bare metal is not the favourite programming model for the RPi.

Please - don't take me for stupid.
I saw lot of rpi projects and the raspberry homepage, which is targeted to kiddies and not to serious programmers ...

Anyway - the circle project shows, that I'm not the only with bare-metal wishes.

For me, the optimal solution would be a linux running on two cores leaving 2 cores to the programmer :)
That way you could leave the "disgusting" parts like communications or persitence-io to linux and care for "business logic" only :)
But that's not possible, so if you want full control of one core, you have to care about anything unwanted too. So that's where projects like circle come into play ...

... is well documented.

Sorry, but on this point I disagree completely.

I admit, that my expectations may be wrong. As I'm a pc programmer coming along the path from avr-programming to stm32 programming to rpi ...
So atmel is, what I call well documented. Whatever you imagine to do with an atmega or attiny there's a complete specs and of cause an app note with code samples for your problem.
Although st is not that good, I stil call it well documented. An stm32 consist of an arm core and peripherals and st documents both: the arm core, the peripherals and of cause the setup for both.
The java gui that helps on chip setup is a goodie that helps a lot in understanding, is it visualizes the clock tree :)

I agree with dwelch67 completely that the documentation has already been written. I'm convinced, that you can't sell a piece of silicon without specs and programmers manuals.
So the question is not about additional menpower to get the documents, but the question is, why those documents are hidden behind the nda-wall.

Maybe success story of rpi was a surprise for broadcom. That might be true for model 1 ...
But with model 3 they should have known that.
And as model 3 starts to offer a SOC with real power and 64bit ability it would be obvious, that serious programmers would ask for specs and progammers manuals.
The arm core is not even half of the pictures. There are lot of peripheral modules and there's a "bridge" between core and peripheral. So it does not help, if you have docs for the arm core and for some peripheral modules.
The bridge is important - only if one understands the whole picture, efficient programming starts to be possible.

... and like LdB already stated: if both sides of a bridge run at different speed, you'll need some tools for synchronization to minimize the gaps of latch.
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by LdB » Thu May 18, 2017 4:41 am
I pretty much agree with everything David and Django said and I care little for what the Pi foundation or Broadcom intended or want. They make a very very cheap fairly impressive system that can be baremetalled and it is getting easier every day as more information is added to the public knowledge pile.

I started playing with the Pi as a personal side project hobby but there are several things I am looking at for commercial use. Projects that would otherwise require the design to be done on something like a Xilinx SOC with a microblaze softcore or a Cortex processor and the cost and time for that process is formidable.

Like all designs there are always things that we feel could have been done better but they need to build the product for a price and they are producing in quantity and not going broke so they are meeting that target. I think the GPIO was obviously something they had to make do with but it is a shame they didn't do better or publish how we can do better with it (AKA how they intended it to work on the Pi3).

I actually think the Pi is now actually one of the easier full systems to program on especially since the bootloader essentially initializes everything ready for you to start. Everyday you see more and more baremetal code available for the Pi and it will only get easier.

I have said this before I am surprised the Pi Foundation and ARM haven't cosied up more. I come from the same commercial world as Django where ARM is a minor bit player my only experience with them was via Xilinx SOC. ARM seemed to be interested in spreading out from the phone/tablet market and I thought they might be interested in helping the Pi Foundation as part of that drive.
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by rst » Thu May 18, 2017 11:08 am

I think, you have to take a platform as it is, at least when you have a project and have to start with it now. That means, you cannot tell: "The hardware is great, but the documentation is poor." Documentation will not change, because of this.

What I wrote targeted this problem, which is basically yours at the moment, as I read it. You need some special information on GPIO clocking and there is no reliable source, to get it from. You can do two things now: just try it out (perhaps it works anyway) or let it be and take another platform.

Don't understand me wrong. I like this platform and I use it for bare metal programming only. I wouldn't have developed Circle, if I would not like this machine. But because of this, I know that it is not perfect for this purpose. Often you have to do try and error, to get some information and in the end you find a solution, but is it the perfect one? You don't know.

My English is not quick enough for longer discussions, excuse me.
Posts: 266
Joined: Sat Apr 20, 2013 6:42 pm
Location: Germany
by django013 » Thu May 18, 2017 11:54 am
I think, you have to take a platform as it is, at least when you have a project and have to start with it now. That means, you cannot tell: "The hardware is great, but the documentation is poor." Documentation will not change, because of this.

Yes, you're right.

I didn't want to argue about platform or framework. May be I had the hope, that you know what I'm looking for. And I thought, as if rpi hardware is similar to stm, software might be similar too.
That was the point I was looking for.

If I would have commercial plans, I would contact broadcom sign the nda and get the docs. So what?
But I don't have commercial plans and I don't have a 300Meg Oszilloscope to get around programmers bugs.

When I look at the results from LdB I know, that rpi3 programming was wrong. But I don't know, how to do it better. So the only choices I have is shut up and leave the platform alone or ask for help and hope that someone reads my question, who knows what I'm looking for ...

My English is not quick enough for longer discussions, excuse me.

I'm german too ;)
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by trolly » Thu May 18, 2017 4:24 pm

i want to use Circle as base environment to develop my robot's OS in raspberry pi, for that it will need drivers for following things:
- Adafruit PCA9685 (16 channel I2C servo driver)
- Gyrosensor ( ... 0.0.6iWnf6)
- Ultrasonic sensor ( ... 0.0.tLj0yH)

theses are designed for arduino, but i think using the i2C interface it could be used in the raspberry pi.

Are you interested to add support of it in Circle?
Posts: 7
Joined: Mon Oct 21, 2013 9:46 pm
by LdB » Fri May 19, 2017 3:52 am
I have been given a link to show me the workings of the GPIO on the Pi from a party that wishes to remain unidentified ... ntrol2.pdf
I don't know the background of this datasheet and am taking it on good faith it is okay to link it.

It clearly shows there is a tri-state in that along with setting the drive strength and slew rate on the pin and attempting to drive it at this speed at low current setting they are saying I am destabilizing the tri-state latch.

The source also tells the maximum speed is a hard set as 68mhz in a way related to that of the Pi1 which match what I can see of my signal when its working :-)

Will work on this tonight.
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by LdB » Fri May 19, 2017 6:50 am
Sorry for the photo quality didn't realize the light was at wrong angle and I am too lazy to set it all back up again.

Yes it's a drive problem and even with full strength settings I had to take the Pi3 down to 700Mhz .. it played up at 1.2Ghz with the scope lead setup I have. I needed aligator clips to get my hands free to take photo .. it was fine at 1.2Ghz with me holding the scope lead on and no aligator clips but I can't take photo with no hands.

So thats 41.77 Mhz at 700hz.

So at 1.2Ghz it would be 1200/700 * 41.77 = 71Mhz and it reported 70.48Mhz to me when I was holding it.

So at these sorts of GPIO speeds you will not be able to simply drag the signals out on loose wires you will need to take all the normal high speed considerations. Failing to do so lets the tri-state latch do bad things :-)
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by django013 » Fri May 19, 2017 7:40 am
Don't know, whether I got you right.

I think, tristate has nothing to do with output frequency of a pin. It is used to disable the output amplifier in case the pin is switched to input mode.

The delay, or what you call latch results in the difference of gpio frequency and core frequency. I have no idea, where the junk pulses from your former ossi-screenshot came from. It looks to my, like your commands resulted in an overshot to the amplifier stage of the pin, but that should not be possible.
Probabely I don't know enuf to talk about those things.

cheers Reinhard
Posts: 21
Joined: Wed Mar 29, 2017 10:11 am
by LdB » Fri May 19, 2017 1:53 pm
I am not a hardware engineer but I have had enough exposure to it and played with it so lets see how I go.

Tri-state buffers in SOC or FPGAS usually latch up more than any other circuit because the the device circuit has an on state resistance but you have a high side and a low side gate you need to get off. The voltage appearing across the RDSon generally makes it harder to get one or other gate closed. Infact most VERILOG and VHDL software will convert them into a MUX structure since late 90's unless a special tri-state logic block exists for them in the FPGA or SOC.

Now in some ASIC and SOC's they would still exist but it would dependent on some skilled hardware engineer to design them to not latch up.

For the record in FPGA you are also so not supposed to join the output buffers in that manner the Pi does because the switch rate of the buffers may not be the same so one can drive into the other. Now they may have a special logic block designed carefully and lovingly by a skilled hardware engineer and it may be perfectly valid but we have no documentation to say that it is :-)

All I can tell you is clearly the signals were clearly latching up to both rails you could see they slammed hard to both rails. That and the fact it was unsticking and latching to the other rail was how I knew it was a tri-state before I even saw the circuit. I have seen it before when writing bad VHDL on FPGA's in the old days.
Posts: 301
Joined: Wed Dec 07, 2016 2:29 pm
by dwelch67 » Fri May 19, 2017 2:48 pm
Maybe this unnamed person has register specs on say the individual arm core reset and enable signals (if they are separated in the bcm logic), and any other currently (assuming it isnt as maybe it has been since I last looked) undocumented but useful registers.
Posts: 721
Joined: Sat May 26, 2012 5:32 pm