ShiftPlusOne
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5767
Joined: Fri Jul 29, 2011 5:36 pm
Location: The unfashionable end of the western spiral arm of the Galaxy

Loading 32bit values into registers.

Mon May 26, 2014 6:11 am

In the past, I've used C on dev boards which give nice libraries to hide all the magic that happens on the register level. I decided to finally get around to playing around with bare metal programming on the pi and learn a bit of ARM ASM as well.

I've noticed the Cambridge Baking Pi tutorial does this

Code: Select all

ldr r0,=0x20200000 
Which when assembled with gas actually puts the 0x20200000 somewhere at the end and uses the program counter with an offset to load it.

I noticed all the wizards are using fasmarm, so I decided to try doing the tutorial using fasmarm as well, but of course it doesn't do the same trick that gas does. I found that they use the following macro:

Code: Select all

macro imm32 reg,immediate {
  mov reg,(immediate) and $FF
  orr reg,(immediate) and $FF00
  orr reg,(immediate) and $FF0000
  orr reg,(immediate) and $FF000000
}
Following the OK1 tutorial, I ended up with this:

Code: Select all

format binary as 'img'

start:
	; Put GPIO16 in output mode
	ldr r0, [GPFSEL0]
	mov r1, #1			;Output mode
	lsl r1, #18 		;GPIO16
	str r1, [r0,#0x4]	;GPFSEL1
	; Clear GPIO16
	mov r1,#1			;Clear
	lsl r1,#16			;GPIO16
	str r1,[r0,#0x28]	;GPCLR0
	
hang:
	b hang

GPFSEL0: dw 0x20200000
So my questions is, what are the advantages of the macro and why are wizards using it? Something tells me the answer might have something to do with offset ranges, but I don't know.

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Loading 32bit values into registers.

Mon May 26, 2014 7:42 am

There are no advantages to that macro, only disadvantages.

The logic for loading a 32 bit constant into a register on ARMv6 using ARM assembly language is as follows:

- if x can be expressed as a modified immediate value*, use MOV
- if the complement of x can be expressed as a modified immediate value, use MVN
- Otherwise use a pc-relative load (what you're seeing).

That gives the fastest possible load where possible. The macro you're looking at takes the "brain dead" approach of "everything is an arbitrary value".

Looking at the ARM1176jzf-s TRM, section "Cycle timings and interlock behaviour" we see that:

MOV Rn, x -> 1 cycle
MVN Rn, x -> 1 cycle
LDR Rn, [PC, #constant] -> 1 cycle, with a latency of 3 cycles on Rn

The macro, which consists of 4 data processing instructions, will always take 4 cycles, and will always underperform the other options by at least 1 cycle - the worst case where you do something stupid like this:

Code: Select all

   ldr r0, [pc, #xx] ; My constant
   add r0, r0, r0 ; try to use r0 straight away, incurring a 3 cycle wait on use of r0
It's worth noting that a lot of compilers will generate the "worst case" code unless they have some fairly hairy ARM-specific optimisations enabled.

Another thing worth considering is that, if you can manage to express most of your constants as modified immediate values, you can not only load them fast, but mostly use them as immediates in data processing instructions rather than even loading them to registers - the macro's 4 cycles become *zero* cycles (no register load) *and* win you an extra free register. For example, here's some already fairly optimised code to count the number of entries in a linked list.

Code: Select all

    mov r2, #0xff000000 ; end of list marker
    mov r1, #0 ; count
loop:
    cmp r0, r2
    ldrne r0, [r0]
    addne r1, #1
    bne loop
    ...
Can be expressed as:

Code: Select all

 
    mov r1, #0 ; count
loop:
    cmp r0, #0xff000000 ; end of list marker
    ldrne r0, [r0]
    addne r1, #1
    bne loop
    ...
... thus freeing up r2 for other uses. Now, the above is trivial, the constant only gets loaded once, but if the loop was more complex, the register holding it might have to be spilled (shoved to the frame temporarily) in order to free up the register for other usage. Just spilling the register *once* in the loop would incur 4 load/store operations per loop iteration.

So. Why do the fasm boys & girls use that macro? Because fasmarm doesn't provide the facilities armasm or gcc do, and nobody (as far as I'm aware) has bothered to implement a more intelligent macro for fasmarm. Perhaps those wizards aren't as clever as they proclaim themselves to be.

Simon

* i.e. can be expressed as an 8-bit value shifted by an even power of 2

ShiftPlusOne
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5767
Joined: Fri Jul 29, 2011 5:36 pm
Location: The unfashionable end of the western spiral arm of the Galaxy

Re: Loading 32bit values into registers.

Mon May 26, 2014 8:09 am

Thanks for an awesomely detailed answer. There's a fair bit there to wrap my head around.

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Loading 32bit values into registers.

Mon May 26, 2014 9:36 am

I should probably add that it would be quite hard to implement "proper" constant loading using a fasmarm macro - although the decision as to how to load a constant (mov, mvn, etc) should be entirely do-able, as I'm led to believe that fasm's macro language is turing complete, the pc-offset ldr loading option may well be quite hard to do, as it requires a "literal pool" to be defined, out of the instruction stream but still within "striking distance" of the PC; that's hard to do, especially if you're not generating the entire "syntax tree" of the assembler source. That said, gnu assembler users live with the option of "liberally dotting their source with .ltorg pseudo-instructions", so if fasmarm can make two macros co-operate, it would be possible.

Worst case, it would be possible to make the "mov/orr/orr/orr" macro the fallback option, although, in fact, standalone mov / mvn instructions could well be considered an antipattern in the case where they are being used for comparison or arithmetic (comparisons are covered by cmp / cmn, and basic arithmetic can either be expressed using immediates, or, in many cases, be covered by rewrite rules at assembly time).

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: Loading 32bit values into registers.

Mon May 26, 2014 2:16 pm

I will throw my two cents in here. The answer from tufty is excellent, not stomping on that. I recommend you try all of those solutions and understand them. In particular get the arm docs for the instruction set(s) (arm and thumb) and in this particular case understand the immediate encoding. It is a little strange but beautiful at the same time. In a nutshell the arm encoding is you can have up to 8 non-zero or non-one bits in a group that can be rotated an even amount, for example

Code: Select all

00000000 <.text>:
   0:	e3a002ff 	mov	r0, #0xf000000f
   4:	e3a000ff 	mov	r0, #0x000000ff
   8:	e3a00eff 	mov	r0, #0x00000ff0
   c:	e3a00a12 	mov	r0, #0x00012000
  10:	e3e00ced 	mvn	r0, #0x0000ed00
all will apparently encode. Then ones that dont encode in one instruction you have to figure out what to do with. The ldr equals thing is a shortcut, and at least with the gnu assembler if it can fit the immediate into a mov or mvn it will make that shortcut for you, if it cant then it finds or tries to find a place to put it then encodes a load pc relative.

I recommend you learn to make your own load pc relative and you control where the data goes.

ldr r0,hello
mov r0,#0xF000000F
mov r0,#0x000000FF
mov r0,#0x00000FF0
mov r0,#0x00012000
mov r0,#0xFFFF12FF
...
hello: .word 0x12345678

and make sure you dont put that data in the execution path it needs to hide behind an unconditional branch (b, bx, ldr pc)

and what you need to do to reach a far address.

Code: Select all

ldr r1,near
ldr r0,[r1]
mov r0,#0xF000000F
mov r0,#0x000000FF
mov r0,#0x00000FF0
mov r0,#0x00012000
mov r0,#0xFFFF12FF
...
[code]near: .word far
...
far: .word 0x12345678
[/code]

the temptation here would be to

Code: Select all

near: .word =far
so try that and dissassemble it and see if it seems right. I dont think it is.

Now thumb immediate encoding comes in more than one flavor and at least one is not beautiful, it is difficult at best, but still worth trying to understand what immediates you can and cant do in a single instruction.

In short, get the arm docs, understand the immediate encoding for arm instrucitons. And understand the assembler syntax so that you can generate your own load pc relative instructions to get at data, not just for immediates but in general, loading things from other objects that are linked in later (and may not be reachable by a load pc relative). Then you are immediately promoted to wizard status yourself. This takes just a matter of minutes to understand not hours or weeks...

David

ShiftPlusOne
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5767
Joined: Fri Jul 29, 2011 5:36 pm
Location: The unfashionable end of the western spiral arm of the Galaxy

Re: Loading 32bit values into registers.

Mon May 26, 2014 11:33 pm

Thanks. I need to let the information bake in my head for a little. I get it, but need a bit of practice. Seems like the tricky part will be getting data that cannot be encoded into a mov instruction and that is outside of PC offset reach.

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Loading 32bit values into registers.

Tue May 27, 2014 7:55 am

It's not that hard, to be honest. PC-offset addressing gives you ±1024 instructions or so of "striking distance", which is enough for 90% of cases. Where it's not enough, you can manually place literal pools closer by branching over them. It's worth mentioning that, assuming you use the gnu tools, you'll get the following, fairly obvious, error at assembly time if there's no literal pool close enough.
Error: invalid literal constant: pool needs to be closer
The general approach to take is to put a .ltorg directive inbetween every function, and in the few cases where you haven't got enough distance place a literal pool "instream". A good place is usually after a loop, where the extra branch won't hurt too much, viz:

Code: Select all

    ldr r0, =0xdeadface
    ...
    bne loop
    b after_pool
.ltorg
after_pool:
    ...
Neither the GNU nor the ARM tools will automatically generate backward references. So the following will fail, despite there being a literal pool well within striking distance.

Code: Select all

.ltorg
 <less than 4079 bytes>
foo:
	ldr r0, =0xdeadface
<more than 4095 bytes>
.ltorg
You can, of course, try manually placing your constant(s) in this case:

Code: Select all

.ltorg
myconstant: .word 0xdeadface
foo:
<less than 4079 bytes>
	ldr r0, myconstant
...

ShiftPlusOne
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5767
Joined: Fri Jul 29, 2011 5:36 pm
Location: The unfashionable end of the western spiral arm of the Galaxy

Re: Loading 32bit values into registers.

Tue May 27, 2014 10:07 am

Hm, I was thinking of a case where you might have a certain value that's used all over the place (for example a base address). My initial feeling was that sticking it in a literal pool every time it comes up might not be the best way to do it.

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: Loading 32bit values into registers.

Tue May 27, 2014 3:09 pm

For hardcoded numbers the compilers tend to just keep making new copies of the value at the end of the function. For hand written assembly, a single instance will save some ram, but is that a pre-mature optimization? The nice-ish thing about it is you can just place the one instance of your base address in hand assembler, and then keep using it that way until the tool complains. Then decide if you want a second copy or if you want to do a far load of the original. That is the nice thing in general, hope for the best with immediates, etc and then deal with problems if/when the tool complains.

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Loading 32bit values into registers.

Tue May 27, 2014 4:17 pm

ShiftPlusOne wrote:Hm, I was thinking of a case where you might have a certain value that's used all over the place (for example a base address). My initial feeling was that sticking it in a literal pool every time it comes up might not be the best way to do it.
Literal pooling probably /is/ the best way of doing it. Remember, it only takes one cycle (and a memory access) to load. Use a symbolic equate to keep it defined in one place in your source, let the assembler deal with placing it in memory (it should do constant folding within a literal pool). Worrying about anything else is a very premature optimisation for space, which will probably have a tradeoff in execution time terms.

ShiftPlusOne
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5767
Joined: Fri Jul 29, 2011 5:36 pm
Location: The unfashionable end of the western spiral arm of the Galaxy

Re: Loading 32bit values into registers.

Tue May 27, 2014 7:14 pm

I'm using fasmarm, and as far as I know there's no .ltorg or letting the assembler deal with memory, unless there are some features I'm not aware of. I've just been putting them in as described in the first post. Anyway, I think I've got the hang of it, thanks again.

krom
Posts: 61
Joined: Wed Dec 05, 2012 9:12 am
Contact: Website

Re: Loading 32bit values into registers.

Wed Jun 04, 2014 11:24 am

Hi guys I just wanted to clear some of this IMM32 macro stuff up:
In 2011 I discovered FASMARM, after needing an ARM assembler that could compile floating point opcodes for Raspberry Pi CPU, for the 1st bare metal floating point fractal demo on the system. I noticed that FASMARM did not have ltorg or the ldr= style of loading immediate values, so I wanted a quick and simple way of loading any 32, 24, or 16 bit immediate value hence me making the macros.

I did this for a very simple reason, so that even a child can understand my short & simple bare metal code examples...
I have devoted the last 15 years of my life promoting the ARM CPU starting with GBA and NDS systems, & now the R-Pi, promoting free university level ARM assembly education thru the internet (something which I am very passionate about). I am a 3D artist in my day job, and I first learnt howto programme the ARM CPU many years ago, because I wanted to better judge GBA competitions that we were running to promote our website GBADEV.

At the same time I found FASMARM, I started converting my huge library of bare metal GBA code to FASMARM, which I will be releasing on my GitHub soon. My previous ARM assembler was GOLDROAD, an in house GBA ARM assembler by a guy who worked for RARE, that he kindly donated to my website very early on, which had the ltorg and ldr= statements built in.
As I started to convert my especially power hungry GBA projects, like my software 3D engine & my Video Codec, & I noticed something very strange...

My Myst conversion for the GBA, which uses my video codec, was much faster at playing back the video files once I had only replaced all the ldr= statements to IMM32 & IMM16 accordingly. e.g the video was decoding all the video frames seconds before the audio had finished playback. Also my software 3D engine was significantly faster, showing more frames per second. The reason it is so easy for me to see big speed differences, is that the GBA has a low powered ~16 MHz ARM CPU, so every cycle counts. All 3D engine code, & video decoding code runs from Internal CPU Cache RAM (IWRAM) & has zero wait-state, which the GBA allows direct access to, for faster code throughput.

So I was really quite happy with including all the macros in all my educational R-Pi material on my GitHub the last few years =D

Everything was fine, until about 6 months ago when someone started to say that there were way better ways of doing it etc...
"Why don't you use ltorg & ldr=", they were saying...
"Why don't you use my amazing 120 line macro to get the best result"...
"I don't think they are the wizards they claim to be using this IMM32 macro, it is the 'brain dead' approach" etc.

I found all of this highly offensive, not to me, but for the educational process which we should all be trying to promote, so I thought I would voice my real HW testing reasons for using IMM32 macros for code simplicity & speed. Also you will find all my code only uses my IMM macros to load up all my registers states right at the start of my code, & I never use them in tight inner loops.

Here is a hint to why I think "clever" people on many boards, have no real clue how fast cpu opcodes take to execute:
The GBA even almost being an ARM CPU plugged onto an LCD screen with RAM, has internal GFX Hardware & Internal Timers, that play havoc with the execution timing of opcodes.
The Raspberry Pi even more so, as it has so much going on inside the system hardware.
Loads from memory & any opcode executed are dictated my the ram & internal cache speeds, & the internal hardware component timings themselves.
Even when placing code directly into internal CPU cache and running it from there on the GBA (which has zero wait-state), will not guarantee you perfect cycle timing of the opcodes...

So the moral of this story is:
1. Always test what code is faster for yourself on real hardware.
2. Take what the seemingly "clever" people on this board say, with a massive pinch of salt, as they are rarely correct in my experience.

P.S I have never stated that I am a "Wizard" at anything (o.k maybe playing Mario Kart), so I do not know where that came from!!
But I would like to state right now that Dex is a Wizard =D

ShiftPlusOne
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 5767
Joined: Fri Jul 29, 2011 5:36 pm
Location: The unfashionable end of the western spiral arm of the Galaxy

Re: Loading 32bit values into registers.

Wed Jun 04, 2014 11:37 am

Great to hear from the wizard in question. Thanks for the github repo, it has been fairly educational. Don't take the criticism so personally. It's clear that there are trade-offs and the answers have been fairly objective as to what those trade-offs are. I've read the responses and understood that outside of loops, there are few practical disadvantages to using macro. But having read about the disadvantages, I found that educational too, as it made me go back to the manual and read up on cycles of different instructions.

krom
Posts: 61
Joined: Wed Dec 05, 2012 9:12 am
Contact: Website

Re: Loading 32bit values into registers.

Wed Jun 04, 2014 12:22 pm

Cheers for the great comments, it is great that you are learning so much in a short space of time =D
(Sorry if my comments came across a little harsh, I had to state my reasons to why I use those macros!)

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: Loading 32bit values into registers.

Wed Jun 04, 2014 6:34 pm

I dont know if you were targetting me or not. And that is fine if you were.

First off the gba was a very interesting platform for performance, what performed well on other arm platforms often was the opposite on the gba. There was that fast 32 bit wide, byte addressable memory, but the rest of it was on a 16 bit bus so on top if the normal amba/axi overhead that you cant get rid of, you also had to do two cycles for a 32 bit read, worse than that there were additional wait state penalties on top of that FOR EACH transaction. Rom the same but it had a prefetch feature that made sequential accesses a bit faster, at the cost of random. The platform was amazingly predictable from a timing perspective simply because it had no cache and there were no other bus masters taking time away, you owned everything and it was sram and the timing was fixed (using the same rom of course, change cartridges the timing there may change).

I have tried to mention Michael Abrash and the Zen of assembly and graphics and such (which you can get free epubs now on github of both books), here and elsewhere. Even when the Zen of assembly came out the 8088/86 was old news, but the concept of not assuming and testing and trying to understand why some crazy experiment was faster. That holds true today.

Here is the bigger problem, you have folks who have some genuine interest in learning assembly language for the arm, they bought a raspberry pi maybe not for that purpose but since they have one why not. So just teaching them basics, to add two numbers you very quickly get into the immediate problem. So using ldr= is not a bad "just do this" to get them past that hurdle (because it is more portable than other solutions). Later if they care to know or start to care about performance and not just the joy/pain of programming in assembly language then there is the options of doing a series of immediates or a pc relative ldr, both have pros and cons.

It is definitely worth talking about performance and we do in these forums. There definitely isnt a "best" solution, as you know many times with the right experience you can create a benchmark that shows either of the solutions as being "better" from a performance perspective. Unlike the gba which had a lot of deterministic features, the raspberry pi is about as undeterministic as it gets, just the normal pipeline and caches, then we are using dram and we are sharing that with another processor via undocumented/known logic with another possible cache in that path. Many solutions will be both good and bad at the same time depending on what else is going on at the moment.

David

User avatar
DexOS
Posts: 876
Joined: Wed May 16, 2012 6:32 pm
Contact: Website

Re: Loading 32bit values into registers.

Thu Jun 05, 2014 9:29 pm

I agree with krom , you need to keep code as simple and understandable has possible when writing code for beginners.

I also believe, that there are two types of coders, there's the paper bashes and then there's the hardware bashes.
They both have there advantages.
I like krom, like to hardware bash, test and then test some more.

I also have taken a lot of stick from using that macro, to the point i have stopped work on my RPI tuts.
I got comments like,
"why are so many people using your code when it users s**t code like that macro, i will not stop spamming you until you use my 120 line macro"
Yet no one user's his code, maybe because no one can understand it.
Batteries not included, Some assembly required.

Siekmanski
Posts: 8
Joined: Mon Apr 28, 2014 11:14 pm
Location: Netherlands

Re: Loading 32bit values into registers.

Thu Jun 05, 2014 10:16 pm

Hello Dex,

I've just started to code assembly on the Pi and learned already a lot from krom's examples.
I'll be very thankful if i could get your RPI assembly tuts?
Because i can't find them and i'm very curious and like to study them.

User avatar
DexOS
Posts: 876
Joined: Wed May 16, 2012 6:32 pm
Contact: Website

Re: Loading 32bit values into registers.

Fri Jun 06, 2014 10:55 am

Siekmanski wrote:Hello Dex,

I've just started to code assembly on the Pi and learned already a lot from krom's examples.
I'll be very thankful if i could get your RPI assembly tuts?
Because i can't find them and i'm very curious and like to study them.
Hi Siekmanski,
PM me your email and i will send you the tuts (there in http forum)
Batteries not included, Some assembly required.

krom
Posts: 61
Joined: Wed Dec 05, 2012 9:12 am
Contact: Website

Re: Loading 32bit values into registers.

Fri Jun 06, 2014 6:28 pm

Hi Dex,
Cheers for the kind comments, it made me feel sad to read that you got so annoyed by the aforementioned stuff, great to see you on the board again =D

P.S I do not want anyone to feel that I singled them out, especially dwelch67 & tufty, you guys have contributed so much to the scene. I was really just going against the bad feelings, as a whole, surrounding this specific subject. I feel much better now after hearing everyones responses =D

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Loading 32bit values into registers.

Mon Jun 09, 2014 3:28 pm

Sorry guys, been away from the intertubes for a while. Actually, I've been on a beach on the west coast of France playing with powerkites, cycling, and generally having a holiday with the family.

Anyway. It was I who made the "maybe they're not wizards after all" comment, and I hesitated about posting it then. I wasn't intending to denigrate anyone, more to point out that one shouldn't assume anyone else is particularly good at anything based merely on "talking the talk". That goes for me, too - I might dabble in majick, but I ain't no frickin' wizard :)

So, no offence meant.

When it comes down to it, the only way to be *really* efficient with this kind of stuff is to tie it in with a whole bunch of other, intertwined, low-level peephole optimisations, and that's beyond the scope of most assembly-level tools - optimisers in HLL compilers can / might be able to carry that off, but at an assembly level it's all down to the optimiser that lies between keyboard and chair, and that means a whole load of other skills - profiling, knowing where to spend effort to best effect and knowing how to measure the effects of a change being as important as knowing what specific optimisations might work.
krom wrote:My Myst conversion for the GBA, which uses my video codec, was much faster at playing back the video files once I had only replaced all the ldr= statements to IMM32 & IMM16 accordingly. e.g the video was decoding all the video frames seconds before the audio had finished playback. Also my software 3D engine was significantly faster, showing more frames per second.
Why?

krom
Posts: 61
Joined: Wed Dec 05, 2012 9:12 am
Contact: Website

Re: Loading 32bit values into registers.

Tue Jun 10, 2014 12:47 am

Hi tufty,
Sounds like you had a great time in France =D

I'll try to answer your question in 2 answers, as I do not know what specifically you meant by "Why?"

1. If you meant why was I able to see an increase in FPS on my video codec & 3D Engine:
I was using a lame decrement counter to wait a certain amount of time between frames, to get the video FPS to sync with the sound. When I converted the code to FASMARM, the video frames played back faster as the same decrement counter was not delaying the video frames enough with the faster decoding.
The 3D Engine just outputs frames as fast as possible, so any code that is faster shows a higher FPS. e.g the same 3D scene rotated about 1024 X,Y,Z rotations, completes in a faster time.

2. If you meant why did the changes in the code run noticeably faster on GBA HW when using my IMM macros:
I am rather embarrassed to say, but I have absolutely no idea how the code can run faster.
I was expecting the same or slower output after just changing all the ldr= statements to IMM macros.

It was on the sole basis of these tests, that I decided to use the IMM macros in all my R-Pi work, as it was so easy to understand, and I felt it was not detrimental to performance.

Here is my post on FASMARM where I say thanks to revolution for making FASMARM & speeding up my code (Near Middle):
http://board.flatassembler.net/topic.ph ... &start=420
It states that my video decompression was sped up by 3%... Also I just noticed this is how I first met Dex =D

Sorry if I misunderstood your question, you might be unsatisfied with my answers without any actual proof other than my word on the subject...
If you have a way of running code on real GBA HW, e.g a flash cart, I will be very happy to provide you with a minimal single video test that can show you exactly the same results I have seen. With full source code & binaries for GOLDROAD, and FASMARM respectively.

tufty
Posts: 1456
Joined: Sun Sep 11, 2011 2:32 pm

Re: Loading 32bit values into registers.

Tue Jun 10, 2014 6:21 am

krom wrote:2. If you meant why did the changes in the code run noticeably faster on GBA HW when using my IMM macros:
I am rather embarrassed to say, but I have absolutely no idea how the code can run faster.
Yeah, that's what I meant.

I assume you have (or can produce) binaries; if the only difference in source is the change from ldr= under one assembler to imm32 under fasm, it should be relatively simple to find an example of what code is being produced under both. My hypothesis would be related to your imm32 macro not touching memory, and the GBA's lack of cache / reduced pipeline / memory map oddities, perhaps combined with some "pathological" constants, leading to imm32 coming out ahead on average, or maybe even just in one or two critical cases. Difficult to tell without a simulator, though.

As you are aware, though, "is faster on the GBA" does not equate to "will be faster on ${PLATFORM}". As ever, instrumentation is king.

krom
Posts: 61
Joined: Wed Dec 05, 2012 9:12 am
Contact: Website

Re: Loading 32bit values into registers.

Tue Jun 10, 2014 7:54 am

tufty wrote:if the only difference in source is the change from ldr= under one assembler to imm32 under fasm
Yep the only changes that I needed to get the code to compile in FASMARM, was changing all the ldr= to imm32.
tufty wrote:As you are aware, though, "is faster on the GBA" does not equate to "will be faster on ${PLATFORM}". As ever, instrumentation is king.
Yep you are definitely right, the R-Pi is a very different beast (but I do like to pretend sometimes that it is a GBA on steroids!)

Cheers for your hypothesis of how this could be possible, this is the best explanation I have seen & makes some sense, because of how the GBA hardware is setup, & how I run the code directly from fast Internal Work Ram (IWRAM).

timrowledge
Posts: 1270
Joined: Mon Oct 29, 2012 8:12 pm
Location: Vancouver Island
Contact: Website

Re: Loading 32bit values into registers.

Tue Jun 10, 2014 5:44 pm

A little off to the side, sometimes using the braindead mov/orr/orr/orr approach is required because of other constraints.
For example, I'm currently working on the CogVM for Squeak to support Scratch on the Pi and we have to be able to go back to some values and edit them (when relinking polymorphic inline caches, for example) and rely on a fixed format for that to work. So until and unless I manage to make time to build literal pools, having a well defined four instructions that I know the layout of precisely is the only decent technique.
Making Smalltalk on ARM since 1986; making your Scratch better since 2012

dwelch67
Posts: 955
Joined: Sat May 26, 2012 5:32 pm

Re: Loading 32bit values into registers.

Tue Jun 10, 2014 5:52 pm

depending on the non fasmarm assembler, the current gas for example will optimize ldr= into a non-memory immediate (mov or mvn) if it can do it in one instruction. With the gba even two or three instructions should be faster than one instruction that does a load depending on the memory. On the gba thumb mode runs faster than arm even though it takes 10-15% more instructions (based on an old experiment that is of course application dependent) and the immediate rules are more limited with thumb. As already mentioned though the real answer would be to simply compare the binaries. Curious to know what the ldr= was encoding as and what FASMARM did differently. It could be as simple as a single instruction in an often used tight loop made the performance difference.

For those tuning in the GBA (Nintendo GameBoy Advance) has ARM7TDMI which is an ARMv4T architecture (arm and thumbv1 instructions). Some, 32K or so 32 bit wide memory that was relatively fast, then 256mb or something may be kb of 16 bit wide memory with a few wait states. Then the rom was also 16 bit wide but there were knobs you can turn to control the wait states so if your cartridge/rom/flash could handle it you could boost the speed a little, if not leave it slow, there was also a prefetch fifo thing that would do burst reads N clocks for the first 16 bits then 1 clock per 16 bits after that up to the size of the fifo (pretty small) it can hurt or help you depending on your ratio of sequential to random access. There was no cache. The video system was relatively quite powerful. With no cache and the various memories you could get repeatable results for performance tests (+/-1 timer clock from run to run) but by simply changing compiler options, optimizations, etc thumb vs arm and moving code around you could see noticeable performance differences. As mentioned in another post it is all about your instrumentation, with the GBA, outside the video system which was still a bit deterministic, you didnt have a lot of variables so you could benchmark things and have a good idea why something was faster or slower. With the Raspberry Pi and other systems with DRAM, caches, and in this case another processor sharing the same resources, it is not deterministic, and you may think that one knob you turned made something change performance but it might have been a coincidence and you need to prove that as best you can (with the limited instrumentation available).

Return to “Bare metal, Assembly language”