BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Assembly language: multiplication by a constant

Wed Jul 19, 2017 2:16 pm

Hi, folks.
1st up, is this the right forum to discuss assembly language, or should I be over in 'Bare Metal'?
2nd, I spent several hours at the weekend, away from my Pi-s (oh, no!), learning about doing multiplication by constants using addition/subtraction/shifting instead. Once I got back to my Pi-s, I tried a few constants on my Pi Zero and, for each, the technique was about the same speed as a MUL!
Does anyone know of any documentation of when (if ever) it's quicker to use ADD/SUB/RSB/LSL instead of MUL?
So far it looks like the weekend's reading was only of academic interest unless I stick to using them when I can do it in 1 instruction (since, unlike with MUL, I won't have to load the constant into a register).
Would Usenet be a better place to ask this? comp.sys.arm seems very quiet and a search didn't show anything about multiplication by constants.
The most recent post I could find about it in comp.compilers was from 2007 and used x86 for examples. Bleah! (It took me a while to figure out what 'LEA's were capable of.)
Any other ARM-related Usenet/Google groups?
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

Heater
Posts: 7936
Joined: Tue Jul 17, 2012 3:02 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 3:17 pm

Using shifts and adds to multiply by a constant was pretty useful when using early micro-processors,and computers before that, that did not have a multiply instruction. Which is still the case for small devices like PIC and AVR (Arduino)

It was still useful when micro-processors did get a multiply-instruction but it took many clock cycles more than other operations.

Today, on devices like the Pi and as you have found, there is no point thinking about it.

Still, it's an interesting and fun thing to know. Your time is not wasted there.

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 4:40 pm

Thanks, Heater.
Heater wrote:... Today, on devices like the Pi and as you have found, there is no point thinking about it.
Well, I think it's still worth it for some power of 2 (obviously!) , some power of 2 +1, some power of 2 -1, or -some power of 2 +1 since that's only 1 instruction and MOV rx, n; MUL ry,rz,rx would be 2.
Heater wrote:Still, it's an interesting and fun thing to know. Your time is not wasted there.
:D It was intersting, though I still only understand a fraction of the state of the art in 2007. I downloaded the source for 2 programs that take a stab at doing it: rigo.c is 1887 lines and 'synth' is 4412 lines of C++ + 1MiB of pre-computed data!
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

Heater
Posts: 7936
Joined: Tue Jul 17, 2012 3:02 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 7:07 pm

Now I recall. Back in 1980's I applied for a job that would have involved a lot of work in assembler on 8 bit micros where maximum performance was desirable. The guy asked me how I would multiply a lot of numbers by the constant 7 on a machine with no MUL instruction.

As you say, it's just a shift up by 3 bits and subtract the original value.

I imagine that today a lot of devices can do such multiplies in a single MUL instruction much faster than shifting and adding.

Of course such micro-optimizations always depend on the details of the processor you are using.

Pre-computed data is an other story. If you have the memory space you can often trade space for speed by pre-computing a lot of stuff and keeping it in look up tables. It could be computed at run time when the program starts up but before it starts it's real work. Or computed before compilation and included in the source code.

User avatar
jojopi
Posts: 3000
Joined: Tue Oct 11, 2011 8:38 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 7:38 pm

I doubt it is practical to make an integer multiply unit as fast as a shift or add, regardless of how much silicon you throw at it. Think about all the intermediate results, and the ways that the output bits depend on the bits of the inputs, and imagine trying to compute it all within a fixed number of gate delays.

Also the complexity of multiplication increases quadratically with word size, whereas addition and shifting are linear.

Manufacturers and third-parties often publish tables of pipeline latency and throughput for different instruction groups. Search for example, "Cortex-A57 Software Optimization Guide". (I cannot immediately find one specific to the cores in any model of Pi.)

Here is a "C" program that generates functions to multiply by small integer constants:

Code: Select all

#define Mn(n) long prod##n (long x) {return x * n;}
Mn(2) Mn(3) Mn(4) Mn(5) Mn(6) Mn(7) Mn(8) Mn(9)
#define Mm(m) Mn(m##0) Mn(m##1) Mn(m##2) Mn(m##3) Mn(m##4) \
              Mn(m##5) Mn(m##6) Mn(m##7) Mn(m##8) Mn(m##9)
Mm(1) Mm(2) Mm(3) Mm(4) Mm(5) Mm(6) Mm(7) Mm(8) Mm(9) Mm(10) Mm(11) Mm(12)
Compiling this with "gcc -O3 -S" gives a *.s assembly file. With the default compiler options in Rapbian, we can see that 22 is the smallest multiplier for which GCC emits an actual MUL instruction. However, with "-mtune=cortex-a53" to optimize for Pi3, it decides that 11 is also better as a MUL than as two ADD with ASL.

On x86-64, 46 is the smallest integer that GCC thinks is worth doing with IMUL instead of horrific sequences of LEA.

There is no guarantee that GCC is optimizing perfectly, of course, but I am very much inclined to believe that avoiding integer multiplication is indeed a benefit in many cases, and always will be.

jahboater
Posts: 1893
Joined: Wed Feb 04, 2015 6:38 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 7:41 pm

BMardle wrote:Thanks, Heater.
Heater wrote:... Today, on devices like the Pi and as you have found, there is no point thinking about it.
Well, I think it's still worth it for some power of 2 (obviously!) , some power of 2 +1, some power of 2 -1, or -some power of 2 +1 since that's only 1 instruction and MOV rx, n; MUL ry,rz,rx would be 2.
Yes. The compiler does these tricks by the way - where its worthwhile.
Edit: sorry Jojopi cross posted.

Division is the really costly operation of course.
Heater wrote:Pre-computed data is an other story. If you have the memory space you can often trade space for speed by pre-computing a lot of stuff and keeping it in look up tables. It could be computed at run time when the program starts up but before it starts it's real work. Or computed before compilation and included in the source code.
I don't agree here. Multiply is very fast on modern CPU's (max 3 clocks ish on a recent Intel machine last time I Iooked). Whereas a table lookup may involve countless (perhaps hundreds) clocks depending if its in the cache or not. Even if its in the L1 cache it will more than 3 clocks to load I think. And on ARM you need multiple instructions just to load a value from a table.

jahboater
Posts: 1893
Joined: Wed Feb 04, 2015 6:38 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 8:05 pm

For Intel and AMD CPU's these tables are very useful for instruction timings:

http://www.agner.org/optimize/instruction_tables.pdf

User avatar
Paeryn
Posts: 1674
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 8:14 pm

On the ARM11 (RPi 0 and 1) the MUL instruction takes 2 cycles to execute but as the MUL hardware is a 3-cycle pipeline you get the result available in 4 cycles (only 1 extra cycle for the 64-bit result of [SU]MULL), I don't know if they made it quicker in the A7 / A53.

So yes, a simple 2^n ±1 is quicker as a mov / add / sub with shift, but you very quickly start loosing time if you need to string several instructions together to do other constant multiplies.
She who travels light — forgot something.

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 8:29 pm

jojopi wrote:On x86-64, 46 is the smallest integer that GCC thinks is worth doing with IMUL instead of horrific sequences of LEA.
Actually, 46 isn't that tricky. In ARM it's
rsb r1,r0,r0,lsl 2 @r1=3*r0
rsb r1,r0,r1,lsl 3 @r1=23*r0
lsl r1,r1,1 @r1=46*r0

I've been poking around inside gcc but not found the relevant code yet. (Don't hold your breath!)

FWIW, most of the stuff I've been reading about optimising constant multiples seems to be aimed at synthesising digital filter hardware. (Though opinions seem to differ as to what to minimise.)
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 8:39 pm

Paeryn wrote:On the ARM11 (RPi 0 and 1) the MUL instruction takes 2 cycles to execute but as the MUL hardware is a 3-cycle pipeline you get the result available in 4 cycles (only 1 extra cycle for the 64-bit result of [SU]MULL), I don't know if they made it quicker in the A7 / A53...
That's interesting. I have been wondering whether pipeline register dependencies would have an effect. The algorithms I've seen for working out ADD/SUB/RSB/LSL sequences generally produce sequences which use each result in the next instruction.
So would
mov r1,<constant>
mul r1,r0,r1
<some simple instruction>
<some instruction which uses r1>
be as fast as the sequence without that 3rd instruction?
Paeryn wrote:So yes, a simple 2^n ±1 is quicker as a mov / add / sub with shift, but you very quickly start loosing time if you need to string several instructions together to do other constant multiplies.
FWIW (not much, probably!), you can multiply by any 20-bit number with at most 5 adders... except 699829, which requires 6!
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

Heater
Posts: 7936
Joined: Tue Jul 17, 2012 3:02 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 8:58 pm

jahboater,

Quite so. I would not imagine that using a look up table for multiplication would be beneficial.

But for more complex functions it certainly can be. For example sin or log tables. Or perhaps you have a function nthPrime(n) that returns the Nth prime number for N up to a billion. A look up table, even on disk might be faster than computing it at run time.

Certainly the caches on modern processors change the game when it comes to many optimizations. For example the order in which you traverse large two dimensional arrays, row first or column first, can dramatically impact performance. If you are missing cache a lot you bog down. We never had to worry about that in pre-cache days. Thinks like linked lists can bog down as elements can be scattered around RAM and cause a lot of cache misses.

As usual, it all depends on what it is you are trying to optimize on what architecture.

Now a days I don't much get into such micro optimizations. At the machine instruction level. Getting algorithm right for the architecture and letting the compiler worry about the details is mostly plenty good enough.

For example, the last time I did this kind of thing was a few years back writing a Fast Fourier Transform for the Parallax Propeller in assembler. No multiply available so I use fixed point arithmetic to avoid having to write slow floating point routines. Multiply by shift and add. Sin and cos from pre-computed look up tables.

Turned out that the same algorithm written in C using the same fixed point format and look up tables was not much slower than my hand crafted assembler. The C version has some great advantages. Not only is it portable to other micro-controllers it can also be compiled to spread the workload over 2 or 4 cores on the Propeller chip for a speed boost. With no source code changes! That would have been much harder to pull off in assembler.

That brings us to another optimization issue. Spreading a work load around the 4 or more cores of the Pi or whatever machine. Using C and OpenMP can get you some nice speed boosts. Doing that in assembler is not something I would want to tackle.

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 9:12 pm

Heater wrote:For example, the last time I did this kind of thing was a few years back writing a Fast Fourier Transform for the Parallax Propeller in assembler. No multiply available so I use fixed point arithmetic to avoid having to write slow floating point routines. Multiply by shift and add. Sin and cos from pre-computed look up tables.
Oh, yeah, I think precomputing at least 1/4 of a cycle of sin or cos is the way to go for FFTs since they're not arbitrary arguments.
I notice ARMv7 has a Reverse BITs instruction. Very handy for FFTs! I'm going to stick to writing ARMv6 code, though, so it'll run on all sorts of Pi.
I've got a book which includes a listing of a FFT in 6800 assembler :)
Sorry for wandering off topic!
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 9:14 pm

BMardle wrote:I'm going to stick to writing ARMv6 code, though, so it'll run on all sorts of Pi.
But not the Archimedes A310 gathering dust in the corner of my living room!
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

Heater
Posts: 7936
Joined: Tue Jul 17, 2012 3:02 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 10:12 pm

BMardl,
I notice ARMv7 has a Reverse BITs instruction. Very handy for FFTs!
The Parallax Propeller micro-controller has a reverse bits instruction. It's handy, saves a bit code space, but did not make a very dramatic difference in performance in the end.

After all the execution time of the bit reversal step is only proportional to n, the number of data points. Where as the execution time of the butterflies is proportional to n.log(n).

User avatar
Paeryn
Posts: 1674
Joined: Wed Nov 23, 2011 1:10 am
Location: Sheffield, England

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 10:42 pm

BMardle wrote:
Paeryn wrote:On the ARM11 (RPi 0 and 1) the MUL instruction takes 2 cycles to execute but as the MUL hardware is a 3-cycle pipeline you get the result available in 4 cycles (only 1 extra cycle for the 64-bit result of [SU]MULL), I don't know if they made it quicker in the A7 / A53...
That's interesting. I have been wondering whether pipeline register dependencies would have an effect. The algorithms I've seen for working out ADD/SUB/RSB/LSL sequences generally produce sequences which use each result in the next instruction.
So would
mov r1,<constant>
mul r1,r0,r1
<some simple instruction>
<some instruction which uses r1>
be as fast as the sequence without that 3rd instruction?
The A7/53 could possibly dual issue the instruction after the mul. Even for the ARM11 or when the other instruction can't be issued to pipeline 1 you'll still get so much of the next instruction through the early stages of the pipeline. If it needs the result of the mul then it will stall at the stage where it requires it until it is available. The problem with back to back adds using the barrel shifter is that the register is needed a stage earlier to shift it.
She who travels light — forgot something.

jahboater
Posts: 1893
Joined: Wed Feb 04, 2015 6:38 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 10:44 pm

Heater wrote: But for more complex functions it certainly can be. For example sin or log tables. Or perhaps you have a function nthPrime(n) that returns the Nth prime number for N up to a billion. A look up table, even on disk might be faster than computing it at run time.
Thats what an experienced programmer does. Its all a tradeoff between speed, complexity (I find a lookup table can sometimes simplify things), program size, required resolution (sin and log you mention could only be a low resolution or it would take too much space).
Heater wrote: Now a days I don't much get into such micro optimizations. At the machine instruction level. Getting algorithm right for the architecture and letting the compiler worry about the details is mostly plenty good enough.
Yes, absolutely, always true at work in the office. But at home, programming purely for fun, fine tuning the code can be amusing!
Heater wrote:That brings us to another optimization issue. Spreading a work load around the 4 or more cores of the Pi or whatever machine. Using C and OpenMP can get you some nice speed boosts. Doing that in assembler is not something I would want to tackle.
or SIMD (NEON).

jahboater
Posts: 1893
Joined: Wed Feb 04, 2015 6:38 pm

Re: Assembly language: multiplication by a constant

Wed Jul 19, 2017 11:03 pm

Heater wrote: I notice ARMv7 has a Reverse BITs instruction. Very handy for FFTs!
I think the goal of simplicity and readability applies to assembler as well as high level languages. So I would use the RBIT instruction because its a lot easier to understand than the longhand version. Heck, I even use the loop instruction on x86 sometimes because it reads better, even though I know it is slow. So apart from using shift, or perhaps add (for *2), I would just stick with MUL - its simpler which is always a good thing.

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Thu Jul 20, 2017 12:25 am

jahboater wrote:
Heater wrote: I notice ARMv7 has a Reverse BITs instruction. Very handy for FFTs!
I think the goal of simplicity and readability applies to assembler as well as high level languages. So I would use the RBIT instruction because its a lot easier to understand than the longhand version. Heck, I even use the loop instruction on x86 sometimes because it reads better, even though I know it is slow. So apart from using shift, or perhaps add (for *2), I would just stick with MUL - its simpler which is always a good thing.
Actually, that's me you're quoting, but, as Heater pointed out, for a 2^n-point FFT, only 1 pass benefits from RBIT and there are n others (or n-1 if you're assuming real data in).

I started this thread because I'm considering writing a (relatively) simple compiler, just for fun. The readability of the object code won't be important.
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Thu Jul 20, 2017 3:48 am

FWIW, in gcc 4.9.2, the determination of how to multiply by a constant seems to be in gcc/expmed.c, most in synth_mult? A mere 479 lines, so it can't be very sophisticated :D
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

Heater
Posts: 7936
Joined: Tue Jul 17, 2012 3:02 pm

Re: Assembly language: multiplication by a constant

Thu Jul 20, 2017 4:13 am

Write a compiler hey. OK, now we are talking.

Some years ago, after having been programming in high level languages and assembler for a couple of decades I decided it was time I found out how these magic compiler things turn the one into the other.

I managed to write a compiler for a simple Pascal like language that generated Intel x86 or Parallax Propeller assembler. I wrote it from scratch in C. No lex or bison nonsense.

It was terribly inefficient. I paid no regard to code size or execution speed. I was just amazed that I had managed to do it at all!

Writing a compiler is a great thing to do. There is a lot to learn on the way.

Real compiler writers of course need to know their target machine architecture, and implementation, very well in order to produce optimal code.

jahboater
Posts: 1893
Joined: Wed Feb 04, 2015 6:38 pm

Re: Assembly language: multiplication by a constant

Thu Jul 20, 2017 8:19 am

Interesting ...
BMardle wrote:FWIW, in gcc 4.9.2, the determination of how to multiply by a constant seems to be in gcc/expmed.c, most in synth_mult? A mere 479 lines, so it can't be very sophisticated :D
GCC is now version 7.1 by the way, but that function doesn't seem to have changed a huge amount.

I suppose "lea" counts as a "cheap shift and add instruction"?

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Thu Jul 20, 2017 11:46 am

jahboater wrote:GCC is now version 7.1 by the way, but that function doesn't seem to have changed a huge amount.

I suppose "lea" counts as a "cheap shift and add instruction"?
Crikey! I don't know why I have the source for 5.9.2 here. Just did a `gcc -v` in Cygwin and it's 5.4.0!

I think LEA is the x86 equivalent of add rx,ry,rz,lsl n but n can only be 0, 1, 2 or 3.
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

jahboater
Posts: 1893
Joined: Wed Feb 04, 2015 6:38 pm

Re: Assembly language: multiplication by a constant

Thu Jul 20, 2017 12:05 pm

BMardle wrote: I think LEA is the x86 equivalent of add rx,ry,rz,lsl n but n can only be 0, 1, 2 or 3.
Yes it is, but I believe its done by the very fast address generation hardware.
I was curious as to how gcc works it all out.

You can get the source of GCC 7.1 from:-
ftp://ftp.mirrorservice.org/sites/sourc ... gcc-7.1.0/

BMardle
Posts: 92
Joined: Wed Feb 13, 2013 4:00 pm
Location: Isle of Wight

Re: Assembly language: multiplication by a constant

Sun Jul 23, 2017 3:09 pm

I've been trying some codes on actual, factual Raspberry PIs. It seems that on a Pi Zero, 2 ADD/SUB/RSB/LSLs are quicker than a MUL. On a Pi 2, 1 instruction is faster but I'm not sure about 2.

Rambling off-topic, Paeryn was right about the latency of MULs. This:

Code: Select all

	ldr	r1, =100000000
	mov	r3,0
	mov	r0,10
	mov	r5,0
loop:	mul	r2,r1,r0
	add	r3,r3,r5
	sub	r1, r1, 1
	mul	r5,r1,r0
	add	r3,r3,r2
	subs	r1, r1, 1
	bne	loop
	add	r3,r3,r5
	mov	pc, lr
is 30% faster than the code with the r2 and r5 swapped at the end of the ADDs (and the ADD after the loop deleted). I.E. Do something else while the Pi (Zero or 2) munches on the MUL!
Bruce Mardle. "You know I yearn for a simpler time of barn dances and buggy rides before life was cheapened by heartless machines."

Return to “Other languages”

Who is online

Users browsing this forum: No registered users and 2 guests