jahboater wrote: ↑Tue Dec 11, 2018 1:23 pm
DavidS wrote: ↑Tue Dec 11, 2018 12:58 pm
We still do not have a single cycle 32-bit divide, and it takes longer in 64-bit, there are many more examples where 32-bit is faster than 64-bit.
Dividing large numbers is slower than dividing small numbers. If the numbers are the same size, then a 32-bit divide takes similar time to a 64-bit divide. I mean 42/12 will take the same time on both platforms. Obviously a 64-bit divide can deal with much larger numbers and so may potentially take longer - which is obviously not relevant.
Divide will never take one cycle on any platform, even Intel.
Not long ago we said the same thing for Multiply, everyone believed that a single cycle multiply was not possible without increasing propagation delay to an unacceptable level, that has been proven wrong so I can see a time when the same is true of Divide. As it stands to implement a single cycle divide introduces to much propagation delay, and that is the same issue we had with multiply. The other solution of breaking a divide across multiple pipeline stages is not acceptable because it would make the pipeline way to deep to manage performance in a sane way (optimization would be even beyond compilers of the highest caliber).
Though just because it is not done does not mean it can not be done. And intel is a poor example of anything, except for lackluster design.
DavidS wrote: ↑Tue Dec 11, 2018 12:58 pm
So I must dissagree on this issue. 32-bit rules and will until every advantage of the 32-bit ARM is matched on the 64-bit ARM, including the timing for execution of any given instruction.
You should look at the conditional instructions, the 64-bit ones have one less dependency than the 32-bit ones, and work better with modern CPU's (CSET/CSEL/CINC/CNEG/CINV etc).
So you are saying that it is lower latency to not be able to have every instruction conditional?
I would argue that, big time. That is the one thing missing from AARCH64 that will forever kill potential performance.
There are a bunch of cases where there is a huge advantage to have every instruction conditional (I know that a few of the newer instructions are not), and have the ability to specify which instructions set flags or not.
LDP/STP is much much faster than LDM/STM.
That is true. Though there are other ways around that issue, using NEON (ok it is a cooprocessor, still it is standard now), and equally fast on both

.
So not really an advantage in most situations, with very few exceptions.
Also that is not the issue of the ISA, rather the implementation, it would be fairly easy to make LDM/STM single cycle for any load up to 4 registers (128 bits), with out adding much to the implementation, and without increasing any propagation delay in any stage of the pipeline.
Simple things like ADD take the same time even though the 64-bit version can handle much larger numbers.
That is a given, the propagation delay through the gates for the carry look ahead is minimally different between the two lengths when done correctly.
So I stand on my argument.