User avatar
bstrobl
Posts: 97
Joined: Wed Jun 04, 2014 8:31 pm
Location: Germany

Re: Raspbian Jessie (64bit) for RPi3?

Sun Apr 16, 2017 5:19 pm

Mr.Scoville wrote: Yet, I want to dare your statement. Why? Because: KISS. The single most important rule each and every software developer-to become has not just to learn, but to digest, to get assembl... assimilated by before they call themselves software developers. Simply put: If you need 1 bit, use 1 bit. If you need 4 bits, use 4 bits. If you need 128 bits, use 128 bits. In other words: If the stuff you've learned doesn't solve your current problem, don't try to hunt the beast down and bash it with what you've learned until it stops twitching, but adjust and refine your knowledge. To get the state of a simple button, you need exactly 1 bit. You won't even need an Arduino Nano to turn a darn bulb on.
I on the other hand would actually say KISS in this case is for everyone to use the same architecture. Developer time tends to be more important than cheap hardware. Also, having all software created on one ISA will allow more robust code as well as better availability to others. You may have a fancy 8 bit processor but there is very little value in it if you prematurely reach its limits during development. You could argue that your spec is fixed, but that rarely happens and feature creep tends to be rather pervasive. If it turns out that your software can be useful on a higher bit machine, well you now have to put in the work of porting it and, worst case, maintaining both versions.
Mr.Scoville wrote: Following the KISS principle, I am absolutely aware of the point that the less tools you need to solve a problem, the better the solution will be (kind of Ockhams' Principle, if you wish). But that doesn't mean that more bits are more effective. If you stick to the KISS principle (and you ought to!), there is no "eierlegende Wollmilchsau" (a German figure of speech, kind of the Swiss Army knife). Reinventing the wheel has always been a waste of time and energy. But reducing your real needs to exactly match the problem has always been a good idea.
Ironically, in the world of technology I see a lot more Swiss Army knifes than ever before that make things simpler :lol: . Just look at smartphones which can replace your old cellphone, iPod, click and point camera, PDA and more with just one device. Same goes for standards such as USB, UTF-8 etc. Granted, the implementation might be a lot more complex, but there are massive advantages to be gained with a more complex integration of simpler components. The bcm2835 contains a decent amount of stuff, that historically would have been separated out. It is usually no longer cost effective to do so however. Even simple microcontrollers frequently come with small amounts of RAM, Flash and advanced IO, often ditching old 8 bit cores for newer ARM cores. Old chips can quickly become obsolete especially if the newer ones (that may be more expensive) can be sold to a larger market (economies of scale and reduced component count due to integration can counter higher per chip cost). The bcm2837 has replaced the 2836 after just one year for example. Customers may not use all of the newer chips functionality or "bitness", but that's ok. The more common software support means reinventing new pieces just for a specific application is also less likely. A System cut to perfect size may look more elegant for an application, but is more likely to exhibit problems since no one has encountered similar issues before and fixed them.
Mr.Scoville wrote: I am absolutely certain that, for some hands full of bucks, many companies would be glad to give the 64 Bit demanders what they want: AArch64, NANO, whatnot. But to those of you who are like "I want it now, and I want it for free!"
On that I can agree, things do take time. Problem is, sitting too long on older architectures has given competitors the chance to take over. I don't see that happening anytime soon with the Pi due to its excellent community and mainline support but you never know.

Fact is, the bcm2835 has been upgraded twice before, to the bcm2836 and bcm2837. Upgrading the bcm2835 with a cortex A35 core for low power usage should be doable even though it won't be cheap. However, it would bring processor capabilities (including USB & network boot) in line with its bigger siblings and allow eventual optimization for one ISA, freeing development resources for other things. The good news is that AArch64 is likely going to stay for a decent chunk of time.

Mr.Scoville
Posts: 4
Joined: Sat Apr 15, 2017 3:16 pm

Re: Raspbian Jessie (64bit) for RPi3?

Mon Apr 17, 2017 4:22 pm

Hey! If they come across with kind of a 2^10^10-bit Gazillion-core-Peta^n-Hertz board including the mysterious De-Lancy QBits, some warp engines, a supersonic shower and a replicator, I'll be the first to order, even if it'd shimmer Borg'ish green. At least if it's around 50 US-$.

Still, they are not green, nor is our current technology, nor is our current technology close to become... usable. So, just to get simple and stupid again, like the KISS principle claims us to be... Do you really believe in 64 bits being the solution in turning light bulbs on and of? Or could, just perhaps, a simple switch solve the same problem? No silicon involved.

Have you ever wondered why cars get excessively more expensive? Because inventions done for expensive cars have to become worth it. Invent some fancy idea for a luxury car model, that's nice, and customers will be happy when paying 100k $ for their from-A-to-B-platform. Inventions and implementations are expensive.

You've been aware of the infamous "Wintel" discussion, haven't you? Intel needed a reason to publish a new, faster core, and Microsoft needed a reason to publish a new OS. Well done, both of you, you've finally managed it to the Linux community!

There are exactly three real reasons to push computing power further up:
1st - Mathematical and scientific demands, like simulations of galaxy collisions, or crash tests, or the weather
2nd - Gaming
3rd - Boasting

From a well-educated student, on the opposite, I simply expect that they are able to bow their minds from the ivory tower down to a real-world requirement.

jahboater
Posts: 4759
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspbian Jessie (64bit) for RPi3?

Mon Apr 17, 2017 5:58 pm

Mr.Scoville wrote:So, just to get simple and stupid again, like the KISS principle claims us to be... Do you really believe in 64 bits being the solution in turning light bulbs on and of?
Someone in another thread is struggling with 64-bit multiplication in assembler.
The 32-bit compiler produces:

Code: Select all

mymul:
    mul   r3, r0, r3  
    mla   r3, r2, r1, r3  
    umull r0, r1, r0, r2  
    add   r1, r3, r1 
    bx    lr 
Not simple I would say (can you understand it?), the 64-bit version is:

Code: Select all

mymul:
    mul x0, x0, x1 
    ret
KISS!!
even the function return is easier to read.

Yes I know its a contrived example!

User avatar
bstrobl
Posts: 97
Joined: Wed Jun 04, 2014 8:31 pm
Location: Germany

Re: Raspbian Jessie (64bit) for RPi3?

Tue Apr 18, 2017 8:41 am

Mr.Scoville wrote:Do you really believe in 64 bits being the solution in turning light bulbs on and of? Or could, just perhaps, a simple switch solve the same problem? No silicon involved.
If you are using any form of CPU for your light bulbs (other than the ones in the bulbs themselves), its most likely going to be for home automation (Internet of Things, which the Pi is also deemed useable for by many). Which means you will want a 64 bit processor for a very long support schedule of the Linux Kernel. After all, getting your Pi to join a botnet once 32 bit Linux support drops is less than ideal. Swapping out the Pi by that time may also be a rather big hassle once its built in to the system.

Disclaimer: I don't use IoT devices exactly for this reason. Manufacturers don't care about supporting their devices past the first sale. The fact that most of the stuff is then based on custom silicon or board designs makes it difficult to fix yourself.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23867
Joined: Sat Jul 30, 2011 7:41 pm

Re: Raspbian Jessie (64bit) for RPi3?

Tue Apr 18, 2017 10:36 am

Worth repeating- the amount of assembler I have had to write in 30 years as a SW engineer in pretty small.

The huge majority of code is written in a high level language.

Learn C, C++, Python etc. All standard and platform independent.

That is the consistent architecture. Not assembler. That is ALWAYS going to keep changing.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

rouan
Posts: 1
Joined: Sun Sep 10, 2017 7:35 pm

Re: Raspbian Jessie (64bit) for RPi3?

Sun Sep 10, 2017 7:38 pm

I think its worth going through the effort to build 64bit for the simple reason that future raspberry pi's should have enough RAM to make it worthwhile having 64bit. At the moment 1GB is quite limiting, and progress marches on.

User avatar
bensimmo
Posts: 4182
Joined: Sun Dec 28, 2014 3:02 pm
Location: East Yorkshire

Re: Raspbian Jessie (64bit) for RPi3?

Sun Sep 10, 2017 8:03 pm

tldr/dk

Which bits do Raspberry need to convert to a 64 bit setup to make Raspbian 64?
If debian 64 is out there...
and someone can easily add the customisations of 'RP-desktop' and majority of the functionality (do 32bit work on 64 in the debian/ARM world, like microsoft has with it's windows)).
Which bits would be left needed from raspberry themselves to provide?

acwest
Posts: 1
Joined: Mon Sep 11, 2017 1:16 am

Re: Raspbian Jessie (64bit) for RPi3?

Mon Sep 11, 2017 1:19 am

As far as I have been able to tell, everything except the tools in /opt/vc is already available, for non-raspbian distributions, at least. I don't know if the raspbian kernel itself has been done yet, although I believe it is all doable, I haven't checked recently...

ClemS
Posts: 3
Joined: Sun Oct 08, 2017 12:54 pm

Re: Raspbian Jessie (64bit) for RPi3?

Sun Oct 08, 2017 1:35 pm

jamesh wrote:
Tue Apr 18, 2017 10:36 am
Learn C, C++, Python etc. All standard and platform independent.

That is the consistent architecture. Not assembler.
Hi,

I just bought a Raspberry Pi3 (my first one) for hosting a game server.
This server will deal with several algorithms using 64bits (hashing, cyphering and a lot more...) and I wonder what is the performance of the Raspberry for those algorithms. So I started with a simple SHA-256 vs SHA-512 test.

Computing 1 million SHA-256 of a 56 bytes message took 5.5 seconds. (SHA-256 is a 32-bit algorithm.)
Computing 1 million SHA-512 of a 112 bytes message took 19.3 seconds. (SHA-512 is a 64-bit algorithm.)

...then I realize that Raspbian is a 32-bit OS (on a 64-bit proc). Durring my surch for a 64-bit raspbian, I found this post that get me very disapointed. :(

I finaly found a 64bit Debian version for Raspberry Pi3 (https://github.com/bamarni/pi64). :D
Now the results are:

Computing 1 million SHA-256 of a 56 bytes message took 4.9 seconds. 10.9% faster than previously.
Computing 1 million SHA-512 of a 112 bytes message took 6.7 seconds. 65.3% faster than previously.

This test has been written in C++ (NOT IN ASSEMBLER) and built with clang++ 3.8.
It demonstrates that having a 64-bit OS can make some code up to 3 times faster! (Assembler code shown above explains why.)
64bit it's NOT ONLY for memory addressing! (I saw too many "Don't care, Rpi3 has only 1GB RAM.")

Thanks for reading...

jahboater
Posts: 4759
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspbian Jessie (64bit) for RPi3?

Sun Oct 08, 2017 11:55 pm

ClemS wrote:
Sun Oct 08, 2017 1:35 pm
Computing 1 million SHA-256 of a 56 bytes message took 4.9 seconds. 10.9% faster than previously.
Computing 1 million SHA-512 of a 112 bytes message took 6.7 seconds. 65.3% faster than previously.

This test has been written in C++ (NOT IN ASSEMBLER) and built with clang++ 3.8.
It demonstrates that having a 64-bit OS can make some code up to 3 times faster! (Assembler code shown above explains why.)
64bit it's NOT ONLY for memory addressing! (I saw too many "Don't care, Rpi3 has only 1GB RAM.")
And I believe Aarch64 has special instructions for computing hash functions which would make it a couple of orders of magnitude faster if used.

Interesting that the 32-bit code is also faster, though I guess there are many reasons for that.

ClemS
Posts: 3
Joined: Sun Oct 08, 2017 12:54 pm

Re: Raspbian Jessie (64bit) for RPi3?

Tue Oct 10, 2017 10:14 pm

jahboater wrote:
Sun Oct 08, 2017 11:55 pm
And I believe Aarch64 has special instructions for computing hash functions which would make it a couple of orders of magnitude faster if used.

Interesting that the 32-bit code is also faster, though I guess there are many reasons for that.
Hello,

This is a relevant question.

The 10% time gained with the 32-bit SHA algorithm can be explained easily:
1/ The digest table needs to be initialized with some default values, generally using a memcpy from a constant pre-initialized buffer. The memcpy will use a 64-bit copy in 64-bit mode... So about 2 times faster than a 32-bit copy. ;)
2/ The hash is performed by computing succesive blocs of N bytes. For the last bloc you need to pad memory with zeros. The memset will use the same 64-bit optimisation.

Ok, now let's consider some kind of basic code that is used everywhere in every application... (not just in a specific hash function):

Code: Select all

std::string foo = "My name is Bond...";
foo += " James Bond!";
On the second line, the std::string will need to reallocate its internal buffer to be able to append the new content. Once reallocated, it will copy the previous buffer content to the new buffer by performing a 64-bit copy. Doing so, it will save lot a CPU time regarding to a slower 32-bit copy.
And the same happen when copying or initializing structures, etc...

Conclusion, even if we don't think building 32 bits code in 64-bits mode will have an impact on our code speed... It will. :D
That's a reason why several applications propose a 32-bit release and a 64-bit one.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23867
Joined: Sat Jul 30, 2011 7:41 pm

Re: Raspbian Jessie (64bit) for RPi3?

Wed Oct 11, 2017 8:51 am

ClemS wrote:
Tue Oct 10, 2017 10:14 pm
Ok, now let's consider some kind of basic code that is used everywhere in every application... (not just in a specific hash function):

Code: Select all

std::string foo = "My name is Bond...";
foo += " James Bond!";
On the second line, the std::string will need to reallocate its internal buffer to be able to append the new content. Once reallocated, it will copy the previous buffer content to the new buffer by performing a 64-bit copy. Doing so, it will save lot a CPU time regarding to a slower 32-bit copy.
And the same happen when copying or initializing structures, etc...
Not commenting on the 32 vs 64 argument, but worth noting that the string handling in the STL is a bit smarter than that. Generally, the default buffer allocated is larger than the string initially in it, so as long as the additional string fits in the buffer, there will be no reallocation. You can also specify how big you want the underlying buffer to be when the string is created, so if you know in advance, you can prevent a LOT of reallocation.

Also applicable in other areas as well, so the speed up benefit may not be as great as naive code inspection may indicate, if the coder was half decent. It all comes down to writing decent code in the first place, which will reduce the difference between 32 and 64 bit code.

Of course, real speed comes with NEON!
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

Heater
Posts: 13592
Joined: Tue Jul 17, 2012 3:02 pm

Re: Raspbian Jessie (64bit) for RPi3?

Wed Oct 11, 2017 9:21 am

ClemS,
On the second line, the std::string will need to reallocate its internal buffer to be able to append the new content. Once reallocated, it will copy the previous buffer content to the new buffer by performing a 64-bit copy. Doing so, it will save lot a CPU time regarding to a slower 32-bit copy.
Unlikely.

As James points out the STL will quite likely have allocated a bigger buffer than needed to start with so no allocation is required.

I also doubt that a 64 bit copy takes place: " James Bond!" is 13 chars including the termination zero. That could be 8 chars moved as a 64 bit word, 4 chars moved as a 32 bit word and one extra byte to move. Quite likely any faffing about to figure all that out would be more code and slower than just copying byte by byte in a loop. I'd wager the compiler/library does not do that.

But what if there is an allocation? Then this code becomes hundreds of times slower.

What if the strings are not sitting in your cache memory? Then it can become hundreds of times slower to get them from RAM or perhaps indefinitely longer if they are in swap space.

All in all, the actual copy there is the least of the time taken. Using 64 bit does not help.
Memory in C++ is a leaky abstraction .

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23867
Joined: Sat Jul 30, 2011 7:41 pm

Re: Raspbian Jessie (64bit) for RPi3?

Wed Oct 11, 2017 9:39 am

Heater wrote:
Wed Oct 11, 2017 9:21 am
ClemS,
On the second line, the std::string will need to reallocate its internal buffer to be able to append the new content. Once reallocated, it will copy the previous buffer content to the new buffer by performing a 64-bit copy. Doing so, it will save lot a CPU time regarding to a slower 32-bit copy.
Unlikely.

As James points out the STL will quite likely have allocated a bigger buffer than needed to start with so no allocation is required.

I also doubt that a 64 bit copy takes place: " James Bond!" is 13 chars including the termination zero. That could be 8 chars moved as a 64 bit word, 4 chars moved as a 32 bit word and one extra byte to move. Quite likely any faffing about to figure all that out would be more code and slower than just copying byte by byte in a loop. I'd wager the compiler/library does not do that.

But what if there is an allocation? Then this code becomes hundreds of times slower.

What if the strings are not sitting in your cache memory? Then it can become hundreds of times slower to get them from RAM or perhaps indefinitely longer if they are in swap space.

All in all, the actual copy there is the least of the time taken. Using 64 bit does not help.
It's a good point that reallocation time will dwarf the copy time. However, mempcy is indeed something that would be faster on 64bit, but in an unpredictable way. For example, as memcpy's get larger, the underlying OS libraries may well decide to use DMA, or NEON or similar as the setup costs are mitigated by the extra speed. Caching will also throw another variable in to the mix. All quite complicated, which is why just saying 64bit will be faster overall is difficult to show. I'm sure it is in many cases, and not in many cases.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

jahboater
Posts: 4759
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspbian Jessie (64bit) for RPi3?

Wed Oct 11, 2017 10:22 am

This is the source of glibc memcpy for aarch64.
The main loop at the end just does ldp/stp copying two 64-bit registers each time (4 times to get a 64 byte copy - probably to match cache lines). (ldp is load pair).

I am surprised it doesn't use NEON Q registers which are 16 bytes each:
ldp q0,q1,[%1],32; stp q0,q1,[%0],32
copies 32 bytes and updates both pointers.
Very fast indeed last time I benchmarked it.

Code: Select all

#include <sysdep.h>

/* Assumptions:
 *
 * ARMv8-a, AArch64, unaligned accesses.
 *
 */

#define dstin	x0
#define src	x1
#define count	x2
#define dst	x3
#define srcend	x4
#define dstend	x5
#define A_l	x6
#define A_lw	w6
#define A_h	x7
#define A_hw	w7
#define B_l	x8
#define B_lw	w8
#define B_h	x9
#define C_l	x10
#define C_h	x11
#define D_l	x12
#define D_h	x13
#define E_l	src
#define E_h	count
#define F_l	srcend
#define F_h	dst
#define G_l	count
#define G_h	dst
#define tmp1	x14

/* Copies are split into 3 main cases: small copies of up to 16 bytes,
   medium copies of 17..96 bytes which are fully unrolled. Large copies
   of more than 96 bytes align the destination and use an unrolled loop
   processing 64 bytes per iteration.
   In order to share code with memmove, small and medium copies read all
   data before writing, allowing any kind of overlap. So small, medium
   and large backwards memmoves are handled by falling through into memcpy.
   Overlapping large forward memmoves use a loop that copies backwards.
*/

#ifndef MEMMOVE
# define MEMMOVE memmove
#endif
#ifndef MEMCPY
# define MEMCPY memcpy
#endif

ENTRY_ALIGN (MEMMOVE, 6)

	DELOUSE (0)
	DELOUSE (1)
	DELOUSE (2)

	sub	tmp1, dstin, src
	cmp	count, 96
	ccmp	tmp1, count, 2, hi
	b.lo	L(move_long)

	/* Common case falls through into memcpy.  */
END (MEMMOVE)
libc_hidden_builtin_def (MEMMOVE)
ENTRY (MEMCPY)

	DELOUSE (0)
	DELOUSE (1)
	DELOUSE (2)

	prfm	PLDL1KEEP, [src]
	add	srcend, src, count
	add	dstend, dstin, count
	cmp	count, 16
	b.ls	L(copy16)
	cmp	count, 96
	b.hi	L(copy_long)

	/* Medium copies: 17..96 bytes.  */
	sub	tmp1, count, 1
	ldp	A_l, A_h, [src]
	tbnz	tmp1, 6, L(copy96)
	ldp	D_l, D_h, [srcend, -16]
	tbz	tmp1, 5, 1f
	ldp	B_l, B_h, [src, 16]
	ldp	C_l, C_h, [srcend, -32]
	stp	B_l, B_h, [dstin, 16]
	stp	C_l, C_h, [dstend, -32]
1:
	stp	A_l, A_h, [dstin]
	stp	D_l, D_h, [dstend, -16]
	ret

	.p2align 4
	/* Small copies: 0..16 bytes.  */
L(copy16):
	cmp	count, 8
	b.lo	1f
	ldr	A_l, [src]
	ldr	A_h, [srcend, -8]
	str	A_l, [dstin]
	str	A_h, [dstend, -8]
	ret
	.p2align 4
1:
	tbz	count, 2, 1f
	ldr	A_lw, [src]
	ldr	A_hw, [srcend, -4]
	str	A_lw, [dstin]
	str	A_hw, [dstend, -4]
	ret

	/* Copy 0..3 bytes.  Use a branchless sequence that copies the same
	   byte 3 times if count==1, or the 2nd byte twice if count==2.  */
1:
	cbz	count, 2f
	lsr	tmp1, count, 1
	ldrb	A_lw, [src]
	ldrb	A_hw, [srcend, -1]
	ldrb	B_lw, [src, tmp1]
	strb	A_lw, [dstin]
	strb	B_lw, [dstin, tmp1]
	strb	A_hw, [dstend, -1]
2:	ret

	.p2align 4
	/* Copy 64..96 bytes.  Copy 64 bytes from the start and
	   32 bytes from the end.  */
L(copy96):
	ldp	B_l, B_h, [src, 16]
	ldp	C_l, C_h, [src, 32]
	ldp	D_l, D_h, [src, 48]
	ldp	E_l, E_h, [srcend, -32]
	ldp	F_l, F_h, [srcend, -16]
	stp	A_l, A_h, [dstin]
	stp	B_l, B_h, [dstin, 16]
	stp	C_l, C_h, [dstin, 32]
	stp	D_l, D_h, [dstin, 48]
	stp	E_l, E_h, [dstend, -32]
	stp	F_l, F_h, [dstend, -16]
	ret

	/* Align DST to 16 byte alignment so that we don't cross cache line
	   boundaries on both loads and stores.  There are at least 96 bytes
	   to copy, so copy 16 bytes unaligned and then align.  The loop
	   copies 64 bytes per iteration and prefetches one iteration ahead.  */

	.p2align 4
L(copy_long):
	and	tmp1, dstin, 15
	bic	dst, dstin, 15
	ldp	D_l, D_h, [src]
	sub	src, src, tmp1
	add	count, count, tmp1	/* Count is now 16 too large.  */
	ldp	A_l, A_h, [src, 16]
	stp	D_l, D_h, [dstin]
	ldp	B_l, B_h, [src, 32]
	ldp	C_l, C_h, [src, 48]
	ldp	D_l, D_h, [src, 64]!
	subs	count, count, 128 + 16	/* Test and readjust count.  */
	b.ls	L(last64)
L(loop64):
	stp	A_l, A_h, [dst, 16]
	ldp	A_l, A_h, [src, 16]
	stp	B_l, B_h, [dst, 32]
	ldp	B_l, B_h, [src, 32]
	stp	C_l, C_h, [dst, 48]
	ldp	C_l, C_h, [src, 48]
	stp	D_l, D_h, [dst, 64]!
	ldp	D_l, D_h, [src, 64]!
	subs	count, count, 64
	b.hi	L(loop64)

	/* Write the last full set of 64 bytes.  The remainder is at most 64
	   bytes, so it is safe to always copy 64 bytes from the end even if
	   there is just 1 byte left.  */
L(last64):
	ldp	E_l, E_h, [srcend, -64]
	stp	A_l, A_h, [dst, 16]
	ldp	A_l, A_h, [srcend, -48]
	stp	B_l, B_h, [dst, 32]
	ldp	B_l, B_h, [srcend, -32]
	stp	C_l, C_h, [dst, 48]
	ldp	C_l, C_h, [srcend, -16]
	stp	D_l, D_h, [dst, 64]
	stp	E_l, E_h, [dstend, -64]
	stp	A_l, A_h, [dstend, -48]
	stp	B_l, B_h, [dstend, -32]
	stp	C_l, C_h, [dstend, -16]
	ret

	.p2align 4
L(move_long):
	cbz	tmp1, 3f

	add	srcend, src, count
	add	dstend, dstin, count

	/* Align dstend to 16 byte alignment so that we don't cross cache line
	   boundaries on both loads and stores.  There are at least 96 bytes
	   to copy, so copy 16 bytes unaligned and then align.  The loop
	   copies 64 bytes per iteration and prefetches one iteration ahead.  */

	and	tmp1, dstend, 15
	ldp	D_l, D_h, [srcend, -16]
	sub	srcend, srcend, tmp1
	sub	count, count, tmp1
	ldp	A_l, A_h, [srcend, -16]
	stp	D_l, D_h, [dstend, -16]
	ldp	B_l, B_h, [srcend, -32]
	ldp	C_l, C_h, [srcend, -48]
	ldp	D_l, D_h, [srcend, -64]!
	sub	dstend, dstend, tmp1
	subs	count, count, 128
	b.ls	2f

	nop
1:
	stp	A_l, A_h, [dstend, -16]
	ldp	A_l, A_h, [srcend, -16]
	stp	B_l, B_h, [dstend, -32]
	ldp	B_l, B_h, [srcend, -32]
	stp	C_l, C_h, [dstend, -48]
	ldp	C_l, C_h, [srcend, -48]
	stp	D_l, D_h, [dstend, -64]!
	ldp	D_l, D_h, [srcend, -64]!
	subs	count, count, 64
	b.hi	1b

	/* Write the last full set of 64 bytes.  The remainder is at most 64
	   bytes, so it is safe to always copy 64 bytes from the start even if
	   there is just 1 byte left.  */
2:
	ldp	G_l, G_h, [src, 48]
	stp	A_l, A_h, [dstend, -16]
	ldp	A_l, A_h, [src, 32]
	stp	B_l, B_h, [dstend, -32]
	ldp	B_l, B_h, [src, 16]
	stp	C_l, C_h, [dstend, -48]
	ldp	C_l, C_h, [src]
	stp	D_l, D_h, [dstend, -64]
	stp	G_l, G_h, [dstin, 48]
	stp	A_l, A_h, [dstin, 32]
	stp	B_l, B_h, [dstin, 16]
	stp	C_l, C_h, [dstin]
3:	ret

END (MEMCPY)
libc_hidden_builtin_def (MEMCPY)
Last edited by jahboater on Wed Oct 11, 2017 10:43 am, edited 1 time in total.

jahboater
Posts: 4759
Joined: Wed Feb 04, 2015 6:38 pm

Re: Raspbian Jessie (64bit) for RPi3?

Wed Oct 11, 2017 10:37 am

32/64 bits.

Obviously any code dealing with 64 bit numbers or addresses is a win for A64.

There are other things which should benefit 32-bit stuff which are just inherent in the new re-designed architecture.

Some of these benefits wont appear with the Pi3 because it uses a simple in-order processor. The higher performance processors (Cortex A57, Cortex-A73 for example) are out-of-order and the new instruction set was designed to make these work properly (removing conditional execution of most instructions, removing ldm/stm, removing the ability to write to the PC register for example).

Other things like having 31 general purpose registers help everywhere (thats a lot of scalar local variables a function can have before the stack needs adjusting!). Regular op-codes with the fields all in the same place help the decoders, and so on. Much of the decades of legacy stuff in ARM has been removed or simplifed.

ClemS
Posts: 3
Joined: Sun Oct 08, 2017 12:54 pm

Re: Raspbian Jessie (64bit) for RPi3?

Wed Oct 11, 2017 11:53 am

jamesh wrote:
Wed Oct 11, 2017 8:51 am
Not commenting on the 32 vs 64 argument, but worth noting that the string handling in the STL is a bit smarter than that. Generally, the default buffer allocated is larger than the string initially in it (...)
Of course the STL if far smarter than that ! And you didn't mention the usage of SSO (Short String Optimization) that prevent allocation for a text that short! That said, if you consider the SSO mechanisms, my example is perfectly valid as the first 13 bytes fit in the SSO buffer (generaly 16 bytes long, including the '\0'). :D
Then, when performing the concatenation, the SSO buffer will not be enough and a memory allocation will occur as well as the copy of the SSO buffer content to the allocated buffer.

This what just an example for people to understand the gain of considering 64-bit mode (In response of some comments I saw before). It was not supposed to strictly represent what we do in the real life to obtain a good, beautifull an optimized code. ;) Of course, we try to prevent unnecessary memory allocations, for instance by reserving the size before any operation or using other technics... But that was not the point of this example.

PS: The SSO is the most common string optimization (used in Linux and Microsoft versions of the STL, AFAIK), but you can also found COW optimization (Copy On Write, used in the MFCs) or the both combined (only saw it once but don't remember where).

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 23867
Joined: Sat Jul 30, 2011 7:41 pm

Re: Raspbian Jessie (64bit) for RPi3?

Wed Oct 11, 2017 1:25 pm

ClemS wrote:
Wed Oct 11, 2017 11:53 am
jamesh wrote:
Wed Oct 11, 2017 8:51 am
Not commenting on the 32 vs 64 argument, but worth noting that the string handling in the STL is a bit smarter than that. Generally, the default buffer allocated is larger than the string initially in it (...)
Of course the STL if far smarter than that ! And you didn't mention the usage of SSO (Short String Optimization) that prevent allocation for a text that short! That said, if you consider the SSO mechanisms, my example is perfectly valid as the first 13 bytes fit in the SSO buffer (generaly 16 bytes long, including the '\0'). :D
Then, when performing the concatenation, the SSO buffer will not be enough and a memory allocation will occur as well as the copy of the SSO buffer content to the allocated buffer.

This what just an example for people to understand the gain of considering 64-bit mode (In response of some comments I saw before). It was not supposed to strictly represent what we do in the real life to obtain a good, beautifull an optimized code. ;) Of course, we try to prevent unnecessary memory allocations, for instance by reserving the size before any operation or using other technics... But that was not the point of this example.

PS: The SSO is the most common string optimization (used in Linux and Microsoft versions of the STL, AFAIK), but you can also found COW optimization (Copy On Write, used in the MFCs) or the both combined (only saw it once but don't remember where).
FYI, copy on write is used extensively in the Linux networking stack. (skb buffers use it - failure to use COW in a couple of network drivers was actually the source of a recent bug fix to do with bridging).

SSO is fair enough, but the copy difference for strings of 20 characters or less in 32 or 64 is just in the noise.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

tkaiser
Posts: 103
Joined: Fri Aug 05, 2016 1:28 pm

Re: Raspbian Jessie (64bit) for RPi3?

Mon Oct 23, 2017 12:32 pm

django013 wrote:
Sun Apr 02, 2017 1:10 pm
3 minutes and 5 seconds is way slower than the 86 seconds from jahboater.

What am I doing wrong?
You run undervolted as the vast majority of RPi users. But your RPi all the time stays at 600 MHz. With Rasbian Stretch (packages built with GCC 6.3) a RPi 3 at 1200 MHz needs ~92.5 seconds for this sysbench call (with Raspbian Jessie and GCC 4.7 the same sysbench run needs 120 seconds so sysbench gets 'accelerated' by 30 percent by updating GCC).

Your 185 seconds are just the result of you using the wrong cable between PSU/charger and your Raspberry. But you're not alone: https://github.com/bamarni/pi64/issues/ ... -291425512

pguillem
Posts: 2
Joined: Sat Nov 12, 2016 7:34 am

Re: Raspbian Jessie (64bit) for RPi3?

Sat Nov 25, 2017 3:45 am

https://github.com/bamarni/pi64/releases

Here it is. Someone named BAMARNI made a 64-bit kernel for it.

Olle2
Posts: 11
Joined: Wed Jan 04, 2017 7:59 am

Re: Raspbian Jessie (64bit) for RPi3?

Sat Mar 03, 2018 8:30 pm

The 64 bit image for rpi3 in https://files.devuan.org/devuan_ascii_beta/embedded/ might be tried if you like to be outside of the safe zone. Otherwise Razpian rulez! The answer is 42. Olle2

Return to “Raspbian”