matherp
Posts: 11
Joined: Tue May 02, 2017 10:54 am

writing a transfer vector in C

Thu Feb 08, 2018 12:25 pm

I'm very old :-( and started programming with PDP11 assember. There was a neat bit of code in the DEC OS to optimise memory to memory copies called a transfer vector. This relied on the fact that you could do indirect register access with auto increment.

In C this is simply:

*p1++ = *p2++

The clever bit was that to do N copies there was simply a string of these instructions in a row - say 128

The the code was:
Program counter = program counter + 128- N
*p1** = *p2++ // copy 1
*p1** = *p2++ // copy2
......
*p1** = *p2++ // copy 128
return

i.e the program incremented the program counter to leave the number of copies needed.

This allowed for very fast memory copying of any size up to 128 (in our example) without any looping

Can anyone think of a way to do this in C on the PI? or can supply equivalent in-line assember as I need to optimise some arbitrary length memory copies.

Thanks

User avatar
PeterO
Posts: 4237
Joined: Sun Jul 22, 2012 4:14 pm

Re: writing a transfer vector in C

Thu Feb 08, 2018 12:33 pm

I doubt you'll come up with anything that is noticeably more efficient than using library routines like memcpy.

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),Aeromodelling,1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

ghans
Posts: 7787
Joined: Mon Dec 12, 2011 8:30 pm
Location: Germany

Re: writing a transfer vector in C

Thu Feb 08, 2018 12:36 pm

Optimized memcpy for the Raspberry Pi:
https://github.com/bavison/arm-mem
• Don't like the board ? Missing features ? Change to the prosilver theme ! You can find it in your settings.
• Don't like to search the forum BEFORE posting 'cos it's useless ? Try googling : yoursearchtermshere site:raspberrypi.org

LdB
Posts: 827
Joined: Wed Dec 07, 2016 2:29 pm

Re: writing a transfer vector in C

Thu Feb 08, 2018 3:03 pm

You need to be careful with terms like fast and copying, and explain what you are trying to do.

Technically you can just copy opcodes to a memory block and do exactly what the PDP11 did for example

ghans has shown one example of assembler optimization memcpy which is nice but it's not the fastest way to move data around the Pi depending what you are trying to do.

If the transfer is large a DMA transfer is technically much faster and very low overhead on the CPU (it all happens in the background).
Note that DMA to the peripheral memory is relatively slow and clunky because of the slow AXI bus to the peripherals.

If you are after real actual time speed there are some tricks you can do with the cache to do JIT execution which is what you seem to want to do.
ARM has details of writing self modifying code inside the cache which would fly for what you are trying to do
https://community.arm.com/processors/b/ ... fying-code

Can't remember who do it to credit but in my clippings I have this code for the PI under linux

Code: Select all

#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
#include <stdlib.h>

int inc(int x){ //increments x

    uint32_t *ret = mmap(NULL,
            2 * sizeof(uint32_t),  // Space for 16 instructions. (More than enough.)
            PROT_READ | PROT_WRITE | PROT_EXEC,
            MAP_PRIVATE | MAP_ANONYMOUS,
            -1,0);
    if (ret == MAP_FAILED) {
        printf("Could not mmap a memory buffer with the proper permissions.\n");
        return -1;
    }

    *(ret + 0) = 0xE2800001; //add r0 r0 #1 := r0 += 1
    *(ret + 1) = 0xE12FFF1E; //bx lr        := jump back to inc()

    __clear_cache((char*) ret, (char*) (ret+2));

    int(*f)(int) = (int (*)(int)) ret;
    return (*f)(x);
}

int main(){
    printf("%d\n",inc(6)); //expect '7' to be printed
exit(0);}
None of this is safe or normal but if you want the fastest speed these are the sorts of tricks you do :-)

jahboater
Posts: 2858
Joined: Wed Feb 04, 2015 6:38 pm

Re: writing a transfer vector in C

Thu Feb 08, 2018 5:13 pm

For safety, why not just use C?
The ARM CPU has auto-increment addressing modes like the PDP 11 and the C compiler will produce similar, or more likely better code.

for( int i = 0; i < 128; ++i )
*p1++ = *p2++;

gcc produces this loop for chars:

.L2:
ldrb ip, [r3], #1
strb ip, [r1, #1]!
cmp r2, r3
bne .L2

gcc has eliminated "i" of course. I doubt there is much benefit from unrolling the loop.

As PeterO says, if you want speed and safety, use memcpy.
However if you have knowledge about the operation that memcpy() does not have, for example you know that both the addresses are suitably aligned and the length is a multiple of something handy like 16, then you might do better (say with ldm/stm of the "q" registers). I once wrote a function to fast copy page tables with NEON and it benchmarked eight times faster than memcpy() (in C with inline assembler). For small memcpy() calls where the length is known at compile time gcc will replace memcpy() with inline instructions.

matherp
Posts: 11
Joined: Tue May 02, 2017 10:54 am

Re: writing a transfer vector in C

Fri Feb 09, 2018 8:40 am

For safety, why not just use C?
The ARM CPU has auto-increment addressing modes like the PDP 11 and the C compiler will produce similar, or more likely better code.
As you demonstrate the C compiler produces perfect code and this will always be better than a more genreic call like memcpy however well written, but what I am trying to do is eliminate the loop counter. The DEC approach was to have a set of the load/store instructions and jump down the set to a place that left the required number before the return. I can't think of a construct in C which would cause the compiler to do this.

// routine to copy N
jump relative (128-N)*2
ldrb ip, [r3], #1 //1st pair
strb ip, [r1, #1]!
ldrb ip, [r3], #1 //2nd pair
strb ip, [r1, #1]!
ldrb ip, [r3], #1 //3rd pair
strb ip, [r1, #1]!
....
ldrb ip, [r3], #1 // 128th pair
strb ip, [r1, #1]! //

jahboater
Posts: 2858
Joined: Wed Feb 04, 2015 6:38 pm

Re: writing a transfer vector in C

Fri Feb 09, 2018 8:59 am

The DEC approach was to have a set of the load/store instructions and jump down the set to a place that left the required number before the return. I can't think of a construct in C which would cause the compiler to do this.
Why not try a simple switch?

If you give it sequential values gcc will definitely emit a jump table, and without the "break"s at the end of each case, it will fall through. I think this does what you want:

switch( n )
{
case 128:
*p1++ = *p2++;
case 127:
*p1++ = *p2++;
case 126:
*p1++ = *p2++;
............
}

On the Pi this will produce something like ldr pc, [pc, N, asl #2]

You shouldn't need to change it, but this sets the minimum sized switch for which the compiler will consider a jump table.
--param case-values-threshold=n
(the default is 4 for arm and 5 for x86).

One by one copies like this are slow, modern CPU's can do much better especially using SIMD.
Perhaps use the switch to get to the first 16 byte boundary and then use the 128-bit "q" registers for the rest which will be very much faster.

I know you want arbitrary length, but for interest, the Pi can do this:-

vldm r1!,{q0-q7}
vstm r0!,{q0-q7}

which will copy 128 bytes very quickly with just two instructions
(and auto increment the addresses (that is, add 128 to each address)).

LdB
Posts: 827
Joined: Wed Dec 07, 2016 2:29 pm

Re: writing a transfer vector in C

Sun Feb 11, 2018 4:50 am

@matherp
The question we are wondering is why you are trying to eliminate the loop, it is not faster or more optimal.

I have the feeling you are used to working with static opcodes off something like a micro taking x amount of cycles it doesn't quite work like that on the ARM you have a cache and predictive branching.
https://en.wikipedia.org/wiki/Branch_predictor
A conditional jump that controls a loop is best predicted with a special loop predictor. A conditional jump in the bottom of a loop that repeats N times will be taken N-1 times and then not taken once. If the conditional jump is placed at the top of the loop, it will be not taken N-1 times and then taken once. A conditional jump that goes many times one way and then the other way once is detected as having loop behavior. Such a conditional jump can be predicted easily with a simple counter. A loop predictor is part of a hybrid predictor where a meta-predictor detects whether the conditional jump has loop behavior.

Many microprocessors today have loop predictors.[7]

In the same way what you are doing is not valid for even an Intel processor where a "rep stosd" would out perform what you are doing.
Which is very similar to what jah was showing you on the ARM.

We keep coming back to the question why are you trying to do it?
What is the goal and point of it?

The only reason I thought you might want to do it was for self modifying code but it appears that is not the case.
So you seem to be going to a lot of effort for nothing :-).

matherp
Posts: 11
Joined: Tue May 02, 2017 10:54 am

Re: writing a transfer vector in C

Thu Feb 22, 2018 11:25 am

[The question we are wondering is why you are trying to eliminate the loop, it is not faster or more optimal.
/quote]

Sorry, but simply not true in the real world (GCC, O=2). The application runs on both Pi and PIC32MZ and I want the fastest code across the platforms. For example:

for(i=0;i<count;i++)*b++=*a++;

is markedly slower than

for(i=0;i<count;i+=8){
*b++=*a++;
*b++=*a++;
*b++=*a++;
*b++=*a++;
*b++=*a++;
*b++=*a++;
*b++=*a++;
*b++=*a++;
}

The application is trying to optimise updating a framebuffer with graphics structures as fast as possible and where the number of bytes to transfer can be any number and starting anywhere (i.e. not constrained to be on word boundaries)

memcpy() is really odd.Occasionally it is faster than hardcoded loops and sometimes it is MUCH slower

jahboater
Posts: 2858
Joined: Wed Feb 04, 2015 6:38 pm

Re: writing a transfer vector in C

Thu Feb 22, 2018 11:41 am

matherp wrote:
Thu Feb 22, 2018 11:25 am
The application is trying to optimise updating a framebuffer with graphics structures as fast as possible and where the number of bytes to transfer can be any number and starting anywhere (i.e. not constrained to be on word boundaries)

memcpy() is really odd.Occasionally it is faster than hardcoded loops and sometimes it is MUCH slower
At what point do you know the length? I mean is it known at compile time? if so memcpy() may well resolve to something far more efficient that the compiler has put in its place.

Its a shame its not Intel where I would say use "rep movsb" which is one instruction that works for all lengths and all alignments - but is spectacularly fast if the data is aligned (say to 16 bytes) or the length is a multiple of say 64 bytes. All the work memcpy does is done in the hardware, and it is kept up to date as the internal bus widths change.

What sort of lengths are you copying? I mentioned before that if you can use NEON for the bulk of the transfer it will be way faster. NEON registers are 16 bytes wide and it seems to have a very fast interface to memory. You can still use ldm/stm to transfer up to 8 of these 16 byte registers each time round the loop.

As for loops vs unrolling, with -O3 (where it doesn't care about the code size) GCC may unroll it for you if there is a benefit - not common nowadays.

matherp
Posts: 11
Joined: Tue May 02, 2017 10:54 am

Re: writing a transfer vector in C

Thu Feb 22, 2018 12:03 pm

At what point do you know the length? I mean is it known at compile time?
Only at runtime.
What sort of lengths are you copying?
Can be anything from 1 to 1920 bytes
As for loops vs unrolling, with -O3 (where it doesn't care about the code size) GCC may unroll it for you if there is a benefit - not common nowadays.
O3 is hopeless with GCC always slower than O2

The issue with all clever optimisations is that they add overhead in the decision to optimise which is fine if it is a big transfer but makes it much slower if the transfer is a single byte. The advantage of the transfer vector approach is that the overhead is always a single jmp instruction. If we take the example of transferring bytes at addresses 0x1001 to 0x0x101F to a new location starting at 0x2003 working out what to transfer as quadword, words, and what as bytes etc. is simply not going to work as a strategy.

jahboater
Posts: 2858
Joined: Wed Feb 04, 2015 6:38 pm

Re: writing a transfer vector in C

Thu Feb 22, 2018 12:46 pm

matherp wrote:
Thu Feb 22, 2018 12:03 pm
The issue with all clever optimisations is that they add overhead in the decision to optimise which is fine if it is a big transfer but makes it much slower if the transfer is a single byte. The advantage of the transfer vector approach is that the overhead is always a single jmp instruction.
You are quite right! That is exactly the problem. Thats why memcpy() functions tend to hundreds of lines long, usually in assembler.

You can beat memcpy() only if you yourself know the, or have arranged, optimal characteristics for the data alignment and length.

On the Pi, I believe one of the RPF engineers has written a faster memcpy()

Code: Select all

[email protected]:/etc $ cat ld.so.preload
/usr/lib/arm-linux-gnueabihf/libarmmem.so
[email protected]:/etc $ 
Not sure where the source is, but it might be worth studying.

Some random ideas ...

To help with alignment the C11 _Alignas could be handy, or the aligned_alloc() version of malloc.

#define aligned _Alignas(16)
matherp wrote:
Thu Feb 22, 2018 12:03 pm
If we take the example of transferring bytes at addresses 0x1001 to 0x0x101F to a new location starting at 0x2003 working out what to transfer as quadword, words, and what as bytes etc. is simply not going to work as a strategy.
Yes. But perhaps you could ask some questions. Does it matter if you overwrite the end of the 30 byte block by two bytes? or can you arrange for it to be OK by allocating extra space? See what the compiler produces if you do memcpy( src, dst, 30 ); Check to see if alignment matters on your platform.

Edit: see this in "man gcc" :

The overhead you speak of doesn't apply when the compiler does it!

Code: Select all

-mmemcpy-strategy=strategy
   Override the internal decision heuristic to decide if "__builtin_memcpy" should be
   inlined and what inline algorithm to use when the expected size of the copy operation
   is known. strategy is a comma-separated list of alg:max_size:dest_align triplets.  alg
   is specified in -mstringop-strategy, max_size specifies the max byte size with which
   inline algorithm alg is allowed.  For the last triplet, the max_size must be "-1". The
   max_size of the triplets in the list must be specified in increasing order.  The
   minimal byte size for alg is 0 for the first triplet and "max_size + 1" of the
   preceding range.
Its not an easy problem if you want real speed and flexibility!

LdB
Posts: 827
Joined: Wed Dec 07, 2016 2:29 pm

Re: writing a transfer vector in C

Sat Feb 24, 2018 3:00 am

For a large block of data on the Pi and the following background

The data is aligned
The MMU and branch prediction are enabled.
If both the level 1 and level 2 caches are enabled and both the code and data memory regions being used marked as cacheable.
We will also use the preload instruction (PLD) causes the level 2 cache to start loading the data some time before the processor executes the code to access this data

If that is all valid on your Pi no one has come up with a way to beat ye old NeonCopyPLD

Code: Select all

NEONCopyPLD
      PLD [r1, #0xC0]
      VLDM r1!,{d0-d7}
      VSTM r0!,{d0-d7}
      SUBS r2,r2,#0x40
      BGE NEONCopyPLD
However the code is useless for his PIC and doesn't help him with his fight for the perfect memcpy :-)
I also question why so much memcpy'ing in the first place, never seen it on any program but that isn't the discussion of the post.

The -03 comments are misguided it would depend on code and compiler version. I don't think I have ever seen -O3 produce slower code on GCC 7.1. That doesn't mean it couldn't happen and maybe our poster is just very unlucky with his code.

jahboater
Posts: 2858
Joined: Wed Feb 04, 2015 6:38 pm

Re: writing a transfer vector in C

Sat Feb 24, 2018 8:03 am

FYI
I found that moving the SUBS up one, to between the LDM and the STM gave a measurable speed increase.
Then there is plenty of time for the flags to be set, and I suppose an extra clock between the load and the store doesn't do any harm either.

Return to “C/C++”

Who is online

Users browsing this forum: No registered users and 6 guests