DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Need a 24 bit memset

Tue May 18, 2021 12:39 am

Hello everyone,

I have an array of a data structure that is size of 3 bytes. I'm using memset right now to 0 the array this works as a reset. I however need to set some of the structure values to something other than 0 to do a proper reset.

Using a for loop for this seems like it would be a slow exercises as the entire array is upwards to 7,600 bytes in size. Is there a better way?

dshadoff
Posts: 42
Joined: Wed Apr 28, 2021 3:12 am

Re: Need a 24 bit memset

Tue May 18, 2021 12:56 am

How fast do you need it to be ?
One way - which may or may not be fast enough for you, but is certainly programmer-friendly - is to store a binary "blob" in flash memory, and load it into the array with a memcpy(). But since this comes from flash, it has an upper limit in speed.

memcpy(array, (uint8_t *)XIP_BASE + FLASH_OFFSET, sizeof(array));

You can install the binary image using picotool - but you will need to explicitly be aware of the value of XIP_BASE when you use picotool to load data into flash using this method (this hung me up for hours).

Oh, and don't be surprised if you find that your 24-bit data structure as an array turns out to be 32 bits per array element, due to the way that compilers align things. (I'm not sure if this is true or not in the Pico environment, but it is often true on other architectures).

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Tue May 18, 2021 1:14 am

Well if the size is 32 bit aligned there is a memset functions mentioned in the SDK for that. Size of reports 3. And the memory used for the array is 7,600 so maybe not I could waste memory and go to 32 bit for 9,600 bytes but that seems a bit extreme.

Looping with a memcpy might be faster

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 608
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: Need a 24 bit memset

Tue May 18, 2021 1:39 am

copying a repeating pattern linearly from flash is a truly terrible idea!

// The compiler does an ok job with this.

Code: Select all

void memset12(uint32_t *dest, uint8_t a, uint8_t b, uint8_t c, uint32_t n) {
    uint32_t v0 = (a << 24) | (c << 16) | (b << 8) | a;
    uint32_t v2 = (v0 << 8) | c;
    uint32_t v1 = (v2 << 8) | b;
    n >>= 2; // needs to be multiple of 4
    for(; n>0; n--) {
        *dest++ = v0;
        *dest++ = v1;
        *dest++ = v2;
    }
}

dshadoff
Posts: 42
Joined: Wed Apr 28, 2021 3:12 am

Re: Need a 24 bit memset

Tue May 18, 2021 2:29 am

Based on the original poster's phrasing, I hadn't assumed that the structure was repeating; rather I had inferred that it was non-uniform.
But perhaps the original poster should comment on this point.

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Tue May 18, 2021 4:31 am

dshadoff wrote: Based on the original poster's phrasing, I hadn't assumed that the structure was repeating; rather I had inferred that it was non-uniform.
But perhaps the original poster should comment on this point.
Sorry for the confusion. I'm using memset to reset the array however one of the elements in the structure really needs to be a different value than 0 so the basic memset doesn't quite do the job but it works.

However your idea of storing a populated array in flash and copying it was something that I do also need those values would not be repeating unlike the reset buffer. The best part of that is I can change that at a later time and not have to re-upload code, very useful!

It took a few trys to figure out why my code was crashing with the function above but in the end it was the n variable I didn't take into account the array count would be a third of the size.

Thanks for the help. Both solutions are very useful for me.

Memotech Bill
Posts: 98
Joined: Sun Nov 18, 2018 9:23 am

Re: Need a 24 bit memset

Tue May 18, 2021 10:49 am

Another option is DMA.
  • Write one copy of the pattern you wish to replicate to your buffer
  • Set up a byte-wide DMA
  • Set the start address to the start of your pattern
  • Set the destination address three bytes higher
  • Auto increment both source and destination addresses
  • Set the length of the transfer to the size of the block you wish to initialise, minus the initial 3 bytes
It can be improved using the idea of @kilograham. Write four copies of 3-byte your pattern to the start of your buffer, then use 32-bit wide DMA to replicate those four copies to the remainder of the buffer.

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Tue May 18, 2021 1:00 pm

Memotech Bill wrote:
Tue May 18, 2021 10:49 am
Another option is DMA.
  • Write one copy of the pattern you wish to replicate to your buffer
  • Set up a byte-wide DMA
  • Set the start address to the start of your pattern
  • Set the destination address three bytes higher
  • Auto increment both source and destination addresses
  • Set the length of the transfer to the size of the block you wish to initialise, minus the initial 3 bytes
It can be improved using the idea of @kilograham. Write four copies of 3-byte your pattern to the start of your buffer, then use 32-bit wide DMA to replicate those four copies to the remainder of the buffer.
I'm going to have to look into this. I haven't learnt how to use the DMA on the pico but should be worth the effort.

Thanks

dairequinlan
Posts: 29
Joined: Tue Feb 23, 2021 3:18 pm

Re: Need a 24 bit memset

Tue May 18, 2021 1:13 pm

if there are 32bit aligned memset functions, then find your lowest common denominator, in this case 96bits, and memset that instead. Pack it with 4 copies of the struct and memset away. However, I can't find any mention of any general memset implementation for anything other than single bytes that doesn't just use loops or something similar internally, so I don't know.

Alternative answer: why ? Just how fast do you need this to be ? I'm pretty sure dumping 3200 iterations of 3 bytes linearly into memory is going to be pretty zippy so absent of some really pressing performance requirement I'd probably just leave it at that ?

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Tue May 18, 2021 2:14 pm

dairequinlan wrote: if there are 32bit aligned memset functions, then find your lowest common denominator, in this case 96bits, and memset that instead. Pack it with 4 copies of the struct and memset away. However, I can't find any mention of any general memset implementation for anything other than single bytes that doesn't just use loops or something similar internally, so I don't know.

Alternative answer: why ? Just how fast do you need this to be ? I'm pretty sure dumping 3200 iterations of 3 bytes linearly into memory is going to be pretty zippy so absent of some really pressing performance requirement I'd probably just leave it at that ?
To answer why it's more of a preventative measure, this is the second iteration of my code. The first version with almost all the features enable required over clocking (it seems this is an inevitably) and then I had to cut the transfer rate because operations were taking too long. So anywhere I can optimize now the better. With that though you may be right this may be fast enough.

dshadoff
Posts: 42
Joined: Wed Apr 28, 2021 3:12 am

Re: Need a 24 bit memset

Tue May 18, 2021 9:45 pm

To be clear, do you need overclocking because of an overall speed limitation, or a specific critical-section slowdown ?
How much of your core program (especially critical section) is marked __not_in_flash_func ?

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Tue May 18, 2021 10:01 pm

The need to overclock is due to a render loop for video generation that all happens in a separate core. Nothing to do with getting data into the buffer faster.

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Thu May 20, 2021 12:20 am

kilograham wrote:
Tue May 18, 2021 1:39 am
copying a repeating pattern linearly from flash is a truly terrible idea!

// The compiler does an ok job with this.

Code: Select all

void memset12(uint32_t *dest, uint8_t a, uint8_t b, uint8_t c, uint32_t n) {
    uint32_t v0 = (a << 24) | (c << 16) | (b << 8) | a;
    uint32_t v2 = (v0 << 8) | c;
    uint32_t v1 = (v2 << 8) | b;
    n >>= 2; // needs to be multiple of 4
    for(; n>0; n--) {
        *dest++ = v0;
        *dest++ = v1;
        *dest++ = v2;
    }
}
For whatever reason this function fails if n isn't a multiple of 4 example n = 79 the whole things crashes how ever n = 80 It works fine so I made this adjustment to get it to sync up to a multiple of 4

Code: Select all

void memset12(uint32_t *dest, uint8_t a, uint8_t b, uint8_t c, uint32_t n) {
    uint32_t v0 = (a << 24) | (c << 16) | (b << 8) | a;
    uint32_t v2 = (v0 << 8) | c;
    uint32_t v1 = (v2 << 8) | b;
    while (n % 4)
    {
        uint8_t* d = (uint8_t*)dest;
        memset(d++, a, 1);
        memset(d++, b, 1);
        memset(d++, c, 1);
        dest = (uint32_t*)d;
        n--;
    }
    if (n == 0) return;
    {
        n >>= 2; // needs to be multiple of 4
        for(; n>0; n--) {
            *dest++ = v0;
            *dest++ = v1;
            *dest++ = v2;
        }
    } 
}
There may be a way to stream line this feel free to post an improved version :)

Memotech Bill
Posts: 98
Joined: Sun Nov 18, 2018 9:23 am

Re: Need a 24 bit memset

Thu May 20, 2021 7:35 am

Your initial function should not crash if called with n not a multiple of 4, but it would omit the last 1-3 copies. The fact that it is crashing suggests that the pointer you are passing in is not 32-bit aligned.

Your revised version tends to confirm this. If the input pointer was 32-bit aligned, then for n not a multiple of 4 it will no longer will be after your initial loop. So I would expect your revised version to only work if the value of n % 4 and the misalignment of the pointer match.

Try something like this (not tested)

Code: Select all

void memset12(uint8_t *dest, uint8_t a, uint8_t b, uint8_t c, uint32_t n) {
    uint32_t v0 = (a << 24) | (c << 16) | (b << 8) | a;
    uint32_t v2 = (v0 << 8) | c;
    uint32_t v1 = (v2 << 8) | b;
    while ((dest & 0x03) && (n > 0))
    {
        *(dest++) = a;
        *(dest++) = b;
        *(dest++) = c;
        n--;
    }
    uint32_t *d = (uint32_t *) dest;
    while ( n >= 4 )
    {
        *(d++) = v0;
        *(d++) = v1;
         *(d++) = v2;
         n -= 4;
    }
    dest = (uint8_t *) d;
    while (n > 0)
    {
        *(dest++) = a;
        *(dest++) = b;
        *(dest++) = c;
        n--;
    }
}

danjperron
Posts: 3788
Joined: Thu Dec 27, 2012 4:05 am
Location: Québec, Canada

Re: Need a 24 bit memset

Thu May 20, 2021 2:40 pm

Another approach will be to initialize the first index of the structure and then copy it over!

This is an example with structure packed and unpack. This way prevents any problem with alignment.
The pack and unpack struct is just to prove that it doesn't matter.

I use the memcpy in the geometric way copy 1 segment, 2 segments, 4 segments, 8.....

Code: Select all

#include <unistd.h>
#include <stdio.h>
#include "pico/stdlib.h"
#include "tusb.h"
#include <string.h>

typedef struct __attribute__((packed))
{
   unsigned char a;
   unsigned short b;
   unsigned char c;
}structpack;

typedef struct
{
   unsigned char a;
   unsigned short b;
   unsigned char c;
}structunpack;

#define ARRAY_SIZE 17
structpack  packArray[ARRAY_SIZE];
structunpack unpackArray[ARRAY_SIZE];

void initStructWithFirst(void * src, int structSize, int ArraySize)
{
   int ArrayCount=1;  // we assume that the first index is initialize
   while(ArrayCount<ArraySize)
   {
      int tcount = ArrayCount * 2;
      if(tcount > ArraySize)
          tcount = ArraySize;
      // ok copy the biggest block possible from the last one
      memcpy((void*) (src + (structSize * ArrayCount)),  src, (tcount - ArrayCount)*structSize);
      ArrayCount = tcount;
   }
}

int main(void)
{
        int loop;
        stdio_init_all();

        cdcd_init();
        printf("waiting for usb host");
        while (!tud_cdc_connected()) {
        printf(".");
        sleep_ms(500);
        }
        printf("\nusb host detected!\n");

        printf("sizeof unpack3:%d\n",sizeof(structunpack));
        printf("sizeof pack3:%d\n",sizeof(structpack));

        // first  initialize first index
        packArray[0].a= 1;
        packArray[0].b= 2;
        packArray[0].c= 3;

        unpackArray[0].a= 1;
        unpackArray[0].b= 2;
        unpackArray[0].c= 3;

         initStructWithFirst(packArray,sizeof(structpack),ARRAY_SIZE);
         initStructWithFirst(unpackArray,sizeof(structunpack),ARRAY_SIZE);

        for(loop=0;loop<ARRAY_SIZE;loop++)
          printf("%02d: %d %d %d   - %d %d %d \n",loop, packArray[loop].a, packArray[loop].b, packArray[loop].c,\
                                                 unpackArray[loop].a, unpackArray[loop].b, unpackArray[loop].c);

        while(true);
        return 0;
}

Code: Select all

Bienvenue avec minicom 2.7.1

OPTIONS: I18n 
Compilé le Aug 13 2017, 15:25:34.
Port /dev/ttyACM0, 10:31:00

Tapez CTRL-A Z pour voir l'aide concernant les touches spéciales                            
                                                                                            
                                                                                            
usb host detected!                                                                          
sizeof unpack3:6                                                                            
sizeof pack3:4
00: 1 2 3   - 1 2 3 
01: 1 2 3   - 1 2 3 
02: 1 2 3   - 1 2 3 
03: 1 2 3   - 1 2 3 
04: 1 2 3   - 1 2 3 
05: 1 2 3   - 1 2 3 
06: 1 2 3   - 1 2 3 
07: 1 2 3   - 1 2 3 
08: 1 2 3   - 1 2 3 
09: 1 2 3   - 1 2 3 
10: 1 2 3   - 1 2 3 
11: 1 2 3   - 1 2 3 
12: 1 2 3   - 1 2 3 
13: 1 2 3   - 1 2 3 
14: 1 2 3   - 1 2 3 
15: 1 2 3   - 1 2 3 
16: 1 2 3   - 1 2 3 

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Thu May 20, 2021 7:44 pm

Memotech Bill the alignment problem is interesting I don't know how I would test if that was happening with the revised code all memset operation are successful. When it comes to pointets and correct me if I'm wrong you can't misalign the start of a uint32_t if it starts on byte 0 or 1 that's where it starts. I noticed that n >>= 2 operation gave the same value of n was 80 or 79 so I suspectes out of bounds memory access was to blame so I made sure to do the alignment check first and then it always works.

Casting the array as a uint32_t on each call is cumbersome but switching to a void pointer was a casting nightmare in the function and it just refused to work because I don't know the right way to cast the pointers pointers and do the math on them.

danjperron another solid suggestion that's almost exactly what my modified code does to start but then I set 12 bytes at a time after that so I don't know what's faster or is it splitting hairs at that point? I like the different approach it's possibly more portable if say the array was to jump to 5 bytes per element. Something to think on!

Memotech Bill
Posts: 98
Joined: Sun Nov 18, 2018 9:23 am

Re: Need a 24 bit memset

Fri May 21, 2021 7:29 am

n >> 2 is the same as (int)(n / 4). So n=80 gives 20, while 79, 78, 77, and 76 will all give 19. So you may be missing up to three values at the end of your buffer.

The easiest way to check the pointer alignment is to insert the statement:

Code: Select all

printf ("dest = %p\n", dest);
Put this both before and after your first loop. For the pointer to be 32-bit aligned, the last digit of the pointer value must be one of "0", "4", "8" or "C". Any other value is not 32-bit aligned and I would expect a crash using the pointer to write a 32-bit value.

It is perhaps worth remarking that Intel processors are tolerant of misaligned memory access, and will do them in multiple steps, whereas ARM processors are not, and raise a trap if one is attempted.

User avatar
jahboater
Posts: 7033
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Need a 24 bit memset

Fri May 21, 2021 7:44 am

Memotech Bill wrote:
Fri May 21, 2021 7:29 am

Code: Select all

printf ("dest = %p\n", dest);
Put this both before and after your first loop. For the pointer to be 32-bit aligned, the last digit of the pointer value must be one of "0", "4", "8" or "C".
For a pointer, alignment may be checked automatically with:

Code: Select all

#define aligned(a,b) (((uintptr_t)(a) & (b-1)) == 0)
use like this:

Code: Select all

if( aligned(ptr, 4) )  .....
For anything else use the C language features _Alignof _Alignas etc (see #include <stdalign.h> for nicer names.)

Also the compiler option may be of interest:

Code: Select all

-mstrict-align
-mno-strict-align
    Avoid or allow generating memory accesses that may not be aligned on a natural object
    boundary as described in the architecture specification.
Memotech Bill wrote:
Fri May 21, 2021 7:29 am
It is perhaps worth remarking that Intel processors are tolerant of misaligned memory access, and will do them in multiple steps, whereas ARM processors are not, and raise a trap if one is attempted.
FYI, in 64-bit mode ARM allows misalignment (by default on Linux anyway), even for SIMD vectors.

Memotech Bill
Posts: 98
Joined: Sun Nov 18, 2018 9:23 am

Re: Need a 24 bit memset

Fri May 21, 2021 7:57 am

With @kilograham's loop, the values v0, v1 and v2 will almost certainly be stored in registers, and the loop is only doing memory writes. As the structure to be replicated gets larger, the ability to store the source copy in registers decreases, and the replication loop will have to both read the source from memory and write the destination.

@dshadoff's method of expanding memcpy's is clever. It does mean however that the source data being copied is getting larger each time. and less likely to fit into cache. For this reason it might be better just to do n memcpy's from the same original source copy, so that even if the source data is not in registers it is at least in cache.

pica200
Posts: 274
Joined: Tue Aug 06, 2019 10:27 am

Re: Need a 24 bit memset

Fri May 21, 2021 11:59 am

Memotech Bill wrote:
Fri May 21, 2021 7:29 am
It is perhaps worth remarking that Intel processors are tolerant of misaligned memory access, and will do them in multiple steps, whereas ARM processors are not, and raise a trap if one is attempted.
That's not true. It has been supported since at least ARMv6 so over a decade. The Cortex-M0+ i have not checked but considering the minimalism it probably does not support it.

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 608
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: Need a 24 bit memset

Fri May 21, 2021 6:19 pm

Note (compiler explorer) https://godbolt.org/z/Y3ed8q1vf is your friend. With -O3 it unrolls the loop once. you could still do better in assembly with stmia

DarkElvenAngel
Posts: 1683
Joined: Tue Mar 20, 2018 9:53 pm

Re: Need a 24 bit memset

Fri May 21, 2021 7:53 pm

kilograham wrote:
Fri May 21, 2021 6:19 pm
Note (compiler explorer) https://godbolt.org/z/Y3ed8q1vf is your friend. With -O3 it unrolls the loop once. you could still do better in assembly with stmia
What an interesting tool thanks for introducing it, I haven't written assembly is a very long time and even then I wasn't good at it.

Maybe something to add to my list of things to do. I assume all the assembly instructions are in the chip documentation?? I've only read through the SDK

User avatar
jahboater
Posts: 7033
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Need a 24 bit memset

Fri May 21, 2021 8:34 pm

DarkElvenAngel wrote:
Fri May 21, 2021 7:53 pm
What an interesting tool thanks for introducing it, I haven't written assembly is a very long time and even then I wasn't good at it.
With GCC you can get the assembler, annotated with the source lines, using these flags:

-S -fverbose-asm

gcc -S -fverbose-asm hello.c -o hello.s

kilograham
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 608
Joined: Fri Apr 12, 2019 11:00 am
Location: austin tx

Re: Need a 24 bit memset

Fri May 21, 2021 9:10 pm

Ha, yeah Clang does a better job actually (I really need to find some time to get that running as an option)

https://godbolt.org/z/soxM7j7Mb

although it still inserts a pointless adds r0, #0 in the loop for some reason. yay humanity!

Return to “SDK”