User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sat May 15, 2021 4:30 pm

I did install and test gmp lib (build instructions in "INSTALL" after unpacking the tar file):
https://gmplib.org/

For this simple example gmp factorize demo did need 10 seconds on Pi400 with 32bit Raspberry OS, and 2 seconds on Linux Intel i7:

Code: Select all

$ time ./factorize `./pexpr "1000000000100011 * (2^61-1)"`
2305843009444303616194470745733461: 1000000000100011 2305843009213693951

real	0m2.163s
user	0m2.106s
sys	0m0.075s
$ 

Interestingly factoring of 37-digit square of Mersenne prime M₆₁ (=2⁶¹-1) completed in less than 10 minutes on Intel i7:

Code: Select all

$ time ./factorize -v `./pexpr "(2^61-1) * (2^61-1)"`
5316911983139663487003542222693990401:[trial division] [is number prime?] [pollard-rho (1)] [trial division] [is number prime?] [trial division] [is number prime?]  2305843009213693951 2305843009213693951

real	9m29.180s
user	9m29.155s
sys	0m0.027s
$

The library comes with quite some demos:
https://gmplib.org/manual/Demonstration ... e-programs

Above I only used these two demos:
‘pexpr’ is an expression evaluator, the program used on the GMP web page.
‘factorize’ is a Pollard-Rho factorization program.

https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm
Image
Last edited by HermannSW on Mon May 17, 2021 1:45 pm, edited 1 time in total.
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

Heater
Posts: 18194
Joined: Tue Jul 17, 2012 3:02 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sat May 15, 2021 4:50 pm

Is there a question here?

Or some new fact we might be unaware of?
Memory in C++ is a leaky abstraction .

User avatar
jahboater
Posts: 7033
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sat May 15, 2021 5:23 pm

My Core i7 is roughly 3x faster than my Pi4, for the type of work I do.

I think this is reasonable because the Core i7 cost about 40x as much, uses vast amounts of electricity, and is physically huge (a tower case) by comparison.

Heater
Posts: 18194
Joined: Tue Jul 17, 2012 3:02 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sat May 15, 2021 5:25 pm

That is my experience as well.
Memory in C++ is a leaky abstraction .

User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 16, 2021 12:55 am

I am looking into factoring first problem RSA-100 (factoring 100 decimal digits number) of RSA factoring challange:
https://en.wikipedia.org/wiki/RSA_numbers#RSA-100
I use my many PIs often, but for this it is better to not use a Pi but faster (and much more expensive) i7.
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

ejolson
Posts: 7231
Joined: Tue Mar 18, 2014 11:47 am

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 16, 2021 1:45 am

HermannSW wrote:
Sun May 16, 2021 12:55 am
I am looking into factoring first problem RSA-100 (factoring 100 decimal digits number) of RSA factoring challange:
https://en.wikipedia.org/wiki/RSA_numbers#RSA-100
I use my many PIs often, but for this it is better to not use a Pi but faster (and much more expensive) i7.
Have you tried 64-bit ARM? I think there is a significant penalty for this kind of thing in 32-bit mode, especially when using the reverse compatible ARMv6 libraries. Note even if you compile all the code yourself using the ARMv7 target, the GCC runtime library that ships with the Raspberry Pi OS is restricted to ARMv6 instructions.

If you set things up for ARMv7 mode in an optimal way ARMv8 has a full set of 64-bit operations that are missing when the processor runs 32-bit mode.

My impression is that Intel is much faster for floating point but for integer the Pi 4B is surprisingly competitive.

As an aside, while rho is more popular, my opinion is Pollard's kangaroo algorithm is widely overlooked. A simple example of the kangaroo algorithm for a trivial case was created for this year's Advent of Code

viewtopic.php?p=1832037#p1832037

Later on in that same thread are some timing comparisons between 32-bit and 64-bit mode on the Pi.
Last edited by ejolson on Sun May 16, 2021 2:06 am, edited 1 time in total.

cleverca22
Posts: 3741
Joined: Sat Aug 18, 2012 2:33 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 16, 2021 2:06 am

HermannSW wrote:
Sun May 16, 2021 12:55 am
I am looking into factoring first problem RSA-100 (factoring 100 decimal digits number) of RSA factoring challange:
https://en.wikipedia.org/wiki/RSA_numbers#RSA-100
I use my many PIs often, but for this it is better to not use a Pi but faster (and much more expensive) i7.
what kinds of math operations does that use at a lower level?

Code: Select all

for (int i=0; i<16; i++) { int temp = a[i] * b[i]; if (store) c[i] = temp; if (accumulate) accumulator[i] += temp; }
the rpi has hardware capable of doing this entire statement in 2 clock cycles at 500mhz, for fairly sustained periods of time, with provisions to interleave load/store in

other ops are available, but not division

edit:
at a glance, the simplest option i can see, is to just pick a pair of random numbers, mult them, and see if you need to make one of them bigger/smaller?
the vector unit i mentioned, is capable of both normal (32bit*32bit->32bit) and high-word (32bit*32bit->64bit then >>32) multiplication, so 64bit mults are possible

bigint stuff, would need more computation that ive not looked into
some very rough math says thats about 125 milion 64bit int mults per sec, not sure how that compares to the power of the arm or qpu cores...

User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 16, 2021 7:43 am

I had shown that Intel i7 was able to factor square of Mersensse prime M₆₁ (=2⁶¹-1) in less than 10 minutes above.
Now I did start Pi400 leaving it alone, and in ssh session did same computation.
Pi400 needs less than 50 minutes, so factor of i7 to Pi400 seems to be 5 for bigint computations.

Code: Select all

pi@raspberrypi400:~/mpp/gmp-6.2.1/demos $ export PS1=$\ 
$ time ./factorize `./pexpr "(2^61-1)*(2^61-1)"`
5316911983139663487003542222693990401: 2305843009213693951 2305843009213693951

real	49m13.154s
user	49m11.625s
sys	0m0.407s
$ 
cleverca22 wrote:
Sun May 16, 2021 2:06 am
what kinds of math operations does that use at a lower level?
bigint operations:
https://gmplib.org/manual/Integer-Functions

For the example just shown the number to be factorized had just less than 128 bits.
But factoring RHO-100 number representation has 100 decimal digits (330 bits)!
https://en.wikipedia.org/wiki/RSA_numbers#RSA-100

So 32bit/64bit performance does help, but only inside the bigint operations dealing with much longer numbers.

Gnu gmp lib was started last century, and current stable release is from 11/2014.
This is "gmp-6.2.1/demos/factorize.c":

Code: Select all

/* Factoring with Pollard's rho method.

Copyright 1995, 1997-2003, 2005, 2009, 2012, 2015 Free Software
Foundation, Inc.

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 3 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program.  If not, see https://www.gnu.org/licenses/.  */


#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <inttypes.h>

#include "gmp.h"

static unsigned char primes_diff[] = {
#define P(a,b,c) a,
#include "primes.h"
#undef P
};
#define PRIMES_PTAB_ENTRIES (sizeof(primes_diff) / sizeof(primes_diff[0]))

int flag_verbose = 0;

/* Prove primality or run probabilistic tests.  */
int flag_prove_primality = 1;

/* Number of Miller-Rabin tests to run when not proving primality. */
#define MR_REPS 25

struct factors
{
  mpz_t         *p;
  unsigned long *e;
  long nfactors;
};

void factor (mpz_t, struct factors *);

void
factor_init (struct factors *factors)
{
  factors->p = malloc (1);
  factors->e = malloc (1);
  factors->nfactors = 0;
}

void
factor_clear (struct factors *factors)
{
  int i;

  for (i = 0; i < factors->nfactors; i++)
    mpz_clear (factors->p[i]);

  free (factors->p);
  free (factors->e);
}

void
factor_insert (struct factors *factors, mpz_t prime)
{
  long    nfactors  = factors->nfactors;
  mpz_t         *p  = factors->p;
  unsigned long *e  = factors->e;
  long i, j;

  /* Locate position for insert new or increment e.  */
  for (i = nfactors - 1; i >= 0; i--)
    {
      if (mpz_cmp (p[i], prime) <= 0)
	break;
    }

  if (i < 0 || mpz_cmp (p[i], prime) != 0)
    {
      p = realloc (p, (nfactors + 1) * sizeof p[0]);
      e = realloc (e, (nfactors + 1) * sizeof e[0]);

      mpz_init (p[nfactors]);
      for (j = nfactors - 1; j > i; j--)
	{
	  mpz_set (p[j + 1], p[j]);
	  e[j + 1] = e[j];
	}
      mpz_set (p[i + 1], prime);
      e[i + 1] = 1;

      factors->p = p;
      factors->e = e;
      factors->nfactors = nfactors + 1;
    }
  else
    {
      e[i] += 1;
    }
}

void
factor_insert_ui (struct factors *factors, unsigned long prime)
{
  mpz_t pz;

  mpz_init_set_ui (pz, prime);
  factor_insert (factors, pz);
  mpz_clear (pz);
}


void
factor_using_division (mpz_t t, struct factors *factors)
{
  mpz_t q;
  unsigned long int p;
  int i;

  if (flag_verbose > 0)
    {
      printf ("[trial division] ");
    }

  mpz_init (q);

  p = mpz_scan1 (t, 0);
  mpz_fdiv_q_2exp (t, t, p);
  while (p)
    {
      factor_insert_ui (factors, 2);
      --p;
    }

  p = 3;
  for (i = 1; i <= PRIMES_PTAB_ENTRIES;)
    {
      if (! mpz_divisible_ui_p (t, p))
	{
	  p += primes_diff[i++];
	  if (mpz_cmp_ui (t, p * p) < 0)
	    break;
	}
      else
	{
	  mpz_tdiv_q_ui (t, t, p);
	  factor_insert_ui (factors, p);
	}
    }

  mpz_clear (q);
}

static int
mp_millerrabin (mpz_srcptr n, mpz_srcptr nm1, mpz_ptr x, mpz_ptr y,
		mpz_srcptr q, unsigned long int k)
{
  unsigned long int i;

  mpz_powm (y, x, q, n);

  if (mpz_cmp_ui (y, 1) == 0 || mpz_cmp (y, nm1) == 0)
    return 1;

  for (i = 1; i < k; i++)
    {
      mpz_powm_ui (y, y, 2, n);
      if (mpz_cmp (y, nm1) == 0)
	return 1;
      if (mpz_cmp_ui (y, 1) == 0)
	return 0;
    }
  return 0;
}

int
mp_prime_p (mpz_t n)
{
  int k, r, is_prime;
  mpz_t q, a, nm1, tmp;
  struct factors factors;

  if (mpz_cmp_ui (n, 1) <= 0)
    return 0;

  /* We have already casted out small primes. */
  if (mpz_cmp_ui (n, (long) FIRST_OMITTED_PRIME * FIRST_OMITTED_PRIME) < 0)
    return 1;

  mpz_inits (q, a, nm1, tmp, NULL);

  /* Precomputation for Miller-Rabin.  */
  mpz_sub_ui (nm1, n, 1);

  /* Find q and k, where q is odd and n = 1 + 2**k * q.  */
  k = mpz_scan1 (nm1, 0);
  mpz_tdiv_q_2exp (q, nm1, k);

  mpz_set_ui (a, 2);

  /* Perform a Miller-Rabin test, finds most composites quickly.  */
  if (!mp_millerrabin (n, nm1, a, tmp, q, k))
    {
      is_prime = 0;
      goto ret2;
    }

  if (flag_prove_primality)
    {
      /* Factor n-1 for Lucas.  */
      mpz_set (tmp, nm1);
      factor (tmp, &factors);
    }

  /* Loop until Lucas proves our number prime, or Miller-Rabin proves our
     number composite.  */
  for (r = 0; r < PRIMES_PTAB_ENTRIES; r++)
    {
      int i;

      if (flag_prove_primality)
	{
	  is_prime = 1;
	  for (i = 0; i < factors.nfactors && is_prime; i++)
	    {
	      mpz_divexact (tmp, nm1, factors.p[i]);
	      mpz_powm (tmp, a, tmp, n);
	      is_prime = mpz_cmp_ui (tmp, 1) != 0;
	    }
	}
      else
	{
	  /* After enough Miller-Rabin runs, be content. */
	  is_prime = (r == MR_REPS - 1);
	}

      if (is_prime)
	goto ret1;

      mpz_add_ui (a, a, primes_diff[r]);	/* Establish new base.  */

      if (!mp_millerrabin (n, nm1, a, tmp, q, k))
	{
	  is_prime = 0;
	  goto ret1;
	}
    }

  fprintf (stderr, "Lucas prime test failure.  This should not happen\n");
  abort ();

 ret1:
  if (flag_prove_primality)
    factor_clear (&factors);
 ret2:
  mpz_clears (q, a, nm1, tmp, NULL);

  return is_prime;
}

void
factor_using_pollard_rho (mpz_t n, unsigned long a, struct factors *factors)
{
  mpz_t x, z, y, P;
  mpz_t t, t2;
  unsigned long long k, l, i;

  if (flag_verbose > 0)
    {
      printf ("[pollard-rho (%lu)] ", a);
    }

  mpz_inits (t, t2, NULL);
  mpz_init_set_si (y, 2);
  mpz_init_set_si (x, 2);
  mpz_init_set_si (z, 2);
  mpz_init_set_ui (P, 1);
  k = 1;
  l = 1;

  while (mpz_cmp_ui (n, 1) != 0)
    {
      for (;;)
	{
	  do
	    {
	      mpz_mul (t, x, x);
	      mpz_mod (x, t, n);
	      mpz_add_ui (x, x, a);

	      mpz_sub (t, z, x);
	      mpz_mul (t2, P, t);
	      mpz_mod (P, t2, n);

	      if (k % 32 == 1)
		{
		  mpz_gcd (t, P, n);
		  if (mpz_cmp_ui (t, 1) != 0)
		    goto factor_found;
		  mpz_set (y, x);
		}
	    }
	  while (--k != 0);

	  mpz_set (z, x);
	  k = l;
	  l = 2 * l;
	  for (i = 0; i < k; i++)
	    {
	      mpz_mul (t, x, x);
	      mpz_mod (x, t, n);
	      mpz_add_ui (x, x, a);
	    }
	  mpz_set (y, x);
	}

    factor_found:
      do
	{
	  mpz_mul (t, y, y);
	  mpz_mod (y, t, n);
	  mpz_add_ui (y, y, a);

	  mpz_sub (t, z, y);
	  mpz_gcd (t, t, n);
	}
      while (mpz_cmp_ui (t, 1) == 0);

      mpz_divexact (n, n, t);	/* divide by t, before t is overwritten */

      if (!mp_prime_p (t))
	{
	  if (flag_verbose > 0)
	    {
	      printf ("[composite factor--restarting pollard-rho] ");
	    }
	  factor_using_pollard_rho (t, a + 1, factors);
	}
      else
	{
	  factor_insert (factors, t);
	}

      if (mp_prime_p (n))
	{
	  factor_insert (factors, n);
	  break;
	}

      mpz_mod (x, x, n);
      mpz_mod (z, z, n);
      mpz_mod (y, y, n);
    }

  mpz_clears (P, t2, t, z, x, y, NULL);
}

void
factor (mpz_t t, struct factors *factors)
{
  factor_init (factors);

  if (mpz_sgn (t) != 0)
    {
      factor_using_division (t, factors);

      if (mpz_cmp_ui (t, 1) != 0)
	{
	  if (flag_verbose > 0)
	    {
	      printf ("[is number prime?] ");
	    }
	  if (mp_prime_p (t))
	    factor_insert (factors, t);
	  else
	    factor_using_pollard_rho (t, 1, factors);
	}
    }
}

int
main (int argc, char *argv[])
{
  mpz_t t;
  int i, j, k;
  struct factors factors;

  while (argc > 1)
    {
      if (!strcmp (argv[1], "-v"))
	flag_verbose = 1;
      else if (!strcmp (argv[1], "-w"))
	flag_prove_primality = 0;
      else
	break;

      argv++;
      argc--;
    }

  mpz_init (t);
  if (argc > 1)
    {
      for (i = 1; i < argc; i++)
	{
	  mpz_set_str (t, argv[i], 0);

	  gmp_printf ("%Zd:", t);
	  factor (t, &factors);

	  for (j = 0; j < factors.nfactors; j++)
	    for (k = 0; k < factors.e[j]; k++)
	      gmp_printf (" %Zd", factors.p[j]);

	  puts ("");
	  factor_clear (&factors);
	}
    }
  else
    {
      for (;;)
	{
	  mpz_inp_str (t, stdin, 0);
	  if (feof (stdin))
	    break;

	  gmp_printf ("%Zd:", t);
	  factor (t, &factors);

	  for (j = 0; j < factors.nfactors; j++)
	    for (k = 0; k < factors.e[j]; k++)
	      gmp_printf (" %Zd", factors.p[j]);

	  puts ("");
	  factor_clear (&factors);
	}
    }

  exit (0);
}

The bigint operations do need temp variables a lot, eg:

Code: Select all

	      mpz_mul (t, x, x);
	      mpz_mod (x, t, n);
	      mpz_add_ui (x, x, a);

Is really "x = (x * x) % n; x += a;".

I did rewrite the interesting part of Polloard Rho function wit standard C operators, so the algorithm becomes more clear. I stopped after having found 1st factor because that is all that is needed to break RSA ... ;-) (if done fast enough)

Code: Select all

void
factor_using_pollard_rho (mpz_t n, unsigned long a, struct factors *factors)
{
  t = t2 = 0;
  y = x = z = 2;
  P = 1;
  k = 1;
  l = 1;

      for (;;)
	{
	  do
	    {
	      x = (x*x) % n;
	      x += a;

	      P = (P*(z-x)) % n;

// do costly gcd() call only every 32nd round
// do it on product of all diffs P, so don't miss something 
	      if (k % 32 == 1)    
		{
		  t = gcd(P, n);
		  if (t != 1)
		    goto factor_found;
		  y = x;
		}
	    }
	  while (--k != 0);

	  z = x;
	  k = l;
	  l *= 2;
	  for (i = 0; i < k; i++)
	    {
	      x = (x*x) % n;
	      x += a;
	    }
	  y = x;
	}

    factor_found:
      do
	{
	  y = (y*y) % n;
	  y += a;

	  t = gcd(z-y, n);
	}
      while (t == 1);

STOP, factor t of n found (maybe composite)

Last edited by HermannSW on Mon May 17, 2021 1:52 pm, edited 1 time in total.
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 16, 2021 1:04 pm

Wow, I added command options "-i" for passing initial value for x (default 2) and "-a" for passing (signed) value of a (default 1). And I made factorize.c always call Pollard Rho algorithm. Finally output ends after 1st maybe composite factor>1 found, not complete factorization. Then I played again with product of Mersenne primes:
https://en.wikipedia.org/wiki/Mersenne_prime

I ended up testing squares of Mersenne Primes as before and got factorization for big numbers in small parts of second, here for square of M₅₂₁=2⁵²¹-1 in 0.044 seconds! M₅₂₁ has 157 decimal digits

Code: Select all

$ time ./factorize -i 17 -a 0 `./pexpr "(2^521-1)^2"`
47125446914534694131579097993419809976955095716785201420286055195012674566357244479460731079205201122720511132925006540350105785156086431086764996857554304847155991333706718342307167456986269662311038377104760933477381254100896222805785374204495333936040246318307567782851014765052850751581472024524956029996236801: 6864797660130609714981900799081393217269435300143305409394463459185543183397656052122559640661454554977296311391480858037121987999716643812574028291115057151 * ...
real	0m0.044s
user	0m0.015s
sys	0m0.038s
$ 

I went further to square of 6002 decimal digit number M₁₉₉₃₇=2^¹⁹⁹³⁷-1 -- that got factored in 16 seconds (on Intel i7)!!

Code: Select all

$ time ./factorize -i 17 -a 0 `./pexpr "(2^19937-1)^2"`
1862289118191266464252416458710476844227700002927857053617613306804569223957499547697285005491500471773838444814120370856462613935465539238594563960026700537555648044846809512791813135570496311123653875888549667847480781477648296322581647475916099641060796764878985665412985716838929197353953343377152285004733782681350634632137363275395760497465692643809501933667995572002620618240401316564442514672271859098073100123488819265954418925981376784020966373154050160601010878309684346103397132328647195042450297965444318309696103780834451158238039954949293904840144110866906785959199661276061951716313635149553561714683207479910620372551387646417044723493300588271965525262987260520880592484236936223380848049391883549969244417964437237575561389849772437557191078037692538361567681641058314966884384653619514466077636557088702197433512208720742805395299398328646976545589378606986008378448137201800362237077208849103305450084609444871632752532436866780777206935288933953609094292850662109902434368757616848665188167850310365665964318945231030683804216117441366251157918493863696522991657994356027193690708399834463960965135739911980335991709108739046380331147825954594361034298866981647209918094212756224942010629287644539170014174023076290468679953457707064315204115087809709258822549211864905741158977982273219465567568481659582897489220166656258534479016244064429599340953559687644555940686651659390053632342829091818492609013860223733008271852811038038971311730267627052995713590568587432256410001188582195925821870822787631123167145036945052307464002327522558574502315402991502808108889378287186542363343204044294814891746120128540908683758204134343350037341421429974471274851732545882995254332163737602314721644501620054963779170026032425799705223570315503766153139961642010159938145852665109360306346406677790140065277316251681307147833766078799870411087267903491574933380296179652234235314486515663314981327877562702762232455144482606465963772501605395938479918585315216236434458419736318895861922763875769177401100812596016090996586252611568819067003121112180217157861911422267758672598508664880382256393040502179919626489701125666917338039148762672037106385095054686635149125429891418565535085565668057875885723948297374732698900317890465291311059818441597907979431823245979564090688722004618689613478672364489877614837339863407366054754260758169965706196983318519710316810181942210502917909173249816823286355802664265629618936224725374636283861159285640037828566091059308886127227078202852966611497971835015162911509030738046053487672495705773070906653456164266587045317284662599871593856260036061636960845355077677913931202424276838309009934761053677870287550509813173198394830481150333602441821639411313484395522079900213373381009558195105663973130204859451608346532028907370580186229052874584880867920814351315370479830896199631700013461102510633254605724947283572017127213049211713387272027042516972599862696263022696812855080000450643215016779396637307439407720168906162571025992331399961353259609687075913327582507382661010148312828302641255978120080000358274272203564569912440084786147546426118408001132651073508589884937407932951905005754460992502320460893468987247447166241286610645686160863193319389106149591669455174522696905522031187999396266757988294993506153144429090499564057619362472021088704085836347965898648091686427995001855494578412510689011997626446727284879340483386977229969782662498020191988699889432247398535615448530015600327854208894981692691491899587519735803023889329144283765541574664617784639420789111412707473577801232704797136348396418219703682070892350963630314336887132389904575292191164817982426235874310215993551963960817057941257790395824058743377772698932766682933691879840143175417799093450223864316272794520664964687797628341478452879159700386284413919570896188337499921974701939265492681079495242353278770781828729751976956767888886718428880281805549668463832861987178560644353001917622233620495363872491040594851076063846148565935985229673362771859945092444671850120469352413956135353781715189820302493852178908854205988675373315210919116068310453413573626570032894733471406350640394999912601492112092050447607890442000619247736263376315322863830759703686652495549976953563699341798495390953294579727788171322311293520368603284849557758531516241556105926651193500565470404518135678565093104244568637138642746332060677763191599800849054896573867367199663863395043872625824373559100026473126466714035312335804530409447864144815830875111682003303859312543792450936070864323248416766026576016352087504890979929123047648603037039178590120052675609475291093925644428952710773018871457106738898358309821676378785393598559656996400712990957312134091208463715199445175335811320013001408036439018917946235877800282206473890583278335158393686392982163249824960237666413147124871001592931602866650982296806677250779790342323240846479919916057153815958521250577021196608492126552201115328086074039292155310684221034930116117590630471508945459680448558210939859909731671946379155592003999215588627052867388924073141300862376382191887373253799627077032304402510218969717039552560511680630193948892394940157740278208620313153592171139071060889529785845057411085648653764687737497371475765124037417967798845149260200135508331737587161420107818430888731892926191139378759697021163086081480770477230738725825628835726821226687371391843028445093171502895746464954370240062662248850673200310298468936238999176162960093885260364308796308200581026982683002559340011408855249995690602510821776017262543557406583983343352144697801023087796946843024469767555970895028494060919634498529618642673555132406970280556858925959446915734196868012350775447498641214418663102847116472782256196309686503519272693596534366124292860983874489370763501917456669896444433651377491235058192557316502543872155028125529532793568646223975043865720397610011187934497423549625071999935306427721461798303762094193843001091065413201650547177798012946133458220395325794239731831979141225077846028219071104573521586610509411807524245338744766943185448698294239506590062604672409666733198771371431279054637403337298940023282026979526362282935091661850599794786252008768297829995304877939514887576368266731967085669936297484220543838066237407556352073999583427082503784893172565571951167888104162440145828763529984867292906855449187147169246089698365883585949964250049347147479899825301259045065055258419824212747423311377081420348507848779790637274800270914139707034464786846970988435123147327639954308740029105477090978279031446660162987614561930193246271134413264346428200522673865143457081776156803104933022356726544586296704908646136269868511569549786123896322012948529046266563865116124433019469649783497710494121451974615750827034657271347908270763577407182838208971639874054491064057485044034754005040852834589977025268401242622777660866288132161894641259500618052399711662296753463831087286767616827573739759637545146114075975496238707879052416220524354758731083067010002630562178510766302527724425423672645740544868844560734956152537087202231549749193054063231292676717441626896854903762480298470804666681777822857999104434464142441045695018526898997263645445497796031604514983860143424892327531028575626007958693685913163464531766801222025359899161377257402172567713262841760543619864742266525320491250999123040117953400240568306002751899681523732299202964545839739742064082741007680789046393846198636727693583744136191865728812650527460034789981056693258831462629855362678779186414707670170366079151928937068923613772103793213532866883139518020007828670305349070499345372662976850149424903682831469572748468680212514773265897611332054101272347461850831289572344004392617746699135434049610895516756776054010608541400010530049101668588083235423884606624929037236521476053258384085559042240003906060562104066699257474705295801146515051845002953911704850881856349781774436192820835356723129828585042378240752501102205849874428993045433927476798505686547578110579081669510674578409292223082874343997099294118747806960509427418627333069545016518589065368696272947966573862658226527425419149993871005798057898023093849781637521963547189554009948494086530904591866233980781784366011417227977124485434119130969962056856699069539706314835115328425511906964682075813441258451907774286716072612061391065808060488136127479960612718721601369197639296025506700269877153962734554460768016189891791555714763664003441383225232837416090312318204111902525117435582242977160842111676843364337783666170394505803541866785556911406834085528175133386930806908883268507288090125399822305721163521227667502701530877493204306130653119078749416057278434187883175038996847641914019423493795404471085561460803899339681123174984314145310408487117132620414987566465816763791118245027001644054453026264490629975375482175945880159661936956947303762523512953933353615406484302195424919687020227102733236009063047688714895125947580300991410186019876147296273077781783764756248537106579553871874149528716536909937621322094096789334831007921809415570327549528548249792283736749742876482992581815898841610798579698807946184104312120274691884723924353568057738088394731916679812469054691496047663474200148786114260254251166381593684826795041498785540658829586042358371876530718515718600875459120566210598702708928393305770896197027347685069511714833376094180384004370827760514036303474383885808882881626244157952645916292295157217529624468822304561872262697409770721341514458058070174672969949098896027309976936719086056653222480587478264953305326137688638034159236329217163101722216996973943656456626754923988283618602699849887406071043877841134782476475321320109390965818629681203238802687216694590888075771038790702889540260883352228912397252441149931904450111963734464102286707911619472831810927923436226756606950987798324036324217554789845150392803719775497735906312335720029996427684794221634175402340436807575685541003618623222397999421851137530208176713319942949039693499909045438888034241566013788133294149151574530913917734676675355599726628273319865876502927445280149080000286919247689422400163307861283048586674688143362979448258181699579443351777905611571452850001409765909425509354247997352257604601404112442532580997448667017389382883942238502610554010349162152130191508026645385202773474930261238092460812699853093997140066497388383457009599760124537422322130807738469436199049636777666270398432247835607606930357563759530269147519623763390314516937678311679936531327291764886566145831470948876044207514839553631085921184303704997694757501014774535325754551448966600306275443830467502951553565800531988719669639260469446499562930591100201991341862549094603539912682562298496946870882439355066022690572481097589339427515812571053529897322703426490514701285625686349945396419248252976509789543318223071053092612581669663576707314741755911835287723850984262981201135148721047724945927244057110738731122863487178488180150492900211600861631596093241526616163707394948189209413666349128803794747105643761309790832807422941332143059995298032020064227343731210150950960401990245075488361270435371759765097552671507199076413017964329706839832092068098847963831575127664509493936287896389598202359380302761924101100021122946200551405915375500825167390949698518921619286485087404507349053867540830586661330932509149584085218913476675279636159133933938577652102228163614754739145164140069306863633789312955343269365507568726101525379246445628252156969026496950778251292057150305905909958451873492089177966921495470149589292491065580092247506148574041177156405152126382437284308987520313316553315609506364509789137357423946016061411923462362393134824230467258333160415434881262255580233616157290559913234376996499793320386785775853344914594756398400239801459631467670952270503804167446511411074367340160584329838828907770112151962590527083766213716887566970036433317179705604681989287145793042047738951758232510330804760083103793166943593119709832337856479039023594696405517658611639189270635121966737985500283079747330549575843841: 43154247973881626480552355163379198390539350432267115051652505414033306801376580911304513629318584665545269938257648835317902217334584413909528269154609168019007875343741396296801920114486480902661414318443276980300066728104984095451588176077132969843762134621790396391341285205627619600513106646376648615994236675486537480241964350295935168662363909047948347692313978301377820785712419054474332844529183172973242310888265081321626469451077707812282829444775022680488057820028764659399164766265200900561495800344054353690389862894061792872011120833614808447482913547328367277879565648307846909116945866230169702401260240187028746650033445774570315431292996025187780790119375902863171084149642473378986267503308961374905766340905289572290016038000571630875191373979555047468154333253474991046248132504516341796551470575481459200859472614836213875557116864445789750886277996487304308450484223420629266518556024339339190844368921018424844677042727664601852914925277280922697538426770257333928954401205465895610347658855386633902546289962132643282425748035786233580608154696546932563833327670769899439774888526687278527451002963059146963875715425735534475979734463100678367393327402149930968778296741391514599602374213629898720611431410402147238998090962818915890645693934483330994169632295877995848993366747014871763494805549996163051541225403465297007721146231355704081493098663065733677191172853987095748167816256084212823380168625334586431254034670806135273543270714478876861861983320777280644806691125713197262581763151313596429547763576367837019349835178462144294960757190918054625114143666384189433852576452289347652454631535740468786228945885654608562058042468987372436921445092315377698407168198376538237748614196207041548106379365123192817999006621766467167113471632715481795877005382694393400403061700457691135349187874888923429349340145170571716181125795888889277495426977149914549623916394014822985025331651511431278802009056808456506818877266609831636883884905621822262933986548645669080672191704740408891349835685662428063231198520436826329415290752972798343429446509992206368781367154091702655772727391329424277529349082600585884766523150957417077831910016168475685658673192860882070179760307269849987354836042371734660257694347235506301744118874141292438958141549100609752216882230887611431996472330842380137110927449483557815037586849644585749917772869926744218369621137675101083278543794081749094091043084096774144708436324279476892056200427227961638669149805489831121244676399931955371484012886360748706479568669048574782855217054740113945929622177502575565811067452201448981991968635965361551681273982740760138899638820318776303668762730157584640042798880691862640268612686180883874939573818125022279689930267446255773959542469831637863000171279227151406034129902181570659650532600775823677398182129087394449859182749999007223592423334567850671186568839186747704960016277540625331440619019129983789914712515365200336057993508601678807687568562377857095255541304902927192220184172502357124449911870210642694565061384919373474324503966267799038402386781686809962015879090586549423504699190743519551043722544515740967829084336025938225780730880273855261551972044075620326780624448803490998232161231687794715613405793249545509528052518010123087258778974115817048245588971438596754408081313438375502988726739523375296641615501406091607983229239827240614783252892479716519936989519187808681221191641747710902480633491091704827441228281186632445907145787138351234842261380074621914004818152386666043133344875067903582838283562688083236575482068479639546383819532174522502682372441363275765875609119783653298312066708217149316773564340379289724393986744139891855416612295739356668612658271234696438377122838998040199739078061443675415671078463404673702403777653478173367084844734702056866636158138003692253382209909466469591930161626097920508742175670306505139542860750806159835357541032147095084278461056701367739794932024202998707731017692582046210702212514120429322530431789616267047776115123597935404147084870985465426502772057300900333847905334250604119503030001704002887892941404603345869926367501355094942750552591581639980523190679610784993580896683299297681262442314008657033421868094551740506448829039207316711307695131892296593509018623094810557519560305240787163809219164433754514863301000915916985856242176563624771328981678548246297376249530251360363412768366456175077031977457534912806433176539995994343308118470147158712816149394421276614228262909950055746981053206610001560295784656616193252269412026831159508949671513845195883217147982748879261851417819979034417285598607727220866677680426090308754823803345446566305619241308374452754668143015487710877728011086004325892262259413968285283497045571062757701421761565262725153407407625405149931989494459106414660534305378576709862520049864880961144869258603473714363659194013962706366851389299692869491805172556818508298824954954815796063169517658741420159798754273428026723452481263569157307213153739781041627653715078598504154797287663122946711348158529418816432825044466692781137474494898385064375787507376496345148625306383391555145690087891955315994462944493235248817599907119135755933382121706191477185054936632211157222920331148502487563303118018805685073569841580518118710778653953571296014372940865270407021924383167290323231567912289419486240594039074452321678019381871219092155460768444573578559513613304242206151356457513937270939009707237827101245853837678338161023397586854894230696091540249987907453461311923963852950754758058205625956600817743007191746812655955021747670922460866747744520875607859062334750627098328593480067789456169602494392813763495657599847485773553990957557313200809040830036446492219409934096948730547494301216165686750735749555882340303989874672975455060957736921559195480815514035915707129930057027117286252843197413312307617886797506784260195436760305990340708481464607278955495487742140753570621217198252192978869786916734625618430175454903864111585429504569920905636741539030968041471 * ...
real	0m16.097s
user	0m16.067s
sys	0m0.036s
$ 

Default Pollard Rho function is x^2+a with a=1 -- it seems a=0 is perfect at least for factoring product of Mersenne primes ...

I will now create a factorize.c gist starting with original gmp demo, then as 2nd revision the current version with the new command line options.
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 16, 2021 1:27 pm

Here is the modified "factorize.c" for you to play with:
https://gist.github.com/Hermann-SW/6bd1 ... 20dada7096
See under "Revisions" for the few changes I made.

I did compile new factorize.c on Pi400 as well. Square of 6002 decimal digit number M₁₉₉₃₇ takes 1:19min to factor ( a number with 12000 decimal digits!), compared to 16 seconds on i7, slightly more than factor 5x.

Factoring square of 157 decimal digits M₅₂₁=2⁵²¹-1 is done in 0.149 seconds!
pi@raspberrypi400:~/mpp/gmp-6.2.1/demos $ time ./factorize -i 17 -a 0 `./pexpr "(2^521-1)^2"`
47125446914534694131579097993419809976955095716785201420286055195012674566357244479460731079205201122720511132925006540350105785156086431086764996857554304847155991333706718342307167456986269662311038377104760933477381254100896222805785374204495333936040246318307567782851014765052850751581472024524956029996236801: 6864797660130609714981900799081393217269435300143305409394463459185543183397656052122559640661454554977296311391480858037121987999716643812574028291115057151 * ...
real 0m0.149s
user 0m0.100s
sys 0m0.068s
pi@raspberrypi400:~/mpp/gmp-6.2.1/demos $

P.S:
After having downloaded gmp lib, ...
https://gmplib.org/
... extracted tar and built according instructions in "INSTALL" file, cd to demos and run "make".
After changing something in factorize.c, do "make factorize".
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Mon May 17, 2021 10:19 am

The cycle finding algorithm used in factorize.c is not Floyds tortoise and hare cycle finding algorithm
https://en.wikipedia.org/wiki/Cycle_det ... e_and_hare
the original Pollard Rho method used (which is depicted in wikipedia animation of earlier posting in this thread). Instead it is Brents cycle detection algorithm:
https://en.wikipedia.org/wiki/Cycle_det ... _algorithm

I did factorize (2^61-1)^2 again with a=1 taking 10 minutes. I inserted simple counter that did count the number the while loop gets executed, and the value reported was 2,163,790,336. Then I changed cycle finding algorithm to Floyd's tortoise and hare algothm and did output conter for same setup, and it was 2,163,790,305 -- 21 less than the other value (and same runtime of 10 minutes).

From now on all I report is with changed Floyd's tortoise and hare algorithm.
I updated the gist, the few changes needed you can find here:
https://gist.github.com/Hermann-SW/6bd1 ... /revisions

Reason for sub-second factorizations for a=0 is that only 65(!) while loop executions are needed, unrelated to initial value for x:
$ time ./factorize -i 2 -a 0 `./pexpr "(2^61-1)^2"`
5316911983139663487003542222693990401: 2305843009213693951 * ... [65]
real 0m0.038s
user 0m0.012s
sys 0m0.035s
$ time ./factorize -i 4 -a 0 `./pexpr "(2^61-1)^2"`
5316911983139663487003542222693990401: 2305843009213693951 * ... [65]
real 0m0.076s
user 0m0.025s
sys 0m0.070s
$ time ./factorize -i 16 -a 0 `./pexpr "(2^61-1)^2"`
5316911983139663487003542222693990401: 2305843009213693951 * ... [65]
real 0m0.035s
user 0m0.013s
sys 0m0.030s
$ time ./factorize -i 3 -a 0 `./pexpr "(2^61-1)^2"`
5316911983139663487003542222693990401: 2305843009213693951 * ... [65]
real 0m0.034s
user 0m0.015s
sys 0m0.028s
$

Next I used another prime number sequence as source for inputs:
A057733 Primes of the form 2^n + 3.

The last number listed has 228 bits length:
l(431359146674410236714672241392314090778194310760649159697657763987459)/l(2)
228.00000000000000000236

Again factorization is sub-second, with only 481 while loop executions:
$ time ./factorize -i 2 -a 0 `./pexpr "(2^228+3)^2"`
186070713419675363980626894819329160794532188335953423432061490990246245911909914832925197016171826668099874697068705445943043963509276681: 431359146674410236714672241392314090778194310760649159697657763987459 * ... [481]
real 0m0.050s
user 0m0.024s
sys 0m0.037s
$

Next I tried numbers not built from powers of two:
A076481 Primes of the form (3^n-1)/2.
+100 24
13, 1093, 797161, 3754733257489862401973357979128773, 6957596529882152968992225251835887181478451547013

The number I used have exponent s 103 and 71:
l(2*6957596529882152968992225251835887181478451547013+1)/l(3)
103.00000000000000000049
l(2*3754733257489862401973357979128773+1)/l(3)
71.00000000000000000033

I was happy not to see sub-second factorization here, I stopped it after 15 minutes:

Code: Select all

$ time ./factorize -i 2 -a 0 `./pexpr "((3^71-1)/2)^2"`
^C

real	15m6.099s
user	15m5.966s
sys	0m0.094s
$

But then I tried iniial value 3 and sub second factorizations came back ...
$ time ./factorize -i 3 -a 0 `./pexpr "((3^71-1)/2)^2"`
14098021834900433353326504437269339492731805015461594895076116485529: 3754733257489862401973357979128773 * ... [65]
real 0m0.044s
user 0m0.019s
sys 0m0.038s
$
$ time ./factorize -i 3 -a 0 `./pexpr "((3^103-1)/2)^2"`
48408149472628176711992895941020836989627796746074189294836404623980614241510486963545132949222169: 6957596529882152968992225251835887181478451547013 * ... [65]
real 0m0.042s
user 0m0.017s
sys 0m0.036s
$
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

User avatar
MikeDB
Posts: 511
Joined: Sun Oct 12, 2014 8:27 am
Contact: Website

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 23, 2021 12:20 am

ejolson wrote:
Sun May 16, 2021 1:45 am
If you set things up for ARMv7 mode in an optimal way ARMv8 has a full set of 64-bit operations that are missing when the processor runs 32-bit mode.

My impression is that Intel is much faster for floating point but for integer the Pi 4B is surprisingly competitive.
I'll concur with that provided it is straight 64 bit integer maths code which I use a lot in audio DSP work. And certainly on a calculations per Watt the Pi4B wins hands down so I split the tasks across multiple Pi4s.

But if you move to vector processing then the i7 (or indeed most modern Intel processors) are still in another league. But ARM has the SVE successor to NEON which hopefully we'll see in a future Pi (6 maybe ?).
Always interested in innovative audio startups needing help and investment. Look for me on ModWiggler or other sites that have PMs.

cleverca22
Posts: 3741
Joined: Sat Aug 18, 2012 2:33 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 23, 2021 1:59 am

MikeDB wrote:
Sun May 23, 2021 12:20 am
ejolson wrote:
Sun May 16, 2021 1:45 am
If you set things up for ARMv7 mode in an optimal way ARMv8 has a full set of 64-bit operations that are missing when the processor runs 32-bit mode.

My impression is that Intel is much faster for floating point but for integer the Pi 4B is surprisingly competitive.
I'll concur with that provided it is straight 64 bit integer maths code which I use a lot in audio DSP work. And certainly on a calculations per Watt the Pi4B wins hands down so I split the tasks across multiple Pi4s.

But if you move to vector processing then the i7 (or indeed most modern Intel processors) are still in another league. But ARM has the SVE successor to NEON which hopefully we'll see in a future Pi (6 maybe ?).
the VPU in every pi model is also capable of vectorized math, including 32bit*32bit -> 64bit
i think the 64bit output, takes 4 clock cycles to run, its clocked at 500mhz, and its doing 16 mults in parallel

how well does that stack up against the arm core and the i7?

ejolson
Posts: 7231
Joined: Tue Mar 18, 2014 11:47 am

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 23, 2021 2:33 am

cleverca22 wrote:
Sun May 23, 2021 1:59 am
MikeDB wrote:
Sun May 23, 2021 12:20 am
ejolson wrote:
Sun May 16, 2021 1:45 am
If you set things up for ARMv7 mode in an optimal way ARMv8 has a full set of 64-bit operations that are missing when the processor runs 32-bit mode.

My impression is that Intel is much faster for floating point but for integer the Pi 4B is surprisingly competitive.
I'll concur with that provided it is straight 64 bit integer maths code which I use a lot in audio DSP work. And certainly on a calculations per Watt the Pi4B wins hands down so I split the tasks across multiple Pi4s.

But if you move to vector processing then the i7 (or indeed most modern Intel processors) are still in another league. But ARM has the SVE successor to NEON which hopefully we'll see in a future Pi (6 maybe ?).
the VPU in every pi model is also capable of vectorized math, including 32bit*32bit -> 64bit
i think the 64bit output, takes 4 clock cycles to run, its clocked at 500mhz, and its doing 16 mults in parallel

how well does that stack up against the arm core and the i7?
Do you have any code examples? Though I'd love to find out otherwise, using the VPU as a vector accelerator seems to me in the impossible to program category.

cleverca22
Posts: 3741
Joined: Sat Aug 18, 2012 2:33 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 2:57 am

ejolson wrote:
Sun May 23, 2021 2:33 am
Do you have any code examples? Though I'd love to find out otherwise, using the VPU as a vector accelerator seems to me in the impossible to program category.
https://github.com/hermanhermitage/vide ... figuration

there is a 64 x 64 x 8bit matrix of registers

H(row,col) refers to a row of 16 cells, starting at row,col and ending at row,(col+15)
V(row,col) refers to a column of cells instead

HX and VX, will take 2 sets of 16 cells (2nd set at col+16), and concat the 8bit values together to make a 16bit value
HY and VY take 4 sets of 16, and combine them into a 32bit value

so HY(0,0) for example, refers to 16 values in row 0
8 bits of each value come from columns 0-15
the next 8 bits come from columns 16-31
then cols 32-47
then the last 8bits in cols 48-63

depending on the bit width (8/16/32), you can fit more or less sets of 16 into the matrix
and with care, you can mix them as well, so you could have HX(0,0) and HX(0,32) living together in row0, and then HY(1,0) taking all of row1 up

Code: Select all

uint16_t a[1024];
__asm__ volatile ("v16ld HX(0++,0), (%0+=%1) REP64": : "r"(a), "r"(2*16));
this will then load a c-array into the matrix
the REP64 says to repeat this opcode 64 times
the 0++ in the row slot, says to increment the row each time it repeats
gcc will replace the %0 and %1 with registers, so it becomes (r0+=r1)
r0 is the address to load the first uint16_t[16] from
and r1 is the stride, how much to increment by when loading the next 16 values

Code: Select all

__asm__ volatile ("vmull.ss HX(0++,0), HX(0++,0), 2 REP64");
vmull is a vector multiply
.ss is the signed/unsigned flag for the 1st and 2nd argument, you can do .ss .uu .su or .us
then you have destination, arg1, arg2
the arguments can be any vector reg, a scalar reg, or an immediate
in this example, it is doing:

Code: Select all

uint16_t matrix[16][64];
for (int row=0; row<64; row++) {
  for (int col=0; col<16; col++) {
    matrix[row][col] = matrix[row][col] * 2;
  }
}
that entire loop, all 1024 mults, will be completed in just 128 clock cycles

some other examples ive written

Code: Select all

v16st HX(0++,0), (%0+=%1) REP64 // store the matrix back into ram
v32add HY(63,0), HY(63,0), HY(31,0)
v32add HY(63,0), HY(63,0), r0
v32asr HY(63,0), HY(63,0), r1
vmul32.ss HY(16,0), HY(0,0), 4
v32sub HY(16,0), HY(48,0), HY(63,0)
the 32bit*32bit->64bit mult, is done using vmulhd, which is a "high side mult", returning the upper 32bits of a mult
so you need to combine both the low-side and high-side mults together, to get the full 64bit answer


the biggest limitation i can see, is that its int only, if you want floats, then you need to implement that yourself (split the float up into 2 pieces, and compute the new exponent, with vector ops)

User avatar
MikeDB
Posts: 511
Joined: Sun Oct 12, 2014 8:27 am
Contact: Website

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (one sample)

Sun May 23, 2021 9:31 am

cleverca22 wrote:
Sun May 23, 2021 1:59 am
the VPU in every pi model is also capable of vectorized math, including 32bit*32bit -> 64bit
i think the 64bit output, takes 4 clock cycles to run, its clocked at 500mhz, and its doing 16 mults in parallel

how well does that stack up against the arm core and the i7?
Like others I thought that was near impossible to program but it certainly looks like it might be faster than the ARM core. I'll come back once I've tried it in my code. Is it the same code for VideoCore VI on the Pi4 ?
Always interested in innovative audio startups needing help and investment. Look for me on ModWiggler or other sites that have PMs.

User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 9:59 am

cleverca22 wrote:
Sun May 23, 2021 2:57 am
https://github.com/hermanhermitage/vide ... figuration

there is a 64 x 64 x 8bit matrix of registers

H(row,col) refers to a row of 16 cells, starting at row,col and ending at row,(col+15)
V(row,col) refers to a column of cells instead

...
Thank you for that details and impressive examples.

64x64x8 bit matrix of registers is means 32x32=1024 32bit registers then.
Most costly operation is multiplication, resulting in sum of space needed by its two operands.

I have copied and grouped one example of gmp lib commands used by Pollard Rho algorithm below.
" mpz_mul (t, x, x);" does need 4x space of "x", so "x" size could be 1024/4=256 32bit values, of 8192 bits!!
This should suffice to attack RSA with 4096 bit modulus.
I just used Firefox to inspect modulus of my website stamm-wilbrandt.de, and it has 512 bytes modulus (4094bit).

The question is how the loading of registers from C arrays and readout of results from registers to C arrays will cost.
Perhaps the easiest test would be to write a program just doing initialization of mpz variables the standard way.
Then time normal "mpz_mul(t,x,x)" versus VPU replacement for just that operation.
This should give nformation on what can be achieved over current gmp lib implementation on a Pi4[00] ...

Code: Select all

  mpz_inits (t, t2, NULL); // init
  mpz_init_set_si (y, 2);
  mpz_init_set_ui (P, 1);
  mpz_set (y, x); // setvar
  mpz_cmp_ui (n, 1) // compare
  mpz_mul (t, x, x);
  mpz_mod (x, t, n);
  mpz_add_ui (x, x, a);
  mpz_sub (t, z, x);
  mpz_divexact (n, n, t); // arith
  mpz_gcd (t, P, n);// complex arith
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

cleverca22
Posts: 3741
Joined: Sat Aug 18, 2012 2:33 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 12:54 pm

MikeDB wrote:
Sun May 23, 2021 9:31 am
Like others I thought that was near impossible to program but it certainly looks like it might be faster than the ARM core. I'll come back once I've tried it in my code. Is it the same code for VideoCore VI on the Pi4 ?
pi4 uses the exact same ISA as past models, the only real difference is the stuff in MMIO, which doesnt matter for vector math
HermannSW wrote:
Sun May 23, 2021 9:59 am
Most costly operation is multiplication, resulting in sum of space needed by its two operands.
today, i tried to implement floating point math on the "int only" vector unit
it has bit shift, and, and or, so it can just shuffle the bits around, decode the floats, fixed-point mult, and fix up the exponents

then i discovered to my horror, there is no `uint32_t mult(uint32_t, uint32_t)`
there is only several variations of `mult(uint16_t,uint16_t)`

i have spent the last 2 hours, working out how to implement `(a*b)>>32`, when my * operator can only take 16bit inputs
it still has edge cases that it gets wrong

User avatar
MikeDB
Posts: 511
Joined: Sun Oct 12, 2014 8:27 am
Contact: Website

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 1:28 pm

cleverca22 wrote:
Sun May 23, 2021 12:54 pm
pi4 uses the exact same ISA as past models, the only real difference is the stuff in MMIO, which doesnt matter for vector math
Does the Pi4 ARM core that issued the VPU command stall when the VPU is calculating, or plough on with its own work ?
And if the latter, is there a flag or something to say the array of multiplies is complete so as to load back to ARM ?
Always interested in innovative audio startups needing help and investment. Look for me on ModWiggler or other sites that have PMs.

cleverca22
Posts: 3741
Joined: Sat Aug 18, 2012 2:33 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 2:17 pm

MikeDB wrote:
Sun May 23, 2021 1:28 pm
cleverca22 wrote:
Sun May 23, 2021 12:54 pm
pi4 uses the exact same ISA as past models, the only real difference is the stuff in MMIO, which doesnt matter for vector math
Does the Pi4 ARM core that issued the VPU command stall when the VPU is calculating, or plough on with its own work ?
And if the latter, is there a flag or something to say the array of multiplies is complete so as to load back to ARM ?
they talk over the same mailbox as most other things, so your thread will block, and the arm will go do other things

https://github.com/ali1234/vcpoke/blob/ ... .c#L21-L53
lines 27-34 are a tiny (4 opcode) VPU program
lines 46 and 51 then create c wrappers for calling the 2 VPU functions

in theory, you just need to compile your vector code to VPU asm, with some extra sugar to disable and re-enable irq's, and save/restore the vector state (so you dont corrupt whatever the firmware is doing)
then you can use this same code to run a vector operation, and another 4 threads can do arm computation while you wait for the vpu reply

but if you instead use the open firmware, you control what its doing with the vector core, and can do things even more efficiently

ejolson
Posts: 7231
Joined: Tue Mar 18, 2014 11:47 am

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 2:46 pm

cleverca22 wrote:
Sun May 23, 2021 2:17 pm
MikeDB wrote:
Sun May 23, 2021 1:28 pm
cleverca22 wrote:
Sun May 23, 2021 12:54 pm
pi4 uses the exact same ISA as past models, the only real difference is the stuff in MMIO, which doesnt matter for vector math
Does the Pi4 ARM core that issued the VPU command stall when the VPU is calculating, or plough on with its own work ?
And if the latter, is there a flag or something to say the array of multiplies is complete so as to load back to ARM ?
they talk over the same mailbox as most other things, so your thread will block, and the arm will go do other things

https://github.com/ali1234/vcpoke/blob/ ... .c#L21-L53
lines 27-34 are a tiny (4 opcode) VPU program
lines 46 and 51 then create c wrappers for calling the 2 VPU functions

in theory, you just need to compile your vector code to VPU asm, with some extra sugar to disable and re-enable irq's, and save/restore the vector state (so you dont corrupt whatever the firmware is doing)
then you can use this same code to run a vector operation, and another 4 threads can do arm computation while you wait for the vpu reply

but if you instead use the open firmware, you control what its doing with the vector core, and can do things even more efficiently
Doing a big number multiply may be a bit much as the first project. Given a 20x20 matrix A with 32-bit integer values, would taking the matrix power A^5 be easier? What about with 16-bit values?

User avatar
HermannSW
Posts: 4119
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany
Contact: Website Twitter YouTube

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 3:04 pm

cleverca22 wrote:
Sun May 23, 2021 12:54 pm
then i discovered to my horror, there is no `uint32_t mult(uint32_t, uint32_t)`
there is only several variations of `mult(uint16_t,uint16_t)`
Thanks for clarification.

i have spent the last 2 hours, working out how to implement `(a*b)>>32`, when my * operator can only take 16bit inputs
it still has edge cases that it gets wrong
Please let us know when you have something to test.

My older son made me aware of this gmplib page wrt SIMD:
https://gmplib.org/manual/Assembly-SIMD-Instructions
Since current stable gmplib version is from 2014, these statements are likely missing the last 7 years of hardware improvements:
15.8.7 SIMD Instructions

The single-instruction multiple-data support in current microprocessors is aimed at signal processing algorithms where each data point can be treated more or less independently. There’s generally not much support for propagating the sort of carries that arise in GMP.

SIMD multiplications of say four 16x16 bit multiplies only do as much work as one 32x32 from GMP’s point of view, and need some shifts and adds besides. But of course if say the SIMD form is fully pipelined and uses less instruction decoding then it may still be worthwhile.

On the x86 chips, MMX has so far found a use in mpn_rshift and mpn_lshift, and is used in a special case for 16-bit multipliers in the P55 mpn_mul_1. SSE2 is used for Pentium 4 mpn_mul_1, mpn_addmul_1, and mpn_submul_1.
https://stamm-wilbrandt.de/2wheel_balancing_robot
https://stamm-wilbrandt.de/en#raspcatbot
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://github.com/Hermann-SW/raspiraw
https://stamm-wilbrandt.de/en/Raspberry_camera.html

cleverca22
Posts: 3741
Joined: Sat Aug 18, 2012 2:33 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 4:24 pm

ejolson wrote:
Sun May 23, 2021 2:46 pm
Doing a big number multiply may be a bit much as the first project. Given a 20x20 matrix A with 32-bit integer values, would taking the matrix power A^5 be easier? What about with 16-bit values?
i'm not sure how to decode what youve said into actual code

but i have partially solved the mult problem

Code: Select all

uint32_t m(uint16_t a, uint16_t b) {
  return (uint32_t)a * b;
}
uint64_t mult(uint32_t a, uint32_t b) {
  uint16_t ah = a >> 16;
  uint16_t al = a & 0xffff;
  uint16_t bh = b >> 16;
  uint16_t bl = b & 0xffff;

  uint32_t part1 = al * bl; // 31:0

  uint32_t part2 = m(ah,bl); // 47:16
  uint32_t part3 = m(al,bh); // 47:16

  uint64_t partb1 = (part2 + part3); // 47:16
  uint32_t partb = partb1;
  if (partb1 > 0xffffffff) printf("partb 0x%lx\n", partb1);

  uint64_t part4 = m(ah,bh); // 63:32
  return part1 + ((uint64_t)partb<<16) + (part4 << 32);
}
this can do a 32bit*32bit->64bit multiply, using only 16bit*16bit->32bit mults, and a 64bit add

that solves the missing mults on the vector unit, so the only problem left is taking 3-4 32bit ints, bit-shifting some of them, and then a 64bit add

ejolson
Posts: 7231
Joined: Tue Mar 18, 2014 11:47 am

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 6:17 pm

cleverca22 wrote:
Sun May 23, 2021 4:24 pm
ejolson wrote:
Sun May 23, 2021 2:46 pm
Doing a big number multiply may be a bit much as the first project. Given a 20x20 matrix A with 32-bit integer values, would taking the matrix power A^5 be easier? What about with 16-bit values?
i'm not sure how to decode what youve said into actual code

but i have partially solved the mult problem

Code: Select all

uint32_t m(uint16_t a, uint16_t b) {
  return (uint32_t)a * b;
}
uint64_t mult(uint32_t a, uint32_t b) {
  uint16_t ah = a >> 16;
  uint16_t al = a & 0xffff;
  uint16_t bh = b >> 16;
  uint16_t bl = b & 0xffff;

  uint32_t part1 = al * bl; // 31:0

  uint32_t part2 = m(ah,bl); // 47:16
  uint32_t part3 = m(al,bh); // 47:16

  uint64_t partb1 = (part2 + part3); // 47:16
  uint32_t partb = partb1;
  if (partb1 > 0xffffffff) printf("partb 0x%lx\n", partb1);

  uint64_t part4 = m(ah,bh); // 63:32
  return part1 + ((uint64_t)partb<<16) + (part4 << 32);
}
this can do a 32bit*32bit->64bit multiply, using only 16bit*16bit->32bit mults, and a 64bit add

that solves the missing mults on the vector unit, so the only problem left is taking 3-4 32bit ints, bit-shifting some of them, and then a 64bit add
I like your code because it uses 3 multiplications rather than 4. Did you figure it out or look it up? My understanding is this is related to Karatsuba's algorithm

https://en.wikipedia.org/wiki/Karatsuba_algorithm

and similar to the Gauss trick for multiplying complex numbers.

There was a thread some time ago comparing the efficiency and expressiveness of different programming languages through the challenge of computing really big Fibonacci numbers using the Karatsuba algorithm.

viewtopic.php?f=62&t=227343

Although we've not seen David online for a couple years (hope you're doing well) and John is running the Basic forum

https://raspberrybasic.org/forum/index.php

most of the other participants who participated in that project are around.

Note that the tradeoff between additional additions versus four multiplications depends on multiplication being relatively expensive. As indicated by the GMP documentation linked above, many SIMD units perform 16-bit multiplication so fast it's better to use four multiplies.

What I meant by a matrix power function is matpower below where

Code: Select all

void matsquare(int A[20][20]){
    int C[20][20];
    memcpy(C,A,sizeof(int)*20*20);
    bzero(A,sizeof(int)*20*20);
    for(int i=0;i<20;i++){
        for(int k=0;k<20;k++){
            for(int j=0;j<20;j++){
                A[i][j]+=C[i][k]*C[k][j];
            }
        }
    }
}

void matmul(int A[20][20],int B[20][20]){
    int C[20][20];
    memcpy(C,A,sizeof(int)*20*20);
    bzero(A,sizeof(int)*20*20);
    for(int i=0;i<20;i++){
        for(int k=0;k<20;k++){
            for(int j=0;j<20;j++){
                A[i][j]+=C[i][k]*B[k][j];
            }
        }
    }
}

void matpower(int A[20][20],int p){
    if(p==1) return;
    int q=p/2, r=p%2;
    int C[20][20];
    memcpy(C,A,sizeof(int)*20*20);
    matpower(A,q);
    matsquare(A);
    if(r==0) return;
    matmul(A,C);
}
Ideally I could pass a 20x20 matrix and a value of p to the matpower function which would then hand things off to the VPU and later obtain a value for A^p. If this is done with 32-bit integers, the 16-bit to 32-bit routine you posted could be useful. However, doing everything modulo 2^16 could also be interesting.

At any rate, my intention is not to derail this thread, but simply provide a test case that might be easier than a VPU-based big number library as a starting point.

Edit: I reread your code and realize that there are actually 4 multiplications but that you are working around the lack of a

32-bit x 32-bit -> 64-bit

operation, so never mind all the stuff about 3 multiplications versus 4. I'll leave earlier part of this post as is, in case someone already read it and was interested.

cleverca22
Posts: 3741
Joined: Sat Aug 18, 2012 2:33 pm

Re: gmp (Gnu multiple precision arithmetic lib) 5x faster on intel i7 than on Pi400 (two samples)

Sun May 23, 2021 7:59 pm

ejolson wrote:
Sun May 23, 2021 6:17 pm
I like your code because it uses 3 multiplications rather than 4. Did you figure it out or look it up? My understanding is this is related to Karatsuba's algorithm
was partially talking to a guy on irc, but being low on sleep, i couldnt fully understand what he was saying

based on my poor understanding of his directions, i then moved on to just doing al * bl (lower 16bits), and ah * bl, (high 16 and low 16)
then just mult 0x12345678 and 0x100, look at the result, and bit-shift it until it lands in the right spot

and repeat until i have a working 32bit*32bit -> 64bit mult
ejolson wrote:
Sun May 23, 2021 6:17 pm
Note that the tradeoff between additional additions versus four multiplications depends on multiplication being relatively expensive. As indicated by the GMP documentation linked above, many SIMD units perform 16-bit multiplication so fast it's better to use four multiplies.
from my testing, the 16bit mult takes 2 clock cycles, and happens over all 16 lanes
so the 4 mults to patch around a missing 32bit mult, would take 8 clock cycles
then i need a few more for the adds (not yet solved that problem), and it feels like i could get it down to effectively 1-2 clocks per float mult?

that doesnt sound that impressive..., i'll need to compare the scalar float ops, and see how they stack up
ejolson wrote:
Sun May 23, 2021 6:17 pm
What I meant by a matrix power function is matpower below where
first issue i can see, is that the vector unit always operates on units of 16, if you could change that to be [16][16] instead, it would be far simpler to implement

next thing, is the C[i][k] * C[k][j], i could see how i might implement that by using both the row and column syntax at the same time
take a row of data, starting at 0,0 spanning to 0,15, and make that arg1
then take a column of data, starting at 0,0 spanning to 15,0, and that is arg2
mult them together, with the accumulator flags (havent played with them much yet)

that does look feasible, but id have to draw it out on paper a bit to make full sense of the algo, and implement it

another factor, is that the current mailbox api, may have a relatively high latency, resulting in a high down-time
the solution then, is to do bigger batches

all of the vector opcodes, can be repeated a power of 2 times, between 1 and 64
depending on how many rows it takes up in the matrix, i could potentially just do 64 matpower's at once, each on a seperate 16x16 matrix

Return to “C/C++”