lurk101
Posts: 289
Joined: Mon Jan 27, 2020 2:35 pm

Euler performance revisited...

Sun Sep 27, 2020 8:35 am

It’s my custom to benchmark Euler Project problem solutions on every box in the house as I solve them. Recent problem #719, not one of the project’s most mind bending problems, is solved here using recursion and a mixture of 32 and 64 bit integer arithmetic. The benchmark results are partly expected but contain some surprises.

Expected:
  • The Raspberry Pi 4 is about the usual 10% faster than the Jetson without GPU assist.
Unexpected:
  • The Raspberry Pi 4 is 10 times faster in 64 bit mode! Yes, a full order of magnitude.
  • Multi-threaded efficiency ( defined as (single_time * cores) / multi_time ) drops sharply above 8 threads regardless of architecture. Likely due to heavy use of interlocked memory accesses by multi-threaded algorithm causing memory contention. The Ryzen usually excels at handling multi-threaded load.
  • Though slower, the Pi 4’s Cortex A72 at 1.5 GHz does about 20% more work per clock cycle in single threaded mode than the 2013 vintage I7, and 70% more than the 2011 Xeon. On the other hand the more modern Ryzen at 4 GHz does about 30% more work per cycle than the A72. Haven’t done any power calculations but I bet they’d tell a different story.
Solution code:

Code: Select all

// euler719.cpp :
//
// Problem originally published: https://projecteuler.net/problem=719
//
// As:
//
// We define an S - number to be a natural number, n, that is a perfect
// square and its square root can be obtained by splitting the decimal
// representation of n into 2 or more numbers then adding the numbers.
//
// For example,
// 81   is an S - number because sqrt( 81 ) = 8 + 1
// 6724 is an S - number: sqrt( 6724 ) = 6 + 72 + 4
// 8281 is an S - number: sqrt( 8281 ) = 8 + 2 + 81 = 82 + 8 + 1
// 9801 is an S - number: sqrt( 9801 ) = 98 + 0 + 1
//
// Further we define T(N) to be the sum of all S numbers n <=  N.You are given
// T(10 ^ 4) = 41333
//
// Find T(10 ^ 12)

#include <cmath>
#include <cstdint>
#include <iostream>
#include <string>

using namespace std;

static bool S(const uint32_t s, const uint64_t l)
{
    for (uint32_t pwr10 = 10;; pwr10 *= 10)
    {
        const uint32_t right = l % pwr10;
        if (right > s)
            return false;
        const uint64_t left = l / pwr10;
        if (right + left == s)
            return true;
        if (left < 10)
            return false;
        if ((s > right) && S(s - right, left))
            return true;
    }
}

static uint64_t T(const uint64_t N)
{
    // Technically 0 and 1 should be S numbers,
    // but Euler doesn't think so!
    uint64_t sum = 0;
    for (uint32_t sn = 4; sn <= uint32_t(sqrt(N)); sn++)
    {
        const uint64_t n = sn * uint64_t(sn);
        if (S(sn, n))
            sum += n;
    }
    return sum;
}

int main(int ac, char** av)
{
    uint64_t N;
    if (ac < 2)
    {
        cout << endl << "Please specify a value for N." << endl << endl;
        return -1;
    }
    N = strtoull(av[1], nullptr, 10);

    cout << endl << "T(" << N << ") = " << T(N) << endl << endl;
}

Code: Select all

// euler719_mt.cpp :
//
// Problem originally published: https://projecteuler.net/problem=719
//
// As:
//
// We define an S - number to be a natural number, n, that is a perfect
// square and its square root can be obtained by splitting the decimal
// representation of n into 2 or more numbers then adding the numbers.
//
// For example,
// 81   is an S - number because sqrt( 81 ) = 8 + 1
// 6724 is an S - number: sqrt( 6724 ) = 6 + 72 + 4
// 8281 is an S - number: sqrt( 8281 ) = 8 + 2 + 81 = 82 + 8 + 1
// 9801 is an S - number: sqrt( 9801 ) = 98 + 0 + 1
//
// Further we define T(N) to be the sum of all S numbers n <=  N.You are given
// T(10 ^ 4) = 41333
//
// Find T(10 ^ 12)

#include <atomic>
#include <cmath>
#include <cstdint>
#include <iostream>
#include <thread>

using namespace std;

static uint64_t N;
static uint32_t sqrtN;

static atomic<uint64_t> sum = {0};
// Technically 0 and 1 should be S numbers, but Euler doesn't think so!
// So we start with 1st possible of 16.
static atomic<uint32_t> sn = {4};

static bool S(const uint32_t s, const uint64_t l)
{
    for (uint32_t pwr10 = 10;; pwr10 *= 10)
    {
        const uint32_t right = l % pwr10;
        if (right > s)
            return false;
        const uint64_t left = l / pwr10;
        if (right + left == s)
            return true;
        if (left < 10)
            return false;
        if ((s > right) && S(s - right, left))
            return true;
    }
}

static void T_thread(int i)
{
    for (;;)
    {
        uint32_t n = sn.load();
        if (n > sqrtN)
            return;
        while (!sn.compare_exchange_weak(n, n + 1))
            if (n > sqrtN)
                return;
        const uint64_t n2 = n * uint64_t(n);
        if (S(n, n2))
            sum.fetch_add(n2);
    }
}

static uint64_t T(const uint64_t N)
{
    uint32_t proc_count = std::thread::hardware_concurrency();
    // may return 0 when not able to detect
    if (proc_count == 0)
        proc_count = 1;
    cout << "Threads: " << proc_count << endl;
    thread t[proc_count];
    for (int i = 0; i < proc_count; i++)
        t[i] = thread(T_thread, i);
    for (int i = 0; i < proc_count; i++)
        t[i].join();
    return sum;
}

int main(int ac, char** av)
{
    if (ac < 2)
    {
        cout << endl << "Please specify a value for N." << endl << endl;
        return -1;
    }
    N = strtoull(av[1], nullptr, 10);
    sqrtN = uint32_t(sqrt(N));
    uint64_t t = T(N);
    cout << endl << "T(" << N << ") = " << t << endl << endl;
}
Results:
perf.gif
perf.gif (100.08 KiB) Viewed 428 times
Of course these results may not apply to the general case, whatever that is.

T(1,000,000,000,000) = 128,088,830,547,982
Last edited by lurk101 on Sun Sep 27, 2020 5:58 pm, edited 1 time in total.

User avatar
jahboater
Posts: 6274
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Euler performance revisited...

Sun Sep 27, 2020 9:25 am

lurk101 wrote:
Sun Sep 27, 2020 8:35 am
Unexpected:

The Raspberry Pi 4 is 10 times faster in 64 bit mode! Yes, a full order of magnitude.[/list]
That may be because it has a fast hardware 64-bit division instruction, in 32-bit mode its done by a library call.
sysbench also exhibits this large performance gain (a 64-bit integer prime number benchmark reliant on division/remainder).
Pi4 8GB running PIOS64 Lite

m4r35n357
Posts: 42
Joined: Fri Jul 06, 2012 4:31 pm

Re: Euler performance revisited...

Mon Nov 09, 2020 4:24 pm

jahboater wrote:
Sun Sep 27, 2020 9:25 am
That may be because it has a fast hardware 64-bit division instruction, in 32-bit mode its done by a library call.
sysbench also exhibits this large performance gain (a 64-bit integer prime number benchmark reliant on division/remainder).
Sounds plausible.

On a related subject, I have just tried both the RaspOS and Ubuntu 64 bit distros on my mathematical code and it runs at least an order of magnitude slower on each, so I strongly suspect software float. I hope they will hook up to the hardware at some point . . .

Heater
Posts: 16832
Joined: Tue Jul 17, 2012 3:02 pm

Re: Euler performance revisited...

Mon Nov 09, 2020 5:15 pm

lurk101 wrote:
Sun Sep 27, 2020 8:35 am
Multi-threaded efficiency ( defined as (single_time * cores) / multi_time ) drops sharply above 8 threads regardless of architecture.
It's never been clear to me why some expect multi-threaded efficiency above and beyond the number of cores a machine should improve, or even stay the same.

To my simple mind if one splits the work over 8 threads on a 4 core machine one has just introduced the overheads of juggling two threads on each core. How can that be anything but worse?

Ignoring the notion of hyperthreads, whatever they actually are, on x86 machines. The Pi does not have hyperthreads as far as I know.
Memory in C++ is a leaky abstraction .

lurk101
Posts: 289
Joined: Mon Jan 27, 2020 2:35 pm

Re: Euler performance revisited...

Mon Nov 09, 2020 6:43 pm

Heater wrote:
Mon Nov 09, 2020 5:15 pm
lurk101 wrote:
Sun Sep 27, 2020 8:35 am
Multi-threaded efficiency ( defined as (single_time * cores) / multi_time ) drops sharply above 8 threads regardless of architecture.
It's never been clear to me why some expect multi-threaded efficiency above and beyond the number of cores a machine should improve, or even stay the same.

To my simple mind if one splits the work over 8 threads on a 4 core machine one has just introduced the overheads of juggling two threads on each core. How can that be anything but worse?

Ignoring the notion of hyperthreads, whatever they actually are, on x86 machines. The Pi does not have hyperthreads as far as I know.
As I recall that was about performance gains dropping off sharply beyond 8 threads on an 8 core, 16 hyperthread CPU. But you're right, much is shared between pairs of hyperthreads, the likely bottleneck.

Return to “General programming discussion”