## Super-cheap Computing Cluster for Learning

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

A surprising number of jobs which execute on many clusters are single-threaded and non-parallel. In those cases parallelism is sometimes obtained as a by product of running many independent copies of the same program with a different choice of parameters. Consider the Monte-Carlo method for approximating Pi from the OctaPi project Calculating Pi. Like darts thrown at a dart board, this method generates random points in a square and then counts how many of those points fall into the circle. Note that due to the randomness, one obtains a different approximation of Pi with each run.

While the dartboard method is a general way of approximating areas, it converges very slowly at a rate of 1/sqrt(n) and is, therefore, also one of the least efficient ways to compute Pi. The Python program included in the OctaPi project computes the percentage of darts falling inside the circle at a speed of 25000 darts/second using a precision of 100 decimal digits. To compute all 100 decimal digits accurately would require approximately 0.4E197 darts and consequently about 0.5E185 years to finish. On the other hand, the age of the universe is estimated to be about 1.3E10 years and it follows in that amount of time that a Pi Zero could throw 4.1E18 darts to obtain an approximation good to about 8 significant digits. As in the OctaPi project, our goal is not to compute Pi, but to understand parallel processing and how the Slurm batch system works.

We begin by creating a batch job which runs the Python code from the OctaPi project on one of the available Pi Zeros in the cluster. Make a working directory called pi. Then download the program file pi_dartboard.py from the OctaPi project and move it to the working directory. Note this is the dartboard algorithm for a stand alone processor, not the version designed for clusters. Before continuing, we make one minor change to pi_dartboard.py to flush the output buffers after each status update. Without this change all output would be buffered until the program finishes and that would make it impossible to monitor the status of the program while it is running in the batch environment. To do this add the option flush=True to the print on line 63 to obtain

Code: Select all

``            print(('Executed trial %i using random seed %i with result %i' % (i, ran_seed, inside)),flush=True)``
Next create a Slurm script called pi_dartboard.slm which contains

Code: Select all

``````#!/bin/bash
time {
printf '100\n10000\n' | python3 pi_dartboard.py
}``````
Since the original Python program was intended to be run interactively, it waits on user input for the size of each trial and the number of trials to run. In a batch system the program will be started when computing resources are available at some unspecified time in the future. Therefore, we specify the inputs for the program ahead of time. In the above case we set trial size to 100 and number of trials to 10000.

Now, submit a few copies of the batch job, check whether they have been scheduled and monitor their progress using

Code: Select all

``````\$ sbatch pi_dartboard.slm
Submitted batch job 54
\$ sbatch pi_dartboard.slm
Submitted batch job 55
\$ sbatch pi_dartboard.slm
Submitted batch job 56
\$ sbatch pi_dartboard.slm
Submitted batch job 57
\$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
54      zero pi_dartb  ejolson  R       0:07      1 s0
55      zero pi_dartb  ejolson  R       0:04      1 s1
56      zero pi_dartb  ejolson  R       0:04      1 s2
57      zero pi_dartb  ejolson  R       0:04      1 s3
\$ cat slurm-56.out
Number of random points to include in each trial =
Number of trials to run =
Doing 10000 trials of 100 points each
Executed trial 0 using random seed 43897 with result 86
Executed trial 1000 using random seed 52318 with result 79
Executed trial 2000 using random seed 59206 with result 80``````
Note that the output of each run is stored in a file of the form slurm-XX.out where XX is the job ID of the run. When squeue returns empty, all programs have finished. The final output should now appear as

Code: Select all

``````\$ cat slurm-54.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 41166 with result 74
Executed trial 1000 using random seed 36385 with result 79
Executed trial 2000 using random seed 39999 with result 87
Executed trial 3000 using random seed 55252 with result 79
Executed trial 4000 using random seed 9008 with result 79
Executed trial 5000 using random seed 28951 with result 81
Executed trial 6000 using random seed 22201 with result 79
Executed trial 7000 using random seed 54954 with result 74
Executed trial 8000 using random seed 65485 with result 74
Executed trial 9000 using random seed 53049 with result 78
The value of Pi is estimated to be 3.14902000000000015234036254696547985076904296875 using 1000000 points

real    0m40.572s
user    0m40.300s
sys 0m0.120s
\$ cat slurm-55.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 50261 with result 78
Executed trial 1000 using random seed 24993 with result 75
Executed trial 2000 using random seed 56492 with result 76
Executed trial 3000 using random seed 2695 with result 79
Executed trial 4000 using random seed 39028 with result 79
Executed trial 5000 using random seed 42505 with result 69
Executed trial 6000 using random seed 38572 with result 79
Executed trial 7000 using random seed 21871 with result 87
Executed trial 8000 using random seed 52748 with result 79
Executed trial 9000 using random seed 16073 with result 77
The value of Pi is estimated to be 3.1490960000000001173248165287077426910400390625 using 1000000 points

real    0m40.088s
user    0m39.830s
sys 0m0.080s
\$ cat slurm-56.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 43897 with result 86
Executed trial 1000 using random seed 52318 with result 79
Executed trial 2000 using random seed 59206 with result 80
Executed trial 3000 using random seed 16956 with result 82
Executed trial 4000 using random seed 44605 with result 82
Executed trial 5000 using random seed 11321 with result 76
Executed trial 6000 using random seed 46185 with result 77
Executed trial 7000 using random seed 23947 with result 76
Executed trial 8000 using random seed 29328 with result 74
Executed trial 9000 using random seed 62471 with result 85
The value of Pi is estimated to be 3.149503999999999859227273191208951175212860107421875 using 1000000 points

real    0m37.257s
user    0m36.980s
sys 0m0.100s
\$ cat slurm-57.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 17784 with result 77
Executed trial 1000 using random seed 18571 with result 76
Executed trial 2000 using random seed 11993 with result 81
Executed trial 3000 using random seed 58358 with result 75
Executed trial 4000 using random seed 60686 with result 84
Executed trial 5000 using random seed 38634 with result 79
Executed trial 6000 using random seed 34956 with result 80
Executed trial 7000 using random seed 11397 with result 81
Executed trial 8000 using random seed 54805 with result 77
Executed trial 9000 using random seed 1196 with result 78
The value of Pi is estimated to be 3.14919200000000021333335098461247980594635009765625 using 1000000 points

real    0m39.179s
user    0m38.910s
sys 0m0.090s``````
We observe that 1000000 darts were thrown in about 40 seconds per Pi Zero for a speed of 25000 darts/second per Pi. Also note that the approximation is only good to about 3 significant digits even though 100 digits of precision were used in the calculation and that the resulting approximation is different for each run. In particular, averaging the approximations obtained by all the runs results in a better approximation. This is how we shall obtain parallel speed-up when using multiple Zeros.

For performance reasons, scientific computing is usually done using a combination of Fortran and C programming languages. While Fortran remains a domain-specific language with little popularity outside scientific computing, the C programming language is equally suited for scientific computing but also widely used in many other application areas. We rewrite the Monte-Carlo method for computing Pi in C as an example of a single-threaded non-parallel code that might be run on computing clusters. The C program looks like

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/resource.h>

typedef unsigned int uint;
uint no_of_trials = 1000;
uint no_of_points = 100000;

uint compute(uint s,uint n){
srandom(s);
uint inside=0;
for(uint i=0;i<n;i++){
double x=(double)random()/RAND_MAX;
double y=(double)random()/RAND_MAX;
double z=x*x+y*y;
if(z<=1.0){
inside++;
}
}
return inside;
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

int main(){
printf("Estimating Pi using Dartboard/Monte-Carlo Method\n"
"with %u trials of %u points.\n\n",
no_of_trials,no_of_points);
fflush(stdout);
double pi=0;
tic();
double tprint=5;
FILE *urandom=fopen("/dev/urandom","r");
for(uint i=1;i<=no_of_trials;i++){
uint ran_seed;
uint inside=compute(ran_seed,no_of_points);
double tnow=toc();
if(tnow>tprint){
printf("Trial %u using seed %u with %u inside"
" (ETA %g seconds)\n",
i,ran_seed,inside,tnow/i*(no_of_trials-i));
fflush(stdout);
tprint=tnow+5;
}
pi+=(double)inside/no_of_points;
}
fclose(urandom);
pi*=4.0/no_of_trials;
printf("\nThe value of Pi is estimated to be %.15f\n"
"using %u trials of %u points each.\n\n",
pi,no_of_trials,no_of_points);
double ttime=toc();
printf("Total running time %g seconds\n"
"with average rate of %g darts/second.\n",
ttime,no_of_trials/ttime*no_of_points);
return 0;
}
``````
and the corresponding pi.slm is

Code: Select all

``````#!/bin/bash
./pi
``````
Compile, submit this program and monitor its progress using

Code: Select all

``````\$ gcc -std=gnu99 -Wall -O2 -o pi pi.c -lm
\$ sbatch pi.slm
Submitted batch job 66
\$ sbatch pi.slm
Submitted batch job 67
\$ sbatch pi.slm
Submitted batch job 68
\$ sbatch pi.slm
Submitted batch job 69
\$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
66      zero   pi.slm  ejolson  R       0:04      1 s0
67      zero   pi.slm  ejolson  R       0:04      1 s1
68      zero   pi.slm  ejolson  R       0:01      1 s2
69      zero   pi.slm  ejolson  R       0:01      1 s3
\$ cat slurm-68.out
Estimating Pi using Dartboard/Monte-Carlo Method
with 1000 trials of 100000 points.

Trial 168 using seed 1638688743 with 78596 inside (ETA 24.894 seconds)
Trial 337 using seed 76444161 with 78365 inside (ETA 19.7693 seconds)
Trial 506 using seed 1246513006 with 78327 inside (ETA 14.7182 seconds)``````
The final output from one of the jobs looks like

Code: Select all

``````\$ cat slurm-68.out
Estimating Pi using Dartboard/Monte-Carlo Method
with 1000 trials of 100000 points.

Trial 168 using seed 1638688743 with 78596 inside (ETA 24.894 seconds)
Trial 337 using seed 76444161 with 78365 inside (ETA 19.7693 seconds)
Trial 506 using seed 1246513006 with 78327 inside (ETA 14.7182 seconds)
Trial 675 using seed 3803787632 with 78665 inside (ETA 9.67678 seconds)
Trial 844 using seed 1400931100 with 78663 inside (ETA 4.64343 seconds)

The value of Pi is estimated to be 3.141761479999999
using 1000 trials of 100000 points each.

Total running time 29.7284 seconds
with average rate of 3.36379e+06 darts/second.``````
We note that the C code runs about 132 times faster than the Python code. Therefore, if a Python program is running slow, it might be wise to translate it to a more efficient language before making a parallel version. On the other hand, the original Python code may run as fast as the C version using the PyPy just-in-time compiler. It would be interesting for someone to check this. In subsequent posts we obtain a parallel increase in speed for the C version of this computation using the OpenMPI message passing interface.
Last edited by ejolson on Fri Oct 12, 2018 11:24 am, edited 1 time in total.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

In this post we will create a parallel Monte-Carlo approximation of Pi. Almost all parallel programs that run on supercomputers are created using the MPI message passing interface. This library passes messages over the high-speed interconnect to copy data back and forth between distributed memory to coordinate the individual nodes in order to perform a single computation. From an educational point of view, it is wonderful that the same MPI libraries used to program the fastest supercomputers in the world also run on the super-cheap Raspberry Pi cluster.

To use MPI, a parallel code must be divided into ranks that run on separate CPUs. As all supercomputers run Linux these days, it is reasonable to think of a rank as a Linux process. To minimize latencies induced by context-switching in general purpose operating systems not designed for message passing, most MPI libraries busy wait and spend 100% CPU rapidly polling until new data arrives. Historically, busy waiting was not seen to be a problem because it was assumed that there is exactly one CPU reserved for each rank. It also significantly speeds things up when using shared memory to pass messages between ranks on a multi-core node. Unfortunately, busy waiting reduces energy efficiency by preventing modern hardware from entering low power states, it makes multi-threaded programming difficult and it degrades throughput in cases where the CPU could have been used for timeshared multitasking.

In particular, busy waiting does not speed things up for the super-cheap cluster. There is significant latency already present in the USB Ethernet gadget and using shared memory by multi threading is not possible because there is only one CPU per Zero. Moreover, having only one CPU means nothing is left to run the operating system if we devote the entire CPU to the MPI calculation. Currently there are two competitive implementations of the MPI library. MPICH which was developed at Argonne National Laboratory and OpenMPI which represents a merger between versions developed at the University of Tennessee, Los Alamos National Laboratory and Indiana University. Between the two, it turns out that MPICH can more easily be tuned to use blocking input output instead of busy waiting. However, the version of MPICH included in Raspbian was configured to use busy waiting. Therefore, we configure and compile it ourselves.

\$ tar zxf mpich-3.2.1.tar.gz
\$ cd mpich-3.2.1
\$ ./configure --with-device=ch3:sock
\$ make

Note that --with-device=ch3:sock is the option required to avoid busy waiting. If all goes well, switch to the root user and install using

\$ su
# make install

At this point update the root nodes of the filesystem using the commands

# cd /x
# ./sreboot

wait for the Zeros to disconnect from the network bridge and then type

# cd /x
# ./supdate
# ./szeroup

Once the Zeros reboot you may have to update their state in Slurm using the commands

# scontrol update nodename=s0 state=idle
# scontrol update nodename=s1 state=idle
# scontrol update nodename=s2 state=idle
# scontrol update nodename=s3 state=idle
# scontrol update nodename=s4 state=idle

At this point we are ready to create a parallel Monte-Carlo approximation of Pi using the MPICH message passing library. The MPI C code is named pi_mpi.c and looks like like

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size;

typedef unsigned int uint;
uint no_of_trials = 10000;
uint no_of_points = 100000;

typedef struct { uint seed,darts,inside; } computeargs;

uint compute(uint s,uint n){
srandom(s);
uint inside=0;
for(uint i=0;i<n;i++){
double x=(double)random()/RAND_MAX;
double y=(double)random()/RAND_MAX;
double z=x*x+y*y;
if(z<=1.0){
inside++;
}
}
return inside;
}
uint rankn(){
for(;;){
computeargs x;
MPI_Status err;
MPI_Recv(&x,sizeof(computeargs),MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD, &err);
if(!x.darts) return 0;
x.inside=compute(x.seed,x.darts);
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
0,0,MPI_COMM_WORLD);
}
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

int rank0(){
printf("Estimating Pi using Dartboard/Monte-Carlo Method\n"
"with %u trials of %u points.\n\n",
no_of_trials,no_of_points);
fflush(stdout);
double pi=0;
tic();
double tprint=5;
FILE *urandom=fopen("/dev/urandom","r");
for(;;){
int rank;
if(started>=mpi_size-1){
computeargs x;
MPI_Status err;
MPI_Recv(&x,sizeof(computeargs),MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD,&err);
double tnow=toc();
if(tnow>tprint){
printf("Trial %u rank %d seed %u with %u inside"
" (ETA %g sec)\n",
fflush(stdout);
tprint=tnow+5;
}
pi+=(double)x.inside/no_of_points;
rank=err.MPI_SOURCE;
} else {
rank=started+1;
}
if(started<no_of_trials){
started++;
uint ran_seed;
computeargs x = { ran_seed, no_of_points };
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
} else {
computeargs x = { 0, 0 };
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
}
}
fclose(urandom);
pi*=4.0/no_of_trials;
printf("\nThe value of Pi is estimated to be %.15f\n"
"using %u trials of %u points each.\n\n",
pi,no_of_trials,no_of_points);
double ttime=toc();
printf("Total running time %g seconds\n"
"with average rate of %g darts/second.\n",
ttime,no_of_trials/ttime*no_of_points);
return 0;
}

int main(int argc, char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
if(mpi_size<2){
printf("Please use at least 2 ranks!\n");
} else {
if(mpi_rank>0) rankn();
else rank0();
}
MPI_Finalize();
return 0;
}``````
Note that rank 0 collects the results while the other ranks perform the computations. Compile the program using

\$ mpicc -std=gnu99 -Wall -O2 -o pi_mpi pi_mpi.c -lm

and then run it using batch script pi_mpi.slm given by

Code: Select all

``````#!/bin/bash
#SBATCH -n5
mpirun -n 6 ./pi_mpi``````
with the command

\$ sbatch pi_mpi.slm

Note that Slurm has been set to allocate five nodes but mpirun starts the MPI C program using six ranks. The result is that rank 0 and rank 5 will both be started on node s0. This over-provisioning would lead to performance degradation if the MPI library was configured for busy waiting, but since we are using blocking input output we can efficiently use all Zeros for the calculation while rank 0 collects the results. The output looks like

Code: Select all

``````Estimating Pi using Dartboard/Monte-Carlo Method
with 10000 trials of 100000 points.

Trial 802 rank 1 seed 426747775 with 78258 inside (ETA 57.425 sec)
Trial 1609 rank 1 seed 4232193648 with 78599 inside (ETA 52.2031 sec)
Trial 2418 rank 1 seed 2269661710 with 78461 inside (ETA 47.0672 sec)
Trial 3225 rank 5 seed 264736980 with 78526 inside (ETA 42.0558 sec)
Trial 4032 rank 5 seed 3973449706 with 78469 inside (ETA 37.0415 sec)
Trial 4841 rank 5 seed 1758916544 with 78848 inside (ETA 31.9995 sec)
Trial 5649 rank 1 seed 1912997491 with 78575 inside (ETA 26.9957 sec)
Trial 6458 rank 2 seed 3703685934 with 78492 inside (ETA 21.9661 sec)
Trial 7265 rank 1 seed 2545871756 with 78677 inside (ETA 16.9625 sec)
Trial 8072 rank 5 seed 3193836803 with 78528 inside (ETA 11.958 sec)
Trial 8882 rank 2 seed 2354279827 with 78560 inside (ETA 6.9314 sec)
Trial 9689 rank 2 seed 4144606175 with 78351 inside (ETA 1.92819 sec)

The value of Pi is estimated to be 3.141592439999978
using 10000 trials of 100000 points each.

Total running time 61.9744 seconds
with average rate of 1.61357e+07 darts/second.``````
Comparing the darts per second rate of the parallel code with the serial code in the previous post we obtain that

1.61357e+07/3.36379e+06=4.797

and conclude that our cluster of five Zeros performs the Monte-Carlo calculation almost five times faster than a single Zero.
Last edited by ejolson on Fri Oct 12, 2018 11:27 am, edited 2 times in total.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

The program pi_mpi.c from the previous post uses the MPI message passing interface to simulate remote procedure calls which are then executed asynchronously on the available nodes. This is because the original OctaPi: Calculating Pi project used remote procedure calls and we chose to emulate the parallel Python code as closely as possible. A simpler more idiomatic MPI program might break the problem up into mpi_size number of parts and combine them to form the solution at the end of the run. The C code pi_reduce.c given by

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size;

typedef unsigned int uint;
uint no_of_trials = 10000;
uint no_of_points = 100000;
double global_pi;

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

uint compute(uint s,uint n){
srandom(s);
uint inside=0;
for(uint i=0;i<n;i++){
double x=(double)random()/RAND_MAX;
double y=(double)random()/RAND_MAX;
double z=x*x+y*y;
if(z<=1.0){
inside++;
}
}
return inside;
}
void dotrials(){
int first_trial=mpi_rank*no_of_trials/mpi_size;
int last_trial=(mpi_rank+1)*no_of_trials/mpi_size;
int trials=last_trial-first_trial;
double pi=0;
FILE *urandom=fopen("/dev/urandom","r");
for(int i=first_trial;i<last_trial;i++){
uint ran_seed;
uint inside=compute(ran_seed,no_of_points);
pi+=(double)inside/no_of_points;
}
fclose(urandom);
pi*=4.0/trials;
printf("Rank %d finished in %g seconds with Pi %.15f\n",
mpi_rank,toc(),pi);
fflush(stdout);
MPI_Reduce(&pi,&global_pi,1,MPI_DOUBLE,
MPI_SUM,0,MPI_COMM_WORLD);
}

int main(int argc, char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

tic();
if(mpi_rank==0){
printf("Estimating Pi using Dartboard/Monte-Carlo Method\n"
"with %u trials of %u points.\n\n",
no_of_trials,no_of_points);
fflush(stdout);
}

dotrials();

if(mpi_rank==0){
global_pi/=mpi_size;
printf("\nThe value of Pi is estimated to be %.15f\n"
"using %u trials of %u points each.\n\n",
global_pi,no_of_trials,no_of_points);
double ttime=toc();
printf("Total running time %g seconds\n"
"with average rate of %g darts/second.\n",
ttime,no_of_trials/ttime*no_of_points);
}

MPI_Finalize();
return 0;
}``````
uses a single MPI_Reduce call to sum the separate approximations of Pi from each of the nodes and combine them into a final answer. Running the above program using the Slurm configuration file

Code: Select all

``````#!/bin/bash
#SBATCH -n5
mpirun ./pi_reduce``````
results in the output

Code: Select all

``````Estimating Pi using Dartboard/Monte-Carlo Method
with 10000 trials of 100000 points.

Rank 3 finished in 59.4139 seconds with Pi 3.141538220000008
Rank 4 finished in 59.4436 seconds with Pi 3.141701000000008
Rank 1 finished in 59.4442 seconds with Pi 3.141408220000017
Rank 2 finished in 59.4451 seconds with Pi 3.141482060000007
Rank 0 finished in 60.6071 seconds with Pi 3.141667720000010

The value of Pi is estimated to be 3.141559444000010
using 10000 trials of 100000 points each.

Total running time 60.6118 seconds
with average rate of 1.64984e+07 darts/second.``````
Again comparing the darts per second rate with the serial code we obtain a parallel speedup of

1.64984e+07/3.36379e+06=4.905

and infer the above program is slightly faster than the previous MPI code that used simulated remote procedure calls.

Using MPI_Reduce works well for the Monte-Carlo simulation, because each trial takes the exact same amount of time because each node in the super-cheap cluster has the same performance characteristics. Thus, the total work needed to throw all the darts is easily partitioned between the individual ranks ahead of time. If the cluster were heterogeneous made out of computers running at different speeds or if some of the nodes were over provisioned running additional tasks, then it would not be possible to accurately partition the work ahead of time and the program would run only as fast as the slowest node.

We emphasize that using simulated remote procedure calls as in the previous MPI code adapts to situations where the time to compute each trial on any particular node is unpredictable. At the same time, the number of ranks to which this method can scale is limited by the fact that the rank 0 process must explicitly start each trial on each of the other ranks. In the case of the super-cheap cluster, there are only 5 nodes so this is not an issue. However, larger clusters have thousands of nodes. In that case MPI_Reduce may be preferable because it uses sophisticated algorithms built into the MPI library that can scale to a much larger number of ranks.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

As stated at the beginning of this thread, the super-cheap computing cluster is designed for learning, not for practical scientific computation. To illustrate the difference in performance, we now run pi_reduce on a 24-core 2.6Ghz Xeon Scale server. The results

Code: Select all

``````Estimating Pi using Dartboard/Monte-Carlo Method
with 10000 trials of 1000000 points.

Rank 3 finished in 15.6546 seconds with Pi 3.141525894230769
Rank 6 finished in 15.6564 seconds with Pi 3.141710009615383
Rank 12 finished in 15.6574 seconds with Pi 3.141567644230772
Rank 0 finished in 15.6596 seconds with Pi 3.141744076923075
Rank 18 finished in 15.6618 seconds with Pi 3.141539538461537
Rank 15 finished in 15.6683 seconds with Pi 3.141624634615384
Rank 9 finished in 15.6766 seconds with Pi 3.141604730769233
Rank 21 finished in 15.6854 seconds with Pi 3.141742913461537
Rank 2 finished in 15.6938 seconds with Pi 3.141564623501198
Rank 8 finished in 15.6941 seconds with Pi 3.141575184652276
Rank 17 finished in 15.694 seconds with Pi 3.141724940047959
Rank 23 finished in 15.694 seconds with Pi 3.141690100719419
Rank 20 finished in 15.6943 seconds with Pi 3.141587194244604
Rank 10 finished in 15.6949 seconds with Pi 3.141487079136687
Rank 16 finished in 15.6953 seconds with Pi 3.141587741007196
Rank 1 finished in 15.6953 seconds with Pi 3.141562302158272
Rank 14 finished in 15.7046 seconds with Pi 3.141616575539569
Rank 5 finished in 15.7096 seconds with Pi 3.141489410071944
Rank 4 finished in 15.7101 seconds with Pi 3.141559769784172
Rank 22 finished in 15.7104 seconds with Pi 3.141569179856113
Rank 11 finished in 15.715 seconds with Pi 3.141545352517987
Rank 19 finished in 15.7156 seconds with Pi 3.141607050359713
Rank 13 finished in 15.7251 seconds with Pi 3.141589227817745
Rank 7 finished in 15.7348 seconds with Pi 3.141577544364508

The value of Pi is estimated to be 3.141599696586960
using 10000 trials of 1000000 points each.

Total running time 15.7351 seconds
with average rate of 6.35523e+08 darts/second.``````
indicate that a single node in a typical supercomputer runs the code we developed using the Pi a factor of

6.35523e+08/1.64984e+07=38.5

times faster than the super-cheap cluster. This is actually less of a speedup than I had expected, especially considering the Xeon server costs about 100 times more than the entire Pi cluster. Note that we are essentially testing the speed of the pseudo-random number generator built into the C library between the ARM and x86 versions of Linux. If five Xeon nodes were devoted to the computation, one would expect a 188 times performance increase. Additional speedup may be possible with tuning. We end this post with the following observation: It is significant from a learning point of view that the code developed for the super-cheap cluster will run on a real supercomputer without modification using exactly the same commands.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

This post describes how to install and run the linear algebra benchmark called HPL that solves systems of linear equations using Gaussian elimination with partial pivoting. This benchmark is used for ranking the top 500 supercomputers in the world in both performance and electrical efficiency.

Before getting started install a couple of subroutine libraries: the basic linear algebra subroutine library OpenBLAS as well as the linear algebra package LAPACK. Log into the head node and as root type

# apt-get install libopenblas-dev liblapacke-dev

Note that LAPACK is not needed for the benchmark, but may be needed for running other programs in the future. While making changes, edit /etc/bash.bashrc and add the line

export SQUEUE_FORMAT="%.6i %.18j %.10u %.2t %.11M %.3C %.8r %N"

in order to make the default output format for the squeue command more appropriate to our hardware configuration. Finally, let's update the operating system using

Be sure to watch out for kernel upgrades. If the kernel is upgraded you will have to regenerate the initramfs and copy the new kernel into the correct place manually for the new kernel to be used. At this point I discovered that the original 48M size of the /boot partition was too small for the package manager to update the kernel and initramfs while also storing the additional files mykernel.img and myinitrd.img. This oversight has been fixed by editing the earlier posts to specify a /boot partition size of 96M.

Propagate the changes from the head node to the computational nodes by shutting them down, updating their snapshots and rebooting them using the same combination of sreboot, supdate and szeroup commands described earlier in this thread.

\$ cd
\$ mkdir -p work
\$ cd work
\$ wget http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz
\$ tar zxf hpl-2.2.tar.gz
\$ cd hpl-2.2

Create a configuration file called make.OpenBLAS which reads

Code: Select all

``````\$ cat Make.OpenBLAS
#
#  -- High Performance Computing Linpack Benchmark (HPL)
#     HPL - 2.2 - February 24, 2016
#     Antoine P. Petitet
#     University of Tennessee, Knoxville
#     Innovative Computing Laboratory
#
#  -- Copyright notice and Licensing terms:
#
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:
#
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.
#
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution.
#
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.
#
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.
#
#  -- Disclaimer:
#
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# ######################################################################
#
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s -f
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = OpenBLAS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = /x/snaa/ejolson/work/hpl-2.2
INCdir       = \$(TOPdir)/include
BINdir       = \$(TOPdir)/bin/\$(ARCH)
LIBdir       = \$(TOPdir)/lib/\$(ARCH)
#
HPLlib       = \$(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/local
MPinc        = -I\$(MPdir)/include
MPlib        = -lmpich
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAinc        =
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      =
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I\$(INCdir) -I\$(INCdir)/\$(ARCH) \$(LAinc) \$(MPinc)
HPL_LIBS     = \$(HPLlib) \$(LAlib) \$(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     = -DHPL_CALL_CBLAS
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = \$(F2CDEFS) \$(HPL_OPTS) \$(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = /usr/bin/gcc
CCNOOPT      = \$(HPL_DEFS)
CCFLAGS      = \$(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------``````
and compile with the command

\$ make arch=OpenBLAS

After the compilation is finished, change to the hpl-2.2/bin/OpenBLAS directory using

\$ cd bin/OpenBLAS

and edit the file HPL.DAT so it reads

Code: Select all

``````HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
8000         Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
5            Qs
16.0         threshold
1            # of panel fact
0            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)``````
Create a slurm configuration file called xhpl.slm that reads

Code: Select all

``````#!/bin/bash
#SBATCH -n5
mpirun -n 5 ./xhpl``````
Submit the the job using

\$ sbatch xhpl.slm

After about six minutes the output from the job will be left in slurm-X.out where X was the job sequence number. In our case the output was given by

Code: Select all

``````================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000
NB     :     128
PMAP   : Row-major process mapping
P      :       1
Q      :       5
PFACT  :    Left
NBMIN  :       2
NDIV   :       2
RFACT  :   Right
BCAST  :   1ring
DEPTH  :       0
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2L2        8000   128     1     5             308.26              1.108e+00
HPL_pdgesv() start time Mon Mar 12 21:56:49 2018

HPL_pdgesv() end time   Mon Mar 12 22:01:57 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0017511 ...... PASSED
================================================================================

Finished      1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================``````
which indicates that the aggregate performance of the cluster is about 1.1 Gflops. This result is about 6 times slower than a single Raspberry Pi 3B but similar to the Terrible Cluster which reached a speed of 1.3 Gflops. I suspect the 20 percent improvement in speed for the Terrible Cluster may come from running part of the computation on the head node, which has faster communication to the other nodes. In 1993 the super-cheap cluster described in this thread would have ranked about 250 on the list of the top 500 supercomputers, right next to the Cray Y-MP.

For further comparison, note that on the same benchmark the four cores in the Pi 2B achieve 1.56 Gflops, the Pi 3B achieves 6.4 Gflops and the Pi 3B+ achieves 6.7 Gflops. Moreover using a larger 20000 by 20000 matrix, an AMD Ryzen 1700 8-core 16-thread desktop achieves 89.9 Gflops, a dual Intel E5-2620 Xeon 12-core 24-thread server achieves 226.0 Gflops and a dual Intel Gold 6126 Xeon 24-core 48-thread server achieves 344.3 Gflops.

As of November 2017, most of the computational power represented by the supercomputers in the top 500 list is delivered by NVIDIA GPU and Intel Phi multi-core accelerators. A notable exception is the fastest machine on the list, the Sunway TaihuLight, which uses a custom processor designed and manufactured in China supplemented by no additional accelerators. In the case of the super-cheap cluster, the CPU in each Pi Zero includes a VideoCore IV GPU with a peak speed of 24 single-precision Gflops. In total, therefore, our super-cheap cluster has a theoretical peak of 120 single-precision Gflops. However, since the Linpack benchmark dictates the use of double precision, it is difficult to use the VideoCore IV in the Pi Zero to speed things up.

Before abandoning the idea of using the VideoCore entirely, note that it is possible to emulate double precision by using two single precision numbers. This float-float representation is described by G. D. Gracca and D. Defour and by A. Thall. We note that the correctly rounded float-float sum df64_add is obtained using 20 single precision operations while the float-float product df64_mult takes 22. Dividing 120 by 21 suggests that the 5 node Pi Zero cluster has a peak rate of about 5.7 float-float Gflops. While it is reasonable to assume peak rates will not be achieved, it would be interesting if someone wrote a float-float linear algebra solver using QPUlib to test whether increased performance is possible. I suspect the resulting speedup would be somewhere between 1.3 and 2.7 times; however, the computation might run slower instead.
Last edited by ejolson on Wed Sep 05, 2018 7:41 pm, edited 2 times in total.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

So far we have written two parallel MPI programs that implement the dartboard method for computing Pi. One version used MPI_Recv and MPI_Send to emulate remote procedure calls in order to be as close to the original Python program as possible while the other version was simpler and used MPI_Reduce. Recall that using MPI_Reduce for computing Pi resulted in a more efficient computation.

Using MPI_Reduce works well for a Monte-Carlo simulation because the total work needed to throw all the darts is predictable and easily partitioned between the individual ranks ahead of time. There are also parallel computations for which it is not so easy to predict how much work any single part might take. Emulated remote procedure calls automatically adapt to situations where it is not possible to evenly partition the work between the nodes ahead of time. Remote procedure calls can also overcome difficulties when the cluster is heterogeneous and some computers run more slowly than others.

To further compare the performance of emulating remote procedure calls using MPI versus statically partitioning the problem ahead of time, consider the task of computing the basins of attraction in the complex plane for Newton's method. A parallel code for calculating Newton fractals was developed in a previous thread using the Intel/MIT Cilkplus parallel processing extensions to the C programming languages on the Raspberry Pi 2B. We now develop the parallel codes newton_mpi.c and newton_reduce.c which perform the same calculation on a Zero cluster and consider the serial code newton.c to gauge the resulting parallel speedup. The results of running these programs are

Code: Select all

``````\$ sbatch newton.slm
Submitted batch job 207
\$ sbatch newton_mpi.slm
Submitted batch job 208
\$ sbatch newton_reduce.slm
Submitted batch job 209
\$ squeue
JOBID               NAME       USER ST        TIME CPU   REASON NODELIST
208     newton_mpi.slm    ejolson PD        0:00   5 Resource
209  newton_reduce.slm    ejolson PD        0:00   5 Priority
207         newton.slm    ejolson  R        2:48   1     None s0
\$ cat slurm-207.out
Compute Basin of Attraction for Newton's Method
Image size 2048 x 2048 with maximum of 100 iterations

Finished raster 37 (ETA 273.734 seconds)
Finished raster 74 (ETA 268.729 seconds)
Finished raster 111 (ETA 263.665 seconds)
Finished raster 149 (ETA 257.031 seconds)
Finished raster 187 (ETA 251.128 seconds)
Finished raster 225 (ETA 245.429 seconds)
Finished raster 264 (ETA 239.284 seconds)
Finished raster 303 (ETA 233.173 seconds)
Finished raster 342 (ETA 227.169 seconds)
Finished raster 381 (ETA 221.245 seconds)
Finished raster 422 (ETA 214.365 seconds)
Finished raster 465 (ETA 206.566 seconds)
Finished raster 509 (ETA 198.914 seconds)
Finished raster 553 (ETA 191.55 seconds)
Finished raster 598 (ETA 183.979 seconds)
Finished raster 645 (ETA 176.023 seconds)
Finished raster 692 (ETA 168.419 seconds)
Finished raster 742 (ETA 160.242 seconds)
Finished raster 796 (ETA 151.09 seconds)
Finished raster 858 (ETA 140.215 seconds)
Finished raster 920 (ETA 130.129 seconds)
Finished raster 982 (ETA 120.675 seconds)
Finished raster 1047 (ETA 111.136 seconds)
Finished raster 1111 (ETA 102.295 seconds)
Finished raster 1191 (ETA 90.9095 seconds)
Finished raster 1284 (ETA 78.155 seconds)
Finished raster 1384 (ETA 65.4242 seconds)
Finished raster 1477 (ETA 54.6616 seconds)
Finished raster 1561 (ETA 45.6879 seconds)
Finished raster 1643 (ETA 37.337 seconds)
Finished raster 1720 (ETA 29.8472 seconds)
Finished raster 1791 (ETA 23.1771 seconds)
Finished raster 1869 (ETA 15.9518 seconds)
Finished raster 1960 (ETA 7.70327 seconds)

Total running time 182.86 seconds.
\$ cat slurm-208.out
Compute Basin of Attraction for Newton's Method
Image size 2048 x 2048 with maximum of 100 iterations
Asynchronous MPI Parallel version 6 ranks (5 for compute)

Rank 5 finished raster 175 (ETA 53.9952 seconds)
Rank 2 finished raster 356 (ETA 47.8896 seconds)
Rank 2 finished raster 555 (ETA 40.5987 seconds)
Rank 1 finished raster 781 (ETA 32.6375 seconds)
Rank 3 finished raster 1071 (ETA 22.9179 seconds)
Rank 1 finished raster 1469 (ETA 11.8825 seconds)
Rank 5 finished raster 1832 (ETA 4.14494 seconds)

Total running time 38.6461 seconds.
\$ cat slurm-209.out
Compute Basin of Attraction for Newton's Method
Image size 2048 x 2048 with maximum of 100 iterations
Simple MPI Parallel version 5 ranks for compute

Finished rank 3 slice in 23.9256 seconds.
Finished rank 4 slice in 24.3671 seconds.
Finished rank 2 slice in 31.8431 seconds.
Finished rank 1 slice in 45.6 seconds.
Finished rank 0 slice in 55.7153 seconds.

Total running time 58.5545 seconds.``````
The relative performance for each of the three codes may be visualized as

Note that newton_mpi.c achieves almost perfect parallel scaling to the 5 nodes in the super-cheap cluster while newton_reduce.c is significantly less efficient. Although MPI_Reduce performed better for the Monte-Carlo calculation of Pi, the situation is reversed in the case of Newton fractals. The work is hard to partition ahead of time because different parts of the Newton fractal take more or less time to compute based on the complex iterated dynamics. For reference, the code for newton.c, newton_mpi.c and newton_reduce.c are included below.

Code: Select all

``````// newton.c
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
int job_id;

#define RES 2048
#define IMAX 100
static int no_of_rows,no_of_cols;
static double xmin=0, xmax=1, ymin=0, ymax=1;
static double delta;

typedef struct { unsigned char r,g,b; } rgbtype;

unsigned char clip256(double x){
double r=256*x;
if(r>255) return 255;
if(r<0) return 0;
return (unsigned char)r;
}

rgbtype ztorgb(complex z){
rgbtype rgb;
double theta=carg(z), r=cabs(z);
double h=3*theta/M_PI;
rgb.r=clip256(2-fabs(h));
rgb.g=clip256(fabs(h-1)-1);
rgb.b=clip256(fabs(h+1)-1);
if(r<1){
rgb.r=255-(255-rgb.r)*r;
rgb.g=255-(255-rgb.g)*r;
rgb.b=255-(255-rgb.b)*r;
} else {
rgb.r/=r; rgb.g/=r; rgb.b/=r;
}
return rgb;
}
void outimage(int m,int n,complex A[m][n]){
char filename[80];
sprintf(filename,"image-%d.ppm",job_id);
FILE *fp=fopen(filename,"w");
fprintf(fp,"P6 %d %d 255\n",n,m);
for(int j=0;j<m;j++){
rgbtype raster[n];
for(int k=0;k<n;k++){
raster[k]=ztorgb(A[j][k]);
}
fwrite(raster,n*sizeof(rgbtype),1,fp);
}
fclose(fp);
}

complex f(complex z) {
return z*z*z*z*z-1;
}
complex df(complex z) {
return 5*z*z*z*z;
}
complex phi(complex z) {
return z-f(z)/df(z);
}
complex compute(int n,complex a[n]){
for(int k=0;k<n;k++){
complex z=a[k];
for(int i=0;i<IMAX;i++){
complex w0=phi(z);
if(w0==z) break;
z=w0;
}
a[k]=z;
}
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

int main(int argc,char *argv[]){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
if(xmax-xmin>ymax-ymin){
delta=(xmax-xmin)/RES;
} else {
delta=(ymax-ymin)/RES;
}
no_of_rows=(ymax-ymin)/delta;
no_of_cols=(xmax-xmin)/delta;
char *p=getenv("SLURM_JOB_ID");
if(p) job_id=atoi(p);

complex A[no_of_rows][no_of_cols];
printf(
"Compute Basin of Attraction for Newton's Method\n"
"Image size %d x %d with maximum of %d iterations\n"
no_of_rows,no_of_cols,IMAX);
for(int j=0;j<no_of_rows;j++) for(int k=0;k<no_of_cols;k++){
A[j][k]=xmin+k*delta+I*(ymax-j*delta);
}
tic();
double tprint=5;
for(int j=0;j<no_of_rows;){
compute(no_of_cols,A[j]);
j++;
double tnow=toc();
if(tnow>tprint){
printf("Finished raster %u (ETA %g seconds)\n",
j,tnow/j*(no_of_rows-j));
fflush(stdout);
tprint=tnow+5;
}
}
outimage(no_of_rows,no_of_cols,A);
double ttime=toc();
printf("\nTotal running time %g seconds.\n",ttime);
return 0;
}``````

Code: Select all

``````// newton_mpi.c
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size, job_id;

#define RES 2048
#define IMAX 100
static int no_of_rows,no_of_cols,xsize;
static double xmin=0, xmax=1, ymin=0, ymax=1;
static double delta;

typedef struct { unsigned char r,g,b; } rgbtype;

unsigned char clip256(double x){
double r=256*x;
if(r>255) return 255;
if(r<0) return 0;
return (unsigned char)r;
}

rgbtype ztorgb(complex z){
rgbtype rgb;
double theta=carg(z), r=cabs(z);
double h=3*theta/M_PI;
rgb.r=clip256(2-fabs(h));
rgb.g=clip256(fabs(h-1)-1);
rgb.b=clip256(fabs(h+1)-1);
if(r<1){
rgb.r=255-(255-rgb.r)*r;
rgb.g=255-(255-rgb.g)*r;
rgb.b=255-(255-rgb.b)*r;
} else {
rgb.r/=r; rgb.g/=r; rgb.b/=r;
}
return rgb;
}
void outimage(int m,int n,rgbtype *image[m]){
char filename[80];
sprintf(filename,"image-%d.ppm",job_id);
FILE *fp=fopen(filename,"w");
fprintf(fp,"P6 %d %d 255\n",n,m);
for(int j=0;j<m;j++){
fwrite(image[j],n*sizeof(rgbtype),1,fp);
}
fclose(fp);
}

complex f(complex z) {
return z*z*z*z*z-1;
}
complex df(complex z) {
return 5*z*z*z*z;
}
complex phi(complex z) {
return z-f(z)/df(z);
}
rgbtype computergb(complex z){
for(int i=0;i<IMAX;i++){
complex w0=phi(z);
if(w0==z) break;
z=w0;
}
return ztorgb(z);
}

typedef struct { int this_row; rgbtype raster[]; } computeargs;

void rankn(){
computeargs *xp = malloc(xsize);
for(;;){
computeargs x;
MPI_Status err;
MPI_Recv(&x,sizeof(computeargs),MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD, &err);
if(x.this_row<0) break;
xp->this_row=x.this_row;
for(int k=0;k<no_of_cols;k++) {
xp->raster[k]=computergb(
xmin+k*delta+I*(ymax-x.this_row*delta));
}
MPI_Send(xp,xsize,MPI_BYTE,0,0,MPI_COMM_WORLD);
}
free(xp);
return;
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

void rank0(){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
rgbtype *image[no_of_rows];
printf(
"Compute Basin of Attraction for Newton's Method\n"
"Image size %d x %d with maximum of %d iterations\n"
"Asynchronous MPI Parallel version %d ranks (%d for compute)\n\n",
no_of_rows,no_of_cols,IMAX,mpi_size,mpi_size-1);

tic();
double tprint=5;
for(;;){
computeargs x;
int rank;
if(started>=mpi_size-1){
MPI_Status err;
computeargs *xp=malloc(xsize);
MPI_Recv(xp,xsize,MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD,&err);
image[xp->this_row]=xp->raster;
double tnow=toc();
if(tnow>tprint){
printf("Rank %d finished raster %u (ETA %g seconds)\n",
err.MPI_SOURCE,
fflush(stdout);
tprint=tnow+5;
}
rank=err.MPI_SOURCE;
} else {
rank=started+1;
}
if(started<no_of_rows){
computeargs x = { started };
started++;
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
} else {
computeargs x = { -1 };
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
}
}
outimage(no_of_rows,no_of_cols,image);
double ttime=toc();
printf("\nTotal running time %g seconds.\n",ttime);
}

int main(int argc,char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

if(xmax-xmin>ymax-ymin){
delta=(xmax-xmin)/RES;
} else {
delta=(ymax-ymin)/RES;
}
no_of_rows=(ymax-ymin)/delta;
no_of_cols=(xmax-xmin)/delta;
xsize=sizeof(computeargs)+sizeof(rgbtype)*no_of_cols;
char *p=getenv("SLURM_JOB_ID");
if(p) job_id=atoi(p);

if(mpi_size<2) {
printf("Please use at least 2 ranks!\n");
} else {
if(mpi_rank>0) rankn();
else rank0();
}
MPI_Finalize();
return 0;
}``````

Code: Select all

``````// newton_reduce.c
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size, job_id;

#define RES 2048
#define IMAX 100
static int no_of_rows,no_of_cols,xsize;
static double xmin=0, xmax=1, ymin=0, ymax=1;
static double delta;

typedef struct { unsigned char r,g,b; } rgbtype;

unsigned char clip256(double x){
double r=256*x;
if(r>255) return 255;
if(r<0) return 0;
return (unsigned char)r;
}

rgbtype ztorgb(complex z){
rgbtype rgb;
double theta=carg(z), r=cabs(z);
double h=3*theta/M_PI;
rgb.r=clip256(2-fabs(h));
rgb.g=clip256(fabs(h-1)-1);
rgb.b=clip256(fabs(h+1)-1);
if(r<1){
rgb.r=255-(255-rgb.r)*r;
rgb.g=255-(255-rgb.g)*r;
rgb.b=255-(255-rgb.b)*r;
} else {
rgb.r/=r; rgb.g/=r; rgb.b/=r;
}
return rgb;
}
void outimage(int m,int n,rgbtype image[m][n]){
char filename[80];
sprintf(filename,"image-%d.ppm",job_id);
FILE *fp=fopen(filename,"w");
fprintf(fp,"P6 %d %d 255\n",n,m);
for(int j=0;j<m;j++){
fwrite(image[j],n*sizeof(rgbtype),1,fp);
}
fclose(fp);
}

complex f(complex z) {
return z*z*z*z*z-1;
}
complex df(complex z) {
return 5*z*z*z*z;
}
complex phi(complex z) {
return z-f(z)/df(z);
}
rgbtype computergb(complex z){
for(int i=0;i<IMAX;i++){
complex w0=phi(z);
if(w0==z) break;
z=w0;
}
return ztorgb(z);
}
typedef struct { int this_row; complex a[]; } computeargs;

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

void rankn(){
tic();
int first_row=mpi_rank*no_of_rows/mpi_size;
int last_row=(mpi_rank+1)*no_of_rows/mpi_size;
int rows=last_row-first_row;
rgbtype slice[rows][no_of_cols];
for(int j=first_row;j<last_row;j++){
for(int k=0;k<no_of_cols;k++){
slice[j-first_row][k]=computergb(xmin+k*delta+I*(ymax-j*delta));
}
}
printf("Finished rank %d slice in %g seconds.\n",mpi_rank,toc());
fflush(stdout);
MPI_Send(slice,sizeof(rgbtype)*rows*no_of_cols,
MPI_BYTE,0,0,MPI_COMM_WORLD);
return;
}

void rank0(){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
printf(
"Compute Basin of Attraction for Newton's Method\n"
"Image size %d x %d with maximum of %d iterations\n"
"Simple MPI Parallel version %d ranks for compute\n\n",
no_of_rows,no_of_cols,IMAX,mpi_size,mpi_size-1);
fflush(stdout);

tic();
rgbtype image[no_of_rows][no_of_cols];
{
int last_row=1*no_of_rows/mpi_size;
for(int j=0;j<last_row;j++){
for(int k=0;k<no_of_cols;k++){
image[j][k]=computergb(xmin+k*delta+I*(ymax-j*delta));
}
}
}
printf("Finished rank 0 slice in %g seconds.\n",toc());
fflush(stdout);
for(int rank=1;rank<mpi_size;rank++){
int first_row=rank*no_of_rows/mpi_size;
int last_row=(rank+1)*no_of_rows/mpi_size;
int rows=last_row-first_row;
MPI_Status err;
MPI_Recv(image[first_row],sizeof(rgbtype)*rows*no_of_cols,MPI_BYTE,
rank,0,MPI_COMM_WORLD, &err);
}
outimage(no_of_rows,no_of_cols,image);
double ttime=toc();
printf("\nTotal running time %g seconds.\n",ttime);
}

int main(int argc,char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

if(xmax-xmin>ymax-ymin){
delta=(xmax-xmin)/RES;
} else {
delta=(ymax-ymin)/RES;
}
no_of_rows=(ymax-ymin)/delta;
no_of_cols=(xmax-xmin)/delta;
char *p=getenv("SLURM_JOB_ID");
if(p) job_id=atoi(p);

if(mpi_size<2) {
printf("Please use at least 2 ranks!\n");
} else {
if(mpi_rank>0) rankn();
else rank0();
}
MPI_Finalize();
return 0;
}``````

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

In the previous post each of the three programs generated the same Newton fractal illustrating the basins of attraction for the five roots of the function z^5-1 in the complex plane [0,1]x[0,1]. I converted the output to portable network graphics format using the command

\$ pnmtopng <image-209.ppm >image-209.png

The resulting image looks like

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

In this post we explore the use of the super-cheap cluster for sorting. This time the goal is not parallel speedup, but to combine the distributed memory of multiple nodes to sort large datasets. Consider the task of sorting 200,000,000 four-byte integers. Since a dataset of that size takes 763MB of memory, it is not possible to store it in the 512MB RAM available on a single Pi Zero. While out-of-core sorting algorithms exist, they run much slower than the parallel algorithm described here.

The algorithm we shall employ is called odd-even sorting. Suppose there are N nodes in the cluster. First partition the data among those N nodes. Each node then sorts its part of the data in parallel. Finally, data between nodes is merged between even and odd nodes. After N merge operations the lowest MPI rank has the first part of the sorted data, the highest rank has the last part, with the rest of the data being stored in order on the ranks in between. Since N merge operations must be performed using 1/N the size of the data, there is no expected parallel speedup from this algorithm. However, there is no expected slowdown either. Also note that merge operations require twice the amount of memory as needed to store the entire dataset. The super-cheap cluster consists of five computational nodes with 512MB each. Since some of that memory is used for operating system tasks and overhead, there is about 480MB per node which can be devoted to sorting. Multiplying that by five yields a total of 2400MB available distributed memory. This is more than enough to accommodate the 1526MB needed to merge our 763MB dataset.

We begin by comparing our parallel algorithm to a serial program using a smaller dataset of 100,000,000 four-byte integers. This allows us to verify correctness of operation and relate performance. The result for the serial code is

Code: Select all

``````Perform serial sorting using qsort.
Sorting total 100000000 integers in serial.
Finished sorting in 155.168 seconds.

real    5m3.215s
user    2m35.130s
sys 0m16.610s``````
and the result for the parallel code is

Code: Select all

``````Perform odd-even sorting using MPI.
Sorting total 100000000 integers using 5 ranks.
Rank 4 finished sorting in 150.809 seconds.
Rank 2 finished sorting in 181.27 seconds.
Rank 0 finished sorting in 181.525 seconds.
Rank 3 finished sorting in 181.341 seconds.
Rank 1 finished sorting in 181.755 seconds.

real    5m1.880s
user    0m0.070s
sys 0m0.120s``````
The timings for only the sorting (neglecting file input and output) show that the parallel sort itself is about 20 percent slower than the serial code. However, after including file input and output the total wall time is almost exactly 5 minutes for either method. Since total wall time is what really matters to a person waiting for the program to finish, we conclude that the parallel and serial sorts take about the same amount of time.

Finally, we demonstrate the use of distributed memory by sorting a larger dataset that doesn't fit in the memory of a single Pi Zero.

Code: Select all

``````Perform odd-even sorting using MPI.
Sorting total 200000000 integers using 5 ranks.
Rank 4 finished sorting in 309.344 seconds.
Rank 0 finished sorting in 378.947 seconds.
Rank 2 finished sorting in 371.555 seconds.
Rank 3 finished sorting in 371.941 seconds.
Rank 1 finished sorting in 370.915 seconds.

real    10m28.655s
user    0m0.080s
sys 0m0.130s``````
Note that it takes 2.08 times longer to sort 200,000,000 integers than it took to sort 100,000,000 integers. Since

Code: Select all

``````       200000000*log(200000000)
------------------------ = 2.075
100000000*log(100000000)``````
This is consistent with the expected nlogn asymptotic performance of a reasonable sorting algorithm.

For reference the parallel odd-even sort program is

Code: Select all

``````/*  This code implements a parallel odd-even sort using MPI and is
based on the lecture notes by Ian Finlayson for the course

CPSC 425: Parallel Computing
University of Mary Washington
http://cs.umw.edu/~finlayson/class/fall14/cpsc425/syllabus.html

This code was modified by Eric Olson for the super-cheap cluster.
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <fcntl.h>
#include <string.h>
#include <mpi.h>

typedef unsigned int uint;
typedef long unsigned int luint;

char *infn = "unsorted.dat";
char *outfn = "sorted.dat";
int mpi_rank, mpi_size, job_id;

luint totalsize;
uint *getinput(luint *n,char *fn){
int fd=open(fn,O_RDONLY);
if(fd==-1){
printf("Unable to open %s for read!\n",fn);
exit(1);
}
struct stat buf;
fstat(fd,&buf);
totalsize=buf.st_size;
if(totalsize%sizeof(uint)){
printf("File %s contains an incomplete integer!\n",fn);
exit(1);
}
totalsize/=sizeof(uint);
if(!mpi_rank){
printf("Sorting total %lu integers using %d ranks.\n",
totalsize,mpi_size);
}
luint i0=mpi_rank*totalsize/mpi_size;
luint iN=(mpi_rank+1)*totalsize/mpi_size;
luint bytes=(iN-i0)*sizeof(uint);
uint *data=malloc(bytes+sizeof(uint));
if(lseek(fd,i0*sizeof(uint),SEEK_SET)==-1){
printf("Error seeking %s to offset %lu!\n",fn,i0*sizeof(uint));
exit(1);
}
if(r!=bytes){
exit(1);
}
close(fd);
*n=iN-i0;
return data;
}
void putrank(int fd,int p,uint *data,char *fn){
luint i0=p*totalsize/mpi_size;
luint iN=(p+1)*totalsize/mpi_size;
luint bytes=(iN-i0)*sizeof(uint);
if(lseek(fd,i0*sizeof(uint),SEEK_SET)==-1){
printf("Error seeking %s to offset %lu!\n",fn,i0*sizeof(uint));
exit(1);
}
luint r=write(fd,data,bytes);
if(r!=bytes){
printf("Error short write of %lu instead of %lu!\n",r,bytes);
exit(1);
}
}
void putoutput(luint n,uint *data,char *fn){
int fd;
if(!mpi_rank){
fd=creat(fn,0644);
if(fd==-1){
printf("Unable to create %s for write!\n",fn);
exit(1);
}
putrank(fd,0,data,fn);
for(int p=1;p<mpi_size;p++){
MPI_Recv(data,n+1,MPI_INT,p,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
putrank(fd,p,data,fn);
}
close(fd);
} else {
MPI_Send(data,n,MPI_INT,0,0,MPI_COMM_WORLD);
}
}

int intcmp(const void* ap,const void* bp){
uint a=*((const uint*)ap);
uint b=*((const uint*)bp);
if(a<b){
return -1;
} else if(a>b){
return 1;
} else {
return 0;
}
}
static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

uint *parallel_sort(luint n,uint *data,uint *other){
qsort(data,n,sizeof(int),intcmp);
for(int mpi=0;mpi<mpi_size;mpi++){
int p;
if(mpi%2){
if(mpi_rank%2) p=mpi_rank+1;
else p=mpi_rank-1;
} else {
if(mpi_rank%2) p=mpi_rank-1;
else p=mpi_rank+1;
}
if(p<0 || p>=mpi_size) continue;
luint p0=p*totalsize/mpi_size;
luint pN=(p+1)*totalsize/mpi_size;
luint m=pN-p0;

/*  do the exchange - even processes send first and odd processes
working together both sending */

if (mpi_rank%2){
MPI_Send(data,n,MPI_INT,p,0,MPI_COMM_WORLD);
MPI_Recv(other,m,MPI_INT,p,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
} else {
MPI_Recv(other,m,MPI_INT,p,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
MPI_Send(data,n,MPI_INT,p,0,MPI_COMM_WORLD);
}
if(mpi_rank<p){
luint i=0, j=n;
while(i<m && j>0){
if(other[i]<data[j-1]){
j--;
uint tmp=data[j]; data[j]=other[i]; other[i]=tmp;
i++;
} else break;
}
i=0; j=n-1;
for(luint k=0;k<n;k++){
if(data[i]<=data[j]) other[k]=data[i++];
else other[k]=data[j--];
}
} else {
luint i=m, j=0;
while(i>0 && j<n){
if(other[i-1]>data[j]){
i--;
uint tmp=data[j]; data[j]=other[i]; other[i]=tmp;
j++;
} else break;
}
i=0; j=n-1;
for(luint k=n;k>0;k--){
if(data[i]>=data[j]) other[k-1]=data[i++];
else other[k-1]=data[j--];
}
}
{ uint *tmp=data; data=other; other=tmp; }
}
return data;
}

int main(int argc,char* argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD,&mpi_size);

setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});

if(!mpi_rank){
printf("Perform odd-even sorting using MPI.\n");
}
char filename[80];
char *p=getenv("SLURM_JOB_ID");
if(p) {
job_id=atoi(p);
sprintf(filename,"sorted-%d.dat",job_id);
outfn=filename;
}
luint n;
uint *data=getinput(&n,infn);
uint *other=malloc((n+1)*sizeof(int));
tic();
data=parallel_sort(n,data,other);
double sorttime=toc();
printf("Rank %d finished sorting in %g seconds.\n",
mpi_rank,sorttime);
putoutput(n,data,outfn);

MPI_Finalize();
return 0;
}``````
the serial sorting routine is

Code: Select all

``````/*    This code uses qsort to sort a file of integers
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <fcntl.h>
#include <string.h>

typedef unsigned int uint;
typedef long unsigned int luint;

char *infn = "unsorted.dat";
char *outfn = "sorted.dat";
int job_id;

uint *getinput(luint *n,char *fn){
int fd=open(fn,O_RDONLY);
if(fd==-1){
printf("Unable to open %s for read!\n",fn);
exit(1);
}
struct stat buf;
fstat(fd,&buf);
if(buf.st_size%sizeof(uint)){
printf("File %s contains an incomplete integer!\n",fn);
exit(1);
}
*n=buf.st_size/sizeof(uint);;
printf("Sorting total %lu integers in serial.\n",*n);
uint *data=malloc(buf.st_size);
if(r!=buf.st_size){
exit(1);
}
close(fd);
return data;
}
void putoutput(luint n,uint *data,char *fn){
int fd;
fd=creat(fn,0644);
if(fd==-1){
printf("Unable to create %s for write!\n",fn);
exit(1);
}
luint bytes=n*sizeof(uint);
luint r=write(fd,data,bytes);
if(r!=bytes){
printf("Error short write of %lu instead of %lu!\n",r,bytes);
exit(1);
}
close(fd);
}
int intcmp(const void* ap,const void* bp){
uint a=*((const uint*)ap);
uint b=*((const uint*)bp);
if(a<b){
return -1;
} else if(a>b){
return 1;
} else {
return 0;
}
}
static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

uint *serial_sort(luint n,uint *data){
qsort(data,n,sizeof(int),intcmp);
return data;
}

int main(int argc,char* argv[]){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});

printf("Perform serial sorting using qsort.\n");
char filename[80];
char *p=getenv("SLURM_JOB_ID");
if(p) {
job_id=atoi(p);
sprintf(filename,"sorted-%d.dat",job_id);
outfn=filename;
}
luint n;
uint *data=getinput(&n,infn);
tic();
data=serial_sort(n,data);
double sorttime=toc();
printf("Finished sorting in %g seconds.\n",sorttime);
putoutput(n,data,outfn);
return 0;
}
``````
and the program which creates the datafile to sort is

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define BUFSIZE 1024U

typedef unsigned int uint;
typedef long unsigned int luint;

char *outfn="unsorted.dat";
uint data[BUFSIZE];
luint n=200000000UL;

int main(int argc,char *argv[]){
int fd=creat(outfn,0644);
if(fd==-1){
printf("Unable to open %s for write!\n",outfn);
exit(1);
}
uint dn=BUFSIZE;
while(n>0){
if(dn>n) dn=n;
for(uint i=0;i<dn;i++) data[i]=rand();
luint bytes=dn*sizeof(uint);
luint r=write(fd,data,bytes);
if(r!=bytes){
printf("Short write of %lu instead of %lu!\n",r,bytes);
exit(1);
}
n-=dn;
}
close(fd);
return 0;
}``````

abymmathew
Posts: 3
Joined: Tue Oct 02, 2018 4:10 am

### Re: Super-cheap Computing Cluster for Learning

Have you tried installing docker swarm on it. I have and it hangs when replicating images across the nodes.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

abymmathew wrote:
Tue Oct 02, 2018 4:12 am
Have you tried installing docker swarm on it. I have and it hangs when replicating images across the nodes.
That's curious.

Have you looked at this blog and the links it contains? There it clearly indicates the possibility of running a Docker swarm on a cluster of USB networked Pi Zero computers. I suspect they are each running from local SD cards rather than mounting root from NFS as in this tutorial.

If you are following this tutorial exactly, it is worth noting the individual compute nodes are completely isolated from the Internet. If your Docker configuration needs to contact an Internet-hosted server while running, you should further setup the host computer as a masquerade firewall. This should prevent the Zeros from hanging when they try to access the Internet.

abymmathew
Posts: 3
Joined: Tue Oct 02, 2018 4:10 am

### Re: Super-cheap Computing Cluster for Learning

thanks for quick reply. Will look into the pointers you provided. Very informative article.

abymmathew
Posts: 3
Joined: Tue Oct 02, 2018 4:10 am

### Re: Super-cheap Computing Cluster for Learning

Docker on NFS won't work. Reference, https://blog.alexellis.io/the-state-of- ... pberry-pi/

However OTG network is sometimes choppy and SSH over it fails often. Using Mosh helped, https://mosh.org/#techinfo

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

abymmathew wrote:
Tue Oct 09, 2018 1:04 am
Docker on NFS won't work. Reference, https://blog.alexellis.io/the-state-of- ... pberry-pi/

However OTG network is sometimes choppy and SSH over it fails often. Using Mosh helped, https://mosh.org/#techinfo
Sorry to hear you are having trouble with your network of USB Ethernet gadgets. For the record, the super-cheap cluster described in this thread has no such problem. In particular ssh sessions are stable and I can run MPI applications which are fairly network intense.

It should be noted that I'm still running a 4.9.59 kernel because of bugs in the 4.14.x series which affect the use of jumbo packets in the Ethernet gadgets.

I would also point out that for the configuration described in this thread, the NFS server is backed by a BTRFS copy-on-write filesystem. While the ability to create snapshots is not exported through NFS, the resulting subvolumes are. Therefore, Docker containers could be efficiently created on the head node using the copy-on-write semantics and then executed on one of the attached Zero compute nodes.

macbug
Posts: 12
Joined: Thu Dec 04, 2014 8:50 pm

### Re: Super-cheap Computing Cluster for Learning

I find this to be a way to complex way of benchmarking. I know nothing about benchmarking so dont get me wrong. This aint for me, i just wanna kode and have the computitations spread over several computers.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

macbug wrote:
Mon Jan 28, 2019 8:09 pm
I find this to be a way to complex way of benchmarking. I know nothing about benchmarking so dont get me wrong. This aint for me, i just wanna kode and have the computitations spread over several computers.
Sounds good. I reported how long it takes to execute each example program because it is rather easy to spread a computation over several machines and then have it take longer than using only one. Just like adding manpower to a late software project makes it later, adding processors to a lengthy calculation can also slow things down. This parallel processing analogue of Brooks' Law was amusingly expressed by Seymour Cray as "if you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" The new Summit super-computer at Oak Ridge National Laboratory uses 2,397,824 oxen and they are worse than the chickens.

The difficulty of obtaining parallel speedup in practice when solving a problem using a cluster is why it is important to have real hardware to test your code on. The fact that anyone can purchase 1993 levels of super-computing performance for the price of a nice meal for two means almost anyone with an interest can begin understanding the constraints of synchronization latency and communication bandwidth that still affect parallel algorithms today. Developing practical code that works well on clusters, though difficult, is now possible without a hundred-million-dollar budget.

If you don't want to include timing routines in your software as I have, you can either use the built-in time command to check how fast your program finished or a hand-held stopwatch. Having said this, many scientific codes as well as games have either built-in performance metrics or a set of regression tests to check what the current computational speeds are. This is useful, for example, when scheduling time on bigger clusters to solve problems sized to the limits of current technology.

It will be interesting to see what kind of parallel programs you code for your cluster. I've been thinking about creating a version of the shared-memory parallel Fibonacci programs studied in this post (see also here and here) that uses a message-passing approach to run on a distributed-memory cluster. As indicated in those posts, the SMP program ran 4 times faster using 5 processors. It would be a significant achievement, in my opinion, to obtain even a three-fold performance increase when using the 5-node super-cheap cluster.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

ejolson wrote:
Wed Jan 30, 2019 12:16 am
I've been thinking about creating a version of the shared-memory parallel Fibonacci programs studied in this post (see also here and here) that uses a message-passing approach to run on a distributed-memory cluster. As indicated in those posts, the SMP program ran 4 times faster using 5 processors. It would be a significant achievement, in my opinion, to obtain even a three-fold performance increase when using the 5-node super-cheap cluster.
There is now available here an MPI-parallel Fibonacci code that can run on the super-cheap cluster. The parallel speedup as measured by wall time did not quite reach 3, but was 2.718 instead. Since there are 5 nodes this translates to a parallel efficiency of 54 percent.

As the total calculation time was around 9 seconds, the size of the problem may not be large enough to clearly offset the overhead of starting multiple communicating ranks across the cluster. I'm my opinion, it is likely that computing a larger Fibonacci number could achieve efficiencies as great as 80 percent.

energyi
Posts: 182
Joined: Tue Mar 24, 2015 9:39 pm

### Re: Super-cheap Computing Cluster for Learning

Great thread ejolson et al. Thanks.
energyi

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

energyi wrote:
Sat Mar 16, 2019 1:41 pm
Great thread ejolson et al. Thanks.
You are welcome!

For people who followed along with this thread and possibly built a similar system, I've been having a slight hardware problem recently related to low voltage. It would appear the barrel connector on the Sabrient USB2 hub has corroded and is no longer making good contact. This became obvious when the red light randomly started turning on and off on the head node.

I've temporarily fixed things by wiggling the barrel connector until the flashing stopped. Such problems with barrel connectors are likely why Raspberry Pi engineers used USB to deliver power to all types of Pi computers designed so far. With the Pi Zero this also allowed a super cheap way to obtain a network connection through the same cable used for power.

With other models of Pi, a standard one cable strategy has been a power over Ethernet hat. However, with the Pi 4B, just like the Zero, it is possible to provide power and networking through a single USB cable. This opens up the possibility of a not-quite-so-cheap cluster based on the same networking fabric as the super-cheap cluster.

Even though a USB-C connector is now used for power, the 4B still only functions as a USB2 Ethernet gadget. Fortunately, the faster processor speed results in greater network throughput anyway. I've been working on a Pi Cloud to explore these ideas

viewtopic.php?f=36&t=279176

however, the focus of that thread is Kubernetes and micro-services rather than the configuration of Slurm and MPI for high-performance parallel computing.

Even so, it is interesting to compare the speed of NFS over a USB Ethernet gadget on the super-cheap cluster compared to the 4B-based Pi cloud. In particular, consider the setups given by
• Zero <--> NFS/USB2 <--> Pi B
• Pi 4B <--> NFS/USB2 <--> PC
The results are

Code: Select all

``````             Sequential Write   Random Write   Random Read
Zero Cluster     9753 KB/sec      661 IOPS       861 IOPS
Pi 4B Cloud     26089 KB/sec     5535 IOPS      5184 IOPS
``````
Note that, even though the same USB 2 networking fabric is used for the network file system in both cases, there is a significant difference in performance. Thus, it should be possible to set up a not-quite-so-cheap cluster of Pi 4B computers following the directions in this thread and obtain significant performance improvements both in processor and networking speed without resorting to traditional networking cables and switches.
Last edited by ejolson on Wed Jul 22, 2020 9:16 pm, edited 16 times in total.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

As a follow up to the previous post, to see how much more true gigabit Ethernet offers the USB2 gadget, I ran the sdtest one more time for the setup
• Pi 4B <--> NFS/GB Ethernet <--> PC
This resulted in

Code: Select all

``````             Sequential Write   Random Write   Random Read
Zero Cluster     9753 KB/sec      661 IOPS       861 IOPS
Pi 4B Cloud     26089 KB/sec     5535 IOPS      5184 IOPS
GB Ethernet    111646 KB/sec    10448 IOPS     11932 IOPS
``````
Since a cheap gigabit switch and cables typically cost less than US\$ 50, using true gigabit Ethernet is likely a win from a price performance point of view. At the same time, it's good to know the cheaper alternative of using only USB still performs well enough when budget considerations overrule reason.

mahjongg
Forum Moderator
Posts: 14042
Joined: Sun Mar 11, 2012 12:19 am
Location: South Holland, The Netherlands

### Re: Super-cheap Computing Cluster for Learning

Unlocked

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

mahjongg wrote:
Fri Mar 26, 2021 2:28 pm
Unlocked
Thanks for reopening the thread. I want to add some examples to this thread on using Julia for numeric computation and machine learning. Julia has been gaining popularity, not only as a replacement for Python because of its speed, but as a replacement for C, C++ and Fortran due to its interactive convenience.

The syntax of Julia includes Matlab-like matrix and vector operations, a choice between strong and duck typing, built-in arbitrary precision arithmetic, interfaces to many high-performance C, C++ and Fortran subroutine libraries and support for heterogenous computing using Nvidia and AMD GPU accelerators. All this is tied together into an optimised just-in-time interactive computing environment based on LLVM behind the scenes.

https://www.nextplatform.com/2021/03/22 ... s-in-2021/

However, Julia is astonishing complex and difficult to build for 32-bit ARM. The current version is 1.6.0 while version 1.0.2 is built in to Raspberry Pi OS and 0.6.0 is the version on the super-cheap cluster. As Julia has seen rapid development for many years, there are significant differences between these versions.

Many super-expensive clusters run Linux with a 3.10.x kernel that includes custom network and GPU drivers. Such a cluster operating system could, for example, be based on CentOS, soon to be Rocky Linux. Aside from whether there will ever be a Bullwinkle Linux, this creates a similar situation to the super-cheap cluster: The operating system and included packages are so out of date as to be useless for the intended computational work.

Note that the system compiler on CentOS is gcc version 4.8.5 which doesn't even support the OpenMP extensions used in many parallel programs. The software on the super-cheap cluster, though built on an earlier Stretch-based version of Raspberry Pi OS, is up to date compared to that.

More and more the solution to out of date system software is to use containers. In the case of the Pi Zero, installing a new version of Raspberry Pi OS would not be difficult as there are only 5 nodes which boot diskless over the high-speed USB network fabric. For clusters consisting of hundreds or even thousands of nodes, updating the OS is much more difficult.

Therefore, to maintain as much realism as possible in this learning project. The next goal is to develop and run a parallel code using a modern version of Julia without upgrading the base OS.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

ejolson wrote:
Sun Mar 28, 2021 3:47 pm
Therefore, to maintain as much realism as possible in this learning project. The next goal is to develop and run a parallel code using a modern version of Julia without upgrading the base OS.
While Kubernetes and Docker are popular for cloud-based micro services, a more suitable container for high-performance computing is Singularity.

https://sylabs.io/singularity/

Note that a Singularity container runs with the same permissions as the logged in user rather than creating an additional set of privileges the way Docker does. This provides a way to bundle an application with the needed libraries such that the resulting container functions more like a local install of the program.

I found it interesting that Singularity, like Kubernetes and Docker, is written using the Go programming language.

As no binary executable is available for Raspberry Pi, I found myself building Singularity from source. Fortunately, the binary for the latest Go compiler needed to do this is readily installed by typing

Code: Select all

``````\$ sudo -s
# cd /usr/local
# wget https://golang.org/dl/go1.16.3.linux-armv6l.tar.gz
# tar zxf go1.16.3.linux-armv6l.tar.gz
# cd bin
# ln -s ../go/bin/* .
# exit
``````
To check that Go works type

Code: Select all

``````\$ go version
go version go1.16.3 linux/arm
``````
While it would be possible at this point to create snapshots of the root image so that Go is available everywhere on the cluster, I decided first to install Singularity.

After reading INSTALL.md, make sure the dependencies are installed.

Code: Select all

``````\$ sudo -s
# apt-get install -y build-essential \
libseccomp-dev pkg-config squashfs-tools cryptsetup
# exit
``````
Then create a work directory, download the source and build it using

Code: Select all

``````\$ mkdir singularity
\$ cd singularity
\$ export GOPATH=`pwd`
\$ mkdir -p \${GOPATH}/src/github.com/sylabs
\$ cd \${GOPATH}/src/github.com/sylabs
\$ tar xf \${GOPATH}/singularity-3.7.2.tar.gz
\$ cd singularity
\$ ./mconfig -p /usr/local/singularity
\$ cd builddir
\$ nohup time make &
``````
Astonishingly, the build finished without errors with the report

Code: Select all

``````6926.80user 639.78system 2:21:44elapsed 88%CPU (0avgtext+0avgdata 323344maxresident)k
1191704inputs+5081416outputs (6811major+6488304minor)pagefaults 0swaps
``````
Considering the advertised speed of the Go compiler, two and a half hours seems like a pretty long time. However, this is mostly a reminder of why the super-cheap cluster is named snail.

Next install Singularity with

Code: Select all

``````\$ sudo -s
# make install
# cd /usr/local/bin
# ln -s ../singularity/bin/* .
# exit
``````
and then check it's installed properly with

Code: Select all

``````\$ singularity version
3.7.2
``````
The next post will describe deploying a singularity container with a newer version of Julia onto the super-cheap cluster.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

ejolson wrote:
Mon Apr 05, 2021 2:47 am
The next post will describe deploying a singularity container with a newer version of Julia onto the super-cheap cluster.
Ready to deploy a new Julia I typed

Code: Select all

``````\$ singularity search julia
No container images found for arm matching "julia".
``````
and immediately felt disappointed.

On an x86 system I get

Code: Select all

`````` \$ singularity search julia
Found 1 users for 'julia'
library://julian

No collections found for 'julia'

Found 8 containers for 'julia'
library://sylabs/examples/julia.sif
Tags: latest

library://sylabs/examples/julia
Tags: latest

library://dtrudg-utsw/demo/julia
Tags: 20190319 latest

library://sebastian_mc/default/julia
Tags: julia1.1.0

library://crown421/default/juliabase
Tags: 1.3.1 1.4.2 latest

library://ericneiva/default/juliabase
Tags: 1.5.1 latest

library://ajgreen/default/julia1.4.2-py3.7-jupyterlab1.2.6
Tags: latest

library://kool/default/julia
``````
so I was expecting at least one result from my search.

After further typing

Code: Select all

``````\$ singularity search rstudio
No container images found for arm matching "rstudio".

\$ singularity search python
No container images found for arm matching "python".

\$ singularity search golang
No container images found for arm matching "golang".
``````
I get the feeling there are no 32-bit ARM Singularity containers.

My next plan is to follow the instructions

https://singularity-tutorial.github.io/03-building/

to try and build my own container that runs on the super-cheap cluster.

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

ejolson wrote:
Tue Apr 06, 2021 1:40 am
My next plan is to follow the instructions

https://singularity-tutorial.github.io/03-building/

to try and build my own container that runs on the super-cheap cluster.
Those instructions don't work on a Pi. After installing debootstrap using apt, the 32-bit Raspberry Pi OS gets confused about whether it's Debian or something else and complains

Code: Select all

``````\$ singularity build --fakeroot --sandbox orig orig.def
INFO:    Starting build...
W: Cannot check Release signature; keyring file not available /usr/share/keyrings/debian-archive-keyring.gpg
I: Retrieving InRelease
E: Invalid Release file, no entry for main/binary-arm/Packages
FATAL:   While performing build: conveyor failed to get: while debootstrapping: exit status 1
``````
I asked the dog developer and it seems the way most people build a Singularity container is to first import a Docker container and then run some extra install scripts. According to Fido, interoperability of Singularity with Docker was an important part of the design. This interoperability essentially allows one to run Docker containers using Singularity with user-level permissions and the native networking needed when the container is scheduled as a Slurm job on a super computer.

Since Alpine Linux has an ARMv6 Docker container, I created an alpine.def file

Code: Select all

``````BootStrap: docker
From: alpine@sha256:cba24b50b9d81704968f65455897a3a519568b236c174c9040135c5afee5dc54

%runscript
echo "This is what happens when you run the container..."
``````
and then created and tested the container with

Code: Select all

``````\$ singularity build --fakeroot alpine.sif alpine.def
INFO:    Starting build...
Getting image source signatures
Copying blob bb87125c6ee1 [--------------------------------------] 0.0b / 0.0b
Copying config 302fb7d0eb done
Writing manifest to image destination
Storing signatures
2021/04/06 21:42:09  info unpack layer: sha256:bb87125c6ee1ce30c6b33d2c96a9fbe46da4a290f7cb1dd73fd62d4e06503699
INFO:    Creating SIF file...
INFO:    Build complete: alpine.sif
\$ singularity run alpine.sif
This is what happens when you run the container...
``````
Note that the sha256 signature was needed to prevent Singularity from pulling the ARMv7 version of Alpine Linux which, of course, segfaults if run on a Pi Zero. I think there should be a way to build Singularity so it automatically detects the difference between a Pi Zero and a Pi 4B. If anyone knows how, a post explaining the build procedure would be appreciated.

Note to find the sha256 I performed the following command using Docker:

Code: Select all

``````\$ docker manifest inspect alpine
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
"manifests": [
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 528,
"digest": "sha256:e103c1b4bf019dc290bcc7aca538dc2bf7a9d0fc836e186f5fa34945c5168310",
"platform": {
"architecture": "amd64",
"os": "linux"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 528,
"digest": "sha256:cba24b50b9d81704968f65455897a3a519568b236c174c9040135c5afee5dc54",
"platform": {
"architecture": "arm",
"os": "linux",
"variant": "v6"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 528,
"digest": "sha256:59b46c319f3b66dfda96faafd0c6959e9b2f409792d0236204f270dfd0235960",
"platform": {
"architecture": "arm",
"os": "linux",
"variant": "v7"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 528,
"digest": "sha256:071fa5de01a240dbef5be09d69f8fef2f89d68445d9175393773ee389b6f5935",
"platform": {
"architecture": "arm64",
"os": "linux",
"variant": "v8"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 528,
"digest": "sha256:e10c13a5af47b1f2f5e3fbb9355fa82bc1234567a83a549d61d15697131e6e66",
"platform": {
"architecture": "386",
"os": "linux"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 528,
"digest": "sha256:f3a907bc0278ea0de7ddafcbca3c9a63cee253a4698eb248a4416d46fc906dbd",
"platform": {
"architecture": "ppc64le",
"os": "linux"
}
},
{
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"size": 528,
"platform": {
"architecture": "s390x",
"os": "linux"
}
}
]
}
``````
and chose the sha256 corresponding to the v6 variant. As I'm not experienced with Docker, there might also be a way to simply specify v6 somehow instead of the signature, but I couldn't figure it out. Again, a post describing a better way would be appreciated.

The next step seems to be creating a container with a new version of Julia inside. At the moment I'm stuck on that, because I still can't get version 1.6.0 to compile on the 4B, let alone the Zero.

viewtopic.php?f=34&t=285336

ejolson
Posts: 7510
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

ejolson wrote:
Mon Apr 05, 2021 2:47 am
Astonishingly, the build finished without errors with the report

Code: Select all

``````6926.80user 639.78system 2:21:44elapsed 88%CPU (0avgtext+0avgdata 323344maxresident)k
1191704inputs+5081416outputs (6811major+6488304minor)pagefaults 0swaps
``````
I decided to install Singularity on the 2U Pi Cloud to compare compile times.

viewtopic.php?f=36&t=279176

The time to build it on a 4B was

Code: Select all

``````794.58user 142.69system 6:52.11elapsed 227%CPU (0avgtext+0avgdata 315092maxresident)k
362472inputs+3322856outputs (91056major+6610603minor)pagefaults 0swaps
``````
This suggests the Pi 4B is 20.6 times faster than the B+ which is used as the head node for the super-cheap cluster. That's not too surprising and suggests it might be a better idea to build a new cluster using the 4B rather than the Zero. At the same time, just as the real epidemic has led to extended depreciation schedules for practical computing clusters, the resulting silicon drought has intubated the life of the super-cheap cluster for learning.