## Super-cheap Computing Cluster for Learning

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

A surprising number of jobs which execute on many clusters are single-threaded and non-parallel. In those cases parallelism is sometimes obtained as a by product of running many independent copies of the same program with a different choice of parameters. Consider the Monte-Carlo method for approximating Pi from the OctaPi project Calculating Pi. Like darts thrown at a dart board, this method generates random points in a square and then counts how many of those points fall into the circle. Note that due to the randomness, one obtains a different approximation of Pi with each run.

While the dartboard method is a general way of approximating areas, it converges very slowly at a rate of 1/sqrt(n) and is, therefore, also one of the least efficient ways to compute Pi. The Python program included in the OctaPi project computes the percentage of darts falling inside the circle at a speed of 25000 darts/second using a precision of 100 decimal digits. To compute all 100 decimal digits accurately would require approximately 0.4E197 darts and consequently about 0.5E185 years to finish. On the other hand, the age of the universe is estimated to be about 1.3E10 years and it follows in that amount of time that a Pi Zero could throw 4.1E18 darts to obtain an approximation good to about 8 significant digits. As in the OctaPi project, our goal is not to compute Pi, but to understand parallel processing and how the Slurm batch system works.

We begin by creating a batch job which runs the Python code from the OctaPi project on one of the available Pi Zeros in the cluster. Make a working directory called pi. Then download the program file pi_dartboard.py from the OctaPi project and move it to the working directory. Note this is the dartboard algorithm for a stand alone processor, not the version designed for clusters. Before continuing, we make one minor change to pi_dartboard.py to flush the output buffers after each status update. Without this change all output would be buffered until the program finishes and that would make it impossible to monitor the status of the program while it is running in the batch environment. To do this add the option flush=True to the print on line 63 to obtain

Code: Select all

``            print(('Executed trial %i using random seed %i with result %i' % (i, ran_seed, inside)),flush=True)``
Next create a Slurm script called pi_dartboard.slm which contains

Code: Select all

``````#!/bin/bash
time {
printf '100\n10000\n' | python3 pi_dartboard.py
}``````
Since the original Python program was intended to be run interactively, it waits on user input for the size of each trial and the number of trials to run. In a batch system the program will be started when computing resources are available at some unspecified time in the future. Therefore, we specify the inputs for the program ahead of time. In the above case we set trial size to 100 and number of trials to 10000.

Now, submit a few copies of the batch job, check whether they have been scheduled and monitor their progress using

Code: Select all

``````\$ sbatch pi_dartboard.slm
Submitted batch job 54
\$ sbatch pi_dartboard.slm
Submitted batch job 55
\$ sbatch pi_dartboard.slm
Submitted batch job 56
\$ sbatch pi_dartboard.slm
Submitted batch job 57
\$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
54      zero pi_dartb  ejolson  R       0:07      1 s0
55      zero pi_dartb  ejolson  R       0:04      1 s1
56      zero pi_dartb  ejolson  R       0:04      1 s2
57      zero pi_dartb  ejolson  R       0:04      1 s3
\$ cat slurm-56.out
Number of random points to include in each trial =
Number of trials to run =
Doing 10000 trials of 100 points each
Executed trial 0 using random seed 43897 with result 86
Executed trial 1000 using random seed 52318 with result 79
Executed trial 2000 using random seed 59206 with result 80``````
Note that the output of each run is stored in a file of the form slurm-XX.out where XX is the job ID of the run. When squeue returns empty, all programs have finished. The final output should now appear as

Code: Select all

``````\$ cat slurm-54.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 41166 with result 74
Executed trial 1000 using random seed 36385 with result 79
Executed trial 2000 using random seed 39999 with result 87
Executed trial 3000 using random seed 55252 with result 79
Executed trial 4000 using random seed 9008 with result 79
Executed trial 5000 using random seed 28951 with result 81
Executed trial 6000 using random seed 22201 with result 79
Executed trial 7000 using random seed 54954 with result 74
Executed trial 8000 using random seed 65485 with result 74
Executed trial 9000 using random seed 53049 with result 78
The value of Pi is estimated to be 3.14902000000000015234036254696547985076904296875 using 1000000 points

real    0m40.572s
user    0m40.300s
sys 0m0.120s
\$ cat slurm-55.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 50261 with result 78
Executed trial 1000 using random seed 24993 with result 75
Executed trial 2000 using random seed 56492 with result 76
Executed trial 3000 using random seed 2695 with result 79
Executed trial 4000 using random seed 39028 with result 79
Executed trial 5000 using random seed 42505 with result 69
Executed trial 6000 using random seed 38572 with result 79
Executed trial 7000 using random seed 21871 with result 87
Executed trial 8000 using random seed 52748 with result 79
Executed trial 9000 using random seed 16073 with result 77
The value of Pi is estimated to be 3.1490960000000001173248165287077426910400390625 using 1000000 points

real    0m40.088s
user    0m39.830s
sys 0m0.080s
\$ cat slurm-56.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 43897 with result 86
Executed trial 1000 using random seed 52318 with result 79
Executed trial 2000 using random seed 59206 with result 80
Executed trial 3000 using random seed 16956 with result 82
Executed trial 4000 using random seed 44605 with result 82
Executed trial 5000 using random seed 11321 with result 76
Executed trial 6000 using random seed 46185 with result 77
Executed trial 7000 using random seed 23947 with result 76
Executed trial 8000 using random seed 29328 with result 74
Executed trial 9000 using random seed 62471 with result 85
The value of Pi is estimated to be 3.149503999999999859227273191208951175212860107421875 using 1000000 points

real    0m37.257s
user    0m36.980s
sys 0m0.100s
\$ cat slurm-57.out
Number of random points to include in each trial = Number of trials to run = Doing 10000 trials of 100 points each
Executed trial 0 using random seed 17784 with result 77
Executed trial 1000 using random seed 18571 with result 76
Executed trial 2000 using random seed 11993 with result 81
Executed trial 3000 using random seed 58358 with result 75
Executed trial 4000 using random seed 60686 with result 84
Executed trial 5000 using random seed 38634 with result 79
Executed trial 6000 using random seed 34956 with result 80
Executed trial 7000 using random seed 11397 with result 81
Executed trial 8000 using random seed 54805 with result 77
Executed trial 9000 using random seed 1196 with result 78
The value of Pi is estimated to be 3.14919200000000021333335098461247980594635009765625 using 1000000 points

real    0m39.179s
user    0m38.910s
sys 0m0.090s``````
We observe that 1000000 darts were thrown in about 40 seconds per Pi Zero for a speed of 25000 darts/second per Pi. Also note that the approximation is only good to about 3 significant digits even though 100 digits of precision were used in the calculation and that the resulting approximation is different for each run. In particular, averaging the approximations obtained by all the runs results in a better approximation. This is how we shall obtain parallel speed-up when using multiple Zeros.

For performance reasons, scientific computing is usually done using a combination of Fortran and C programming languages. While Fortran remains a domain-specific language with little popularity outside scientific computing, the C programming language is equally suited for scientific computing but also widely used in many other application areas. We rewrite the Monte-Carlo method for computing Pi in C as an example of a single-threaded non-parallel code that might be run on computing clusters. The C program looks like

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/resource.h>

typedef unsigned int uint;
uint no_of_trials = 1000;
uint no_of_points = 100000;

uint compute(uint s,uint n){
srandom(s);
uint inside=0;
for(uint i=0;i<n;i++){
double x=(double)random()/RAND_MAX;
double y=(double)random()/RAND_MAX;
double z=x*x+y*y;
if(z<=1.0){
inside++;
}
}
return inside;
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

int main(){
printf("Estimating Pi using Dartboard/Monte-Carlo Method\n"
"with %u trials of %u points.\n\n",
no_of_trials,no_of_points);
fflush(stdout);
double pi=0;
tic();
double tprint=5;
FILE *urandom=fopen("/dev/urandom","r");
for(uint i=1;i<=no_of_trials;i++){
uint ran_seed;
uint inside=compute(ran_seed,no_of_points);
double tnow=toc();
if(tnow>tprint){
printf("Trial %u using seed %u with %u inside"
" (ETA %g seconds)\n",
i,ran_seed,inside,tnow/i*(no_of_trials-i));
fflush(stdout);
tprint=tnow+5;
}
pi+=(double)inside/no_of_points;
}
fclose(urandom);
pi*=4.0/no_of_trials;
printf("\nThe value of Pi is estimated to be %.15f\n"
"using %u trials of %u points each.\n\n",
pi,no_of_trials,no_of_points);
double ttime=toc();
printf("Total running time %g seconds\n"
"with average rate of %g darts/second.\n",
ttime,no_of_trials/ttime*no_of_points);
return 0;
}
``````
and the corresponding pi.slm is

Code: Select all

``````#!/bin/bash
./pi
``````
Compile, submit this program and monitor its progress using

Code: Select all

``````\$ gcc -std=gnu99 -Wall -O2 -o pi pi.c -lm
\$ sbatch pi.slm
Submitted batch job 66
\$ sbatch pi.slm
Submitted batch job 67
\$ sbatch pi.slm
Submitted batch job 68
\$ sbatch pi.slm
Submitted batch job 69
\$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
66      zero   pi.slm  ejolson  R       0:04      1 s0
67      zero   pi.slm  ejolson  R       0:04      1 s1
68      zero   pi.slm  ejolson  R       0:01      1 s2
69      zero   pi.slm  ejolson  R       0:01      1 s3
\$ cat slurm-68.out
Estimating Pi using Dartboard/Monte-Carlo Method
with 1000 trials of 100000 points.

Trial 168 using seed 1638688743 with 78596 inside (ETA 24.894 seconds)
Trial 337 using seed 76444161 with 78365 inside (ETA 19.7693 seconds)
Trial 506 using seed 1246513006 with 78327 inside (ETA 14.7182 seconds)``````
The final output from one of the jobs looks like

Code: Select all

``````\$ cat slurm-68.out
Estimating Pi using Dartboard/Monte-Carlo Method
with 1000 trials of 100000 points.

Trial 168 using seed 1638688743 with 78596 inside (ETA 24.894 seconds)
Trial 337 using seed 76444161 with 78365 inside (ETA 19.7693 seconds)
Trial 506 using seed 1246513006 with 78327 inside (ETA 14.7182 seconds)
Trial 675 using seed 3803787632 with 78665 inside (ETA 9.67678 seconds)
Trial 844 using seed 1400931100 with 78663 inside (ETA 4.64343 seconds)

The value of Pi is estimated to be 3.141761479999999
using 1000 trials of 100000 points each.

Total running time 29.7284 seconds
with average rate of 3.36379e+06 darts/second.``````
We note that the C code runs about 132 times faster than the Python code. Therefore, if a Python program is running slow, it might be wise to translate it to a more efficient language before making a parallel version. On the other hand, the original Python code may run as fast as the C version using the PyPy just-in-time compiler. It would be interesting for someone to check this. In subsequent posts we obtain a parallel increase in speed for the C version of this computation using the OpenMPI message passing interface.
Last edited by ejolson on Fri Oct 12, 2018 11:24 am, edited 1 time in total.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

In this post we will create a parallel Monte-Carlo approximation of Pi. Almost all parallel programs that run on supercomputers are created using the MPI message passing interface. This library passes messages over the high-speed interconnect to copy data back and forth between distributed memory to coordinate the individual nodes in order to perform a single computation. From an educational point of view, it is wonderful that the same MPI libraries used to program the fastest supercomputers in the world also run on the super-cheap Raspberry Pi cluster.

To use MPI, a parallel code must be divided into ranks that run on separate CPUs. As all supercomputers run Linux these days, it is reasonable to think of a rank as a Linux process. To minimize latencies induced by context-switching in general purpose operating systems not designed for message passing, most MPI libraries busy wait and spend 100% CPU rapidly polling until new data arrives. Historically, busy waiting was not seen to be a problem because it was assumed that there is exactly one CPU reserved for each rank. It also significantly speeds things up when using shared memory to pass messages between ranks on a multi-core node. Unfortunately, busy waiting reduces energy efficiency by preventing modern hardware from entering low power states, it makes multi-threaded programming difficult and it degrades throughput in cases where the CPU could have been used for timeshared multitasking.

In particular, busy waiting does not speed things up for the super-cheap cluster. There is significant latency already present in the USB Ethernet gadget and using shared memory by multi threading is not possible because there is only one CPU per Zero. Moreover, having only one CPU means nothing is left to run the operating system if we devote the entire CPU to the MPI calculation. Currently there are two competitive implementations of the MPI library. MPICH which was developed at Argonne National Laboratory and OpenMPI which represents a merger between versions developed at the University of Tennessee, Los Alamos National Laboratory and Indiana University. Between the two, it turns out that MPICH can more easily be tuned to use blocking input output instead of busy waiting. However, the version of MPICH included in Raspbian was configured to use busy waiting. Therefore, we configure and compile it ourselves.

To do this download mpich-3.2.1 and then type the following commands

\$ tar zxf mpich-3.2.1.tar.gz
\$ cd mpich-3.2.1
\$ ./configure --with-device=ch3:sock
\$ make

Note that --with-device=ch3:sock is the option required to avoid busy waiting. If all goes well, switch to the root user and install using

\$ su
# make install

At this point update the root nodes of the filesystem using the commands

# cd /x
# ./sreboot

wait for the Zeros to disconnect from the network bridge and then type

# cd /x
# ./supdate
# ./szeroup

Once the Zeros reboot you may have to update their state in Slurm using the commands

# scontrol update nodename=s0 state=idle
# scontrol update nodename=s1 state=idle
# scontrol update nodename=s2 state=idle
# scontrol update nodename=s3 state=idle
# scontrol update nodename=s4 state=idle

At this point we are ready to create a parallel Monte-Carlo approximation of Pi using the MPICH message passing library. The MPI C code is named pi_mpi.c and looks like like

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size;

typedef unsigned int uint;
uint no_of_trials = 10000;
uint no_of_points = 100000;

typedef struct { uint seed,darts,inside; } computeargs;

uint compute(uint s,uint n){
srandom(s);
uint inside=0;
for(uint i=0;i<n;i++){
double x=(double)random()/RAND_MAX;
double y=(double)random()/RAND_MAX;
double z=x*x+y*y;
if(z<=1.0){
inside++;
}
}
return inside;
}
uint rankn(){
for(;;){
computeargs x;
MPI_Status err;
MPI_Recv(&x,sizeof(computeargs),MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD, &err);
if(!x.darts) return 0;
x.inside=compute(x.seed,x.darts);
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
0,0,MPI_COMM_WORLD);
}
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

int rank0(){
printf("Estimating Pi using Dartboard/Monte-Carlo Method\n"
"with %u trials of %u points.\n\n",
no_of_trials,no_of_points);
fflush(stdout);
double pi=0;
tic();
double tprint=5;
FILE *urandom=fopen("/dev/urandom","r");
for(;;){
int rank;
if(started>=mpi_size-1){
computeargs x;
MPI_Status err;
MPI_Recv(&x,sizeof(computeargs),MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD,&err);
double tnow=toc();
if(tnow>tprint){
printf("Trial %u rank %d seed %u with %u inside"
" (ETA %g sec)\n",
fflush(stdout);
tprint=tnow+5;
}
pi+=(double)x.inside/no_of_points;
rank=err.MPI_SOURCE;
} else {
rank=started+1;
}
if(started<no_of_trials){
started++;
uint ran_seed;
computeargs x = { ran_seed, no_of_points };
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
} else {
computeargs x = { 0, 0 };
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
}
}
fclose(urandom);
pi*=4.0/no_of_trials;
printf("\nThe value of Pi is estimated to be %.15f\n"
"using %u trials of %u points each.\n\n",
pi,no_of_trials,no_of_points);
double ttime=toc();
printf("Total running time %g seconds\n"
"with average rate of %g darts/second.\n",
ttime,no_of_trials/ttime*no_of_points);
return 0;
}

int main(int argc, char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);
if(mpi_size<2){
printf("Please use at least 2 ranks!\n");
} else {
if(mpi_rank>0) rankn();
else rank0();
}
MPI_Finalize();
return 0;
}``````
Note that rank 0 collects the results while the other ranks perform the computations. Compile the program using

\$ mpicc -std=gnu99 -Wall -O2 -o pi_mpi pi_mpi.c -lm

and then run it using batch script pi_mpi.slm given by

Code: Select all

``````#!/bin/bash
#SBATCH -n5
mpirun -n 6 ./pi_mpi``````
with the command

\$ sbatch pi_mpi.slm

Note that Slurm has been set to allocate five nodes but mpirun starts the MPI C program using six ranks. The result is that rank 0 and rank 5 will both be started on node s0. This over-provisioning would lead to performance degradation if the MPI library was configured for busy waiting, but since we are using blocking input output we can efficiently use all Zeros for the calculation while rank 0 collects the results. The output looks like

Code: Select all

``````Estimating Pi using Dartboard/Monte-Carlo Method
with 10000 trials of 100000 points.

Trial 802 rank 1 seed 426747775 with 78258 inside (ETA 57.425 sec)
Trial 1609 rank 1 seed 4232193648 with 78599 inside (ETA 52.2031 sec)
Trial 2418 rank 1 seed 2269661710 with 78461 inside (ETA 47.0672 sec)
Trial 3225 rank 5 seed 264736980 with 78526 inside (ETA 42.0558 sec)
Trial 4032 rank 5 seed 3973449706 with 78469 inside (ETA 37.0415 sec)
Trial 4841 rank 5 seed 1758916544 with 78848 inside (ETA 31.9995 sec)
Trial 5649 rank 1 seed 1912997491 with 78575 inside (ETA 26.9957 sec)
Trial 6458 rank 2 seed 3703685934 with 78492 inside (ETA 21.9661 sec)
Trial 7265 rank 1 seed 2545871756 with 78677 inside (ETA 16.9625 sec)
Trial 8072 rank 5 seed 3193836803 with 78528 inside (ETA 11.958 sec)
Trial 8882 rank 2 seed 2354279827 with 78560 inside (ETA 6.9314 sec)
Trial 9689 rank 2 seed 4144606175 with 78351 inside (ETA 1.92819 sec)

The value of Pi is estimated to be 3.141592439999978
using 10000 trials of 100000 points each.

Total running time 61.9744 seconds
with average rate of 1.61357e+07 darts/second.``````
Comparing the darts per second rate of the parallel code with the serial code in the previous post we obtain that

1.61357e+07/3.36379e+06=4.797

and conclude that our cluster of five Zeros performs the Monte-Carlo calculation almost five times faster than a single Zero.
Last edited by ejolson on Fri Oct 12, 2018 11:27 am, edited 2 times in total.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

The program pi_mpi.c from the previous post uses the MPI message passing interface to simulate remote procedure calls which are then executed asynchronously on the available nodes. This is because the original OctaPi: Calculating Pi project used remote procedure calls and we chose to emulate the parallel Python code as closely as possible. A simpler more idiomatic MPI program might break the problem up into mpi_size number of parts and combine them to form the solution at the end of the run. The C code pi_reduce.c given by

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size;

typedef unsigned int uint;
uint no_of_trials = 10000;
uint no_of_points = 100000;
double global_pi;

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

uint compute(uint s,uint n){
srandom(s);
uint inside=0;
for(uint i=0;i<n;i++){
double x=(double)random()/RAND_MAX;
double y=(double)random()/RAND_MAX;
double z=x*x+y*y;
if(z<=1.0){
inside++;
}
}
return inside;
}
void dotrials(){
int first_trial=mpi_rank*no_of_trials/mpi_size;
int last_trial=(mpi_rank+1)*no_of_trials/mpi_size;
int trials=last_trial-first_trial;
double pi=0;
FILE *urandom=fopen("/dev/urandom","r");
for(int i=first_trial;i<last_trial;i++){
uint ran_seed;
uint inside=compute(ran_seed,no_of_points);
pi+=(double)inside/no_of_points;
}
fclose(urandom);
pi*=4.0/trials;
printf("Rank %d finished in %g seconds with Pi %.15f\n",
mpi_rank,toc(),pi);
fflush(stdout);
MPI_Reduce(&pi,&global_pi,1,MPI_DOUBLE,
MPI_SUM,0,MPI_COMM_WORLD);
}

int main(int argc, char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

tic();
if(mpi_rank==0){
printf("Estimating Pi using Dartboard/Monte-Carlo Method\n"
"with %u trials of %u points.\n\n",
no_of_trials,no_of_points);
fflush(stdout);
}

dotrials();

if(mpi_rank==0){
global_pi/=mpi_size;
printf("\nThe value of Pi is estimated to be %.15f\n"
"using %u trials of %u points each.\n\n",
global_pi,no_of_trials,no_of_points);
double ttime=toc();
printf("Total running time %g seconds\n"
"with average rate of %g darts/second.\n",
ttime,no_of_trials/ttime*no_of_points);
}

MPI_Finalize();
return 0;
}``````
uses a single MPI_Reduce call to sum the separate approximations of Pi from each of the nodes and combine them into a final answer. Running the above program using the Slurm configuration file

Code: Select all

``````#!/bin/bash
#SBATCH -n5
mpirun ./pi_reduce``````
results in the output

Code: Select all

``````Estimating Pi using Dartboard/Monte-Carlo Method
with 10000 trials of 100000 points.

Rank 3 finished in 59.4139 seconds with Pi 3.141538220000008
Rank 4 finished in 59.4436 seconds with Pi 3.141701000000008
Rank 1 finished in 59.4442 seconds with Pi 3.141408220000017
Rank 2 finished in 59.4451 seconds with Pi 3.141482060000007
Rank 0 finished in 60.6071 seconds with Pi 3.141667720000010

The value of Pi is estimated to be 3.141559444000010
using 10000 trials of 100000 points each.

Total running time 60.6118 seconds
with average rate of 1.64984e+07 darts/second.``````
Again comparing the darts per second rate with the serial code we obtain a parallel speedup of

1.64984e+07/3.36379e+06=4.905

and infer the above program is slightly faster than the previous MPI code that used simulated remote procedure calls.

Using MPI_Reduce works well for the Monte-Carlo simulation, because each trial takes the exact same amount of time because each node in the super-cheap cluster has the same performance characteristics. Thus, the total work needed to throw all the darts is easily partitioned between the individual ranks ahead of time. If the cluster were heterogeneous made out of computers running at different speeds or if some of the nodes were over provisioned running additional tasks, then it would not be possible to accurately partition the work ahead of time and the program would run only as fast as the slowest node.

We emphasize that using simulated remote procedure calls as in the previous MPI code adapts to situations where the time to compute each trial on any particular node is unpredictable. At the same time, the number of ranks to which this method can scale is limited by the fact that the rank 0 process must explicitly start each trial on each of the other ranks. In the case of the super-cheap cluster, there are only 5 nodes so this is not an issue. However, larger clusters have thousands of nodes. In that case MPI_Reduce may be preferable because it uses sophisticated algorithms built into the MPI library that can scale to a much larger number of ranks.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

As stated at the beginning of this thread, the super-cheap computing cluster is designed for learning, not for practical scientific computation. To illustrate the difference in performance, we now run pi_reduce on a 24-core 2.6Ghz Xeon Scale server. The results

Code: Select all

``````Estimating Pi using Dartboard/Monte-Carlo Method
with 10000 trials of 1000000 points.

Rank 3 finished in 15.6546 seconds with Pi 3.141525894230769
Rank 6 finished in 15.6564 seconds with Pi 3.141710009615383
Rank 12 finished in 15.6574 seconds with Pi 3.141567644230772
Rank 0 finished in 15.6596 seconds with Pi 3.141744076923075
Rank 18 finished in 15.6618 seconds with Pi 3.141539538461537
Rank 15 finished in 15.6683 seconds with Pi 3.141624634615384
Rank 9 finished in 15.6766 seconds with Pi 3.141604730769233
Rank 21 finished in 15.6854 seconds with Pi 3.141742913461537
Rank 2 finished in 15.6938 seconds with Pi 3.141564623501198
Rank 8 finished in 15.6941 seconds with Pi 3.141575184652276
Rank 17 finished in 15.694 seconds with Pi 3.141724940047959
Rank 23 finished in 15.694 seconds with Pi 3.141690100719419
Rank 20 finished in 15.6943 seconds with Pi 3.141587194244604
Rank 10 finished in 15.6949 seconds with Pi 3.141487079136687
Rank 16 finished in 15.6953 seconds with Pi 3.141587741007196
Rank 1 finished in 15.6953 seconds with Pi 3.141562302158272
Rank 14 finished in 15.7046 seconds with Pi 3.141616575539569
Rank 5 finished in 15.7096 seconds with Pi 3.141489410071944
Rank 4 finished in 15.7101 seconds with Pi 3.141559769784172
Rank 22 finished in 15.7104 seconds with Pi 3.141569179856113
Rank 11 finished in 15.715 seconds with Pi 3.141545352517987
Rank 19 finished in 15.7156 seconds with Pi 3.141607050359713
Rank 13 finished in 15.7251 seconds with Pi 3.141589227817745
Rank 7 finished in 15.7348 seconds with Pi 3.141577544364508

The value of Pi is estimated to be 3.141599696586960
using 10000 trials of 1000000 points each.

Total running time 15.7351 seconds
with average rate of 6.35523e+08 darts/second.``````
indicate that a single node in a typical supercomputer runs the code we developed using the Pi a factor of

6.35523e+08/1.64984e+07=38.5

times faster than the super-cheap cluster. This is actually less of a speedup than I had expected, especially considering the Xeon server costs about 100 times more than the entire Pi cluster. Note that we are essentially testing the speed of the pseudo-random number generator built into the C library between the ARM and x86 versions of Linux. If five Xeon nodes were devoted to the computation, one would expect a 188 times performance increase. Additional speedup may be possible with tuning. We end this post with the following observation: It is significant from a learning point of view that the code developed for the super-cheap cluster will run on a real supercomputer without modification using exactly the same commands.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

This post describes how to install and run the linear algebra benchmark called HPL that solves systems of linear equations using Gaussian elimination with partial pivoting. This benchmark is used for ranking the top 500 supercomputers in the world in both performance and electrical efficiency.

Before getting started install a couple of subroutine libraries: the basic linear algebra subroutine library OpenBLAS as well as the linear algebra package LAPACK. Log into the head node and as root type

# apt-get install libopenblas-dev liblapacke-dev

Note that LAPACK is not needed for the benchmark, but may be needed for running other programs in the future. While making changes, edit /etc/bash.bashrc and add the line

export SQUEUE_FORMAT="%.6i %.18j %.10u %.2t %.11M %.3C %.8r %N"

in order to make the default output format for the squeue command more appropriate to our hardware configuration. Finally, let's update the operating system using

# apt-get update; apt-get upgrade

Be sure to watch out for kernel upgrades. If the kernel is upgraded you will have to regenerate the initramfs and copy the new kernel into the correct place manually for the new kernel to be used. At this point I discovered that the original 48M size of the /boot partition was too small for the package manager to update the kernel and initramfs while also storing the additional files mykernel.img and myinitrd.img. This oversight has been fixed by editing the earlier posts to specify a /boot partition size of 96M.

Propagate the changes from the head node to the computational nodes by shutting them down, updating their snapshots and rebooting them using the same combination of sreboot, supdate and szeroup commands described earlier in this thread.

Now, download HPL and unpack it using

\$ cd
\$ mkdir -p work
\$ cd work
\$ wget http://www.netlib.org/benchmark/hpl/hpl-2.2.tar.gz
\$ tar zxf hpl-2.2.tar.gz
\$ cd hpl-2.2

Create a configuration file called make.OpenBLAS which reads

Code: Select all

``````\$ cat Make.OpenBLAS
#
#  -- High Performance Computing Linpack Benchmark (HPL)
#     HPL - 2.2 - February 24, 2016
#     Antoine P. Petitet
#     University of Tennessee, Knoxville
#     Innovative Computing Laboratory
#
#  -- Copyright notice and Licensing terms:
#
#  Redistribution  and  use in  source and binary forms, with or without
#  modification, are  permitted provided  that the following  conditions
#  are met:
#
#  1. Redistributions  of  source  code  must retain the above copyright
#  notice, this list of conditions and the following disclaimer.
#
#  2. Redistributions in binary form must reproduce  the above copyright
#  notice, this list of conditions,  and the following disclaimer in the
#  documentation and/or other materials provided with the distribution.
#
#  3. All  advertising  materials  mentioning  features  or  use of this
#  software must display the following acknowledgement:
#  This  product  includes  software  developed  at  the  University  of
#  Tennessee, Knoxville, Innovative Computing Laboratory.
#
#  4. The name of the  University,  the name of the  Laboratory,  or the
#  names  of  its  contributors  may  not  be used to endorse or promote
#  products  derived   from   this  software  without  specific  written
#  permission.
#
#  -- Disclaimer:
#
#  THIS  SOFTWARE  IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
#  ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING,  BUT NOT
#  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
#  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
#  OR  CONTRIBUTORS  BE  LIABLE FOR ANY  DIRECT,  INDIRECT,  INCIDENTAL,
#  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES  (INCLUDING,  BUT NOT
#  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
#  DATA OR PROFITS; OR BUSINESS INTERRUPTION)  HOWEVER CAUSED AND ON ANY
#  THEORY OF LIABILITY, WHETHER IN CONTRACT,  STRICT LIABILITY,  OR TORT
#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
#  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# ######################################################################
#
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s -f
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = OpenBLAS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = /x/snaa/ejolson/work/hpl-2.2
INCdir       = \$(TOPdir)/include
BINdir       = \$(TOPdir)/bin/\$(ARCH)
LIBdir       = \$(TOPdir)/lib/\$(ARCH)
#
HPLlib       = \$(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/local
MPinc        = -I\$(MPdir)/include
MPlib        = -lmpich
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAinc        =
LAlib        = /usr/lib/libopenblas.a -lpthread
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section  if and only if  you are not planning to use
# a  BLAS  library featuring a Fortran 77 interface.  Otherwise,  it  is
# necessary  to  fill out the  F2CDEFS  variable  with  the  appropriate
# options.  **One and only one**  option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_              : all lower case and a suffixed underscore  (Suns,
#                       Intel, ...),                           [default]
# -DNoChange          : all lower case (IBM RS6000),
# -DUpCase            : all upper case (Cray),
# -DAdd__             : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int   : Fortran 77 INTEGER is a C int,         [default]
# -DF77_INTEGER=long  : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle    : The string address is passed at the string loca-
#                       tion on the stack, and the string length is then
#                       passed as  an  F77_INTEGER  after  all  explicit
#                       stack arguments,                       [default]
# -DStringStructPtr   : The address  of  a  structure  is  passed  by  a
#                       Fortran 77  string,  and the structure is of the
#                       form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal   : A structure is passed by value for each  Fortran
#                       77 string,  and  the  structure is  of the form:
#                       struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle   : Special option for  Cray  machines,  which  uses
#                       Cray  fcd  (fortran  character  descriptor)  for
#                       interoperation.
#
F2CDEFS      =
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I\$(INCdir) -I\$(INCdir)/\$(ARCH) \$(LAinc) \$(MPinc)
HPL_LIBS     = \$(HPLlib) \$(LAlib) \$(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L           force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS       call the cblas interface;
# -DHPL_CALL_VSIPL       call the vsip  library;
# -DHPL_DETAILED_TIMING  enable detailed timers;
#
# By default HPL will:
#    *) not copy L before broadcast,
#    *) call the BLAS Fortran 77 interface,
#    *) not display detailed timing information.
#
HPL_OPTS     = -DHPL_CALL_CBLAS
#
# ----------------------------------------------------------------------
#
HPL_DEFS     = \$(F2CDEFS) \$(HPL_OPTS) \$(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = /usr/bin/gcc
CCNOOPT      = \$(HPL_DEFS)
CCFLAGS      = \$(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------``````
and compile with the command

\$ make arch=OpenBLAS

After the compilation is finished, change to the hpl-2.2/bin/OpenBLAS directory using

\$ cd bin/OpenBLAS

and edit the file HPL.DAT so it reads

Code: Select all

``````HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
8000         Ns
1            # of NBs
128          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
5            Qs
16.0         threshold
1            # of panel fact
0            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)``````
Create a slurm configuration file called xhpl.slm that reads

Code: Select all

``````#!/bin/bash
#SBATCH -n5
mpirun -n 5 ./xhpl``````
Submit the the job using

\$ sbatch xhpl.slm

After about six minutes the output from the job will be left in slurm-X.out where X was the job sequence number. In our case the output was given by

Code: Select all

``````================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :    8000
NB     :     128
PMAP   : Row-major process mapping
P      :       1
Q      :       5
PFACT  :    Left
NBMIN  :       2
NDIV   :       2
RFACT  :   Right
BCAST  :   1ring
DEPTH  :       0
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2L2        8000   128     1     5             308.26              1.108e+00
HPL_pdgesv() start time Mon Mar 12 21:56:49 2018

HPL_pdgesv() end time   Mon Mar 12 22:01:57 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0017511 ...... PASSED
================================================================================

Finished      1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================``````
which indicates that the aggregate performance of the cluster is about 1.1 Gflops. This result is about 6 times slower than a single Raspberry Pi 3B but similar to the Terrible Cluster which reached a speed of 1.3 Gflops. I suspect the 20 percent improvement in speed for the Terrible Cluster may come from running part of the computation on the head node, which has faster communication to the other nodes. In 1993 the super-cheap cluster described in this thread would have ranked about 250 on the list of the top 500 supercomputers, right next to the Cray Y-MP.

For further comparison, note that on the same benchmark the four cores in the Pi 2B achieve 1.56 Gflops, the Pi 3B achieves 6.4 Gflops and the Pi 3B+ achieves 6.7 Gflops. Moreover using a larger 20000 by 20000 matrix, an AMD Ryzen 1700 8-core 16-thread desktop achieves 89.9 Gflops, a dual Intel E5-2620 Xeon 12-core 24-thread server achieves 226.0 Gflops and a dual Intel Gold 6126 Xeon 24-core 48-thread server achieves 344.3 Gflops.

As of November 2017, most of the computational power represented by the supercomputers in the top 500 list is delivered by NVIDIA GPU and Intel Phi multi-core accelerators. A notable exception is the fastest machine on the list, the Sunway TaihuLight, which uses a custom processor designed and manufactured in China supplemented by no additional accelerators. In the case of the super-cheap cluster, the CPU in each Pi Zero includes a VideoCore IV GPU with a peak speed of 24 single-precision Gflops. In total, therefore, our super-cheap cluster has a theoretical peak of 120 single-precision Gflops. However, since the Linpack benchmark dictates the use of double precision, it is difficult to use the VideoCore IV in the Pi Zero to speed things up.

Before abandoning the idea of using the VideoCore entirely, note that it is possible to emulate double precision by using two single precision numbers. This float-float representation is described by G. D. Gracca and D. Defour and by A. Thall. We note that the correctly rounded float-float sum df64_add is obtained using 20 single precision operations while the float-float product df64_mult takes 22. Dividing 120 by 21 suggests that the 5 node Pi Zero cluster has a peak rate of about 5.7 float-float Gflops. While it is reasonable to assume peak rates will not be achieved, it would be interesting if someone wrote a float-float linear algebra solver using QPUlib to test whether increased performance is possible. I suspect the resulting speedup would be somewhere between 1.3 and 2.7 times; however, the computation might run slower instead.
Last edited by ejolson on Wed Sep 05, 2018 7:41 pm, edited 2 times in total.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

So far we have written two parallel MPI programs that implement the dartboard method for computing Pi. One version used MPI_Recv and MPI_Send to emulate remote procedure calls in order to be as close to the original Python program as possible while the other version was simpler and used MPI_Reduce. Recall that using MPI_Reduce for computing Pi resulted in a more efficient computation.

Using MPI_Reduce works well for a Monte-Carlo simulation because the total work needed to throw all the darts is predictable and easily partitioned between the individual ranks ahead of time. There are also parallel computations for which it is not so easy to predict how much work any single part might take. Emulated remote procedure calls automatically adapt to situations where it is not possible to evenly partition the work between the nodes ahead of time. Remote procedure calls can also overcome difficulties when the cluster is heterogeneous and some computers run more slowly than others.

To further compare the performance of emulating remote procedure calls using MPI versus statically partitioning the problem ahead of time, consider the task of computing the basins of attraction in the complex plane for Newton's method. A parallel code for calculating Newton fractals was developed in a previous thread using the Intel/MIT Cilkplus parallel processing extensions to the C programming languages on the Raspberry Pi 2B. We now develop the parallel codes newton_mpi.c and newton_reduce.c which perform the same calculation on a Zero cluster and consider the serial code newton.c to gauge the resulting parallel speedup. The results of running these programs are

Code: Select all

``````\$ sbatch newton.slm
Submitted batch job 207
\$ sbatch newton_mpi.slm
Submitted batch job 208
\$ sbatch newton_reduce.slm
Submitted batch job 209
\$ squeue
JOBID               NAME       USER ST        TIME CPU   REASON NODELIST
208     newton_mpi.slm    ejolson PD        0:00   5 Resource
209  newton_reduce.slm    ejolson PD        0:00   5 Priority
207         newton.slm    ejolson  R        2:48   1     None s0
\$ cat slurm-207.out
Compute Basin of Attraction for Newton's Method
Image size 2048 x 2048 with maximum of 100 iterations

Finished raster 37 (ETA 273.734 seconds)
Finished raster 74 (ETA 268.729 seconds)
Finished raster 111 (ETA 263.665 seconds)
Finished raster 149 (ETA 257.031 seconds)
Finished raster 187 (ETA 251.128 seconds)
Finished raster 225 (ETA 245.429 seconds)
Finished raster 264 (ETA 239.284 seconds)
Finished raster 303 (ETA 233.173 seconds)
Finished raster 342 (ETA 227.169 seconds)
Finished raster 381 (ETA 221.245 seconds)
Finished raster 422 (ETA 214.365 seconds)
Finished raster 465 (ETA 206.566 seconds)
Finished raster 509 (ETA 198.914 seconds)
Finished raster 553 (ETA 191.55 seconds)
Finished raster 598 (ETA 183.979 seconds)
Finished raster 645 (ETA 176.023 seconds)
Finished raster 692 (ETA 168.419 seconds)
Finished raster 742 (ETA 160.242 seconds)
Finished raster 796 (ETA 151.09 seconds)
Finished raster 858 (ETA 140.215 seconds)
Finished raster 920 (ETA 130.129 seconds)
Finished raster 982 (ETA 120.675 seconds)
Finished raster 1047 (ETA 111.136 seconds)
Finished raster 1111 (ETA 102.295 seconds)
Finished raster 1191 (ETA 90.9095 seconds)
Finished raster 1284 (ETA 78.155 seconds)
Finished raster 1384 (ETA 65.4242 seconds)
Finished raster 1477 (ETA 54.6616 seconds)
Finished raster 1561 (ETA 45.6879 seconds)
Finished raster 1643 (ETA 37.337 seconds)
Finished raster 1720 (ETA 29.8472 seconds)
Finished raster 1791 (ETA 23.1771 seconds)
Finished raster 1869 (ETA 15.9518 seconds)
Finished raster 1960 (ETA 7.70327 seconds)

Total running time 182.86 seconds.
\$ cat slurm-208.out
Compute Basin of Attraction for Newton's Method
Image size 2048 x 2048 with maximum of 100 iterations
Asynchronous MPI Parallel version 6 ranks (5 for compute)

Rank 5 finished raster 175 (ETA 53.9952 seconds)
Rank 2 finished raster 356 (ETA 47.8896 seconds)
Rank 2 finished raster 555 (ETA 40.5987 seconds)
Rank 1 finished raster 781 (ETA 32.6375 seconds)
Rank 3 finished raster 1071 (ETA 22.9179 seconds)
Rank 1 finished raster 1469 (ETA 11.8825 seconds)
Rank 5 finished raster 1832 (ETA 4.14494 seconds)

Total running time 38.6461 seconds.
\$ cat slurm-209.out
Compute Basin of Attraction for Newton's Method
Image size 2048 x 2048 with maximum of 100 iterations
Simple MPI Parallel version 5 ranks for compute

Finished rank 3 slice in 23.9256 seconds.
Finished rank 4 slice in 24.3671 seconds.
Finished rank 2 slice in 31.8431 seconds.
Finished rank 1 slice in 45.6 seconds.
Finished rank 0 slice in 55.7153 seconds.

Total running time 58.5545 seconds.``````
The relative performance for each of the three codes may be visualized as Note that newton_mpi.c achieves almost perfect parallel scaling to the 5 nodes in the super-cheap cluster while newton_reduce.c is significantly less efficient. Although MPI_Reduce performed better for the Monte-Carlo calculation of Pi, the situation is reversed in the case of Newton fractals. The work is hard to partition ahead of time because different parts of the Newton fractal take more or less time to compute based on the complex iterated dynamics. For reference, the code for newton.c, newton_mpi.c and newton_reduce.c are included below.

Code: Select all

``````// newton.c
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
int job_id;

#define RES 2048
#define IMAX 100
static int no_of_rows,no_of_cols;
static double xmin=0, xmax=1, ymin=0, ymax=1;
static double delta;

typedef struct { unsigned char r,g,b; } rgbtype;

unsigned char clip256(double x){
double r=256*x;
if(r>255) return 255;
if(r<0) return 0;
return (unsigned char)r;
}

rgbtype ztorgb(complex z){
rgbtype rgb;
double theta=carg(z), r=cabs(z);
double h=3*theta/M_PI;
rgb.r=clip256(2-fabs(h));
rgb.g=clip256(fabs(h-1)-1);
rgb.b=clip256(fabs(h+1)-1);
if(r<1){
rgb.r=255-(255-rgb.r)*r;
rgb.g=255-(255-rgb.g)*r;
rgb.b=255-(255-rgb.b)*r;
} else {
rgb.r/=r; rgb.g/=r; rgb.b/=r;
}
return rgb;
}
void outimage(int m,int n,complex A[m][n]){
char filename;
sprintf(filename,"image-%d.ppm",job_id);
FILE *fp=fopen(filename,"w");
fprintf(fp,"P6 %d %d 255\n",n,m);
for(int j=0;j<m;j++){
rgbtype raster[n];
for(int k=0;k<n;k++){
raster[k]=ztorgb(A[j][k]);
}
fwrite(raster,n*sizeof(rgbtype),1,fp);
}
fclose(fp);
}

complex f(complex z) {
return z*z*z*z*z-1;
}
complex df(complex z) {
return 5*z*z*z*z;
}
complex phi(complex z) {
return z-f(z)/df(z);
}
complex compute(int n,complex a[n]){
for(int k=0;k<n;k++){
complex z=a[k];
for(int i=0;i<IMAX;i++){
complex w0=phi(z);
if(w0==z) break;
z=w0;
}
a[k]=z;
}
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

int main(int argc,char *argv[]){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
if(xmax-xmin>ymax-ymin){
delta=(xmax-xmin)/RES;
} else {
delta=(ymax-ymin)/RES;
}
no_of_rows=(ymax-ymin)/delta;
no_of_cols=(xmax-xmin)/delta;
char *p=getenv("SLURM_JOB_ID");
if(p) job_id=atoi(p);

complex A[no_of_rows][no_of_cols];
printf(
"Compute Basin of Attraction for Newton's Method\n"
"Image size %d x %d with maximum of %d iterations\n"
no_of_rows,no_of_cols,IMAX);
for(int j=0;j<no_of_rows;j++) for(int k=0;k<no_of_cols;k++){
A[j][k]=xmin+k*delta+I*(ymax-j*delta);
}
tic();
double tprint=5;
for(int j=0;j<no_of_rows;){
compute(no_of_cols,A[j]);
j++;
double tnow=toc();
if(tnow>tprint){
printf("Finished raster %u (ETA %g seconds)\n",
j,tnow/j*(no_of_rows-j));
fflush(stdout);
tprint=tnow+5;
}
}
outimage(no_of_rows,no_of_cols,A);
double ttime=toc();
printf("\nTotal running time %g seconds.\n",ttime);
return 0;
}``````

Code: Select all

``````// newton_mpi.c
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size, job_id;

#define RES 2048
#define IMAX 100
static int no_of_rows,no_of_cols,xsize;
static double xmin=0, xmax=1, ymin=0, ymax=1;
static double delta;

typedef struct { unsigned char r,g,b; } rgbtype;

unsigned char clip256(double x){
double r=256*x;
if(r>255) return 255;
if(r<0) return 0;
return (unsigned char)r;
}

rgbtype ztorgb(complex z){
rgbtype rgb;
double theta=carg(z), r=cabs(z);
double h=3*theta/M_PI;
rgb.r=clip256(2-fabs(h));
rgb.g=clip256(fabs(h-1)-1);
rgb.b=clip256(fabs(h+1)-1);
if(r<1){
rgb.r=255-(255-rgb.r)*r;
rgb.g=255-(255-rgb.g)*r;
rgb.b=255-(255-rgb.b)*r;
} else {
rgb.r/=r; rgb.g/=r; rgb.b/=r;
}
return rgb;
}
void outimage(int m,int n,rgbtype *image[m]){
char filename;
sprintf(filename,"image-%d.ppm",job_id);
FILE *fp=fopen(filename,"w");
fprintf(fp,"P6 %d %d 255\n",n,m);
for(int j=0;j<m;j++){
fwrite(image[j],n*sizeof(rgbtype),1,fp);
}
fclose(fp);
}

complex f(complex z) {
return z*z*z*z*z-1;
}
complex df(complex z) {
return 5*z*z*z*z;
}
complex phi(complex z) {
return z-f(z)/df(z);
}
rgbtype computergb(complex z){
for(int i=0;i<IMAX;i++){
complex w0=phi(z);
if(w0==z) break;
z=w0;
}
return ztorgb(z);
}

typedef struct { int this_row; rgbtype raster[]; } computeargs;

void rankn(){
computeargs *xp = malloc(xsize);
for(;;){
computeargs x;
MPI_Status err;
MPI_Recv(&x,sizeof(computeargs),MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD, &err);
if(x.this_row<0) break;
xp->this_row=x.this_row;
for(int k=0;k<no_of_cols;k++) {
xp->raster[k]=computergb(
xmin+k*delta+I*(ymax-x.this_row*delta));
}
MPI_Send(xp,xsize,MPI_BYTE,0,0,MPI_COMM_WORLD);
}
free(xp);
return;
}

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

void rank0(){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
rgbtype *image[no_of_rows];
printf(
"Compute Basin of Attraction for Newton's Method\n"
"Image size %d x %d with maximum of %d iterations\n"
"Asynchronous MPI Parallel version %d ranks (%d for compute)\n\n",
no_of_rows,no_of_cols,IMAX,mpi_size,mpi_size-1);

tic();
double tprint=5;
for(;;){
computeargs x;
int rank;
if(started>=mpi_size-1){
MPI_Status err;
computeargs *xp=malloc(xsize);
MPI_Recv(xp,xsize,MPI_BYTE,
MPI_ANY_SOURCE,0,MPI_COMM_WORLD,&err);
image[xp->this_row]=xp->raster;
double tnow=toc();
if(tnow>tprint){
printf("Rank %d finished raster %u (ETA %g seconds)\n",
err.MPI_SOURCE,
fflush(stdout);
tprint=tnow+5;
}
rank=err.MPI_SOURCE;
} else {
rank=started+1;
}
if(started<no_of_rows){
computeargs x = { started };
started++;
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
} else {
computeargs x = { -1 };
MPI_Send(&x,sizeof(computeargs),MPI_BYTE,
rank,0,MPI_COMM_WORLD);
}
}
outimage(no_of_rows,no_of_cols,image);
double ttime=toc();
printf("\nTotal running time %g seconds.\n",ttime);
}

int main(int argc,char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

if(xmax-xmin>ymax-ymin){
delta=(xmax-xmin)/RES;
} else {
delta=(ymax-ymin)/RES;
}
no_of_rows=(ymax-ymin)/delta;
no_of_cols=(xmax-xmin)/delta;
xsize=sizeof(computeargs)+sizeof(rgbtype)*no_of_cols;
char *p=getenv("SLURM_JOB_ID");
if(p) job_id=atoi(p);

if(mpi_size<2) {
printf("Please use at least 2 ranks!\n");
} else {
if(mpi_rank>0) rankn();
else rank0();
}
MPI_Finalize();
return 0;
}``````

Code: Select all

``````// newton_reduce.c
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <math.h>
#include <string.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <mpi.h>
int mpi_rank, mpi_size, job_id;

#define RES 2048
#define IMAX 100
static int no_of_rows,no_of_cols,xsize;
static double xmin=0, xmax=1, ymin=0, ymax=1;
static double delta;

typedef struct { unsigned char r,g,b; } rgbtype;

unsigned char clip256(double x){
double r=256*x;
if(r>255) return 255;
if(r<0) return 0;
return (unsigned char)r;
}

rgbtype ztorgb(complex z){
rgbtype rgb;
double theta=carg(z), r=cabs(z);
double h=3*theta/M_PI;
rgb.r=clip256(2-fabs(h));
rgb.g=clip256(fabs(h-1)-1);
rgb.b=clip256(fabs(h+1)-1);
if(r<1){
rgb.r=255-(255-rgb.r)*r;
rgb.g=255-(255-rgb.g)*r;
rgb.b=255-(255-rgb.b)*r;
} else {
rgb.r/=r; rgb.g/=r; rgb.b/=r;
}
return rgb;
}
void outimage(int m,int n,rgbtype image[m][n]){
char filename;
sprintf(filename,"image-%d.ppm",job_id);
FILE *fp=fopen(filename,"w");
fprintf(fp,"P6 %d %d 255\n",n,m);
for(int j=0;j<m;j++){
fwrite(image[j],n*sizeof(rgbtype),1,fp);
}
fclose(fp);
}

complex f(complex z) {
return z*z*z*z*z-1;
}
complex df(complex z) {
return 5*z*z*z*z;
}
complex phi(complex z) {
return z-f(z)/df(z);
}
rgbtype computergb(complex z){
for(int i=0;i<IMAX;i++){
complex w0=phi(z);
if(w0==z) break;
z=w0;
}
return ztorgb(z);
}
typedef struct { int this_row; complex a[]; } computeargs;

static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

void rankn(){
tic();
int first_row=mpi_rank*no_of_rows/mpi_size;
int last_row=(mpi_rank+1)*no_of_rows/mpi_size;
int rows=last_row-first_row;
rgbtype slice[rows][no_of_cols];
for(int j=first_row;j<last_row;j++){
for(int k=0;k<no_of_cols;k++){
slice[j-first_row][k]=computergb(xmin+k*delta+I*(ymax-j*delta));
}
}
printf("Finished rank %d slice in %g seconds.\n",mpi_rank,toc());
fflush(stdout);
MPI_Send(slice,sizeof(rgbtype)*rows*no_of_cols,
MPI_BYTE,0,0,MPI_COMM_WORLD);
return;
}

void rank0(){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});
printf(
"Compute Basin of Attraction for Newton's Method\n"
"Image size %d x %d with maximum of %d iterations\n"
"Simple MPI Parallel version %d ranks for compute\n\n",
no_of_rows,no_of_cols,IMAX,mpi_size,mpi_size-1);
fflush(stdout);

tic();
rgbtype image[no_of_rows][no_of_cols];
{
int last_row=1*no_of_rows/mpi_size;
for(int j=0;j<last_row;j++){
for(int k=0;k<no_of_cols;k++){
image[j][k]=computergb(xmin+k*delta+I*(ymax-j*delta));
}
}
}
printf("Finished rank 0 slice in %g seconds.\n",toc());
fflush(stdout);
for(int rank=1;rank<mpi_size;rank++){
int first_row=rank*no_of_rows/mpi_size;
int last_row=(rank+1)*no_of_rows/mpi_size;
int rows=last_row-first_row;
MPI_Status err;
MPI_Recv(image[first_row],sizeof(rgbtype)*rows*no_of_cols,MPI_BYTE,
rank,0,MPI_COMM_WORLD, &err);
}
outimage(no_of_rows,no_of_cols,image);
double ttime=toc();
printf("\nTotal running time %g seconds.\n",ttime);
}

int main(int argc,char *argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

if(xmax-xmin>ymax-ymin){
delta=(xmax-xmin)/RES;
} else {
delta=(ymax-ymin)/RES;
}
no_of_rows=(ymax-ymin)/delta;
no_of_cols=(xmax-xmin)/delta;
char *p=getenv("SLURM_JOB_ID");
if(p) job_id=atoi(p);

if(mpi_size<2) {
printf("Please use at least 2 ranks!\n");
} else {
if(mpi_rank>0) rankn();
else rank0();
}
MPI_Finalize();
return 0;
}``````

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

In the previous post each of the three programs generated the same Newton fractal illustrating the basins of attraction for the five roots of the function z^5-1 in the complex plane [0,1]x[0,1]. I converted the output to portable network graphics format using the command

\$ pnmtopng <image-209.ppm >image-209.png

The resulting image looks like ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

In this post we explore the use of the super-cheap cluster for sorting. This time the goal is not parallel speedup, but to combine the distributed memory of multiple nodes to sort large datasets. Consider the task of sorting 200,000,000 four-byte integers. Since a dataset of that size takes 763MB of memory, it is not possible to store it in the 512MB RAM available on a single Pi Zero. While out-of-core sorting algorithms exist, they run much slower than the parallel algorithm described here.

The algorithm we shall employ is called odd-even sorting. Suppose there are N nodes in the cluster. First partition the data among those N nodes. Each node then sorts its part of the data in parallel. Finally, data between nodes is merged between even and odd nodes. After N merge operations the lowest MPI rank has the first part of the sorted data, the highest rank has the last part, with the rest of the data being stored in order on the ranks in between. Since N merge operations must be performed using 1/N the size of the data, there is no expected parallel speedup from this algorithm. However, there is no expected slowdown either. Also note that merge operations require twice the amount of memory as needed to store the entire dataset. The super-cheap cluster consists of five computational nodes with 512MB each. Since some of that memory is used for operating system tasks and overhead, there is about 480MB per node which can be devoted to sorting. Multiplying that by five yields a total of 2400MB available distributed memory. This is more than enough to accommodate the 1526MB needed to merge our 763MB dataset.

We begin by comparing our parallel algorithm to a serial program using a smaller dataset of 100,000,000 four-byte integers. This allows us to verify correctness of operation and relate performance. The result for the serial code is

Code: Select all

``````Perform serial sorting using qsort.
Sorting total 100000000 integers in serial.
Finished sorting in 155.168 seconds.

real    5m3.215s
user    2m35.130s
sys 0m16.610s``````
and the result for the parallel code is

Code: Select all

``````Perform odd-even sorting using MPI.
Sorting total 100000000 integers using 5 ranks.
Rank 4 finished sorting in 150.809 seconds.
Rank 2 finished sorting in 181.27 seconds.
Rank 0 finished sorting in 181.525 seconds.
Rank 3 finished sorting in 181.341 seconds.
Rank 1 finished sorting in 181.755 seconds.

real    5m1.880s
user    0m0.070s
sys 0m0.120s``````
The timings for only the sorting (neglecting file input and output) show that the parallel sort itself is about 20 percent slower than the serial code. However, after including file input and output the total wall time is almost exactly 5 minutes for either method. Since total wall time is what really matters to a person waiting for the program to finish, we conclude that the parallel and serial sorts take about the same amount of time.

Finally, we demonstrate the use of distributed memory by sorting a larger dataset that doesn't fit in the memory of a single Pi Zero.

Code: Select all

``````Perform odd-even sorting using MPI.
Sorting total 200000000 integers using 5 ranks.
Rank 4 finished sorting in 309.344 seconds.
Rank 0 finished sorting in 378.947 seconds.
Rank 2 finished sorting in 371.555 seconds.
Rank 3 finished sorting in 371.941 seconds.
Rank 1 finished sorting in 370.915 seconds.

real    10m28.655s
user    0m0.080s
sys 0m0.130s``````
Note that it takes 2.08 times longer to sort 200,000,000 integers than it took to sort 100,000,000 integers. Since

Code: Select all

``````       200000000*log(200000000)
------------------------ = 2.075
100000000*log(100000000)``````
This is consistent with the expected nlogn asymptotic performance of a reasonable sorting algorithm.

For reference the parallel odd-even sort program is

Code: Select all

``````/*  This code implements a parallel odd-even sort using MPI and is
based on the lecture notes by Ian Finlayson for the course

CPSC 425: Parallel Computing
University of Mary Washington
http://cs.umw.edu/~finlayson/class/fall14/cpsc425/syllabus.html

This code was modified by Eric Olson for the super-cheap cluster.
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <fcntl.h>
#include <string.h>
#include <mpi.h>

typedef unsigned int uint;
typedef long unsigned int luint;

char *infn = "unsorted.dat";
char *outfn = "sorted.dat";
int mpi_rank, mpi_size, job_id;

luint totalsize;
uint *getinput(luint *n,char *fn){
int fd=open(fn,O_RDONLY);
if(fd==-1){
printf("Unable to open %s for read!\n",fn);
exit(1);
}
struct stat buf;
fstat(fd,&buf);
totalsize=buf.st_size;
if(totalsize%sizeof(uint)){
printf("File %s contains an incomplete integer!\n",fn);
exit(1);
}
totalsize/=sizeof(uint);
if(!mpi_rank){
printf("Sorting total %lu integers using %d ranks.\n",
totalsize,mpi_size);
}
luint i0=mpi_rank*totalsize/mpi_size;
luint iN=(mpi_rank+1)*totalsize/mpi_size;
luint bytes=(iN-i0)*sizeof(uint);
uint *data=malloc(bytes+sizeof(uint));
if(lseek(fd,i0*sizeof(uint),SEEK_SET)==-1){
printf("Error seeking %s to offset %lu!\n",fn,i0*sizeof(uint));
exit(1);
}
if(r!=bytes){
printf("Error short read of %lu instead of %lu!\n",r,bytes);
exit(1);
}
close(fd);
*n=iN-i0;
return data;
}
void putrank(int fd,int p,uint *data,char *fn){
luint i0=p*totalsize/mpi_size;
luint iN=(p+1)*totalsize/mpi_size;
luint bytes=(iN-i0)*sizeof(uint);
if(lseek(fd,i0*sizeof(uint),SEEK_SET)==-1){
printf("Error seeking %s to offset %lu!\n",fn,i0*sizeof(uint));
exit(1);
}
luint r=write(fd,data,bytes);
if(r!=bytes){
printf("Error short write of %lu instead of %lu!\n",r,bytes);
exit(1);
}
}
void putoutput(luint n,uint *data,char *fn){
int fd;
if(!mpi_rank){
fd=creat(fn,0644);
if(fd==-1){
printf("Unable to create %s for write!\n",fn);
exit(1);
}
putrank(fd,0,data,fn);
for(int p=1;p<mpi_size;p++){
MPI_Recv(data,n+1,MPI_INT,p,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
putrank(fd,p,data,fn);
}
close(fd);
} else {
MPI_Send(data,n,MPI_INT,0,0,MPI_COMM_WORLD);
}
}

int intcmp(const void* ap,const void* bp){
uint a=*((const uint*)ap);
uint b=*((const uint*)bp);
if(a<b){
return -1;
} else if(a>b){
return 1;
} else {
return 0;
}
}
static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

uint *parallel_sort(luint n,uint *data,uint *other){
qsort(data,n,sizeof(int),intcmp);
for(int mpi=0;mpi<mpi_size;mpi++){
int p;
if(mpi%2){
if(mpi_rank%2) p=mpi_rank+1;
else p=mpi_rank-1;
} else {
if(mpi_rank%2) p=mpi_rank-1;
else p=mpi_rank+1;
}
if(p<0 || p>=mpi_size) continue;
luint p0=p*totalsize/mpi_size;
luint pN=(p+1)*totalsize/mpi_size;
luint m=pN-p0;

/*  do the exchange - even processes send first and odd processes
receive first this avoids possible deadlock of two processes
working together both sending */

if (mpi_rank%2){
MPI_Send(data,n,MPI_INT,p,0,MPI_COMM_WORLD);
MPI_Recv(other,m,MPI_INT,p,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
} else {
MPI_Recv(other,m,MPI_INT,p,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
MPI_Send(data,n,MPI_INT,p,0,MPI_COMM_WORLD);
}
if(mpi_rank<p){
luint i=0, j=n;
while(i<m && j>0){
if(other[i]<data[j-1]){
j--;
uint tmp=data[j]; data[j]=other[i]; other[i]=tmp;
i++;
} else break;
}
i=0; j=n-1;
for(luint k=0;k<n;k++){
if(data[i]<=data[j]) other[k]=data[i++];
else other[k]=data[j--];
}
} else {
luint i=m, j=0;
while(i>0 && j<n){
if(other[i-1]>data[j]){
i--;
uint tmp=data[j]; data[j]=other[i]; other[i]=tmp;
j++;
} else break;
}
i=0; j=n-1;
for(luint k=n;k>0;k--){
if(data[i]>=data[j]) other[k-1]=data[i++];
else other[k-1]=data[j--];
}
}
{ uint *tmp=data; data=other; other=tmp; }
}
return data;
}

int main(int argc,char* argv[]){
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
MPI_Comm_size(MPI_COMM_WORLD,&mpi_size);

setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});

if(!mpi_rank){
printf("Perform odd-even sorting using MPI.\n");
}
char filename;
char *p=getenv("SLURM_JOB_ID");
if(p) {
job_id=atoi(p);
sprintf(filename,"sorted-%d.dat",job_id);
outfn=filename;
}
luint n;
uint *data=getinput(&n,infn);
uint *other=malloc((n+1)*sizeof(int));
tic();
data=parallel_sort(n,data,other);
double sorttime=toc();
printf("Rank %d finished sorting in %g seconds.\n",
mpi_rank,sorttime);
putoutput(n,data,outfn);

MPI_Finalize();
return 0;
}``````
the serial sorting routine is

Code: Select all

``````/*    This code uses qsort to sort a file of integers
*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <fcntl.h>
#include <string.h>

typedef unsigned int uint;
typedef long unsigned int luint;

char *infn = "unsorted.dat";
char *outfn = "sorted.dat";
int job_id;

uint *getinput(luint *n,char *fn){
int fd=open(fn,O_RDONLY);
if(fd==-1){
printf("Unable to open %s for read!\n",fn);
exit(1);
}
struct stat buf;
fstat(fd,&buf);
if(buf.st_size%sizeof(uint)){
printf("File %s contains an incomplete integer!\n",fn);
exit(1);
}
*n=buf.st_size/sizeof(uint);;
printf("Sorting total %lu integers in serial.\n",*n);
uint *data=malloc(buf.st_size);
if(r!=buf.st_size){
printf("Error short read of %lu instead of %lu!\n",r,buf.st_size);
exit(1);
}
close(fd);
return data;
}
void putoutput(luint n,uint *data,char *fn){
int fd;
fd=creat(fn,0644);
if(fd==-1){
printf("Unable to create %s for write!\n",fn);
exit(1);
}
luint bytes=n*sizeof(uint);
luint r=write(fd,data,bytes);
if(r!=bytes){
printf("Error short write of %lu instead of %lu!\n",r,bytes);
exit(1);
}
close(fd);
}
int intcmp(const void* ap,const void* bp){
uint a=*((const uint*)ap);
uint b=*((const uint*)bp);
if(a<b){
return -1;
} else if(a>b){
return 1;
} else {
return 0;
}
}
static double tic_time;
void tic() {
struct timeval tp;
gettimeofday(&tp,0);
tic_time=tp.tv_sec+tp.tv_usec*1.0e-6;
}
double toc() {
struct timeval tp;
gettimeofday(&tp,0);
return (tp.tv_sec+tp.tv_usec*1.0e-6)-tic_time;
}

uint *serial_sort(luint n,uint *data){
qsort(data,n,sizeof(int),intcmp);
return data;
}

int main(int argc,char* argv[]){
setrlimit(RLIMIT_STACK,
&(const struct rlimit)
{RLIM_INFINITY,RLIM_INFINITY});

printf("Perform serial sorting using qsort.\n");
char filename;
char *p=getenv("SLURM_JOB_ID");
if(p) {
job_id=atoi(p);
sprintf(filename,"sorted-%d.dat",job_id);
outfn=filename;
}
luint n;
uint *data=getinput(&n,infn);
tic();
data=serial_sort(n,data);
double sorttime=toc();
printf("Finished sorting in %g seconds.\n",sorttime);
putoutput(n,data,outfn);
return 0;
}
``````
and the program which creates the datafile to sort is

Code: Select all

``````#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define BUFSIZE 1024U

typedef unsigned int uint;
typedef long unsigned int luint;

char *outfn="unsorted.dat";
uint data[BUFSIZE];
luint n=200000000UL;

int main(int argc,char *argv[]){
int fd=creat(outfn,0644);
if(fd==-1){
printf("Unable to open %s for write!\n",outfn);
exit(1);
}
uint dn=BUFSIZE;
while(n>0){
if(dn>n) dn=n;
for(uint i=0;i<dn;i++) data[i]=rand();
luint bytes=dn*sizeof(uint);
luint r=write(fd,data,bytes);
if(r!=bytes){
printf("Short write of %lu instead of %lu!\n",r,bytes);
exit(1);
}
n-=dn;
}
close(fd);
return 0;
}``````

abymmathew
Posts: 3
Joined: Tue Oct 02, 2018 4:10 am

### Re: Super-cheap Computing Cluster for Learning

Have you tried installing docker swarm on it. I have and it hangs when replicating images across the nodes.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

abymmathew wrote:
Tue Oct 02, 2018 4:12 am
Have you tried installing docker swarm on it. I have and it hangs when replicating images across the nodes.
That's curious.

Have you looked at this blog and the links it contains? There it clearly indicates the possibility of running a Docker swarm on a cluster of USB networked Pi Zero computers. I suspect they are each running from local SD cards rather than mounting root from NFS as in this tutorial.

If you are following this tutorial exactly, it is worth noting the individual compute nodes are completely isolated from the Internet. If your Docker configuration needs to contact an Internet-hosted server while running, you should further setup the host computer as a masquerade firewall. This should prevent the Zeros from hanging when they try to access the Internet.

abymmathew
Posts: 3
Joined: Tue Oct 02, 2018 4:10 am

### Re: Super-cheap Computing Cluster for Learning

thanks for quick reply. Will look into the pointers you provided. Very informative article.

abymmathew
Posts: 3
Joined: Tue Oct 02, 2018 4:10 am

### Re: Super-cheap Computing Cluster for Learning

Docker on NFS won't work. Reference, https://blog.alexellis.io/the-state-of- ... pberry-pi/

However OTG network is sometimes choppy and SSH over it fails often. Using Mosh helped, https://mosh.org/#techinfo

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

abymmathew wrote:
Tue Oct 09, 2018 1:04 am
Docker on NFS won't work. Reference, https://blog.alexellis.io/the-state-of- ... pberry-pi/

However OTG network is sometimes choppy and SSH over it fails often. Using Mosh helped, https://mosh.org/#techinfo
Sorry to hear you are having trouble with your network of USB Ethernet gadgets. For the record, the super-cheap cluster described in this thread has no such problem. In particular ssh sessions are stable and I can run MPI applications which are fairly network intense.

It should be noted that I'm still running a 4.9.59 kernel because of bugs in the 4.14.x series which affect the use of jumbo packets in the Ethernet gadgets.

I would also point out that for the configuration described in this thread, the NFS server is backed by a BTRFS copy-on-write filesystem. While the ability to create snapshots is not exported through NFS, the resulting subvolumes are. Therefore, Docker containers could be efficiently created on the head node using the copy-on-write semantics and then executed on one of the attached Zero compute nodes.

macbug
Posts: 10
Joined: Thu Dec 04, 2014 8:50 pm

### Re: Super-cheap Computing Cluster for Learning

I find this to be a way to complex way of benchmarking. I know nothing about benchmarking so dont get me wrong. This aint for me, i just wanna kode and have the computitations spread over several computers.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

macbug wrote:
Mon Jan 28, 2019 8:09 pm
I find this to be a way to complex way of benchmarking. I know nothing about benchmarking so dont get me wrong. This aint for me, i just wanna kode and have the computitations spread over several computers.
Sounds good. I reported how long it takes to execute each example program because it is rather easy to spread a computation over several machines and then have it take longer than using only one. Just like adding manpower to a late software project makes it later, adding processors to a lengthy calculation can also slow things down. This parallel processing analogue of Brooks' Law was amusingly expressed by Seymour Cray as "if you were plowing a field, which would you rather use: two strong oxen or 1024 chickens?" The new Summit super-computer at Oak Ridge National Laboratory uses 2,397,824 oxen and they are worse than the chickens.

The difficulty of obtaining parallel speedup in practice when solving a problem using a cluster is why it is important to have real hardware to test your code on. The fact that anyone can purchase 1993 levels of super-computing performance for the price of a nice meal for two means almost anyone with an interest can begin understanding the constraints of synchronization latency and communication bandwidth that still affect parallel algorithms today. Developing practical code that works well on clusters, though difficult, is now possible without a hundred-million-dollar budget.

If you don't want to include timing routines in your software as I have, you can either use the built-in time command to check how fast your program finished or a hand-held stopwatch. Having said this, many scientific codes as well as games have either built-in performance metrics or a set of regression tests to check what the current computational speeds are. This is useful, for example, when scheduling time on bigger clusters to solve problems sized to the limits of current technology.

It will be interesting to see what kind of parallel programs you code for your cluster. I've been thinking about creating a version of the shared-memory parallel Fibonacci programs studied in this post (see also here and here) that uses a message-passing approach to run on a distributed-memory cluster. As indicated in those posts, the SMP program ran 4 times faster using 5 processors. It would be a significant achievement, in my opinion, to obtain even a three-fold performance increase when using the 5-node super-cheap cluster.

ejolson
Posts: 2482
Joined: Tue Mar 18, 2014 11:47 am

### Re: Super-cheap Computing Cluster for Learning

ejolson wrote:
Wed Jan 30, 2019 12:16 am
I've been thinking about creating a version of the shared-memory parallel Fibonacci programs studied in this post (see also here and here) that uses a message-passing approach to run on a distributed-memory cluster. As indicated in those posts, the SMP program ran 4 times faster using 5 processors. It would be a significant achievement, in my opinion, to obtain even a three-fold performance increase when using the 5-node super-cheap cluster.
There is now available here an MPI-parallel Fibonacci code that can run on the super-cheap cluster. The parallel speedup as measured by wall time did not quite reach 3, but was 2.718 instead. Since there are 5 nodes this translates to a parallel efficiency of 54 percent.

As the total calculation time was around 9 seconds, the size of the problem may not be large enough to clearly offset the overhead of starting multiple communicating ranks across the cluster. I'm my opinion, it is likely that computing a larger Fibonacci number could achieve efficiencies as great as 80 percent.

energyi
Posts: 98
Joined: Tue Mar 24, 2015 9:39 pm

### Re: Super-cheap Computing Cluster for Learning

Great thread ejolson et al. Thanks.
energyi