mn416
Posts: 2
Joined: Wed Apr 20, 2016 7:52 pm

QPULib: a language and compiler for the QPUs

Thu Apr 21, 2016 8:58 pm

Hi all,

For anyone interested in using the Pi's GPU for general-purpose compute, the following may be worth a look:

https://github.com/mn416/QPULib

It's a programming language and compiler targeting the Pi's vector processors (QPUs). This has been a fun side-project for the past year and seems to have reached a somewhat useable state, although (as ever) it could be better in many ways. Hopefully the README gives enough info to understand what it is and how to use it.

Matt

pagenotfound
Posts: 68
Joined: Mon Mar 14, 2016 12:44 pm

Re: QPULib: a language and compiler for the QPUs

Sat Apr 23, 2016 3:41 pm

First of all, thanks a lot for making this available!

Going by your tutorial, it seems surprisingly easy to use. Can you comment on how this compares to using NEON instructions?

ejolson
Posts: 3421
Joined: Tue Mar 18, 2014 11:47 am

Re: QPULib: a language and compiler for the QPUs

Mon Apr 25, 2016 5:22 am

pagenotfound wrote:First of all, thanks a lot for making this available!

Going by your tutorial, it seems surprisingly easy to use. Can you comment on how this compares to using NEON instructions?
Update: Timings changed to use 2000 time steps, so that Pi 2B timings are now consistent with the updated README.

It looks like a great way of introducing vector processing. To further compare the vectorized speed, I downloaded the file HeatMapScalar.cpp from the test directory, changed the grid size to 512x512 and time steps to 2000 so they match the parameters used for the performance analysis in the README file. I compiled the program using gcc version 5.2 as

Code: Select all

$ g++ -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \
    -mfloat-abi=hard -ffast-math -Wall \
    -o HeatMapSerial HeatMapSerial.cpp -lm
The program took 35.59 seconds to run on a single CPU core of the Pi 2B. Next I created a parallel version of the serial program using my ARM port of the MIT/Intel Cilk/Cilkplus parallel programming extensions to the C/C++ programming language by changing the outer loop of the time step algorithm to

Code: Select all

  cilk_for (int y = 1; y < height-1; y++) {
and including the file cilk/cilk.h at the beginning. Upon compiling with the command

Code: Select all

$ g++ -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 \
    -mfloat-abi=hard -ffast-math -Wall -fcilkplus \
    -o HeatMapCilk HeatMapCilk.cpp -lcilkrts -lm
the program took 10.63 seconds to run on 4 cores of the Pi 2B.

As reported in the README, the best time for the GPU vectorized code running 1000 time steps on a 512x512 grid was 20.36 seconds. Thus, the vectorized version appears to be 2 times slower than the quad core CPU of the Pi 2B for this particular problem.

It is possible that the tests included in the test directory are not comparable to the ones discussed in the README. In particular, the included code seems to compute a heat field consisting of zero boundary conditions with a few initial hot spots while the image included in the README shows non-zero boundary conditions. Any information on the exact test problem used for the GPU timings would be appreciated.
Last edited by ejolson on Thu Apr 28, 2016 7:26 pm, edited 2 times in total.

ejolson
Posts: 3421
Joined: Tue Mar 18, 2014 11:47 am

Re: QPULib: a language and compiler for the QPUs

Tue Apr 26, 2016 8:12 pm

Update: Timings changed to use 2000 time steps, so that Pi 2B timings are now consistent with the updated README.

I made further changes to HeatMapSerial.cpp so it compiles as valid C code and added the C99 restrict keyword to the pointers used in the time step loop. This resulted in 7.91 seconds running on the 4 cores of a Pi 2B. Thus, with a little optimization but still not hand coded NEON intrinsics, the ARM CPU on the Pi 2B is about 2.5 times faster than the vectorized GPU code. Aside from teaching vector processing techniques, this apparently means the QPU library is more useful for improving performance on the Pi B, B+ and Zero models. On the Pi 3 it could also be useful for writing diagnostic software to test cooling systems and reliability.

Included is the modified source code for reference.

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <cilk/cilk.h>

// Heat dissapation constant
#define K 0.25

// ============================================================================
// Intel/MIT Cilk Parallel version
// ============================================================================

// One time step
void step(int width, int height,
    float (*restrict map)[height], float (*restrict mapOut)[height])
{
  cilk_for (int y = 1; y < height-1; y++) {
    for (int x = 1; x < width-1; x++) {
      float surroundings =
        map[y-1][x-1] + map[y-1][x]   + map[y-1][x+1] +
        map[y][x-1]   +                 map[y][x+1]   +
        map[y+1][x-1] + map[y+1][x]   + map[y+1][x+1];
      surroundings *= 0.125;
      mapOut[y][x] = map[y][x] - (K * (map[y][x] - surroundings));
    }
  }
}

// ============================================================================
// Main
// ============================================================================

int main()
{
  // Parameters
  const int WIDTH  = 512;
  const int HEIGHT = 512;
  const int NSPOTS = 10;
  const int NSTEPS = 2000;

  // Timestamps
  struct timeval tvStart, tvEnd, tvDiff;

  // Allocate
  float (*map2D)[HEIGHT] = malloc(HEIGHT * WIDTH * sizeof(float));
  float (*mapOut2D)[HEIGHT] = malloc(HEIGHT * WIDTH * sizeof(float));

  // Initialise
  memset(map2D, 0, HEIGHT * WIDTH * sizeof(float));
  memset(mapOut2D, 0, HEIGHT * WIDTH * sizeof(float));

  // Inject hot spots
  srand(0);
  for (int i = 0; i < NSPOTS; i++) {
    int t = rand() % 256;
    int x = 1 + rand() % (WIDTH-2);
    int y = 1 + rand() % (HEIGHT-2);
    map2D[y][x] = (float) 1000*t;
  }

  // Simulate
  gettimeofday(&tvStart, NULL);
  for (int i = 0; i < NSTEPS; i++) {
    step(WIDTH, HEIGHT, map2D, mapOut2D);
    float (*tmp)[HEIGHT] = map2D; map2D = mapOut2D; mapOut2D = tmp;
  }
  gettimeofday(&tvEnd, NULL);
  timersub(&tvEnd, &tvStart, &tvDiff);

  // Display results
  printf("P2\n%i %i\n255\n", WIDTH, HEIGHT);
  for (int y = 0; y < HEIGHT; y++)
    for (int x = 0; x < WIDTH; x++) {
      int t = (int) map2D[y][x];
      t = t < 0   ? 0 : t;
      t = t > 255 ? 255 : t;
      printf("%d\n", t);
    }
 
  // Run-time of simulation
  printf("# %ld.%06lds\n", tvDiff.tv_sec, tvDiff.tv_usec);

  return 0;
}
Compiled using

Code: Select all

$ gcc --version
gcc (GCC) 5.2.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard \
    -ffast-math -Wall -fcilkplus -o myheat myheat.c -lcilkrts -lm
Last edited by ejolson on Thu Apr 28, 2016 7:25 pm, edited 1 time in total.

mn416
Posts: 2
Joined: Wed Apr 20, 2016 7:52 pm

Re: QPULib: a language and compiler for the QPUs

Wed Apr 27, 2016 10:54 pm

@pagenotfound and @ejolson: thank you very much for the responses.

@pagenotfound: I don't have any comparison against NEON as yet.

@ejolson: Cilk is very cool. Nice work and very nice results!

It turns out I have made a slight mistake in the README. After
rerunning the HeatMap example it is apparent that my timings for the
vector versions are for 2000 iterations, not 1000. Also, my timing
for the scalar version was for 1500 iterations. Have
committed an update to the README: all timings now for 2000
iterations.

The heat map shown in the README is indeed produced using a slight
variant of the program in the Tests directory -- the QPULib kernel is
the same, but the way it is invoked is different, initialising the
boundary rather than random hot spots and emitting a PPM image instead
of a PGM image. However, the timings were taken using the program in
the Tests directory.

The only other thing I can really add is that I wouldn't be surprised
if I am missing a few tricks to get better performance from the QPUs
espeically w.r.t. memory, and to achieve better scaling. If so,
hopefully this will be worked out in due course.

ejolson
Posts: 3421
Joined: Tue Mar 18, 2014 11:47 am

Re: QPULib: a language and compiler for the QPUs

Thu Apr 28, 2016 7:39 pm

mn416 wrote:It turns out I have made a slight mistake in the README. After rerunning the HeatMap example it is apparent that my timings for the vector versions are for 2000 iterations, not 1000. Also, my timing for the scalar version was for 1500 iterations. Have committed an update to the README: all timings now for 2000 iterations.
Thanks for the details. I've updated the timings and commentary in my previous posts to reflect the change. A comparison with hand optimized NEON assembler would be interesting--especially for the Pi 3B. Optimal NEON timings for the Pi 3B would also require a heat sink and possibly a fan. In this case, the program should output the final heat field in floating point so it can be checked for the random numerical errors which have been reported to occur on some Pi 3B systems.

mic_s
Posts: 91
Joined: Sun Oct 26, 2014 4:15 pm

Re: QPULib: a language and compiler for the QPUs

Fri Jun 10, 2016 5:12 pm

ejolson wrote:the ARM CPU on the Pi 2B is about 2.5 times faster than the vectorized GPU code.
Your comparison between QLIP and Cilk is not fair. Consider the pricepoints :

HeatMap-Example (512 x 512 samples, 2000 timesteps)

(1) QLIP Run-Time : 20 s ( Zero GPU, Pricepoint: 5$ ) , 20*5 = 100 Dollar*Seconds
(2) Cilk Run-Time : 8 s. ( Pi2B 4*a7cores, Pricepoint: 25$ ) , 8*25 = 200 Dollar*Seconds

So it's the other way around.

mic_s
.

eupton
Forum Moderator
Forum Moderator
Posts: 56
Joined: Sun Apr 15, 2012 7:28 pm

Re: QPULib: a language and compiler for the QPUs

Fri Jun 10, 2016 7:55 pm

Can you post some of the generated QPU code for the various examples? I'd like to trawl through and see if any tweaks spring to mind.

mic_s
Posts: 91
Joined: Sun Oct 26, 2014 4:15 pm

Re: QPULib: a language and compiler for the QPUs

Sat Jun 11, 2016 1:31 am

eupton wrote:Can you post some of the generated QPU code for the various examples?
All you need is :
https://github.com/mn416/QPULib
("HeatMap" is in https://github.com/mn416/QPULib/Tests)

and the „Getting started“
https://github.com/mn416/QPULib/blob/ma ... Started.md

The generated GPU-Assembler-Code you are asking for is shown with DEBUG=1

e.g.
make QPU=1 DEBUG=1 HeatMap
sudo ./HeatMap

A simple example:
#include <stdlib.h>
#include "QPULib.h"

void add(Ptr<Int> a, Ptr<Int> b, Ptr<Int> r)
{
*r = *a+*b;;
}

int main()
{
// Construct kernel
auto core = compile(add);

// Allocate arrays shared between ARM and GPU
SharedArray<int> a(16), b(16), r(16);

// Initialise arrays
srand(0);
for (int i = 0; i < 16; i++) {
a = 100 + (rand() % 100);
b = 100 + (rand() % 100);
}

// Invoke the kernel
core(&a, &b, &r);

// Display the result
for (int i = 0; i < 16; i++)
printf("add(%i, %i) = %i\n", a, b, r);

return 0;
}


Step 1
Source Code is generated :
Source code
===========

v0 = UNIFORM;
v1 = UNIFORM;
v4 = UNIFORM;
v5 = UNIFORM;
v6 = UNIFORM;
*v6 = (*v4+*v5);
flush()
If (any(v0==0))
v7 = (v1-1);
v8 = 0;
While (any(v8<v7))
semaDec(15)
v8 = (v8+1);
End
hostIRQ()
Else
semaInc(15)
End

Step 2
QPU Assembler Code is generated :
Target code
===========

0: A0 <- or(S[QPU_NUM], S[QPU_NUM])
1: B0 <- -1879048188
2: A2 <- -1073676288
3: ACC1 <- -2146428928
4: A4 <- or(A0, ACC1)
5: B1 <- -2013179904
6: ACC1 <- shl(A0, 3)
7: B4 <- or(B1, ACC1)
8: ACC1 <- 1049088
9: B3 <- or(A0, ACC1)
10: ACC1 <- 1049120
11: S[WR_SETUP] <- or(A0, ACC1)
12: A0 <- or(S[UNIFORM], S[UNIFORM])
13: A1 <- or(S[UNIFORM], S[UNIFORM])
14: B1 <- or(S[UNIFORM], S[UNIFORM])
15: A3 <- or(S[UNIFORM], S[UNIFORM])
16: B2 <- or(S[UNIFORM], S[UNIFORM])
17: S[RD_SETUP] <- or(B0, B0)
18: S[RD_SETUP] <- or(A4, A4)
19: S[DMA_LD_ADDR] <- or(B1, B1)
20: LD2
21: S[RD_SETUP] <- or(B3, B3)
22: NOP
23: NOP
24: NOP
25: B1 <- LD4
26: S[RD_SETUP] <- or(B0, B0)
27: S[RD_SETUP] <- or(A4, A4)
28: S[DMA_LD_ADDR] <- or(A3, A3)
29: LD2
30: S[RD_SETUP] <- or(B3, B3)
31: NOP
32: NOP
33: NOP
34: A3 <- LD4
35: NOP
36: ACC1 <- add(B1, A3)
37: ST1(A) <- ACC1
38: S[WR_SETUP] <- or(A2, A2)
39: S[WR_SETUP] <- or(B4, B4)
40: S[DMA_ST_ADDR] <- or(B2, B2)
41: ST3
42: ST3
43: B0 <-{sf} sub(A0, 0)
44: if all(ZC) goto L0
45: NOP
46: NOP
47: NOP
48: A0 <- sub(A1, 1)
49: B0 <- 0
50: NOP
51: A1 <-{sf} sub(B0, A0)
52: if all(NC) goto L3
53: NOP
54: NOP
55: NOP
56: L2:
57: SDEC 15
58: ACC0 <- or(B0, B0)
59: B0 <- add(ACC0, 1)
60: NOP
61: B1 <-{sf} sub(B0, A0)
62: if any(NS) goto L2
63: NOP
64: NOP
65: NOP
66: L3:
67: IRQ
68: if always goto L1
69: NOP
70: NOP
71: NOP
72: L0:
73: SINC 15
74: L1:
75: END
76: NOP
77: NOP
78: NOP


You see the NOPs filling the slots after jumps and all those glorious setting needed to transport Pacs in and out.

Target Code up to Line 25 : load first pac1 in b1
Target Code up to Line 34 : load second pac2 in a3
Tagret Code up to Line 36 : add pac1, pac2

another example (walsh_float):

<skip>
#include "QPULib.h"
void walsh_float(Ptr<Float> a, Ptr<Float> b, Ptr<Float> r1, Ptr<Float> r2)
{
*r1 = *a+*b;
*r2 = *a-*b;
}
<skip>

Target code
===========

0: A0 <- or(S[QPU_NUM], S[QPU_NUM])
1: B0 <- -1879048188
2: A2 <- -1073676288
3: ACC1 <- -2146428928
4: A6 <- or(A0, ACC1)
5: B1 <- -2013179904
6: ACC1 <- shl(A0, 3)
7: B5 <- or(B1, ACC1)
8: ACC1 <- 1049088
9: B4 <- or(A0, ACC1)
10: ACC1 <- 1049120
11: S[WR_SETUP] <- or(A0, ACC1)
12: A0 <- or(S[UNIFORM], S[UNIFORM])
13: A1 <- or(S[UNIFORM], S[UNIFORM])
14: B1 <- or(S[UNIFORM], S[UNIFORM])
15: A3 <- or(S[UNIFORM], S[UNIFORM])
16: B2 <- or(S[UNIFORM], S[UNIFORM])
17: A4 <- or(S[UNIFORM], S[UNIFORM])
18: S[RD_SETUP] <- or(B0, B0)
19: S[RD_SETUP] <- or(A6, A6)
20: S[DMA_LD_ADDR] <- or(B1, B1)
21: LD2
22: S[RD_SETUP] <- or(B4, B4)
23: NOP
24: NOP
25: NOP
26: B3 <- LD4
27: S[RD_SETUP] <- or(B0, B0)
28: S[RD_SETUP] <- or(A6, A6)
29: S[DMA_LD_ADDR] <- or(A3, A3)
30: LD2
31: S[RD_SETUP] <- or(B4, B4)
32: NOP
33: NOP
34: NOP
35: A5 <- LD4
36: NOP
37: ACC1 <- addf(B3, A5)
38: ST1(A) <- ACC1
39: S[WR_SETUP] <- or(A2, A2)
40: S[WR_SETUP] <- or(B5, B5)
41: S[DMA_ST_ADDR] <- or(B2, B2)
42: ST3
43: S[RD_SETUP] <- or(B0, B0)
44: S[RD_SETUP] <- or(A6, A6)
45: S[DMA_LD_ADDR] <- or(B1, B1)
46: LD2
47: S[RD_SETUP] <- or(B4, B4)
48: NOP
49: NOP
50: NOP
51: B1 <- LD4
52: S[RD_SETUP] <- or(B0, B0)
53: S[RD_SETUP] <- or(A6, A6)
54: S[DMA_LD_ADDR] <- or(A3, A3)
55: LD2
56: S[RD_SETUP] <- or(B4, B4)
57: NOP
58: NOP
59: NOP
60: A3 <- LD4
61: NOP
62: ACC1 <- subf(B1, A3)
63: ST1(A) <- ACC1
64: S[WR_SETUP] <- or(A2, A2)
65: S[WR_SETUP] <- or(B5, B5)
66: S[DMA_ST_ADDR] <- or(A4, A4)
67: ST3
68: ST3
69: B0 <-{sf} sub(A0, 0)
70: if all(ZC) goto L0
71: NOP
72: NOP
73: NOP
74: A0 <- sub(A1, 1)
75: B0 <- 0
76: NOP
77: A1 <-{sf} sub(B0, A0)
78: if all(NC) goto L3
79: NOP
80: NOP
81: NOP
82: L2:
83: SDEC 15
84: ACC0 <- or(B0, B0)
85: B0 <- add(ACC0, 1)
86: NOP
87: B1 <-{sf} sub(B0, A0)
88: if any(NS) goto L2
89: NOP
90: NOP
91: NOP
92: L3:
93: IRQ
94: if always goto L1
95: NOP
96: NOP
97: NOP
98: L0:
99: SINC 15
100: L1:
101: END
102: NOP
103: NOP
104: NOP

details ?
contact : mn416

mic_s
.

Yggdrasil
Posts: 138
Joined: Sun Aug 26, 2012 8:45 pm

Re: QPULib: a language and compiler for the QPUs

Mon Jul 04, 2016 8:05 pm

Hello,

first of all: Great idea/library. :)
I've wondering if the usage of multiple QPU works because I can not produce any scaling effect. I've edited the Rot3D example and added some null operations
into the rot3D_3 function:

Code: Select all

    Float xTmp = xOld * cosTheta - yOld * sinTheta;
    Float yTmp = yOld * cosTheta + xOld * sinTheta;
    // Some lines without effect to increase operations.
    xTmp = xTmp + xTmp;
    yTmp = yTmp + yTmp;
    xTmp = xTmp * 0.5f;
    yTmp = yTmp * 0.5f;
    // Multiple repeat of above lines
    [...]
    store(xTmp, p);
    store(yTmp, q);
This increases the total used time, but the number of used QPUs (constant value was replaced by a parsed argument) made no difference. Is this a bug or does I made a mistake?
System: RPi1 Model B.

Regards YggdrasiI

Yggdrasil
Posts: 138
Joined: Sun Aug 26, 2012 8:45 pm

Re: QPULib: a language and compiler for the QPUs

Tue Jul 05, 2016 8:55 am

Nevermind, I simply forgot to compile with 'QPU=1'. Now it scales as aspected. :)
I would suggest to echo'ing the status of the flags in the Makefile or add this info into the Readme file.

boban_r
Posts: 1
Joined: Wed Feb 13, 2019 4:23 pm

how to display the number of QPUs on the screen?

Wed May 01, 2019 5:14 pm

I'm using your QPU library created on GitHub (https://github.com/mn416/QPULib) to run a sample code on Raspberry Pi 3B

I need to display the number of QPUs utilised by the GCD algorithm. I'm getting an error which seems to say that I'm performing an illegal type cast.

This is the functon for GCD where I've added two lines to display the QPUs running the algorithm:

void gcd(Ptr<Int> p, Ptr<Int> q, Ptr<Int> r)
{
int nqpu=numQPUs(); // this line causes the error
cout<<"number of QPUS="<<nqpu<<endl; // this line causes the error
Int a = *p;
Int b = *q;
While (any(a != b))
Where (a > b)
a = a-b;
End
Where (a < b)
b = b-a;
End
End
*r = a;
}

I've attached the screenshot of the error message

Could you tell me the return datatype of the numQPUs function? I believe it's a vector but I don't understand the difference between 'Int' and 'int'.

Thank you for your time.
Attachments
error.jpg
error.jpg (22.34 KiB) Viewed 621 times

Return to “General discussion”