User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

LLVM backend for QPU development

Wed May 14, 2014 10:43 pm

Howdy. Ignoring all the drama, who is actually after a high-level GPU processing API for the RPI?

If it's to just tick a box saying "open source OpenCL, woo yeah!" that doesn't count. Running an OpenCL benchmark doesn't count. Mining Bitcoin doesn't count. Is there a real task which people would like to do on the device, which they believe the RPI's GPU could meaningfully accelerate? Of if you were seriously interested in performance, you would run the task on your PC. Would it be heavy lifting for a Raspberry PI that is not near a PC? Now if it were a programming learning tool for people to mess around with, then that I could get behind...

I'm interested because I have a near-working LLVM back-end for the QPUs, and need input from people who might use it to guide its development. An LLVM back-end can technically plug in to any LLVM front-end, meaning you can use any supported programming language. This back-end then connects to a QPU assembler I wrote, allowing you to run (eg) C++ on the GPU's QPUs. Clang, the C/C++ compiler has a simple compatibility mode for OpenCL kernels. Here's a basic kernel, turned into a bucket-load of *non-working* QPU assembly:

Code: Select all

__kernel void test(__global int *inA, __global int *out) {
        *out = 3 * *inA;
}

Code: Select all

[email protected]:~/llvm$ clang test.cl -x cl -S -emit-llvm -o test.ll
[email protected]:~/llvm$ ~/llvm/cmake_debug_build/bin/llc test.ll -march=qpu -O3 -o -
	.file	"test.ll"
	.text
	.globl	test
	.align	2
	.type	test,@function
	.cfi_startproc                  // @test
// BB#0:
	add	ra30, ra30, -8
$tmp3:
	.cfi_def_cfa_offset 8
	store_word	ra29, ra30, 4   // 4-byte Folded Spill
$tmp4:
	.cfi_offset ra29, -4
	add	ra29, ra30, rda_nop
$tmp5:
	.cfi_def_cfa_register ra29
	add	vpm_ld_addr, ra29, 8
	or	wra_nop, vpm_ld_wait, 0
	il	vpm_ld_setup, 1234
	add	ra0, rda_vpm_dat, 0
	add	vpm_ld_addr, ra0, 0
	add	acc0, rda_nop, 3
	or	wra_nop, vpm_ld_wait, 0
	il	vpm_ld_setup, 1234
	add	ra0, rda_vpm_dat, 0
	mul24	acc0, ra0, acc0
	add	vpm_ld_addr, ra29, 12
	or	wra_nop, vpm_ld_wait, 0
	il	vpm_ld_setup, 1234
	add	ra0, rda_vpm_dat, 0
	store_word	acc0, ra0, 0
	add	ra30, ra29, rda_nop
	add	ra29, ra30, 4           // 4-byte Folded Reload
	add	ra30, ra30, 8
	bla	wra_nop, wrb_nop, ra31
$tmp6:
	.size	test, ($tmp6)-test
	.cfi_endproc
Helpful comments please :)

Twinkletoes
Posts: 210
Joined: Fri May 25, 2012 9:44 pm

GPU Processing API - 2

Thu May 15, 2014 12:40 am

who is actually after a high-level GPU processing API for the RPI?
Me.

Ideas:

Facial (or gender/age) recognition.
Audio mixer with DSP per channel.

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Thu May 15, 2014 7:03 am

jamesh wrote:That's good work - I have forwarded it to someone who mentioned this sort of thing to me the other day...
Thanks. But lots of work still to go! Can I ask: when you program these devices, do you do it in assembly or in some higher-level language?
Twinkletoes wrote:
who is actually after a high-level GPU processing API for the RPI?
Facial (or gender/age) recognition.
Audio mixer with DSP per channel.
Have you already written code to do this, which you would port over? Which language is it in? Are you happy with SIMD (not parallel!) programming?

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 24142
Joined: Sat Jul 30, 2011 7:41 pm

Re: GPU Processing API

Thu May 15, 2014 7:48 am

teh_orph wrote:
jamesh wrote:That's good work - I have forwarded it to someone who mentioned this sort of thing to me the other day...
Thanks. But lots of work still to go! Can I ask: when you program these devices, do you do it in assembly or in some higher-level language?
For the scaler CPU, C, for vector code, assembler (I rarely do that) as there is no compiler for it.

For the QPU's they are all programmed in assembler, but very few people do this as the instruction set is a nightmare. I've certainly never programmed in QPU, and I only know of one person who has.

The vast vast majority of code running on the VC4 is C.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Thu May 15, 2014 9:32 am

jamesh wrote:but very few people do this as the instruction set is a nightmare
+1
The whole design is pretty whack tbh. I can just imagine pitching the architecture design. I'd like to compare it to previous designs, and see where the next version goes. There must be some enormous die space advantage by structuring it like this.

Anyway this does make the back-end design pretty tricky, as does its bias towards vertex and pixel shaders. I expect generic OpenCL to perform poorly due the lack of scatter/gather load-stores from both VPM and main memory. If you don't touch memory or do conditional operations I'm sure those 24 GFLOPS are pretty attainable though. Generic 32x32 bit integer multiplications will be slow too - I think I'm gonna expose a stack of intrinsic operations to pull the efficiency up a bit!

My biggest win I think will be turning it from a 16-way scalar programming model to a 1-way 16-wide SIMD model (or perhaps 4-way 4-wide SIMD) but this will really put the onus on the programmer to make efficient use of each QPU.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 24142
Joined: Sat Jul 30, 2011 7:41 pm

Re: GPU Processing API

Thu May 15, 2014 10:53 am

The QPU's were designed by some bloke here called Upton, E. So you can blame him on the wackiness. I think they were indeed designed for high performance, low die area (and hence low power consumption) but this Upton bloke knows much more about it that I do. All I know is that people avoid them like the plague unless they really REALLY need to use them.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

User avatar
PeterO
Posts: 5138
Joined: Sun Jul 22, 2012 4:14 pm

Re: GPU Processing API

Thu May 15, 2014 10:57 am

jamesh wrote:The QPU's were designed by some bloke here called Upton, E. So you can blame him on the wackiness. I think they were indeed designed for high performance, low die area (and hence low power consumption) but this Upton bloke knows much more about it that I do. All I know is that people avoid them like the plague unless they really REALLY need to use them.
I once showed this Upton bloke some openGL ES code I had written. He went all gooey eyed and said someting like "It's nice to see someone using my GPU" :-)

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

eman
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm

Re: GPU Processing API

Thu May 15, 2014 7:12 pm

An LLVM back-end would be pretty cool and pretty interesting. I was thinking about taking a crack at that but I haven't gotten much past reading the docs and cloning the bare minimum from one of the existing back-ends.

I honestly didn't think the QPUs were that bad. It has about the functionality I would expect for a 4-yr old mobile GPU and it makes sense in that context. There's stuff it's going to do well and stuff it's going to do very poorly (if at all). The trick is finding those things it does well and putting them into real, useful applications. GPU programming is usually not an end unto itself (unless it is for learning). Conveniently, some of the things GPUs are good at like FFT and matrix math turn out to be broadly applicable in many areas and as building blocks for many interesting algorithms.

I think an OpenCL implementation would be cool but I'm not sure it's worth it. (I think a C back-end is, though). Part of the idea of OpenCL is that the code is (relatively) portable and, in theory, can be compiled and run on any machine with an OpenCL implementation but that may not be realistic in the case of the QPUs (not without serious performance implications). If you have to write "special" OpenCL compiled specially for the QPU, it sort of defeats the purpose.

In any case, I wouldn't get caught up on the host APIs and instead focus on the individual algorithm to be accelerated as it's likely to be on a case-by-case basis what and how to accelerate things.

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Fri May 16, 2014 6:58 am

Yeah I def agree with you there. A target-specific C would be a winner. I think the biggest problems are the load/store system not supporting vector loads/stores and the operations to manipulate the execution mask are not being complex enough if you want to treat it like a 16 scalar processors.
I think I've got to just suck it up regarding the execution mask but the load/store thing is a bit tricky. Serialising a vector load into 16 separate loads (as I can't be sure that their addresses are sequential) would make touching memory dead slow. Sure I could make an intrinsic where you could say exactly how you want to read memory. But in C you touch memory all the time (eg the stack) so there needs to be a general/fallback solution for this too.

Once these two are cracked, you should be able to run regular C. OpenCL and C are pretty similar from a semantic perspective, it's just that CL needs a bit of compile time help. For example all those built-in maths functions would need to be written! No way :)

eman
Posts: 9
Joined: Wed Mar 19, 2014 10:23 pm

Re: GPU Processing API

Sat May 17, 2014 4:13 pm

Yeah. If it makes it easier, I would consider exposing the 16-wide vector as the primitive type and let the programmer worry about handling scalar code. To get good performance out of it, the user is going to have to understand how it works and restructure their algorithm. Just having the compiler do instruction scheduling and register allocation (and the optimizations that go along with those stages) would be a big win over writing assembly by hand even if we still have to worry about making it explicitly parallel.

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Sun May 25, 2014 11:11 pm

Just an update to say this is still on-going.

Previously I was limiting myself to the accumulators for nearly all operations, and only using the two register files for my procedure calling system. Also try as I might I couldn't get LLVM to insert branch delay slots (even taking the MIPS code for the same thing...I must be missing some key setting). This also meant I could not insert delays after register usage, hence using just accumulators. The memory subsystem - not being able to trigger a read or write atomically - was also too hard to model in LLVM. One word load 'instruction' translates into at least six machine instructions, with plenty of implicit deps to track...

Anyway, I have upgraded my assembler to do MIPS-style automatic instruction reordering. Both register files and accumulators can be comfortably used with the code LLVM emits (once I expose ra/rb correctly) and the assembled code should now actually run. Load/stores work by having LLVM emit pseudo instructions which the assembler expands and reschedules.

I'm pretty close to running C on this thing now, wup wup!

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Tue May 27, 2014 5:48 pm

Quite excitingly it worked first time. Of course every single thing I try needs to have its assembly looked over carefully :-)
For those who are interested,

Code: Select all

extern "C" {
void entry(void)
{
        int *A = (int *)0x0888c000;
        for (int count = 0; count < 16; count++)
                A[count] = count - 15;
}
}
goes through Clang to make LLVM IR (O3, hence the unroll)

Code: Select all

define void @entry() #0 {
  store i32 -15, i32* inttoptr (i64 143179776 to i32*), align 16384, !tbaa !1
  store i32 -14, i32* inttoptr (i64 143179780 to i32*), align 4, !tbaa !1
  store i32 -13, i32* inttoptr (i64 143179784 to i32*), align 8, !tbaa !1
  store i32 -12, i32* inttoptr (i64 143179788 to i32*), align 4, !tbaa !1
  store i32 -11, i32* inttoptr (i64 143179792 to i32*), align 16, !tbaa !1
  store i32 -10, i32* inttoptr (i64 143179796 to i32*), align 4, !tbaa !1
  store i32 -9, i32* inttoptr (i64 143179800 to i32*), align 8, !tbaa !1
  store i32 -8, i32* inttoptr (i64 143179804 to i32*), align 4, !tbaa !1
  store i32 -7, i32* inttoptr (i64 143179808 to i32*), align 32, !tbaa !1
  store i32 -6, i32* inttoptr (i64 143179812 to i32*), align 4, !tbaa !1
  store i32 -5, i32* inttoptr (i64 143179816 to i32*), align 8, !tbaa !1
  store i32 -4, i32* inttoptr (i64 143179820 to i32*), align 4, !tbaa !1
  store i32 -3, i32* inttoptr (i64 143179824 to i32*), align 16, !tbaa !1
  store i32 -2, i32* inttoptr (i64 143179828 to i32*), align 4, !tbaa !1
  store i32 -1, i32* inttoptr (i64 143179832 to i32*), align 8, !tbaa !1
  store i32 0, i32* inttoptr (i64 143179836 to i32*), align 4, !tbaa !1
  ret void
}
goes through my back-end to make pseudo QPU asm

Code: Select all

	.file	"test.ll"
	.text
	.globl	entry
	.align	2
	.type	entry,@function
	.set	reorder                 // @entry
// BB#0:
	il	acc0, 143179776
	il	acc1, -15
	store_word	acc1, acc0, 0
	il	acc0, 143179780
	il	acc1, -14
<snip>
	store_word	acc1, acc0, 0
	il	acc0, 143179832
	il	acc1, -1
	store_word	acc1, acc0, 0
	il	acc0, 143179836
	il	acc1, 0
	store_word	acc1, acc0, 0
	bla	wra_nop, wrb_nop, lr
	.set	noreorder
$tmp0:
	.size	entry, ($tmp0)-entry
goes into my assembler, where the pseudos get turned into a sequence of instructions, the whole thing gets rescheduled, nops get inserted and the whole thing gets assembled (note I've squirted in a tiny start-operation to set the stack (ra30) and end the program correctly)

Code: Select all

/*0888c000*/	/*start:*/	/*  */
/*0888c000*/	0x0888c038, 0xf0f009e7,	/* bra wra_nop, wrb_nop, 0888c038 ( 143179832 ) */
/*0888c008*/	0x0888cff0, 0xe0020780,	/* il ra30, 0888cff0 ( 143183856 ) */
/*0888c010*/	0x00000000, 0x10000820,	/* nop;  */
/*0888c018*/	0x00000000, 0x10000820,	/* nop;  */
/*0888c020*/	0x15000000, 0x30020820,	/* or acc0, acc0, acc0; ProgramEnd */
/*0888c028*/	0x15000000, 0x10020820,	/* or acc0, acc0, acc0;  */
/*0888c030*/	0x15000000, 0x10020820,	/* or acc0, acc0, acc0;  */
/*0888c038*/	/*entry:*/	/*  */
/*0888c038*/	0xfffffff1, 0xe0021840,	/* il acc1, fffffff1 ( -15 ) */
/*0888c040*/	0x0888c000, 0xe0021800,	/* il acc0, 0888c000 ( 143179776 ) */
/*0888c048*/	0x00000a00, 0xe0021c40,	/* il vpmvcd_wr_setup, 00000a00 ( 2560 ) */
/*0888c050*/	0x15000249, 0x10020c20,	/* or wra_vpm_dat, acc1, acc1;  */
/*0888c058*/	0xfffffff2, 0xe0021840,	/* il acc1, fffffff2 ( -14 ) */
/*0888c060*/	0x80814000, 0xe0021c40,	/* il vpmvcd_wr_setup, 80814000 ( -2139013120 ) */
/*0888c068*/	0x0c0001c7, 0xd0021ca0,	/* add vpm_st_addr, acc0, 00000000 (0);  */
/*0888c070*/	0x0888c004, 0xe0021800,	/* il acc0, 0888c004 ( 143179780 ) */
/*0888c078*/	0x15032fff, 0x100209e0,	/* or wra_nop, vpm_st_wait, vpm_st_wait;  */
/*0888c080*/	0x00000a00, 0xe0021c40,	/* il vpmvcd_wr_setup, 00000a00 ( 2560 ) */
/*0888c088*/	0x15000249, 0x10020c20,	/* or wra_vpm_dat, acc1, acc1;  */
/*0888c090*/	0xfffffff3, 0xe0021840,	/* il acc1, fffffff3 ( -13 ) */
/*0888c098*/	0x80814000, 0xe0021c40,	/* il vpmvcd_wr_setup, 80814000 ( -2139013120 ) */
/*0888c0a0*/	0x0c0001c7, 0xd0021ca0,	/* add vpm_st_addr, acc0, 00000000 (0);  */
<snip>
/*0888c390*/	0x00000a00, 0xe0021c40,	/* il vpmvcd_wr_setup, 00000a00 ( 2560 ) */
/*0888c398*/	0x15000249, 0x10020c20,	/* or wra_vpm_dat, acc1, acc1;  */
/*0888c3a0*/	0x00000000, 0xf0f7e9e7,	/* bra wra_nop, wrb_nop, ra31, 00000000 ( 0 ) */
/*0888c3a8*/	0x80814000, 0xe0021c40,	/* il vpmvcd_wr_setup, 80814000 ( -2139013120 ) */
/*0888c3b0*/	0x0c0001c7, 0xd0021ca0,	/* add vpm_st_addr, acc0, 00000000 (0);  */
/*0888c3b8*/	0x15032fff, 0x100209e0,	/* or wra_nop, vpm_st_wait, vpm_st_wait;  */
/*0888c3c0*/	/*$tmp0:*/	/*  */
and then it gets run live on the pi

Code: Select all

[email protected] ~/dma $ sudo ./mapper /dev/mem 
phys addr 0888c000
mapped in at 0x888c000
status 00000600
status 00000600
status 00000700
status 00000700
0x888c000	fffffff1
0x888c004	fffffff2
0x888c008	fffffff3
0x888c00c	fffffff4
0x888c010	fffffff5
0x888c014	fffffff6
0x888c018	fffffff7
0x888c01c	fffffff8
0x888c020	fffffff9
0x888c024	fffffffa
0x888c028	fffffffb
0x888c02c	fffffffc
0x888c030	fffffffd
0x888c034	fffffffe
0x888c038	ffffffff
0x888c03c	00000000
Things which are working right now
  • 32-bit integer arithmetic and logic (excluding operations which expect carry-in/out)
  • integer multiplication is 24-bit
  • loading and storing of X words of data
  • 16-way conditional control flow (branches)
  • function calling and most stack operations
  • register, flag and branch delay slot hazards
  • out-of-order scheduling within a basic block to avoid inserting nops
  • load/store reordering
We're getting there. I'd like to fork this out into a new thread now...

mimi123
Posts: 583
Joined: Thu Aug 22, 2013 3:32 pm

Re: GPU Processing API

Thu May 29, 2014 7:12 am

Where is the source code of your LLVM backend?

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Thu May 29, 2014 8:44 am

I've not checked it in anywhere - originally I started coding from an llvm source tar.gz but I think I need to branch their git repo, but I don't know how that works as I don't think they use github...
Anyway I've not got much motivation for that at the moment as my machine at work (where I sometimes work on stuff at lunch time) takes an age to link the shared libraries! SSDs appear to be a requirement.

Anyway the assembler (which forms a key part of my back end) can be found here.
https://github.com/simonjhall/qpu_assembler
clone it at the same level as my disassembler. The assembler uses a shared header from the disassembler.
https://github.com/simonjhall/qpu
------------
Last night I discovered something new, which will def help me. Can anyone else who has done QPU stuff confirm this? Writes to accumulator 5 from the mul ALU pipe seem to replicate the first lane's value across the entire register. Reading it back will give the same result for *all 16 lanes*. Writing to it from the add pipe does what it says in the documentation, and broadcasts it in a 4x4 fashion. Each quad gets a different value, but every lane within a quad gets the same value.

This does not appear to be in the documentation...

ghans
Posts: 7878
Joined: Mon Dec 12, 2011 8:30 pm
Location: Germany

Re: GPU Processing API

Thu May 29, 2014 8:53 am

Simon , are you on the raspi-internals mailinglist ?
http://www.freelists.org/list/raspi-internals

I think people would be interested in your efforts.


ghans
• Don't like the board ? Missing features ? Change to the prosilver theme ! You can find it in your settings.
• Don't like to search the forum BEFORE posting 'cos it's useless ? Try googling : yoursearchtermshere site:raspberrypi.org

mimi123
Posts: 583
Joined: Thu Aug 22, 2013 3:32 pm

Re: GPU Processing API

Fri May 30, 2014 8:47 am

teh_orph wrote:I've not checked it in anywhere - originally I started coding from an llvm source tar.gz but I think I need to branch their git repo, but I don't know how that works as I don't think they use github...
Anyway I've not got much motivation for that at the moment as my machine at work (where I sometimes work on stuff at lunch time) takes an age to link the shared libraries! SSDs appear to be a requirement.

Anyway the assembler (which forms a key part of my back end) can be found here.
https://github.com/simonjhall/qpu_assembler
clone it at the same level as my disassembler. The assembler uses a shared header from the disassembler.
https://github.com/simonjhall/qpu
------------
Last night I discovered something new, which will def help me. Can anyone else who has done QPU stuff confirm this? Writes to accumulator 5 from the mul ALU pipe seem to replicate the first lane's value across the entire register. Reading it back will give the same result for *all 16 lanes*. Writing to it from the add pipe does what it says in the documentation, and broadcasts it in a 4x4 fashion. Each quad gets a different value, but every lane within a quad gets the same value.

This does not appear to be in the documentation...
It is in the HHH documentation.

For the LLVM backend, upload a tarball with all the source.

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Fri May 30, 2014 8:55 am

What's HHH?

ghans
Posts: 7878
Joined: Mon Dec 12, 2011 8:30 pm
Location: Germany

Re: GPU Processing API

Fri May 30, 2014 9:12 am

I guess he means Herman H. Hermitage , who runs a github repo with lots of unofficial documentation on the GPU:

https://github.com/hermanhermitage/videocoreiv

ghans
• Don't like the board ? Missing features ? Change to the prosilver theme ! You can find it in your settings.
• Don't like to search the forum BEFORE posting 'cos it's useless ? Try googling : yoursearchtermshere site:raspberrypi.org

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Fri May 30, 2014 9:16 am

Ah I didn't know his middle name had an H in it too!
I've read some of his QPU stuff but see no stuff on accumulator 5 and broadcast. Might you have a link or quote?

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Sat May 31, 2014 11:41 pm

Today's new gotcha - although the spec sheet says branches through a register use lane zero's value as the destination, this does not appear to be true! Hours wasted. (I'm still not sure which lane is required, but broadcasting lane zero to all lanes equals success)

In better news, the Fibonacci series now runs on QPU :) This means both the stack and function calling is working.
Conversely I'm really hampered by the lack of an overflow flag. Signed/unsigned comparisons are the same at this point.

(defo need to fork this thread - can a mod help?)
EDIT: for those who care, it's the *last* lane, lane 15 which is used as the branch target

ghans
Posts: 7878
Joined: Mon Dec 12, 2011 8:30 pm
Location: Germany

Re: GPU Processing API

Sun Jun 01, 2014 8:06 am

I guess you need to PM a mod directly.

ghans
• Don't like the board ? Missing features ? Change to the prosilver theme ! You can find it in your settings.
• Don't like to search the forum BEFORE posting 'cos it's useless ? Try googling : yoursearchtermshere site:raspberrypi.org

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 24142
Joined: Sat Jul 30, 2011 7:41 pm

Re: GPU Processing API

Sun Jun 01, 2014 3:10 pm

At what point do you want to fork ths thread, and what's the new name the for it?
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: GPU Processing API

Sun Jun 01, 2014 9:40 pm

Cheers James for the help :)
How about we start from my first post and go from there? http://www.raspberrypi.org/forums/viewt ... 00#p550710
Let's go for something like "LLVM backend for QPU development". I dunno though which sub-forum to put it in though.
- it's not graphics
- it's not C/C++...though users would probably use that as their front-end (or OpenCL)
- it's not bare metal
- it's not other languages (more like any language)

Here I think having a dedicated VideoCore forum would be handy. eg as mentioned I've found a number of gotchas that I wouldn't want to put in a long sprawling thread. Dunno.

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 24142
Joined: Sat Jul 30, 2011 7:41 pm

Re: LLVM backend for QPU development

Fri Jun 06, 2014 12:17 pm

MOD: Just split this topic to here. If anyone has any suggestions for a better forum for this to be in please let me know by PM.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed. Here's an example...
“I think it’s wrong that only one company makes the game Monopoly.” – Steven Wright

User avatar
teh_orph
Posts: 346
Joined: Mon Jan 30, 2012 2:09 pm
Location: London
Contact: Website

Re: LLVM backend for QPU development

Sun Jun 08, 2014 8:33 pm

An update: the code generator has improved quite a bit since my last post.
- rewrite of conditional braches and conditional stores
- Fib now works in O3
- rework of references to symbols (I basically started with the MIPS backend which uses hi(symbol) + lo(symbol))
- removal of (MIPS) big endian targets
- load/store [rbXY+imm] support
- WAW hazards for conditional writes
- back-end lives on git: https://github.com/simonjhall/llvm_qpu

and the big news is that I can now build and run the C++ reference code for SHA-256 that is found here on QPU with no modification to the algorithm! ( http://www.raspberrypi.org/forums/viewt ... 33&t=77231 )
I've not bothered measure the performance as I expect it to be poor but this is a pretty major step for the back-end.

Before I progress with adding features etc I really need to make what I've got robust. But unfortunately I just can't think of any 32-bit integer only algorithms which are compact and use no built-in or system functions...

Once I'm happy with what I've got I'll add 8- and 16-bit integer support then 32-bit float. I'll then add 2-to-16-way SIMD operations and perhaps expose VPM as a C++ address space. I'd like to stick the stack there (sort of like what you get on SPARC) but adding constant code to check for free space + DMA would add real bloat.

EDIT: added basic 32-bit FP maths support

Return to “Advanced users”