Whatever the choice for the next GPU is better have complete openness as a requirement. It took over four years to get to the still incoplete, alpha stage
Full GPU documentation has been available for two years - http://www.broadcom.com/support/videocore
I'd also note that source for an EGL driver has been available for the same amount of time. Eric Anholt's driver is GL and will be better performing but it isn't required for you to play with the GPU to your heart's content.
I was actually working on a Metal/Vulcan like thing for Raspberry Pi. An API that would allow you to contruct command lists for V3D to give you direct bare metal GPU control. Didn't get very far and haven't touched it for a while but what I've done so far is available here - https://github.com/GregAC/rpi-v3d
. It's got a python script that generates some C for creating/disassembling command lists and a small program that can disassemble command lists and QPU programs in memory (I could freeze Quake 3 in the debugger and diassemble the command list for the frame including dumping the shader programs). Might pick it up again soon, had grand plans to replace the Q3 rendering back end with a direct to V3D command list version
No other mobile GPU has such documentation freely available (though I think AMD and Intel release similar stuff for their desktop parts).
Sadly the rest of VideoCore hasn't been opened up in the same way. Would be nice to get open VPU documentation
If the GPU on the Pi were user programmable, this would help teach how to program the next generation of intelligent devices. It would also open up many embedded applications that are currently beyond the computing power of even the newest ARM CPUs.
It is, documentation available above. I know various people have been playing with GPGPU using the QPUs. I'd also note that whilst an interesting/useful thing to play with using the QPUs doesn't suddenly make the raspberry pi massively more powerful.
Still it compares favourably to the A53s. They can manage 4 32-bit floating point operations per cycle. So with 4 cores at 1.2 Ghz = 19.2 GFlops peak (actual practical max will be lower as you need load/store/branch/increment operations along with all those floating point at the very least, plus cache misses etc). I remember VC4 V3D being quoted as 24 GFlops peak, with the bump to 300 MHz from 250 MHz that gives you 28.8 GFlops peak.