schamberlin
Posts: 25
Joined: Tue Sep 16, 2014 5:29 pm

RPi 3D triangles per second?

Mon Sep 29, 2014 12:15 am

Can anyone offer some general guidelines on what sort of 3D performance is possible on the Raspberry Pi, assuming a reasonably efficient Open GL - based program written in C? I realize this is a very broad question with lots of "it depends on screen resolution, texture size, etc" answers, but some kind of rough real-world numbers for triangles per second or vertex or pixel fill rates would be helpful. The sources I've found so far cover a tremendous range:

"40 million shaded polygons per second" is mentioned a couple of times on the Raspberry Pi website. I assume "shaded" means lit but not textured, and this number is the peak under theoretical optimal conditions, and probably not something achievable in real-world programs.

I've also seen the original Xbox used as a similar-performing comparison. According to Wikipedia, the first Xbox could do 115 million vertices/sec, 932 megapixels/sec. Assuming 32-pixel triangles, that's about 29 million triangles per second. I'm unsure if that means shaded and textured triangles. With 1-pixel triangles and triangle strips, you could theoretically reach the vertex-bound limit of 115 million triangles/sec.

I know Quake 3 can run at 60+ fps on the RPi, but I'm not sure how many triangles are in a typical Quake 3 scene frame. Some Googling suggests 10K to 100K textured polygons per frame, which implies 600K to 6 million shaded triangles per second.

Roy Longbottom did some OpenGL benchmarks on the Raspberry Pi, and got about 360K shaded triangles/sec, and about 250K shaded and textured triangles/sec. http://www.raspberrypi.org/forums/viewt ... 82#p370482

I've started some simple 3D performance tests of my own, and so far I've achieved about 1.1 million shaded and textured triangles per second under semi-realistic conditions. I can push it to about 6.7 million triangles per second by using 4x4 textures and a 568 x 320 frame buffer - not so realistic conditions. How much further can I likely improve this?

dave j
Posts: 116
Joined: Mon Mar 05, 2012 2:19 pm

Re: RPi 3D triangles per second?

Tue Sep 30, 2014 1:24 pm

I don't think anyone will be able to give you any numbers because, as your post says, it's a very broad question with many "it depends".

I doubt Roy's tests provide a good measure of real world performance - they use very small batch sizes which kills performance.

The Pi's GPU, like most other mobile GPUs, is tile-based rather than immediate mode. These have different performance characteristics and so extrapolating performance figures from one to the other, as you did with the XBox, won't produce good figures.

As far as your own code goes, you could look to make sure your code is optimised for tile-based GPUs rather than assuming what is best for a desktop GPU will also be best for the Pi. The OpenGL Insights book's chapter on Performance Tuning for Tile-Based Architectures provides a good guide to what to look for.

schamberlin
Posts: 25
Joined: Tue Sep 16, 2014 5:29 pm

Re: RPi 3D triangles per second?

Tue Sep 30, 2014 4:31 pm

Thanks for the link to the great doc on tile-based rendering! I'll definitely check that out.

I agree there's a huge amount of "it depends", but I'd rather have fuzzy numbers than no numbers at all. Maybe I should rephrase my original question as: can anyone offer performance numbers from their own Raspberry Pi 3D program, along with a short summary of the program conditions? Something like:

I get about X triangles/second (X fps with average of X triangles/frame) in my Pi-based 3D racing game. It runs at 1280 x 720, with eight moving car models, a static world map, and a 2D UI overlay. Each object has a single 512x512 texture and uses the same standard per-pixel lighting shader with one directional light source. No alpha blending or anti-aliasing. There's an average of about 80 draw calls per frame.

schamberlin
Posts: 25
Joined: Tue Sep 16, 2014 5:29 pm

Re: RPi 3D triangles per second?

Fri Oct 03, 2014 7:08 pm

I'm now getting between 4 million and 16 million textured tris/second, depending on the model drawn. I'd appreciate feedback on how this compares to other programs or other hardware platforms, since relative comparisons are probably more helpful than absolute numbers.

My test draws many copies of the same model in a grid layout on the screen. The models are positioned so there's minimal overdraw between adjacent objects. I tried several models ranging from 24 verts/12 tris to 65536 verts/58808 tris. All models have indexed verts. Each vertex has position XYZ, normal XYZ, and texture UV for a total of 32 bytes per vertex. Rendering is done with VBOs (vertex buffer objects) and glDrawElements. Each copy of the model is a separate draw call to OpenGL. Backface culling is on - glEnable(GL_CULL_FACE). Screen resolution is 1280 x 720 with no multi-sampling or anti-aliasing. Each model has a single 256 x 256 RGB texture with mipmaps. The shader is a basic per-pixel lighting shader (phong shading) with one directional light source. The RPi was running at the stock clock speed of 700 MHz.

I went a little overboard, and made a Raspberry Pi OpenGL ES performance testing program that can change every rendering parameter on the fly, to gauge its impact on performance, including screen resolution and the type of shader used. I'll post more about this program in a separate thread, since I think it may be helpful to others.

With no textures and per-vertex lighting (gouraud shading) instead of per-pixel, I reached as high as 19 million tris/sec in one test. That's about half of the 40M/sec number for "shaded triangles" mentioned on the Raspberry Pi web site.

Absolute peak throughput was 27.3 million tris/sec with a very simple flat-colored shader, screen resolution of 568 x 320 (roughly standard-definition television resolution), and the camera pointed away from the objects. Since nothing was actually drawn on the screen, that number is presumably limited by vertex shader performance, and represents the upper bound of what can be achieved using a typical vertex structure consisting of a 3D position, a 3D normal vector, and a 2D texture coordinate.

User avatar
jeanleflambeur
Posts: 157
Joined: Mon Jun 16, 2014 6:07 am
Contact: Website

Re: RPi 3D triangles per second?

Fri Oct 03, 2014 9:23 pm

Can you try these things to see if there's any improvement?
- interleaved VBOs if they are not already
- compressed (ETC1) texture
- quantized tex coords (short or char) instead of float
- quantized positions
- try to discard the depth/stencil buffer with GL_EXT_discard_framebuffer.
All these should save bandwidth.
Can you post the vertex shader?

Also, what's the CPU usage?

schamberlin
Posts: 25
Joined: Tue Sep 16, 2014 5:29 pm

Re: RPi 3D triangles per second?

Fri Oct 03, 2014 11:54 pm

Thanks for the suggestions! I've posted the code at http://www.raspberrypi.org/forums/viewt ... 67&t=88424, so you can download it and make modifications if you'd like.
jeanleflambeur wrote:- interleaved VBOs if they are not already
They are already interleaved.
- compressed (ETC1) texture
I'm lacking a good example of how to create a compressed texture, or how to load one.
- quantized tex coords (short or char) instead of float
- quantized positions
I'll try these.
- try to discard the depth/stencil buffer with GL_EXT_discard_framebuffer.
I'm not sure I understand. The depth buffer is required for correct rendering - discarding it wouldn't be a fair test. There shouldn't be any stencil buffer, assuming I initialized the frame buffer correctly.
Can you post the vertex shader?
There are several choices of vertex shaders, which you can see if you download the code. The one for per-pixel lighting is this:

Code: Select all

// input constants
uniform mat4 mvMatrix;
uniform mat4 mvpMatrix;

// input variables, different for each vertex
attribute vec4 vertPosition_modelspace;
attribute vec3 vertNormal_modelspace;
attribute vec2 vertTexCoord0;

// outputs
varying vec2 texCoord0;
varying vec3 position_viewspace;
varying vec3 normal_viewspace;

void main()
{
    // pass the texture coordinate unchanged
    texCoord0 = vertTexCoord0;

    // vertex position in viewspace is needed for lighting in the fragment shader
    position_viewspace = vec3(mvMatrix * vertPosition_modelspace);

    // normal vector in viewspace is also needed for lighting in the fragment shader
    // this math is only correct if mvMatrix has a uniform scale! Otherwise use its inverse transpose.
    normal_viewspace = vec3(mvMatrix * vec4(vertNormal_modelspace,0)); 

    // OpenGL needs the fully-projected vertex position for rendering the triangle
    gl_Position = mvpMatrix * vertPosition_modelspace;
}
Also, what's the CPU usage?
If I'm interpreting top correctly, it's about 9.5% when rendering some of the high-poly models. More like 50% when rendering lots of low-poly models.

User avatar
jeanleflambeur
Posts: 157
Joined: Mon Jun 16, 2014 6:07 am
Contact: Website

Re: RPi 3D triangles per second?

Sat Oct 04, 2014 10:25 am

GL_EXT_discard_framebuffer is used to inform the driver that you don't care about preserving the contents of some framebuffer attachments - like depth and stencil - between frames. On tile based architectures this allows the driver to avoid writing the depth/stencil to memory and keep them in the fast tile memory. On PowerVR (iOS) this saves huge amounts of bandwidth.

I see that in the vertex shader you compute vertPosition_modelspace. What do you use it for?
I'm not familiar with the video core GPU, but on PowerVR the math is scalar and a mat4*vec4 is significant.

Try to replace this:
vec3(mvMatrix * vertPosition_modelspace);
with this:
vec3(mvMatrix * vec3(vertPosition_modelspace.xyz, 0));

You can compress to ETC using the PowerVR texture compressor or something like this: https://code.google.com/p/rg-etc1/
To upload to the gpu, replace this:
glTexImage2D(GL_TEXTURE_2D, mipmap, internalFormat, width, height, 0, pixelFormat, pixelType, data);
with this:
glCompressedTexImage2D(GL_TEXTURE_2D, mipmap, internalFormat, width, height, 0, dataSize, data);
where internalFormat is GL_ETC1_RGB8_OES (value == 0x8D64)

For the file format itself you can use either pvr (which can hold uncompressed, pvrtc, dxt1/3/5 and etc compressed data) or KTX:
https://www.khronos.org/opengles/sdk/to ... rmat_spec/

Regarding the interleaved VBO - what are your offsets for each attribute?
I imagine it's this:
position - offset 0, size 16
normal - offset 16, size 12
tex_coord - offset 28, size 8
This will prevent the positions to be aligned at 16 bytes. Don't know if it's relevant for video core but it's something to be tested.
When you try quantization make sure you align any float attribute at 4 bytes minimum.

/edit/ Just noticed this:
Each vertex has position XYZ, normal XYZ, and texture UV for a total of 32 bytes per vertex
You should match your vertex format with the attribute declaration, otherwise the driver will patch your shader and add code to convert between formats. Sometimes this is non-optimal.

Cheers.

schamberlin
Posts: 25
Joined: Tue Sep 16, 2014 5:29 pm

Re: RPi 3D triangles per second?

Mon Oct 06, 2014 4:51 pm

Thanks for pointing out the vec3/vec4 mismatch between my vertex format and the shader attributes. I've fixed this so it's everywhere vec3 now. That didn't seem to have any effect on performance.
jeanleflambeur wrote:I see that in the vertex shader you compute vertPosition_modelspace. What do you use it for?
I think you mean vertPosition_viewspace? It's part of the lighting calculation for the specular term. To get the eye vector used for computing specular, the fragment shader needs that. It's a standard per-pixel lighting shader unless I'm doing something stupid.

I tried the other ideas you suggested, but unfortunately didn't see much improvement.

ETC1 compressed textures: This gave about a 6% frame rate improvement for the lower-poly models. None for high-poly models that aren't fill-bound on performance. If I disabled mipmapping the difference became more dramatic. I'm sure I could create a test where the gain from using compressed textures was bigger, but for "typical" rendering the benefit was fairly modest. This was with a 500 triangle model drawn at about 10% of the screen width, using mipmaps, and power-of-two texture sizes between 256 and 1024.

By the way, getting ETC1 texture loading working was a huge pain in the butt. I couldn't find any example code for loading the texture that wasn't Java/Android stuff, so in the end I wrote my own loader. I happened to begin testing with a 512 x 256 non-square texture, and my code shrunk it to a 1 x 0 texture for the smallest mipmap. The 0 height of that mipmap seemed to make the driver reject my entire mipmap pyramid and render everything black, but without reporting any errors. Argh! Once I made sure mipmap dimensions never shrunk below 1, it worked. My rasperf3d example has been updated with the new ETC1 loader if anybody is looking at it.

GL_EXT_discard_framebuffer: This had no noticeable effect. I'm not even sure it was working. I added the following lines immediately before my call to swap buffers:

Code: Select all

const GLenum discards[]  = {GL_DEPTH_EXT, GL_STENCIL_EXT, GL_COLOR_EXT};
glBindFramebuffer(GL_FRAMEBUFFER, 0);
glDiscardFramebufferEXT(GL_FRAMEBUFFER, 3, discards);
some more discussion of using this on the Raspberry Pi is here: see http://www.raspberrypi.org/forums/viewt ... 68&t=11931

Quantized vertex data: I made positions, normals, and texture coordinates into shorts (2 bytes) instead of floats (4 bytes), but there was no performance improvement for the vertex-bound models. If anything, it was slightly slower. I guess it's limited by the computations in the vertex shader, and not by the memory bandwidth needed to move the vertex data. This makes sense, as the vertex data is static and is only ever uploaded to the GPU in a VBO once when the program starts.

16-byte alignment: I started to play with this, but I don't think it's possible. With interleaved vertices and 3-element positions and 3-element normals, there's no way to get all the positions and/or normals on 16-byte boundaries. Even if I used 4-element positions and normals, I'd have to pad the whole vertex structure out to 64 bytes in order to make the alignment work.

I'm starting to think that the numbers I'm getting are pretty solid, and there are no 2X speedups hidden anywhere that I can find. If anyone's been able to coax better 3D performance out the Raspberry Pi, I'd be interested to hear what techniques they used, and under what conditions. I think what I have now is a good base for making a Pi-based 3D game or demo.

toxibunny
Posts: 1382
Joined: Thu Aug 18, 2011 9:21 pm

Re: RPi 3D triangles per second?

Tue Oct 07, 2014 11:34 am

In terms of old PCs/consoles, what would you compare your raspis performance results to?
note: I may or may not know what I'm talking about...

User avatar
PeterO
Posts: 4538
Joined: Sun Jul 22, 2012 4:14 pm

Re: RPi 3D triangles per second?

Tue Oct 07, 2014 12:10 pm

I've reported this thread (and your other one) for being in the wrong place. There is an openGLES specific sub-forum where they should be.
PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

User avatar
PeterO
Posts: 4538
Joined: Sun Jul 22, 2012 4:14 pm

Re: RPi 3D triangles per second?

Tue Oct 07, 2014 12:33 pm

Surley all this is pretty pointless as it's going to be the complexity of your vertex and fragment shaders that ultimatley determines the maximum frame rate you can achieve. And there's not much point in worrying about getting a frame rate faster than 60Hz anyway.

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

schamberlin
Posts: 25
Joined: Tue Sep 16, 2014 5:29 pm

Re: RPi 3D triangles per second?

Tue Oct 07, 2014 3:33 pm

toxibunny wrote:In terms of old PCs/consoles, what would you compare your raspis performance results to?
I hesitated to make any such comparison, since there's so much "it depends", and most older systems also ran at substantially lower screen resolutions than the RPi typically does. Just a guess: I used to do game development for the Gamecube, Playstation 2, and the original Xbox - I'd say this feels roughly in the same ballpark.
PeterO wrote:Surley all this is pretty pointless as it's going to be the complexity of your vertex and fragment shaders that ultimatley determines the maximum frame rate you can achieve.
Yeah, I agree the raw tris/sec numbers I got are mostly pointless. What was more useful to me was measuring relative performance: exploring different ways of passing the data to Open GL, and measuring the impact of the other render settings. For my use case, screen resolution and texture settings had a major impact, other settings were not as significant. It's also interesting to measure things like per-pixel vs per-vertex lighting performance, and the approximate triangle counts where vertex and fragment shader work are roughly balanced.

mimi123
Posts: 583
Joined: Thu Aug 22, 2013 3:32 pm

Re: RPi 3D triangles per second?

Sun Mar 01, 2015 6:33 pm

You can write your own shaders in assembly for the VideoCore...
(it would be optimal if you manage it well)

There is a [email protected] github.com/phire/hackdriver

User avatar
PeterO
Posts: 4538
Joined: Sun Jul 22, 2012 4:14 pm

Re: RPi 3D triangles per second?

Sun Mar 01, 2015 7:06 pm

mimi123 wrote:You can write your own shaders in assembly for the VideoCore...
(it would be optimal if you manage it well)

There is a [email protected] github.com/phire/hackdriver
Life is just too short to do everything the hard way !

PeterO
Discoverer of the PI2 XENON DEATH FLASH!
Interests: C,Python,PIC,Electronics,Ham Radio (G0DZB),1960s British Computers.
"The primary requirement (as we've always seen in your examples) is that the code is readable. " Dougie Lawson

Return to “OpenGLES”