I've put together a program that lets the user run a vertex and fragment shader and capture the QPU output. It's at: https://github.com/hermanhermitage/videocoreiv-qpu/tree/master/qpu-sniff Usage is: ./qpu-sniff --testgl <vsfilename> <fsfilename> The assembly it spits out is still raw (no names for operations), will populate it again shortly... just working out which way to go with the names. Unlike the VPU, for the 3d we know the names as some debug builds of the blob contain the relevant strings. For those who want to understand more, here is a sample (no blendfunc, depthfunc, colormask or stencilfunc): ---- fs/min.fs: uniform vec4 c1; uniform vec4 c2; void main(void) { gl_FragColor = min(c1, c2); } ('shader code' 1c50b140 88) 00000000: 15827d80 10020827 mov A0, uniform; nop; 00000002: 03827c00 40020867 fmin A1, uniform, A0; nop; scoreboard-wait 00000004: 15827d80 10020827 mov A0, uniform; nop; 00000006: 03827c00 10020827 fmin A0, uniform, A0; nop; 00000008: 95827d80 114258a0 mov A2, uniform, mov8 A0.8a, A0; 0000000a: 83827c89 11525860 fmin A1, uniform, A2; mov8 A0.8b, A1; 0000000c: 95827d89 11625860 mov A1, uniform; mov8 A0.8c, A1; 0000000e: 03827c40 10020867 fmin A1, uniform, A1; nop; 00000010: 809e7009 317059e0 nop; mov8 A0.8d, A1; thread-end 00000012: 159e7000 10020ba7 mov gl_FragColor, A0; nop; 00000014: 009e7000 500009e7 nop; nop; scoreboard-done As mentioned before there is an ADD slot, a MUL slot and a CONTROL slot. Apart from the triple issue of ADD, MUL and CONTROL slots, the majority of parallelism comes between threads. eg. notice the sequential packing of fields of the gl_FragColor result. The code sequences are generated without branches. Packing/unpacking (eg.A0.8a) allows reference to the 8 bit parts (called a,b,c,d) of a 32 bit register. A0..A3 are accumulators (fast access). This shader is too simple to use the register A and B banks. A4 is the result of a log2/exp2/recip/recipsqrt ops, or the result of a load of the gl_FragColor from the tile buffer. The scoreboard control is discussed in the patent at http://www.google.com/patents/US20110242113. Essentially it is used to order the writes of overlapping fragments. Here is the same sample with a blendfunc (GL_ONE_MINUS_SRC_ALPHA, GL_ONE); GL_FUNC_ADD: ---- 00000000: 15827d80 10020827 mov A0, uniform; nop; 00000002: 03827c00 40020867 fmin A1, uniform, A0; nop; scoreboard-wait 00000004: 15827d80 80020827 mov A0, uniform; nop; load-gl_FragColor 00000006: 03827c00 10020827 fmin A0, uniform, A0; nop; 00000008: 95827d80 114258a0 mov A2, uniform, mov8 A0.8a, A0; 0000000a: 83827c89 11525860 fmin A1, uniform, A2; mov8 A0.8b, A1; 0000000c: 95827d89 11625860 mov A1, uniform, mov8 A0.8c, A1; 0000000e: 03827c40 10020867 fmin A1, uniform, A1; nop; 00000010: 809e7009 117059e0 nop; mov8 A0.8d, A1; 00000012: 159e7000 10020027 mov ra0, A0; nop; 00000014: 009e7000 100009e7 nop; nop; 00000016: 17027d80 36020827 not A0, ra0.8dr; nop; thread-end 00000018: 60027006 100059e0 nop; mul8 A0, A0, ra0; 0000001a: 1e9e7100 50020ba7 adds8 gl_FragColor, A0, A4; nop; scoreboard-done ---------------------------------------- > - 3 Slices * 4 QPUs * (4+4)*250MHz -> 24 GFLOPS This statement is misleading / wrong. Cheers HH.