[raspi-internals] Re: 24 GFLOPS QPUs

  • From: Herman Hermitage <hermanhermitage@xxxxxxxxxxx>
  • To: "raspi-internals@xxxxxxxxxxxxx" <raspi-internals@xxxxxxxxxxxxx>
  • Date: Tue, 30 Jul 2013 13:48:44 +1200

I've put together a program that lets the user run a vertex and fragment shader 
and capture the QPU output.  It's at:
  https://github.com/hermanhermitage/videocoreiv-qpu/tree/master/qpu-sniff

Usage is:
./qpu-sniff --testgl <vsfilename> <fsfilename>

The assembly it spits out is still raw (no names for operations), will populate 
it again shortly... just working out which way to go with the names.
Unlike the VPU, for the 3d we know the names as some debug builds of the blob 
contain the relevant strings.


For those who want to understand more, here is a sample (no blendfunc, 
depthfunc, colormask or stencilfunc):
----

fs/min.fs:
  uniform vec4 c1;
  uniform vec4 c2;
  void main(void) {
    gl_FragColor = min(c1, c2);
  }

('shader code' 1c50b140 88)
00000000: 15827d80 10020827  mov  A0, uniform;       nop;
00000002: 03827c00 40020867  fmin A1, uniform, A0;   nop;             
scoreboard-wait
00000004: 15827d80 10020827  mov  A0, uniform;       nop;
00000006: 03827c00 10020827  fmin A0, uniform, A0;   nop;
00000008: 95827d80 114258a0  mov  A2, uniform,       mov8 A0.8a, A0;
0000000a: 83827c89 11525860  fmin A1, uniform, A2;   mov8 A0.8b, A1;
0000000c: 95827d89 11625860  mov  A1, uniform;       mov8 A0.8c, A1;
0000000e: 03827c40 10020867  fmin A1, uniform, A1;   nop;
00000010: 809e7009 317059e0  nop;                    mov8 A0.8d, A1;  thread-end
00000012: 159e7000 10020ba7  mov  gl_FragColor, A0;  nop;
00000014: 009e7000 500009e7  nop;                    nop;             
scoreboard-done

As mentioned before there is an ADD slot, a MUL slot and a CONTROL slot.

Apart from the triple issue of ADD, MUL and CONTROL slots, the majority of 
parallelism comes between threads.
eg. notice the sequential packing of fields of the gl_FragColor result.

The code sequences are generated without branches.

Packing/unpacking (eg.A0.8a) allows reference to the 8 bit parts (called 
a,b,c,d) of a 32 bit register.

A0..A3 are accumulators (fast access).  This shader is too simple to use the 
register A and B banks.

A4 is the result of a log2/exp2/recip/recipsqrt ops, or the result of a load of 
the gl_FragColor from the tile buffer.

The scoreboard control is discussed in the patent at 
http://www.google.com/patents/US20110242113.  Essentially it is used to order 
the writes of overlapping fragments.


Here is the same sample with a blendfunc (GL_ONE_MINUS_SRC_ALPHA, GL_ONE); 
GL_FUNC_ADD:
----
00000000: 15827d80 10020827  mov A0, uniform;        nop;
00000002: 03827c00 40020867  fmin A1, uniform, A0;   nop;             
scoreboard-wait
00000004: 15827d80 80020827  mov A0, uniform;        nop;             
load-gl_FragColor
00000006: 03827c00 10020827  fmin A0, uniform, A0;   nop;
00000008: 95827d80 114258a0  mov A2, uniform,        mov8 A0.8a, A0;
0000000a: 83827c89 11525860  fmin A1, uniform, A2;   mov8 A0.8b, A1;
0000000c: 95827d89 11625860  mov A1, uniform,        mov8 A0.8c, A1;
0000000e: 03827c40 10020867  fmin A1, uniform, A1;   nop;
00000010: 809e7009 117059e0  nop;                    mov8 A0.8d, A1;
00000012: 159e7000 10020027  mov ra0, A0;            nop;
00000014: 009e7000 100009e7  nop;                    nop;
00000016: 17027d80 36020827  not A0, ra0.8dr;        nop;             thread-end
00000018: 60027006 100059e0  nop;                    mul8 A0, A0, ra0;
0000001a: 1e9e7100 50020ba7  adds8 gl_FragColor, A0, A4; nop;         
scoreboard-done


----------------------------------------

> - 3 Slices * 4 QPUs * (4+4)*250MHz -> 24 GFLOPS
This statement is misleading / wrong.


Cheers
HH.                                       

Other related posts: