[raspi-internals] Re: 24 GFLOPS QPUs

From: Herman Hermitage <hermanhermitage@xxxxxxxxxxx>
To: "raspi-internals@xxxxxxxxxxxxx" <raspi-internals@xxxxxxxxxxxxx>
Date: Wed, 24 Jul 2013 14:09:39 +1200
> I'm going to start tinkering with the shader processor at:
>    https://github.com/hermanhermitage/videocoreiv-qpu
>
> Its
> in a separate repo incase there are any copyright issues - basically
> I'm going to document it based on a differential analysis feeding the
> blob different inputs and capturing the outputs.
>
> My understanding
> is the outputs of a computer program are generally not copyrightable as
> a program cant be considered an author of an artistic work.
>
> For anyone interested in contributing, patent "US20110227920 Method and 
> System for a Shader Processor With Closely Couple Peripherals" is a good 
> starting point.

[1] I've added a simple tool at 
https://github.com/hermanhermitage/videocoreiv-qpu/tree/master/qpu-sniff

It uses the /opt/vc/bin/vcdbg to walk relocatable memory allocations on the 
videocore side, searching for ones marked as GL related.

Whilst a OpenGL program is active, run it as:
$ ./qpu-scan --qpuscan

type = 'mem_strdup'
size = 108

type = 'GL20_PROGRAM_T.uniform_data'                                            
                                                         
size = 20
.......? 00000000 3f800000          0          1                                
                                                        
.......@ bf800000 40000000         -1          2                                
         
...?.... 3f000000 8000000b        0.5 -1.541e-44     

'shader code':
00000000: 009e7000 100009e7 ra=39, rb=39, adda=0, addb=0, mula=0, mulb=0, 
wa=39, wb=39, F=0, X=0, packbits=0x00; addop00<cc0> io39, A0, A0; mulop00<cc0> 
io39, A0, A0; op01
00000002: 009e7000 400009e7 ra=39, rb=39, adda=0, addb=0, mula=0, mulb=0, 
wa=39, wb=39, F=0, X=0, packbits=0x00; addop00<cc0> io39, A0, A0; mulop00<cc0> 
io39, A0, A0; op04
00000004: 15827d80 10020ba7 ra=32, rb=39, adda=6, addb=6, mula=0, mulb=0, 
wa=46, wb=39, F=0, X=0, packbits=0x00; addop21<cc1> io46, io32, io32; 
mulop00<cc0> io39, A0, A0; op01
00000006: 009e7000 300009e7 ra=39, rb=39, adda=0, addb=0, mula=0, mulb=0, 
wa=39, wb=39, F=0, X=0, packbits=0x00; addop00<cc0> io39, A0, A0; mulop00<cc0> 
io39, A0, A0; op03
00000008: 009e7000 100009e7 ra=39, rb=39, adda=0, addb=0, mula=0, mulb=0, 
wa=39, wb=39, F=0, X=0, packbits=0x00; addop00<cc0> io39, A0, A0; mulop00<cc0> 
io39, A0, A0; op01
0000000a: 009e7000 500009e7 ra=39, rb=39, adda=0, addb=0, mula=0, mulb=0, 
wa=39, wb=39, F=0, X=0, packbits=0x00; addop00<cc0> io39, A0, A0; mulop00<cc0> 
io39, A0, A0; op05
...

[2] Some background to aid understanding:
- My understanding is there are 3 "slices". Each slice has QPUs.
- Each QPU is a 4 way SIMD unit with an add ALU and a mulitply ALU.  
- 3 Slices * 4 QPUs * (4+4)*250MHz -> 24 GFLOPS
- Each QPU has two register banks ra and rb with limitations on read/write 
ports and latencies.
  - The first 32 entries ra0..ra31 and rb0..rb31 are normal registers
  - The second 32 entries are actually references to units such as exp, log, 
reciprocal, reciprocal-squareroot and 3d pipeline registers.
- Each QPU has 4 or more Accumulators (these are high speed registers) - for 
back to back access.
- The split of slices and QPUs is due to balancing of other shared units for 3d 
pipeline (Texturing, Tiling etc).
- Bit encodings can be deduced from parts of the blob (eg. the shader emitter). 
 
- The main fragments in the blob seem related to OpenVG, where as the majority 
of OpenGL ES stuff is generated dynamically from dataflow graphs of the user 
supplied shaders.

Examining a compiled shader fragment:

  void main(void) {
    gl_FragColor = vec4(1,1,0,0.5);
  }

Compiles to:

  # addop; mulop; controlop;
  addop00<cc0> io39, A0, A0; mulop00<cc0> io39, A0, A0; op01
  addop00<cc0> io39, A0, A0; mulop00<cc0> io39, A0, A0; op04
  addop21<cc1> io46, io32, io32; mulop00<cc0> io39, A0, A0; op01
  addop00<cc0> io39, A0, A0; mulop00<cc0> io39, A0, A0; op03
  addop00<cc0> io39, A0, A0; mulop00<cc0> io39, A0, A0; op01
  addop00<cc0> io39, A0, A0; mulop00<cc0> io39, A0, A0; op05

Assuming
  <cc0> = never
  <cc1> = always
  io32 = fetch from uniform memory
  io39 = discard/ignore
  io46 = gl_FragColor
  addop21 = one of mov, or, and
  
This gives:
  nop; nop; op01
  nop; nop; op04
  mov gl_FragColor, uniform
  nop; nop; op03
  nop; nop; op05

By playing around with different fragment inputs I think i have a handle on 
operations for:
  add, sub, mul, exp, log, min, max, etc

Also
  gl_FragCoord[.xyzw], gl_FrontFacing, etc...


Will try and post the OpenGL shader sample program soon.

Cheers
HH.
Follow-Ups:
- [raspi-internals] Re: 24 GFLOPS QPUs
  - From: Herman Hermitage
- [raspi-internals] Re: 24 GFLOPS QPUs
  - From: Herman Hermitage
References:
- [raspi-internals] 24 GFLOPS QPUs
  - From: Herman Hermitage
[raspi-internals] Re: 24 GFLOPS QPUs

Other related posts: